While artificial intelligence continues to transform software development, large language models (LLMs) face significant hurdles when tackling high-performance computing tasks. These models, for all their brilliance elsewhere, just can’t seem to crack the high-performance code problem. Why? Well, they’re basically working with one hand tied behind their back.

The training data problem is real. These models simply don’t have enough quality HPC examples to learn from. Garbage in, garbage out. This shortage of specialized training material means even the most sophisticated LLMs flop when asked to write efficient parallel code. University of Maryland researchers recognized this problem and created HPC-INSTRUCT dataset to address these limitations.

Parallelism confounds them. Race conditions? Deadlocks? LLMs get confused fast. They’ll give you code that works, sure, but good luck getting it to scale across 10,000 cores. It’s like asking a house cat to herd sheep.

Even the smartest AI chokes on parallel code like a cat trying to solve differential equations.

Their algorithmic choices are often laughable to experts. Studies show a stunning 90% of AI-suggested optimizations either don’t work or offer zero benefit. That’s worse than random guessing! LLMs prioritize code that runs, not code that runs well.

Perhaps most fundamentally, these models can’t verify their own work. They can’t run profilers. They can’t benchmark. They’re flying blind without any execution context. It’s like trying to bake a cake without tasting the batter.

The complexity problem compounds everything. As problems get harder, LLM performance drops off a cliff. They generate shorter but more complicated solutions that human programmers struggle to maintain. These issues directly impact the code maintainability which is crucial for long-term project sustainability and scalability.

Domain knowledge gaps are glaring. General-purpose models simply don’t understand HPC priorities or hardware specifics. They need specialized training, which circles back to the data limitation problem.

Evaluation presents its own challenges. We need better benchmarks specifically for HPC tasks, not just general coding tests. Current metrics don’t capture what actually matters: runtime performance and scaling.

Until these issues are addressed, LLMs will remain impressive toys rather than transformative HPC tools.