The world’s fastest supercomputers have long relied on GPUs, with nine of the top 10 powered by them. However, that dominance may soon face a challenge as chipmakers increasingly optimize processors for AI workloads rather than the high-precision FP64 calculations required for scientific research.
To address this shift, US National Laboratories are exploring alternative architectures, including NextSilicon’s Maverick-2 processor. Built specifically for 64-bit floating-point computations, the chip is designed to handle the complex simulations used by the Department of Energy, ranging from nuclear weapons modeling to public health and national security research.
While most modern supercomputers, including systems powered by Nvidia and AMD GPUs, follow traditional designs, Sandia National Laboratories has taken a different approach with its new Spectra supercomputer. Developed alongside Penguin Solutions and NextSilicon, Spectra serves as a testing platform for the Maverick-2 architecture.
Although modest in size compared to exascale giants like Frontier and El Capitan, Spectra’s purpose is to validate the technology. Sandia recently confirmed that the system successfully met all acceptance requirements, paving the way for larger deployments in future high-performance computing systems.
Unlike conventional CPUs and GPUs built on the von Neumann architecture, Maverick-2 uses a reconfigurable dataflow design. Its compute units are dynamically assigned tasks at runtime, allowing operations to be tailored to specific workloads. The architecture also enables data processing to occur immediately as information moves through the pipeline, reducing delays associated with traditional load-and-store operations and improving efficiency for scientific computing.
NextSilicon says its dataflow architecture delivers significant gains in both performance and energy efficiency for real-world scientific workloads.
While dataflow computing is not a new concept, most companies that have adopted it—including Groq, Cerebras, and SambaNova Systems—have focused primarily on AI training and inference. NextSilicon stands out by targeting high-performance computing (HPC), a field that demands extreme precision and computational accuracy.
One of the biggest challenges with dataflow architectures is software compatibility. To overcome this, NextSilicon developed a compiler it claims can run existing C, Python, Fortran, and CUDA applications without requiring extensive code rewrites. The system initially executes workloads on a CPU, captures the compute graph, maps it onto the Maverick-2 architecture, and then optimizes execution to maximize performance.
Sandia National Laboratories has already tested the technology on several major HPC workloads, including the HPCG benchmark, the LAMMPS molecular dynamics suite, and the Sparta Monte Carlo simulation framework. The successful validation of these workloads marks an important milestone for the platform.
At the same time, the GPU market is increasingly being shaped by AI demands. Nvidia’s upcoming Rubin GPUs are designed to deliver massive memory bandwidth and up to 50 petaFLOPS of FP4 performance, making them highly attractive for AI training and inference. These capabilities are also driving adoption in scientific computing environments, including systems such as the Doudna supercomputer at Lawrence Berkeley National Laboratory.
As AI continues to influence GPU design, alternative architectures like NextSilicon’s Maverick-2 could offer researchers a specialized option for workloads that depend heavily on high-precision FP64 calculations.
While FP64 performance remains important for many traditional scientific applications, Nvidia’s GPUs continue to play a key role in AI workloads at U.S. research labs.
That AI-focused design, however, comes with a trade-off. Rubin delivers just 33 teraFLOPS of native FP64 performance, making it slower in this area than Nvidia’s nearly four-year-old H100.
Still, that doesn’t mean Rubin is ill-suited for scientific computing. For matrix-intensive applications such as High Performance Linpack (HPL), Nvidia relies on a modified version of the Ozaki method, using lower-precision data types to emulate FP64 calculations.
Nvidia says this technique enables Rubin to achieve up to 200 teraFLOPS of FP64 matrix performance. However, emulated FP64 is not a perfect substitute. While it has shown strong results in some HPC workloads, its benefits are limited in vector-heavy applications such as computational fluid dynamics.
Interestingly, those vector-focused workloads are exactly where NextSilicon has concentrated its efforts.
Although full system-level benchmarks for NextSilicon’s hardware—and the upcoming Spectra supercomputer—are not yet available, the company says a single Maverick-2 processor can deliver roughly 600 gigaFLOPS of FP64 HPCG performance. According to the startup, that puts it on par with leading GPUs while using about half the power.
While Nvidia’s latest generation prioritizes AI acceleration, AMD has taken a different path.
Like Rubin, AMD’s MI455X accelerators are optimized for AI training and inference. However, they are only one part of a broader GPU lineup manufactured by TSMC. For the MI430X, AMD replaced the AI-centric compute dies with versions specifically designed for high-performance computing.
Earlier this month, AMD revealed that the MI430X will deliver up to 200 teraFLOPS of peak FP64 performance in the U.S. Department of Energy’s Discovery supercomputer and Europe’s Alice Recoque system.
Who Needs GPUs Anyway?
Startups like NextSilicon still have to demonstrate that their processors can scale efficiently in large supercomputing deployments. But across the Pacific, China has already shown that world-class scientific computing performance is possible without relying on GPUs.
For years, China has pursued a strategy of developing custom silicon to strengthen its supercomputing capabilities.
Some of its most notable systems have been built around homegrown processors. The Sunway TaihuLight supercomputer, for example, relied on custom manycore chips featuring hundreds of proprietary RISC cores. Meanwhile, the Tianhe-2A system used the domestically developed Matrix 2000 digital signal processor (DSP) to handle FP64 workloads.
More recently, reports have surfaced about a new supercomputer known as LineShine. Like TaihuLight, it is said to use tens of thousands of custom CPUs—roughly 47,000 in total—and could reportedly deliver up to 2 exaFLOPS of FP64 performance. However, since China no longer participates in the annual Top500 rankings of the world’s fastest publicly known supercomputers, independent verification may never come.
China’s reliance on custom silicon is partly a response to U.S. export restrictions on advanced accelerators. Even when access to foreign chips remains legal, dependence on overseas suppliers has become a strategic vulnerability. In fact, Washington’s decision to block Intel’s Xeon Phi sales to China helped spur the development of the Matrix 2000 processor.
In the United States, the challenge is somewhat different. As AI demand continues to soar, chipmakers are increasingly focused on the lucrative AI accelerator market. Nvidia has become the world’s most valuable company on the back of this trend, while high-performance computing—though still critical for scientific research—remains a comparatively niche business.
Source: Theregister Edited by Bernie