Andromeda has more cores than the Frontier system • The Register
Cerebras, maker of waferscale AI chips and systems, says its new Andromeda supercomputer has more compute cores than Frontier, the world’s first and only publicly verified cluster to break the exascale barrier.
However, there’s a major catch: Andromeda won’t be able to perform the wide range of high-performance computing work that’s possible on Oak Ridge National Lab’s Frontier, which previously had a peak performance of 1.1 exaflops at the HPC world standard -Linpack benchmark has reached year.
The catch is that “real” HPC requires double-precision (64-bit) floating-point capabilities. Pairing this CS-2 allows Andromeda to achieve more than 1 exaflop at sparse 16-bit semi-precision (FP16) and 120 petaflops at dense FP16, two formats Cerebras says are used to train deep neural networks.
Unveiled today at the Supercomputing 2022 (SC22) event, Andromeda consists of 16 CS-2 systems, each powered by Cerebras’ massive Wafer-Scale Engine 2 (WSE-2) chip and powered by the SwarmX interconnect fabric of the startup.
With each WSE-2 chip having 850,000 compute cores, Andromeda hits the 13.5 million core mark, more than AMD’s 8.7 million CPU and GPU cores, the frontier at the US Department of Energy’s Oak Ridge facility on Aug keep going. Cerebras was keen to point this out, but it’s about more than just the core count.
This is far from a direct comparison given the architectural differences between each core type and the types of workloads they are optimized for. While CPUs and GPUs can handle a wider range of HPC workloads, the WSE-2 chip only supports FP16 and 32-bit single-precision (FP32) formats, meaning it can print at chunky 64-bit double-precision ( FP64) does not help out. Mathematics.
In conversation with The registryCerebras CEO Andrew Feldman had no illusions that Frontier was a more capable machine for a wider range of applications.
“For supercomputing, traditional supercomputing, large simulations, trajectory analysis, it’s a better machine. It’s a bigger machine,” said Feldman, who previously tried to usher in a new era of Arm-based server CPUs at AMD before joining AMD in 2014, eventually found Cerebras.
But he said the core comparison between Andromeda and Frontier is relevant because there are certain problems in the supercomputing world, like deep learning, that benefit from as many cores as possible, regardless of the core. And getting so many cores to work together effectively is no small feat.
“Our cores are smaller. Our cores are optimized for AI. Our cores are not 64-bit double precision. But with AI, they are unprecedented. And 13.5 million of those are really, really hard to get them to act like a single machine on a single problem and be able to access it via a few lines on a scientific notebook like a Jupyter notebook is unheard of.” he said.
And here is the science
This is a claim apparently supported by the DOE’s Argonne National Laboratory.
In a statement from Cerebras, Rick Stevens, Argonne associate lab manager who was the face of the heavily lagged Aurora supercomputer, said Andromeda achieves “near-perfect linear scaling” when training the GPT3-XL large-language model on the COVID have -19 genome over one, two, four, eight and 16 nodes.
“Linear scaling is one of the most desirable characteristics of a large cluster, and Cerebras Andromeda delivered 15.87x throughput across 16 CS-2 systems and a corresponding reduction in training time compared to a single CS-2. Andromeda is setting a new bar for AI accelerator performance,” he said.
Feldman suggested that Argonne, with its Polaris supercomputer powered by 2,000 Nvidia A100 GPUs, wouldn’t be able to do the same job. He said this is because “the GPUs couldn’t get the job done due to GPU memory and memory bandwidth limitations.” Although Argonne didn’t come out and say so explicitly, it can be implied from the lab’s research report detailing his work, Feldman claimed.
“I think they were under a lot of political pressure. That makes Nvidia look bad. And big companies don’t like it when you make them look bad,” he added.
What enables Cerebras to support models with trillions of parameters is the startup’s MemoryX technology, which Feldman says combined with the SwarmX fabric allows models to run in clusters of up to 192 CS-2 systems.
While Cerebras has attracted a good range of customers and made impressive claims of performance, it must withstand the collective financial might of Nvidia, AMD and Intel and emerge on the other side of this shrinking economy we now find ourselves in. ®
https://www.theregister.com/2022/11/15/cerebras_supercomputer_frontier/ Andromeda has more cores than the Frontier system • The Register