analysis During the recent launch of its 96-core Epyc Genoa CPUs, AMD touched on one of the biggest challenges modern computers face. In recent years, the rate at which processors have become more powerful has outpaced the memory subsystems that feed those cores.
“Anything that uses a very large memory footprint is going to need a lot of bandwidth to drive the cores,” said Gartner analyst Tim Harvey The registry. “And if you’re accessing that data indiscriminately, you’re going to be missing a lot of cache, so being able to pull data very quickly is going to be very useful.”
And this is by no means a new phenomenon, especially for high performance computing (HPC) workloads. Our sister site The next platform has been tracking the growing ratio of computing power to memory bandwidth for some time.
But while the move to DDR5 4800 MTps DIMMs will boost bandwidth by 50 percent over the fastest DDR4, that alone wasn’t enough to saturate AMD’s 96-core Epycs. AMD engineers had to make up the difference by increasing the number of memory controllers, and therefore channels, to 12. Combined with faster DDR5, Genoa offers more than twice the memory bandwidth of Milan.
The approach is not uncompromising. For one, adding more channels requires allocating more on-chip real estate to memory controllers. There are also signaling considerations that must be addressed to support the increased number of DIMMs that will be plugged into these channels. And then there’s the challenge of physically fitting all of these DIMMs into a traditional chassis, particularly in a dual-socket configuration.
Because of this, AMD will likely stick with 12 channels for at least the next few generations, relying instead on improving DDR5 memory speeds to increase bandwidth.
Micron expects memory speeds to exceed 8,800 MTps within DDR5’s lifetime. In a 12-channel system, this corresponds to a memory bandwidth of around 840 GBps.
“DDR5 performance will increase over time, but we’re still going to have this big gap between the available cores and memory bandwidth, and it’s going to be difficult to feed it,” Harvey said.
Optane lives on. Type of
While AMD’s approach to the problem is to cram more memory controllers into its chips and more faster DDR5 memory into the system, Intel has taken a different approach with the Xeon Max CPUs, which power the US Department of Energy’s long-delayed Aurora supercomputer will drive.
Formerly known as the Sapphire Rapids HBM, the chips pack 64GB of HBM2e memory with 1TBps of bandwidth in a 56-core 4th Gen Xeon Scalable processor.
And while you can technically run the chip entirely outside of the HBM, for those who need massive pools of memory for things like large natural language models, Intel supports tiered memory in two configurations that strongly resemble the recently discontinued Optane business unit.
In Intel’s HBM Flat mode, each external DDR5 acts as a separately accessible memory pool. Meanwhile, in caching mode, the HBM is treated more like a level 4 cache for the DDR5.
While the latter may be attractive for some use cases because it’s transparent and doesn’t require software changes, Harvey argues that the HBM might be underutilized if it behaves anything remotely like Intel’s Optane persistent memory.
“For the most part, CPUs are good at instruction-level caching; they’re not very good at application-level caching,” he said, adding that operating the chip in flat mode could prove promising, although it would require special considerations from software vendors.
“If you have a large HBM cache effectively for main memory, then the operating system vendors, the hypervisor vendors, will be able to manage that much better than the CPU,” he said. “The CPU can’t see the layer of instructions while the hypervisor knows I’m going to toggle between this app and that app, so I can preload that app in HBM.”
In order to achieve similarly high bandwidths for its first data center CPU, Nvidia also shifts the main memory to the CPU. But unlike Intel’s Xeon Max, Nvidia does not rely on expensive, low-capacity HBM memory, but on standard LPDDR5x modules.
Each Grace superchip fuses two Grace CPU dies – each with 72 ARM Neoverse V2 cores – connected by the chipmaker’s 900GB/s NVLink C2C interconnect. The chips are flanked by rows of LPDDR5 memory modules for a terabyte of bandwidth and capacity.
While it’s hard to say, our best guess is that each Grace CPU chip is paired with eight 64GB LPDDR5x memory modules running anywhere near 8,533 MTps. This would result in 546GBps of bandwidth for each of the two CPU chips.
Apple actually took a similar approach, although it used slower LPDDR5 6400 MTps memory to achieve 800 GBps of memory bandwidth on its M1 Ultra processors, which debuted in the Mac Studio earlier this year. However, Apple’s reasons for doing so had less to do with per-core memory bandwidth and more to do with powering the chip’s integrated GPUs.
For Nvidia, the method offers some obvious advantages over using something like HBM, the biggest being capacity and cost. HBM2e is available in capacities up to 16GB from vendors such as Micron. That means you need four times as many modules as LPDDR.
But even this approach is not uncompromising, according to Harvey. Baking memory onto the CPU package means you give up flexibility. If you need more than 1TB of system memory, you can’t just add more DIMMs to the mix – at least not in the way Nvidia has implemented things.
However, it probably still makes sense for Nvidia’s target market for these chips, Harvey explained. “Nvidia has a very heavy focus on AI/ML workloads with specific requirements, while Intel is more focused on these general purpose workloads.”
CXL is not the answer yet
Both AMD’s Genoa and Intel’s 4th Gen Xeon Scalable processors offer support for the CXL 1.1 interconnect standard.
Early implementations of the technology by companies like Astera Labs and Samsung will enable novel storage configurations, including storage expansion and storage tiering.
For now, however, the limited bandwidth available to these devices means that their usefulness in addressing the mismatch between CPU and memory performance is limited.
AMD’s implementation includes 64 lanes dedicated to CXL devices. However, due to the way these lanes were forked, CXL devices can only access four of them at a time. And since CXL 1.1 runs on PCIe 5.0, that means each device’s bandwidth is limited to 16Gbps.
“It might open up some memory bandwidth things over time, but I think the initial implementations might not be fast enough,” Harvey said.
This could change with future PCIe generations. The connection technology typically doubles its bandwidth with each succeeding generation. So with PCIe Gen 7.0, a single CXL 4x device would have approximately 64Gbps bandwidth available.
For now, Harvey argues that CXL will be most valuable for memory-hungry applications that aren’t necessarily as sensitive to bandwidth, or in a tiered memory configuration. ®
https://www.theregister.com/2022/11/14/amd_intel_nvidia_ram_bandwidth/ How AMD, Intel and Nvidia keep their cores from starving • The Register