NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systemsby Ryan Smith on April 12, 2021 12:20 PM EST
Kicking off another busy Spring GPU Technology Conference for NVIDIA, this morning the graphics and accelerator designer is announcing that they are going to once again design their own Arm-based CPU/SoC. Dubbed Grace – after Grace Hopper, the computer programming pioneer and US Navy rear admiral – the CPU is NVIDIA’s latest stab at more fully vertically integrating their hardware stack by being able to offer a high-performance CPU alongside their regular GPU wares. According to NVIDIA, the chip is being designed specifically for large-scale neural network workloads, and is expected to become available in NVIDIA products in 2023.
With two years to go until the chip is ready, NVIDIA is playing things relatively coy at this time. The company is offering only limited details for the chip – it will be based on a future iteration of Arm’s Neoverse cores, for example – as today’s announcement is a bit more focused on NVIDIA’s future workflow model than it is speeds and feeds. If nothing else, the company is making it clear early on that, at least for now, Grace is an internal product for NVIDIA, to be offered as part of their larger server offerings. The company isn’t directly gunning for the Intel Xeon or AMD EPYC server market, but instead they are building their own chip to complement their GPU offerings, creating a specialized chip that can directly connect to their GPUs and help handle enormous, trillion parameter AI models.
|NVIDIA SoC Specification Comparison|
|CPU Architecture||Next-Gen Arm Neoverse
(Custom Arm v8.2)
(Custom Arm v8)
|PCIe 3||PCIe 3|
|Manufacturing Process||?||TSMC 12nm||TSMC 16nm|
More broadly speaking, Grace is designed to fill the CPU-sized hole in NVIDIA’s AI server offerings. The company’s GPUs are incredibly well-suited for certain classes of deep learning workloads, but not all workloads are purely GPU-bound, if only because a CPU is needed to keep the GPUs fed. NVIDIA’s current server offerings, in turn, typically rely on AMD’s EPYC processors, which are very fast for general compute purposes, but lack the kind of high-speed I/O and deep learning optimizations that NVIDIA is looking for. In particular, NVIDIA is currently bottlenecked by the use of PCI Express for CPU-GPU connectivity; their GPUs can talk quickly amongst themselves via NVLink, but not back to the host CPU or system RAM.
The solution to the problem, as was the case even before Grace, is to use NVLink for CPU-GPU communications. Previously NVIDIA has worked with the OpenPOWER foundation to get NVLink into POWER9 for exactly this reason, however that relationship is seemingly on its way out, both as POWER’s popularity wanes and POWER10 is skipping NVLink. Instead, NVIDIA is going their own way by building an Arm server CPU with the necessary NVLink functionality.
The end result, according to NVIDIA, will be a high-performance and high-bandwidth CPU that is designed to work in tandem with a future generation of NVIDIA server GPUs. With NVIDIA talking about pairing each NVIDIA GPU with a Grace CPU on a single board – similar to today’s mezzanine cards – not only does CPU performance and system memory scale up with the number of GPUs, but in a roundabout way, Grace will serve as a co-processor of sorts to NVIDIA’s GPUs. This, if nothing else, is a very NVIDIA solution to the problem, not only improving their performance, but giving them a counter should the more traditionally integrated AMD or Intel try some sort of similar CPU+GPU fusion play.
By 2023 NVIDIA will be up to NVLink 4, which will offer at least 900GB/sec of cummulative (up + down) bandwidth between the SoC and GPU, and over 600GB/sec cummulative between Grace SoCs. Critically, this is greater than the memory bandwidth of the SoC, which means that NVIDIA’s GPUs will have a cache coherent link to the CPU that can access the system memory at full bandwidth, and also allowing the entire system to have a single shared memory address space. NVIDIA describes this as balancing the amount of bandwidth available in a system, and they’re not wrong, but there’s more to it. Having an on-package CPU is a major means towards increasing the amount of memory NVIDIA’s GPUs can effectively access and use, as memory capacity continues to be the primary constraining factors for large neural networks – you can only efficiently run a network as big as your local memory pool.
|CPU & GPU Interconnect Bandwidth|
|Grace||EPYC 2 + A100||EPYC 1 + V100|
(Cummulative, Both Directions)
PCIe 4 x16
PCIe 3 x16
(Cummulative, Both Directions)
Infinity Fabric 2
And this memory-focused strategy is reflected in the memory pool design of Grace, as well. Since NVIDIA is putting the CPU on a shared package with the GPU, they’re going to put the RAM down right next to it. Grace-equipped GPU modules will include a to-be-determined amount of LPDDR5x memory, with NVIDIA targeting at least 500GB/sec of memory bandwidth. Besides being what’s likely to be the highest-bandwidth non-graphics memory option in 2023, NVIDIA is touting the use of LPDDR5x as a gain for energy efficiency, owing to the technology’s mobile-focused roots and very short trace lengths. And, since this is a server part, Grace’s memory will be ECC-enabled, as well.
As for CPU performance, this is actually the part where NVIDIA has said the least. The company will be using a future generation of Arm’s Neoverse CPU cores, where the initial N1 design has already been turning heads. But other than that, all the company is saying is that the cores should break 300 points on the SPECrate2017_int_base throughput benchmark, which would be comparable to some of AMD’s second-generation 64 core EPYC CPUs. The company also isn’t saying much about how the CPUs are configured or what optimizations are being added specifically for neural network processing. But since Grace is meant to support NVIDIA’s GPUs, I would expect it to be stronger where GPUs in general are weaker.
Otherwise, as mentioned earlier, NVIDIA big vision goal for Grace is significantly cutting down the time required for the largest neural networking models. NVIDIA is gunning for 10x higher performance on 1 trillion parameter models, and their performance projections for a 64 module Grace+A100 system (with theoretical NVLink 4 support) would be to bring down training such a model from a month to three days. Or alternatively, being able to do real-time inference on a 500 billion parameter model on an 8 module system.
Overall, this is NVIDIA’s second real stab at the data center CPU market – and the first that is likely to succeed. NVIDIA’s Project Denver, which was originally announced just over a decade ago, never really panned out as NVIDIA expected. The family of custom Arm cores was never good enough, and never made it out of NVIDIA’s mobile SoCs. Grace, in contrast, is a much safer project for NVIDIA; they’re merely licensing Arm cores rather than building their own, and those cores will be in use by numerous other parties, as well. So NVIDIA’s risk is reduced to largely getting the I/O and memory plumbing right, as well as keeping the final design energy efficient.
If all goes according to plan, expect to see Grace in 2023. NVIDIA is already confirming that Grace modules will be available for use in HGX carrier boards, and by extension DGX and all the other systems that use those boards. So while we haven’t seen the full extent of NVIDIA’s Grace plans, it’s clear that they are planning to make it a core part of future server offerings.
First Two Supercomputer Customers: CSCS and LANL
And even though Grace isn’t shipping until 2023, NVIDIA has already lined up their first customers for the hardware – and they’re supercomputer customers, no less. Both the Swiss National Supercomputing Centre (CSCS) and Los Alamos National Laboratory are announcing today that they’ll be ordering supercomputers based on Grace. Both systems will be built by HPE’s Cray group, and are set to come online in 2023.
CSCS’s system, dubbed Alps, will be replacing their current Piz Daint system, a Xeon plus NVIDIA P100 cluster. According to the two companies, Alps will offer 20 ExaFLOPS of AI performance, which is presumably a combination of CPU, CUDA core, and tensor core throughput. When it’s launched, Alps should be the fastest AI-focused supercomputer in the world.
An artist's rendition of the expected Alps system
Interestingly, however, CSCS’s ambitions for the system go beyond just machine learning workloads. The institute says that they’ll be using Alps as a general purpose system, working on more traditional HPC-type tasks as well as AI-focused tasks. This includes CSCS’s traditional research into weather and the climate, which the pre-AI Piz Daint is already used for as well.
As previously mentioned, Alps will be built by HPE, who will be basing on their previously-announced Cray EX architecture. This would make NVIDIA’s Grace the second CPU option for Cray EX, along with AMD’s EPYC processors.
Meanwhile Los Alamos’ system is being developed as part of an ongoing collaboration between the lab and NVIDIA, with LANL set to be the first US-based customer to receive a Grace system. LANL is not discussing the expected performance of their system beyond the fact that it’s expected to be “leadership-class,” though the lab is planning on using it for 3D simulations, taking advantage of the largest data set sizes afforded by Grace. The LANL system is set to be delivered in early 2023.
Post Your CommentPlease log in or sign up to comment.
View All Comments
gescom - Tuesday, April 13, 2021 - linkThis.
CiccioB - Tuesday, April 13, 2021 - linkAhahaha, it appears some fanboy here got hurt by a presentation of future <b>revolutionary</b> products that are going to make AMD (and Intel) future "20% better performance at each generation" something pitiful.
They sell GPUs, SoCs, Network devices and... this may surprise you... also SW!!!
And they just announced they are going to sell CPUs. Yes. Pure CPUs.
Is all this thought to support their core business GPUs? Of course it is.
Is all this allowing them to create better products? Of course it is.
Is all this going to further hurt x86 market, which already lost the race for the HPC market many years ago, like a Thor Hammer? Of course it is!
And if they can really deliver Grace + Ampere Next in 2023, they will just rollover (like a bulldozer (cit.)) to all AMD's and Intel's proposals for their hexascale super computer (which are sill quite far away to be achieved, SW support included).
In 2023, if they maintain what they have on that roadmap, they can finally demonstrate that ARM CPUs can definitely get the spot of x86 CPUs in any high computing work, making 32, 64, 128, 256 or whatever the number of x86 cores in a single package (with related 200, 300, 400, 500W power consumption and still pitiful bandwidth per core) not really interesting anymore. The same work will be done with less power consuming HW and much better balanced/parallel resources.
They can definitely destroy the CPU centric architectures where everything has to be handled by the main CPU cores we have seen so far. Work distribution (with linear scale performances) can be achieved, making beefy centric CPUs I-do-all-the-management-work-with-bandwidth-issues useless.
They are developing what they have invested into: parallel computing device connected by high bandwidth networking. That's the meaning of the 7 billion buy of Mellanox.
If you have not understood this, yes, Nvidia's presentation is just waporware.
Much better AMD and Intel's presentations about how they are going to get 20% more performances from their architectures while using the latest most advanced, expensive and production limited PP.
See ya in the future.
RanFodar - Tuesday, April 13, 2021 - linkOkay. You can shove me in the face that by 2023 ARM will take over the CPU market...
But aside from speculation, do you really believe that at that time, Intel and AMD will be bulldozed by a round from Nvidia's first server CPU? I know ARM has a lot more advantage than x86, but don't tell me that their launch will take over the market with their first swing. Besides, we don't really know what's the future for Intel and AMD anyway. They still have a lot to go, and ONLY time will tell.
CiccioB - Tuesday, April 13, 2021 - linkWherever have you read "at launch time ARM will take over the entire CPU market"?
If (and if) Nvidia maintains its promises on these CPUs they'll demonstrate that ARM has all the real potential to become a main player in the high computing market and has the potential to gather many more developers and development resources. The same thing that Apple just did with its M1, where they just broke up the exclusive support to x86 applications by developers. Now developers have to think about creating application for both worlds, which is already a great achievement which goes well beyond the real performances the M1 CPU have. Even Adobe has converted its big tanks to ARM, something they haven't done for Windows on ARM, for example.
Microsoft is not Apple and I have had always doubts they really wanted to go against Intel promoting and developing their Windows for ARM, but Apple just did that demonstrating that x86 is no more the king or the only actor that is for a powerful machine.
Nvidia could gather the same spirit with the server market (where ARM is used only in private and custom contests) and with more and more developers that will write optimized code (and framework, and libraries and anything needed) for HPC market, x86 could soon loose the granitic position it has been keeping during these last 20 years. Position that has given them a great inertia as lots of SW is written and optimized just for x86.
Nvidia is taking a different route than those of any previous CPU designers, where the CPU is not the center of everything, but just a node of a bigger mesh where it has little importance with respect to the whole. Communication and data distribution now are the important aspects, not just how many core a die encloses and how fast it can elaborate data by itself.
This for sure could be thought and achieved only by a company that has not directly interest only in the CPU market like Intel, AMD, but also IBM have (had).
Parallelism at system architecture level, not only just inside a die (be it a CPU or GPU).
This is the revolutionary vision Nvidia has just disclosed. And they have been preparing for this with products on the needed parts: CPU, GPU, faster buses and last but not least, networking. They are in the position to dwarf (x86) CPUs importance in distributed servers.
Intel and AMD have to be quite worried about this new scenarios as this not only put them in second line in the importance to have a real scalable system, but it opens up to new actors that will be more than willingly to get their fair share of the big cake Intel (and just lately as a hope to improve its dire situation, AMD).
More money diverged from x86 development = more money invested in creating alternatives.
Nvidia will support them all. Their aim is the more they can, and be it on x86 or ARM solutions it is the same for them. But they know this move is going to weaken its historic rivals, especially AMD which gets a double hit (CPU and GPU, where the latter have not been in professional market for years and they just started developing something for it now).
And as I said at those times when Intel lose against ARM in the mobile race, saying that that would have brought much more investments to other fabs making them competing better and better (at those time Intel was using 22nm against ARM @45nm or 32nm for the bold designers like Qualcomm) making Intel loose its advantage of the PP compensating the (awful and obsolete) x86 architecture, now with TSMC able to bake dies that are better then those made by Intel and with more and more CPU designers using architectures that are much more efficient, it becomes quite difficult for Intel to continue maintain its monopoly, despite the fact that they have enormous engineering capacity and still can deliver better products with the use of advanced bus, in-die connections and packages other cannot benefit but in next years.
Silver5urfer - Tuesday, April 13, 2021 - linkSO much of FUD and BS.
How many comments are here singing this whole ARM nonsensical crap. Esp for all you guys I have one question, does ARM improve "YOUR" computing abilities over x86 ??
No it doesn't you cannot find a DIY machine of ARM, nor a high performance PC, M1 Mac got whipped by Ryzen 4000U series processor. Once the Zen 3 based product with low TDP range launches it's going to be shoveled hard. And Alder Lake is where Intel bet more money for laptops since it's 10nm for one and two it's big little trash on top of Intel Wifi + 5G + thin and light with Win10 out of box compat. Intel targeted Tiger Lake 10nm over RKL because of many reasons in that the volume and profits are definitely one.
And now, does Graviton2 can be owned by indoviduals ? nope. Fujitsu A64FX ? nope. Altra Ampere processors ? nope. Marvell ? nope. So what does ARM provide you guys to shill so damn hard and spell doom about x86. On top in Android land, the OEM controls everything top down stack. You don't get blobs from the OEMs the HW cannot be upgraded nor modified in any part. Plus BL locks on top you don't even own the HW to any extent, with centralized Appstores and control freak Google with Filesystem limitations what do you own actually ?
Yeah, nothing. Remember there's no full proper market for the big OEM companies like Dell and HP, SuperMicro, Gigabyte, Lenovo that make the Server Racks for ARM processors on top talk Volume ARM gets crushed to oblivion. Centriq was the last that was purported to be revolutionary. Recently Ampere processor and now people are heralding upcoming Microsoft server CPU based off ARM and that stupid incapable Google's Whitechapel ARM processor for smartphone and their own ARM processor for servers. Every damn thing is centralized and they simply want to save money, but why do you guys love it so hard.
With x86 you can own a mini server beast, from all the old Xeon parts and user Racks and etc in the market where people make complex Homelabs and what not to the latest Threadripper workstation professional grade processors which have insane PCIe lanes and power in your damn hand where you can install numerous OSes and VMs. Yeah people make them with Raspberry Pi too, it's superb for projects but it's not going to replace an x86 machine. We talk all day even more on this aspect alone and ARM will not come out leading anywhere.
"Grace CPU OMG, ARM is going to take over the planet, and we are going to moon" right ?
mode_13h - Wednesday, April 14, 2021 - link> And now, does Graviton2 can be owned by indoviduals ? nope. Fujitsu A64FX ? nope. Altra Ampere processors ? nope. Marvell ? nope.
The general public can buy Ampere Altra servers from Gigabyte and Fujitsu A64FX from HPE. There's even a company selling Altra-based workstations.
> in Android land, the OEM controls everything top down stack.
Why confuse ARM with Android? True fact: they even shipped Android for x86!
Also, ARM runs on regular Linux, from little R.Pi to the big servers you mentioned.
Your whole Android tangent appears to be a red herring.
> "Grace CPU OMG, ARM is going to take over the planet, and we are going to moon" right ?
Oddly, I agree with this point. IMO, what cores Grace uses are one of its less interesting aspects, and don't have any real bearing on ARM's broader trajectory in the server market.
That said, the numbers don't lie: ARM's growth in the cloud is substantial and only looks to be accelerating. The decline of x86 will be the one of the big computing stories of this decade.
Silver5urfer - Wednesday, April 14, 2021 - linkOkay I didn't knew A64FX could be bought but a $50K A64FX Rack from HPE is considered as Obtainable for General Public ? Look at what I mentioned. I asked people "you" and Homelab. That's where ARM question comes. What do you feel like x86 HW is lacking. Looking at Homelab and NAS SOHO. EPYC Rome can be really purchased and run with vast ecosystem on top even if they have deep pockets. Usually Homelab crowd uses used Xeon parts. Which speaks for itself.
Android x86, yes they did. But my point was the ARM HW which is already in circulation and owned by many can it run anything else or used as a Computing OS as in Linux or Windows Server ? They do not. Qcomm and Exynos decide what consumer gets. And why would even x86 HW run Goolag Locked Android ? I knoe F Droid exists but its more of a hobby type.
Pi runs Linux and so do all ARM. Pi is super customizable which is not just great but stellar. However the power it has cannot compete. So it's more of a fun project with Educational focus options too than a PC usecase or Homelab or a Render rig. But the thing is how do they Improve your experience over x86 that all here shill for..
As for the ARM is the future race. The latest market trend is AI. ARM is saying they are building CPU and HW around it with next decade. AMD said x86 is their future and with Xilinx M&A they are going to put that to use with FPGA which will change more. AMD is confident on that. As for Intel, will they kill their own x86 for ARM. They are targeting thin and light ARM with Big little approach.
ARM is increasing more because of AWS not others since Amazon simply wants to save money and everyone wants vertical control. Control freaks. Will see how far it will go ofc.
mode_13h - Thursday, April 15, 2021 - link> a $50K A64FX Rack from HPE is considered as Obtainable for General Public ?
Affordability and availability are two different things. A64FX is a specialized chip that will only really benefit a few with specific workloads. It's an important milestone in ARM's progression, but I see it as a digression from the broader point.
The more relevant data point, for people with the deep pockets to afford 64-core workstations and servers, is really Altra. However, that's not me. I am upgrading my home server to a Ryzen 5000, in the coming months.
Even if I could afford an Altra machine, I'd probably wait at least until the N2-based CPUs are out, before moving to ARM. Altra/N1 are basically A76 cores. But, the bigger issue for me is that I still care about single-thread performance and really don't need so many cores.
> Usually Homelab crowd uses used Xeon parts
BTW, the other server-ish machine in my homelab is an i3, because most of them enable ECC RAM. I also have a E5 Xeon-based workstation.
> What do you feel like x86 HW is lacking.
It's a question of competitiveness. ARM offers better performance-per-area (PPA) and therefore more performance per $. It's also more energy-efficient, due to having a simpler instruction encoding and more architectural registers. And this gives it a slight edge on performance, due to enabling a wider decoder.
For mobile and datacenter, energy efficiency is key. Also relevant is cost. And by offering better PPA, you can afford to fab it on newer nodes, since ARM chips with the same core counts and IPC will be smaller than x86 counterparts. And newer nodes confer an additional performance and energy-efficiency advantages.
So, it's really a case where a whole lot of benefits are derived from a few, key aspects of the ISA.
> Qcomm and Exynos decide what consumer gets.
Okay, but that's an issue with Qualcomm and Samsung, not ARM. Several ARM SBC's run generic desktop Linux distros built for AArch64.
> But the thing is how do they Improve your experience over x86 that all here shill for..
I don't think anyone is saying that there's yet an ARM-based answer to the mainstream PC. I'm certainly not.
I guess the new Mac Mini could be a good option for the sort who are content to use a NUC-class machine, but I'm allergic to Apple for so many reasons that even if Linux is fully-supported on those things, I wouldn't even touch one.
> AMD said x86 is their future
They have to *say* that, even if they're already deep into ARM, RISC V, or whatever. To announce an ARM-based initiative would be an acknowledgement that they're not fully invested in x86. It would create doubt in the minds of both existing and prospective customers about how long AMD will continue to offer leading x86 server products. So, maybe they go with Intel, instead.
Also, it's free advertising for ARM. So, maybe customers would just opt for an ARM CPU that's available today, like Altra, instead of waiting to see what AMD comes up with. If they feel like ARM is an inevitability for them, they might just want to get it over with and embrace that ecosystem.
So, AMD needs to wait until either the market is at a tipping point, or until they're nearly ready to launch their own new ARM chips, before they'll announce. And they're probably not going to jump into the ARM race, while they still have such growth momentum with EPYC. Again, they don't want to confuse the market or canibalize that growth.
I'm not saying it's 100% that AMD will go with ARM, but it's got way more traction than anything else. The main source of uncertainty, in my mind, is Nvidia's ownership. That's got to make a lot of would-be adopters very nervous.
CiccioB - Wednesday, April 14, 2021 - linkI can't understand your point.
I can't understand if you are defending x86 point of view because you can't see things beyond today and tomorrow is already too far for you.
You are just comparing today x86 situation against ARM, just when ARM is getting out from the mobile market it has been for years.
You are comparing an architecture with tenth of years of support and optimizations against one that has just born.
And you say that the new one has nothing good to offer because the old one is better.
Better in what? Consuming energy while, yes, having more cores/frequency/bloated dies and point to a 20% improvement at each new generation?
Despite the fact that Windows for ARM exists, the only real thing that makes the difference is the SW support. Not the HW performances.
Under the HW point of view, your considerations are quite biased and really useless: M1 has 4 high performance cores, Ryzen 4000U has 8 core+HT.
The power consumption are also quite different between the two. And guess who's better?
But apart actual benchmark results, what you can't understand is that a breach opened.
For sure the water spilling from it today is not the same of the river under the dam.
But this breach is not to be ignored, as it was not to be ignored the one that opened when ARM won the race in mobility market.
That allowed for more money on alternative PP different than Intel's (which was the only enjoying its scaling market business), and that provided the situation we are now, where Intel has lost its PP leadership.
This situation may fix the last brick needed for a platform to become popular: that is SW support.
What have you not understood that before M1 Adobe never made a version of their suite that was not x86? And I've seen may other competitors are creating their M1 version of their graphical suits (see Affinity suite).
What have you not understood that with M1 developers that want something published in the Apple universe have now to develop for ARM as well (and in few months when even the biggest Macs shifts to ARM based CPUs probably only for ARM)?
Can't you understand what this means? It means that more and more basic libraries, algorhitms, optimizations and choices will be made for ARM architecture(s).
Today you only have professional (or mobile and embedded) developing framework for ARM. In few time you could have a suite like Visual Studio and all relative libraries being able to create a fully optimized ARM executable that runs for Linux or Windows and iOS.
May this not happen? Of course it may not. Though the Apple ARM market will remain nonetheless.
But for this not to become a bleeding nightmare for x86 historic (and last pillar) SW support, Intel and AMD have really to work hard. Not surely come up with those power sucking, pizza's size dies just for 20% more performance (and only in some tests) that the year before.
You think that everyone needs 8 cores, 16 thread and 32GB of RAM to do to their jobs (especially at home) to say that ARM is not a threat?
That goes againt the success of Chromebook (which can be ARM too).
It's just a question of SW development and support before future ARM PCs can run Windows with whatever office suite they want + Abobe or any other photo retoucher + Blender + CAD + development tools and finally... <b>games</b>.
If next gaming consoles will switch to anything not x86 based, x86 down track will be definitive, and I doubt Intel or AMD can supply SoC at cheaper price that AMD is doing now to make designers interested in their solutions.
It will prefigure a dumping scenery.
In ARM becomes wide spread (independently of the actual performances against 16 core x86 which is owned by 0.01% of PC users) future can be less easy for x86 players.
I just think that we will hear in a couple of years (or at most three) some more announcements by Intel (and I belive first by AMD) of opening towards ARM design or even embedded compatibility (that is x86 cores that can run ARM code as well, just to not say that x86 has been completely, and I may say finally, rendered obsolete).
GeoffreyA - Thursday, April 15, 2021 - linkCiccioB, even if ARM takes over, many of us enthusiasts will keep alive the memory of x86 in our hearts and nobody can erase that, not Apple, not ARM, not the latest, slick, up-to-date thing.