Arm's New Cortex-A77 CPU Micro-architecture: Evolving Performance
by Andrei Frumusanu on May 27, 2019 12:01 AM EST2018 was an exciting year for Arm’s own CPU designs. Last year in May we saw the release of the Cortex-A76 and the subsequent resulting silicon in the form of the Kirin 980 as well as Snapdragon 855 SoCs. We were very impressed by the IP, and Arm managed to deliver on all its performance, efficiency and area promises, resulting in some excellent SoCs and devices powering most of 2019’s flagship devices.
This year we follow-up with another TechDay disclosure, and this time around we’re uncovering Arm’s follow-up to the Cortex-A76: the new Cortex-A77. The new generation is a direct evolution of last year’s major microarchitecture introduction, and represents the second instance of Arm’s brand-new Austin core family. Today we’ll analyse how Arm has pushed the IPC of its new microarchitecture and how this will translate into real performance for upcoming late-2019/early-2020 SoCs and devices.
Deimos turns to Cortex-A77
The announcement of the Cortex-A77 doesn’t come as a surprise as Arm continues on their traditional annual IP release cadence. In fact today is not the first time that Arm has talked about the A77: In August of last year Arm had teased the CPU core when releasing its performance roadmap through 2020:
Codenamed as “Deimos”, the new Cortex-A77 picks up where the Cortex-A76 left off and follows Arm’s projected trajectory of delivering a continued solid 20-25% CAGR of performance uplift with each generation of Arm’s new Austin family of CPUs.
Before we dwell into the new Cortex-A77, we should take a look back at how the performance of the A76 has evolved for Arm:
The A76 has certainly been a hugely successful core for Arm and its licensees. The combination of the brand-new microarchitecture alongside the major improvements that the 7nm TSMC process node has brought some of the biggest performance and efficiency jumps we’ve ever seen in the industry.
The results is that the Kirin 980 as well as the Snapdragon 855 both represented major jumps over their predecessors. Qualcomm has proclaimed a 45% leap in CPU performance compared to the previous generation Snapdragon 845 with Cortex-A75 cores, the biggest generational leap ever.
While the performance increase was notable, the energy efficiency gains we saw this generation was even more impressive and directly resulted in improved battery life of devices powered by the new Kirin and Snapdragon SoCS.
While the A76 performed well, we should remember that it does have competition. While Samsung’s own microarchitecture this year with the M4 has lessened the performance/efficiency gap, the Exynos CPU still largely lags behind by a generation, even though this difference is amplified by a process node difference this year (8nm vs 7nm). The real competition for Arm here lies with Apple’s CPU design teams: Currently the A11 and A12 still hold a large performance and efficiency lead that amounts to roughly two microarchitecture generations.
Die shot credit: ChipRebel - Block labelling: AnandTech
One of Arm’s fortes however remains in delivering the best PPA in the industry. Even though the A76’s performance didn’t quite match Apple’s, it managed to achieve outstanding efficiency with incredibly small die area sizes. In fact, this is a conscious design decision by Arm as power efficiency and area efficiency are among the top priorities for Arm’s licensees.
The Cortex-A77: A Top-Level Overview
The Cortex-A77 being a direct microarchitectural successor to the A76 means the new core largely stays in line with the predecessor’s features. Arm states that the core was built in mind with vendors being able to simply upgrade the SoC IP without much effort.
In practice what this means is that the A77 is architecturally aligned with its predecessor, still being an ARMv8.2 CPU core that is meant to be paired with a Cortex-A55 little CPU inside of a DynamIQ Shared Unit (DSU) cluster.
Fundamental configuration features such as the cache sizes of the A77 also haven’t changed compared to its predecessor: We’re still seeing 64KB L1 instruction and data caches, along with a 256 or 512KB L2 cache. It’s interesting here that Arm did design the option for an 1MB L2 cache for the infrastructure Neoverse N1 CPU core (Which itself is derived from the A76 µarch), but chooses to stay with the smaller configuration options on the client (mobile) CPU IP.
As an evolution of the A76, the A77 performance jump as expected won’t be quite as impressive, both from a microarchitecture perspective, but also from an absolute performance standpoint as we’re not expecting large process node improvements for the coming SoC generation.
Here the A77 is projected to still be productised on 7nm process nodes for most customers, and Arm is proclaiming a similar 3GHz peak target frequency as its predecessor. Naturally since frequency isn’t projected to change much, this means that the core’s targeted +20% performance boost can be solely attributed to the IP’s microarchitectural changes.
To achieve the IPC (Instructions per clock) gains, Arm has reworked the microarchitecture and introduced clever new features, generally beefing up the CPU IP to what results in a wider and more performant design.
108 Comments
View All Comments
Raqia - Tuesday, May 28, 2019 - link
Another interesting development in the big AX CPUs is that they've moved from a more complex cache hierarchy in the A10 to a 2 level hierarchy with a much bigger L2 since the A11 that had better bandwidth and latency; L1's were also further boosted in size and bandwidth in the A12. This likely accounts for the continuation of growth in single threaded benchmark scores but seems to indicate that the CPU complex is oriented toward client type workloads.ARM has gone full steam ahead with more multi-processing oriented cache designs with some SoCs sporting a further layer of L4 cache and server designs sporting sophisticated un-cores. Their ambitions seem rather different than Apple's and this year's A77's will likely be implemented into servers designs sometime soon.
Apple's 3-wide OoOE little cores continue to be even more impressive than their big cores, and hold their own against the A73 in performance with much higher efficiency. One wonders if the 2-wide A73 or even the A75 could be tweaked and underclocked to be the "little" in future designs. It certainly fits the bill in terms of die area.
peevee - Tuesday, May 28, 2019 - link
"The results is that the Kirin 980 as well as the Snapdragon 855 both represented major jumps over their predecessors. Qualcomm has proclaimed a 45% leap in CPU performance compared to the previous generation Snapdragon 855 with Cortex-A76 cores, the biggest generational leap ever."Wat?
peevee - Tuesday, May 28, 2019 - link
"In the A77’s case the structure is 1.5K entries big, which if one would assume macro-ops having a similar 32-bit density as Arm instructions, would equate to about 48KB."You mean Kb, right? And of course this assumption is nonsense.
peevee - Tuesday, May 28, 2019 - link
"web-browsing is the killer-app that happens to be floating point heavy"Why? Because ECMAScript has just one number type?
I suspect WebAssembly would eliminate this problem.
ballsystemlord - Tuesday, May 28, 2019 - link
Spelling and grammar corrections:"Having less capacity would take reduce the hit-rate more significantly, while going for a larger cache would have diminishing returns."
Extra word "take":
"Having less capacity would reduce the hit-rate more significantly, while going for a larger cache would have diminishing returns."
"...and again this imbalance with a more "fat" front-end bandwidth allows the core to hide to quickly hide branch bubbles and pipeline flushes."
More extra words "to hide":
"...and again this imbalance with a more "fat" front-end bandwidth allows the core to quickly hide branch bubbles and pipeline flushes."
sireangelus - Tuesday, May 28, 2019 - link
Are there any news or rumors regarding the succesor of the cortex a55? not even just working on reducing power consumption?tuxRoller - Tuesday, May 28, 2019 - link
"The combination of the brand-new microarchitecture alongside the major improvements that the 7nm TSMC process node has brought some of the biggest performance and efficiency jumps we’ve ever seen in the industry."Or, to paraphrase many a cynical AT commenter: same old incremental improvement, nothing exciting... where's my mr fusion?!!!
AshlayW - Tuesday, May 28, 2019 - link
Can someone tell me how this stacks up to a high-performance X86 core, like Zen or Skylake please? If ARM is so powerful and efficient why are they not developing Desktop CPUs? Is it just because the software ecosystem is dominated by proprietary X86?Wilco1 - Wednesday, May 29, 2019 - link
The IPC is higher than the latest x86 cores. There are Arm server CPUs which are competitive with Skylake and beat it on HPC applications in super computers. Currently you can buy desktops based on ThunderX2 and Ampere, see https://store.avantek.co.uk/arm-desktops.html .Farfolomew - Thursday, June 6, 2019 - link
I wish this were talked about more. As awesome as Zen2 is, and as cool as the story is regarding AMD finally getting on level playing field with Intel again (ie, circa 2004), in the back of my mind I find it a bit silly when all the while that's been happening, a better architecture in all regards has managed to catch up and pass x86. That should be where the computing industry focus is, and how Intel+AMD is planning on battling that threat.