ARM Launches DynamIQ: big.Little to Eight Cores Per Cluster
by Ian Cutress on March 21, 2017 1:00 AM ESTMost users delving into SoCs know about ARM core designs over the years. Initially we had single CPUs, then paired CPUs and then quad-core processors, using early ARM cores to help drive performance. In October 2011, ARM introduced big.Little – the ability to use two different ARM cores in the same design by typically pairing a two or four core high-performance cluster with a two or four core high-efficiency cluster design. From this we have offshoots, like MediaTek’s tri-cluster design, or just wide core mesh designs such as Cavium’s ThunderX. As the tide of progress washes against the shore, ARM is today announcing the next step on the sandy beach with DynamIQ.
The underlying theme with DynamIQ is heterogeneous scalability. Those two words hide a lot of ecosystem jargon, but as ARM predicts that another 100 billion ARM chips will be sold in the next five years, they pin key areas such as automotive, artificial intelligence and machine learning at the interesting end of that growth. As a result, performance, efficiency, scalability, and latency are all going to be key metrics moving forward that DynamIQ aims to facilitate.
The first stage of DynamIQ is a larger cluster paradigm - which means up to eight cores per cluster. But in a twist, there can be a variable core design within a cluster. Those eight cores could be different cores entirely, from different ARM Cortex-A families in different configurations.
Many questions come up here, such as how the cache hierarchy will allow threads to migrate between cores within a cluster (perhaps similar to how threads migrate between clusters on big.Little today), even when cores have different cache arrangements. ARM did not yet go into that level of detail, however we were told that more information will be provided in the coming months.
Each variable core-configuration cluster will be a part of a new fabric, with uses additional power saving modes and aims to provide much lower latency. The underlying design also allows each core to be controlled independently for voltage and frequency, as well as sleep states. Based on the slide diagrams, various other IP blocks, such as accelerators, should be able to be plugged into this fabric and benefit from that low latency. ARM quoted elements such as safety critical automotive decisions can benefit from this.
One of the focus areas from ARM’s presentation was one of redundancy. The new fabric will allow a seemingly unlimited number of clusters to be used, such that if one cluster fails the others might take its place (or if an accelerator fails). That being said, the sort of redundancy that some of the customers of ARM chips might require is fail-over in the event of physical damage, such as automotive car control is retained if there are >2 ‘brains’ in the vehicle and there is an impact which disables one. It will be interesting to see if ARM’s vision for DynamIQ extends to that level of redundancy at the SoC level, or if it will be up to ARM’s partners to develop on the top of DynamIQ.
Along with the new fabric, ARM stated that a new memory sub-system design is in place to assist with the compute capabilities, however nothing specific was mentioned. Along the lines of additional compute, ARM did state that new dedicated processor instructions (such as limited precision math) for artificial intelligence and machine learning will be integrated into a variant of the ARMv8 architecture. We’re unsure if this is an extension of ARMv8.2-A, which introduced half-precision for data processing, or a new version. ARMv8.2-A also adds in RAS features and memory model enhancements, which coincides with the ‘new memory sub-system design’ mentioned earlier. When asked about which cores can use DynamIQ, ARM stated that new cores would be required. Future cores will be ARMv8.2-A compliant and will be able to be part of DynamIQ.
ARM’s presentation focused mainly on DynamIQ for new and upcoming technologies, such as AI, automotive and mixed reality, although it was clear that DynamIQ can be used with other existing edge-case use models, such as tablets and smartphones. This will depend on how ARM supports current core designs in the market (such as updates to A53, A72 and A73) or whether DynamIQ requires separate ARM licenses. We fully expect any new cores announced from this point on will support the technology, in the same way that current ARM cores support big.Little.
So here’s some conjecture. A future tablet SoC uses DynamIQ, which consists of two high-powered cores, four mid-range cores, and two low-power cores, without a dual cluster / big.Little design. Either that or all three types of cores are on different clusters altogether using the new topology. Actually, the latter sounds more feasible from a silicon design standpoint, as well as software management. That being said, the spec sheet of any future design using DynamIQ will now have to list the cores in each cluster. ARM did state that it should be fairly easy to control which cores are processing which instruction streams in order to get either the best power or the best efficiency as needed.
ARM states that more information is to come over the next few months.
35 Comments
View All Comments
Alexvrb - Tuesday, March 21, 2017 - link
I thought it was intentional. Like technological progress vs certainty.g1011999 - Tuesday, March 21, 2017 - link
My wild guess that Apple A10 already have similar design, big/little cores in the same cluster, share the same L2 cache and hardware based system monitor will automatically switch thread execution to either core.Meteor2 - Tuesday, March 21, 2017 - link
Well this seems to validate MediaTeks Tri-Cluster and CorePilot technologies. But it always strikes me that while the ARM ISA is popular, ARM cores and SoC tech is not. Apple have their own cores, Qualcomm does, Nvidia does. Rumours are that the X30 is getting very few design wins; most phone makers are buying 821s or 835s.So maybe ARM's own IP isn't that great.
Mobile-Dom - Tuesday, March 21, 2017 - link
you say that, but:Apple makes only custom cores
Qualcomm does, but Kryo cores are only used in theit top performing chips, everything else is standard ARM IP.
Samsung mixes their custom IP with standard ARM IP.
Nvidia created their own custom core (Denver) which sdly failed spectacularly, but also pairs with standard ARM IP.
MediaTek *only* uses standard ARM IP.
For all but the extremely high performance stuff where custom is more efficient and effective, everyone uses standard ARM IP
Meteor2 - Tuesday, March 21, 2017 - link
Yeah I was referring to high-end (because that's the most interesting).Is Denver a failure? It's in Jetson and DrivePX, and Nvidia are apparently rolling another custom core for Xavier.
saratoga4 - Tuesday, March 21, 2017 - link
Qualcomm and Nvidia are moving towards ARM designs from full custom though. If anything, arm is gaining ground.senecarr - Tuesday, March 21, 2017 - link
I'm not sure I'd say Qualcomm is moving towards ARM designs. The change happened when Apple shifted the market by introducing 64-bit ahead of what people expected, and Qualcomm didn't have a 64-bit design in their near term roadmap. Since getting their first 64-bit ARM standards out, they've been shifting back to their business as usual custom designs.Marc GP - Tuesday, March 21, 2017 - link
Qualcomm is moving towards ARM desings. The Snapdragon 835 uses a customization of the Cortex-A73, not an evolution of the Kryo cores on the Snapdragon 820.serendip - Wednesday, March 22, 2017 - link
And looking at midrange designs like the 650/652, those plain ARM A72 cores pack one heck of a punch while still keeping power consumption reasonable. It might not make sense in the future to devote resources to a custom core when an ARM core is good enough while allowing faster time to market.serendip - Wednesday, March 22, 2017 - link
I guess it comes down to how fine the hairs can be split... You could have 10 clusters at each power level and you'd see little performance penalty if all cache was shared. A tricluster design could work well if there was enough power/performance differentiation for the middle cluster. Looking at Mediatek's latest stuff, it looks like they're still struggling with efficiency and adding another cluster didn't improve things.