Arm Announces Mali-G76 GPU: Scaling up Bifrostby Ryan Smith & Andrei Frumusanu on May 31, 2018 3:00 PM EST
The Mali G76 µarch - Fine Tuning It
Section by Ryan Smith
While the biggest change in the G72 is by far Arm’s vastly wider cores, it’s not the only change to come to the Bifrost architecture. The company has also undertaken a few smaller changes to further optimize the architecture and improve performance efficiency.
First off, within their ALUs Arm has added support for Int8 dot products. These operations are becoming increasingly important in machine learning inference, as it’s a critical operation in processing neural networks and despite the limited precision, is still deep enough for basic inference in a number of cases. To be sure, even the original Bifrost already natively supported Int8 data types, including packing 4 of them into a single lane, but G76 becomes the first to be able to use them in a dot product in a single cycle.
As a result, Arm is touting a 2.7x increase in machine learning performance. This will of course depend on the workload – particularly the framework and model used – so it’s just a high-level approximation. But Arm is betting big on machine learning, so significantly speeding up GPU machine learning inference gives Arm’s customers another option for efficiently processing these neural networks.
Meanwhile, in part as a consequence of the better scalability of Mali-G76’s core design, Arm has also taken a look at other aspects of GPU scalability to improve performance. Their research found that another potential scaling bottleneck is the tiler, which could block the rest of the GPU if it stalled during a polygon writeback. As a result, Arm has moved from an in-order writeback mechanism to an out-of-order writeback mechanism, allowing for polygons to be written back with more flexibility by bypassing those writeback stalls. Unfortunately Arm is being somewhat mum here on how this was implemented – generally changing an in-order process to out-of-order is not a simple task – so we haven’t been given much other information on the matter.
Arm has also made a subtle but important change to how their tile buffers can be used in an effort to keep more traffic local to the GPU core. In certain cases, it’s now possible for applications that run out of color tile buffer space to spill over into the depth tile buffer. Arm is specifically citing workloads involving heavy use of multiple render targets without MSAA for driving this change; the lack of MSAA means that the depth tile buffer is used only sparingly, while the multiple render targets quickly chew through the color tile buffer rather quickly. The net result of this is that it cuts down on the number of trips that need to be made to main memory, which is a rather expensive operation.
Speaking of spilling, G76’s thread local storage mechanism has also been optimized for how it handles register spills. Now the GPU will attempt to group data chunks from spills together so that they can be more easily fetched in the future. This is as opposed to how earlier GPUs did it, where register spills were scattered based on which SIMD lane the data ultimately belonged to.