NVIDIA's GeForce GTX 580: Fermi Refined
by Ryan Smith on November 9, 2010 9:00 AM ESTGF110: Fermi Learns Some New Tricks
We’ll start our in-depth look at the GTX 580 with a look at GF110, the new GPU at the heart of the card.
There have been rumors about GF110 for some time now, and while they ultimately weren’t very clear it was obvious NVIDIA would have to follow up GF100 with something else similar to it on 40nm to carry them through the rest of the processes’ lifecycle. So for some time now we’ve been speculating on what we might see with GF100’s follow-up part – an outright bigger chip was unlikely given GF100’s already large die size, but NVIDIA has a number of tricks they can use to optimize things.
Many of those tricks we’ve already seen in GF104, and had you asked us a month ago what we thought GF110 would be, we were expecting some kind of fusion of GF104 and GF100. Primarily our bet was on the 48 CUDA Core SM making its way over to a high-end part, bringing with it GF104’s higher theoretical performance and enhancements such as superscalar execution and additional special function and texture units for each SM. What we got wasn’t quite what we were imagining – GF110 is much more heavily rooted in GF100 than GF104, but that doesn’t mean NVIDIA hasn’t learned a trick or two.
Fundamentally GF110 is the same architecture as GF100, especially when it comes to compute. 512 CUDA Cores are divided up among 4 GPCs, and in turn each GPC contains 1 raster engine and 4 SMs. At the SM level each SM contains 32 CUDA cores, 16 load/store units, 4 special function units, 4 texture units, 2 warp schedulers with 1 dispatch unit each, 1 Polymorph unit (containing NVIDIA’s tessellator) and then the 48KB+16KB L1 cache, registers, and other glue that brought an SM together. At this level NVIDIA relies on TLP to keep a GF110 SM occupied with work. Attached to this are the ROPs and L2 cache, with 768KB of L2 cache serving as the guardian between the SMs and the 6 64bit memory controllers. Ultimately GF110’s compute performance per clock remains unchanged from GF100 – at least if we had a GF100 part with all of its SMs enabled.
On the graphics side however, NVIDIA has been hard at work. They did not port over GF104’s shader design, but they did port over GF104’s texture hardware. Previously with GF100, each unit could compute 1 texture address and fetch 4 32bit/INT8 texture samples per clock, 2 64bit/FP16 texture samples per clock, or 1 128bit/FP32 texture sample per clock. GF104’s texture units improved this to 4 samples/clock for 32bit and 64bit, and it’s these texture units that have been brought over for GF110. GF110 can now do 64bit/FP16 filtering at full speed versus half-speed on GF100, and this is the first of the two major steps NVIDIA took to increase GF110’s performance over GF100’s performance on a clock-for-clock basis.
NVIDIA Texture Filtering Speed (Per Texture Unit) | |||||
GF110 | GF104 | GF100 | |||
32bit (INT8) | 4 Texels/Clock | 4 Texels/Clock | 4 Texels/Clock | ||
64bit (FP16) | 4 Texels/Clock | 4 Texels/Clock | 2 Texels/Clock | ||
128bit (FP32) | 1 Texel/Clock | 1 Texel/Clock | 1 Texel/Clock |
Like most optimizations, the impact of this one is going to be felt more on newer games than older games. Games that make heavy use of 64bit/FP16 texturing stand to gain the most, while older games that rarely (if at all) used 64bit texturing will gain the least. Also note that while 64bit/FP16 texturing has been sped up, 64bit/FP16 rendering has not – the ROPs still need 2 cycles to digest 64bit/FP16 pixels, and 4 cycles to digest 128bit/FP32 pixels.
It’s also worth noting that this means that NVIDIA’s texture:compute ratio schism remains. Compared to GF100, GF104 doubled up on texture units while only increasing the shader count by 50%; the final result was that per SM 32 texels were processed to 96 instructions computed (seeing as how the shader clock is 2x the base clock), giving us 1:3 ratio. GF100 and GF110 on the other hand retain the 1:4 (16:64) ratio. Ultimately at equal clocks GF104 and GF110 widely differ in shading, but with 64 texture units total in both designs, both have equal texturing performance.
Moving on, GF110’s second trick is brand-new to GF110, and it goes hand-in-hand with NVIDIA’s focus on tessellation: improved Z-culling. As a quick refresher, Z-culling is a method of improving GPU performance by throwing out pixels that will never be seen early in the rendering process. By comparing the depth and transparency of a new pixel to existing pixels in the Z-buffer, it’s possible to determine whether that pixel will be seen or not; pixels that fall behind other opaque objects are discarded rather than rendered any further, saving on compute and memory resources. GPUs have had this feature for ages, and after a spurt of development early last decade under branded names such as HyperZ (AMD) and Lightspeed Memory Architecture (NVIDIA), Z-culling hasn’t been promoted in great detail since then.
Z-Culling In Action: Not Rendering What You Can't See
For GF110 this is changing somewhat as Z-culling is once again being brought back to the surface, although not with the zeal of past efforts. NVIDIA has improved the efficiency of the Z-cull units in their raster engine, allowing them to retire additional pixels that were not caught in the previous iteration of their Z-cull unit. Without getting too deep into details, internal rasterizing and Z-culling take place in groups of pixels called tiles; we don’t believe NVIDIA has reduced the size of their tiles (which Beyond3D estimates at 4x2); instead we believe NVIDIA has done something to better reject individual pixels within a tile. NVIDIA hasn’t come forth with too many details beyond the fact that their new Z-cull unit supports “finer resolution occluder tracking”, so this will have to remain a mystery for another day.
In any case, the importance of this improvement is that it’s particularly weighted towards small triangles, which are fairly rare in traditional rendering setups but can be extremely common with heavily tessellated images. Or in other words, improving their Z-cull unit primarily serves to improve their tessellation performance by allowing NVIDIA to better reject pixels on small triangles. This should offer some benefit even in games with fewer, larger triangles, but as framed by NVIDIA the benefit is likely less pronounced.
In the end these are probably the most aggressive changes NVIDIA could make in such a short period of time. Considering the GF110 project really only kicked off in earnest in February, NVIDIA only had around half a year to tinker with the design before it had to be taped out. As GPUs get larger and more complex, the amount of tweaking that can get done inside such a short window is going to continue to shrink – and this is a far cry from the days where we used to get major GPU refreshes inside of a year.
160 Comments
View All Comments
Taft12 - Tuesday, November 9, 2010 - link
In this article, Ryan does exactly what you are accusing him of not doing! It is you who need to be asked WTF is wrongIketh - Thursday, November 11, 2010 - link
ok EVERYONE belonging to this thread is on CRACK... what other option did AMD have to name the 68xx? If they named them 67xx, the differences between them and 57xx are too great. They use nearly as little power as 57xx yet the performance is 1.5x or higher!!!im a sucker for EFFICIENCY... show me significant gains in efficiency and i'll bite, and this is what 68xx handily brings over 58xx
the same argument goes for 480-580... AT, show us power/performance ratios between generations on each side, then everyone may begin to understand the naming
i'm sorry to break it to everyone, but this is where the GPU race is now, in efficiency, where it's been for cpus for years
MrCommunistGen - Tuesday, November 9, 2010 - link
Just started reading the article and I noticed a couple of typos on p1."But before we get to deep in to GF110" --> "but before we get TOO deep..."
Also, the quote at the top of the page was placed inside of a paragraph which was confusing.
I read: "Furthermore GTX 480 and GF100 were clearly not the" and I thought: "the what?". So I continued and read the quote, then realized that the paragraph continued below.
MrCommunistGen - Tuesday, November 9, 2010 - link
well I see that the paragraph break has already been fixed...ahar - Tuesday, November 9, 2010 - link
Also, on page 2 if Ryan is talking about the lifecycle of one process then "...the processes’ lifecycle." is wrong.Aikouka - Tuesday, November 9, 2010 - link
I noticed the remark on Bitstreaming and it seems like a logical choice *not* to include it with the 580. The biggest factor is that I don't think the large majority of people actually need/want it. While the 580 is certainly quieter than the 480, it's still relatively loud and extraneous noise is not something you want in a HTPC. It's also overkill for a HTPC, which would delegate the feature to people wanting to watch high-definition content on their PC through a receiver, which probably doesn't happen much.I'd assume the feature could've been "on the board" to add, but would've probably been at the bottom of the list and easily one of the first features to drop to either meet die size (and subsequently, TDP/Heat) targets or simply to hit their deadline. I certainly don't work for nVidia so it's really just pure speculation.
therealnickdanger - Tuesday, November 9, 2010 - link
I see your points as valid, but let me counterpoint with 3-D. I think NVIDIA dropped the ball here in the sense that there are two big reasons to have a computer connected to your home theater: games and Blu-ray. I know a few people that have 3-D HDTVs in their homes, but I don't know anyone with a 3-D HDTV and a 3-D monitor.I realize how niche this might be, but if the 580 supported bitstreaming, then it would be perfect card for anyone that wants to do it ALL. Blu-ray, 3-D Blu-Ray, any game at 1080p with all eye-candy, any 3-D game at 1080p with all eye-candy. But without bitstreaming, Blu-ray is moot (and mute, IMO).
For a $500+ card, it's just a shame, that's all. All of AMD's high-end cards can do it.
QuagmireLXIX - Sunday, November 14, 2010 - link
Well said. There are quite a few fixes that make the 580 what I wanted in March, but the lack of bitstream is still a hard hit for what I want my PC to do.Call me niche.
QuagmireLXIX - Sunday, November 14, 2010 - link
Actually, this is killing me. I waited for the 480 in March b4 pulling the trigger on a 5870 because I wanted HDMI to a Denon 3808 and the 480 totally dropped the ball on the sound aspect (S/PDIF connector and limited channels and all). I figured no big deal, it is a gamer card after all, so 5870 HDMI I went.The thing is, my PC is all-in-one (HTPC, Game & typical use). The noise and temps are not a factor as I watercool. When I read that HDMI audio got internal on the 580, I thought, finally. Then I read Guru's article and seen bitstream was hardware supported and just a driver update away, I figured I was now back with the green team since 8800GT.
Now Ryan (thanks for the truth, I guess :) counters Gurus bitstream comment and backs it up with direct communication with NV. This blows, I had a lofty multimonitor config in mind and no bitstream support is a huge hit. I'm not even sure if I should spend the time to find out if I can arrange the monitor setup I was thinking.
Now I might just do a HTPC rig and Game rig or see what 6970 has coming. Eyefinity has an advantage for multiple monitors, but the display-port puts a kink in my designs also.
Mr Perfect - Tuesday, November 9, 2010 - link
So where do they go from here? Disable one SM again and call it a GTX570? GF104 is to new to replace, so I suppose they'll enable the last SM on it for a GTX560.