Some Graphics Card Performance Factors

Here's a response to a question about graphics card performance I answered on the OpenGL message board. It's from 2004, but should still be a good start to understand graphics card performance using hardware transform and lighting.

Q: If I draw 1000 polygons per frame, what frame rates should I get on card X? Is it better than card Y?

A: If you draw 1000 polygons, and they all fill large parts of the screen (perhaps with blending, even) then no card will be able to do that at any appreciable speed.

If you draw 1000 polygons, and you have depth test/write turned off, and they're all opaque and cover only a single pixel, your frame rate should be limited by things like just getting the data to the card, rather than the number of polygons.

In brief, here are descriptions of some different bottlenecks:

Pixel write rate ("fill rate") measured as how many pixels on the screen can be written to per second. This may vary somewhat with the pattern you write in (mostly horizontal polygons _may_ draw faster than vertical, or whatever). If you touch the same pixel on the screen many times, you have a high overdraw count, and you are likely fill rate limited. You can measure whether you're fill rate limited by making the window tiny (160x120 or so); if frame rate improves, then you're probably fill rate limited.

Pixel blend rate (no good name) measured as how fast you can read back pixels to apply a second texture pass on them (and then write them back to the framebuffer again). Note that going high on utilization of this number probably will cut your raw fill rate substantially, as the framebuffer memory subsystem has to switch direction a lot, although it's also likely that there's special hardware on a modern card intended to make this additional cost as small as possible. Blend is usually implemented very close to the framebuffer memory, and may turn out to be free or almost-free on modern implementations. This is, by the way, the reason you can't blend into floating-point framebuffers, nor generally use the framebuffer as a texturing source.

Texturing rate. How many texels can be read per second? Using MIP mapping may substantially reduce your consumption of texturing rate. Depending on how memory is banked on the card, using high refresh rates or large screen spaces may reduce your texture fetch rate. Making sure the card stores textures in 16 bit mode (instead of 32) will double the amount of texels you can use in a second. Using DXT1 compression will further compress it by 4x.

Triangle setup rate. How many triangles can physically be started on the card per second, assuming you already have all coordinates transformed. As an example, I've seen a TNT2 do 18 million triangles per second, when they're all minuscle and come out of a display list, and the transform matrices haven't changed since the list was compiled. Jitter MODELVIEW just a little bit between each frame, and the same card/driver is down at 1.5 million/second (depending on what features you use).

Vertex transform rate ("polygons per second"). How many GL verts can the card (or driver) process per second? Divide by three, and you get the polygons per second rate. Well, except that there's a primitive called triangle strips, where you only need to submit a single vertex to get another triangle on the screen, once the first triangle has been drawn. There's also the issue of a vertex transform cache, which can make regular triangle lists perform on par with triangle strips. In fact, you can get to the point of getting one triangle on screen per 0.6 transformed vertices, for the right mesh, and the right cache. Guess which primitive the graphics vendors use when they publich "polygons per second" performance numbers?
Note that some cards (GeForce, most Radeons) actually do this in hardware on the card, and don't burden the CPU with this, so the CPU is available to prepare the next batch of graphics; other cards (like Intel Extreme of all flavors, the Matroxen before the Parhelia, NVIDIA TNT, ATI Rage and some low-end Radeons) do not have hardware transform and lighting, so they lock up the CPU until the transform is done. Other cards that use the CPU include the TNT line, the Voodoo line (remember those?), the Rage line, the S3 Savage line, built-in Intel or SiS graphcs, and pretty much most other consumer-grade graphics cards.
For cards with hardware transform, how you order your indexed primitives actually matters as well, because the post-transform vertex cache will be more or less well utilized.

Texture upload rate. If your working set is bigger than the amount of memory available on the card, just slurping textures from host system RAM onto the card is going to take up a lot of time. Typically, this is measured as "AGP 1x", "AGP 2x" or "AGP 4x" although there are somewhat meaningful differences between cards even within these speed classes. (Additionally, AGP 8x, and PCI-express 8x and 16x are now available speeds).

Vertex data transfer rate. For hardware transform & lighting cards, the vertex data needs to be transferred from the host system RAM (where you provide it) to the card. This transfer usually competes with the "texture upload" budget, but it may also have other constraints. For example, if you provide the data in locked, contiguous, uncacheable memory that has been pre-prepared by the driver, there is almost no set-up cost to initiate the transfer (this is the intention behind the NV_vertex_array_range and ATI_vertex_buffer extensions). If you just malloc() some memory and call glVertexPointer(), the driver has to either page in, lock, cache-flush (on some systems) and map the memory, OR it has to copy the data into some pre-prepared area it has previously allocated for that purpose. If I were a driver writer, I'd do the second, as it's likely to be faster overall, but it introduces at least one extra copy step. If you do glVertex(), chances are that the driver may poke data at the card using programmed I/O, one vertex at a time, for really slug-like performance -- or maybe it'll accumulate all the vertices into some buffer, and send it all at glEnd(). Host transform systems are more likely to do the previous; hardware transform the latter. Hopefully, the Vertex Buffer Object extension will make management of these issues easier, but that requires that you pay attention to the usage hints you specify!
Also, some cards fetch data using specific memory line sizes; if your vertices are aligned on such memory line sizes, you may avoid fetching unnecessary memory lines and thus get better throughput. Thus, there may be some gains in making a vertex a power of two in size (say, 16 or 32 bytes) and aligning it on that same power in two in your memory buffer.
Last, some vertex formats are more efficient than others. GL_SHORT and GL_FLOAT are usually supported by hardware; most other types have spottier support. For colors, the format to use is usually GL_UNSIGNED_BYTE, and GL_BGRA external format. Also, the amount of interleave matters. Some cards can only deal with one or two vertex array streams; if you interleave each array separately, you won't get very good performance because the driver needs to re- pack vertices to a format it likes. Try packing everything into a single interleaved struct and specify each channel using the "stride" parameter.

Program Complexity. If you're using hardware shaders (vertex or fragment programs) then the more complex the program, and the more indirection and texture accesses are involved, the more varied the performance will be depending on the suitability to the underlying hardware.
For hardware fragment shading (NV_register_combiner, ATI_fragment_shader and ARB_fragment_shader, as well as GLSL and Cg high-level shaders) the number of operations you apply to each fragment will have an impact on fill rate. Thus, if you have high overdraw, expensive shaders are likely to cut your frame rate a lot. You can make this better by first laying down the Z buffer without writing to RGBA at all; then switch off Z buffer writing and set Z testing to LEQUAL, and re-render the scene; this will make sure that each pixel only gets touched once by an expensive shader (all the cards capable of running expensive shaders implement some kind of early Z testing that avoids fragment processing for pixels which will be Z discarded).

These are all factors in how "fast" a graphics card performs. Different applications excercise these bottlenecks differently, and may thus come to different results on the same hardware. Also, there's a whole level of higher-level performance factors, such as whether you're using fog or not, whether you're using specular lighting or not, how many lights you're using, whether you're using the depth buffer, stencil buffer, destination alpha, single-texturing, multi-texturing, or multi-pass rendering, ...

Remember: if you think you need to optimize, PROFILE FIRST and then fix the bottlenecks, as indicated by profile. If you're fill rate limited, optimizing vertex transfer is going to give you only marginal benefits. Always measure before- and after- results. If you don't make progress, you may wish to back out the change, and try something else. If you don't know how to profile, then you should learn; it's the second most important skill for a good software developer (the most important skill is how to debug programs).

Back to the Graphics page.