Here's a response to a question about graphics card performance I answered on
message board. It's from 2004, but should still be a good start to understand
graphics card performance using hardware transform and lighting.
Q: If I draw 1000 polygons per frame, what frame rates should I get on card
X? Is it better than card Y?
If you draw 1000 polygons, and they all fill large parts of the screen (perhaps
with blending, even) then no card will be able to do that at any appreciable
If you draw 1000 polygons, and you have depth test/write turned off, and
they're all opaque and cover only a single pixel, your frame rate should be
limited by things like just getting the data to the card, rather than the
number of polygons.
In brief, here are descriptions of some different bottlenecks:
Pixel write rate ("fill rate") measured as how many pixels on the screen can
be written to per second. This may vary somewhat with the pattern you write
in (mostly horizontal polygons _may_ draw faster than vertical, or whatever).
If you touch the same pixel on the screen many times, you have a high
overdraw count, and you are likely fill rate limited. You can measure whether
you're fill rate limited by making the window tiny (160x120 or so); if frame
rate improves, then you're probably fill rate limited.
Pixel blend rate (no good name) measured as how fast you can read back pixels
to apply a second texture pass on them (and then write them back to the
framebuffer again). Note that going high on utilization of this number
probably will cut your raw fill rate substantially, as the framebuffer memory
subsystem has to switch direction a lot, although it's also likely that
there's special hardware on a modern card intended to make this additional
cost as small as possible. Blend is usually implemented very close to the
framebuffer memory, and may turn out to be free or almost-free on modern
implementations. This is, by the way, the reason you can't blend into
floating-point framebuffers, nor generally use the framebuffer as a texturing
Texturing rate. How many texels can be read per second? Using MIP mapping may
substantially reduce your consumption of texturing rate. Depending on how
memory is banked on the card, using high refresh rates or large screen spaces
may reduce your texture fetch rate. Making sure the card stores textures in
16 bit mode (instead of 32) will double the amount of texels you can use in a
second. Using DXT1 compression will further compress it by 4x.
Triangle setup rate. How many triangles can physically be started on the card
per second, assuming you already have all coordinates transformed. As an
example, I've seen a TNT2 do 18 million triangles per second, when they're
all minuscle and come out of a display list, and the transform matrices
haven't changed since the list was compiled. Jitter MODELVIEW just a little
bit between each frame, and the same card/driver is down at 1.5
million/second (depending on what features you use).
Vertex transform rate ("polygons per second"). How many GL verts can the card
(or driver) process per second? Divide by three, and you get the polygons per
second rate. Well, except that there's a primitive called triangle strips,
where you only need to submit a single vertex to get another triangle on the
screen, once the first triangle has been drawn.
There's also the issue of a vertex transform cache, which can make regular
triangle lists perform on par with triangle strips. In fact, you can get to
the point of getting one triangle on screen per 0.6 transformed vertices, for
the right mesh, and the right cache. Guess which primitive the graphics
vendors use when they publich "polygons per second" performance numbers?
Note that some cards (GeForce, most Radeons) actually do this in
hardware on the card, and don't burden the CPU with this, so the CPU is
available to prepare the next batch of graphics; other cards (like Intel Extreme
of all flavors, the Matroxen before the Parhelia, NVIDIA TNT, ATI Rage and some low-end Radeons)
do not have hardware transform and lighting, so they lock up the CPU until
the transform is done. Other cards that use the CPU include the TNT line, the
Voodoo line (remember those?), the Rage line, the S3 Savage line, built-in
Intel or SiS graphcs, and pretty much most other consumer-grade graphics cards.
For cards with hardware transform, how you order your indexed primitives actually
matters as well, because the post-transform vertex cache will be more or less
Texture upload rate. If your working set is bigger than the amount of memory
available on the card, just slurping textures from host system RAM onto the
card is going to take up a lot of time. Typically, this is measured as "AGP
1x", "AGP 2x" or "AGP 4x" although there are somewhat meaningful differences
between cards even within these speed classes. (Additionally, AGP 8x, and PCI-express
8x and 16x are now available speeds).
Vertex data transfer rate. For hardware transform & lighting cards, the
vertex data needs to be transferred from the host system RAM (where you
provide it) to the card. This transfer usually competes with the "texture
upload" budget, but it may also have other constraints. For example, if you
provide the data in locked, contiguous, uncacheable memory that has been
pre-prepared by the driver, there is almost no set-up cost to initiate the
transfer (this is the intention behind the NV_vertex_array_range and
ATI_vertex_buffer extensions). If you just malloc() some memory and call
glVertexPointer(), the driver has to either page in, lock, cache-flush (on
some systems) and map the memory, OR it has to copy the data into some
pre-prepared area it has previously allocated for that purpose. If I were a
driver writer, I'd do the second, as it's likely to be faster overall, but it
introduces at least one extra copy step. If you do glVertex(), chances are
that the driver may poke data at the card using programmed I/O, one vertex at
a time, for really slug-like performance -- or maybe it'll accumulate all the
vertices into some buffer, and send it all at glEnd(). Host transform systems
are more likely to do the previous; hardware transform the latter. Hopefully,
the Vertex Buffer Object extension will make management of these issues
easier, but that requires that you pay attention to the usage hints you
Also, some cards fetch data using specific memory line sizes; if your vertices
are aligned on such memory line sizes, you may avoid fetching unnecessary
memory lines and thus get better throughput. Thus, there may be some gains in
making a vertex a power of two in size (say, 16 or 32 bytes) and aligning it
on that same power in two in your memory buffer.
Last, some vertex formats are more efficient than others. GL_SHORT and GL_FLOAT
are usually supported by hardware; most other types have spottier support.
For colors, the format to use is usually GL_UNSIGNED_BYTE, and GL_BGRA external
format. Also, the amount of interleave matters. Some cards can only deal with
one or two vertex array streams; if you interleave each array separately, you
won't get very good performance because the driver needs to re- pack vertices
to a format it likes. Try packing everything into a single interleaved struct
and specify each channel using the "stride" parameter.
Program Complexity. If you're using hardware shaders (vertex or fragment
programs) then the more complex the program, and the more indirection and
texture accesses are involved, the more varied the performance will be
depending on the suitability to the underlying hardware.
For hardware fragment shading (NV_register_combiner, ATI_fragment_shader and
ARB_fragment_shader, as well as GLSL and Cg high-level shaders) the number of
operations you apply to each fragment will have an impact on fill rate. Thus,
if you have high overdraw, expensive shaders are likely to cut your frame
rate a lot. You can make this better by first laying down the Z buffer
without writing to RGBA at all; then switch off Z buffer writing and set Z
testing to LEQUAL, and re-render the scene; this will make sure that each
pixel only gets touched once by an expensive shader (all the cards capable of
running expensive shaders implement some kind of early Z testing that avoids
fragment processing for pixels which will be Z discarded).
These are all factors in how "fast" a graphics card performs. Different
applications excercise these bottlenecks differently, and may thus come to
different results on the same hardware. Also, there's a whole level of
higher-level performance factors, such as whether you're using fog or not,
whether you're using specular lighting or not, how many lights you're using,
whether you're using the depth buffer, stencil buffer, destination alpha,
single-texturing, multi-texturing, or multi-pass rendering, ...
Remember: if you think you need to optimize, PROFILE FIRST and then fix
the bottlenecks, as indicated by profile. If you're fill rate limited, optimizing
vertex transfer is going to give you only marginal benefits. Always measure
before- and after- results. If you don't make progress, you may wish to back out
the change, and try something else. If you don't know how to profile, then you
should learn; it's the second most important skill for a good software developer
(the most important skill is how to debug programs).