| Here's a response to a question about graphics card performance I answered on
the OpenGL
  message board. It's from 2004, but should still be a good start to understand 
graphics card performance using hardware transform and lighting. 
Q: If I draw 1000 polygons per frame, what frame rates should I get on card 
X? Is it better than card Y?
 
A:
If you draw 1000 polygons, and they all fill large parts of the screen (perhaps
with blending, even) then no card will be able to do that at any appreciable
speed.
 
If you draw 1000 polygons, and you have depth test/write turned off, and
they're all opaque and cover only a single pixel, your frame rate should be
limited by things like just getting the data to the card, rather than the
number of polygons.
 
In brief, here are descriptions of some different bottlenecks:
 
Pixel write rate ("fill rate") measured as how many pixels on the screen can
  be written to per second. This may vary somewhat with the pattern you write
  in (mostly horizontal polygons _may_ draw faster than vertical, or whatever).
  If you touch the same pixel on the screen many times, you have a high
  overdraw count, and you are likely fill rate limited. You can measure whether
  you're fill rate limited by making the window tiny (160x120 or so); if frame
  rate improves, then you're probably fill rate limited.
Pixel blend rate (no good name) measured as how fast you can read back pixels
  to apply a second texture pass on them (and then write them back to the
  framebuffer again). Note that going high on utilization of this number
  probably will cut your raw fill rate substantially, as the framebuffer memory
  subsystem has to switch direction a lot, although it's also likely that
  there's special hardware on a modern card intended to make this additional
  cost as small as possible.  Blend is usually implemented very close to the
  framebuffer memory, and may turn out to be free or almost-free on modern
  implementations. This is, by the way, the reason you can't blend into
  floating-point framebuffers, nor generally use the framebuffer as a texturing
  source.
Texturing rate. How many texels can be read per second? Using MIP mapping may
  substantially reduce your consumption of texturing rate. Depending on how
  memory is banked on the card, using high refresh rates or large screen spaces
  may reduce your texture fetch rate. Making sure the card stores textures in
  16 bit mode (instead of 32) will double the amount of texels you can use in a
  second. Using DXT1 compression will further compress it by 4x.
Triangle setup rate. How many triangles can physically be started on the card
  per second, assuming you already have all coordinates transformed. As an
  example, I've seen a TNT2 do 18 million triangles per second, when they're
  all minuscle and come out of a display list, and the transform matrices
  haven't changed since the list was compiled. Jitter MODELVIEW just a little
  bit between each frame, and the same card/driver is down at 1.5
  million/second (depending on what features you use).
Vertex transform rate ("polygons per second"). How many GL verts can the card
  (or driver) process per second? Divide by three, and you get the polygons per
  second rate. Well, except that there's a primitive called triangle strips,
  where you only need to submit a single vertex to get another triangle on the
  screen, once the first triangle has been drawn.
  There's also the issue of a vertex transform cache, which can make regular
  triangle lists perform on par with triangle strips. In fact, you can get to 
  the point of getting one triangle on screen per 0.6 transformed vertices, for 
  the right mesh, and the right cache. Guess which primitive the graphics
  vendors use when they publich "polygons per second" performance numbers?
Note that some cards (GeForce, most Radeons) actually do this in
  hardware on the card, and don't burden the CPU with this, so the CPU is
  available to prepare the next batch of graphics; other cards (like Intel Extreme 
  of all flavors, the Matroxen before the Parhelia, NVIDIA TNT, ATI Rage and some low-end Radeons)
  do not have hardware transform and lighting, so they lock up the CPU until
  the transform is done. Other cards that use the CPU include the TNT line, the
  Voodoo line (remember those?), the Rage line, the S3 Savage line, built-in
  Intel or SiS graphcs, and pretty much most other consumer-grade graphics cards.
 For cards with hardware transform, how you order your indexed primitives actually 
  matters as well, because the post-transform vertex cache will be more or less
  well utilized.
Texture upload rate. If your working set is bigger than the amount of memory
  available on the card, just slurping textures from host system RAM onto the
  card is going to take up a lot of time. Typically, this is measured as "AGP
  1x", "AGP 2x" or "AGP 4x" although there are somewhat meaningful differences
  between cards even within these speed classes. (Additionally, AGP 8x, and PCI-express
  8x and 16x are now available speeds).
Vertex data transfer rate. For hardware transform & lighting cards, the
  vertex data needs to be transferred from the host system RAM (where you
  provide it) to the card. This transfer usually competes with the "texture
  upload" budget, but it may also have other constraints. For example, if you
  provide the data in locked, contiguous, uncacheable memory that has been
  pre-prepared by the driver, there is almost no set-up cost to initiate the
  transfer (this is the intention behind the NV_vertex_array_range and
  ATI_vertex_buffer extensions). If you just malloc() some memory and call
  glVertexPointer(), the driver has to either page in, lock, cache-flush (on
  some systems) and map the memory, OR it has to copy the data into some
  pre-prepared area it has previously allocated for that purpose. If I were a
  driver writer, I'd do the second, as it's likely to be faster overall, but it
  introduces at least one extra copy step. If you do glVertex(), chances are
  that the driver may poke data at the card using programmed I/O, one vertex at
  a time, for really slug-like performance -- or maybe it'll accumulate all the
  vertices into some buffer, and send it all at glEnd(). Host transform systems
  are more likely to do the previous; hardware transform the latter. Hopefully,
  the Vertex Buffer Object extension will make management of these issues
  easier, but that requires that you pay attention to the usage hints you
  specify!
Also, some cards fetch data using specific memory line sizes; if your vertices 
  are aligned on such memory line sizes, you may avoid fetching unnecessary
  memory lines and thus get better throughput. Thus, there may be some gains in
  making a vertex a power of two in size (say, 16 or 32 bytes) and aligning it
  on that same power in two in your memory buffer.
 Last, some vertex formats are more efficient than others. GL_SHORT and GL_FLOAT 
  are usually supported by hardware; most other types have spottier support.
  For colors, the format to use is usually GL_UNSIGNED_BYTE, and GL_BGRA external 
  format. Also, the amount of interleave matters.  Some cards can only deal with 
  one or two vertex array streams; if you interleave each array separately, you 
  won't get very good performance because the driver needs to re- pack vertices
  to a format it likes. Try packing everything into a single interleaved struct
  and specify each channel using the "stride" parameter.
Program Complexity. If you're using hardware shaders (vertex or fragment 
  programs) then the more complex the program, and the more indirection and
  texture accesses are involved, the more varied the performance will be
  depending on the suitability to the underlying hardware.
For hardware fragment shading (NV_register_combiner, ATI_fragment_shader and 
  ARB_fragment_shader, as well as GLSL and Cg high-level shaders) the number of
  operations you apply to each fragment will have an impact on fill rate. Thus,
  if you have high overdraw, expensive shaders are likely to cut your frame
  rate a lot. You can make this better by first laying down the Z buffer
  without writing to RGBA at all; then switch off Z buffer writing and set Z
  testing to LEQUAL, and re-render the scene; this will make sure that each
  pixel only gets touched once by an expensive shader (all the cards capable of
  running expensive shaders implement some kind of early Z testing that avoids
  fragment processing for pixels which will be Z discarded).
 
These are all factors in how "fast" a graphics card performs. Different
applications excercise these bottlenecks differently, and may thus come to
different results on the same hardware. Also, there's a whole level of
higher-level performance factors, such as whether you're using fog or not,
whether you're using specular lighting or not, how many lights you're using,
whether you're using the depth buffer, stencil buffer, destination alpha,
single-texturing, multi-texturing, or multi-pass rendering, ...
 
Remember: if you think you need to optimize, PROFILE FIRST and then fix 
the bottlenecks, as indicated by profile. If you're fill rate limited, optimizing 
vertex transfer is going to give you only marginal benefits. Always measure 
before- and after- results. If you don't make progress, you may wish to back out 
the change, and try something else. If you don't know how to profile, then you 
should learn; it's the second most important skill for a good software developer 
(the most important skill is how to debug programs).
 |  |  |