The question came up how you would run inter-particle collision detection within a particle system.

A pretty good approach, which would cache very well, is to sort along one axis, and march along, collecting duplicates. The sort is a fast N lg N, the marching is "approximately" N lg N as well.

Assuming all the particles have the same radius, you can sort on the midpoint; else you must sort on the left edge of the bounding volume of each particle.

// pseudo-code

struct ParticleInfo {
  float x, y, z;
  ActualParticle * p;
};

std::vector< ParticleInfo > particles;

void check_collisions() {
  // assume the particles have some spread along the x axis
  sort_particles_on_x( particles );

  std::deque< ParticleInfo * > active;

  foreach ( ParticleInfo & pi in particles ) {
    retire_active( active, pi.x-radius );
    check_collisions( active, pi );
    active.push_back( pi );
  }
}

void retire_active( deque active, float minx ) {
  foreach( iterator i in active ) {
    if( i->x <= minx ) {
      active.erase( i );
    }
  }
}

void check_collisions( deque active, ParticleInfo & pi ) {
  foreach( iterator i in active ) {
    if( distance( pi, *i ) < 2*radius ) {
      got_a_collision( pi, *i );
    }
  }
}

Instead of a deque, you can use a list or a vector if the number of expected size of the container is small. In fact, vector may perform the best there, too, for many cases.

If you find that most of your particles bunch in the middle, you might want to do a secondary sort along another axis for the active list, and do a binary search when looking for collisions when adding the new particleinfo; this will reduce the average-case number of tests a little bit. However, it should only be necessary with thousands of particles that intersect a lot.

Note that the worst case is always N * N, because each particle may actually collide with every other particle, and thus you need to generate N * N / 2 actual collision pairs, thus that's the runtime of the algorithm in that case.