Wednesday, March 14, 2012

Find all collisions in large point clouds

It's been a very long time since I posted here, and I was surprised to come back to a significant number of comments--and a lot more page views than my personal blog. Fascinating. So, here's another snippet :).

Sometimes, you have a large number (say, several million) points, all floating in space, and you'd like to know if any of those points are touching any of the other points. It's even a little worse if the point positions are floating point values -- in that case, we'd like to know about not only exact equality, we'd also like to know about approximate equality, which means classic hashing and similar bit-twiddling aren't going to do the trick.

First, we'll start out with a correct and terrible algorithm. Assume tol is the allowed tolerance, and that equals(x,y,tol) is a function for determining equality within tolerance.

for( i = 0; i < points.size(); ++i )
for( j = i + 1; j < points.size(); ++j )
  if( equals(points[i], points[j], tol) )
    cout << i << " touches " << j << endl;

Given n points, this requires n*n/2 iterations to find all collisions. For a million points, this is going to need to make half a trillion comparisons, and I'd rather not wait around that long. So we need something better.

Initially, I thought of writing a spatial hash as a way to solve this problem. The idea is to place all your points in their own buckets in space according to their location, then only look through points which land in the same bucket for collisions. Even better, you can dynamically split the buckets whenever they start getting large, allowing you to handle point clouds with highly variable densities. This works, but it's a lot of code. Surely there is a simpler answer? Yes, there is.

The idea is to use a sort function, like the STL sort, but only on one dimension at a time. So we'll first find all the points that have matching X values; then, within those groups, we'll find points that have matching Y, and (if we're three-dimensional), we'll find matching Z's within the groups that have matching X and Y. Got it? Here's an attempt at some code:

// This is the cloud of points we wish to sort.
vector< Point > pts;

// This is our output value. It will hold
// the indices in pts of colliding points.
vector< vector< size_t > > point_cliques;

point_cliques.push_back( vector< size_t >() );
// Initially, all points start in one big clique that
// matches on zero dimensions.
for( size_t i = 0; i < pts.size(); ++i )
  point_cliques.back().push_back( i );

// Reuse vector in the loop to avoid alloc hits.
vector< size_t > tmp_clique;

// For each dimension (assuming 3 here)
for( int idim = 0; idim < 3; ++idim )
  vector< vector< size_t > > new_cliques;
  // Point::operator[] returns a component of the point.
  auto sort_by_dim =
    [&]( uint a, uint b ) { return pts[a][idim] < pts[b][idim]; };
  for(size_t iclique=0; iclique < point_cliques.size(); ++iclique)
    // The clique we're working on, stored as a local reference.
    vector< size_t > &clique = point_cliques[ iclique ];
    // Sort current clique by current dimension.
    sort( clique.begin(), clique.end(), sort_by_dim );

    // Break into parts according to equal ranges.
    for( auto i = clique.begin(); i != clique.end(); ++i )
      // If our point doesn't touch the previous clique...
      if( !tmp_clique.empty() &&
        !equals(tmp_clique.back()[idim], pts[*i][idim], tol))
        // Discard the previous clique if it only had one item,
        // otherwise keep it.
        if( tmp_clique.size() > 1 )
          new_cliques.push_back( tmp_clique );
      tmp_clique.push_back( *i );
  // Assign new_cliques to point_cliques cheaply.
  swap( point_cliques, new_cliques );

If you can't switch to C++11 yet, then you'll have to use a functor with local state for the sort_by_dim function, instead of a closure. Otherwise, the idea still works.

The worst case time for this function is if you have a large array of points that are all on top of each other; in this case, sort will run over the full array 3 times. This is still n log n, which is much better than our n * n above. In practice, I've found this algorithm to be very fast, and a lot easier to work with than a full spatial hash. I hope you find it useful :).