Wednesday, March 14, 2012

Find all collisions in large point clouds

It's been a very long time since I posted here, and I was surprised to come back to a significant number of comments--and a lot more page views than my personal blog. Fascinating. So, here's another snippet :).

Sometimes, you have a large number (say, several million) points, all floating in space, and you'd like to know if any of those points are touching any of the other points. It's even a little worse if the point positions are floating point values -- in that case, we'd like to know about not only exact equality, we'd also like to know about approximate equality, which means classic hashing and similar bit-twiddling aren't going to do the trick.

First, we'll start out with a correct and terrible algorithm. Assume tol is the allowed tolerance, and that equals(x,y,tol) is a function for determining equality within tolerance.

for( i = 0; i < points.size(); ++i )
for( j = i + 1; j < points.size(); ++j )
{
  if( equals(points[i], points[j], tol) )
    cout << i << " touches " << j << endl;
}

Given n points, this requires n*n/2 iterations to find all collisions. For a million points, this is going to need to make half a trillion comparisons, and I'd rather not wait around that long. So we need something better.

Initially, I thought of writing a spatial hash as a way to solve this problem. The idea is to place all your points in their own buckets in space according to their location, then only look through points which land in the same bucket for collisions. Even better, you can dynamically split the buckets whenever they start getting large, allowing you to handle point clouds with highly variable densities. This works, but it's a lot of code. Surely there is a simpler answer? Yes, there is.

The idea is to use a sort function, like the STL sort, but only on one dimension at a time. So we'll first find all the points that have matching X values; then, within those groups, we'll find points that have matching Y, and (if we're three-dimensional), we'll find matching Z's within the groups that have matching X and Y. Got it? Here's an attempt at some code:

// This is the cloud of points we wish to sort.
vector< Point > pts;

// This is our output value. It will hold
// the indices in pts of colliding points.
vector< vector< size_t > > point_cliques;

point_cliques.push_back( vector< size_t >() );
// Initially, all points start in one big clique that
// matches on zero dimensions.
for( size_t i = 0; i < pts.size(); ++i )
  point_cliques.back().push_back( i );

// Reuse vector in the loop to avoid alloc hits.
vector< size_t > tmp_clique;

// For each dimension (assuming 3 here)
for( int idim = 0; idim < 3; ++idim )
{
  vector< vector< size_t > > new_cliques;
  // Point::operator[] returns a component of the point.
  auto sort_by_dim =
    [&]( uint a, uint b ) { return pts[a][idim] < pts[b][idim]; };
  for(size_t iclique=0; iclique < point_cliques.size(); ++iclique)
  {
    // The clique we're working on, stored as a local reference.
    vector< size_t > &clique = point_cliques[ iclique ];
    // Sort current clique by current dimension.
    sort( clique.begin(), clique.end(), sort_by_dim );

    tmp_clique.clear();
    // Break into parts according to equal ranges.
    for( auto i = clique.begin(); i != clique.end(); ++i )
    {
      // If our point doesn't touch the previous clique...
      if( !tmp_clique.empty() &&
        !equals(tmp_clique.back()[idim], pts[*i][idim], tol))
      {
        // Discard the previous clique if it only had one item,
        // otherwise keep it.
        if( tmp_clique.size() > 1 )
          new_cliques.push_back( tmp_clique );
        tmp_clique.clear();
      }
      tmp_clique.push_back( *i );
    }
  }
  // Assign new_cliques to point_cliques cheaply.
  swap( point_cliques, new_cliques );
}

If you can't switch to C++11 yet, then you'll have to use a functor with local state for the sort_by_dim function, instead of a closure. Otherwise, the idea still works.

The worst case time for this function is if you have a large array of points that are all on top of each other; in this case, sort will run over the full array 3 times. This is still n log n, which is much better than our n * n above. In practice, I've found this algorithm to be very fast, and a lot easier to work with than a full spatial hash. I hope you find it useful :).

Monday, October 26, 2009

A poorly chosen warning

GCC spits out warnings when you declare a local variable and don't use it. For example:

{
int x = 3, y = 4;
cout << "The value is " << y << endl;
}

This would warn you that you never used the value 'x', and is a useful warning. But I think they're too overzealous:

int success = doSomethingImportant();
#ifdef _DEBUG_ON_
cout << "Did something important. It was "
<< (success ? "successful." :
"unsuccessful.") << endl;
#endif

If you don't define the _DEBUG_ON_ symbol, GCC will issue a warning here. And if it's late at night and you're not paying attention, you might make a mistake and move that 'doSomethingImportant()' call inside of the #ifdef, because after all, GCC is warning you that it's not doing anything. And sometime on Saturday, while fixing several hundred warnings, I did precisely that (although in my defense, the real code was a bit longer). It took several hours to figure out why the test suite was failing; if we hadn't had a test suite, we may have accidentally allowed a very buggy piece of software out the door. As it was, it kept me at work late.

This particular warning strikes me as poorly thought out. I agree with the warning when the initialization value is constant, but when the initialization is complex, with the potential for side effects, I believe the warning should not be issued. It certainly would have saved me a literal headache today.

I haven't tried this with the latest GCC yet, by the way; I only have a really old 3.4.5 compiler on my machine. Feel free to contradict my conclusions in a flaming, foaming-at-the-mouth sort of way.

Thursday, October 22, 2009

Porting C++

Blowing the dust off this old blog, I notice that I never finished my previous post about vector-based sets. I've been using one for a while, and the performance is great, but unless some reader shows an enormous amount of interest, I'm not going to bother finishing the explanation. Instead, I'm going to complain about porting.

As part of my work, I write cross-platform code. Most of the time, however, I just write in Visual Studio, and every few months we check that everything still compiles and runs in GCC, preferably with zero warnings. Compiling on GCC & Visual Studio isn't everything, but it's a pretty good litmus test. While doing this, I've found a few important "gotchas" where Visual Studio zigs and GCC zags. Here, for your delectation, is my list:

1. Templates. There are so many issues in porting Visual Studio templates to GCC that I'll just brush gently over the top and move on. It's tricky. Visual Studio allows horrible syntax problems within template code, and won't tell you there's a problem until the code is instantiated and used in a fully specified type. GCC is much stricter. GCC also sticks more tightly to the rules in the standard. If you're having problems with a template that worked in Visual Studio, some good tips are:

- Try adding the word "typename" in front of template parameters when they refer to a type. In some rare cases, this can disambiguate the syntax. Visual Studio delays evaluation of the syntax until it can check the symbol table, but GCC requires it to be unambiguous up front.

- Visual Studio will often accept a partial name (i.e. TypeName) where GCC requires a full name (i.e. TypeName< A, B >). Using full names is almost never wrong. This is very often true if you combine inheritance and templates. Also be aware that you probably need to specify the full template name when referring to member variables in a base class, i.e. "BaseClass< A, B >::member_var" rather than just "member_var". Again, there are ambiguity problems in the dark, smelly corners of the C++ language.

2. Newlines at the end of a file. Visual Studio doesn't care, GCC warns you if there isn't one.

3. Initialization order. The C++ standard requires you to initialize member variables in the order they're declared in the class. If there's a mismatch between the initializer list in the constructor and the declaration order, GCC throws a warning at you. Visual Studio doesn't care. Keep the order the same and GCC will be happy. It took me hours to fix all of these last time, and fixing them is a mistake-prone process.

4. GCC is more aggressive in warning you about local variables which were declared & initialized but never used. This is mostly helpful, honestly. It is a bit annoying when it starts picking out file-local (static) variables that are only used in debug mode, though :P. In the end, the workaround for that one was not using a static variable.

5. GCC has a 'uint' type, short for 'unsigned int'. Visual Studio does not. We get around this with an ifdef/typedef block in a common header, since we like uint.

6. The two compilers have different pragmas. #pragma warning does not work in GCC. If you don't want to be warned about unrecognized pragmas, you should wrap #ifdef _MSC_VER around your pragmas (that's a flag to test for whether you're using Visual Studio, roughly). #ifdef WIN32 is not as good, because GCC defines that symbol when you're compiling under Windows ;).

7. Your life will be better if you don't inherit std::exception. If you must, though, ensure that your child class has a destructor with the "throw()" specifier, since the auto-generated destructor has no such specifier and this will cause compile errors in GCC.

8. GCC objects (warning) if you use "switch" with an enumerated type but do not include all possible values of the enumeration in the "case:" statements. Visual Studio doesn't care. If you stick in a blank/useless "default:" case, then all is well.

9. GCC warns about type mismatches if you compare unsigned and signed integers using == and !=. Visual Studio doesn't care.

10. All of Microsoft's secured functions are non-standard (sscanf_s and so on). You can't use them in GCC. Either do it the C++ way (cin, istream, etc.), roll your own (if you're complaining about how slow the C++ ones are, then you should, because you'll probably go faster), or use the unsafe/standard functions and define _CRT_SECURE_NO_DEPRECATE to avoid the idiotic warnings in Visual Studio.

These are a few things I've run into myself recently. Some of them are a bit tricky to figure out, so maybe I shall aid some poor Google searcher :P. Happy porting!

Monday, February 4, 2008

A vector-based set (Part 1)

I recently finished the vector-based set described in my previous post, and have incorporated it into production code. I thought I'd spend a few posts explaining my thought process and discussing some of the gotchas in making STL-compatible containers. Previously, I had something like this:

map< PointIndex, set< SideIndex > > mymap;

PointIndex and SideIndex are basically integers referring to geometry in my scene. The main reason for giving them a type other than "int" is to make the code easier to read and to make it harder to mix them up.

Anyway, the problem here is that mymap often has a total of about 100,000 SideIndex values in it, averaging a few dozen per set. It takes roughly a tenth to a third of a second to make a complete copy of one of these maps (and about the same amount of time to deallocate it when deleting the object), so this is a major time sink. We're going to make a drop-in replacement for the set type that would drastically speed up these map copies.

First off, the SGI STL reference was indispensable. Although there were a few nitpicky ways in which Microsoft didn't follow their specifications, this is still the best STL reference available on the web. The link I gave is directly to the specification for a set--we're going to make our class compatible with that specification.

From that specification, there are a few main things that we're going to need to implement:
  • A whole bunch of typedefs
  • Forward and reverse iterators (plus begin()/end())
  • The insert() methods
  • The erase() methods
  • The find() and *_bound() methods.
  • Construction, destruction, and assignment
  • A bunch of simple utility methods--size(), swap(), empty(), clear(), operators, etc.

This is going to take more than one post, but I'll finish this one off with the basic data structure and class declaration:

template< class T >
class IntegerSet
{
public:
typedef size_t size_type;
// Coming soon...
protected:
T *mData;
size_type mLength, mCapacity;
};

The 'm' prefix refers to member variables. The 'data' is the actual sorted array of items. The array will have enough room in it for 'capacity' items, but only 'length' of those items will be used. The remainder is blank space to allow for growth without reallocation, the same way that vectors do it. The 'size_t' type should be standard, and is the same type that the Microsoft STL implementation uses for its size_type.

You may be a bit puzzled by my use of size_type here. Why not just use size_t directly? The reason is that this is the way the STL expects you to do it; your class is expected to have a local 'size_type' typedef which algorithms may use when operating on your class. For example, if I wanted to write a trivial templated algorithm that prints the letter "X" once for every element in a given STL container, I might do it something like this:

template< class Container >
void printX( const Container &c, ostream &o )
{
Container::size_type i;
for( i = 0; i < c.size(); ++i )
o << "X";
}

By having the size_type defined directly in the class, I can guarantee that "i < c.size()" will work without any problems, even if you need to use some kind of strange type for the size (like a 12-bit integer, say).

Anyway, that's the initial datastructure layout--a grand total of 12 bytes of overhead (24 on a 64-bit compiler). Next time, I'll start working on implementing some of the methods.

Friday, January 25, 2008

How to beat the performance of Microsft's STL sets

In my line of work, we use a lot of STL sets. For us, these are usually sets of indices into some shape--say, a set of vertex indices or a set of edge indices. While the STL sets are very convenient and have some nice properties, they come with some large performance hits (the following applies to Microsoft's STL implementation):

Firstly, every element inserted into a set requires a memory allocation. While the release-mode memory allocator is reasonably good, it's still a significant hit. The debug mode memory allocator is incredibly slow. The sets do not attempt any kind of memory pooling.

Secondly, every element has 14 bytes of overhead (left pointer, right pointer, parent pointer, color, and some kind of "isnil" flag). Considering that in our application, the value itself is a four-byte integer, this means that more than 75% of the space is wasted in storage overhead.So why use STL sets? The obvious answer is that they have very nice timing guarantees: Insertions are O(log n), deletions are O(log n), lookups are O(log n), and iteration over all elements is O(n). The underlying algorithm uses red-black trees (AVL trees are also an option, but Microsoft uses red-black trees), and there really aren't any other algorithms that give you growth guarantees this nice. In short, it seems that as bad as this is, it's good as it gets.

But hold on a moment. The O(f(n)) functions only dominate the time for sufficiently large values of n. Sometimes, n has to be a pretty large value before it starts to matter. So I decided to try an alternate implementation for sets: A simple sorted array of integers.

In a sorted array, insertions are O(n), deletions are O(n), lookups are O(log n), and iteration is O(n). Inserting N elements has a worst-case time of O(N2), which is obviously going to be worse than an STL set for some big enough value of N. But how big?

I coded up the sorted array, optimizing heavily for an array of integers (using memcpy and memmove instead of copy constructors, in particular). I would then insert a certain number of integers into both the STL set and the sorted array, timing how long each took, and tracking at about what point the sets 'overtook' the array in speed. I expect the sorted array to win for low element counts, and the STL set to eventually pull ahead. I repeated the full test 8 times, and took the median passing point.

On my machine, the STL set in release mode is reliably faster than the sorted array when you pass 1,500 elements. At this point, that O(n2) just starts to hurt too much. Still, that is a much higher value of N than I was expecting--I thought it wouldn't even make it into the triple digits. In debug mode, I gave up before I found a value of N high enough--at N=1,000,000, the array was still easily beating the set in performance by a wide margin, and the gap still appeared to be widening (!). Apparently, those allocation hits are just too painful for the set to catch up; the dynamic array allocates rarely, since it doubles its capacity each time it has to expand.

I expect the breakthrough value of N to be smaller if the datatype is larger; I also expect that you would take a significant hit if you were required to use the copy constructors correctly. For my application, however, it looks like a sorted array is going to be sufficient for all but the largest sets. However, as the set gets very large, there are some other rather interesting optimizations I can make--more on that in a future post!

Tuesday, January 1, 2008

Lightweight bit vectors

Bit vectors are a pretty common data structure, and are a good example of how abstraction can make code efficient--if you don't have a good bit vector class, you're more likely to use an array of boolean values, which is hugely inefficient. Most bit vectors work by storing a size and/or capacity, plus a pointer to some memory on the heap. This post gives a gentle improvement on that model.

The goal is to try and avoid allocating any memory on the heap unless absolutely necessary. The best reason for this is speed; if you're using a bit vector in a tight inner loop, or if you're using a very large number of them, you can quickly bog your program down with allocation and deallocation calls.

The basic idea is that when the number of bits to be stored is no more than the number of bits used for the memory pointer itself (typically 32 or 64), we store the bits directly in the pointer. Of course, you cannot safely dereference such a repurposed pointer, so we'll need to be careful about whether the value in the pointer variable is actually a pointer or whether the bit vector has been stuffed there. This is done by using a private member function to access the bits.

Here's the code for a skeletal bit vector class using this principle:

class BitVector
{
public:
typedef unsigned int size_type;
BitVector() : mBuf( NULL ), mSize( 0 ) {}
~BitVector() { if( mSize > sizeof(void*)) free(mBuf); }
void resize( size_type size );
void size() { return mSize; }
bool get( size_type index );
void set( size_type index, bool value );
private:
void *mBuf;
size_type mSize;
unsigned char *getBuf();
};

A fully featured class would be much bigger; this has been stripped to its bare bones. There's no distinction between size and capacity. Next, we need some member function implementations:

void BitVector::resize( size_type size )
{
if( mSize > sizeof(void*) ) free( mBuf );
if( size <= sizeof(void*) ) mBuf = (void *) 0;
else mBuf = malloc( size );
mSize = size;
}

The resize function lets you set the size/capacity of the bit vector. Size is expressed in bytes here, which is not really good practice, but it simplifies the code. For capacities less than the size of a void pointer, the pointer itself will be used to store values. Otherwise, the pointer will contain the address of the memory storing our bits, which is allocated with malloc.

unsigned char *getBuf()
{
if( mSize <= sizeof(void*) ) return (unsigned char *) &mBuf;
else return (unsigned char *) mBuf;
}

This function is used to abstract away the difficulties of sometimes using the pointer as a pointer and sometimes using it as a value. We use the size as a switchboard; if the pointer is not big enough to hold the number of values we have, then we can assume that mBuf is being used as a pointer. If the pointer is big enough, however, we can use mBuf as a value. Either way, this function returns a pointer to the bit vector data itself.

bool BitVector::get( size_type index )
{
return !!(getBuf()[index >> 3] & (1 << (index & 7)));
}

bool BitVector::set( size_type index, bool value )
{
if( value ) getBuf()[index >> 3] |= 1 << (index & 7);
else getBuf()[index >> 3] &= ~(1 << (index &7));
}

These are the getter and setter functions. I've included them mostly because the bit-twiddling might be a little tricky, and it shows how the getBuf function can be used to avoid complicating the code unduly.

This technique isn't always appropriate; if you're always going to have large bit vectors, then this class doesn't buy you anything. However, if you expect bit vectors containing 32 or fewer booleans to appear sometimes, This class might buy you a nice speedup.

Happy twiddling!

Wednesday, November 21, 2007

Calculating the next power of 2

A fairly common problem in systems programming is to locate the "next highest power of 2," since many data structures are much more efficient that way. Memory pages on a 32-bit OS are usually 4096 bytes long (212), and textures on graphics cards usually need to have power-of-2 dimensions.

So, suppose I have some object of a certain size (947, for example), and I need to find the smallest power-of-2 that's large enough to hold my object (in this case, 210 or 1024). We'll start with a fairly simple algorithm:

int val = 947, powof2 = 1;
// Double powof2 until >= val
while( powof2 < val ) powof2 <<= 1;

This will work, and if you're in a hurry, this is probably what will get used. If you're just calculating the size for a texture, then this is unlikely to be performance-critical code, so it may not be worth your time to optimize this. This might become a bottleneck if you're writing a memory manager, however; how can you do this more quickly?

This code has a few weaknesses. Firstly, it uses a loop, which means that the code has to run a test and then split in two possible directions after every iteration. Conditional branches of this sort aren't a huge deal on a modern processor, but they're still slower than avoiding branches altogether. Secondly, the loop can iterate quite a few times--up to 31, in fact. Thirdly, if the value stored in 'val' is over 231, and if you're using unsigned integers, then this can become an infinite loop. Hardly desirable!

It turns out that there's a fairly slick algorithm to solve this problem. The basic idea is to take your number (947 again: 0000 0011 1011 0011), and use bit operations to convert it to a solid string of 1's with the same number of digits (in this case, 0000 0011 1111 1111). Then, add 1, and you'll have the next power of 2 (0000 0100 0000 0000, or 1024). To convert any number into a solid string of 1's, we'll use right-shift and or repeatedly:

val = (val >> 1) | val;
val = (val >> 2) | val;
val = (val >> 4) | val;
val = (val >> 8) | val;
val = (val >> 16) | val;

Why do we do it five times? Well, that's because val is a 32-bit integer, and every time you iterate, you're doubling the number of digits you fill. After five steps, you've filled 25 digits, and you're done. If you're on a 64-bit architecture (26) you'll need to repeat 6 times. The steps on 947 look like:

0000 0011 1011 0011
0000 0011 1111 1011
0000 0011 1111 1111
... with the remaining steps staying at this value.

There's one small problem with this algorithm: it doesn't work properly for the powers of two themselves. It will give the next highest power of 2. This is pretty easy to fix, though: Just subtract one from the value before you begin, and you'll get the right answer in every case. Here's the final algorithm:

val = ... // Get input
val--;
val = (val >> 1) | val;
val = (val >> 2) | val;
val = (val >> 4) | val;
val = (val >> 8) | val;
val = (val >> 16) | val;
val++; // Val is now the next highest power of 2.

One final thought: What does it do on zeroes? It turns out that if you pass the algorithm a zero or any value greater than 231, it will return zero (try it if you're not sure why). Unless you're certain you will never pass either of these values, you should handle having zero as a return value.

That's all for today. Happy bit-twiddling!