Friday, January 25, 2008

How to beat the performance of Microsft's STL sets

In my line of work, we use a lot of STL sets. For us, these are usually sets of indices into some shape--say, a set of vertex indices or a set of edge indices. While the STL sets are very convenient and have some nice properties, they come with some large performance hits (the following applies to Microsoft's STL implementation):

Firstly, every element inserted into a set requires a memory allocation. While the release-mode memory allocator is reasonably good, it's still a significant hit. The debug mode memory allocator is incredibly slow. The sets do not attempt any kind of memory pooling.

Secondly, every element has 14 bytes of overhead (left pointer, right pointer, parent pointer, color, and some kind of "isnil" flag). Considering that in our application, the value itself is a four-byte integer, this means that more than 75% of the space is wasted in storage overhead.So why use STL sets? The obvious answer is that they have very nice timing guarantees: Insertions are O(log n), deletions are O(log n), lookups are O(log n), and iteration over all elements is O(n). The underlying algorithm uses red-black trees (AVL trees are also an option, but Microsoft uses red-black trees), and there really aren't any other algorithms that give you growth guarantees this nice. In short, it seems that as bad as this is, it's good as it gets.

But hold on a moment. The O(f(n)) functions only dominate the time for sufficiently large values of n. Sometimes, n has to be a pretty large value before it starts to matter. So I decided to try an alternate implementation for sets: A simple sorted array of integers.

In a sorted array, insertions are O(n), deletions are O(n), lookups are O(log n), and iteration is O(n). Inserting N elements has a worst-case time of O(N2), which is obviously going to be worse than an STL set for some big enough value of N. But how big?

I coded up the sorted array, optimizing heavily for an array of integers (using memcpy and memmove instead of copy constructors, in particular). I would then insert a certain number of integers into both the STL set and the sorted array, timing how long each took, and tracking at about what point the sets 'overtook' the array in speed. I expect the sorted array to win for low element counts, and the STL set to eventually pull ahead. I repeated the full test 8 times, and took the median passing point.

On my machine, the STL set in release mode is reliably faster than the sorted array when you pass 1,500 elements. At this point, that O(n2) just starts to hurt too much. Still, that is a much higher value of N than I was expecting--I thought it wouldn't even make it into the triple digits. In debug mode, I gave up before I found a value of N high enough--at N=1,000,000, the array was still easily beating the set in performance by a wide margin, and the gap still appeared to be widening (!). Apparently, those allocation hits are just too painful for the set to catch up; the dynamic array allocates rarely, since it doubles its capacity each time it has to expand.

I expect the breakthrough value of N to be smaller if the datatype is larger; I also expect that you would take a significant hit if you were required to use the copy constructors correctly. For my application, however, it looks like a sorted array is going to be sufficient for all but the largest sets. However, as the set gets very large, there are some other rather interesting optimizations I can make--more on that in a future post!

Tuesday, January 1, 2008

Lightweight bit vectors

Bit vectors are a pretty common data structure, and are a good example of how abstraction can make code efficient--if you don't have a good bit vector class, you're more likely to use an array of boolean values, which is hugely inefficient. Most bit vectors work by storing a size and/or capacity, plus a pointer to some memory on the heap. This post gives a gentle improvement on that model.

The goal is to try and avoid allocating any memory on the heap unless absolutely necessary. The best reason for this is speed; if you're using a bit vector in a tight inner loop, or if you're using a very large number of them, you can quickly bog your program down with allocation and deallocation calls.

The basic idea is that when the number of bits to be stored is no more than the number of bits used for the memory pointer itself (typically 32 or 64), we store the bits directly in the pointer. Of course, you cannot safely dereference such a repurposed pointer, so we'll need to be careful about whether the value in the pointer variable is actually a pointer or whether the bit vector has been stuffed there. This is done by using a private member function to access the bits.

Here's the code for a skeletal bit vector class using this principle:

class BitVector
typedef unsigned int size_type;
BitVector() : mBuf( NULL ), mSize( 0 ) {}
~BitVector() { if( mSize > sizeof(void*)) free(mBuf); }
void resize( size_type size );
void size() { return mSize; }
bool get( size_type index );
void set( size_type index, bool value );
void *mBuf;
size_type mSize;
unsigned char *getBuf();

A fully featured class would be much bigger; this has been stripped to its bare bones. There's no distinction between size and capacity. Next, we need some member function implementations:

void BitVector::resize( size_type size )
if( mSize > sizeof(void*) ) free( mBuf );
if( size <= sizeof(void*) ) mBuf = (void *) 0;
else mBuf = malloc( size );
mSize = size;

The resize function lets you set the size/capacity of the bit vector. Size is expressed in bytes here, which is not really good practice, but it simplifies the code. For capacities less than the size of a void pointer, the pointer itself will be used to store values. Otherwise, the pointer will contain the address of the memory storing our bits, which is allocated with malloc.

unsigned char *getBuf()
if( mSize <= sizeof(void*) ) return (unsigned char *) &mBuf;
else return (unsigned char *) mBuf;

This function is used to abstract away the difficulties of sometimes using the pointer as a pointer and sometimes using it as a value. We use the size as a switchboard; if the pointer is not big enough to hold the number of values we have, then we can assume that mBuf is being used as a pointer. If the pointer is big enough, however, we can use mBuf as a value. Either way, this function returns a pointer to the bit vector data itself.

bool BitVector::get( size_type index )
return !!(getBuf()[index >> 3] & (1 << (index & 7)));

bool BitVector::set( size_type index, bool value )
if( value ) getBuf()[index >> 3] |= 1 << (index & 7);
else getBuf()[index >> 3] &= ~(1 << (index &7));

These are the getter and setter functions. I've included them mostly because the bit-twiddling might be a little tricky, and it shows how the getBuf function can be used to avoid complicating the code unduly.

This technique isn't always appropriate; if you're always going to have large bit vectors, then this class doesn't buy you anything. However, if you expect bit vectors containing 32 or fewer booleans to appear sometimes, This class might buy you a nice speedup.

Happy twiddling!