Monday, February 4, 2008

A vector-based set (Part 1)

I recently finished the vector-based set described in my previous post, and have incorporated it into production code. I thought I'd spend a few posts explaining my thought process and discussing some of the gotchas in making STL-compatible containers. Previously, I had something like this:

map< PointIndex, set< SideIndex > > mymap;

PointIndex and SideIndex are basically integers referring to geometry in my scene. The main reason for giving them a type other than "int" is to make the code easier to read and to make it harder to mix them up.

Anyway, the problem here is that mymap often has a total of about 100,000 SideIndex values in it, averaging a few dozen per set. It takes roughly a tenth to a third of a second to make a complete copy of one of these maps (and about the same amount of time to deallocate it when deleting the object), so this is a major time sink. We're going to make a drop-in replacement for the set type that would drastically speed up these map copies.

First off, the SGI STL reference was indispensable. Although there were a few nitpicky ways in which Microsoft didn't follow their specifications, this is still the best STL reference available on the web. The link I gave is directly to the specification for a set--we're going to make our class compatible with that specification.

From that specification, there are a few main things that we're going to need to implement:
  • A whole bunch of typedefs
  • Forward and reverse iterators (plus begin()/end())
  • The insert() methods
  • The erase() methods
  • The find() and *_bound() methods.
  • Construction, destruction, and assignment
  • A bunch of simple utility methods--size(), swap(), empty(), clear(), operators, etc.

This is going to take more than one post, but I'll finish this one off with the basic data structure and class declaration:

template< class T >
class IntegerSet
typedef size_t size_type;
// Coming soon...
T *mData;
size_type mLength, mCapacity;

The 'm' prefix refers to member variables. The 'data' is the actual sorted array of items. The array will have enough room in it for 'capacity' items, but only 'length' of those items will be used. The remainder is blank space to allow for growth without reallocation, the same way that vectors do it. The 'size_t' type should be standard, and is the same type that the Microsoft STL implementation uses for its size_type.

You may be a bit puzzled by my use of size_type here. Why not just use size_t directly? The reason is that this is the way the STL expects you to do it; your class is expected to have a local 'size_type' typedef which algorithms may use when operating on your class. For example, if I wanted to write a trivial templated algorithm that prints the letter "X" once for every element in a given STL container, I might do it something like this:

template< class Container >
void printX( const Container &c, ostream &o )
Container::size_type i;
for( i = 0; i < c.size(); ++i )
o << "X";

By having the size_type defined directly in the class, I can guarantee that "i < c.size()" will work without any problems, even if you need to use some kind of strange type for the size (like a 12-bit integer, say).

Anyway, that's the initial datastructure layout--a grand total of 12 bytes of overhead (24 on a 64-bit compiler). Next time, I'll start working on implementing some of the methods.

Friday, January 25, 2008

How to beat the performance of Microsft's STL sets

In my line of work, we use a lot of STL sets. For us, these are usually sets of indices into some shape--say, a set of vertex indices or a set of edge indices. While the STL sets are very convenient and have some nice properties, they come with some large performance hits (the following applies to Microsoft's STL implementation):

Firstly, every element inserted into a set requires a memory allocation. While the release-mode memory allocator is reasonably good, it's still a significant hit. The debug mode memory allocator is incredibly slow. The sets do not attempt any kind of memory pooling.

Secondly, every element has 14 bytes of overhead (left pointer, right pointer, parent pointer, color, and some kind of "isnil" flag). Considering that in our application, the value itself is a four-byte integer, this means that more than 75% of the space is wasted in storage overhead.So why use STL sets? The obvious answer is that they have very nice timing guarantees: Insertions are O(log n), deletions are O(log n), lookups are O(log n), and iteration over all elements is O(n). The underlying algorithm uses red-black trees (AVL trees are also an option, but Microsoft uses red-black trees), and there really aren't any other algorithms that give you growth guarantees this nice. In short, it seems that as bad as this is, it's good as it gets.

But hold on a moment. The O(f(n)) functions only dominate the time for sufficiently large values of n. Sometimes, n has to be a pretty large value before it starts to matter. So I decided to try an alternate implementation for sets: A simple sorted array of integers.

In a sorted array, insertions are O(n), deletions are O(n), lookups are O(log n), and iteration is O(n). Inserting N elements has a worst-case time of O(N2), which is obviously going to be worse than an STL set for some big enough value of N. But how big?

I coded up the sorted array, optimizing heavily for an array of integers (using memcpy and memmove instead of copy constructors, in particular). I would then insert a certain number of integers into both the STL set and the sorted array, timing how long each took, and tracking at about what point the sets 'overtook' the array in speed. I expect the sorted array to win for low element counts, and the STL set to eventually pull ahead. I repeated the full test 8 times, and took the median passing point.

On my machine, the STL set in release mode is reliably faster than the sorted array when you pass 1,500 elements. At this point, that O(n2) just starts to hurt too much. Still, that is a much higher value of N than I was expecting--I thought it wouldn't even make it into the triple digits. In debug mode, I gave up before I found a value of N high enough--at N=1,000,000, the array was still easily beating the set in performance by a wide margin, and the gap still appeared to be widening (!). Apparently, those allocation hits are just too painful for the set to catch up; the dynamic array allocates rarely, since it doubles its capacity each time it has to expand.

I expect the breakthrough value of N to be smaller if the datatype is larger; I also expect that you would take a significant hit if you were required to use the copy constructors correctly. For my application, however, it looks like a sorted array is going to be sufficient for all but the largest sets. However, as the set gets very large, there are some other rather interesting optimizations I can make--more on that in a future post!

Tuesday, January 1, 2008

Lightweight bit vectors

Bit vectors are a pretty common data structure, and are a good example of how abstraction can make code efficient--if you don't have a good bit vector class, you're more likely to use an array of boolean values, which is hugely inefficient. Most bit vectors work by storing a size and/or capacity, plus a pointer to some memory on the heap. This post gives a gentle improvement on that model.

The goal is to try and avoid allocating any memory on the heap unless absolutely necessary. The best reason for this is speed; if you're using a bit vector in a tight inner loop, or if you're using a very large number of them, you can quickly bog your program down with allocation and deallocation calls.

The basic idea is that when the number of bits to be stored is no more than the number of bits used for the memory pointer itself (typically 32 or 64), we store the bits directly in the pointer. Of course, you cannot safely dereference such a repurposed pointer, so we'll need to be careful about whether the value in the pointer variable is actually a pointer or whether the bit vector has been stuffed there. This is done by using a private member function to access the bits.

Here's the code for a skeletal bit vector class using this principle:

class BitVector
typedef unsigned int size_type;
BitVector() : mBuf( NULL ), mSize( 0 ) {}
~BitVector() { if( mSize > sizeof(void*)) free(mBuf); }
void resize( size_type size );
void size() { return mSize; }
bool get( size_type index );
void set( size_type index, bool value );
void *mBuf;
size_type mSize;
unsigned char *getBuf();

A fully featured class would be much bigger; this has been stripped to its bare bones. There's no distinction between size and capacity. Next, we need some member function implementations:

void BitVector::resize( size_type size )
if( mSize > sizeof(void*) ) free( mBuf );
if( size <= sizeof(void*) ) mBuf = (void *) 0;
else mBuf = malloc( size );
mSize = size;

The resize function lets you set the size/capacity of the bit vector. Size is expressed in bytes here, which is not really good practice, but it simplifies the code. For capacities less than the size of a void pointer, the pointer itself will be used to store values. Otherwise, the pointer will contain the address of the memory storing our bits, which is allocated with malloc.

unsigned char *getBuf()
if( mSize <= sizeof(void*) ) return (unsigned char *) &mBuf;
else return (unsigned char *) mBuf;

This function is used to abstract away the difficulties of sometimes using the pointer as a pointer and sometimes using it as a value. We use the size as a switchboard; if the pointer is not big enough to hold the number of values we have, then we can assume that mBuf is being used as a pointer. If the pointer is big enough, however, we can use mBuf as a value. Either way, this function returns a pointer to the bit vector data itself.

bool BitVector::get( size_type index )
return !!(getBuf()[index >> 3] & (1 << (index & 7)));

bool BitVector::set( size_type index, bool value )
if( value ) getBuf()[index >> 3] |= 1 << (index & 7);
else getBuf()[index >> 3] &= ~(1 << (index &7));

These are the getter and setter functions. I've included them mostly because the bit-twiddling might be a little tricky, and it shows how the getBuf function can be used to avoid complicating the code unduly.

This technique isn't always appropriate; if you're always going to have large bit vectors, then this class doesn't buy you anything. However, if you expect bit vectors containing 32 or fewer booleans to appear sometimes, This class might buy you a nice speedup.

Happy twiddling!