Tuesday, January 1, 2008

Lightweight bit vectors

Bit vectors are a pretty common data structure, and are a good example of how abstraction can make code efficient--if you don't have a good bit vector class, you're more likely to use an array of boolean values, which is hugely inefficient. Most bit vectors work by storing a size and/or capacity, plus a pointer to some memory on the heap. This post gives a gentle improvement on that model.

The goal is to try and avoid allocating any memory on the heap unless absolutely necessary. The best reason for this is speed; if you're using a bit vector in a tight inner loop, or if you're using a very large number of them, you can quickly bog your program down with allocation and deallocation calls.

The basic idea is that when the number of bits to be stored is no more than the number of bits used for the memory pointer itself (typically 32 or 64), we store the bits directly in the pointer. Of course, you cannot safely dereference such a repurposed pointer, so we'll need to be careful about whether the value in the pointer variable is actually a pointer or whether the bit vector has been stuffed there. This is done by using a private member function to access the bits.

Here's the code for a skeletal bit vector class using this principle:

class BitVector
{
public:
typedef unsigned int size_type;
BitVector() : mBuf( NULL ), mSize( 0 ) {}
~BitVector() { if( mSize > sizeof(void*)) free(mBuf); }
void resize( size_type size );
void size() { return mSize; }
bool get( size_type index );
void set( size_type index, bool value );
private:
void *mBuf;
size_type mSize;
unsigned char *getBuf();
};

A fully featured class would be much bigger; this has been stripped to its bare bones. There's no distinction between size and capacity. Next, we need some member function implementations:

void BitVector::resize( size_type size )
{
if( mSize > sizeof(void*) ) free( mBuf );
if( size <= sizeof(void*) ) mBuf = (void *) 0;
else mBuf = malloc( size );
mSize = size;
}

The resize function lets you set the size/capacity of the bit vector. Size is expressed in bytes here, which is not really good practice, but it simplifies the code. For capacities less than the size of a void pointer, the pointer itself will be used to store values. Otherwise, the pointer will contain the address of the memory storing our bits, which is allocated with malloc.

unsigned char *getBuf()
{
if( mSize <= sizeof(void*) ) return (unsigned char *) &mBuf;
else return (unsigned char *) mBuf;
}

This function is used to abstract away the difficulties of sometimes using the pointer as a pointer and sometimes using it as a value. We use the size as a switchboard; if the pointer is not big enough to hold the number of values we have, then we can assume that mBuf is being used as a pointer. If the pointer is big enough, however, we can use mBuf as a value. Either way, this function returns a pointer to the bit vector data itself.

bool BitVector::get( size_type index )
{
return !!(getBuf()[index >> 3] & (1 << (index & 7)));
}

bool BitVector::set( size_type index, bool value )
{
if( value ) getBuf()[index >> 3] |= 1 << (index & 7);
else getBuf()[index >> 3] &= ~(1 << (index &7));
}

These are the getter and setter functions. I've included them mostly because the bit-twiddling might be a little tricky, and it shows how the getBuf function can be used to avoid complicating the code unduly.

This technique isn't always appropriate; if you're always going to have large bit vectors, then this class doesn't buy you anything. However, if you expect bit vectors containing 32 or fewer booleans to appear sometimes, This class might buy you a nice speedup.

Happy twiddling!

1 comment:

Nate Fries said...

The concept is very clever.

However, it appears that when resizing you lose your data. This is not good.