Monday, February 4, 2008

A vector-based set (Part 1)

I recently finished the vector-based set described in my previous post, and have incorporated it into production code. I thought I'd spend a few posts explaining my thought process and discussing some of the gotchas in making STL-compatible containers. Previously, I had something like this:

map< PointIndex, set< SideIndex > > mymap;

PointIndex and SideIndex are basically integers referring to geometry in my scene. The main reason for giving them a type other than "int" is to make the code easier to read and to make it harder to mix them up.

Anyway, the problem here is that mymap often has a total of about 100,000 SideIndex values in it, averaging a few dozen per set. It takes roughly a tenth to a third of a second to make a complete copy of one of these maps (and about the same amount of time to deallocate it when deleting the object), so this is a major time sink. We're going to make a drop-in replacement for the set type that would drastically speed up these map copies.

First off, the SGI STL reference was indispensable. Although there were a few nitpicky ways in which Microsoft didn't follow their specifications, this is still the best STL reference available on the web. The link I gave is directly to the specification for a set--we're going to make our class compatible with that specification.

From that specification, there are a few main things that we're going to need to implement:
  • A whole bunch of typedefs
  • Forward and reverse iterators (plus begin()/end())
  • The insert() methods
  • The erase() methods
  • The find() and *_bound() methods.
  • Construction, destruction, and assignment
  • A bunch of simple utility methods--size(), swap(), empty(), clear(), operators, etc.

This is going to take more than one post, but I'll finish this one off with the basic data structure and class declaration:

template< class T >
class IntegerSet
{
public:
typedef size_t size_type;
// Coming soon...
protected:
T *mData;
size_type mLength, mCapacity;
};

The 'm' prefix refers to member variables. The 'data' is the actual sorted array of items. The array will have enough room in it for 'capacity' items, but only 'length' of those items will be used. The remainder is blank space to allow for growth without reallocation, the same way that vectors do it. The 'size_t' type should be standard, and is the same type that the Microsoft STL implementation uses for its size_type.

You may be a bit puzzled by my use of size_type here. Why not just use size_t directly? The reason is that this is the way the STL expects you to do it; your class is expected to have a local 'size_type' typedef which algorithms may use when operating on your class. For example, if I wanted to write a trivial templated algorithm that prints the letter "X" once for every element in a given STL container, I might do it something like this:

template< class Container >
void printX( const Container &c, ostream &o )
{
Container::size_type i;
for( i = 0; i < c.size(); ++i )
o << "X";
}

By having the size_type defined directly in the class, I can guarantee that "i < c.size()" will work without any problems, even if you need to use some kind of strange type for the size (like a 12-bit integer, say).

Anyway, that's the initial datastructure layout--a grand total of 12 bytes of overhead (24 on a 64-bit compiler). Next time, I'll start working on implementing some of the methods.

No comments: