Friday, January 25, 2008

How to beat the performance of Microsft's STL sets

In my line of work, we use a lot of STL sets. For us, these are usually sets of indices into some shape--say, a set of vertex indices or a set of edge indices. While the STL sets are very convenient and have some nice properties, they come with some large performance hits (the following applies to Microsoft's STL implementation):

Firstly, every element inserted into a set requires a memory allocation. While the release-mode memory allocator is reasonably good, it's still a significant hit. The debug mode memory allocator is incredibly slow. The sets do not attempt any kind of memory pooling.

Secondly, every element has 14 bytes of overhead (left pointer, right pointer, parent pointer, color, and some kind of "isnil" flag). Considering that in our application, the value itself is a four-byte integer, this means that more than 75% of the space is wasted in storage overhead.So why use STL sets? The obvious answer is that they have very nice timing guarantees: Insertions are O(log n), deletions are O(log n), lookups are O(log n), and iteration over all elements is O(n). The underlying algorithm uses red-black trees (AVL trees are also an option, but Microsoft uses red-black trees), and there really aren't any other algorithms that give you growth guarantees this nice. In short, it seems that as bad as this is, it's good as it gets.

But hold on a moment. The O(f(n)) functions only dominate the time for sufficiently large values of n. Sometimes, n has to be a pretty large value before it starts to matter. So I decided to try an alternate implementation for sets: A simple sorted array of integers.

In a sorted array, insertions are O(n), deletions are O(n), lookups are O(log n), and iteration is O(n). Inserting N elements has a worst-case time of O(N2), which is obviously going to be worse than an STL set for some big enough value of N. But how big?

I coded up the sorted array, optimizing heavily for an array of integers (using memcpy and memmove instead of copy constructors, in particular). I would then insert a certain number of integers into both the STL set and the sorted array, timing how long each took, and tracking at about what point the sets 'overtook' the array in speed. I expect the sorted array to win for low element counts, and the STL set to eventually pull ahead. I repeated the full test 8 times, and took the median passing point.

On my machine, the STL set in release mode is reliably faster than the sorted array when you pass 1,500 elements. At this point, that O(n2) just starts to hurt too much. Still, that is a much higher value of N than I was expecting--I thought it wouldn't even make it into the triple digits. In debug mode, I gave up before I found a value of N high enough--at N=1,000,000, the array was still easily beating the set in performance by a wide margin, and the gap still appeared to be widening (!). Apparently, those allocation hits are just too painful for the set to catch up; the dynamic array allocates rarely, since it doubles its capacity each time it has to expand.

I expect the breakthrough value of N to be smaller if the datatype is larger; I also expect that you would take a significant hit if you were required to use the copy constructors correctly. For my application, however, it looks like a sorted array is going to be sufficient for all but the largest sets. However, as the set gets very large, there are some other rather interesting optimizations I can make--more on that in a future post!

1 comment:

khartmann said...

When I run the code listed below in windows, the memory used reported in the task manager is 314592K

This is much more than would be expected assuming 14bytes of overhead per node.

(14+4)*10000000 = 180000K

Do you get the same results, and if so, would you know the reason?

Thanks,
Kevin


#include "stdafx.h"
#include {set}

int _tmain(int argc, _TCHAR* argv[])
{
std::set{ int } stuff;

for (int i = 0; i < 10000000; i++)
{
stuff.insert( i );
}

printf( "Size: %d\n", stuff.size());
printf( "Int size: %d\n", sizeof( int ) );

getchar();
return 0;
}