LIBCDS2 – work in progress

LIBCDS2 provides basic building blocks for static and space-efficient data structures. We have made a couple of design decisions that differ from LIBCDS. We will keep to maintain LIBCDS, as this serves as a faster version (you will see why further down) for building small research prototypes.

The main goal of LIBCDS2 is to allow huge structures and support multi-core/parallel computation. The new design aims at supporting these and at the same time free the user from tricky considerations. We want to provide a library that can be used in production, with proper error handling and stable up to any scale. The drawback, there is a bit of overhead we have to live with.

Lets explain this with one example: Arrays. Arrays in LIBCDS2 are one of the few structures that are not completely static. Users can change values, as long as the maximum value does not exceed the original maximum (actually, that the new value does not take more bits than the width of each field). The array optimizes under the hood, if your width is 16, it uses an array of uint16_t underneath. In order to do so, but not bother the user with knowing whether the specific width they want can be optimized, the Array::Create() method instantiates a subclass depending on the width. This adds one indirection when calling the set/access operations.

Two more examples regarding the performance hit LIBCDS2 is taking are the reference counters and the default storage of integers and sequences.

All structures in LIBCDS2 use reference counters. This has proven to be really effective when building space-efficient data structures, since they allow you to re-use substructures (and they are static, so it’s the perfect scenario). The user creates a structure with new (C++), and then just calls the method Use() to tell the library that the object is being used. When the user is done with the object, instead of deleting it, she/he calls the Unuse() method. If no one else is using the object, it is automatically freed. The problem: in multithreading, the Use and Unuse methods have to acquire a lock on the object. This slows down the Use/Unuse operations, and also adds the space required to keep a mutex per class (unless we used a global lock, which is an option we are considering, since Use/Unuse may not be that frequent).

Finally, the storage of integers and sequences. LIBCDS2 works almost exclusively with Array objects. These are slower than standard arrays, and the samples within structures are done with Array objects. Why did we do this? Well, the decision was made thinking that we don’t want to waste 64-bit samples for small structures, but we also want the library to escalate to huge (order of TBs) structures. For this, we need to represent the samplings using as many bits as needed (or close to that). Array offers that, and it also encapsulates a possible source of problems into a single class when escalating, making it much easier to maintain.

We still have many more decisions to make, and the library is slowly taking shape. Any comments, suggestions are welcome. If you want to contribute, please let us know, the project is hosted on github.

(This will be cross-posted on the libcds2 web page.)

Leave a Reply