Archive for the ‘Research’ Category.

LIBCDS2 – work in progress

LIBCDS2 provides basic building blocks for static and space-efficient data structures. We have made a couple of design decisions that differ from LIBCDS. We will keep to maintain LIBCDS, as this serves as a faster version (you will see why further down) for building small research prototypes.

The main goal of LIBCDS2 is to allow huge structures and support multi-core/parallel computation. The new design aims at supporting these and at the same time free the user from tricky considerations. We want to provide a library that can be used in production, with proper error handling and stable up to any scale. The drawback, there is a bit of overhead we have to live with.

Lets explain this with one example: Arrays. Arrays in LIBCDS2 are one of the few structures that are not completely static. Users can change values, as long as the maximum value does not exceed the original maximum (actually, that the new value does not take more bits than the width of each field). The array optimizes under the hood, if your width is 16, it uses an array of uint16_t underneath. In order to do so, but not bother the user with knowing whether the specific width they want can be optimized, the Array::Create() method instantiates a subclass depending on the width. This adds one indirection when calling the set/access operations.

Two more examples regarding the performance hit LIBCDS2 is taking are the reference counters and the default storage of integers and sequences.

All structures in LIBCDS2 use reference counters. This has proven to be really effective when building space-efficient data structures, since they allow you to re-use substructures (and they are static, so it’s the perfect scenario). The user creates a structure with new (C++), and then just calls the method Use() to tell the library that the object is being used. When the user is done with the object, instead of deleting it, she/he calls the Unuse() method. If no one else is using the object, it is automatically freed. The problem: in multithreading, the Use and Unuse methods have to acquire a lock on the object. This slows down the Use/Unuse operations, and also adds the space required to keep a mutex per class (unless we used a global lock, which is an option we are considering, since Use/Unuse may not be that frequent).

Finally, the storage of integers and sequences. LIBCDS2 works almost exclusively with Array objects. These are slower than standard arrays, and the samples within structures are done with Array objects. Why did we do this? Well, the decision was made thinking that we don’t want to waste 64-bit samples for small structures, but we also want the library to escalate to huge (order of TBs) structures. For this, we need to represent the samplings using as many bits as needed (or close to that). Array offers that, and it also encapsulates a possible source of problems into a single class when escalating, making it much easier to maintain.

We still have many more decisions to make, and the library is slowly taking shape. Any comments, suggestions are welcome. If you want to contribute, please let us know, the project is hosted on github.

(This will be cross-posted on the libcds2 web page.)

Finding the most significant bit

In one of the problems I’ve been working on with Diego Seco and Patrick Nicholson, we needed the most-significant-bit (msb) function as a primitive for our solution. As Diego pointed out today, this function was one of the bottlenecks of our structure, consuming a considerable amount of time.

In this post I’ll go through the solutions we tested in our data structure. Continue reading ‘Finding the most significant bit’ »

CSCBCE, WAA and Finland!

From May 20-22 we were hosting CSCBCE 2010 in Waterloo. I was part of the organizing committee so it was a very busy week. Everything went quite well, Bob (the chair of the conference) was able to get a lot of funding and that made everything in the conference much easier, and of course, the meals awesome! :D. I couldn’t attend much to all the talks/tutorials, since I couldn’t escape of my TA duties (wasn’t really away :) ), but I certainly enjoyed the ones I could attend and met a lot of interesting people.

After that, on the 23rd I left to Helsinki to present the paper we wrote with Gonzalo Navarro, “Extended Compact Web Graph Representations”. It was published in Algorithms and Applications, a Festschrift for Prof. Esko Ukkonen for his 60th birthday. The workshop was great, with an interesting variety of topics. After it we went for dinner to a lovely place in the center of Helsinki :-).

I stayed here in Helsinki to work with Prof. Veli Makinen and part of the group on succinct data structures (Leena Salmela and Niko Valimaki). We have been looking at representations for the graph that comes from the overlaps in the sequences obtained during the genome sequencing process. So far we have some interesting results in practice and some pretty pictures on how this graph looks.

About the city itself, it is an amazing city, very alive and with all sorts of things to do (maybe because it’s summer now). I went for a walk on Saturday and ended up in a festival where a Chilean group was playing. It was an amazing experience, got to meet many other Chileans that live here in Helsinki and we ended up having a really nice party.

I guess that would be my report for now, on Sunday I’m going back to Waterloo and leaving again on Wednesday to Fredericton for a workshop on data structures for spatial data.

New papers

I got two really good news after my return from LATIN. Our paper (with Gonzalo Navarro) about compact Web graphs representations was accepted for publications in ACM Transactions on the Web (TWEB). This was one of the main results in my masters thesis. I uploaded some of the code and documentation related to our TWEB paper to http://webgraphs.recoded.cl .

The second good news was that the extended paper about Straight-Line programs (also with Gonzalo Navarro) was accepted into Fundamenta Informaticae. We included some new results. In particular, we gave an improved version for finding the most common substring in a grammar compressed text. Another novelty is an adaptation of our index to the case when the input is not a grammar but a set of rules and a sequence of (non)terminal symbols. In this case, expanding that last sequence, according to the rules, generates the text.

SPIRE 2010

I was invited to help in the organization of String Processing and Information Retrieval Symposium (SPIRE) this year. The conference is going to be held in Los Cabos, Mexico. The chairs are Edgar Chavez and Stefano Lonardi.

I’ve been to two previous versions of SPIRE, 2007 and 2008, both were a great experience, interesting presentations and a lot  of fun, except for the fines acquired for driving in Australia :-P.

The call for papers for this year can be found here or here for the plain-text version. The submission is due June 20, and given the schedule there is no room for any deadline extension.

LATIN 2010

I just got back from LATIN 2010, I was organizing committee and it was a really good experience. We got invaluable help from professors and students from UABJO.

The conference started with quite a few problems. To start with, many people could not make it to the conference thanks to a friendly volcano that left Europe uncommunicated. This forced us to change the conference program and a considerable number of talks were cancelled.

The second problem we had (Monday night) is that the UABJO university workers went on a strike. The conference was being held there and in Mexico you are not allowed to enter the place during the strike. The university gave us support and Alex Lopez-Ortiz managed to move the conference to the Hotel Victoria.

Besides those “little” incidents, I think the conference went pretty well (the recommendation comes from a person close to the organization :p). The talks I got to see were really good and I did really like many of them. The others I didn’t understand that much, but I think that’s normal in a broad conference.

I was also presenting a paper with Jeremy Barbay and Gonzalo Navarro about succinct representations of binary relations.

Those are all the news about the conference. About the place there is too much to tell. To start with, the people are friendly and cheering, which makes Oaxaca a really nice place to visit. I took a tour (after the conference) to some places nearby and got to see some ruins, swim in natural pools, drink mezcal. There is a lot to do and visit around. It would take too much space to cover it all :-).

The SXSI System

I promised a post about the sexy SXSI system in my last post. Its awesome name stands for Fast In-Memory XPath Search over Compressed Text and Tree Indexes, which says much more about what it does.

The main idea for representing the XML file is as follows: the shape of the tree is represented using succinct balanced parentheses, the labels are represented with structures that support rank/select and access queries, and the text with a self-index for the collection of texts inside the xml file.

Let’s go over each part:

  1. The tree: the tree is represented using balanced parentheses, by doing a DFS over the tree and writing a ‘(‘ every time you start visiting a subtree and a ‘)’ when you leave the subtree. This representation allows for very fast navigation over the tree, for a good reference about it see the paper by Arroyuelo, Canovas, Navarro and Sadakane.  In the same order we keep another sequence that corresponds to the ids of each node in the tree, we write down the ids in DFS order. This representation allows to move though the tree fast and access node ids quickly. Besides that, something that is really interesting with this combination is the ability to jump to a node with a given id, we just need to perform a select query over the sequence of tags (recall that select(c,i) over a sequence obtains the position where the i-th c occurs). This last operation allows for traversing the tree only looking at the nodes that have a given id.
  2. The self-index, representing every text in the XML tree, is based on the FM-Index. Every node that contains textual data introduces a text into the collection. All the elements are concatenated introducing delimiters and every text in the collection gets an id (in our representation the texts are at the leaves of the tree so it’s easy to map back a forward). The id is given by the position of the delimiter in the BWT, and using a range searching data structure we can map to original leaf-id. This index allows for fast pattern matching among the text data present in the XML file.

This an overview, you can get more details looking the original paper. The nice thing is that by using this structures we allow for a flexible navigation of the structure, and an automaton-based search procedure achieves great practical results over real XML files (more details look at the experimental results in the paper). This makes SXSI an interesting option as engine for representing and querying XML files. The idea of the automaton was formalized a couple of days ago, you can see the paper by Maneth and Nguyen.

What’s next? Well, we are not done, there are some operations we would like SXSI to support that aren’t implemented yet. Another issue is that the structure is static. Making it dynamic in practice is quite tricky, as it usually is with succinct representations.

At this moment we are working on a first release (cleaning code, adding documentation, etc), and we hope to have something soon :-).

Update

It’s been a long time since my last post. Lots of things had happened and I’ll use this post to get back on track. I usually don’t post anything about my private life, yet I consider important to mention that my girlfriend was visiting me at Waterloo. We don’t get to see each other very often so it was really great having her here for more than two months!

Now going back to Geeky stuff :-). I was traveling during December. First I went to visit Sebastian Maneth and Kim Nguyen at NICTA in Australia. We did work in our SXSI system (more on it in a following post), in particular, I did some experiments with the XBW and regular/naive tries for answering some specific queries on the system. For path searching both options were awesome, the second one was of course faster and sometimes bigger. The way we answer queries in the system doesn’t allow us to change exclusively to any of those two representations, but considering them as part of the index for answering specific queries could have been interesting. We didn’t make any final decision and the fact that the XBW is patented could complicate things a little bit. We expect to release a GPL version of SXSI and we would like people to be able to use it without any restriction.

After two weeks in Sydney I went back to Chile. It was probably the worst flight ever. NICTA did pay for my trip to Australia, so I had to come back to Canada. I came back, spent 12 hours in the airport and then took a really bad combination through Miami. The total time since I left Australia till I got to Chile was more than 50 hours. During that time ISAAC was taking place in Hawaii. I was coauthor in two papers accepted there: the untangling (with Diego Arroyuelo, Reza Dorrigiv, Stephane Durocher, Meng He, Alejandro López-Ortiz, J. Ian Munro, Patrick Nicholson, Alejandro Salinger, and Matthew Skala) and disc covering (with Reza Dorrigiv, Stephane Durocher, Robert Fraser, Alejandro López-Ortiz, and Alejandro Salinger). While Bob and Matthew were presenting them, I was attending Alejandro Salinger’s wedding (he is also coauthor in both papers :p). The wedding was great! I did really enjoy it, the planning was amazing and we had so much fun in it!

After a couple of weeks in Chile, on the 6th of January I came back to Waterloo. When I got here we started working on the untangling problem we “solved” in the ISAAC paper about range searching. We found a flaw in our algorithm and we spent a LOT of time correcting the method. We did submit a journal version of the paper (corrected). Patrick and I did implement the untangling algorithm and a simplified version of the range searching data structure (that’s were we found the problem). It works really well on real datasets and at this moment we are preparing an experimental paper including those results.

I also went to an Android Developer Lab and I got a Nexus One as a gift from Google! It was a really good experience (the lab, and of course the nexus too :-)). We did learn about implementing software for android. I’ve played a little bit with it and so far I’m impressed with what a phone can do! (Yes, I used to have a phone that wouldn’t do more than calling.) I’m really tempted to test the range searching data structure on the phone and compare it with other data structures. It’s one of the scenarios we were thinking of when designing our structure.

We also got some good news about two other papers. One about compact web graph representations in Esko Ukkonen Festschrift with Gonzalo Navarro, and one about indexing repetitive sequences in BIBE 2010 (with Antonio Farina, Miguel A. Martinez-Prieto and Gonzalo Navarro), where we got to see how the SLP index works in practice.

The last thing remaining to tell: We also got a paper about succinct binary relations accepted into LATIN (with Jeremy Barbay and Gonzalo Navarro), I’ll be going there in April :-).

I guess that’s a good summary of what happened in the last couple of months =). There are some other things I want to write about but I’ll start doing so during the next week. That way I’ll be able to keep posting and not let the blog sleep for so long again :-).

MFCS 2009 / PSC 2009

The conference is over, I had a great time, met a lot of interesting and nice people. The summary: a great experience.

Now I arrived to Prage to attend to PSC 2009. (I still have to write about my train experience, hope to do it in the next couple of days.) The city is awesome, incredibly beautiful. A lot of people I know is here (for PSC2009 and SISAP2009), so it’s nice to see familiar faces. Yesterday I went to the SISAP2009 dinner with Nieves, Oscar and Gonzalo. After that, we took a walk around the city, the ilumination of the buildings in something amazing. Of couse, I forgot my camera so I don’t have any pictures to show. Today I’m taking the camera with me ;-).

Today I’ll spend the day walking around the city and tomorrow back to attending talks at PSC2009 :-).

MFCS Talk

As I said in the previous post the talk is ready and online. I’ll use this space to tell a little bit about the result we achieved with Gonzalo Navarro for the paper “Self-Indexed Text Compression using Straight-Line Programs”.

The main problem we address is indexing straight-line programs, which are a special kind of free-context grammars that represent a language composed of one single text. This can be used as a compressed representation of a text, for example, LZ78 and Re-Pair map to an SLP without much effort. The bad new is that finding the minimum grammar that generates the text is NP-Complete, so approximations have to be used.

The formal definition is: A Straight-Line Program (SLP) \mathcal{G}=(X={X_1,\ldots,X_n},\Sigma) is a grammar that defines a single finite sequence T[1,u], drawn from an alphabet \Sigma =[1,\sigma] of terminals. It has n rules, which must be of the following types:

  • X_i \rightarrow \alpha, where alpha \in \Sigma. It represents string \mathcal{F}(X_i)=\alpha.
  • X_i \rightarrow X_l X_r, where l,r < i[/latex]. It represents string [latex]\mathcal{F}(X_i) = \mathcal{F}(X_l)\mathcal{F}(X_r)[/latex].</li> </ul>  We call [latex]\mathcal{F}(X_i) the phrase generated by nonterminal X_i, and T=\mathcal{F}(X_n).

    Given a pattern P and a text represented as an SLP, we focus on the following problems:

    • Count the number of occurrences of P in the text generated by the SLP
    • Locate the positions where the pattern occurs.

    In the past this problem was approached from an online-searching point of view, our goal was to provide an index that supports searching in sublinear time on the size of the SLP. For the indexing approach we work on representing the SLP as a labeled binary relation. We consider a binary relation mathcal{R} \subset A\times B where A=[n_1] and B=[n_2], and a labeling function \mathcal{L}:A\times B \rightarrow L, where L=[\ell]\cup \{ \perp\} is the set of labels and \perp represents no label/relation between the pairs. We want to answer the following queries:

    • Label of a given pair.
    • Elements related to an element of B.
    • Elements related to an element of A.
    • Elements related trough a given label.
    • Given to contiguous ranges in A and B respectively, find the pair that belong to the relation where a\in A is in the range for A and b\in B is in it's range too.

    The first theorem we include shows that each of this queries can be answered paying close to O(\log n_2) time per element retrieved (element in the answer). Using that result we show how to modify an SLP so that the rules are sorted by lexicographical order and then represent this variant of the SLP in n\log n+o(n\log n) bits of space, and support queries as:

    • Access to rules.
    • Reverse access to rules (given two symbols, which rules generates them in a given order).
    • Rules using a given left/right symbol.
    • Rules using a contiguous range of symbols.

    All the queries require O(\log n) time per datum delivered, as a direct consequence of our labeled binary relation representation. Finally, including some extra structures we get our final result (I'm not going to go into the details on how to use the SLP representation, or where the different trade-offs come from):

    Theorem:

    Let T[1,u] be a text over an effective alphabet [\sigma] represented by an SLP of n rules and height h. Then there exists a representation of T using n(\log u + 3\log n + O(\log\sigma + \log h) + o(\log n)) bits, such that any substring T[l,r] can be extracted in time O((r-l+h)\log n), and the positions of occ occurrences of a pattern P[1,m] in T can be found in time O((m(m+h)+h,occ)\log n). By removing the O(\log h) term in the space, search time raises to O((m^2+occ)h\log n). By further removing the O(\log\sigma) term in the space, search time raises to O((m(m+h)\log n+h,occ)\log n). The existence problem is solved within the time corresponding to occ=0.

    There are some open problems we were stuck at the end of the paper included in the article and we are now working on a journal version with some interesting extensions.