Archive for April 2010

LATIN 2010

I just got back from LATIN 2010, I was organizing committee and it was a really good experience. We got invaluable help from professors and students from UABJO.

The conference started with quite a few problems. To start with, many people could not make it to the conference thanks to a friendly volcano that left Europe uncommunicated. This forced us to change the conference program and a considerable number of talks were cancelled.

The second problem we had (Monday night) is that the UABJO university workers went on a strike. The conference was being held there and in Mexico you are not allowed to enter the place during the strike. The university gave us support and Alex Lopez-Ortiz managed to move the conference to the Hotel Victoria.

Besides those “little” incidents, I think the conference went pretty well (the recommendation comes from a person close to the organization :p). The talks I got to see were really good and I did really like many of them. The others I didn’t understand that much, but I think that’s normal in a broad conference.

I was also presenting a paper with Jeremy Barbay and Gonzalo Navarro about succinct representations of binary relations.

Those are all the news about the conference. About the place there is too much to tell. To start with, the people are friendly and cheering, which makes Oaxaca a really nice place to visit. I took a tour (after the conference) to some places nearby and got to see some ruins, swim in natural pools, drink mezcal. There is a lot to do and visit around. It would take too much space to cover it all :-).

The SXSI System

I promised a post about the sexy SXSI system in my last post. Its awesome name stands for Fast In-Memory XPath Search over Compressed Text and Tree Indexes, which says much more about what it does.

The main idea for representing the XML file is as follows: the shape of the tree is represented using succinct balanced parentheses, the labels are represented with structures that support rank/select and access queries, and the text with a self-index for the collection of texts inside the xml file.

Let’s go over each part:

  1. The tree: the tree is represented using balanced parentheses, by doing a DFS over the tree and writing a ‘(‘ every time you start visiting a subtree and a ‘)’ when you leave the subtree. This representation allows for very fast navigation over the tree, for a good reference about it see the paper by Arroyuelo, Canovas, Navarro and Sadakane. ¬†In the same order we keep another sequence that corresponds to the ids of each node in the tree, we write down the ids in DFS order. This representation allows to move though the tree fast and access node ids quickly. Besides that, something that is really interesting with this combination is the ability to jump to a node with a given id, we just need to perform a select query over the sequence of tags (recall that select(c,i) over a sequence obtains the position where the i-th c occurs). This last operation allows for traversing the tree only looking at the nodes that have a given id.
  2. The self-index, representing every text in the XML tree, is based on the FM-Index. Every node that contains textual data introduces a text into the collection. All the elements are concatenated introducing delimiters and every text in the collection gets an id (in our representation the texts are at the leaves of the tree so it’s easy to map back a forward). The id is given by the position of the delimiter in the BWT, and using a range searching data structure we can map to original leaf-id. This index allows for fast pattern matching among the text data present in the XML file.

This an overview, you can get more details looking the original paper. The nice thing is that by using this structures we allow for a flexible navigation of the structure, and an automaton-based search procedure achieves great practical results over real XML files (more details look at the experimental results in the paper). This makes SXSI an interesting option as engine for representing and querying XML files. The idea of the automaton was formalized a couple of days ago, you can see the paper by Maneth and Nguyen.

What’s next? Well, we are not done, there are some operations we would like SXSI to support that aren’t implemented yet. Another issue is that the structure is static. Making it dynamic in practice is quite tricky, as it usually is with succinct representations.

At this moment we are working on a first release (cleaning code, adding documentation, etc), and we hope to have something soon :-).