I promised a post about the sexy SXSI system in my last post. Its awesome name stands for Fast In-Memory XPath Search over Compressed Text and Tree Indexes, which says much more about what it does.
The main idea for representing the XML file is as follows: the shape of the tree is represented using succinct balanced parentheses, the labels are represented with structures that support rank/select and access queries, and the text with a self-index for the collection of texts inside the xml file.
Let’s go over each part:
- The tree: the tree is represented using balanced parentheses, by doing a DFS over the tree and writing a ‘(‘ every time you start visiting a subtree and a ‘)’ when you leave the subtree. This representation allows for very fast navigation over the tree, for a good reference about it see the paper by Arroyuelo, Canovas, Navarro and Sadakane. In the same order we keep another sequence that corresponds to the ids of each node in the tree, we write down the ids in DFS order. This representation allows to move though the tree fast and access node ids quickly. Besides that, something that is really interesting with this combination is the ability to jump to a node with a given id, we just need to perform a select query over the sequence of tags (recall that select(c,i) over a sequence obtains the position where the i-th c occurs). This last operation allows for traversing the tree only looking at the nodes that have a given id.
- The self-index, representing every text in the XML tree, is based on the FM-Index. Every node that contains textual data introduces a text into the collection. All the elements are concatenated introducing delimiters and every text in the collection gets an id (in our representation the texts are at the leaves of the tree so it’s easy to map back a forward). The id is given by the position of the delimiter in the BWT, and using a range searching data structure we can map to original leaf-id. This index allows for fast pattern matching among the text data present in the XML file.
This an overview, you can get more details looking the original paper. The nice thing is that by using this structures we allow for a flexible navigation of the structure, and an automaton-based search procedure achieves great practical results over real XML files (more details look at the experimental results in the paper). This makes SXSI an interesting option as engine for representing and querying XML files. The idea of the automaton was formalized a couple of days ago, you can see the paper by Maneth and Nguyen.
What’s next? Well, we are not done, there are some operations we would like SXSI to support that aren’t implemented yet. Another issue is that the structure is static. Making it dynamic in practice is quite tricky, as it usually is with succinct representations.
At this moment we are working on a first release (cleaning code, adding documentation, etc), and we hope to have something soon :-).
Hi, I’m a graduating student from Beijing.
I’m very interested in your work on tree automata based XPath serach engine, and looking for to see your first release. May I know when will that come out?
Thank you!
We are still working on the cleaning for the journal version. I will send you an email as soon as we get a definitive date.
Cheers!
Now it’s available 😀 http://fclaude.recoded.cl/archives/193
Pingback: Francisco Claude » Blog Archive » SXSI