## New Domain

I bought a new domain for my personal Web page and to host some projects (The blog stays at WordPress).

The main reason to buy the domain was to be able to mention URLs to source code in papers and be sure that I could maintain the URL for more than 2-3 years. The new address for my Web pages is http://www.recoded.cl/.

As I mentioned in the last post, I uploaded some of the code for the compact/compressed Web graph representations in http://webgraphs.recoded.cl/, the next project I would like to upload is libcds, I’ve been working on the new version for quite a while now, as soon as I have that one I’ll create a Web page for the project and upload it here. The main reason for moving it is that I prefer to have a private svn server and publish only the final code on the Web, plus the option to build the Web page using the tools I feel more comfortable with.

Posted in Personal | 1 Comment

## New papers

I got two really good news after my return from LATIN. Our paper (with Gonzalo Navarro) about compact Web graphs representations was accepted for publications in ACM Transactions on the Web (TWEB). This was one of the main results in my masters thesis. I uploaded some of the code and documentation related to our TWEB paper to http://webgraphs.recoded.cl .

The second good news was that the extended paper about Straight-Line programs (also with Gonzalo Navarro) was accepted into Fundamenta Informaticae. We included some new results. In particular, we gave an improved version for finding the most common substring in a grammar compressed text. Another novelty is an adaptation of our index to the case when the input is not a grammar but a set of rules and a sequence of (non)terminal symbols. In this case, expanding that last sequence, according to the rules, generates the text.

## SPIRE 2010

I was invited to help in the organization of String Processing and Information Retrieval Symposium (SPIRE) this year. The conference is going to be held in Los Cabos, Mexico. The chairs are Edgar Chavez and Stefano Lonardi.

I’ve been to two previous versions of SPIRE, 2007 and 2008, both were a great experience, interesting presentations and a lot  of fun, except for the fines acquired for driving in Australia :-P.

The call for papers for this year can be found here or here for the plain-text version. The submission is due June 20, and given the schedule there is no room for any deadline extension.

## LATIN 2010

I just got back from LATIN 2010, I was organizing committee and it was a really good experience. We got invaluable help from professors and students from UABJO.

The conference started with quite a few problems. To start with, many people could not make it to the conference thanks to a friendly volcano that left Europe uncommunicated. This forced us to change the conference program and a considerable number of talks were cancelled.

The second problem we had (Monday night) is that the UABJO university workers went on a strike. The conference was being held there and in Mexico you are not allowed to enter the place during the strike. The university gave us support and Alex Lopez-Ortiz managed to move the conference to the Hotel Victoria.

Besides those “little” incidents, I think the conference went pretty well (the recommendation comes from a person close to the organization :p). The talks I got to see were really good and I did really like many of them. The others I didn’t understand that much, but I think that’s normal in a broad conference.

I was also presenting a paper with Jeremy Barbay and Gonzalo Navarro about succinct representations of binary relations.

Those are all the news about the conference. About the place there is too much to tell. To start with, the people are friendly and cheering, which makes Oaxaca a really nice place to visit. I took a tour (after the conference) to some places nearby and got to see some ruins, swim in natural pools, drink mezcal. There is a lot to do and visit around. It would take too much space to cover it all :-).

## The SXSI System

I promised a post about the sexy SXSI system in my last post. Its awesome name stands for Fast In-Memory XPath Search over Compressed Text and Tree Indexes, which says much more about what it does.

The main idea for representing the XML file is as follows: the shape of the tree is represented using succinct balanced parentheses, the labels are represented with structures that support rank/select and access queries, and the text with a self-index for the collection of texts inside the xml file.

Let’s go over each part:

1. The tree: the tree is represented using balanced parentheses, by doing a DFS over the tree and writing a ‘(‘ every time you start visiting a subtree and a ‘)’ when you leave the subtree. This representation allows for very fast navigation over the tree, for a good reference about it see the paper by Arroyuelo, Canovas, Navarro and Sadakane.  In the same order we keep another sequence that corresponds to the ids of each node in the tree, we write down the ids in DFS order. This representation allows to move though the tree fast and access node ids quickly. Besides that, something that is really interesting with this combination is the ability to jump to a node with a given id, we just need to perform a select query over the sequence of tags (recall that select(c,i) over a sequence obtains the position where the i-th c occurs). This last operation allows for traversing the tree only looking at the nodes that have a given id.
2. The self-index, representing every text in the XML tree, is based on the FM-Index. Every node that contains textual data introduces a text into the collection. All the elements are concatenated introducing delimiters and every text in the collection gets an id (in our representation the texts are at the leaves of the tree so it’s easy to map back a forward). The id is given by the position of the delimiter in the BWT, and using a range searching data structure we can map to original leaf-id. This index allows for fast pattern matching among the text data present in the XML file.

This an overview, you can get more details looking the original paper. The nice thing is that by using this structures we allow for a flexible navigation of the structure, and an automaton-based search procedure achieves great practical results over real XML files (more details look at the experimental results in the paper). This makes SXSI an interesting option as engine for representing and querying XML files. The idea of the automaton was formalized a couple of days ago, you can see the paper by Maneth and Nguyen.

What’s next? Well, we are not done, there are some operations we would like SXSI to support that aren’t implemented yet. Another issue is that the structure is static. Making it dynamic in practice is quite tricky, as it usually is with succinct representations.

At this moment we are working on a first release (cleaning code, adding documentation, etc), and we hope to have something soon :-).

Posted in Research | 4 Comments

## Update

It’s been a long time since my last post. Lots of things had happened and I’ll use this post to get back on track. I usually don’t post anything about my private life, yet I consider important to mention that my girlfriend was visiting me at Waterloo. We don’t get to see each other very often so it was really great having her here for more than two months!

Now going back to Geeky stuff :-). I was traveling during December. First I went to visit Sebastian Maneth and Kim Nguyen at NICTA in Australia. We did work in our SXSI system (more on it in a following post), in particular, I did some experiments with the XBW and regular/naive tries for answering some specific queries on the system. For path searching both options were awesome, the second one was of course faster and sometimes bigger. The way we answer queries in the system doesn’t allow us to change exclusively to any of those two representations, but considering them as part of the index for answering specific queries could have been interesting. We didn’t make any final decision and the fact that the XBW is patented could complicate things a little bit. We expect to release a GPL version of SXSI and we would like people to be able to use it without any restriction.

After two weeks in Sydney I went back to Chile. It was probably the worst flight ever. NICTA did pay for my trip to Australia, so I had to come back to Canada. I came back, spent 12 hours in the airport and then took a really bad combination through Miami. The total time since I left Australia till I got to Chile was more than 50 hours. During that time ISAAC was taking place in Hawaii. I was coauthor in two papers accepted there: the untangling (with Diego Arroyuelo, Reza Dorrigiv, Stephane Durocher, Meng He, Alejandro López-Ortiz, J. Ian Munro, Patrick Nicholson, Alejandro Salinger, and Matthew Skala) and disc covering (with Reza Dorrigiv, Stephane Durocher, Robert Fraser, Alejandro López-Ortiz, and Alejandro Salinger). While Bob and Matthew were presenting them, I was attending Alejandro Salinger’s wedding (he is also coauthor in both papers :p). The wedding was great! I did really enjoy it, the planning was amazing and we had so much fun in it!

After a couple of weeks in Chile, on the 6th of January I came back to Waterloo. When I got here we started working on the untangling problem we “solved” in the ISAAC paper about range searching. We found a flaw in our algorithm and we spent a LOT of time correcting the method. We did submit a journal version of the paper (corrected). Patrick and I did implement the untangling algorithm and a simplified version of the range searching data structure (that’s were we found the problem). It works really well on real datasets and at this moment we are preparing an experimental paper including those results.

I also went to an Android Developer Lab and I got a Nexus One as a gift from Google! It was a really good experience (the lab, and of course the nexus too :-)). We did learn about implementing software for android. I’ve played a little bit with it and so far I’m impressed with what a phone can do! (Yes, I used to have a phone that wouldn’t do more than calling.) I’m really tempted to test the range searching data structure on the phone and compare it with other data structures. It’s one of the scenarios we were thinking of when designing our structure.

We also got some good news about two other papers. One about compact web graph representations in Esko Ukkonen Festschrift with Gonzalo Navarro, and one about indexing repetitive sequences in BIBE 2010 (with Antonio Farina, Miguel A. Martinez-Prieto and Gonzalo Navarro), where we got to see how the SLP index works in practice.

The last thing remaining to tell: We also got a paper about succinct binary relations accepted into LATIN (with Jeremy Barbay and Gonzalo Navarro), I’ll be going there in April :-).

I guess that’s a good summary of what happened in the last couple of months =). There are some other things I want to write about but I’ll start doing so during the next week. That way I’ll be able to keep posting and not let the blog sleep for so long again :-).

Posted in Research | 1 Comment

## Pictures of prague

Note: My Picasa account does not exist anymore.

As promised, I did upload some pictures of Prague to my Picasa account. I didn’t take much pictures in Slovakia but did make it up in Prague taking way too many pictures. I still have some pictures of Prague at night, will upload them soon.

## MFCS 2009 / PSC 2009

The conference is over, I had a great time, met a lot of interesting and nice people. The summary: a great experience.

Now I arrived to Prage to attend to PSC 2009. (I still have to write about my train experience, hope to do it in the next couple of days.) The city is awesome, incredibly beautiful. A lot of people I know is here (for PSC2009 and SISAP2009), so it’s nice to see familiar faces. Yesterday I went to the SISAP2009 dinner with Nieves, Oscar and Gonzalo. After that, we took a walk around the city, the ilumination of the buildings in something amazing. Of couse, I forgot my camera so I don’t have any pictures to show. Today I’m taking the camera with me ;-).

Today I’ll spend the day walking around the city and tomorrow back to attending talks at PSC2009 :-).

Posted in Research | 1 Comment

## MFCS Talk

As I said in the previous post the talk is ready and online. I’ll use this space to tell a little bit about the result we achieved with Gonzalo Navarro for the paper “Self-Indexed Text Compression using Straight-Line Programs”.

The main problem we address is indexing straight-line programs, which are a special kind of free-context grammars that represent a language composed of one single text. This can be used as a compressed representation of a text, for example, LZ78 and Re-Pair map to an SLP without much effort. The bad new is that finding the minimum grammar that generates the text is NP-Complete, so approximations have to be used.

The formal definition is: A Straight-Line Program (SLP) $$\mathcal{G}=(X={X_1,\ldots,X_n},\Sigma)$$ is a grammar that defines a single finite sequence $$T[1,u]$$, drawn from an alphabet $$\Sigma =[1,\sigma]$$ of terminals. It has $$n$$ rules, which must be of the following types:

• $$X_i \rightarrow \alpha$$, where $$alpha \in \Sigma$$. It represents string $$\mathcal{F}(X_i)=\alpha$$.
• $$X_i \rightarrow X_l X_r$$, where $$l,r < i[/latex]. It represents string $\mathcal{F}(X_i) = \mathcal{F}(X_l)\mathcal{F}(X_r)$. We call [latex]\mathcal{F}(X_i)$$ the phrase generated by nonterminal $$X_i$$, and $$T=\mathcal{F}(X_n)$$.

Given a pattern P and a text represented as an SLP, we focus on the following problems:

• Count the number of occurrences of P in the text generated by the SLP
• Locate the positions where the pattern occurs.

In the past this problem was approached from an online-searching point of view, our goal was to provide an index that supports searching in sublinear time on the size of the SLP. For the indexing approach we work on representing the SLP as a labeled binary relation. We consider a binary relation $$mathcal{R} \subset A\times B$$ where $$A=[n_1]$$ and $$B=[n_2]$$, and a labeling function $$\mathcal{L}:A\times B \rightarrow L$$, where $$L=[\ell]\cup \{ \perp\}$$ is the set of labels and $$\perp$$ represents no label/relation between the pairs. We want to answer the following queries:

• Label of a given pair.
• Elements related to an element of B.
• Elements related to an element of A.
• Elements related trough a given label.
• Given to contiguous ranges in A and B respectively, find the pair that belong to the relation where $$a\in A$$ is in the range for A and $$b\in B$$ is in it’s range too.

The first theorem we include shows that each of this queries can be answered paying close to $$O(\log n_2)$$ time per element retrieved (element in the answer). Using that result we show how to modify an SLP so that the rules are sorted by lexicographical order and then represent this variant of the SLP in $$n\log n+o(n\log n)$$ bits of space, and support queries as:

• Reverse access to rules (given two symbols, which rules generates them in a given order).
• Rules using a given left/right symbol.
• Rules using a contiguous range of symbols.

All the queries require $$O(\log n)$$ time per datum delivered, as a direct consequence of our labeled binary relation representation. Finally, including some extra structures we get our final result (I’m not going to go into the details on how to use the SLP representation, or where the different trade-offs come from):

Theorem:

Let $$T[1,u]$$ be a text over an effective alphabet $$[\sigma]$$ represented by an SLP of $$n$$ rules and height $$h$$. Then there exists a representation of $$T$$ using $$n(\log u + 3\log n + O(\log\sigma + \log h) + o(\log n))$$ bits, such that any substring $$T[l,r]$$ can be extracted in time $$O((r-l+h)\log n)$$, and the positions of $$occ$$ occurrences of a pattern $$P[1,m]$$ in $$T$$ can be found in time $$O((m(m+h)+h,occ)\log n)$$. By removing the $$O(\log h)$$ term in the space, search time raises to $$O((m^2+occ)h\log n)$$. By further removing the $$O(\log\sigma)$$ term in the space, search time raises to $$O((m(m+h)\log n+h,occ)\log n)$$. The existence problem is solved within the time corresponding to $$occ=0$$.

There are some open problems we were stuck at the end of the paper included in the article and we are now working on a journal version with some interesting extensions.