Update

It’s been a long time since my last post. Lots of things had happened and I’ll use this post to get back on track. I usually don’t post anything about my private life, yet I consider important to mention that my girlfriend was visiting me at Waterloo. We don’t get to see each other very often so it was really great having her here for more than two months!

Now going back to Geeky stuff :-). I was traveling during December. First I went to visit Sebastian Maneth and Kim Nguyen at NICTA in Australia. We did work in our SXSI system (more on it in a following post), in particular, I did some experiments with the XBW and regular/naive tries for answering some specific queries on the system. For path searching both options were awesome, the second one was of course faster and sometimes bigger. The way we answer queries in the system doesn’t allow us to change exclusively to any of those two representations, but considering them as part of the index for answering specific queries could have been interesting. We didn’t make any final decision and the fact that the XBW is patented could complicate things a little bit. We expect to release a GPL version of SXSI and we would like people to be able to use it without any restriction.

After two weeks in Sydney I went back to Chile. It was probably the worst flight ever. NICTA did pay for my trip to Australia, so I had to come back to Canada. I came back, spent 12 hours in the airport and then took a really bad combination through Miami. The total time since I left Australia till I got to Chile was more than 50 hours. During that time ISAAC was taking place in Hawaii. I was coauthor in two papers accepted there: the untangling (with Diego Arroyuelo, Reza Dorrigiv, Stephane Durocher, Meng He, Alejandro López-Ortiz, J. Ian Munro, Patrick Nicholson, Alejandro Salinger, and Matthew Skala) and disc covering (with Reza Dorrigiv, Stephane Durocher, Robert Fraser, Alejandro López-Ortiz, and Alejandro Salinger). While Bob and Matthew were presenting them, I was attending Alejandro Salinger’s wedding (he is also coauthor in both papers :p). The wedding was great! I did really enjoy it, the planning was amazing and we had so much fun in it!

After a couple of weeks in Chile, on the 6th of January I came back to Waterloo. When I got here we started working on the untangling problem we “solved” in the ISAAC paper about range searching. We found a flaw in our algorithm and we spent a LOT of time correcting the method. We did submit a journal version of the paper (corrected). Patrick and I did implement the untangling algorithm and a simplified version of the range searching data structure (that’s were we found the problem). It works really well on real datasets and at this moment we are preparing an experimental paper including those results.

I also went to an Android Developer Lab and I got a Nexus One as a gift from Google! It was a really good experience (the lab, and of course the nexus too :-)). We did learn about implementing software for android. I’ve played a little bit with it and so far I’m impressed with what a phone can do! (Yes, I used to have a phone that wouldn’t do more than calling.) I’m really tempted to test the range searching data structure on the phone and compare it with other data structures. It’s one of the scenarios we were thinking of when designing our structure.

We also got some good news about two other papers. One about compact web graph representations in Esko Ukkonen Festschrift with Gonzalo Navarro, and one about indexing repetitive sequences in BIBE 2010 (with Antonio Farina, Miguel A. Martinez-Prieto and Gonzalo Navarro), where we got to see how the SLP index works in practice.

The last thing remaining to tell: We also got a paper about succinct binary relations accepted into LATIN (with Jeremy Barbay and Gonzalo Navarro), I’ll be going there in April :-).

I guess that’s a good summary of what happened in the last couple of months =). There are some other things I want to write about but I’ll start doing so during the next week. That way I’ll be able to keep posting and not let the blog sleep for so long again :-).

Pictures of prague

As promised, I did upload some pictures of Prague to my Picasa account. I didn’t take much pictures in Slovakia but did make it up in Prague taking way too many pictures. I still have some pictures of Prague at night, will upload them soon.

MFCS 2009 / PSC 2009

The conference is over, I had a great time, met a lot of interesting and nice people. The summary: a great experience.

Now I arrived to Prage to attend to PSC 2009. (I still have to write about my train experience, hope to do it in the next couple of days.) The city is awesome, incredibly beautiful. A lot of people I know is here (for PSC2009 and SISAP2009), so it’s nice to see familiar faces. Yesterday I went to the SISAP2009 dinner with Nieves, Oscar and Gonzalo. After that, we took a walk around the city, the ilumination of the buildings in something amazing. Of couse, I forgot my camera so I don’t have any pictures to show. Today I’m taking the camera with me ;-).

Today I’ll spend the day walking around the city and tomorrow back to attending talks at PSC2009 :-).

MFCS Talk

As I said in the previous post the talk is ready and online. I’ll use this space to tell a little bit about the result we achieved with Gonzalo Navarro for the paper “Self-Indexed Text Compression using Straight-Line Programs”.

The main problem we address is indexing straight-line programs, which are a special kind of free-context grammars that represent a language composed of one single text. This can be used as a compressed representation of a text, for example, LZ78 and Re-Pair map to an SLP without much effort. The bad new is that finding the minimum grammar that generates the text is NP-Complete, so approximations have to be used.

The formal definition is: A Straight-Line Program (SLP) \mathcal{G}=(X={X_1,\ldots,X_n},\Sigma) is a grammar that defines a single finite sequence T[1,u], drawn from an alphabet \Sigma =[1,\sigma] of terminals. It has n rules, which must be of the following types:

  • X_i \rightarrow \alpha, where alpha \in \Sigma. It represents string \mathcal{F}(X_i)=\alpha.
  • X_i \rightarrow X_l X_r, where l,r < i. It represents string \mathcal{F}(X_i) = \mathcal{F}(X_l)\mathcal{F}(X_r).

We call \mathcal{F}(X_i) the phrase generated by nonterminal X_i, and T=\mathcal{F}(X_n).

Given a pattern P and a text represented as an SLP, we focus on the following problems:

  • Count the number of occurrences of P in the text generated by the SLP
  • Locate the positions where the pattern occurs.

In the past this problem was approached from an online-searching point of view, our goal was to provide an index that supports searching in sublinear time on the size of the SLP. For the indexing approach we work on representing the SLP as a labeled binary relation. We consider a binary relation mathcal{R} \subset A\times B where A=[n_1] and B=[n_2], and a labeling function \mathcal{L}:A\times B \rightarrow L, where L=[\ell]\cup \{ \perp\} is the set of labels and \perp represents no label/relation between the pairs. We want to answer the following queries:

  • Label of a given pair.
  • Elements related to an element of B.
  • Elements related to an element of A.
  • Elements related trough a given label.
  • Given to contiguous ranges in A and B respectively, find the pair that belong to the relation where a\in A is in the range for A and b\in B is in it's range too.

The first theorem we include shows that each of this queries can be answered paying close to O(\log n_2) time per element retrieved (element in the answer). Using that result we show how to modify an SLP so that the rules are sorted by lexicographical order and then represent this variant of the SLP in n\log n+o(n\log n) bits of space, and support queries as:

  • Access to rules.
  • Reverse access to rules (given two symbols, which rules generates them in a given order).
  • Rules using a given left/right symbol.
  • Rules using a contiguous range of symbols.

All the queries require O(\log n) time per datum delivered, as a direct consequence of our labeled binary relation representation. Finally, including some extra structures we get our final result (I'm not going to go into the details on how to use the SLP representation, or where the different trade-offs come from):

Theorem:

Let T[1,u] be a text over an effective alphabet [\sigma] represented by an SLP of n rules and height h. Then there exists a representation of T using n(\log u + 3\log n + O(\log\sigma + \log h) + o(\log n)) bits, such that any substring T[l,r] can be extracted in time O((r-l+h)\log n), and the positions of occ occurrences of a pattern P[1,m] in T can be found in time O((m(m+h)+h,occ)\log n). By removing the O(\log h) term in the space, search time raises to O((m^2+occ)h\log n). By further removing the O(\log\sigma) term in the space, search time raises to O((m(m+h)\log n+h,occ)\log n). The existence problem is solved within the time corresponding to occ=0.

There are some open problems we were stuck at the end of the paper included in the article and we are now working on a journal version with some interesting extensions.

MFCS 2009

I’m back after a while. During this month I took a week off and did a road-trip with Roberto Konow trough Canada. We went to Kingston, Ottawa, Montreal and Quebec City. It was an awesome trip and then life went back to normal with work to catch up.

Now I’m in MFCS 2009, the first day already went by and I’ve had the pleasure to see lots of good presentations :-), that makes my panic mode higher since tomorrow I’m presenting the paper we submitted with Gonzalo Navarro. I’ve already uploaded the slides (and the preprint) to my Web page (http://www.cs.uwaterloo.ca/~fclaude/). Also the hotel in which the conference is being held is great, I’ve already spent around 3-4 hours between the Sauna and the Jacuzzi :-).

I’ll post more about the conference as it goes by and I’ll try to post a small summary of the paper during the night.

FOCS 2009

The list of accepted papers for FOCS2009 was published a couple of days ago. The page with the abstracts is here and a post on “Graph Theory” included the pdf file for some of the papers (those available).

I was able to find some of the missing papers:

“A Parallel Repetition Theorem for Any Interactive Argument” – Iftach Haitner.

“Optimal Long Code Test with One Free Bit” - Nikhil Bansal and Subhash Khot.

“Submodular Function Minimization under Covering Constraints” – Satoru Iwata and Kiyohito Nagano.

I’ve been checking out the ones that seem more interesting (for me, don’t want to start a war on best papers or imply that other papers aren’t interesting).

One of them is the paper “Space-Efficient Framework for Top-k String Retrieval Problems” by Wing Kai Hon, Rahul Shah and Jeffrey Scott Vitter.  It’s a really interesting paper about a problem we have been looking at for a while (with Diego Arroyuelo, Meng He and Ian Munro) and I found that the solution they propose is a very interesting approach, it allows to handle a wide family of scoring functions for the top-k string retrieval problem.

Another paper that caught my interest is “Models for the compressible Web” by Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Alessandro Panconesi and Prabhakar Raghavan. I couldn’t find the pdf on that one so I haven’t read it yet. It seems to be closely related to my master thesis (yes, it’s written in English) and it would be interesting to see if some of the results there help on explaining in a theoretical way the empirical results we obtained in the thesis.

Text Partitioning

Today I saw this article posted in arXiv:

On optimally partitioning a text to improve its compression
Written by Paolo Ferragina, Igor Nitto and Rossano Venturini

I found this article really interesting and nicely presented. The problem they focus on is:

Problem: Given a compressor C and a text T of length n, drawn from an alphabet \Sigma of size \sigma. Find the optimal partition of T=T_1T_2\ldots T_k such that |C(T_1)C(T_2)\ldots C(T_k)| is minimized.

This means that we want to cut the text in k pieces, with k unknown, such that applying the compressor C over each piece achieves the best compression possible over the text. This is ignoring possible permutations of the text, such as the Burrows-Wheeler Transform (BWT).

A simple solution is to transform this problem into a shortest path problem, every position in the text is a node in the graph, and every node i is connected with nodes i+1,i+2,\ldots, n. The cost of going from node i to node j < i is |C(T_{i,j})|. It is easy to see that the best partition obtains a total size equal to the minimum path from 1 to n. Here I include a figure of the graph (using IPE :-)).

Example shortest path for text partitioning

Example shortest path for text partitioning

The main problem is that just building this graph takes O(n^3) time. Assume that C takes linear time to compress a sequence, then building the graph takes:

\sum_{i=1}^{n-1} \sum_{j=i+1}^n j-i =\sum_{i=1}^{n-1}\sum_{j=1}^{n-i} j = \sum_{i=1}^{n-1}\sum_{j=1}^{n-1} j = \sum_{i=1}^{n-1} O(n^2) = O(n^3)

In the paper they present an algorithm to approximate the problem, they require O(n\log_{\epsilon+1}n) time and achieve an (1+\epsilon)-approximation. The main idea behind the approach is to approximate the graph in such a way that, by storing less edges, they can still approximate the cost of the minimum path. They show how to run the algorithm without building the approximated version of the graph, to keep the space consumption low. They also show how to estimate the size of the compression for 0-order and k-order compressors during the process.

Drawings for Latex

Maybe many of you know many programs for drawing in Linux that allow you to embed the image in a \LaTeX document and at the same time use the math typesettings of \LaTeX. I used to draw in Dia (dia-gnome in Ubuntu), then I devolved to xfig. Ok, lets stop here, I say devolved because I like better dia, not because I want to start a religious fight :-).

A couple of months ago Alejandro showed me a program called IPE (also in Ubuntu repositories). This program is amazing, it is easy to use (and that coming from a really bad user) and it supports \LaTeX in a very friendly way, just type latex into the text fields and then with Ctrl+l you make the transition between the code and the actual drawing, you can see the way the picture looks right away and what you see is exactly what you get.

The only problem so far, I still don’t know how to change a text after typing it, but there is a solution for this (besides the brute-force approach of making the whole image from scratch). When you save the image as .ipe file, the file contains an XML that represents the image and has the source of each text, you can modify it there and then load the image into IPE again.

I’m editing the post to add the link to IPE, as Victor suggested:

http://tclab.kaist.ac.kr/ipe/

Cooler address and Latex support

I just moved the blog with it’s old content to continue here, I got a much cooler url! Anyway, the main reason for moving the blog wasn’t the url, it was the LaTeX support :-), now I can type geekier things like \sum_{i=0}^n 2^i = 2^{n+1}-1. And what has this to do with Categories Math and Computer Science? I just used them so they have at least one post :p.

Wireless Card Linux/Windows

I moved to my new place last month and the second thing I got was the Internet connection (first was the bed). My new apartment is small but has a *really* bad distribution of all the plugs in the walls. Since I didn’t have any good way to connect with a wire my computer to the router, I bought a wireless card. The card is a TRENDnet TEW-421PC/423PI, which is really cheap and it’s supposed to work fine. My apartment is so small that it shouldn’t give any problems, yet it did. The connection did drop randomly from time to time, as always, the drops did occur on the worst moment possible. It did happen in Linux (Ubuntu 9.04 32 bits) and Windows XP, but my notebook was working fine (so I assumed it was the card and not the router). When I was about to take the card back to the place I bought it, I found a page that explained that you could update your driver for the card using windows update. That did work perfectly, and then the problem for Linux was remaining. What I did was to install ndiswrapper with the drivers that came with the card and then copy the file rtl8185.sys from the folder system32 (in Windows) to /etc/ndiswrapper/net8185/. Now my Linux works with the new driver too, until now everything seems to work fine.