Archive for August 2009

## MFCS 2009 / PSC 2009

The conference is over, I had a great time, met a lot of interesting and nice people. The summary: a great experience.

Now I arrived to Prage to attend to PSC 2009. (I still have to write about my train experience, hope to do it in the next couple of days.) The city is awesome, incredibly beautiful. A lot of people I know is here (for PSC2009 and SISAP2009), so it’s nice to see familiar faces. Yesterday I went to the SISAP2009 dinner with Nieves, Oscar and Gonzalo. After that, we took a walk around the city, the ilumination of the buildings in something amazing. Of couse, I forgot my camera so I don’t have any pictures to show. Today I’m taking the camera with me ;-).

Today I’ll spend the day walking around the city and tomorrow back to attending talks at PSC2009 :-).

## MFCS Talk

As I said in the previous post the talk is ready and online. I’ll use this space to tell a little bit about the result we achieved with Gonzalo Navarro for the paper “Self-Indexed Text Compression using Straight-Line Programs”.

The main problem we address is indexing straight-line programs, which are a special kind of free-context grammars that represent a language composed of one single text. This can be used as a compressed representation of a text, for example, LZ78 and Re-Pair map to an SLP without much effort. The bad new is that finding the minimum grammar that generates the text is NP-Complete, so approximations have to be used.

The formal definition is: A Straight-Line Program (SLP) $\mathcal{G}=(X={X_1,\ldots,X_n},\Sigma)$ is a grammar that defines a single finite sequence $T[1,u]$, drawn from an alphabet $\Sigma =[1,\sigma]$ of terminals. It has $n$ rules, which must be of the following types:

• $X_i \rightarrow \alpha$, where $alpha \in \Sigma$. It represents string $\mathcal{F}(X_i)=\alpha$.
• $X_i \rightarrow X_l X_r$, where $l,r < i[/latex]. It represents string $\mathcal{F}(X_i) = \mathcal{F}(X_l)\mathcal{F}(X_r)$. We call [latex]\mathcal{F}(X_i)$ the phrase generated by nonterminal $X_i$, and $T=\mathcal{F}(X_n)$.

Given a pattern P and a text represented as an SLP, we focus on the following problems:

• Count the number of occurrences of P in the text generated by the SLP
• Locate the positions where the pattern occurs.

In the past this problem was approached from an online-searching point of view, our goal was to provide an index that supports searching in sublinear time on the size of the SLP. For the indexing approach we work on representing the SLP as a labeled binary relation. We consider a binary relation $mathcal{R} \subset A\times B$ where $A=[n_1]$ and $B=[n_2]$, and a labeling function $\mathcal{L}:A\times B \rightarrow L$, where $L=[\ell]\cup \{ \perp\}$ is the set of labels and $\perp$ represents no label/relation between the pairs. We want to answer the following queries:

• Label of a given pair.
• Elements related to an element of B.
• Elements related to an element of A.
• Elements related trough a given label.
• Given to contiguous ranges in A and B respectively, find the pair that belong to the relation where $a\in A$ is in the range for A and $b\in B$ is in it's range too.

The first theorem we include shows that each of this queries can be answered paying close to $O(\log n_2)$ time per element retrieved (element in the answer). Using that result we show how to modify an SLP so that the rules are sorted by lexicographical order and then represent this variant of the SLP in $n\log n+o(n\log n)$ bits of space, and support queries as:

• Reverse access to rules (given two symbols, which rules generates them in a given order).
• Rules using a given left/right symbol.
• Rules using a contiguous range of symbols.

All the queries require $O(\log n)$ time per datum delivered, as a direct consequence of our labeled binary relation representation. Finally, including some extra structures we get our final result (I'm not going to go into the details on how to use the SLP representation, or where the different trade-offs come from):

Theorem:

Let $T[1,u]$ be a text over an effective alphabet $[\sigma]$ represented by an SLP of $n$ rules and height $h$. Then there exists a representation of $T$ using $n(\log u + 3\log n + O(\log\sigma + \log h) + o(\log n))$ bits, such that any substring $T[l,r]$ can be extracted in time $O((r-l+h)\log n)$, and the positions of $occ$ occurrences of a pattern $P[1,m]$ in $T$ can be found in time $O((m(m+h)+h,occ)\log n)$. By removing the $O(\log h)$ term in the space, search time raises to $O((m^2+occ)h\log n)$. By further removing the $O(\log\sigma)$ term in the space, search time raises to $O((m(m+h)\log n+h,occ)\log n)$. The existence problem is solved within the time corresponding to $occ=0$.

There are some open problems we were stuck at the end of the paper included in the article and we are now working on a journal version with some interesting extensions.

## MFCS 2009

I’m back after a while. During this month I took a week off and did a road-trip with Roberto Konow trough Canada. We went to Kingston, Ottawa, Montreal and Quebec City. It was an awesome trip and then life went back to normal with work to catch up.

Now I’m in MFCS 2009, the first day already went by and I’ve had the pleasure to see lots of good presentations :-), that makes my panic mode higher since tomorrow I’m presenting the paper we submitted with Gonzalo Navarro. I’ve already uploaded the slides (and the preprint) to my Web page (http://www.cs.uwaterloo.ca/~fclaude/). Also the hotel in which the conference is being held is great, I’ve already spent around 3-4 hours between the Sauna and the Jacuzzi :-).

I’ll post more about the conference as it goes by and I’ll try to post a small summary of the paper during the night.