MFCS Talk

As I said in the previous post the talk is ready and online. I’ll use this space to tell a little bit about the result we achieved with Gonzalo Navarro for the paper “Self-Indexed Text Compression using Straight-Line Programs”.

The main problem we address is indexing straight-line programs, which are a special kind of free-context grammars that represent a language composed of one single text. This can be used as a compressed representation of a text, for example, LZ78 and Re-Pair map to an SLP without much effort. The bad new is that finding the minimum grammar that generates the text is NP-Complete, so approximations have to be used.

The formal definition is: A Straight-Line Program (SLP) $$\mathcal{G}=(X={X_1,\ldots,X_n},\Sigma)$$ is a grammar that defines a single finite sequence $$T[1,u]$$, drawn from an alphabet $$\Sigma =[1,\sigma]$$ of terminals. It has $$n$$ rules, which must be of the following types:

• $$X_i \rightarrow \alpha$$, where $$alpha \in \Sigma$$. It represents string $$\mathcal{F}(X_i)=\alpha$$.
• $$X_i \rightarrow X_l X_r$$, where $$l,r < i[/latex]. It represents string $\mathcal{F}(X_i) = \mathcal{F}(X_l)\mathcal{F}(X_r)$. We call [latex]\mathcal{F}(X_i)$$ the phrase generated by nonterminal $$X_i$$, and $$T=\mathcal{F}(X_n)$$.

Given a pattern P and a text represented as an SLP, we focus on the following problems:

• Count the number of occurrences of P in the text generated by the SLP
• Locate the positions where the pattern occurs.

In the past this problem was approached from an online-searching point of view, our goal was to provide an index that supports searching in sublinear time on the size of the SLP. For the indexing approach we work on representing the SLP as a labeled binary relation. We consider a binary relation $$mathcal{R} \subset A\times B$$ where $$A=[n_1]$$ and $$B=[n_2]$$, and a labeling function $$\mathcal{L}:A\times B \rightarrow L$$, where $$L=[\ell]\cup \{ \perp\}$$ is the set of labels and $$\perp$$ represents no label/relation between the pairs. We want to answer the following queries:

• Label of a given pair.
• Elements related to an element of B.
• Elements related to an element of A.
• Elements related trough a given label.
• Given to contiguous ranges in A and B respectively, find the pair that belong to the relation where $$a\in A$$ is in the range for A and $$b\in B$$ is in it’s range too.

The first theorem we include shows that each of this queries can be answered paying close to $$O(\log n_2)$$ time per element retrieved (element in the answer). Using that result we show how to modify an SLP so that the rules are sorted by lexicographical order and then represent this variant of the SLP in $$n\log n+o(n\log n)$$ bits of space, and support queries as:

• Reverse access to rules (given two symbols, which rules generates them in a given order).
• Rules using a given left/right symbol.
• Rules using a contiguous range of symbols.

All the queries require $$O(\log n)$$ time per datum delivered, as a direct consequence of our labeled binary relation representation. Finally, including some extra structures we get our final result (I’m not going to go into the details on how to use the SLP representation, or where the different trade-offs come from):

Theorem:

Let $$T[1,u]$$ be a text over an effective alphabet $$[\sigma]$$ represented by an SLP of $$n$$ rules and height $$h$$. Then there exists a representation of $$T$$ using $$n(\log u + 3\log n + O(\log\sigma + \log h) + o(\log n))$$ bits, such that any substring $$T[l,r]$$ can be extracted in time $$O((r-l+h)\log n)$$, and the positions of $$occ$$ occurrences of a pattern $$P[1,m]$$ in $$T$$ can be found in time $$O((m(m+h)+h,occ)\log n)$$. By removing the $$O(\log h)$$ term in the space, search time raises to $$O((m^2+occ)h\log n)$$. By further removing the $$O(\log\sigma)$$ term in the space, search time raises to $$O((m(m+h)\log n+h,occ)\log n)$$. The existence problem is solved within the time corresponding to $$occ=0$$.

There are some open problems we were stuck at the end of the paper included in the article and we are now working on a journal version with some interesting extensions.

This entry was posted in Research. Bookmark the permalink.

This site uses Akismet to reduce spam. Learn how your comment data is processed.