Text Partitioning

Today I saw this article posted in arXiv:

On optimally partitioning a text to improve its compression
Written by Paolo Ferragina, Igor Nitto and Rossano Venturini

I found this article really interesting and nicely presented. The problem they focus on is:

Problem: Given a compressor \(C\) and a text \(T\) of length \(n\), drawn from an alphabet \(\Sigma\) of size \(\sigma\). Find the optimal partition of \(T=T_1T_2\ldots T_k\) such that \(|C(T_1)C(T_2)\ldots C(T_k)|\) is minimized.

This means that we want to cut the text in \(k\) pieces, with \(k\) unknown, such that applying the compressor \(C\) over each piece achieves the best compression possible over the text. This is ignoring possible permutations of the text, such as the Burrows-Wheeler Transform (BWT).

A simple solution is to transform this problem into a shortest path problem, every position in the text is a node in the graph, and every node \(i\) is connected with nodes \(i+1,i+2,\ldots, n\). The cost of going from node \(i\) to node \(j < i[/latex] is [latex]|C(T_{i,j})|[/latex]. It is easy to see that the best partition obtains a total size equal to the minimum path from [latex]1[/latex] to [latex]n[/latex]. Here I include a figure of the graph (using IPE :-)). [caption id="attachment_49" align="aligncenter" width="449" caption="Example shortest path for text partitioning"]Example shortest path for text partitioning[/caption]

The main problem is that just building this graph takes [latex]O(n^3)\) time. Assume that \(C\) takes linear time to compress a sequence, then building the graph takes:

\(\sum_{i=1}^{n-1} \sum_{j=i+1}^n j-i =\sum_{i=1}^{n-1}\sum_{j=1}^{n-i} j = \sum_{i=1}^{n-1}\sum_{j=1}^{n-1} j \)

\(= \sum_{i=1}^{n-1} O(n^2) = O(n^3) \)

In the paper they present an algorithm to approximate the problem, they require \(O(n\log_{\epsilon+1}n)\) time and achieve an \((1+\epsilon)-\)approximation. The main idea behind the approach is to approximate the graph in such a way that, by storing less edges, they can still approximate the cost of the minimum path. They show how to run the algorithm without building the approximated version of the graph, to keep the space consumption low. They also show how to estimate the size of the compression for 0-order and k-order compressors during the process.

This entry was posted in Research. Bookmark the permalink.

1 Response to Text Partitioning

  1. clover1987 says:

    mmmm I have to be honest. I didn’t undersatand a shitt… JAJAJAJA…!!!!!
    I think I’ll never understand what you do ….
    but I understand something more important… other more important things about you, so I can feek satisfied!!

    I love you!!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.