Text Partitioning
Today I saw this article posted in arXiv:
On optimally partitioning a text to improve its compression
Written by Paolo Ferragina, Igor Nitto and Rossano Venturini
I found this article really interesting and nicely presented. The problem they focus on is:
Problem: Given a compressor and a text
of length
, drawn from an alphabet
of size
. Find the optimal partition of
such that
is minimized.
This means that we want to cut the text in pieces, with
unknown, such that applying the compressor
over each piece achieves the best compression possible over the text. This is ignoring possible permutations of the text, such as the Burrows-Wheeler Transform (BWT).
A simple solution is to transform this problem into a shortest path problem, every position in the text is a node in the graph, and every node is connected with nodes
. The cost of going from node
to node
is
. It is easy to see that the best partition obtains a total size equal to the minimum path from
to
. Here I include a figure of the graph (using IPE :-)).

Example shortest path for text partitioning
The main problem is that just building this graph takes time. Assume that
takes linear time to compress a sequence, then building the graph takes:
In the paper they present an algorithm to approximate the problem, they require time and achieve an
approximation. The main idea behind the approach is to approximate the graph in such a way that, by storing less edges, they can still approximate the cost of the minimum path. They show how to run the algorithm without building the approximated version of the graph, to keep the space consumption low. They also show how to estimate the size of the compression for 0-order and k-order compressors during the process.
mmmm I have to be honest. I didn’t undersatand a shitt… JAJAJAJA…!!!!!
I think I’ll never understand what you do ….
but I understand something more important… other more important things about you, so I can feek satisfied!!
I love you!!
XOXO