Today I saw this article posted in arXiv:
On optimally partitioning a text to improve its compression
Written by Paolo Ferragina, Igor Nitto and Rossano Venturini
I found this article really interesting and nicely presented. The problem they focus on is:
Problem: Given a compressor and a text of length , drawn from an alphabet of size . Find the optimal partition of such that is minimized.
This means that we want to cut the text in pieces, with unknown, such that applying the compressor over each piece achieves the best compression possible over the text. This is ignoring possible permutations of the text, such as the Burrows-Wheeler Transform (BWT).
A simple solution is to transform this problem into a shortest path problem, every position in the text is a node in the graph, and every node is connected with nodes . The cost of going from node to node is . It is easy to see that the best partition obtains a total size equal to the minimum path from to . Here I include a figure of the graph (using IPE :-)).
The main problem is that just building this graph takes time. Assume that takes linear time to compress a sequence, then building the graph takes:
In the paper they present an algorithm to approximate the problem, they require time and achieve an approximation. The main idea behind the approach is to approximate the graph in such a way that, by storing less edges, they can still approximate the cost of the minimum path. They show how to run the algorithm without building the approximated version of the graph, to keep the space consumption low. They also show how to estimate the size of the compression for 0-order and k-order compressors during the process.