Archive for November 2011

SXSI

After a while I can happily say that our SXSI system is available for downloading. Kim created a package that can be easily installed and tested. In this post, I include a step by step tutorial on how to build the system and index a sample XML file.

First we need to install the dependencies in our system. In this case, I’ll show how to do it in Ubuntu 11.10.

$ sudo apt-get install ocaml-ulex ocaml-findlib ocaml-nox libxml++2.6-dev camlp4

Then, download the package. To decompress and compile the package do:

$ tar xvf sxsi.tar.gz
$ cd sxsi/libcds
$ make
$ cd ../xpathcomp/src/XMLTree
$ make clean all
$ cd ../../
$ ./configure
$ ./build
$ cd xpathcomp

This generates a binary called “main.native”. This binary allows you to index and query xml files. Lets first generate an xml file with the standard xmark tool

$ ./gen_xml -f 1 > sample.xml

Now, in order to index this file we run:

$ ./main.native -v -s sample.srx sample.xml ""
Parsing XML Document : 84729.9ms
Mem use before: VmRSS:	    5380 kB
Mem use after: VmRSS:	  117652 kB
Building TextCollection : 189371ms
Mem use before: VmRSS:	  117916 kB
Mem use after: VmRSS:	  237264 kB
Building parenthesis struct : 303.961ms
Mem use before: VmRSS:	  237264 kB
Mem use after: VmRSS:	  237528 kB
Tags blen is 8
Building Tag Structure : 4715.13ms
Mem use before: VmRSS:	  237528 kB
Mem use after: VmRSS:	  214988 kB
Number of distinct tags 92
Building tag relationship table: 1209.822893ms
Parsing document: 280366.960049ms
Writing file to disk: 115.392923ms
character 0-0 Stream.Error("illegal begin of query")

Yes, I gave an empty query, but I didn’t want to search anything yet ;-). We have an indexed version of sample.xml with standard options saved as sample.srx. You can play with different indexing options, run main.native without parameters to see the full set of options supported at the moment. The result so far looks like this:

$ ls -lh sample.*
-rw-r--r-- 1 fclaude fclaude 188M 2011-10-31 23:10 sample.srx
-rw-rw-r-- 1 fclaude fclaude 112M 2011-10-31 23:02 sample.xml

Finally, as an example, we can count the number of results for the query “/site/regions/africa”:

$ ./main.native -c -v sample.srx /site/regions/africa
Loading tag table: 4.344940ms
Loading parenthesis struct : 304.395ms
Mem use before: VmRSS:	    5380 kB
Mem use after: VmRSS:	    7492 kB
Loading tag names struct : 0.049ms
Mem use before: VmRSS:	    7492 kB
Mem use after: VmRSS:	    7492 kB
tags_blen is 8
11 MB for tag sequence
Loading tag struct : 11.366ms
Mem use before: VmRSS:	    7492 kB
Mem use after: VmRSS:	   27492 kB
Loading text bitvector struct : 11.738ms
Mem use before: VmRSS:	   27492 kB
Mem use after: VmRSS:	   29076 kB
Loading TextCollection : 144.782ms
Mem use before: VmRSS:	   29076 kB
Mem use after: VmRSS:	  194604 kB
Loading file: 478.782892ms
Parsing query: 0.061035ms
Parsed query:
/child::site/child::regions/child::africa
Compiling query: 0.048876ms
Automaton (0) :
States {q₀ q₁ q₂ q₃ q₄}
Initial states: {q₀}
Marking states: {q₀ q₁ q₂ q₃}
Topdown marking states: {q₀ q₁ q₂ q₃}
Bottom states: {q₀ q₁ q₂ q₃ q₄}
True states: {q₄}
Alternating transitions
_________________________________
(q₀, {'' })         → ↓₂q₀ ∧ ↓₁q₁
(q₀, Σ)             → ↓₂q₀
(q₁, {'site' })     → ↓₂q₁ ∧ ↓₁q₂
(q₁, Σ)             → ↓₂q₁
(q₂, {'regions' })  → ↓₂q₂ ∧ ↓₁q₃
(q₂, Σ)             → ↓₂q₂
(q₃, {'africa' })   ⇒ ↓₂q₃ ∧ ↓₁q₄
(q₃, Σ)             → ↓₂q₃
(q₄, Σ)             → ↓₂q₄ ∧ ↓₁q₄
_________________________________
Execution time: 1.343012ms
Number of results: 1
Maximum resident set size: VmHWM:	  200084 kB

And that’s it. Now you can play further with SXSI :-)