After a while I can happily say that our SXSI system is available for downloading. Kim created a package that can be easily installed and tested. In this post, I include a step by step tutorial on how to build the system and index a sample XML file.
First we need to install the dependencies in our system. In this case, I’ll show how to do it in Ubuntu 11.10.
$ sudo apt-get install ocaml-ulex ocaml-findlib ocaml-nox libxml++2.6-dev camlp4
Then, download the package. To decompress and compile the package do:
$ tar xvf sxsi.tar.gz $ cd sxsi/libcds $ make $ cd ../xpathcomp/src/XMLTree $ make clean all $ cd ../../ $ ./configure $ ./build $ cd xpathcomp
This generates a binary called “main.native”. This binary allows you to index and query xml files. Lets first generate an xml file with the standard xmark tool
$ ./gen_xml -f 1 > sample.xml
Now, in order to index this file we run:
$ ./main.native -v -s sample.srx sample.xml "" Parsing XML Document : 84729.9ms Mem use before: VmRSS: 5380 kB Mem use after: VmRSS: 117652 kB Building TextCollection : 189371ms Mem use before: VmRSS: 117916 kB Mem use after: VmRSS: 237264 kB Building parenthesis struct : 303.961ms Mem use before: VmRSS: 237264 kB Mem use after: VmRSS: 237528 kB Tags blen is 8 Building Tag Structure : 4715.13ms Mem use before: VmRSS: 237528 kB Mem use after: VmRSS: 214988 kB Number of distinct tags 92 Building tag relationship table: 1209.822893ms Parsing document: 280366.960049ms Writing file to disk: 115.392923ms character 0-0 Stream.Error("illegal begin of query")
Yes, I gave an empty query, but I didn’t want to search anything yet ;-). We have an indexed version of sample.xml with standard options saved as sample.srx. You can play with different indexing options, run main.native without parameters to see the full set of options supported at the moment. The result so far looks like this:
$ ls -lh sample.* -rw-r--r-- 1 fclaude fclaude 188M 2011-10-31 23:10 sample.srx -rw-rw-r-- 1 fclaude fclaude 112M 2011-10-31 23:02 sample.xml
Finally, as an example, we can count the number of results for the query “/site/regions/africa”:
$ ./main.native -c -v sample.srx /site/regions/africa Loading tag table: 4.344940ms Loading parenthesis struct : 304.395ms Mem use before: VmRSS: 5380 kB Mem use after: VmRSS: 7492 kB Loading tag names struct : 0.049ms Mem use before: VmRSS: 7492 kB Mem use after: VmRSS: 7492 kB tags_blen is 8 11 MB for tag sequence Loading tag struct : 11.366ms Mem use before: VmRSS: 7492 kB Mem use after: VmRSS: 27492 kB Loading text bitvector struct : 11.738ms Mem use before: VmRSS: 27492 kB Mem use after: VmRSS: 29076 kB Loading TextCollection : 144.782ms Mem use before: VmRSS: 29076 kB Mem use after: VmRSS: 194604 kB Loading file: 478.782892ms Parsing query: 0.061035ms Parsed query: /child::site/child::regions/child::africa Compiling query: 0.048876ms Automaton (0) : States {q₀ q₁ q₂ q₃ q₄} Initial states: {q₀} Marking states: {q₀ q₁ q₂ q₃} Topdown marking states: {q₀ q₁ q₂ q₃} Bottom states: {q₀ q₁ q₂ q₃ q₄} True states: {q₄} Alternating transitions _________________________________ (q₀, {'' }) → ↓₂q₀ ∧ ↓₁q₁ (q₀, Σ) → ↓₂q₀ (q₁, {'site' }) → ↓₂q₁ ∧ ↓₁q₂ (q₁, Σ) → ↓₂q₁ (q₂, {'regions' }) → ↓₂q₂ ∧ ↓₁q₃ (q₂, Σ) → ↓₂q₂ (q₃, {'africa' }) ⇒ ↓₂q₃ ∧ ↓₁q₄ (q₃, Σ) → ↓₂q₃ (q₄, Σ) → ↓₂q₄ ∧ ↓₁q₄ _________________________________ Execution time: 1.343012ms Number of results: 1 Maximum resident set size: VmHWM: 200084 kB
And that’s it. Now you can play further with SXSI 🙂