SXSI

After a while I can happily say that our SXSI system is available for downloading. Kim created a package that can be easily installed and tested. In this post, I include a step by step tutorial on how to build the system and index a sample XML file.

First we need to install the dependencies in our system. In this case, I’ll show how to do it in Ubuntu 11.10.

$ sudo apt-get install ocaml-ulex ocaml-findlib ocaml-nox libxml++2.6-dev camlp4

Then, download the package. To decompress and compile the package do:

$ tar xvf sxsi.tar.gz
$ cd sxsi/libcds
$ make
$ cd ../xpathcomp/src/XMLTree
$ make clean all
$ cd ../../
$ ./configure
$ ./build
$ cd xpathcomp

This generates a binary called “main.native”. This binary allows you to index and query xml files. Lets first generate an xml file with the standard xmark tool

$ ./gen_xml -f 1 > sample.xml

Now, in order to index this file we run:

$ ./main.native -v -s sample.srx sample.xml ""
Parsing XML Document : 84729.9ms
Mem use before: VmRSS:	    5380 kB
Mem use after: VmRSS:	  117652 kB
Building TextCollection : 189371ms
Mem use before: VmRSS:	  117916 kB
Mem use after: VmRSS:	  237264 kB
Building parenthesis struct : 303.961ms
Mem use before: VmRSS:	  237264 kB
Mem use after: VmRSS:	  237528 kB
Tags blen is 8
Building Tag Structure : 4715.13ms
Mem use before: VmRSS:	  237528 kB
Mem use after: VmRSS:	  214988 kB
Number of distinct tags 92
Building tag relationship table: 1209.822893ms
Parsing document: 280366.960049ms
Writing file to disk: 115.392923ms
character 0-0 Stream.Error("illegal begin of query")

Yes, I gave an empty query, but I didn’t want to search anything yet ;-) . We have an indexed version of sample.xml with standard options saved as sample.srx. You can play with different indexing options, run main.native without parameters to see the full set of options supported at the moment. The result so far looks like this:

$ ls -lh sample.*
-rw-r--r-- 1 fclaude fclaude 188M 2011-10-31 23:10 sample.srx
-rw-rw-r-- 1 fclaude fclaude 112M 2011-10-31 23:02 sample.xml

Finally, as an example, we can count the number of results for the query “/site/regions/africa”:

$ ./main.native -c -v sample.srx /site/regions/africa
Loading tag table: 4.344940ms
Loading parenthesis struct : 304.395ms
Mem use before: VmRSS:	    5380 kB
Mem use after: VmRSS:	    7492 kB
Loading tag names struct : 0.049ms
Mem use before: VmRSS:	    7492 kB
Mem use after: VmRSS:	    7492 kB
tags_blen is 8
11 MB for tag sequence
Loading tag struct : 11.366ms
Mem use before: VmRSS:	    7492 kB
Mem use after: VmRSS:	   27492 kB
Loading text bitvector struct : 11.738ms
Mem use before: VmRSS:	   27492 kB
Mem use after: VmRSS:	   29076 kB
Loading TextCollection : 144.782ms
Mem use before: VmRSS:	   29076 kB
Mem use after: VmRSS:	  194604 kB
Loading file: 478.782892ms
Parsing query: 0.061035ms
Parsed query:
/child::site/child::regions/child::africa
Compiling query: 0.048876ms
Automaton (0) :
States {q₀ q₁ q₂ q₃ q₄}
Initial states: {q₀}
Marking states: {q₀ q₁ q₂ q₃}
Topdown marking states: {q₀ q₁ q₂ q₃}
Bottom states: {q₀ q₁ q₂ q₃ q₄}
True states: {q₄}
Alternating transitions
_________________________________
(q₀, {'' })         → ↓₂q₀ ∧ ↓₁q₁
(q₀, Σ)             → ↓₂q₀
(q₁, {'site' })     → ↓₂q₁ ∧ ↓₁q₂
(q₁, Σ)             → ↓₂q₁
(q₂, {'regions' })  → ↓₂q₂ ∧ ↓₁q₃
(q₂, Σ)             → ↓₂q₂
(q₃, {'africa' })   ⇒ ↓₂q₃ ∧ ↓₁q₄
(q₃, Σ)             → ↓₂q₃
(q₄, Σ)             → ↓₂q₄ ∧ ↓₁q₄
_________________________________
Execution time: 1.343012ms
Number of results: 1
Maximum resident set size: VmHWM:	  200084 kB

And that’s it. Now you can play further with SXSI :-)

Update

Yes, I’ve been lazy when it comes to writing here :-) . I decided to go and write a short update on what’s been going on lately before going into my next post, which is going to take me a little bit longer. This post is mainly based on my progress report I’m supposed to be writing. In Waterloo, PhD students have to write a progress report every year telling the school what they did during the year.

The last two months have been quite interesting, I just updated the publications page accordingly. Together with Patrick K. Nicholson and Diego Seco we got a paper accepted into SPIRE 2011 about constructing Wavelet Trees in little space (almost in place). Also, together with Gonzalo Navarro, we are going to be giving a tutorial on space efficient data structures.

Another set of good news came from CIKM, where I’m a coauthor on two short papers accepted this year. One is with Antonio Farina, Miguel A. Martinez-Prieto and Gonzalo Navarro, where we present indexes for highly repetitive collections in two scenarios, when we want to query documents containing a pattern or positions in documents where the pattern is contained.

The second article that went in is a joint work with Susana Ladra, who recently (April) defended her PhD (It doesn’t come as a surprise to anybody that knows her that she did great :-) ). This article mixes two know representations and uses the information of the url to split the graph and achieve a structure that supports navigation in both directions, offering the best tradeoff so far.

Finally, to close the set of good news, I was awarded with a U.S./Canada Google PhD. Fellowship in Search and Information Retrieval. This came as unexpected and highly appreciated news, thanks to this I’ll have my funding till the end of my PhD (hopefully). As part of the fellowship I was assigned a research mentor from Google for the next two years. My mentor is Stefan Büttcher, one of the authors of “Information Retrieval: Implementing and Evaluating Search Engines”. He also implemented a search engine called Wumpus. As you can imagine, I’ve a lot to learn from him and I’m looking forward to this experience :-) .

And that’s it, now I go back to finishing my progress report due tomorrow.

Project/task management software – Qubity

Today I’m writing about something less technical. In fact, I would say I’m more kind of advertising something: Qubity. Qubity is a project/task management software developed in Chile, some of the people behind this awesome project are good friends of mine, and thus I decided to contribute with a little bit of publicity :-) . I’ll try convincing you to try the system, in beta, by just telling you what it does.

Here we go!

Continue reading ‘Project/task management software – Qubity’ »

Disk Crash, Recovering Files and Doing Backups

About one and a half weeks ago I had a disk crash. I didn’t lose anything, but was pretty close, mainly because I deleted by hand an important file :-( .

It is interesting to talk about my disk crash because I faced many problems to bring my computer back. Luckily my notebook has two hard drives, so I’m up and running with the secondary one now, but to do so I had to clone the recovery partition. Then I copied my backups to my home directory, and in the process I deleted an important file, which took me days to recover. And finally, I wrote a small script to back up things. This script is not a program (there is no error checking or anything like that), but it shows how to keep an encrypted backup.

Continue reading ‘Disk Crash, Recovering Files and Doing Backups’ »

Land of LISP and Lazy Evaluation

One of my recent read was “Land of LISP” by Conrad Barski. It’s an unconventional programming book, packed with comics. Most examples are tiny games you can code in a couple of pages. I personally liked the book a lot, it is fun to read and presents a language that, in my opinion, is also fun.

My experience with LISP is quite limited, so most of the things I found in the book were new to me (I knew how to define functions and basic stuff, but nothing about macros, only a couple of built-in functions, etc.). One of the things I liked the most was one of the examples for macros where the author presents a simple solution to get lazy evaluation. In the book, the author codes a game with some weird rules, and I don’t think I would learn much by just copying that example, therefore, I will use the same idea here but with our old friend tic-tac-toe. I have to warn you that the implementation I’m going to post is not a good tic-tac-toe implementation, you can probably find more challenging opponents. The main goal of this exercise is to illustrate the lazy evaluation. Having said that, lets get started. Continue reading ‘Land of LISP and Lazy Evaluation’ »

Finding the most significant bit

In one of the problems I’ve been working on with Diego Seco and Patrick Nicholson, we needed the most-significant-bit (msb) function as a primitive for our solution. As Diego pointed out today, this function was one of the bottlenecks of our structure, consuming a considerable amount of time.

In this post I’ll go through the solutions we tested in our data structure. Continue reading ‘Finding the most significant bit’ »

Seven Languages in Seven Weeks

I read this book, by Bruce Tate, some weeks ago and totally recommend it. You can buy it from amazon or the pragmatic bookshelf.

What’s in it?

From the title of the book is easy to get an idea. This book explores seven different programming languages, the idea is that you spend one week using each programming language and get an idea of what’s out there to offer alternatives to the standards we are used to. By the standard I mean what I consider the standard, based on my experience most people program in Python, PHP, C++, C# or Java.

The languages covered areRuby, Io, Prolog, Scala, Erlang, Clojure, and Haskell.

The book has one chapter per language (plus intro and wrap-up chapters). Each language is presented in three parts, called days. Usually the first one shows the very basics, like input/output or math operations. The second day is usually used to present something that is different for this language, and the third day to present a harder example where the language shows to be useful and superior to others. For example, in Prolog, you can solve Sudoku puzzles by day 3 with little effort.

Why do I recommend it?

I liked several things about this book.

  1. The third day shows you an example where the language actually helps you. This is in contrast to what my programming languages course was,  where we went through 4 different languages. In that course we implemented pretty much the same things in each language. Maybe someone would disagree with my following statement, but in my opinion, implementing QuickSort in Prolog just makes you think you are wasting your time. You know how to implement it in C, and it works faster, so what’s the point? In this book you don’t do that kind of things. As I mentioned before, one of the examples for Prolog is solving a Sudoku puzzle. This shows you something great about the language. You can do this with almost no effort. If you try implementing it in another language, say Java, it is certainly going to take more effort.
  2. The presentation is clear. It is easy to follow and you don’t feel lost at some point looking at code you don’t understand. For instance, in the basics, you usually go through basic math operations which are really similar to at least one language you know, yet the author takes the time to go through them. The explanations are of the right length to not be boring either, which is also a great plus. Another point regarding the basics, the author goes into things like typing with examples of that part, which also makes it much more interesting.
  3. You are likely to find lots of things you didn’t know. In my case, I haven’t been exposed to that many programming languages, most of them imperative and object oriented. The only language I knew besides that was Scheme and some Common Lisp. I had a great time reading about these other languages that are somehow different and offer powerful constructs that make it easy to do things I know to be hard. For example, the future objects in Io, or the way you can build a service monitor in Erlang.
  4. The author is not selling you the languages. Well, maybe some of them more than others. But the important point is that I didn’t find the book to be too biased. In fact, for every language the author presents it and finishes listing advantages and disadvantages. This is a great thing, not only you get information about when to use a language, but also when not to.
  5. You get to hear from the people that invented the languages (or important users of it). In every chapter you get to see an interview with someone that can be considered as a worthy representative of the language. I found this really interesting, and adds external opinions to the book, which I think, increases its value.

It is important to make clear that you will not learn the seven languages by reading this book. I would not recommend it to someone trying to learn how to program or trying to learn one of those seven languages.

If you know how to program, and enjoy it, you should consider buying it.

Some interesting links

Of course, the previous post does not count, so I have to start with something. Here is a list of interesting talks/articles that I’ve seen lately.

Dan Meyer: Math class needs a makeover

Points out some problems in math education, where doing math becomes just pattern matching. Search for the variables you need and replace them in the formula. The proposal is to make them realize the information needed to solve a problem, as in real life, where it’s rare to get exactly the pieces of information you need.

Conrad Wolfram: Teaching kids real math with computers

The main critique to the math education system is that it focuses in calculation (computation) instead of real math. The proposal is to use computers to do the calculation and allow children to get intuition on it by playing with them, seeing for what the things they learn are useful. An interesting detail I found in this talks is that the presenter’s proposal for teaching the basics of calculation, like multiplicating big numbers, can be accomplished teaching the kids how to program that method instead of making the repeat it 10^6 times.

Arthur Benjamin’s formula for changing math education

A 3min short opinion-video. The main idea is to include statistics as the main subject. The main argument for this is that most people only need that in their day to day life (plus the basics, I guess).

Ken Robinson: Changing education paradigms

Those videos are awesome, they really catch you attention and it’s hard to close the video before it finishes.

10 easy ways to fail a Ph.D.

The title explains exactly what the article is about. And in it you can find a link to another article I found really interesting: Productivity tips, tricks and hacks for academics.

 

 

New Web/Blog, Why?

I updated my Web and my blog (it used to be @wordpress). The main reason to do so was to get everything in one place and to move from http://www.recoded.cl to http://fclaude.recoded.cl. The main domain redirects to this page and will continue to do so for a while, since I don’t have time right now to setup another site.

Some of the projects I’m hosting at this domain are starting to attract people to collaborate in them, and since those projects are hosted as subdomains of recoded, I wanted to allow the option of using the root domain pointing to the projects instead of being my personal page. Besides, that makes it okay to also potentially host projects where I’m not involved.

Coming back to my blog

I haven’t written in here for a long time. There are many reasons for that, lots of things to do,  traveling, and time. Yet, the most important one, I didn’t know what to blog about.

The main problem is not finding what to write about, but to decide what should actually go into the blog. So far I’ve mostly posted about my trips and papers, and that’s something one should do. If you don’t think people want to know about what you write, then maybe you shouldn’t have had written that paper in the first place :) .

The missing part is the things I didn’t want to post. For example, talks and useful links I find. Or comments about books I read. After some thought I decided that I should also include that. So my new goal, copying from Alex Bowe, is to post once every two weeks.