It’s done!

Today I finally finished my Ph.D. My defence was on Monday 22, and after some minor corrections, I submitted the final version. It bounced once because of a formatting issue, but today the library approved it and it’s all over.

I was here for 4 years and 8 months. I’m taking many good memories with me, it was certainly a good time! All the conference trips, the freedom to work or learn whatever you want, were without doubt the best of the “professional” live as a Ph.D. student. Also being able to manage your own time was a big big plus. You could spend time reading, learning stuff not completely related to your thesis, enjoy a bit of curiosity over different topics or fields.

I’m leaving Canada in a couple of days. Not for long though, since I won’t miss my convocation ceremony in mid June.

I’m going back to Chile to work in Akori. A company I founded together with an old friend from undergrad. It’s going to be an exciting and uncertain new experience. At times I’ll probably miss the flexibility provided by the academia though :-).

If you are curious about the thesis itself, it’s already available! Download it here!

LIBCDS2 – work in progress

LIBCDS2 provides basic building blocks for static and space-efficient data structures. We have made a couple of design decisions that differ from LIBCDS. We will keep to maintain LIBCDS, as this serves as a faster version (you will see why further down) for building small research prototypes.

The main goal of LIBCDS2 is to allow huge structures and support multi-core/parallel computation. The new design aims at supporting these and at the same time free the user from tricky considerations. We want to provide a library that can be used in production, with proper error handling and stable up to any scale. The drawback, there is a bit of overhead we have to live with.

Lets explain this with one example: Arrays. Arrays in LIBCDS2 are one of the few structures that are not completely static. Users can change values, as long as the maximum value does not exceed the original maximum (actually, that the new value does not take more bits than the width of each field). The array optimizes under the hood, if your width is 16, it uses an array of uint16_t underneath. In order to do so, but not bother the user with knowing whether the specific width they want can be optimized, the Array::Create() method instantiates a subclass depending on the width. This adds one indirection when calling the set/access operations.

Two more examples regarding the performance hit LIBCDS2 is taking are the reference counters and the default storage of integers and sequences.

All structures in LIBCDS2 use reference counters. This has proven to be really effective when building space-efficient data structures, since they allow you to re-use substructures (and they are static, so it’s the perfect scenario). The user creates a structure with new (C++), and then just calls the method Use() to tell the library that the object is being used. When the user is done with the object, instead of deleting it, she/he calls the Unuse() method. If no one else is using the object, it is automatically freed. The problem: in multithreading, the Use and Unuse methods have to acquire a lock on the object. This slows down the Use/Unuse operations, and also adds the space required to keep a mutex per class (unless we used a global lock, which is an option we are considering, since Use/Unuse may not be that frequent).

Finally, the storage of integers and sequences. LIBCDS2 works almost exclusively with Array objects. These are slower than standard arrays, and the samples within structures are done with Array objects. Why did we do this? Well, the decision was made thinking that we don’t want to waste 64-bit samples for small structures, but we also want the library to escalate to huge (order of TBs) structures. For this, we need to represent the samplings using as many bits as needed (or close to that). Array offers that, and it also encapsulates a possible source of problems into a single class when escalating, making it much easier to maintain.

We still have many more decisions to make, and the library is slowly taking shape. Any comments, suggestions are welcome. If you want to contribute, please let us know, the project is hosted on github.

(This will be cross-posted on the libcds2 web page.)

Programming in Go

I’ve been programming in Go for a couple of months now (as hinted by my previous post). I thought it would be interesting to talk about my experience so far. The best way you can make your own opinion about the language is by trying it.

As a summary, I love Go! I know there are some drawbacks at the moment with respect to C++, yet the benefits I’ve gained by using Go overcome these. Lets start with the benefits:

  • Concurrent programming: The goroutines are awesome, you can just launch a lightweight thread by adding “go” in front of a function call. This is so simple that writing concurrent software is simplified amazingly. As I said in a previous post, Go does not provide something new in this sense. Erlang has a similar feature using “spawn”.

  • The memory model: You just don’t have to worry about whether stuff is allocated in the stack or the heap. It also has a garbage collector, so you don’t have to worry about freeing the resources when you don’t need them anymore. This simplifies your life quite a bit, yet it has some drawbacks. For example, if you keep pointing to something you don’t need anymore, then the garbage collector does not claim that memory (the gc can guess for you of course). Yet you can think of that as forgetting to free the space anyways :p.

  • Multiple return values: A function in Go can return more than one value. Here is a quick example:

    func sqrt(i int) (int, error) {
      if i < 0 {
        return 0, errors.New("Can't compute the square root of a negative number")
      }
      return int(math.Sqrt(float64(i))), nil
    }

    The function just return (-1, some error) when the argument i is negative, and (s, nil) when i is non-negative. The value s is the square root of i. What a poor choice for the variable name! :D

  • Errors: The way you handle errors in Go is very C-like, with two differences though. First, you have to be explicit when ignoring an error. Consider the previous function, we could call it the following ways:

    s, err := sqrt(i)
    if err != nil {
      fmt.Printf("There was an error")
    }

    or

    s, _ := srqt(i)

    But we can’t do this:

    s, err := sqrt(i)

    and then ignore err. Go complains about every single thing you don’t use. So if you don’t do anything with err, it will give you a compile-time error. This is pretty amazing when you read other’s people code. You have all the information right there, it is clear what could go wrong and how that is being handled.

  • Channels: channels are part of the language and allow you to communicate between different goroutines. Their usage is amazingly simple and at the same time powerful. Here is a small example:

    func genString(c chan string) {
      s := generateStringExpensive() // generate string
      c <- s
    }
    
    func main() {
      c := make(channel string)
      go genString() // compute the string in a goroutine
      var s string
      select { // we wait for the goroutine to send the result
        case s = <- c: fmt.Println(s)
      }
    }
  • Interfaces: They remind me of Haskell classes. It’s a very simple idea, you define a set of methods and any structure implementing those methods satisfies the interface. You don’t have to be explicit about it.

  • Maps: Yes, almost all languages have them. Still I like the defaults and so far, their performance has felt pretty good.

  • The tools: The Go compiler comes with a lot of tools. The syntax is so strictly defined that there is a tool that formats the code for you. There are no styles, just one way of writing it (or at least according to this program). The build, install and get commands are great! I like the fact that installing packages from github, googlecode and similar sites is so simple, just a couple of commands (usually just one) and you are ready to use the library you downloaded.

  • The library: The library that distributes with Go has a great design. Every time I have to use something from there it’s amazing how quickly you can get it up and running. One great example is the sort package, just take a look.

There are many defaults that make sense in go. You have for example types int, int32, and int64. For storing them you read and write integers of a given size, and when you read you can detect if there is a problem. At least this is a recurrent problem I had in C++, that I would write long to a file in a 64bits machine and then read it in a 32bits machine. Usually I wasn’t careful enough and ended up having problems when reading in the other machine.

Now lets move to the things I don’t like that much, or I miss.

  • Generics: I’m missing some kind of generics. Or a better type inference. There is a very basic type inference at this time. I have no clue if this is going to change, but I would love to see a type inference that is more like the one Haskell offers. Anyways, interfaces solve many of the problems one has without generics, so it’s not too bad.

  • Memory limitation: Currently there is a memory limitation, your process can’t use more than 16GB. It is possible to fix this (unlike in Java, where you can’t allocate an array with more than 4G-elements). For a quick fix take a look at my previous post.

  • Sometimes it feels a bit too strict: I know, the language is designed to enforce that all variables are used, all imports are required, etc. Yet sometimes you end up adding a removing the same import many times. A classical example of this is when you are checking your code, at least I use fmt and log intermittently, and I have to go up and comment them when I don’t use them. Similar things happen because there is no coercion between int and int32 for example. This avoids bugs, but sometimes (when you know it won’t fail) … it kind of bothers. Just to clarify, given a comment in twitter from Alex Bowe, in general I like these features, it’s just that in some simple cases they kind of feel an overkill (but I can’t think of a good solution to overcome those simple cases, so it’s just a warning that you may have to write or delete more than usual at some points :p)

Anyways, even though there are some things I don’t like that much (at least in certain cases), I’m really happy with Go. If you feel like learning it after reading this post, go here.

Increasing the memory limit in Go

Yeah, it’s been a while, but instead of talking about why I haven’t written anything, I’ll just jump into the good stuff.

I’ve been developing in Go lately. If you haven’t tried it yet, I totally recommend it. Honestly, when you first look at the language, you tend to think that there is nothing new in there. That may be true. Many of the features are not unique, and you can find them in other languages, yet the combination is unique. After one week of using Go I didn’t want to move back to C++ (what I usually use). There are a couple of things I miss from time to time, but the gain is so big that I don’t mind.

The problem. I’ve had a couple of problems dealing with the kind of software I usually implement. Today I found one that did worry me a lot, I can’t use more than 16GB of RAM! After that, my code will just crash with an “out of memory” error. In general this is not too bad, but for the particular algorithm I’m running, I need more than 16GB. Luckily, I found out that the limitation is temporal and that you can lift it up by changing just a couple of lines in Go’s source code. So here it goes.

I will assume you installed Go in $HOME/bin/go (if not, don’t worry, we are going to re-install it, right there).

The first step is to get the source for Go:

$ wget http://go.googlecode.com/files/go1.0.2.src.tar.gz

Once that finishes, decompress the package:

$ tar xvfz go1.0.2.src.tar.gz

Now, using your favourite editor, change the following files as explained here:

  • In file go/src/pkg/runtime/malloc.h
    Change line 119, where it says “MHeapMap_Bits = 22,” for “MHeapMap_Bits = 25,”
  • In file go/src/pkg/runtime/malloc.goc
    Change line 309, where it says “arena_size = 16LL<<30;” for “arena_size = 128LL<<30;”

And that’s it, now lets install this.

$ cd go/src
$ ./all.bash
$ cd ../..
$ mkdir -p $HOME/bin/go
$ mv go $HOME/bin/

Now we are ready. Remember to set the environment variable in your .bashrc:

export GOROOT=$HOME/bin/go
export PATH=$PATH:$GOROOT/bin

You are ready to enjoy 128GB. If you want 64GB, change the 25 for 24, and the 128 by 64.

This is supposed to be experimental, so please don’t blame me if things crash. There has to be a reason for that limitation to be there. In my case, I was able to get away with this and run my code. In this particular case is the (expensive) construction of a compressed text index, so the thing runs, outputs, and that’s it. Afterwards I process the output with other binaries. I’m just happy if the program manages to go through the construction, without much care on the resources I’m using at the time.

SXSI

After a while I can happily say that our SXSI system is available for downloading. Kim created a package that can be easily installed and tested. In this post, I include a step by step tutorial on how to build the system and index a sample XML file.

First we need to install the dependencies in our system. In this case, I’ll show how to do it in Ubuntu 11.10.

$ sudo apt-get install ocaml-ulex ocaml-findlib ocaml-nox libxml++2.6-dev camlp4

Then, download the package. To decompress and compile the package do:

$ tar xvf sxsi.tar.gz
$ cd sxsi/libcds
$ make
$ cd ../xpathcomp/src/XMLTree
$ make clean all
$ cd ../../
$ ./configure
$ ./build
$ cd xpathcomp

This generates a binary called “main.native”. This binary allows you to index and query xml files. Lets first generate an xml file with the standard xmark tool

$ ./gen_xml -f 1 > sample.xml

Now, in order to index this file we run:

$ ./main.native -v -s sample.srx sample.xml ""
Parsing XML Document : 84729.9ms
Mem use before: VmRSS:	    5380 kB
Mem use after: VmRSS:	  117652 kB
Building TextCollection : 189371ms
Mem use before: VmRSS:	  117916 kB
Mem use after: VmRSS:	  237264 kB
Building parenthesis struct : 303.961ms
Mem use before: VmRSS:	  237264 kB
Mem use after: VmRSS:	  237528 kB
Tags blen is 8
Building Tag Structure : 4715.13ms
Mem use before: VmRSS:	  237528 kB
Mem use after: VmRSS:	  214988 kB
Number of distinct tags 92
Building tag relationship table: 1209.822893ms
Parsing document: 280366.960049ms
Writing file to disk: 115.392923ms
character 0-0 Stream.Error("illegal begin of query")

Yes, I gave an empty query, but I didn’t want to search anything yet ;-). We have an indexed version of sample.xml with standard options saved as sample.srx. You can play with different indexing options, run main.native without parameters to see the full set of options supported at the moment. The result so far looks like this:

$ ls -lh sample.*
-rw-r--r-- 1 fclaude fclaude 188M 2011-10-31 23:10 sample.srx
-rw-rw-r-- 1 fclaude fclaude 112M 2011-10-31 23:02 sample.xml

Finally, as an example, we can count the number of results for the query “/site/regions/africa”:

$ ./main.native -c -v sample.srx /site/regions/africa
Loading tag table: 4.344940ms
Loading parenthesis struct : 304.395ms
Mem use before: VmRSS:	    5380 kB
Mem use after: VmRSS:	    7492 kB
Loading tag names struct : 0.049ms
Mem use before: VmRSS:	    7492 kB
Mem use after: VmRSS:	    7492 kB
tags_blen is 8
11 MB for tag sequence
Loading tag struct : 11.366ms
Mem use before: VmRSS:	    7492 kB
Mem use after: VmRSS:	   27492 kB
Loading text bitvector struct : 11.738ms
Mem use before: VmRSS:	   27492 kB
Mem use after: VmRSS:	   29076 kB
Loading TextCollection : 144.782ms
Mem use before: VmRSS:	   29076 kB
Mem use after: VmRSS:	  194604 kB
Loading file: 478.782892ms
Parsing query: 0.061035ms
Parsed query:
/child::site/child::regions/child::africa
Compiling query: 0.048876ms
Automaton (0) :
States {q₀ q₁ q₂ q₃ q₄}
Initial states: {q₀}
Marking states: {q₀ q₁ q₂ q₃}
Topdown marking states: {q₀ q₁ q₂ q₃}
Bottom states: {q₀ q₁ q₂ q₃ q₄}
True states: {q₄}
Alternating transitions
_________________________________
(q₀, {'' })         → ↓₂q₀ ∧ ↓₁q₁
(q₀, Σ)             → ↓₂q₀
(q₁, {'site' })     → ↓₂q₁ ∧ ↓₁q₂
(q₁, Σ)             → ↓₂q₁
(q₂, {'regions' })  → ↓₂q₂ ∧ ↓₁q₃
(q₂, Σ)             → ↓₂q₂
(q₃, {'africa' })   ⇒ ↓₂q₃ ∧ ↓₁q₄
(q₃, Σ)             → ↓₂q₃
(q₄, Σ)             → ↓₂q₄ ∧ ↓₁q₄
_________________________________
Execution time: 1.343012ms
Number of results: 1
Maximum resident set size: VmHWM:	  200084 kB

And that’s it. Now you can play further with SXSI :-)

Update

Yes, I’ve been lazy when it comes to writing here :-). I decided to go and write a short update on what’s been going on lately before going into my next post, which is going to take me a little bit longer. This post is mainly based on my progress report I’m supposed to be writing. In Waterloo, PhD students have to write a progress report every year telling the school what they did during the year.

The last two months have been quite interesting, I just updated the publications page accordingly. Together with Patrick K. Nicholson and Diego Seco we got a paper accepted into SPIRE 2011 about constructing Wavelet Trees in little space (almost in place). Also, together with Gonzalo Navarro, we are going to be giving a tutorial on space efficient data structures.

Another set of good news came from CIKM, where I’m a coauthor on two short papers accepted this year. One is with Antonio Farina, Miguel A. Martinez-Prieto and Gonzalo Navarro, where we present indexes for highly repetitive collections in two scenarios, when we want to query documents containing a pattern or positions in documents where the pattern is contained.

The second article that went in is a joint work with Susana Ladra, who recently (April) defended her PhD (It doesn’t come as a surprise to anybody that knows her that she did great :-)). This article mixes two know representations and uses the information of the url to split the graph and achieve a structure that supports navigation in both directions, offering the best tradeoff so far.

Finally, to close the set of good news, I was awarded with a U.S./Canada Google PhD. Fellowship in Search and Information Retrieval. This came as unexpected and highly appreciated news, thanks to this I’ll have my funding till the end of my PhD (hopefully). As part of the fellowship I was assigned a research mentor from Google for the next two years. My mentor is Stefan Büttcher, one of the authors of “Information Retrieval: Implementing and Evaluating Search Engines”. He also implemented a search engine called Wumpus. As you can imagine, I’ve a lot to learn from him and I’m looking forward to this experience :-).

And that’s it, now I go back to finishing my progress report due tomorrow.

Project/task management software – Qubity

Today I’m writing about something less technical. In fact, I would say I’m more kind of advertising something: Qubity. Qubity is a project/task management software developed in Chile, some of the people behind this awesome project are good friends of mine, and thus I decided to contribute with a little bit of publicity :-). I’ll try convincing you to try the system, in beta, by just telling you what it does.

Here we go!

Continue reading ‘Project/task management software – Qubity’ »

Disk Crash, Recovering Files and Doing Backups

About one and a half weeks ago I had a disk crash. I didn’t lose anything, but was pretty close, mainly because I deleted by hand an important file :-(.

It is interesting to talk about my disk crash because I faced many problems to bring my computer back. Luckily my notebook has two hard drives, so I’m up and running with the secondary one now, but to do so I had to clone the recovery partition. Then I copied my backups to my home directory, and in the process I deleted an important file, which took me days to recover. And finally, I wrote a small script to back up things. This script is not a program (there is no error checking or anything like that), but it shows how to keep an encrypted backup.

Continue reading ‘Disk Crash, Recovering Files and Doing Backups’ »

Land of LISP and Lazy Evaluation

One of my recent read was “Land of LISP” by Conrad Barski. It’s an unconventional programming book, packed with comics. Most examples are tiny games you can code in a couple of pages. I personally liked the book a lot, it is fun to read and presents a language that, in my opinion, is also fun.

My experience with LISP is quite limited, so most of the things I found in the book were new to me (I knew how to define functions and basic stuff, but nothing about macros, only a couple of built-in functions, etc.). One of the things I liked the most was one of the examples for macros where the author presents a simple solution to get lazy evaluation. In the book, the author codes a game with some weird rules, and I don’t think I would learn much by just copying that example, therefore, I will use the same idea here but with our old friend tic-tac-toe. I have to warn you that the implementation I’m going to post is not a good tic-tac-toe implementation, you can probably find more challenging opponents. The main goal of this exercise is to illustrate the lazy evaluation. Having said that, lets get started. Continue reading ‘Land of LISP and Lazy Evaluation’ »

Finding the most significant bit

In one of the problems I’ve been working on with Diego Seco and Patrick Nicholson, we needed the most-significant-bit (msb) function as a primitive for our solution. As Diego pointed out today, this function was one of the bottlenecks of our structure, consuming a considerable amount of time.

In this post I’ll go through the solutions we tested in our data structure. Continue reading ‘Finding the most significant bit’ »