Archive for the ‘Programming’ Category.

LIBCDS2 – work in progress

LIBCDS2 provides basic building blocks for static and space-efficient data structures. We have made a couple of design decisions that differ from LIBCDS. We will keep to maintain LIBCDS, as this serves as a faster version (you will see why further down) for building small research prototypes.

The main goal of LIBCDS2 is to allow huge structures and support multi-core/parallel computation. The new design aims at supporting these and at the same time free the user from tricky considerations. We want to provide a library that can be used in production, with proper error handling and stable up to any scale. The drawback, there is a bit of overhead we have to live with.

Lets explain this with one example: Arrays. Arrays in LIBCDS2 are one of the few structures that are not completely static. Users can change values, as long as the maximum value does not exceed the original maximum (actually, that the new value does not take more bits than the width of each field). The array optimizes under the hood, if your width is 16, it uses an array of uint16_t underneath. In order to do so, but not bother the user with knowing whether the specific width they want can be optimized, the Array::Create() method instantiates a subclass depending on the width. This adds one indirection when calling the set/access operations.

Two more examples regarding the performance hit LIBCDS2 is taking are the reference counters and the default storage of integers and sequences.

All structures in LIBCDS2 use reference counters. This has proven to be really effective when building space-efficient data structures, since they allow you to re-use substructures (and they are static, so it’s the perfect scenario). The user creates a structure with new (C++), and then just calls the method Use() to tell the library that the object is being used. When the user is done with the object, instead of deleting it, she/he calls the Unuse() method. If no one else is using the object, it is automatically freed. The problem: in multithreading, the Use and Unuse methods have to acquire a lock on the object. This slows down the Use/Unuse operations, and also adds the space required to keep a mutex per class (unless we used a global lock, which is an option we are considering, since Use/Unuse may not be that frequent).

Finally, the storage of integers and sequences. LIBCDS2 works almost exclusively with Array objects. These are slower than standard arrays, and the samples within structures are done with Array objects. Why did we do this? Well, the decision was made thinking that we don’t want to waste 64-bit samples for small structures, but we also want the library to escalate to huge (order of TBs) structures. For this, we need to represent the samplings using as many bits as needed (or close to that). Array offers that, and it also encapsulates a possible source of problems into a single class when escalating, making it much easier to maintain.

We still have many more decisions to make, and the library is slowly taking shape. Any comments, suggestions are welcome. If you want to contribute, please let us know, the project is hosted on github.

(This will be cross-posted on the libcds2 web page.)

Programming in Go

I’ve been programming in Go for a couple of months now (as hinted by my previous post). I thought it would be interesting to talk about my experience so far. The best way you can make your own opinion about the language is by trying it.

As a summary, I love Go! I know there are some drawbacks at the moment with respect to C++, yet the benefits I’ve gained by using Go overcome these. Lets start with the benefits:

  • Concurrent programming: The goroutines are awesome, you can just launch a lightweight thread by adding “go” in front of a function call. This is so simple that writing concurrent software is simplified amazingly. As I said in a previous post, Go does not provide something new in this sense. Erlang has a similar feature using “spawn”.

  • The memory model: You just don’t have to worry about whether stuff is allocated in the stack or the heap. It also has a garbage collector, so you don’t have to worry about freeing the resources when you don’t need them anymore. This simplifies your life quite a bit, yet it has some drawbacks. For example, if you keep pointing to something you don’t need anymore, then the garbage collector does not claim that memory (the gc can guess for you of course). Yet you can think of that as forgetting to free the space anyways :p.

  • Multiple return values: A function in Go can return more than one value. Here is a quick example:

    func sqrt(i int) (int, error) {
      if i < 0 {
        return 0, errors.New("Can't compute the square root of a negative number")
      }
      return int(math.Sqrt(float64(i))), nil
    }

    The function just return (-1, some error) when the argument i is negative, and (s, nil) when i is non-negative. The value s is the square root of i. What a poor choice for the variable name! :D

  • Errors: The way you handle errors in Go is very C-like, with two differences though. First, you have to be explicit when ignoring an error. Consider the previous function, we could call it the following ways:

    s, err := sqrt(i)
    if err != nil {
      fmt.Printf("There was an error")
    }

    or

    s, _ := srqt(i)

    But we can’t do this:

    s, err := sqrt(i)

    and then ignore err. Go complains about every single thing you don’t use. So if you don’t do anything with err, it will give you a compile-time error. This is pretty amazing when you read other’s people code. You have all the information right there, it is clear what could go wrong and how that is being handled.

  • Channels: channels are part of the language and allow you to communicate between different goroutines. Their usage is amazingly simple and at the same time powerful. Here is a small example:

    func genString(c chan string) {
      s := generateStringExpensive() // generate string
      c <- s
    }
    
    func main() {
      c := make(channel string)
      go genString() // compute the string in a goroutine
      var s string
      select { // we wait for the goroutine to send the result
        case s = <- c: fmt.Println(s)
      }
    }
  • Interfaces: They remind me of Haskell classes. It’s a very simple idea, you define a set of methods and any structure implementing those methods satisfies the interface. You don’t have to be explicit about it.

  • Maps: Yes, almost all languages have them. Still I like the defaults and so far, their performance has felt pretty good.

  • The tools: The Go compiler comes with a lot of tools. The syntax is so strictly defined that there is a tool that formats the code for you. There are no styles, just one way of writing it (or at least according to this program). The build, install and get commands are great! I like the fact that installing packages from github, googlecode and similar sites is so simple, just a couple of commands (usually just one) and you are ready to use the library you downloaded.

  • The library: The library that distributes with Go has a great design. Every time I have to use something from there it’s amazing how quickly you can get it up and running. One great example is the sort package, just take a look.

There are many defaults that make sense in go. You have for example types int, int32, and int64. For storing them you read and write integers of a given size, and when you read you can detect if there is a problem. At least this is a recurrent problem I had in C++, that I would write long to a file in a 64bits machine and then read it in a 32bits machine. Usually I wasn’t careful enough and ended up having problems when reading in the other machine.

Now lets move to the things I don’t like that much, or I miss.

  • Generics: I’m missing some kind of generics. Or a better type inference. There is a very basic type inference at this time. I have no clue if this is going to change, but I would love to see a type inference that is more like the one Haskell offers. Anyways, interfaces solve many of the problems one has without generics, so it’s not too bad.

  • Memory limitation: Currently there is a memory limitation, your process can’t use more than 16GB. It is possible to fix this (unlike in Java, where you can’t allocate an array with more than 4G-elements). For a quick fix take a look at my previous post.

  • Sometimes it feels a bit too strict: I know, the language is designed to enforce that all variables are used, all imports are required, etc. Yet sometimes you end up adding a removing the same import many times. A classical example of this is when you are checking your code, at least I use fmt and log intermittently, and I have to go up and comment them when I don’t use them. Similar things happen because there is no coercion between int and int32 for example. This avoids bugs, but sometimes (when you know it won’t fail) … it kind of bothers. Just to clarify, given a comment in twitter from Alex Bowe, in general I like these features, it’s just that in some simple cases they kind of feel an overkill (but I can’t think of a good solution to overcome those simple cases, so it’s just a warning that you may have to write or delete more than usual at some points :p)

Anyways, even though there are some things I don’t like that much (at least in certain cases), I’m really happy with Go. If you feel like learning it after reading this post, go here.

Increasing the memory limit in Go

Yeah, it’s been a while, but instead of talking about why I haven’t written anything, I’ll just jump into the good stuff.

I’ve been developing in Go lately. If you haven’t tried it yet, I totally recommend it. Honestly, when you first look at the language, you tend to think that there is nothing new in there. That may be true. Many of the features are not unique, and you can find them in other languages, yet the combination is unique. After one week of using Go I didn’t want to move back to C++ (what I usually use). There are a couple of things I miss from time to time, but the gain is so big that I don’t mind.

The problem. I’ve had a couple of problems dealing with the kind of software I usually implement. Today I found one that did worry me a lot, I can’t use more than 16GB of RAM! After that, my code will just crash with an “out of memory” error. In general this is not too bad, but for the particular algorithm I’m running, I need more than 16GB. Luckily, I found out that the limitation is temporal and that you can lift it up by changing just a couple of lines in Go’s source code. So here it goes.

I will assume you installed Go in $HOME/bin/go (if not, don’t worry, we are going to re-install it, right there).

The first step is to get the source for Go:

$ wget http://go.googlecode.com/files/go1.0.2.src.tar.gz

Once that finishes, decompress the package:

$ tar xvfz go1.0.2.src.tar.gz

Now, using your favourite editor, change the following files as explained here:

  • In file go/src/pkg/runtime/malloc.h
    Change line 119, where it says “MHeapMap_Bits = 22,” for “MHeapMap_Bits = 25,”
  • In file go/src/pkg/runtime/malloc.goc
    Change line 309, where it says “arena_size = 16LL<<30;” for “arena_size = 128LL<<30;”

And that’s it, now lets install this.

$ cd go/src
$ ./all.bash
$ cd ../..
$ mkdir -p $HOME/bin/go
$ mv go $HOME/bin/

Now we are ready. Remember to set the environment variable in your .bashrc:

export GOROOT=$HOME/bin/go
export PATH=$PATH:$GOROOT/bin

You are ready to enjoy 128GB. If you want 64GB, change the 25 for 24, and the 128 by 64.

This is supposed to be experimental, so please don’t blame me if things crash. There has to be a reason for that limitation to be there. In my case, I was able to get away with this and run my code. In this particular case is the (expensive) construction of a compressed text index, so the thing runs, outputs, and that’s it. Afterwards I process the output with other binaries. I’m just happy if the program manages to go through the construction, without much care on the resources I’m using at the time.

Land of LISP and Lazy Evaluation

One of my recent read was “Land of LISP” by Conrad Barski. It’s an unconventional programming book, packed with comics. Most examples are tiny games you can code in a couple of pages. I personally liked the book a lot, it is fun to read and presents a language that, in my opinion, is also fun.

My experience with LISP is quite limited, so most of the things I found in the book were new to me (I knew how to define functions and basic stuff, but nothing about macros, only a couple of built-in functions, etc.). One of the things I liked the most was one of the examples for macros where the author presents a simple solution to get lazy evaluation. In the book, the author codes a game with some weird rules, and I don’t think I would learn much by just copying that example, therefore, I will use the same idea here but with our old friend tic-tac-toe. I have to warn you that the implementation I’m going to post is not a good tic-tac-toe implementation, you can probably find more challenging opponents. The main goal of this exercise is to illustrate the lazy evaluation. Having said that, lets get started. Continue reading ‘Land of LISP and Lazy Evaluation’ »

Finding the most significant bit

In one of the problems I’ve been working on with Diego Seco and Patrick Nicholson, we needed the most-significant-bit (msb) function as a primitive for our solution. As Diego pointed out today, this function was one of the bottlenecks of our structure, consuming a considerable amount of time.

In this post I’ll go through the solutions we tested in our data structure. Continue reading ‘Finding the most significant bit’ »

Seven Languages in Seven Weeks

I read this book, by Bruce Tate, some weeks ago and totally recommend it. You can buy it from amazon or the pragmatic bookshelf.

What’s in it?

From the title of the book is easy to get an idea. This book explores seven different programming languages, the idea is that you spend one week using each programming language and get an idea of what’s out there to offer alternatives to the standards we are used to. By the standard I mean what I consider the standard, based on my experience most people program in Python, PHP, C++, C# or Java.

The languages covered areRuby, Io, Prolog, Scala, Erlang, Clojure, and Haskell.

The book has one chapter per language (plus intro and wrap-up chapters). Each language is presented in three parts, called days. Usually the first one shows the very basics, like input/output or math operations. The second day is usually used to present something that is different for this language, and the third day to present a harder example where the language shows to be useful and superior to others. For example, in Prolog, you can solve Sudoku puzzles by day 3 with little effort.

Why do I recommend it?

I liked several things about this book.

  1. The third day shows you an example where the language actually helps you. This is in contrast to what my programming languages course was,  where we went through 4 different languages. In that course we implemented pretty much the same things in each language. Maybe someone would disagree with my following statement, but in my opinion, implementing QuickSort in Prolog just makes you think you are wasting your time. You know how to implement it in C, and it works faster, so what’s the point? In this book you don’t do that kind of things. As I mentioned before, one of the examples for Prolog is solving a Sudoku puzzle. This shows you something great about the language. You can do this with almost no effort. If you try implementing it in another language, say Java, it is certainly going to take more effort.
  2. The presentation is clear. It is easy to follow and you don’t feel lost at some point looking at code you don’t understand. For instance, in the basics, you usually go through basic math operations which are really similar to at least one language you know, yet the author takes the time to go through them. The explanations are of the right length to not be boring either, which is also a great plus. Another point regarding the basics, the author goes into things like typing with examples of that part, which also makes it much more interesting.
  3. You are likely to find lots of things you didn’t know. In my case, I haven’t been exposed to that many programming languages, most of them imperative and object oriented. The only language I knew besides that was Scheme and some Common Lisp. I had a great time reading about these other languages that are somehow different and offer powerful constructs that make it easy to do things I know to be hard. For example, the future objects in Io, or the way you can build a service monitor in Erlang.
  4. The author is not selling you the languages. Well, maybe some of them more than others. But the important point is that I didn’t find the book to be too biased. In fact, for every language the author presents it and finishes listing advantages and disadvantages. This is a great thing, not only you get information about when to use a language, but also when not to.
  5. You get to hear from the people that invented the languages (or important users of it). In every chapter you get to see an interview with someone that can be considered as a worthy representative of the language. I found this really interesting, and adds external opinions to the book, which I think, increases its value.

It is important to make clear that you will not learn the seven languages by reading this book. I would not recommend it to someone trying to learn how to program or trying to learn one of those seven languages.

If you know how to program, and enjoy it, you should consider buying it.