In one of the problems I’ve been working on with Diego Seco and Patrick Nicholson, we needed the most-significant-bit (msb) function as a primitive for our solution. As Diego pointed out today, this function was one of the bottlenecks of our structure, consuming a considerable amount of time.

In this post I’ll go through the solutions we tested in our data structure.

## Iterative Solution

This function is implemented in LIBCDS. It is used mainly in construction phase for the structures in the library, which is something that hasn’t been optimised yet. Furthermore, this function is usually called a constant number of times per operation, and is negligible compared to the cost of other things going around. The implementation is the following:

1 char msb(unsigned int v) { 2 char ret = 0; 3 while(v) { 4 v/=2; 5 ret++; 6 } 7 return ret; 8 }

This is basically just going through the bits in \(v\) until finding that all bits to the left are 0s. It is not hard to convince yourself that the cost of msb is \(O(\lg v)\), and this is unacceptable if this function becomes a primitive within another function that executes many times, as it’s our case, and there is a known constant time solution.

In order to measure the performance I wrote this small program that computes msb for random numbers generated with the rand() function. This code, shown below, is used to measure the other solutions as well. In all cases, the parameter rep is set to 2,000,000,000.

1 int main(int argc, char **argv) { 2 if(argc!=2) { 3 cout << "usage: " << argv[0] << " REP" << endl; 4 return 0; 5 } 6 7 unsigned int rep = atoi(argv[1]); 8 unsigned int acc = 0; 9 srand(rep);1011 for(unsigned int i=0;ii++) { 12 unsigned int v = rand(); 13 acc += msb(v); 14 } 15 16 cout << "acc=" << acc << endl; 17 return 0; 18 }

The \(O(lg v)\) time algorithm takes 1 minute and 15 seconds in my machine (Intel Core i7 1.7GHz).

## Using Assembler

I know that many would jump with something like “Nooo, don’t use that!”. The truth is that in this particular case the in-line assembler code is as easy to read as the alternatives. In fact, many will probably think that it’s the simplest solution. The function I include next was extracted from the book “Advanced Linux Programming“, and as you can see from the code, the processor implements this operation for us, so problem solved :-).

1 char msb(unsigned int v) { 2 unsigned int ret=0; 3 if(v==0) return 0; 4 asm("bsrl %1, %0" : "=r" (ret) : "r" (v)); 5 return (char)ret+1; 6 }

The time it takes for executing the same queries as in the previous section is 22 seconds.

## Using Tables

This is the second approach I tried for improving over the iterative solution. The intuition behind this approach, is that we can store tables of pre-computed values and use the tables to speed up computation.

The idea is to create a table \(T\) where we enumerate all possible bitmaps of length \(b\), and for each one of them, we store the position of their most significant bit. Note that we need a special case to handle \(0\)s. The table takes \(2^b\lg b\) bits of space.

If we want to compute the msb of a value \(v\) represented in a word of size \(w\), which for simplicity we assume \(w=kb\), then we iterate over each chunk of size \(b\) from left to right. As soon as we find a non-zero block of size \(b\), say block B at position \(i\) from right to left, we return \(T[B] + (i-1)\times b\). This is the result of msb.

Assuming \(w=\Omega(\lg n)\): If we set \(b=\lg n/c\), we have that \(T\) takes only \(\sqrt[c]{n}\lg\lg n = o(n)\) bits and supports msb in \(O(c)\) time. Two simple options for a practical implementation are either setting \(b\) to \(16\) or \(8\). The code for \(b=16\) is the following:

1 #define msb16_mask ((1<<16)-1) 2 3 char msb(unsigned int v) { 4 char ret = __msb_tb16[(v>>16)&msb16_mask]; 5 if(ret==0) { 6 ret = __msb_tb16[v&msb16_mask]; 7 return ret; 8 } 9 return ret + 16;10}

For completeness, I include the code for \(b=8\), it follows the same idea as the \(b=16\) one but is harder to read.

1 #define msb8_mask ((1<<8)-1) 2 3 char msb(unsigned int v) { 4 char ret = __msb_tb8[(v>>24)&msb8_mask]; 5 if(ret==0) { 6 ret = __msb_tb8[(v>>16)&msb8_mask]; 7 if(ret==0) { 8 ret = __msb_tb8[(v>>8)&msb8_mask]; 9 if(ret==0) {10ret = __msb_tb8[v&msb8_mask]; 11 return ret; 12 } 13 return ret + 8; 14 } 15 return ret + 16; 16 } 17 return ret + 24; 18 }

The time taken by both is 22 seconds. Quite close to the assembler version, about 0.5 seconds more. The disadvantage is that the table-based version uses extra space, while the assembler version does not.

We have to consider the possibility of facing an architecture where msb is not a function supported by hardware though, and in that case, the table-based option is clearly better than the iterative one. In fact, when \(b=8\), \(T\) requires \(256\) bytes, which is negligible for almost any practical application.

The only thing I haven’t said yet is how to build the tables __msb_tb16 and __msb_tb8. For this I just implemented a short ruby script that generates them hard-coded in an array:

1 #!/usr/bin/ruby 2 3 def print_table(k) 4 n = 2**k 5 print "char __msb_tb#{k}[#{n}] = {" 6 (0...n).each {|v| 7 msb = 1 8 while msb <= k 9 if v[k-msb]==110break 11 end 12 msb += 1 13 end 14 msb = k-msb+1 15 if v%20==0 16 print ",nt" unless v==0 17 elsif v != 0 18 print ", " 19 end20print "#{msb}" 21 } 22 print "};nnn" 23 end 24 25 ARGV.each {|b| 26 print_table(Integer(b)) 27 }

## Conclusion

It is pretty clear which options one would use if the msb function becomes critical in a system. In our case we are using the assembler version and the results for our structure have improved significantly.

We have to take into account that the measurement shown here favours the table-based implementation, since there isn’t much messing with the cache while measuring. In our structure the difference can be noticed, and the assembler version outperforms the table-based approach by a wider margin.

You can also take into account assumptions you have about the input (e.g, biased towards small numbers), and by using the __builtin_expect macro in gcc, optimise the ifs to favour the most likely jumps. In general, this does not make sense for the table-based version, since it is not clear which path the execution will take.

The assembler version can be optimised using __builtin_expect, since we don’t expect \(v\) to be \(0\) too frequently (we get a \(1\%\) improvement in our structure by including this).

If you are interested in the technique of using pre-computed tables, you can find many examples in the literature on succinct data structures, in particular, structures for representing succinct trees and bitmaps.

Hi,

There are many other possibilities… Msb is essentially the same as log_2 function. So you could simply use the standard portable ilogb C99 function, that is, msb(v) == ilogb(v)+1. Except that this will fail in the special case of v==0.

Another solution relying on IEEE floating point standard is to cast the value to double (fast on modern computers) and then finding the exponent using bit magic:

int msb (unsigned v) {

union { double foo; unsigned long long x; } bar;

bar.foo = (double)(v) + 0.5;

return (bar.x >> 52) – 1023 + 1;

}

This is fast if (as always…) double is in ieee format and long long is 64 bit integer type (but you can use e.g. char array instead).