Friday, March 9, 2012

Counting the number of "1" bits

Another "ridiculous" interview I had was when an interviewer asked me how to count the number of set bits in a 32-bit or 8-bit. I told him that there are various algorithms to do this, but the easiest way but yet good enough in performance is a function like this:

uint32_t count_bits(uint32_t b)
{
    int i;
    int n = 0;
    
    for(i=0; i<sizeof(b)*8; i++)
    {
        n += (b & 1);
        b >>= 1;
    }   
    return n;
}


I also told him that there are many ways to do this (referred him to look at "Beautiful Code" to see some of the beautiful algorithms).

But then the interviewer argued.  FIrst, he said the upper border of the for-loop should be to "sizeof(b)*8" (meaning, instead of using "<", I should have used "<="). To bad, I said yes to him without thinking further (Probably got nervous.  Later at home, I checked that he was wrong).  Second, he said that the fastest is to use look-up table (n[0] = 0, n[1]=1, n[2]=1,...,n[255]=8,...). Although I agreed with his argument, but then he become inconsistent with his original question and there is a big downside of his argument. I said that this wouldn't be a choice in embedded programming where memory resource is limited. Secondly, his original question was to compute ones in a 32-bit number. That means we have to 2^32 (4,294,967,296 or about 4 GBytes) items set manually in an array. That's so ridiculous. Who wants to waste that much space just to calculate the number of bits in an integer? For a byte, although it seems acceptable ("only" 256 bytes of table), but compare that to the above algorithm:


movl $32,  %edx
 xorl %ecx, %ecx
.L2:
 movl %ecx, %ebx
 shrl %ecx
 andl $1,   %ebx
 addl %ebx, %eax
 decl %edx
 jne .L2

It's only 7 x86 mnemonic instructions (about 12-18 bytes) and no memory involvement in the loop!. Yes, it is O(32) for 32-bit, but it's a constant speed.  Even further, if the function is called only once, the fully optimized version of the instructions would be inserted inline in the place (without calling the function), so we can CPU time from doing push-pop stack frames. That's another tiny speedup.  It's probably fine for table lookup (yes, it's the fastest, O(1)), but it is limited to 8-bit.  For 16 or higher, seems the space-time trade-off is too big not to consider.  Besides, the above algorithm extendable into polymorphism in Object-Oriented code.

When I searched Google, I found the following (on Stanford's web):


v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count


Which renders to instruction codes like below:

shrl %edx
 andl $1431655765, %edx
 subl %edx, %eax
 movl %eax, %edx
 shrl $2, %eax
 andl $858993459, %edx
 andl $858993459, %eax
 addl %edx, %eax
 movl %eax, %edx
 shrl $4, %edx
 leal (%edx,%eax), %eax
 andl $252645135, %eax
 imull $16843009, %eax, %eax
 shrl $24, %eax

(14 mnemonics).  This seems the fastest and most optimized.  It took some time for me to understand it.

Seems the interviewer doesn't have enough experience in embedded or microcontroller  or doesn't think much about optimization. Nevertheless, I failed an interview again.   I start to think that many interview questions are really not to dig the thinking ability of a candidate, but rather to get instant answers meet their mind.  I just hope big software companies like IBM, Microsoft, Google or Apple shouldn't fall to this kind of trap, but rather ask logical questions to know if the candidate can think systematically.

What a waste!

No comments:

Post a Comment