Monday, October 13, 2014

Left shift by negative number used to test unique characters in a string

What is the outcome of this?

val=-32;
mask = 1<<val;

I tried to execute it on Python, it failed with error message.

But if I do it in C or C++, I get something else.  Turn out, this depends o the compiler and CPU we use.  As I use Pentium4 and GCC to compile this, the following snippet of the generated assembly language is interesting:


.L4:
        movsbl  (%edx), %ecx
        movl    %esi, %edi
        addl    $1, %edx
        subl    $97, %ecx
        sall    %cl, %edi
        testl   %eax, %edi
        jg      .L5
        orl     %edi, %eax

.L2:

What is "sall" above? It is Intel CPU's mnemonic for "Shift Arithmetic Left for Long Integer".

In the machine code, -32 is transformed to Second-complement of 32:

32        = 0b00100000
~32      = 0b11011111
~32+1 = 0b11100000 = -32

According Wikipedia,  arithmetic shift left maintains the MSBit, so in this case MSB is always 1. Also, Intel CPUs only shifts the first 5-bits of the shifting bits (in this case, 5-bit LSB of 0b11100000 which is 0), so 1<<-32 is actuall 1<<0 = 1.

What about 1<<-33?

33        = 0b00100001
~33      = 0b11011110
~32+1 = 0b11100001 = -33

The first five bits is 1, so 1<<-33 = 1<<1 = 2

What about 1<<-31?

31        = 0b00011111
~31      = 0b11100000
~31+1 = 0b11100001 = -31

The first five bits is 1, so 1<<-31 = 1<<1 = 0x00000010

What about 1<<-7?

7        = 0b00000111
~7      = 0b11111000
~7+1 = 0b11111001 = -7

The first five bits is 1, so 1<<-7 = 1<<25 = 0x02000000

Now, how do we use this to check a uniqueness of characters in a string? 
Alphabet 'A' to 'Z' has 26 characters. Alphabet 'a' to 'z' has also 26 characters.

The difference between the character i in a string is: str[i] - 'a'.  Now this is the interesting part: if we declare a 32-bit variable (which is enough to cover 'A' to 'Z') where each bit corresponds to the occurence of each character in alphabet, we can determine if the character has been used before and tell the string has unique characters or not (e.g, bit 1 to tell if character 'A' has been used or not, etc.)

'A' - 'a' = -32
...
...
'Z' - 'a' = -7
...
...
'a' - 'a' = 0
...
...
'Z' - 'a' = 25

if we shift 1 to -32, it is the same as 1<<0.  If we shift 1 to -33 it is the same as 1<<1, so on.

1<<-32 = 0x01
1<<-31 = 0x02
...
...
1<<-7 = 1<<25 = 0x2000000
1<<0 = 0x01
1<<1 = 0x02
...
...
1<<25 = 0x2000000

so:
 1<<('A'-'a') == 1<<('a'-'a') = 0x01
 1<<('B'-'a') == 1<<('b'-'a') = 0x02
...
...
1<<('Z'-'a') == 1<<('z'-'a') = 0x2000000

So the code is:

for (i=0; i<strlen(s); i++)
{
    val = s[i] - 'a';
    if (flags & (1 << val))
        return 0; /* the string is not unique */
    flags |= 1 << val;
}
return 1;


Notice this algorithm might not work on different machine with different compiler (as shifting by negative number is undefined in standard C).





No comments:

Post a Comment