Lair Of The Multimedia Guru

2006-01-28

Avoiding branches/if/conditionals

Why? well because misspredictable branches are very slow on modern CPUs, doing 10 or more simple arithemetic operations can be faster, but also keep in mind that:

  • compilers sometimes compile conditional code to very efficient unconditional code, while they might not recognize your branchless code es equivalent
  • if the branch is well predictable it will probably be faster then the branchless code

Building blocks of branchless code

Turning comparissions into masks or 0/1
a<0 a>>31
a!=0 (a|-a)>>31
a<b (a-b)>>31
a>b (b-a)>>31
a≤b (a-b-1)>>31
a≥b (b-a-1)>>31

Note, you might want to replace 31 by the number of bits in whatever type you use and cast to a signed type if you want a bit mask (-1/0) or unsigned type if you want (1/0)
allso keep in mind that (a<b) and such will give you (1/0) too

converting bitmasks to 0/1

well, thats trivial, there are many ways, some even flip the 0/1 in the process
mask&1, -mask, mask+1

converting 0/1 to masks

-x, x-1

flipping the 0/1 value

x^1, 1-x

flipping the mask value

x^(-1), ~x

switch between 2 variables depening upon a mask

b+((a-b)&mask), b^((a^b)&mask)

convert mask to 0/x

mask&x

convert mask to -1/x

mask|x
Note: usefull for (-1/0) -> (-1/1)

add/sub x depening upon a mask

a += x&mask
Note, dont forget a+=mask / a-=mask for x=1/-1

negate depening upon a mask

a = (a^mask) – mask

minimum of 2 variables

b+((a-b)&((a-b)>>31))

maximum of 2 variables

a-((a-b)&((a-b)>>31))

shift by s depending upon a mask

a>>= s&mask, a< <=s&mask, a += ((a>>s)-a)&mask
Note, that shifting by a constant might be faster

exchanging 2 values depending upon a mask

a^=b
b^=a&mask
a^=b
Note, the mask doesnt need to be 0/-1 it could have just some bits set

logical operations between masks

well, that really should be obvious AND a&b, OR a|b, XOR a^b, NOT ~a

Filed under: Optimization — Michael @ 02:52

5 Comments »

  1. I remember seeing a bunch of stuff like this in the assembly code that TuxNES generates. (Hi Mike)

    It also had this optimization:

    For unsigned values, (x>a)&&(x

    Comment by None — 2006-01-30 @ 15:28

  2. err.. (x>a)&&(x<b) is equivalent to (x-a)<(b-a)

    this time with &lt;… :)

    Comment by None — 2006-01-30 @ 15:35

  3. yes, and for a=0 this becomes x<b
    and big surprisse it works for signed values too, for example:
    (x>-10)&&(x<10) == (unsigened)(x+10)<20

    PS: mike (melanson) != michael (niedermayer)

    Comment by Michael — 2006-01-30 @ 17:09

  4. Yes, but most of this stuff breaks if you’re not careful about signedness and range. For example
    (0x80000000<0x7FFFFFFF)==1 but (0x80000000-0x7FFFFFFF)>>31==0 !

    This can be a source of bugs if you’re not careful.

    Note that if you compile on x86, some of these optimizations don’t make much difference. For example, the compiler should emit a<b as a setl instruction, which isn’t a conditional branch and doesn’t stall the pipeline. So you need to look at the instructions that actually get generated…

    Comment by None — 2006-01-31 @ 04:27

  5. You’re a better hacker than I, Mr. None– I just branched the Nestra source into TuxNES. I never claimed to understand how the dynamic recompiler worked! :)

    Comment by Multimedia Mike — 2006-01-31 @ 08:04

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress