Avoiding branches/if/conditionals
Why? well because misspredictable branches are very slow on modern CPUs, doing 10 or more simple arithemetic operations can be faster, but also keep in mind that:
- compilers sometimes compile conditional code to very efficient unconditional code, while they might not recognize your branchless code es equivalent
- if the branch is well predictable it will probably be faster then the branchless code
Building blocks of branchless code
Turning comparissions into masks or 0/1
a<0 | a>>31 |
a!=0 | (a|-a)>>31 |
a<b | (a-b)>>31 |
a>b | (b-a)>>31 |
a≤b | (a-b-1)>>31 |
a≥b | (b-a-1)>>31 |
Note, you might want to replace 31 by the number of bits in whatever type you use and cast to a signed type if you want a bit mask (-1/0) or unsigned type if you want (1/0)
allso keep in mind that (a<b) and such will give you (1/0) too
converting bitmasks to 0/1
well, thats trivial, there are many ways, some even flip the 0/1 in the process
mask&1, -mask, mask+1
converting 0/1 to masks
-x, x-1
flipping the 0/1 value
x^1, 1-x
flipping the mask value
x^(-1), ~x
switch between 2 variables depening upon a mask
b+((a-b)&mask), b^((a^b)&mask)
convert mask to 0/x
mask&x
convert mask to -1/x
mask|x
Note: usefull for (-1/0) -> (-1/1)
add/sub x depening upon a mask
a += x&mask
Note, dont forget a+=mask / a-=mask for x=1/-1
negate depening upon a mask
a = (a^mask) – mask
minimum of 2 variables
b+((a-b)&((a-b)>>31))
maximum of 2 variables
a-((a-b)&((a-b)>>31))
shift by s depending upon a mask
a>>= s&mask, a< <=s&mask, a += ((a>>s)-a)&mask
Note, that shifting by a constant might be faster
exchanging 2 values depending upon a mask
a^=b
b^=a&mask
a^=b
Note, the mask doesnt need to be 0/-1 it could have just some bits set
logical operations between masks
well, that really should be obvious AND a&b, OR a|b, XOR a^b, NOT ~a
I remember seeing a bunch of stuff like this in the assembly code that TuxNES generates. (Hi Mike)
It also had this optimization:
For unsigned values, (x>a)&&(x
Comment by None — 2006-01-30 @ 15:28
err.. (x>a)&&(x<b) is equivalent to (x-a)<(b-a)
this time with <… :)
Comment by None — 2006-01-30 @ 15:35
yes, and for a=0 this becomes x<b
and big surprisse it works for signed values too, for example:
(x>-10)&&(x<10) == (unsigened)(x+10)<20
PS: mike (melanson) != michael (niedermayer)
Comment by Michael — 2006-01-30 @ 17:09
Yes, but most of this stuff breaks if you’re not careful about signedness and range. For example
(0x80000000<0x7FFFFFFF)==1 but (0x80000000-0x7FFFFFFF)>>31==0 !
This can be a source of bugs if you’re not careful.
Note that if you compile on x86, some of these optimizations don’t make much difference. For example, the compiler should emit a<b as a setl instruction, which isn’t a conditional branch and doesn’t stall the pipeline. So you need to look at the instructions that actually get generated…
Comment by None — 2006-01-31 @ 04:27
You’re a better hacker than I, Mr. None– I just branched the Nestra source into TuxNES. I never claimed to understand how the dynamic recompiler worked! :)
Comment by Multimedia Mike — 2006-01-31 @ 08:04