Lair Of The Multimedia Guru

2006-01-28

Avoiding branches/if/conditionals

Why? well because misspredictable branches are very slow on modern CPUs, doing 10 or more simple arithemetic operations can be faster, but also keep in mind that:

compilers sometimes compile conditional code to very efficient unconditional code, while they might not recognize your branchless code es equivalent
if the branch is well predictable it will probably be faster then the branchless code

Building blocks of branchless code

Turning comparissions into masks or 0/1

a<0	a>>31
a!=0	(a\|-a)>>31
a<b	(a-b)>>31
a>b	(b-a)>>31
a≤b	(a-b-1)>>31
a≥b	(b-a-1)>>31

Note, you might want to replace 31 by the number of bits in whatever type you use and cast to a signed type if you want a bit mask (-1/0) or unsigned type if you want (1/0)
allso keep in mind that (a<b) and such will give you (1/0) too

converting bitmasks to 0/1

well, thats trivial, there are many ways, some even flip the 0/1 in the process
mask&1, -mask, mask+1

converting 0/1 to masks

-x, x-1

flipping the 0/1 value

x^1, 1-x

flipping the mask value

x^(-1), ~x

switch between 2 variables depening upon a mask

b+((a-b)&mask), b^((a^b)&mask)

convert mask to 0/x

mask&x

convert mask to -1/x

mask|x
Note: usefull for (-1/0) -> (-1/1)

add/sub x depening upon a mask

a += x&mask
Note, dont forget a+=mask / a-=mask for x=1/-1

negate depening upon a mask

a = (a^mask) – mask

minimum of 2 variables

b+((a-b)&((a-b)>>31))

maximum of 2 variables

a-((a-b)&((a-b)>>31))

shift by s depending upon a mask

a>>= s&mask, a< <=s&mask, a += ((a>>s)-a)&mask
Note, that shifting by a constant might be faster

exchanging 2 values depending upon a mask

a^=b
b^=a&mask
a^=b
Note, the mask doesnt need to be 0/-1 it could have just some bits set

logical operations between masks

well, that really should be obvious AND a&b, OR a|b, XOR a^b, NOT ~a

Filed under: Optimization — Michael @ 02:52

Comments (5)

5 Comments »

I remember seeing a bunch of stuff like this in the assembly code that TuxNES generates. (Hi Mike)

It also had this optimization:

For unsigned values, (x>a)&&(x

Comment by None — 2006-01-30 @ 15:28
err.. (x>a)&&(x<b) is equivalent to (x-a)<(b-a)

this time with <… :)

Comment by None — 2006-01-30 @ 15:35
yes, and for a=0 this becomes x<b
and big surprisse it works for signed values too, for example:
(x>-10)&&(x<10) == (unsigened)(x+10)<20

PS: mike (melanson) != michael (niedermayer)

Comment by Michael — 2006-01-30 @ 17:09
Yes, but most of this stuff breaks if you’re not careful about signedness and range. For example
(0x80000000<0x7FFFFFFF)==1 but (0x80000000-0x7FFFFFFF)>>31==0 !

This can be a source of bugs if you’re not careful.

Note that if you compile on x86, some of these optimizations don’t make much difference. For example, the compiler should emit a<b as a setl instruction, which isn’t a conditional branch and doesn’t stall the pipeline. So you need to look at the instructions that actually get generated…

Comment by None — 2006-01-31 @ 04:27
You’re a better hacker than I, Mr. None– I just branched the Nestra source into TuxNES. I never claimed to understand how the dynamic recompiler worked! :)

Comment by Multimedia Mike — 2006-01-31 @ 08:04

RSS feed for comments on this post. TrackBack URI

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31