Building a polynomial out of its roots is not particularly hard, one just multiplies the corresponding linear/first degree polynomials. But this isnt fast, even doing it recursively and using FFT based multiplication isnt all that great its still O(nlog2n). We can under some circumstances get rid of one of these log(n) factors as ill describe in a moment below. But first i wonder if below is the best that can be done or if iam silly and theres a faster or simpler way?

The idea of this method is not to build up the polynomial coefficients but to build up a vector of polynomial evaluations at evenly spaced points. Doing this for a linear factor “prototype” like x-1 on m points costs us O(m). next we build a sparse vector that is 1 where its index matches a root (and 2 for double roots,…) otherwise its 0. Now we can almost build our evaluation of the final polynom by convolution, just that convolution adds its terms while we need the factors multiplied of course. The solution is simply to convert the evaluation of the “prototype” linear factor by elementwise log() before convolution and by exp() afterwards. The value for log(0) does not matter for us except for numerical stabilty, we have to after exp() reset all roots to 0 anyway. With log(0)=0 one gets first order derivatives at the roots though. Also one can implement this using clasic log/exp and a complex value fft or finite field log/exp with a real or finite field fft. The last step of turning the evaluation vector of our polynom into coefficients can be done with a finite field fft. This make the whole thing run in O(m logm) time for a field size of m.

Whats annoying on it is that the the first part works with samples evenly spaced (aka an additative subgroup) while the second, that is turning the evaluation into coefficients is on a multiplicative subgroup of a finite field. In practice that means while my roots are along a multiplicative subgroup of GF(216+1) i have to apply the rdft over the whole field. which is kinda feeling like a waste of cpu cycles

Suggestions to improve this are welcome. Also alternatively if one knows of a linear time method to zero pad in the frequency domain a block of size 2n to twice its size than the resursive multiplication variant should also run in O(n logn) time.

Everyone who wrote assembly code probably has cursed the lack of some instuctions in the instrcution set of their target architecture, be it due to the work needed to workaround it or the complexity and consequently speed loss of the resulting code

Heres a list of what i think is lacking in the x86 instruction set …

btw, i had to use shl and shr instead of 2 less than and 2 right than symbols as wordpress randomizes the html code if & lt ; or & gt ; occur in it (i dunno why, but that did work in the past)

cadd/csub (conditional add / sub)
While there are conditional move instructions there are no conditional add/subtract instructions even though code like if(...) x+=... is pretty common
While there are mmx and sse maximum and minimum instruction there are sadly no normal integer equivalents
negative index addressing
addressing stuff like base[constant+index] is possible base[constant-index] is not, not only would that be usefull on its own it would also allow to subtract a left shifted register from another like: lea eax, [eax-8*ebx]
For fixed point calculations the mul and imul instructions are very hard or slow to use as using the low 32bit result often would overflow and the high 32bit limits the coefficient to -0.5 … 0.5, a instruction which does (a*b + (1 shl 23)) shr 24 would solve this
sarr (shift arithmetic right with rounding)
Shift right is nice but often you need to do it with rounding, a (a + (0.5 shl s)) shr s would come in handy in these cases
a pseudo random number generator and a true random number generator would be nice
While there is pabs for mmx theres no equivalent for normal integers
A a&c | b&(~c) would come in handy for combining bits from 2 registers
a hardware based bitstream reader which takes a base pointer, index in bits and length of bits it should read would be very usefull for many audio and video formats, bitstream&vlc decoding generally takes 1/3 of the time spend decoding
(a&~b) and (a|~b) and maybe others
pavgb_nornd (packed byte average without rounding)
while theres a average with rounding theres none without sadly mpeg need the later too
integer instructions like sarr and select for mmx/sse too
pcmpneq / pcmplt / pcmple / pcmpge
Just the missing compare instructions for mmx
unpack instructions which unpack the source with an implicit 0, this would avoid 1 move per 2 unpack and unpacking is very common
packsrr (pack with shift right and rounding)
like pack* but with rounding and shift right integrated
A mmx/sse instruction similar to lea which does the extreemly common operation of adding a left shifted value to another, and if this isnt able to do a=2*a+-b then an additional instruction for that as it would reduce the very common a-b, a+b butterfly operation to 2 instructions instead of 3
having to unpack 2 source operands into 4 to perform 1 operation with them sucks (=needs 6-8 instructions) so combined foobar+unpack high/low would be pretty nice, of course only if its fast though
ps(r/l)lb (packed right/left shift unsigned byte)
the missing packed byte shifts
pdabs (packed absolute difference)
absolute value of the difference of 2 values
add a signed into a unsiged (byte) value with unsigned saturation, usefull for brightness correcture and loop/postprocessing filters

You want to put several values in a single int or uint64_t, and either you dont have a cpu which supports SIMD instructions like MMX or the instructions plain dont match the datatype you have (like 2 5:6:5 rgb values in an int)

shifting some arbitrary bits left by 1

a += a&mask
note1, this is very usefull for converting betweem rgb15 and rgb16
note2, yes the bit into which you shift should be 0

unsiged shift left & right

this is simply a normal shift and then masking out the bits which where shifted out of each component
a>>/<<= shift

sign extension


10001|10000|01111|01110    + 1110011100111001110
11111|11110|00001|00000    ^ 1110011100111001110
signed shift right (signed left shift == unsigned)

unsigned shift + sign extension

sum all elements

there are several ways,
if you know nothing will overflow then a simple multiply will do
(X,a+b,X,c+d) + ((X,a+b,X,c+d)>>8) = X,X,X,a+b+c+d
if overflows might happen / the width of the types is too small then
a= (x & 1111000011110000)>>4
b= x & 0000111100001111
and then either use one of the methods above or recursively continue with this


1. remove msb, and add the lsbs
2. xor the xored msb back in
m=1000100010001000 and l=0111011101110111
((x&l) + (y&l)) ^ ((x^y)&m)


1. remove msb, and sub the lsbs
2. xor the xored msb back in
m=1000100010001000 and l=0111011101110111
((x|m) – (y&l)) ^ ((x^y^m)&m)


m=1000100010001000 and l=0111011101110111
x= ~x
((x&l) + 0001000100010001) ^ (x&m)

average round down

(x&y) + (((x^y)&1110111011101110)>>1)

average round up

(x|y) – (((x^y)&1110111011101110)>>1)

test if any element is zero

(x – 0001000100010001) & (~x) & 1000100010001000
Note, this can trivially be used to test for equality by choosing x=a^b

create a bitmask based on zeroness

m=1000100010001000 and l=0111011101110111
(((((x&l) + l) | x) & m) >>3) + l ^ l
Note, this can trivially be used to test for equality by choosing x=a^b

You want to convert a signed n-bit integer into a m-bit interger, now if n and m are one of 8,16,32,64 then its easy, just cast to the respective type and let the compiler choose how to do it efficiently, but what if n or m are not one of these?

Bit Twiddling Hacks suggests:
C= 1 << (n – 1)
-(x & C) | x
which needs 3 operations if n is constant and 5 if not

can this be done with fewer instructions? yes
C= (-1)<<(n-1)
which needs 2 operations if n is constant and 4 if not

PS: the C= (-1)<<(n-1) could be replaced by a look up table …

Why? well because misspredictable branches are very slow on modern CPUs, doing 10 or more simple arithemetic operations can be faster, but also keep in mind that:

  • compilers sometimes compile conditional code to very efficient unconditional code, while they might not recognize your branchless code es equivalent
  • if the branch is well predictable it will probably be faster then the branchless code

Building blocks of branchless code

Turning comparissions into masks or 0/1
a<0 a>>31
a!=0 (a|-a)>>31
a<b (a-b)>>31
a>b (b-a)>>31
a≤b (a-b-1)>>31
a≥b (b-a-1)>>31

Note, you might want to replace 31 by the number of bits in whatever type you use and cast to a signed type if you want a bit mask (-1/0) or unsigned type if you want (1/0)
allso keep in mind that (a<b) and such will give you (1/0) too

converting bitmasks to 0/1

well, thats trivial, there are many ways, some even flip the 0/1 in the process
mask&1, -mask, mask+1

converting 0/1 to masks

-x, x-1

flipping the 0/1 value

x^1, 1-x

flipping the mask value

x^(-1), ~x

switch between 2 variables depening upon a mask

b+((a-b)&mask), b^((a^b)&mask)

convert mask to 0/x


convert mask to -1/x

Note: usefull for (-1/0) -> (-1/1)

add/sub x depening upon a mask

a += x&mask
Note, dont forget a+=mask / a-=mask for x=1/-1

negate depening upon a mask

a = (a^mask) – mask

minimum of 2 variables


maximum of 2 variables


shift by s depending upon a mask

a>>= s&mask, a< <=s&mask, a += ((a>>s)-a)&mask
Note, that shifting by a constant might be faster

exchanging 2 values depending upon a mask

Note, the mask doesnt need to be 0/-1 it could have just some bits set

logical operations between masks

well, that really should be obvious AND a&b, OR a|b, XOR a^b, NOT ~a

When working with integers, fixedpoint math or when converting floats to integers we need to round somehow, but which way is best …
In school they probably told you to round to nearest and round halfway cases away from zero +-x.0000 … +-x.4999 to +-x and +-x.5 … +-x.9999 to +-x+-1 but thats actually not optimal

First decission, round to nearest or not, rounding to nearest is pretty clear already, it means to well round to the nearest if there is a single nearest value and thats obviously more accurate then not doing so, not rounding to nearest has one big advantage, its often but not always (PAVGB MMX instruction for example) faster (x>>2 vs. (x+2)>>2)

Second decission, what to do with x.5, it seems to not matter much but actually it often does matter, the round toward +inf and round to -inf cases are easy to implement (x+2)>>2 and (x+1)>>2 for example but they add some ugly systematic bias which can be problematic if values are used in long calculations and get rounded many times
the solution, round to even (or odd) doesnt add such a bias and is thus better, alternatively you could also repeatedly flip the direction in which rounding is done, mpeg4 for example does that in the motion compensation code …
to round to even you could do something like (x + ((x>>1)&1))>>1, (x+1+((x>>2)&1))>>2
to round to odd you could do something like (x>>1)|(x&1), (x+2-((x>>2)&1))>>2

you might also be stuck with instructions which do something like (a*constant)>>16 to reduce the rounding error of these you could simple add (1<<15) / constant to a

You have a piece of code, its too slow so you want to optimize it, but to do that you first need to be able to meassure its speed, otherwise you wont know if a change improved speed or not

Benchmarking a piece of code sounds easy but it isnt, you can try to do it with various standard “get the current time” functions but these tend to be very inaccurate, canned tools like gprof are an option too, but in my experience they are also inaccurate and they can produce highly misleading results due to function inlining. A more accurate method on the x86 architecture is to use rdtsc

So now we have a accurate method to get the current “time” but what now? simply running the code once and getting the time before and afterwards? This would be a very bad choice as the first time a piece of code is executed its not in the code cache, the data isnt in the data cache, branch prediction has not seen the code yet, … so we could ignore the first few iterations, but still a single measurement isnt very trustworthy, next improvement would be to average many interations but still its very sensitive, just think about what happens when the kernel decides to stop your program and give the cpu to your mp3 player, its like having a interation which needs 50 times longer then the average assuming the code between the rdtsc instructions doesnt take too long. Solving this is easy though, just discard such outliers which need many times the average (see the START/STOP_TIMER macros in ffmpeg/libavutil/common.h)

Ok kids, today we will look at how to quickly calculate a crc checksum, the naive way is to xor the crc polynomial with various shifts into the data to remove all 1 bits except the last 32 (or whatever length our crc polynom has) which from a mathemathical point of view is the data_polynom modulo the crc_polynom over GF(2)
now noone does that except maybe a few very smart or very stupid people :-) everyone else uses a 8bit->32bit look up table which simply contains the 32bit crc for each 8bit data polynom, more precissely:
LUT[i]= (i<<|CRC_POLY|) GF2_MOD CRC_POLY where GF2_MOD is the modulo operation between 2 binary polynoms over GF(2)
the actual code which uses this LUT looks approximately like:

for (i = 0; i < length; i++)
    crc = LUT[(crc&0xFF) ^ *buffer++] ^ (crc >>8);

a much faster method is to use 4 LUTs so that they gives the crc checksum 0,1,2,3 bytes ahead:
LUT[j][i]= (i<<(|CRC_POLY| + j*8)) GF2_MOD CRC_POLY
with that we can calculate the crc then by:

    crc ^= le2me_32(*(uint32_t*)buffer); buffer+=4;
    crc =  tab[3][ crc     &0xFF]
          ^tab[2][(crc>>8 )&0xFF]
          ^tab[0][(crc>>24)     ];
    crc = tab[0][(crc&0xFF) ^ *buffer++] ^ (crc >>8);

actually working code should be in ffmpeg/libavutil soon, if not ill post my messy proof of concept code

How do we find the greatest common divisor of 2 integers? Try all integers which are ≤ then the arguments and select the largest which divides both with no remainder, works but its too slow. Or maybe factorize both, take the common factors and multiply them together like they probably told you in school, but thats still far too slow for anything but the small numbers in school excercies