Lair Of The Multimedia Guru

2006-02-17

SIMD without SIMD

You want to put several values in a single int or uint64_t, and either you dont have a cpu which supports SIMD instructions like MMX or the instructions plain dont match the datatype you have (like 2 5:6:5 rgb values in an int)

shifting some arbitrary bits left by 1

a += a&mask
note1, this is very usefull for converting betweem rgb15 and rgb16
note2, yes the bit into which you shift should be 0

unsiged shift left & right

this is simply a normal shift and then masking out the bits which where shifted out of each component
a>>/<<= shift
a&=mask

sign extension

(a+C)^C

00011|00010|00001|00000
10001|10000|01111|01110    + 1110011100111001110
11111|11110|00001|00000    ^ 1110011100111001110
signed shift right (signed left shift == unsigned)

unsigned shift + sign extension

sum all elements

there are several ways,
if you know nothing will overflow then a simple multiply will do
AAAABBBBCCCCDDDD * 1000100010001 = SSSSXXXXXXXXXXXX
or
AAAABBBBCCCCDDDD + (AAAABBBBCCCCDDDD>>4) = X,a+b,X,c+d
(X,a+b,X,c+d) + ((X,a+b,X,c+d)>>8) = X,X,X,a+b+c+d
if overflows might happen / the width of the types is too small then
a= (x & 1111000011110000)>>4
b= x & 0000111100001111
a+=b
and then either use one of the methods above or recursively continue with this

add

1. remove msb, and add the lsbs
2. xor the xored msb back in
x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
m=1000100010001000 and l=0111011101110111
((x&l) + (y&l)) ^ ((x^y)&m)

subtract

1. remove msb, and sub the lsbs
2. xor the xored msb back in
x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
m=1000100010001000 and l=0111011101110111
((x|m) – (y&l)) ^ ((x^y^m)&m)

negate

x=AAAABBBBCCCCDDDD is our input
m=1000100010001000 and l=0111011101110111
x= ~x
((x&l) + 0001000100010001) ^ (x&m)

average round down

x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
(x&y) + (((x^y)&1110111011101110)>>1)

average round up

x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
(x|y) – (((x^y)&1110111011101110)>>1)

test if any element is zero

x=AAAABBBBCCCCDDDD is our input
(x – 0001000100010001) & (~x) & 1000100010001000
Note, this can trivially be used to test for equality by choosing x=a^b

create a bitmask based on zeroness

x=AAAABBBBCCCCDDDD is our input
m=1000100010001000 and l=0111011101110111
(((((x&l) + l) | x) & m) >>3) + l ^ l
Note, this can trivially be used to test for equality by choosing x=a^b

Filed under: Optimization — Michael @ 23:46

2006-02-10

ASCII Mandelbrot Realtime Zoom

not only that, but its 447 bytes of C source (with license header, copyright statement and indented code), and you can even specify the spot to which the realtime zoom will go on the command line
Update: new version with more nasty tricks and less whitespace only needs 285 bytes of C source

Filed under: Off Topic — Michael @ 03:14

2006-02-06

CRC32 vs. Adler32

You need to choose a checksum and dunno which, well the 2 most popular ones are the CRC and Adler32 checksum

Speed / innermost loop

CRC32:

while(buffer<end -3){
    crc ^= le2me_32(*(uint32_t*)buffer); buffer+=4;
    crc =  tab[3][ crc     &0xFF]
          ^tab[2][(crc>>8 )&0xFF]
          ^tab[1][(crc>>16)&0xFF]
          ^tab[0][(crc>>24)     ];
}

Adler32:

while(buffer<end -3){
    s1 += *buffer++; s2+=s1;
    s1 += *buffer++; s2+=s1;
    s1 += *buffer++; s2+=s1;
    s1 += *buffer++; s2+=s1;
}

as we can see adler needs 12 adds and 4 1byte reads and crc needs 1 4byte read, 4 xor, 4 table lookups, 1 add, 3 and, 3 shift, without benchmarking i would say adler should be faster

Burst error detecting capabilities

CRC-32 can detect every burst error of 32 or less bits
Adler-32 will fail with at least one 17bit burst error (0 00000010 00000000 vs. 1 00000000 00000001)

Bit error detecting capabilities

CRC-32 differs depening upon generator polynom see misusing-crcs-for-error-correction
Adler-32 can detect every 2bit error within 65521 byte or so but will fail with some 3 bit errors

Random error detecting capabilities

CRC-32 will produce a evenly distributed checksum for messages >4byte
Adler-32 will produce a quite unevenly distributed checksum for small mesages so that the effective number of bits in the checksum is significantly reduced for short messags

adler32
len=     1, collisions=488296949(0.390637%), effective bits=7.999957
len=     2, collisions=  1903788(0.001523%), effective bits=16.002699
len=     4, collisions=   372789(0.000298%), effective bits=18.355140
len=     8, collisions=    94799(0.000076%), effective bits=20.330556
len=    16, collisions=    23902(0.000019%), effective bits=22.318296
len=    32, collisions=     6068(0.000005%), effective bits=24.296135
len=    64, collisions=     1573(0.000001%), effective bits=26.243837
len=   128, collisions=      669(0.000001%), effective bits=27.477278
len=   256, collisions=      437(0.000000%), effective bits=28.091651
len=   512, collisions=      331(0.000000%), effective bits=28.492453
len=  1024, collisions=      219(0.000000%), effective bits=29.088353
----
crc32
len=     1, collisions=488296915(0.390637%), effective bits=7.999957
len=     2, collisions=  1903782(0.001523%), effective bits=16.002703
len=     4, collisions=       22(0.000000%), effective bits=32.403708
len=     8, collisions=       29(0.000000%), effective bits=32.005159
len=    16, collisions=       30(0.000000%), effective bits=31.956249
len=    32, collisions=       28(0.000000%), effective bits=32.055785
len=    64, collisions=       39(0.000000%), effective bits=31.577738
len=   128, collisions=       39(0.000000%), effective bits=31.577738
len=   256, collisions=       32(0.000000%), effective bits=31.863140
len=   512, collisions=       32(0.000000%), effective bits=31.863140
len=  1024, collisions=       42(0.000000%), effective bits=31.470823

Note, generated with: check_collision.c

Filed under: Error Correcting Codes — Michael @ 19:27

2006-02-04

Fast sign extension

You want to convert a signed n-bit integer into a m-bit interger, now if n and m are one of 8,16,32,64 then its easy, just cast to the respective type and let the compiler choose how to do it efficiently, but what if n or m are not one of these?

Bit Twiddling Hacks suggests:
C= 1 << (n – 1)
-(x & C) | x
which needs 3 operations if n is constant and 5 if not

can this be done with fewer instructions? yes
C= (-1)<<(n-1)
(x+C)^C
which needs 2 operations if n is constant and 4 if not

PS: the C= (-1)<<(n-1) could be replaced by a look up table …

Filed under: Optimization — Michael @ 23:10

Powered by WordPress