Lair Of The Multimedia Guru

2006-04-15

Decrypting Nagravision

Now after breaking videocrypt, lets take a look at (old analog) nagravision

Nagravision

the PRNG seed

Same as videocrypt (Binary data from the vertical blanking interval is feeded into a smartcard which then among other things produces a PRNG seed for the nagravision decoder, but none of this matters to us here)

the PRNG

The PRNG (pseudo random number generator) is feeded with a 15bit seed and a 256 5bit entry table, and produces 287 5 bit values, the following code will do exactly that:

void prng(int seed, int *cle2enc){
  uint8_t keyNdx= seed >> 7;
  int keyInc= ((seed & 0x7F) << 1) + 1;
  int buffer[32];

  for(int i=0; i<32; i++) buffer[i]= i;

  for(int i=0; i<255; i++){
    cle2enc[i]= buffer[ keyTable[keyNdx] ];
    buffer[ keyTable[keyNdx] ]= i + 32;
    keyNdx+= keyInc;
  }

  memcpy(cle2enc+255, buffer, 32*sizeof(int));
}
the actual decryption

nagravision has a 32 line buffer, each line can contain a line of video, for each of the 287 lines of a field there are 5bits from the PRNG which are used as index into the buffer to decide into which entry to store the next received line and from where to output one, or in other words nagravision reorders lines using a 32line temporary buffer

Breaking Nagravision without knowledeg of the PRNG

The naive and ideal solution would be to try all 32^287 possibilities and compare the resulting decoded images, obviously thats not realistic
if OTOH we just use a compare function which compares 2 lines and lets assume its just simple correlation then and wish to find the global minimum then we actually didnt improve by that much as thats equivalent of solving the TSP
old versions of 2010 used the following algorithm to find a good permutation of the lines, which resulted in watchable movies

Step 1 (downsampling)

to speedup the linecompare later, the lines can be horizintally downsampled

Step 2 (comparing lines)

all lines which are not near black and which have not yet been merged with 2 lines are compared to all other such lines, the compares are done blockwise so that the number of cache misses is minimzed, for each line a fixed number of best matches is kept

Step 3 (sorting)

all the line pairs are sorted depending upon their similarity

Step 4 (merging)

from best matching line pair to worst the following steps are done

Step 4A (out of pairs)

if we hit the last pair for any line (we just keept a fixed number of best matches for each line) then we goto Step 2

Step 4B (impossible merges 1)

if either of the lines is in the middle of a block or they are the ends of the same block we continue with the next pair at Step 4

Step 4C (impossible merges 2)

we maintain 2 position limits for each block one for each possible orientation, if a merge would force the block outside the image we continue with the next pair at Step 4
the limits are calculated based on the fact that no line can be output before it has been received

Step 4D (impossible merges 3)

we maintain the fullness state of the 32 line reorder buffer if a merge would require overflowing it then we again continue with the next pair at Step 4

Step 4E (merge)

we merge the 2 blocks and update the buffer fullness and limits

Step 5 (sort blocks)

all blocks are sorted by their less restrictive position limit (=orientation is so that the limit is less restrictive)

Step 6 (reorder lines)

we reorder the actual lines

Breaking Nagravision with knowledge of the PRNG

as there are just 2^15 possible permutations so the naive bruteforce becomes alot more realistic but its still somewhat expensive for realtime decoding, so 2010 used a faster method

Step 1 (downsampling)

to speedup the linecompare later, the lines can be horizintally downsampled

Step 2 (compare)

every keyline is compared against all lines, the best 2 are keept

Step 3 (find seedlists)

for every line triplet (keyline and the 2 best matches) we assume that they will be consecutive
in the decoded image, we look up a list of seeds for which this is true from a precalculated table
or tree structure

Step 4 (find seed)

we find the most often occuring seed in the seedlists and if its number of occurances is above some threshold we decode the image, if not we drop it, this ensure practcally zero wrong decoded images

Step 5 (reorder)

after we know the seed we just need to reorder the lines and fix the chroma

Filed under: Cryptanalysis — Michael @ 11:54

2006-04-10

Decrypting VideoCrypt

To complement my last blog entry, heres a description of how VideoCrypt works and how it can be broken with reasonable quality

PAL

First before we can talk about VideoCrypt, we need to understand how PAL works
Video information in PAL (and NTSC and SECAM) is stored linewise top to bottom, and interlaced so all the even and then in the next field all the odd lines are stored (313 lines in even fields, 312 in odd from which ~288 lines contain a vissible picture for PAL)
each line has a sync pulse to mark the end of a line, a horizontal blanking interval during which the electron beam in crts “moves” back to the left after the end of the previous line at the right and obviously the vissible part in which brightness = amplitude
In case of Color the U and V components of the YUV color value are stored quadrature amplitude modulated at the color subcarier frequency and the V component has its sign inverted in every 2nd line of a field, theres also a short reference color burst added in the HBI

VideoCrypt

the PRNG seed

Binary data from the vertical blanking interval is feeded into a smartcard which then among other things produces a PRNG seed for the videocrypt decoder, but none of this matters to us here

the actual decryption

The PRNG (pseudo random number generator) outputs a sequence of 8bit values (cutpoints), one for each line of video, each of these lines is then split into 4 parts, parts 0 and 3 are outside the vissible area and arent touched by videocrypt, parts 1 and 2 are the vissible area and are exchanged, the location which seperates 1 and 2 is the cutpoint, videocrypt does nothing special with the color information it just exchanges raw digitally sampled data
because this exchange operation cannot be done with zero-delay the videocrypt decoder will also output parts 1 and 2 one line later then when they where transmitted

Breaking VideoCrypt

Lacking informaion about the PRNG we can only guess the cutpoints based on the assumtation that color/luminance values for pixels which are close will be strongly correlated while distant pixels will have more different color/lumi
still, naive bruteforce would require 256^288 images of w*288 pixels to be compared
depening on how we test a image for how good it is, we are able to significantly reduce the complexity, if we use a score function which is the sum of some compare function which uses just consecutive line pairs and their cutpoints then the dynamic programing algorithm / viterbi algorithm can be used to reduce the complexity to 256*256*288 such 2 line checks, if we now further limit ourselfs to the common cross correlation + some edge detector and only consider a few cutpoint differences close to the best result from cross correlation then the complexity would be w*log(w)*288 + 9*w*288 which is what markus kuhns antisky.c did, though antisky was far from realtime decoding, mainly because it was inefficiently written (floating point code everwhere at a time where no cpu had a reasonable fast fpu, …)

Note, if you think comparing 256^288 images would be stupid, its not, think about choosing the image which is compressed best with jpeg or another codec, this might (or might not) be better then the linewise method

Step 1 (downsampling)

to speedup the cross correlation and other things the whole image can be optionally downsampled

Step 2 (luminance cross correlation)

antisky.c used a FFT based cross correlation, this while asymptotically quite fast is really not fast at all for the case of finding the best matching cutpoint difference in reality, 2010 uses a adaptive comparission which first tries a few coarsly spaced cutpoint differences and then tries recursively more around the better ones, its simpler and faster

Step 3 (missmatched lines detection)

lines which simply cant be matched by cross correlation (mean under some threshold or variance under some threshold or best matching score from cross correlation not much better then average score)
must be marked otherwise the (random) cross correlation result will trash the other surrounding lines

Step 4 (PAL phase detection)

as already described somewhere above, PAL flips the sign of the V component every second line, the first line may or may not be fliped, and if the capture hw doesnt provide us with this info then we need to find it if we want to decode color, one way to solve this is to let the user set the flag and then just flip the flag after every frame, a much more robust method is to compare the color of the first and last pixel of the encrypted line which will be exchanged, these are guranteed to be always adjacent in the decrypted image so they will have almost the same color in the decrypted image, while in the encrypted image their difference depends upon the PAL phase thingy

Step 5 (finding the Chroma phase difference)

we know the approximate cutpoint difference from the luma cross correlation, so we can simply find the chroma phase/hue difference by calculating the complex correlation coefficient (U and V are real and imag values) (note a little care needs to be taken here due to various things i dont remember exactly …)

Step 6 (edge detection)

simply run your favorite edge detector over the 256 possible cutpoints to detect the picture edge, for example log(abs(l[x-2] + l[x-1] – l[x+0] – l[x+1]))

Step 7 (dynamic programming search)

Well here we simply combine all the previous stuff to find the best cutpoint sequence to reach every cutpoint of line X assuming we know the best sequence to all cutpoints of line X-1 already
lines which couldnt be matched to their previous line are a special case which is handled like the first line (every cutpoint has the same score) the same is done with yet another special case, namely the one where no cutpoint is possible anymore, that can happen as we restrict ourselfs to a few relative cutpoints around what the cross correlation guessed is correct and theres a small deadzone in which no cutpoints lie, so not every sequence of relative cutpoints is possible depending upon how you define relative

Step 8 (cutpoint sequence cache)

over some timeperiod cutpoint sequences where repeating after an hour or so, so i added a cache to 2010 which lead to perfect decoding until the repeations dissapeared …

Step 8a (finding the current sequence)

we select several consecutive relative cutpoint pairs and look them up in a hashtable, we also try to add +-1 to both as to compensate for errors in our cutpoint guess
for each pair we will get a list of pointers to cutpoint sequences
we count the number of occurances of each pointer and choose the most often occuring as the correct sequence if its number of occurances is above some threshold or return no match if not

Step 8b (merging the current cutpoint estimates)

we add the current relative cutpoints to the sequence so that each line has a list of relative cutpoints, if that gets too long the least often occuring is discarded
for each line the most often occuring relative cutpoint is choosen and used in later steps, the hashtable is also updated with it

Step 9 (cut and exchange)

this is simple, just 2-3 memcpys and rotating the UV values so as to compensate for the effects of the shift and 1 line delay on the color

Filed under: Cryptanalysis — Michael @ 14:58

2006-04-07

Decrypting (old analog) PayTv

A really long time ago, 1997 or maybe it was even a little earlier i wrote a program to decrypt the at that time used analog pay tv systems videocrypt and nagravision sadly i was too afraid of legal issues at that time so i never released either source or binary, well both systems are AFAIK dead nowadays, the old code hasnt been touched since 2000, doesnt compile with any compiler i have, so i thought its the perfect time for finally releasing that trash

When was the first working version born?

well i dont know, the oldest file i could find was a 2do list from 1997 which contains notes about improving videcrypt and nagravision decoding so these must have been working already …

Does it contain any cryptografically new knowlegde about the systems

no, before the nagra PRNG was reverse engeneered i belived that the PRNGs both systems used would be very secure and there would be no point in “looking” at their output, afterwards besides feeling silly as even a blind man would have noticed how the nagra PRNG works (its just permutating 256 5bit values in trivial 32768 ways, 127 only if we ignore new[x]= old[x+C])
i tried to look at the videocrypt PRNG output but failed to find anything, 2 cutpoint sequences where either different and their difference random or there where exactly equal, which happened either commonly (repeating after an hour or so) or never, depending upon when you recorded them, i also failed to find any other relations in the sequences …

Is this clean ANSI C code?

its ultra messy DJGPP-GCC-DOS-C code, i started to clean it but gave up, a patch with my unfinished cleanup work is included for the insane. with the patch a few files compile under current gcc-linux

what else does it do? could it make coffee?

Theres some teletext decoding code, some half working sync suppression decoder (with unmodified hardware) and probably a few other things i dont remember, after all it doesnt compile and i dont have my matrox meteor card here (which was the only card it supported, as that was what i had)

did you port it to java?

yes, i ported part of it to java a really long time ago, why i did that iam not sure anymore, was i seriously beliving the sun propaganda that java was a useable general propose programming language?i guess my excuse is that my previous java code at that time where just tiny applets & applications where the java-effect isnt so noticeable
the java version was even with hand coded mmx over jni about 3 times slower, there where ~1sec garbage collector pauses every 30secs or so and after 15min the computer needed a hard reset (yeah that was even without any mmx/jni)

download this crap\h\h\h\hmasterpiece

2010

license?

GPL but what do you want to do with it?!

Filed under: Cryptanalysis — Michael @ 14:37

2006-03-24

AVI and B frames

AVI and b frames, is it allowed or not?

AVI certainly wasnt designed to support b frames but on the other hand theres nothing which would make b frames in AVI illegal, the biggest problem isnt AVI but the various applications and APIs like vfw which are designed with zero delay codecs in mind

PTS(Presentation timestamps)

well wtf are they and why would we need them? every (video) frame has a decode timestamp(DTS) and a PTS, the DTS is the time at which a frame is feeded into the decoder, the PTS is the time at which the decoded frame will be presented to the user, codecs which have no bframes / zero delay always have PTS =DTS and AVI as it wasnt desiged with b frames in mind has no concept of PTS, there are just DTS which are simply the frame number divide by the frame rate
So one could now argue AVI doesnt support b frames as it doesnt store PTS and would if the application needs to know PTS (simpler players dont need to know the PTS…) to calculate the PTS based upon frame type and DTS, but that argument against AVI+b frames has a critical flaw, MPEG-PS and MPEG-TS dont store PTS for every frame either but only require it to be stored every 0.5 seconds or so. Which means that the same complicated calculate the PTS from DTS + frame types code is needed for the official MPEG format too

Packed bitstream

Packed bitstream is a very ugly hack which puts several frames in to a single avi chunk, this reduces the b frame delay by 1, which avoids some problems with APIs which are not aware of b frames, but causes various problems with APIs which are aware of b frames, the resulting bitstream is also not valid according to the MPEG standards, so lossless transcoding to .mp4 is much more complicated

Filed under: Container Formats — Michael @ 19:37

2006-03-02

The AVI container file format

Some people love it some hate it, but no doubt it is probably the most common container format for videos though its usage is declining as supperior formats like matroska, nut and others gain popularity

Why is/was AVI so successfull

well theres no way to awnser this with certanity, we can only guess

  • allows storing of audio and video encoded with almost any codec without complicated codec specific hacks (ogg is probably the most common example which fundamentally failed here, but mpeg-ps/ts has the same issue, and .mp4 is thanks to the standard comitees also full of codec specific hacks, which ironically wouldnt be needed at all)
  • simple muxing and demuxing (quicktime/.mov/.mp4/mpeg-ps fails here)
  • no known software patents (asf/wma/wmv fails here)

AVI the format vs. reallity or why some developers hate AVI

The biggest problem with AVI at least in my humble oppionion is not some technical limitation of AVI but the fact that many encoders (muxers actually to be precisse) generate incorrect AVI files which violate the AVI spec from MS, that again means that players need to be much more complex to deal with all these variations …
some examples of such variations

  • putting all the audio in a single chunk, yeah try to implement seeking …
  • having an index which doesnt match the chunks, and just to be sure dont set the header flags which indicate that the index must be used
  • putting several variable sized frames into a single chunk
  • or the very common way of storing mp3 in AVI, simply cut the mp3 frames completely randomly and then scream “avi doesnt support vbr mp3” (hi virtualdub and clones)

Common missconceptions about AVI

AVI doesnt support variable bitrate audio
This is simple nonsense, there are 2 ways to store frames (audio/video/…) in AVI, the first is to set sample_size to the size in bytes of a packet and store an integer number of such packets per chunk, the second is to store 1 frame per chunk and set sample_size=0 and thats what has to be done for variable size packets it works with audio as well as video. Now what some encoders actually do is they set sample_size=1 and then chop the audio stream up at random, this surprisingly works with cbr mp3 but not vbr mp3 but its totally wrong even for cbr mp3
AVI doesnt support variable framerate
again this isnt true, storing variable framerate in AVI is not efficient but it is possible and its not a hack and it doesnt stretch the AVI spec, its fully suported, you just set the framerate to the least common multiple of the rates you want to use, and use 0-byte chunks for “skiped” frames
storing variable samples per packet audio in AVI (vorbis)
while this is still possible, this stretches the spec somewhat, sample_size must be 0, now there are 2 ways to store things, 1. the one with large overhead this sets rate/scale to a common timebase (gcd of samples per packet/samplerate) and then adds 0 sized packets to keep the timestamps correct, 2. the one with small overhead, this sets rate/scale to (lcm of samples per packet/samplerate) and packs several packets intoeach chunk which isnt really allowed but well …
AVI has 24byte per chunk overhead, ODML-AVI has 32byte per chunk overhead
well ODML-AVI without the old redundant index has just 16byte per chunk overhead and if you stretch the spec like some encoders successfully do by having chunks and index missmatch then you can get away with 8byte per chunk overhead, and yes you can still seek in this
Filed under: Container Formats — Michael @ 23:53

2006-02-17

SIMD without SIMD

You want to put several values in a single int or uint64_t, and either you dont have a cpu which supports SIMD instructions like MMX or the instructions plain dont match the datatype you have (like 2 5:6:5 rgb values in an int)

shifting some arbitrary bits left by 1

a += a&mask
note1, this is very usefull for converting betweem rgb15 and rgb16
note2, yes the bit into which you shift should be 0

unsiged shift left & right

this is simply a normal shift and then masking out the bits which where shifted out of each component
a>>/<<= shift
a&=mask

sign extension

(a+C)^C

00011|00010|00001|00000
10001|10000|01111|01110    + 1110011100111001110
11111|11110|00001|00000    ^ 1110011100111001110
signed shift right (signed left shift == unsigned)

unsigned shift + sign extension

sum all elements

there are several ways,
if you know nothing will overflow then a simple multiply will do
AAAABBBBCCCCDDDD * 1000100010001 = SSSSXXXXXXXXXXXX
or
AAAABBBBCCCCDDDD + (AAAABBBBCCCCDDDD>>4) = X,a+b,X,c+d
(X,a+b,X,c+d) + ((X,a+b,X,c+d)>>8) = X,X,X,a+b+c+d
if overflows might happen / the width of the types is too small then
a= (x & 1111000011110000)>>4
b= x & 0000111100001111
a+=b
and then either use one of the methods above or recursively continue with this

add

1. remove msb, and add the lsbs
2. xor the xored msb back in
x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
m=1000100010001000 and l=0111011101110111
((x&l) + (y&l)) ^ ((x^y)&m)

subtract

1. remove msb, and sub the lsbs
2. xor the xored msb back in
x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
m=1000100010001000 and l=0111011101110111
((x|m) – (y&l)) ^ ((x^y^m)&m)

negate

x=AAAABBBBCCCCDDDD is our input
m=1000100010001000 and l=0111011101110111
x= ~x
((x&l) + 0001000100010001) ^ (x&m)

average round down

x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
(x&y) + (((x^y)&1110111011101110)>>1)

average round up

x=AAAABBBBCCCCDDDD and y=EEEEFFFFGGGGHHHH are our inputs
(x|y) – (((x^y)&1110111011101110)>>1)

test if any element is zero

x=AAAABBBBCCCCDDDD is our input
(x – 0001000100010001) & (~x) & 1000100010001000
Note, this can trivially be used to test for equality by choosing x=a^b

create a bitmask based on zeroness

x=AAAABBBBCCCCDDDD is our input
m=1000100010001000 and l=0111011101110111
(((((x&l) + l) | x) & m) >>3) + l ^ l
Note, this can trivially be used to test for equality by choosing x=a^b

Filed under: Optimization — Michael @ 23:46

2006-02-10

ASCII Mandelbrot Realtime Zoom

not only that, but its 447 bytes of C source (with license header, copyright statement and indented code), and you can even specify the spot to which the realtime zoom will go on the command line
Update: new version with more nasty tricks and less whitespace only needs 285 bytes of C source

Filed under: Off Topic — Michael @ 03:14

2006-02-06

CRC32 vs. Adler32

You need to choose a checksum and dunno which, well the 2 most popular ones are the CRC and Adler32 checksum

Speed / innermost loop

CRC32:

while(buffer<end -3){
    crc ^= le2me_32(*(uint32_t*)buffer); buffer+=4;
    crc =  tab[3][ crc     &0xFF]
          ^tab[2][(crc>>8 )&0xFF]
          ^tab[1][(crc>>16)&0xFF]
          ^tab[0][(crc>>24)     ];
}

Adler32:

while(buffer<end -3){
    s1 += *buffer++; s2+=s1;
    s1 += *buffer++; s2+=s1;
    s1 += *buffer++; s2+=s1;
    s1 += *buffer++; s2+=s1;
}

as we can see adler needs 12 adds and 4 1byte reads and crc needs 1 4byte read, 4 xor, 4 table lookups, 1 add, 3 and, 3 shift, without benchmarking i would say adler should be faster

Burst error detecting capabilities

CRC-32 can detect every burst error of 32 or less bits
Adler-32 will fail with at least one 17bit burst error (0 00000010 00000000 vs. 1 00000000 00000001)

Bit error detecting capabilities

CRC-32 differs depening upon generator polynom see misusing-crcs-for-error-correction
Adler-32 can detect every 2bit error within 65521 byte or so but will fail with some 3 bit errors

Random error detecting capabilities

CRC-32 will produce a evenly distributed checksum for messages >4byte
Adler-32 will produce a quite unevenly distributed checksum for small mesages so that the effective number of bits in the checksum is significantly reduced for short messags

adler32
len=     1, collisions=488296949(0.390637%), effective bits=7.999957
len=     2, collisions=  1903788(0.001523%), effective bits=16.002699
len=     4, collisions=   372789(0.000298%), effective bits=18.355140
len=     8, collisions=    94799(0.000076%), effective bits=20.330556
len=    16, collisions=    23902(0.000019%), effective bits=22.318296
len=    32, collisions=     6068(0.000005%), effective bits=24.296135
len=    64, collisions=     1573(0.000001%), effective bits=26.243837
len=   128, collisions=      669(0.000001%), effective bits=27.477278
len=   256, collisions=      437(0.000000%), effective bits=28.091651
len=   512, collisions=      331(0.000000%), effective bits=28.492453
len=  1024, collisions=      219(0.000000%), effective bits=29.088353
----
crc32
len=     1, collisions=488296915(0.390637%), effective bits=7.999957
len=     2, collisions=  1903782(0.001523%), effective bits=16.002703
len=     4, collisions=       22(0.000000%), effective bits=32.403708
len=     8, collisions=       29(0.000000%), effective bits=32.005159
len=    16, collisions=       30(0.000000%), effective bits=31.956249
len=    32, collisions=       28(0.000000%), effective bits=32.055785
len=    64, collisions=       39(0.000000%), effective bits=31.577738
len=   128, collisions=       39(0.000000%), effective bits=31.577738
len=   256, collisions=       32(0.000000%), effective bits=31.863140
len=   512, collisions=       32(0.000000%), effective bits=31.863140
len=  1024, collisions=       42(0.000000%), effective bits=31.470823

Note, generated with: check_collision.c

Filed under: Error Correcting Codes — Michael @ 19:27

2006-02-04

Fast sign extension

You want to convert a signed n-bit integer into a m-bit interger, now if n and m are one of 8,16,32,64 then its easy, just cast to the respective type and let the compiler choose how to do it efficiently, but what if n or m are not one of these?

Bit Twiddling Hacks suggests:
C= 1 << (n – 1)
-(x & C) | x
which needs 3 operations if n is constant and 5 if not

can this be done with fewer instructions? yes
C= (-1)<<(n-1)
(x+C)^C
which needs 2 operations if n is constant and 4 if not

PS: the C= (-1)<<(n-1) could be replaced by a look up table …

Filed under: Optimization — Michael @ 23:10

2006-01-28

Avoiding branches/if/conditionals

Why? well because misspredictable branches are very slow on modern CPUs, doing 10 or more simple arithemetic operations can be faster, but also keep in mind that:

  • compilers sometimes compile conditional code to very efficient unconditional code, while they might not recognize your branchless code es equivalent
  • if the branch is well predictable it will probably be faster then the branchless code

Building blocks of branchless code

Turning comparissions into masks or 0/1
a<0 a>>31
a!=0 (a|-a)>>31
a<b (a-b)>>31
a>b (b-a)>>31
a≤b (a-b-1)>>31
a≥b (b-a-1)>>31

Note, you might want to replace 31 by the number of bits in whatever type you use and cast to a signed type if you want a bit mask (-1/0) or unsigned type if you want (1/0)
allso keep in mind that (a<b) and such will give you (1/0) too

converting bitmasks to 0/1

well, thats trivial, there are many ways, some even flip the 0/1 in the process
mask&1, -mask, mask+1

converting 0/1 to masks

-x, x-1

flipping the 0/1 value

x^1, 1-x

flipping the mask value

x^(-1), ~x

switch between 2 variables depening upon a mask

b+((a-b)&mask), b^((a^b)&mask)

convert mask to 0/x

mask&x

convert mask to -1/x

mask|x
Note: usefull for (-1/0) -> (-1/1)

add/sub x depening upon a mask

a += x&mask
Note, dont forget a+=mask / a-=mask for x=1/-1

negate depening upon a mask

a = (a^mask) – mask

minimum of 2 variables

b+((a-b)&((a-b)>>31))

maximum of 2 variables

a-((a-b)&((a-b)>>31))

shift by s depending upon a mask

a>>= s&mask, a< <=s&mask, a += ((a>>s)-a)&mask
Note, that shifting by a constant might be faster

exchanging 2 values depending upon a mask

a^=b
b^=a&mask
a^=b
Note, the mask doesnt need to be 0/-1 it could have just some bits set

logical operations between masks

well, that really should be obvious AND a&b, OR a|b, XOR a^b, NOT ~a

Filed under: Optimization — Michael @ 02:52
« Previous PageNext Page »

Powered by WordPress