Lair Of The Multimedia Guru

2005-11-25

Fast CRC

Ok kids, today we will look at how to quickly calculate a crc checksum, the naive way is to xor the crc polynomial with various shifts into the data to remove all 1 bits except the last 32 (or whatever length our crc polynom has) which from a mathemathical point of view is the data_polynom modulo the crc_polynom over GF(2)
now noone does that except maybe a few very smart or very stupid people :-) everyone else uses a 8bit->32bit look up table which simply contains the 32bit crc for each 8bit data polynom, more precissely:
LUT[i]= (i<<|CRC_POLY|) GF2_MOD CRC_POLY where GF2_MOD is the modulo operation between 2 binary polynoms over GF(2)
the actual code which uses this LUT looks approximately like:

for (i = 0; i < length; i++)
    crc = LUT[(crc&0xFF) ^ *buffer++] ^ (crc >>8);

a much faster method is to use 4 LUTs so that they gives the crc checksum 0,1,2,3 bytes ahead:
LUT[j][i]= (i<<(|CRC_POLY| + j*8)) GF2_MOD CRC_POLY
with that we can calculate the crc then by:

while(buffer<end-3){
    crc ^= le2me_32(*(uint32_t*)buffer); buffer+=4;
    crc =  tab[3][ crc     &0xFF]
          ^tab[2][(crc>>8 )&0xFF]
          ^tab[1][(crc>>16)&0xFF]
          ^tab[0][(crc>>24)     ];
}
while(buffer<end)
    crc = tab[0][(crc&0xFF) ^ *buffer++] ^ (crc >>8);

actually working code should be in ffmpeg/libavutil soon, if not ill post my messy proof of concept code

Filed under: Optimization — Michael @ 02:43

5 Comments »

  1. I have a rs485 based motor controller that requires this rather arcane protocol. the plc that we use does not have the ability to transmit anything less than 32bit words. eg, i can get it to send a 32bit crc but all this 8bit who-ha i have to write from scratch.

    i have already written a preamble, command code, group address, and data byte generators. i do have a 2s complement checksum generator, but it takes up too much space in the processor, causing it to run slow and screw up the timing for the motor controller.

    Comment by Jeff — 2007-01-30 @ 22:58

  2. Your method uses 9 ops instead of 12 for every 4 bytes. However, your table size is 4K instead of 1K. Assuming you are computing a checksum on a 1K block, and the tables are not already in L1 or L2 cache (i.e., this is a real world application, not an artificial benchmark), its hard for me to see how hitting 5K of memory is going to be faster than hitting only 2K of memory, even if you do 768 fewer XOR’s.

    Comment by Allen — 2007-09-08 @ 18:20

  3. > Your method uses 9 ops instead of 12 for every 4 bytes. However, your table size is 4K instead
    > of 1K. Assuming you are computing a checksum on a 1K block, and the tables are not already in
    > L1 or L2 cache (i.e., this is a real world application, not an artificial benchmark), its
    > hard for me to see how hitting 5K of memory is going to be faster than hitting only 2K of
    > memory, even if you do 768 fewer XOR’s.

    if the 4k table isnt in your L2 cache then CRC calculation is only an insignificant part of what your application does (after all your application must read 512k to flush the 4k out of a hypothetical 512k L2 cache) after each 1k block
    in that case indeed you would be better out with a 1k LUT based algorithm though as crc calculations would be only an insignificant part, it would only make a very small difference in overall speed

    the same argument can also be extended to blocks with 256 bytes and an implementation of 1k LUT vs. one with a 64byte LUT (which needs about twice as many ops per byte)

    Comment by Michael — 2007-09-10 @ 13:01

  4. Thanks a lot for the implementation. It seems to beat linux’s crc32() in real workload (141MB/s vs. 203MB/s) and is over twice faster in a naive benchmark you can find at http://iki.fi/lindi/darcs/crcbenchmark/

    Comment by Timo Lindfors — 2008-06-10 @ 10:34

  5. That was a very nice optimization. I was trying to bend my head around the possibility of doing something like that for CRC-24, but so far I have come up with nothing. As there is no uint24_t that I can do aligned reads with, and since I guess I could at most have 3 LUTs anyway, there may not be any way to translate your idea into a performance gain for CRC-24?

    Comment by Jesper — 2010-05-14 @ 23:20

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress