Lair Of The Multimedia Guru

2005-12-30

Principal components analysis / Karhunen-Loève transform

The PCA/KLT is often said to be the optimal linear transform for video coding while the DCT is a similar, faster and simpler transform
This isnt really true, the KLT is not that optimal at all for video coding, what the KLT is, is that it is the optimal orthogonal transform for compacting the energy of a vector into its first n components, what might be better is a transform, preferably (near) orthogonal which compacts the energy into few components, not neccessarily the first few!
For example, if we consider 1D 8 component vectors, lets assume our data set is entirely made of piecewice constants then the following basis functions would allow us to exactly store these with 1+n basis functions where n is the number of discontinuities, and please note this here is just an example where the KLT isnt that optimal, 1-D piecewise constants are probably not a ideal model for images …

1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1
0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1
0 0 0 0 1 1 1 1
0 0 0 0 0 1 1 1
0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1

ok this is quite a non orthogonal set of basis functions which has its problems too, a less compact and orthogonal (if the basis functions are normalized, which for sake of readability hasnt been done) but still better then KLT transform would be based upon the following basis functions:

1 1 1 1 1 1 1 1
1 1 1 1-1-1-1-1
1 1-1-1 0 0 0 0
0 0 0 0 1 1-1-1
1-1 0 0 0 0 0 0
0 0 1-1 0 0 0 0
0 0 0 0 1-1 0 0 
0 0 0 0 0 0 1-1

here a piecewise constant with 1 discontinuity can be represented by 2-4 basis functions where the KLT case below will need all 8 for an exact representation

 0.290596  0.341669  0.377247  0.395479  0.395472  0.377227  0.341853  0.290749  
-0.490579 -0.416017 -0.277788 -0.097204  0.097447  0.277903  0.415566  0.490129      
 0.473927  0.255266 -0.138718 -0.437652 -0.436948 -0.137245  0.255345  0.473804       
-0.415204  0.097424  0.490934  0.276661 -0.277947 -0.490521 -0.097391  0.416181       
-0.382979  0.320320  0.364628 -0.343399 -0.342868  0.364783  0.320177 -0.383024       
-0.278541  0.491575 -0.099199 -0.414873  0.416580  0.096333 -0.489054  0.277157        
-0.210508  0.464503 -0.455644  0.190764  0.181391 -0.451288  0.463825 -0.210293        
-0.096505  0.274820 -0.412134  0.489524 -0.491862  0.418348 -0.280951  0.098971

this has been generated by (the hopefully not buggy) pca.c/pca.h

Another way to see the sub-optimality of the KLT is to consider the last (high frquency) basis functions, in many images and videos they are simply not used, if they would be replaced by commonly occuring patterns which without them would need many basis functions to be accurately represented then the compression-rate could be improved, the resultig transform would be non orthogonal though …

Filed under: DCT,VideoCoding — Michael @ 01:36

Comments (1)

2005-12-28

Extracting scantables from binary video codecs

Whats a scantable? A thing that tells you where to put the coefficients which you got from vlc/rle decoding, so for example with the famous zigzag table

    0,   1,  8, 16,  9,  2,  3, 10,
    17, 24, 32, 25, 18, 11,  4,  5,
    12, 19, 26, 33, 40, 48, 41, 34,
    27, 20, 13,  6,  7, 14, 21, 28,
    35, 42, 49, 56, 57, 50, 43, 36,
    29, 22, 15, 23, 30, 37, 44, 51,
    58, 59, 52, 45, 38, 31, 39, 46,
    53, 60, 61, 54, 47, 55, 62, 63

the first Coefficient would be written at position 0, the second one at 1, third at 8 and so on. As the destination array is a 8×8 matrix here this produces a nice zigzag from the top left corner to the bottom right corner, ordering the dct coefficients approximately from low frequency to high, and as the high ones are almost always zero the vlc-rle coding before ends up being quite effective, but back to the topic, what if we for whatever odd reason want to know if a binary contains such a table?
its quite easy to find such tables by brute force, their size is something like 16 or 64 entries with no duplicates and the first entry most often being 0 and none of their entries should be larger then the number of entries in the table, heres some (old) code which i wrote quite some time ago to implement this and a few other things

And now the interresting part, what do we find with this?

 0  1  4  8 
 5  2  3  6 
 9 12 13 10 
 7 11 14 15

o-->o   o-->o
  /   /   /
o   o   o   o
| /   /   / |
o   o   o   o
  /   /   /
o-->o   o-->o

and

 0  1  2  6 
10  3  7 11 
 4  8  5  9 
12 13 14 15  

 o-->o-->o   o
         |  /|
 o   o   o / o
 | / |   |/  |
 o   o   o   o
   /
 o-->o-->o-->o
(dual scan table ascii art stolen from libavcodec/svq3.c)

in drv3.so.6.0
note these tables are the 4×4 zigzag and dualscan tables from an old H.264 draft, so RV30 seems to be a H.264 variant like SVQ3

Filed under: Reverse Engineering,VideoCoding — Michael @ 00:43

Comments (3)

2005-12-20

Profiling / Benchmarking code

You have a piece of code, its too slow so you want to optimize it, but to do that you first need to be able to meassure its speed, otherwise you wont know if a change improved speed or not

Benchmarking a piece of code sounds easy but it isnt, you can try to do it with various standard “get the current time” functions but these tend to be very inaccurate, canned tools like gprof are an option too, but in my experience they are also inaccurate and they can produce highly misleading results due to function inlining. A more accurate method on the x86 architecture is to use rdtsc

So now we have a accurate method to get the current “time” but what now? simply running the code once and getting the time before and afterwards? This would be a very bad choice as the first time a piece of code is executed its not in the code cache, the data isnt in the data cache, branch prediction has not seen the code yet, … so we could ignore the first few iterations, but still a single measurement isnt very trustworthy, next improvement would be to average many interations but still its very sensitive, just think about what happens when the kernel decides to stop your program and give the cpu to your mp3 player, its like having a interation which needs 50 times longer then the average assuming the code between the rdtsc instructions doesnt take too long. Solving this is easy though, just discard such outliers which need many times the average (see the START/STOP_TIMER macros in ffmpeg/libavutil/common.h)

Filed under: Optimization — Michael @ 13:27

Comments (6)

2005-12-07

Correcting byte and burst errors with CRCs

Can CRCs be used to fix a whole messed up byte? Yes as long as the codeword (data bits + crc bits) is less then whats in this table under 8≤b:

	generator polynom	b≤2	b≤3	b≤4	b≤5	b≤6	b≤7	b≤8	b≤9	b≤10	b≤11	b≤12	b≤13	b≤14
CRC-4	0xF0000000	8	6	6	6	6	6	6	6	6	6	6	6	6
CRC-5	0x28000000	16	9	7	7	7	7	7	7	7	7	7	7	7
CRC-7	0xA2000000	18	18	9	9	9	9	9	9	9	9	9	9	9
CRC-7	0x12000000	34	28	12	9	9	9	9	9	9	9	9	9	9
CRC-7	0x6E000000	66	12	9	9	9	9	9	9	9	9	9	9	9
CRC-8	0x07000000	130	12	12	10	10	10	10	10	10	10	10	10	10
CRC-8	0x39000000	20	20	11	10	10	10	10	10	10	10	10	10	10
CRC-8	0xD5000000	96	24	12	10	10	10	10	10	10	10	10	10	10
CRC-12	0x80F00000	2050	555	16	16	16	14	14	14	14	14	14	14	14
CRC-12	0x80B00000	513	259	16	16	16	14	14	14	14	14	14	14	14
CRC-12	0x1F100000	68	68	21	15	15	14	14	14	14	14	14	14	14
CRC-15	0x8B320000	130	130	130	130	58	27	17	17	17	17	17	17	17
CRC-16	0x10210000	32770	7144	4181	175	19	19	19	18	18	18	18	18	18
CRC-16	0x80050000	32770	19	19	19	19	19	19	18	18	18	18	18	18
CRC-16	0xA0970000	16386	931	862	76	76	65	22	18	18	18	18	18	18
CRC-16	0xC5030000	7164	7164	188	188	164	145	28	18	18	18	18	18	18
CRC-16	0x90D90000	154	154	154	154	154	77	21	18	18	18	18	18	18
CRC-24	0x80510100	7164	7164	1028	1028	1028	1028	1028	1028	348	30	30	26	26
CRC-32	0x04C11DB7	376820511	376820511	30435040	14373578	14373578	3932619	1077949	49616	11995	5682	1731	732	40
CRC-32	0x404098E2	1026	1026	1026	1026	1026	1026	1026	1026	1026	1026	241	229	114
CRC-32	0x1EDC6F41	2147483650	258958121	193439312	62023781	3040389	1847228	603132	98658	4913	3356	1104	86	86
CRC-32	0x741B8CD7	114698	114698	16390	16390	6361	3955	1601	120	120	120	120	77	49
CRC-32	0xF4ACFB13	32770	32770	32770	32770	32770	32770	32770	32770	6508	3052	1696	152	152
CRC-32	0x32583499	32772	32772	11340	11340	6230	5348	324	324	324	324	156	44	34
CRC-32	0x20044009	32772	32772	3792	3792	3792	3792	620	360	302	302	52	52	52
CRC-32	0xA833982B	65540	65540	1928	1928	1928	1928	1928	1593	203	203	203	66	66
CRC-32	0x00210801	65540	65540	15207	15207	3211	3211	959	83	83	83	39	39	39

table was created with crc_test_v2.c

To actually correct byte errors the code below could be used, a complete example (crc_test_byte.c) is available too.

static int get_error(unsigned int crc, unsigned int len, int *error){
    int i;

    for(i=0; i<len; i++){
        if(!(crc & ~255)){
            *error= crc;
            return i;
        }
        crc= (crc>>1) ^ (((G>>1)|0x80000000) & (-(crc&1)));
    }
    return -1;
}

Filed under: Error Correcting Codes — Michael @ 23:08

Comments (0)

2005-12-06

Correcting 2 bit errors with CRCs

1 and 2 bit errors can easily be corrected by using the following code snippet, a complete example is available too.

    unsigned int i,v=1;
    for(i=0; i<block_size ; i++){
        crctab[i][0]= i ? (v^1) : v;
        crctab[i][1]= i;
        v= (v<<1) ^ (G & (((int)v)>>31));
    }

    qsort(crctab, BLOCK_SIZE, 2*sizeof(int), cmp);

static int get_error(unsigned int crc, unsigned int len, int error[2]){
    int i;

    for(i=0; i<len ; i++){
        int *result= bsearch(&crc, crctab, BLOCK_SIZE, 2*sizeof(int), cmp);
        if(result){
            error[0]= i;
            error[1]= result[1] + i;
            return 1 + (result != crctab[0]);
        }
        crc= (crc>>1) ^ (((G>>1)|0x80000000) & (-(crc&1)));
    }
    return -1;
}

Note, if you want to use this in anything where speed matters, then you should replace the qsort/bsearch with some hash table, i didnt as the c stadard doesnt contain any useable hashtable implementation and i was lazy

Filed under: Error Correcting Codes — Michael @ 00:18

Comments (0)

2005-12-05

(Mis)using CRCs for error correction

Cyclic Redundancy Check codes are commonly used for detecting errors, but they can also be used to correct single and multibit errors. To be able to correct e errors per code word and detect code words with d errors, its needed for the minimum hamming distance of a code to be > 2e+d
So what is the relation between the code word length and the minimum hamming distances for crc codes?

	generator polyom	hd≥3	hd≥4	hd≥5	hd≥6	hd≥7	hd≥8
CRC- 4	0xF0000000	5	5	5	5	5	5
CRC- 5	0x28000000	31	4	4	4	4	4
CRC- 7	0xA2000000	15	15	7	7	7	7
CRC- 7	0x12000000	127	6	6	6	6	6
CRC- 7	0x6E000000	63	63	9	9	7	7
CRC- 8	0x07000000	127	127	8	8	8	8
CRC- 8	0x39000000	17	17	17	7	7	7
CRC- 8	0xD5000000	93	93	10	10	8	8
CRC-12	0x80F00000	2047	2047	13	13	12	12
CRC-12	0x80B00000	510	171	36	13	13	13
CRC-12	0x1F100000	65	65	65	21	13	11
CRC-15	0x8B320000	127	127	127	127	22	22
CRC-16	0x10210000	32767	32767	16	16	16	16
CRC-16	0x80050000	32767	32767	16	16	16	16
CRC-16	0xA0970000	32766	32766	83	83	24	24
CRC-16	0xC5030000	7161	496	85	36	24	22
CRC-16	0x90D90000	151	151	151	151	21	21
CRC-24	0x80510100	7161	7161	1030	1030	24	24
CRC-32	0x04C11DB7	4294967295	91638	3006	299	203	122
CRC-32	0x404098E2	1024	1023	1023	1023	1023	1023
CRC-32	0x1EDC6F41	2147483647	2147483647	5275	5275	209	209
CRC-32	0x741B8CD7	114695	114695	16392	16392	184	184
CRC-32	0xF4ACFB13	65534	65534	32768	32768	306	306
CRC-32	0x32583499	65538	65538	32770	32770	166	166
CRC-32	0x20044009	65538	65538	32770	32770	32	32
CRC-32	0xA833982B	65537	65537	65537	1091	113	89
CRC-32	0x00210801	65537	65537	65537	31	31	31

These where found using crc_test.c
Actually correcting errors is pretty trivial, for a single bit error, feeding the code below with the calculated crc XOR the received crc, will give you the distance from the end where the error ocured, and yeah if they match (crc=0) thn theres no single bit error

int get_single_error(int crc, int len){
    int i, v=1;

    for(i=0; i<len ; i++){
        if(v == crc)
            return i;
        v= (v<<1) ^ (generator_polynom & (v>>31));
    }
    return -1;
}

ill post some code examples to fix multibit errors later …

Filed under: Error Correcting Codes — Michael @ 20:23

Comments (6)

2005-12-03

PP filter comparission update

Ive added some -vf uspp (PP by (mis)using the snow codec) based images to pp vs. spp filters, the reason for uspp=6 instead of higher values is that higher ones wake the OOM-killer :-) the reason for the lack of pp=ac/ha/va based samples is that they look identical to the pp=de/hb/vb

Filed under: Pictures,PostProcessing,Wavelets — Michael @ 14:21

Comments (0)

2005-11-28

15 reasons why MPEG4 sucks

no loop filter to reduce blocking artefacts, H.263 upon which MPEG4 is based had one, and the thing people most often complain about when looking at MPEG4 videos are blocking artefacts, why cant postprocessing filters cover this up? well its much harder to remove multiple overlapping and randomliy shifted bocking artefacts instead of ones which are exactly at a 8×8 grid and only around macroblocks with specific parameters …
block boundary mirroring for qpel, this is AFAIK needed because there is no deblocking loop filter, if mirroring wouldnt be done then the filter would cross many more edges created by blocking artefacts and create a random mess, mirroring also reduces the number of bytes which need to be fetched for each block, ok but why am i complaining? well its a mess to implement and it forces the encoder to do redundant calculations as each block needs things mirrored differently
all the startcode avoidance and redundant marker bits, there are really just 2 possibilities, either the stream is undamaged in which case we dont need to search for startcodes unless we try to seek in raw mpeg4 (.m4v NOT .mp4!) or it is damaged in which case the data can contain anything anyway, now if all this mess was for seeking in raw mpeg4 why are there no reliably timestamps? additionally all non-MPEG/ITU codecs work perfectly fine without this startcode emulation prevention thing
OBMC is in the spec, theres a bit in the header to enable it but oops its not allowed in ANY profile
IDCT see prevous blog entry
GMC is very computationally intensive and according to some documents from the JVT the only reason why there is any quality/compression gain at all is that the motion vector for skiped macroblocks is the GMC vector where without GMC it would be (0,0) now simply changing that to the predicted median vector should make the tiny gain GMC has dissapear
a single set of fixed vlc tables, why? its just silly, several fixed sets or variable ones stored in a global header would mean very little additional complexity IMHO at least
short header format (=every MPEG4 decoder must support simple H.263)
different motion vector prediction in bframes, why? using the same as in P frames would be simpler and very likely higher quality
no intra macroblocks in b frames
macroblocks in B frames must be skiped if they where skiped in the last P frame, thanks to this idiotic idea encoders must check all b frames before they can mark a MB in a P frame as skiped
no macroblocks with 4mv and dquant at the same time, requires special handling in the encoder and no advantage at all, h.263 demonstrates how to do it correctly
all the “flash” animation multi object stuff in MPEG4
dquant is limited to +-2 so encoders must somehow convert the ideal QP values from lumimasking/… to ones which dont change too much between MBs
QP is limited to either even or odd values in b frames, yeah poor encoder must decide which it wants, cant have both, hmm and according to the comments in ffmpeg more code is needed for working around all these arbitarray mpeg4 QP limitations about then the actual QP selection needs

Filed under: VideoCoding — Michael @ 21:31

Comments (10)

The MPEG1/2/4 and H.261/2/3 IDCT

The 8×8 Inverse discrete cosine transform used in most video codecs converts the very sparse and thus easily compressible dct coefficient matrixes into the 8×8 blocks vissible or the 8×8 difference relative to some area from the previous frame
The IDCT in these codecs is not exactly specified, instead the MPEG and ITU commitees only require the used IDCT to be approximately equal to a idealized and very slow reference IDCT due to that, decoders and encoders use various different IDCTs, if the actually used encoder and decoder happen to use 2 different enough IDCTs and the video material, encoding parameters, moon phase and so on match then the tiny +-1 differences between the IDCTs will accumulate and turn the pale face of a corpse pink or green or add stripes or worse …
some examples of such artifacts:


LLM/IJG int	simpleidct

libmpeg2 idct	xvids idct


LLM/IJG int	simpleidct

libmpeg2 idct	xvids idct

Its also very important to keep in mind that these artifacts are caused by the difference between the encoder and decoder IDCT and not the difference to some idealized IDCT only on the decoder side, so using the reference IDCT on the decoder side will often not help at all, and sometimes might make the artifacts worse
now maybe you are curious which IDCT was used in the encoders for the 2 examples above, well iam curious too about that as theres no 100% certain way to find out, allthough i would guess the first used the idct from the msmpeg4v3 codec and the second the one from xvid
Which prameters can be tuned on the encoder side to reduce these artifacts? First use a smaller keyframe interval, then avoid qp=1, avoid qpel and use lots of b frames or use H.264 which doesnt suffer from this mess as it has a exactly specified idct approximation

Filed under: DCT,VideoCoding — Michael @ 16:37

Comments (5)

pp vs. spp filters

Comparing the pp and spp filters from MPlayer/ffmpeg with the standard lena 256×256 image, the unfiltered one is a simple 10% quality jpeg saved by gimp