VideoCoding

In the last few weeks FFmpegs huffyuv codec has grown the ability to encode and decode a much broader list of pixel formats. From planar rgb variants 4:4:4 YUV to 4:1:0 YUV and bit depths up to 16bit and alpha support.
One might ask why anyone cares about all this, the awnser is simple

1 thread YUV420 10bit, matrixbench
FFv1 Huffyuv
encode fps 63fps 281fps
decode fps 82fps 418fps
filesize 478mb 765mb

Encoding parameters used:
-an -threads 1 -pix_fmt yuv420p10le -strict -2 -vcodec ffvhuff -context 1
-an -threads 1 -pix_fmt yuv420p10le -strict -2 -vcodec ffv1

6 threads YUV420 10bit, matrixbench
FFv1 Huffyuv
encode fps 269fps 1173fps
decode fps 386fps 2675fps
filesize 480mb 771mb

Encoding parameters used:
-threads 6 -an -pix_fmt yuv420p10le -strict -2 -vcodec ffvhuff -pass 2 test.nut
-threads 6 -an -pix_fmt yuv420p10le -strict -2 -vcodec ffv1 -level 3 -slices 6 test1.nut

As can be seen above huffyuv while it doesnt even get close to ffv1 in terms of compression even though i used per frame huffman tables and 2pass mode in the 2 tests, its simply much faster.

Google is planing to drop h264 support from the video tag in chrome html-video-codec-support-in-chrome and more-about-chrome-html-video-codec. With some blah blah about open and patents

  • real network, rv1, rv2, rv3, rv4 failed
  • Apple SVQ1, SVQ3 (there was no 2) failed
  • Microsoft MSMpeg4v123 WMV* failed
  • On2 VP* failed
  • Google VP ehm i mean webm ….

About open, ITU H.264 was developed on a public mailing list namely JVT-experts, using public FTP containing software, spec drafts, proposed changes, test results, meeting protocols and god knows what else. Anyone could read in near realtime what was discussed, proposed,changed and why and any expert that wanted to contribute iam pretty darn sure would have been taken serious. google webm is On2 last video codec, that to the best of my knowledge was developed behind closed doors by On2 and On2 people where on the jvt-expert list.

So calling webm more open than h264 is a insult at best, calling it less patent encumbered is something that will only be found out once its widespreadly used and third party companies have enough financial interrest in cross checking their patents against it. And about which has better compression vs quality, that was elaborated by others to great lengths already. I guess google needs a new motto, like dont be stupid

The PCA/KLT is often said to be the optimal linear transform for video coding while the DCT is a similar, faster and simpler transform
This isnt really true, the KLT is not that optimal at all for video coding, what the KLT is, is that it is the optimal orthogonal transform for compacting the energy of a vector into its first n components, what might be better is a transform, preferably (near) orthogonal which compacts the energy into few components, not neccessarily the first few!
For example, if we consider 1D 8 component vectors, lets assume our data set is entirely made of piecewice constants then the following basis functions would allow us to exactly store these with 1+n basis functions where n is the number of discontinuities, and please note this here is just an example where the KLT isnt that optimal, 1-D piecewise constants are probably not a ideal model for images …

1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1
0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1
0 0 0 0 1 1 1 1
0 0 0 0 0 1 1 1
0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1

ok this is quite a non orthogonal set of basis functions which has its problems too, a less compact and orthogonal (if the basis functions are normalized, which for sake of readability hasnt been done) but still better then KLT transform would be based upon the following basis functions:

1 1 1 1 1 1 1 1
1 1 1 1-1-1-1-1
1 1-1-1 0 0 0 0
0 0 0 0 1 1-1-1
1-1 0 0 0 0 0 0
0 0 1-1 0 0 0 0
0 0 0 0 1-1 0 0 
0 0 0 0 0 0 1-1

here a piecewise constant with 1 discontinuity can be represented by 2-4 basis functions where the KLT case below will need all 8 for an exact representation

 0.290596  0.341669  0.377247  0.395479  0.395472  0.377227  0.341853  0.290749  
-0.490579 -0.416017 -0.277788 -0.097204  0.097447  0.277903  0.415566  0.490129      
 0.473927  0.255266 -0.138718 -0.437652 -0.436948 -0.137245  0.255345  0.473804       
-0.415204  0.097424  0.490934  0.276661 -0.277947 -0.490521 -0.097391  0.416181       
-0.382979  0.320320  0.364628 -0.343399 -0.342868  0.364783  0.320177 -0.383024       
-0.278541  0.491575 -0.099199 -0.414873  0.416580  0.096333 -0.489054  0.277157        
-0.210508  0.464503 -0.455644  0.190764  0.181391 -0.451288  0.463825 -0.210293        
-0.096505  0.274820 -0.412134  0.489524 -0.491862  0.418348 -0.280951  0.098971

this has been generated by (the hopefully not buggy) pca.c/pca.h

Another way to see the sub-optimality of the KLT is to consider the last (high frquency) basis functions, in many images and videos they are simply not used, if they would be replaced by commonly occuring patterns which without them would need many basis functions to be accurately represented then the compression-rate could be improved, the resultig transform would be non orthogonal though …

Whats a scantable? A thing that tells you where to put the coefficients which you got from vlc/rle decoding, so for example with the famous zigzag table

    0,   1,  8, 16,  9,  2,  3, 10,
    17, 24, 32, 25, 18, 11,  4,  5,
    12, 19, 26, 33, 40, 48, 41, 34,
    27, 20, 13,  6,  7, 14, 21, 28,
    35, 42, 49, 56, 57, 50, 43, 36,
    29, 22, 15, 23, 30, 37, 44, 51,
    58, 59, 52, 45, 38, 31, 39, 46,
    53, 60, 61, 54, 47, 55, 62, 63

the first Coefficient would be written at position 0, the second one at 1, third at 8 and so on. As the destination array is a 8×8 matrix here this produces a nice zigzag from the top left corner to the bottom right corner, ordering the dct coefficients approximately from low frequency to high, and as the high ones are almost always zero the vlc-rle coding before ends up being quite effective, but back to the topic, what if we for whatever odd reason want to know if a binary contains such a table?
its quite easy to find such tables by brute force, their size is something like 16 or 64 entries with no duplicates and the first entry most often being 0 and none of their entries should be larger then the number of entries in the table, heres some (old) code which i wrote quite some time ago to implement this and a few other things

And now the interresting part, what do we find with this?

 0  1  4  8 
 5  2  3  6 
 9 12 13 10 
 7 11 14 15

o-->o   o-->o
  /   /   /
o   o   o   o
| /   /   / |
o   o   o   o
  /   /   /
o-->o   o-->o


and

 0  1  2  6 
10  3  7 11 
 4  8  5  9 
12 13 14 15  

 o-->o-->o   o
         |  /|
 o   o   o / o
 | / |   |/  |
 o   o   o   o
   /
 o-->o-->o-->o
(dual scan table ascii art stolen from libavcodec/svq3.c)

in drv3.so.6.0
note these tables are the 4×4 zigzag and dualscan tables from an old H.264 draft, so RV30 seems to be a H.264 variant like SVQ3

  • no loop filter to reduce blocking artefacts, H.263 upon which MPEG4 is based had one, and the thing people most often complain about when looking at MPEG4 videos are blocking artefacts, why cant postprocessing filters cover this up? well its much harder to remove multiple overlapping and randomliy shifted bocking artefacts instead of ones which are exactly at a 8×8 grid and only around macroblocks with specific parameters …
  • block boundary mirroring for qpel, this is AFAIK needed because there is no deblocking loop filter, if mirroring wouldnt be done then the filter would cross many more edges created by blocking artefacts and create a random mess, mirroring also reduces the number of bytes which need to be fetched for each block, ok but why am i complaining? well its a mess to implement and it forces the encoder to do redundant calculations as each block needs things mirrored differently
  • all the startcode avoidance and redundant marker bits, there are really just 2 possibilities, either the stream is undamaged in which case we dont need to search for startcodes unless we try to seek in raw mpeg4 (.m4v NOT .mp4!) or it is damaged in which case the data can contain anything anyway, now if all this mess was for seeking in raw mpeg4 why are there no reliably timestamps? additionally all non-MPEG/ITU codecs work perfectly fine without this startcode emulation prevention thing
  • OBMC is in the spec, theres a bit in the header to enable it but oops its not allowed in ANY profile
  • IDCT see prevous blog entry
  • GMC is very computationally intensive and according to some documents from the JVT the only reason why there is any quality/compression gain at all is that the motion vector for skiped macroblocks is the GMC vector where without GMC it would be (0,0) now simply changing that to the predicted median vector should make the tiny gain GMC has dissapear
  • a single set of fixed vlc tables, why? its just silly, several fixed sets or variable ones stored in a global header would mean very little additional complexity IMHO at least
  • short header format (=every MPEG4 decoder must support simple H.263)
  • different motion vector prediction in bframes, why? using the same as in P frames would be simpler and very likely higher quality
  • no intra macroblocks in b frames
  • macroblocks in B frames must be skiped if they where skiped in the last P frame, thanks to this idiotic idea encoders must check all b frames before they can mark a MB in a P frame as skiped
  • no macroblocks with 4mv and dquant at the same time, requires special handling in the encoder and no advantage at all, h.263 demonstrates how to do it correctly
  • all the “flash” animation multi object stuff in MPEG4
  • dquant is limited to +-2 so encoders must somehow convert the ideal QP values from lumimasking/… to ones which dont change too much between MBs
  • QP is limited to either even or odd values in b frames, yeah poor encoder must decide which it wants, cant have both, hmm and according to the comments in ffmpeg more code is needed for working around all these arbitarray mpeg4 QP limitations about then the actual QP selection needs

The 8×8 Inverse discrete cosine transform used in most video codecs converts the very sparse and thus easily compressible dct coefficient matrixes into the 8×8 blocks vissible or the 8×8 difference relative to some area from the previous frame
The IDCT in these codecs is not exactly specified, instead the MPEG and ITU commitees only require the used IDCT to be approximately equal to a idealized and very slow reference IDCT due to that, decoders and encoders use various different IDCTs, if the actually used encoder and decoder happen to use 2 different enough IDCTs and the video material, encoding parameters, moon phase and so on match then the tiny +-1 differences between the IDCTs will accumulate and turn the pale face of a corpse pink or green or add stripes or worse …
some examples of such artifacts:

LLM/IJG int simpleidct
libmpeg2 idct xvids idct
LLM/IJG int simpleidct
libmpeg2 idct xvids idct

Its also very important to keep in mind that these artifacts are caused by the difference between the encoder and decoder IDCT and not the difference to some idealized IDCT only on the decoder side, so using the reference IDCT on the decoder side will often not help at all, and sometimes might make the artifacts worse
now maybe you are curious which IDCT was used in the encoders for the 2 examples above, well iam curious too about that as theres no 100% certain way to find out, allthough i would guess the first used the idct from the msmpeg4v3 codec and the second the one from xvid
Which prameters can be tuned on the encoder side to reduce these artifacts? First use a smaller keyframe interval, then avoid qp=1, avoid qpel and use lots of b frames or use H.264 which doesnt suffer from this mess as it has a exactly specified idct approximation