Lair Of The Multimedia Guru

June 18, 2007

mnzip

I always wanted to write my own general purpose compressor, a short while ago i actually did. Its based on the BWT + MTF variant described in yesterdays blog post + snows/ffv1s range coder with a state transition table borrowed from bbb/paq8 source code under GPL is of course available too.

compressed file sizes of the Canterbury Corpus

alice29.txt asyoulik.txt cp.html fields.c grammar.lsp kennedy.xls lcet10.txt plrabn12.txt ptt5 sum xargs.1
bzip2 1.0.3 43202 39569 7624 3039 1283 130280 107706 145577 49759 12909 1762
7zip 4.43 beta 48555 44606 7692 3076 1350 50904 119569 165462 42153 9619 1860
bbb ver. 1 40839 37818 7736 3253 1349 76523 101117 135829 44816 12593 1792
mnzip r32 with plain MTF 42698 39286 7572 2962 1227 19804 105883 143634 50624 12591 1698
mnzip r32 40950 37835 7431 2983 1237 19287 101140 137191 45604 12428 1699

Time needed to compress

alice29.txt asyoulik.txt lcet10.txt plrabn12.txt cp.html fields.c grammar.lsp kennedy.xls ptt5 sum xargs.1
bzip2 1.0.3 0m0.166s 0m0.133s 0m0.533s 0m0.633s 0m0.047s 0m0.037s 0m0.007s 0m1.062s 0m0.151s 0m0.056s 0m0.006s
7zip 4.43 beta 0m0.539s 0m0.417s 0m1.732s 0m2.161s 0m0.070s 0m0.035s 0m0.019s 0m6.048s 0m1.402s 0m0.105s 0m0.022s
bbb ver. 1 0m2.675s 0m2.271s 0m7.455s 0m8.599s 0m0.559s 0m0.344s 0m0.230s 0m17.446s 0m45.407s 0m0.813s 0m0.235s
mnzip r32 0m0.273s 0m0.206s 0m0.951s 0m1.099s 0m0.031s 0m0.012s 0m0.006s 0m3.545s 0m1.173s 0m0.051s 0m0.006s

time needed to decompress

alice29.txt asyoulik.txt lcet10.txt plrabn12.txt cp.html fields.c grammar.lsp kennedy.xls ptt5 sum xargs.1
bzip2 1.0.3 0m0.063s 0m0.049s 0m0.177s 0m0.222s 0m0.007s 0m0.003s 0m0.002s 0m0.210s 0m0.053s 0m0.009s 0m0.003s
7zip 4.43 beta 0m0.033s 0m0.027s 0m0.066s 0m0.085s 0m0.009s 0m0.011s 0m0.007s 0m0.099s 0m0.043s 0m0.016s 0m0.006s
bbb ver. 1 0m2.265s 0m1.918s 0m6.015s 0m6.916s 0m0.511s 0m0.332s 0m0.231s 0m13.492s 0m6.660s 0m0.715s 0m0.237s
mnzip r32 0m0.073s 0m0.061s 0m0.215s 0m0.261s 0m0.010s 0m0.005s 0m0.003s 0m0.441s 0m0.155s 0m0.017s 0m0.002s

Options used where -9 for bzip2, -mx=9 for 7zip, f for bbb to use fast but memory hungry mode (this doesnt affect compression rate for bbb). The benchmark score are just single run based no proper mean so dont take them too serious, and i hope ive not messed up the file order ;)

Patches are welcome!

Filed under: Entropy Coding — Michael @ 1:20

11 Comments »

  1. Here are results for ppmd. The first size is with “-m256 -o16”, second with “-m256 -o4”. The latter is faster than default-level bzip2. The -m argument has probably no effect with files this small. Most of these results are better than those of the compressors listed above. kennedy.xls seems to be some kind of a bad case.

    alice29.txt 38682 39909
    asyoulik.txt 36103 36561
    cp.html 6576 6712
    fields.c 2576 2751
    grammar.lsp 1069 1090
    kennedy.xls 173000 123468
    lcet10.txt 95624 101165
    plrabn12.txt 132753 134645
    ptt5 48511 50164
    sum 11274 11983
    xargs.1 1516 1523

    Comment by uau — June 19, 2007 @ 22:28

  2. ppm based compressors where always amongth the best for compressing text, maybe you want to try paq too, i suspect it should perform even better at the cost of being unuseably slow
    the problem with ppm is that its symmetric, that is decompression will need the same amount of time and space as compression and its memory hungry, its especially bad if you try to decompress such a -m256 file on a computer with less than 256mb free physical memory (of course you can be lucky and the file was small and doesnt need the whole 256mb)

    Comment by Michael — June 20, 2007 @ 0:25

  3. These test files are too small :) Of course 7z doesn’t do well, it’s designed for larger stuff; heck, its default dictionary size is 16M. Decompression is also very fast (much faster than mnzip or other BWT-based compressors) but from these benchmarks it seems to be hurt by its startup time, again it’s not designed for such tiny files (and not specifically tailored for text). As expected it wins on the big binary files.

    To make things interesting you could try appending the test set…

    I don’t like BWT compressors because it seems to me that the BWT is too slow, it’s fast for compressing but slow for decompressing, so I’d rather take LZ with huge dictionaries.

    BTW, 7z supports both LZMA and PPMD, you can even have both in the same archive, so you can use PPMD for a solid block containing text files and LZMA for binary ones. BZip2 is also supported but most 3rd party implementations won’t decode such files.

    Comment by Small — June 26, 2007 @ 8:11

  4. > These test files are too small :) Of course 7z doesn’t do well, it’s designed for larger
    > stuff; heck, its default dictionary size is 16M.

    ive also run mnzip and 7z on the large corpus (its on the same page as the Canterbury Corpus)
    -rw——- 1 michael michael 4047392 1997-03-21 05:31 bible.txt
    -rw-r—– 1 michael michael 885162 2007-06-18 00:31 bible.txt.7z
    -rw-r—– 1 michael michael 742228 2007-06-18 00:14 bible.txt.mnz
    -rw——- 1 michael michael 4638690 1997-02-26 00:12 E.coli
    -rw-r—– 1 michael michael 1185614 2007-06-18 00:32 E.coli.7z
    -rw-r—– 1 michael michael 1144784 2007-06-18 00:15 E.coli.mnz
    -rw——- 1 michael michael 2473400 1997-03-18 05:13 world192.txt
    -rw-r—– 1 michael michael 487613 2007-06-18 00:32 world192.txt.7z
    -rw-r—– 1 michael michael 404576 2007-06-18 00:15 world192.txt.mnz

    > Decompression is also very fast (much faster than mnzip or other BWT-based compressors) but
    > from these benchmarks it seems to be hurt by its startup time, again it’s not designed for
    > such tiny files (and not specifically tailored for text).

    LZ is fast at decompressing yes, how much that matters depends on what its used for, for communication like ssh fast decompression + slow compression is not any better than if both are the same speed and for decompressing software once on installation id say that its not that important either, network speed is more likely the limiting factor

    > BTW, 7z supports both LZMA and PPMD, you can even have both in the same archive, so you can
    > use PPMD for a solid block containing text files and LZMA for binary ones. BZip2 is also
    > supported but most 3rd party implementations won’t decode such files.

    ppmds memory requirements on decoding makes it very problematic, it easily needs 20 times as much as lz and bwt for the same size of data

    Comment by Michael — June 27, 2007 @ 9:31

  5. The files in the “large” corpus are still so small that they fit in a single bwt transform block except for E.coli which goes a bit over. I tested with linux-2.6.20.tar earlier (250316800 bytes). I don’t have the exact numbers any more (and running them again would take over 10 minutes), but the approximate results were (numbers rounded down to MB, I think I remember them about right…):
    7z with “ultra” settings from man page: 35 MB
    mnzip with default blocksize: 38 MB
    mnzip with 32 MiB blocksize: 36 MB

    If I take just a single 4 MiB block from the beginning of the tar then mnzip wins compressing than in isolation, 922092 bytes to 988826.

    Comment by uau — June 27, 2007 @ 23:06

  6. > ppmds memory requirements on decoding makes it very problematic, it easily needs 20 times as much as lz
    > and bwt for the same size of data

    In fact current mnzip uses more memory to decompress the 4 MiB linux tar part than ppmd does (with the default decompression mode in mnzip and pmd file compressed with -o6 which gives better compression than mnzip). Using the low-memory mode would lower the requirements but in any case ppmd is not orders of magnitude worse (and ppmd could possibly use a smaller block size without losing in compression).

    Comment by uau — June 27, 2007 @ 23:50

  7. I wonder how compares to lrzip

    http://ck.kolivas.org/apps/lrzip/README

    Comment by lu_zero — June 28, 2007 @ 21:05

  8. > In fact current mnzip uses more memory to decompress the 4 MiB linux tar part than ppmd does
    > (with the default decompression mode in mnzip and pmd file compressed with -o6 which gives
    > better compression than mnzip). Using the low-memory mode would lower the requirements but
    > in any case ppmd is not orders of magnitude worse (and ppmd could possibly use a smaller
    > block size without losing in compression).

    mnzips low memory mode can be choosen at decompression time so theres no problem decompressing normal mnzip files with it, with ppm the amount of memory cant so easily be cut to a quarter at decompression time, if it needs 100mb to decompress so it does you cant change that easily. If you have less physical memory, then you have a problem …

    the inverese bwt is a fairly simple transform and i suspect that it can be performed quickly with less than O(n) memory and a small number of passes over a temporary file

    and the most important part you ignore is that a single case where ppm needs normal amounts of memory doesnt help as it does need significanly more in other cases and being able to just decompress some files is not exactly good

    Comment by Michael — June 29, 2007 @ 13:27

  9. have you tried using divsufsort bwt algorithm? http://homepage3.nifty.com/wpage/software/libdivsufsort.html
    or msufsort? http://www.michael-maniscalco.com/msufsort.htm

    they are both very fast. could you try to use them in your program (ie. those bwt algos + your second stage algos)?

    Comment by donkey7 — July 16, 2007 @ 13:00

  10. > have you tried using divsufsort bwt algorithm?
    > http://homepage3.nifty.com/wpage/software/libdivsufsort.html
    > or msufsort? http://www.michael-maniscalco.com/msufsort.htm

    no ive not tried them

    > they are both very fast. could you try to use them in your program
    > (ie. those bwt algos + your second stage algos)?

    well, if you want you can try them, a patch which adds support for them would be welcome if they are faster or need less memory
    i myself likely wont add support for them anytime soon if noone else does, but iam certainly willing to review patches, awnser questions and help …

    Comment by Michael — July 17, 2007 @ 0:22

  11. it’s better code.
    if((prob[ctx]+= b*256 – p8)<1)prob[ctx]=1524;

    Comment by da0 — February 26, 2012 @ 5:05

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress