mnzip
I always wanted to write my own general purpose compressor, a short while ago i actually did. Its based on the BWT + MTF variant described in yesterdays blog post + snows/ffv1s range coder with a state transition table borrowed from bbb/paq8 source code under GPL is of course available too.
compressed file sizes of the Canterbury Corpus
alice29.txt | asyoulik.txt | cp.html | fields.c | grammar.lsp | kennedy.xls | lcet10.txt | plrabn12.txt | ptt5 | sum | xargs.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
bzip2 1.0.3 | 43202 | 39569 | 7624 | 3039 | 1283 | 130280 | 107706 | 145577 | 49759 | 12909 | 1762 |
7zip 4.43 beta | 48555 | 44606 | 7692 | 3076 | 1350 | 50904 | 119569 | 165462 | 42153 | 9619 | 1860 |
bbb ver. 1 | 40839 | 37818 | 7736 | 3253 | 1349 | 76523 | 101117 | 135829 | 44816 | 12593 | 1792 |
mnzip r32 with plain MTF | 42698 | 39286 | 7572 | 2962 | 1227 | 19804 | 105883 | 143634 | 50624 | 12591 | 1698 |
mnzip r32 | 40950 | 37835 | 7431 | 2983 | 1237 | 19287 | 101140 | 137191 | 45604 | 12428 | 1699 |
Time needed to compress
alice29.txt | asyoulik.txt | lcet10.txt | plrabn12.txt | cp.html | fields.c | grammar.lsp | kennedy.xls | ptt5 | sum | xargs.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
bzip2 1.0.3 | 0m0.166s | 0m0.133s | 0m0.533s | 0m0.633s | 0m0.047s | 0m0.037s | 0m0.007s | 0m1.062s | 0m0.151s | 0m0.056s | 0m0.006s |
7zip 4.43 beta | 0m0.539s | 0m0.417s | 0m1.732s | 0m2.161s | 0m0.070s | 0m0.035s | 0m0.019s | 0m6.048s | 0m1.402s | 0m0.105s | 0m0.022s |
bbb ver. 1 | 0m2.675s | 0m2.271s | 0m7.455s | 0m8.599s | 0m0.559s | 0m0.344s | 0m0.230s | 0m17.446s | 0m45.407s | 0m0.813s | 0m0.235s |
mnzip r32 | 0m0.273s | 0m0.206s | 0m0.951s | 0m1.099s | 0m0.031s | 0m0.012s | 0m0.006s | 0m3.545s | 0m1.173s | 0m0.051s | 0m0.006s |
time needed to decompress
alice29.txt | asyoulik.txt | lcet10.txt | plrabn12.txt | cp.html | fields.c | grammar.lsp | kennedy.xls | ptt5 | sum | xargs.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
bzip2 1.0.3 | 0m0.063s | 0m0.049s | 0m0.177s | 0m0.222s | 0m0.007s | 0m0.003s | 0m0.002s | 0m0.210s | 0m0.053s | 0m0.009s | 0m0.003s |
7zip 4.43 beta | 0m0.033s | 0m0.027s | 0m0.066s | 0m0.085s | 0m0.009s | 0m0.011s | 0m0.007s | 0m0.099s | 0m0.043s | 0m0.016s | 0m0.006s |
bbb ver. 1 | 0m2.265s | 0m1.918s | 0m6.015s | 0m6.916s | 0m0.511s | 0m0.332s | 0m0.231s | 0m13.492s | 0m6.660s | 0m0.715s | 0m0.237s |
mnzip r32 | 0m0.073s | 0m0.061s | 0m0.215s | 0m0.261s | 0m0.010s | 0m0.005s | 0m0.003s | 0m0.441s | 0m0.155s | 0m0.017s | 0m0.002s |
Options used where -9 for bzip2, -mx=9 for 7zip, f for bbb to use fast but memory hungry mode (this doesnt affect compression rate for bbb). The benchmark score are just single run based no proper mean so dont take them too serious, and i hope ive not messed up the file order ;)
Patches are welcome!
Here are results for ppmd. The first size is with “-m256 -o16”, second with “-m256 -o4”. The latter is faster than default-level bzip2. The -m argument has probably no effect with files this small. Most of these results are better than those of the compressors listed above. kennedy.xls seems to be some kind of a bad case.
alice29.txt 38682 39909
asyoulik.txt 36103 36561
cp.html 6576 6712
fields.c 2576 2751
grammar.lsp 1069 1090
kennedy.xls 173000 123468
lcet10.txt 95624 101165
plrabn12.txt 132753 134645
ptt5 48511 50164
sum 11274 11983
xargs.1 1516 1523
Comment by uau — 2007-06-19 @ 22:28
ppm based compressors where always amongth the best for compressing text, maybe you want to try paq too, i suspect it should perform even better at the cost of being unuseably slow
the problem with ppm is that its symmetric, that is decompression will need the same amount of time and space as compression and its memory hungry, its especially bad if you try to decompress such a -m256 file on a computer with less than 256mb free physical memory (of course you can be lucky and the file was small and doesnt need the whole 256mb)
Comment by Michael — 2007-06-20 @ 00:25
These test files are too small :) Of course 7z doesn’t do well, it’s designed for larger stuff; heck, its default dictionary size is 16M. Decompression is also very fast (much faster than mnzip or other BWT-based compressors) but from these benchmarks it seems to be hurt by its startup time, again it’s not designed for such tiny files (and not specifically tailored for text). As expected it wins on the big binary files.
To make things interesting you could try appending the test set…
I don’t like BWT compressors because it seems to me that the BWT is too slow, it’s fast for compressing but slow for decompressing, so I’d rather take LZ with huge dictionaries.
BTW, 7z supports both LZMA and PPMD, you can even have both in the same archive, so you can use PPMD for a solid block containing text files and LZMA for binary ones. BZip2 is also supported but most 3rd party implementations won’t decode such files.
Comment by Small — 2007-06-26 @ 08:11
> These test files are too small :) Of course 7z doesn’t do well, it’s designed for larger
> stuff; heck, its default dictionary size is 16M.
ive also run mnzip and 7z on the large corpus (its on the same page as the Canterbury Corpus)
-rw——- 1 michael michael 4047392 1997-03-21 05:31 bible.txt
-rw-r—– 1 michael michael 885162 2007-06-18 00:31 bible.txt.7z
-rw-r—– 1 michael michael 742228 2007-06-18 00:14 bible.txt.mnz
-rw——- 1 michael michael 4638690 1997-02-26 00:12 E.coli
-rw-r—– 1 michael michael 1185614 2007-06-18 00:32 E.coli.7z
-rw-r—– 1 michael michael 1144784 2007-06-18 00:15 E.coli.mnz
-rw——- 1 michael michael 2473400 1997-03-18 05:13 world192.txt
-rw-r—– 1 michael michael 487613 2007-06-18 00:32 world192.txt.7z
-rw-r—– 1 michael michael 404576 2007-06-18 00:15 world192.txt.mnz
> Decompression is also very fast (much faster than mnzip or other BWT-based compressors) but
> from these benchmarks it seems to be hurt by its startup time, again it’s not designed for
> such tiny files (and not specifically tailored for text).
LZ is fast at decompressing yes, how much that matters depends on what its used for, for communication like ssh fast decompression + slow compression is not any better than if both are the same speed and for decompressing software once on installation id say that its not that important either, network speed is more likely the limiting factor
> BTW, 7z supports both LZMA and PPMD, you can even have both in the same archive, so you can
> use PPMD for a solid block containing text files and LZMA for binary ones. BZip2 is also
> supported but most 3rd party implementations won’t decode such files.
ppmds memory requirements on decoding makes it very problematic, it easily needs 20 times as much as lz and bwt for the same size of data
Comment by Michael — 2007-06-27 @ 09:31
The files in the “large” corpus are still so small that they fit in a single bwt transform block except for E.coli which goes a bit over. I tested with linux-2.6.20.tar earlier (250316800 bytes). I don’t have the exact numbers any more (and running them again would take over 10 minutes), but the approximate results were (numbers rounded down to MB, I think I remember them about right…):
7z with “ultra” settings from man page: 35 MB
mnzip with default blocksize: 38 MB
mnzip with 32 MiB blocksize: 36 MB
If I take just a single 4 MiB block from the beginning of the tar then mnzip wins compressing than in isolation, 922092 bytes to 988826.
Comment by uau — 2007-06-27 @ 23:06
> ppmds memory requirements on decoding makes it very problematic, it easily needs 20 times as much as lz
> and bwt for the same size of data
In fact current mnzip uses more memory to decompress the 4 MiB linux tar part than ppmd does (with the default decompression mode in mnzip and pmd file compressed with -o6 which gives better compression than mnzip). Using the low-memory mode would lower the requirements but in any case ppmd is not orders of magnitude worse (and ppmd could possibly use a smaller block size without losing in compression).
Comment by uau — 2007-06-27 @ 23:50
I wonder how compares to lrzip
http://ck.kolivas.org/apps/lrzip/README
Comment by lu_zero — 2007-06-28 @ 21:05
> In fact current mnzip uses more memory to decompress the 4 MiB linux tar part than ppmd does
> (with the default decompression mode in mnzip and pmd file compressed with -o6 which gives
> better compression than mnzip). Using the low-memory mode would lower the requirements but
> in any case ppmd is not orders of magnitude worse (and ppmd could possibly use a smaller
> block size without losing in compression).
mnzips low memory mode can be choosen at decompression time so theres no problem decompressing normal mnzip files with it, with ppm the amount of memory cant so easily be cut to a quarter at decompression time, if it needs 100mb to decompress so it does you cant change that easily. If you have less physical memory, then you have a problem …
the inverese bwt is a fairly simple transform and i suspect that it can be performed quickly with less than O(n) memory and a small number of passes over a temporary file
and the most important part you ignore is that a single case where ppm needs normal amounts of memory doesnt help as it does need significanly more in other cases and being able to just decompress some files is not exactly good
Comment by Michael — 2007-06-29 @ 13:27
have you tried using divsufsort bwt algorithm? http://homepage3.nifty.com/wpage/software/libdivsufsort.html
or msufsort? http://www.michael-maniscalco.com/msufsort.htm
they are both very fast. could you try to use them in your program (ie. those bwt algos + your second stage algos)?
Comment by donkey7 — 2007-07-16 @ 13:00
> have you tried using divsufsort bwt algorithm?
> http://homepage3.nifty.com/wpage/software/libdivsufsort.html
> or msufsort? http://www.michael-maniscalco.com/msufsort.htm
no ive not tried them
> they are both very fast. could you try to use them in your program
> (ie. those bwt algos + your second stage algos)?
well, if you want you can try them, a patch which adds support for them would be welcome if they are faster or need less memory
i myself likely wont add support for them anytime soon if noone else does, but iam certainly willing to review patches, awnser questions and help …
Comment by Michael — 2007-07-17 @ 00:22
it’s better code.
if((prob[ctx]+= b*256 – p8)<1)prob[ctx]=1524;
Comment by da0 — 2012-02-26 @ 05:05