Lair Of The Multimedia Guru

2007-06-18

mnzip

I always wanted to write my own general purpose compressor, a short while ago i actually did. Its based on the BWT + MTF variant described in yesterdays blog post + snows/ffv1s range coder with a state transition table borrowed from bbb/paq8 source code under GPL is of course available too.

compressed file sizes of the Canterbury Corpus

	alice29.txt	asyoulik.txt	cp.html	fields.c	grammar.lsp	kennedy.xls	lcet10.txt	plrabn12.txt	ptt5	sum	xargs.1
bzip2 1.0.3	43202	39569	7624	3039	1283	130280	107706	145577	49759	12909	1762
7zip 4.43 beta	48555	44606	7692	3076	1350	50904	119569	165462	42153	9619	1860
bbb ver. 1	40839	37818	7736	3253	1349	76523	101117	135829	44816	12593	1792
mnzip r32 with plain MTF	42698	39286	7572	2962	1227	19804	105883	143634	50624	12591	1698
mnzip r32	40950	37835	7431	2983	1237	19287	101140	137191	45604	12428	1699

Time needed to compress

	alice29.txt	asyoulik.txt	lcet10.txt	plrabn12.txt	cp.html	fields.c	grammar.lsp	kennedy.xls	ptt5	sum	xargs.1
bzip2 1.0.3	0m0.166s	0m0.133s	0m0.533s	0m0.633s	0m0.047s	0m0.037s	0m0.007s	0m1.062s	0m0.151s	0m0.056s	0m0.006s
7zip 4.43 beta	0m0.539s	0m0.417s	0m1.732s	0m2.161s	0m0.070s	0m0.035s	0m0.019s	0m6.048s	0m1.402s	0m0.105s	0m0.022s
bbb ver. 1	0m2.675s	0m2.271s	0m7.455s	0m8.599s	0m0.559s	0m0.344s	0m0.230s	0m17.446s	0m45.407s	0m0.813s	0m0.235s
mnzip r32	0m0.273s	0m0.206s	0m0.951s	0m1.099s	0m0.031s	0m0.012s	0m0.006s	0m3.545s	0m1.173s	0m0.051s	0m0.006s

time needed to decompress

	alice29.txt	asyoulik.txt	lcet10.txt	plrabn12.txt	cp.html	fields.c	grammar.lsp	kennedy.xls	ptt5	sum	xargs.1
bzip2 1.0.3	0m0.063s	0m0.049s	0m0.177s	0m0.222s	0m0.007s	0m0.003s	0m0.002s	0m0.210s	0m0.053s	0m0.009s	0m0.003s
7zip 4.43 beta	0m0.033s	0m0.027s	0m0.066s	0m0.085s	0m0.009s	0m0.011s	0m0.007s	0m0.099s	0m0.043s	0m0.016s	0m0.006s
bbb ver. 1	0m2.265s	0m1.918s	0m6.015s	0m6.916s	0m0.511s	0m0.332s	0m0.231s	0m13.492s	0m6.660s	0m0.715s	0m0.237s
mnzip r32	0m0.073s	0m0.061s	0m0.215s	0m0.261s	0m0.010s	0m0.005s	0m0.003s	0m0.441s	0m0.155s	0m0.017s	0m0.002s

Options used where -9 for bzip2, -mx=9 for 7zip, f for bbb to use fast but memory hungry mode (this doesnt affect compression rate for bbb). The benchmark score are just single run based no proper mean so dont take them too serious, and i hope ive not messed up the file order ;)

Patches are welcome!

Filed under: Entropy Coding — Michael @ 01:20

Comments (11)

11 Comments »

Here are results for ppmd. The first size is with “-m256 -o16”, second with “-m256 -o4”. The latter is faster than default-level bzip2. The -m argument has probably no effect with files this small. Most of these results are better than those of the compressors listed above. kennedy.xls seems to be some kind of a bad case.

alice29.txt 38682 39909
asyoulik.txt 36103 36561
cp.html 6576 6712
fields.c 2576 2751
grammar.lsp 1069 1090
kennedy.xls 173000 123468
lcet10.txt 95624 101165
plrabn12.txt 132753 134645
ptt5 48511 50164
sum 11274 11983
xargs.1 1516 1523

Comment by uau — 2007-06-19 @ 22:28
ppm based compressors where always amongth the best for compressing text, maybe you want to try paq too, i suspect it should perform even better at the cost of being unuseably slow
the problem with ppm is that its symmetric, that is decompression will need the same amount of time and space as compression and its memory hungry, its especially bad if you try to decompress such a -m256 file on a computer with less than 256mb free physical memory (of course you can be lucky and the file was small and doesnt need the whole 256mb)

Comment by Michael — 2007-06-20 @ 00:25
These test files are too small :) Of course 7z doesn’t do well, it’s designed for larger stuff; heck, its default dictionary size is 16M. Decompression is also very fast (much faster than mnzip or other BWT-based compressors) but from these benchmarks it seems to be hurt by its startup time, again it’s not designed for such tiny files (and not specifically tailored for text). As expected it wins on the big binary files.

To make things interesting you could try appending the test set…

I don’t like BWT compressors because it seems to me that the BWT is too slow, it’s fast for compressing but slow for decompressing, so I’d rather take LZ with huge dictionaries.

BTW, 7z supports both LZMA and PPMD, you can even have both in the same archive, so you can use PPMD for a solid block containing text files and LZMA for binary ones. BZip2 is also supported but most 3rd party implementations won’t decode such files.

Comment by Small — 2007-06-26 @ 08:11
> These test files are too small :) Of course 7z doesnâ€™t do well, itâ€™s designed for larger
> stuff; heck, its default dictionary size is 16M.

ive also run mnzip and 7z on the large corpus (its on the same page as the Canterbury Corpus)
-rw——- 1 michael michael 4047392 1997-03-21 05:31 bible.txt
-rw-r—– 1 michael michael 885162 2007-06-18 00:31 bible.txt.7z
-rw-r—– 1 michael michael 742228 2007-06-18 00:14 bible.txt.mnz
-rw——- 1 michael michael 4638690 1997-02-26 00:12 E.coli
-rw-r—– 1 michael michael 1185614 2007-06-18 00:32 E.coli.7z
-rw-r—– 1 michael michael 1144784 2007-06-18 00:15 E.coli.mnz
-rw——- 1 michael michael 2473400 1997-03-18 05:13 world192.txt
-rw-r—– 1 michael michael 487613 2007-06-18 00:32 world192.txt.7z
-rw-r—– 1 michael michael 404576 2007-06-18 00:15 world192.txt.mnz

> Decompression is also very fast (much faster than mnzip or other BWT-based compressors) but
> from these benchmarks it seems to be hurt by its startup time, again itâ€™s not designed for
> such tiny files (and not specifically tailored for text).

LZ is fast at decompressing yes, how much that matters depends on what its used for, for communication like ssh fast decompression + slow compression is not any better than if both are the same speed and for decompressing software once on installation id say that its not that important either, network speed is more likely the limiting factor

> BTW, 7z supports both LZMA and PPMD, you can even have both in the same archive, so you can
> use PPMD for a solid block containing text files and LZMA for binary ones. BZip2 is also
> supported but most 3rd party implementations wonâ€™t decode such files.

ppmds memory requirements on decoding makes it very problematic, it easily needs 20 times as much as lz and bwt for the same size of data

Comment by Michael — 2007-06-27 @ 09:31
The files in the “large” corpus are still so small that they fit in a single bwt transform block except for E.coli which goes a bit over. I tested with linux-2.6.20.tar earlier (250316800 bytes). I don’t have the exact numbers any more (and running them again would take over 10 minutes), but the approximate results were (numbers rounded down to MB, I think I remember them about right…):
7z with “ultra” settings from man page: 35 MB
mnzip with default blocksize: 38 MB
mnzip with 32 MiB blocksize: 36 MB

If I take just a single 4 MiB block from the beginning of the tar then mnzip wins compressing than in isolation, 922092 bytes to 988826.

Comment by uau — 2007-06-27 @ 23:06
> ppmds memory requirements on decoding makes it very problematic, it easily needs 20 times as much as lz
> and bwt for the same size of data

In fact current mnzip uses more memory to decompress the 4 MiB linux tar part than ppmd does (with the default decompression mode in mnzip and pmd file compressed with -o6 which gives better compression than mnzip). Using the low-memory mode would lower the requirements but in any case ppmd is not orders of magnitude worse (and ppmd could possibly use a smaller block size without losing in compression).

Comment by uau — 2007-06-27 @ 23:50
I wonder how compares to lrzip

http://ck.kolivas.org/apps/lrzip/README

Comment by lu_zero — 2007-06-28 @ 21:05
> In fact current mnzip uses more memory to decompress the 4 MiB linux tar part than ppmd does
> (with the default decompression mode in mnzip and pmd file compressed with -o6 which gives
> better compression than mnzip). Using the low-memory mode would lower the requirements but
> in any case ppmd is not orders of magnitude worse (and ppmd could possibly use a smaller
> block size without losing in compression).

mnzips low memory mode can be choosen at decompression time so theres no problem decompressing normal mnzip files with it, with ppm the amount of memory cant so easily be cut to a quarter at decompression time, if it needs 100mb to decompress so it does you cant change that easily. If you have less physical memory, then you have a problem …

the inverese bwt is a fairly simple transform and i suspect that it can be performed quickly with less than O(n) memory and a small number of passes over a temporary file

and the most important part you ignore is that a single case where ppm needs normal amounts of memory doesnt help as it does need significanly more in other cases and being able to just decompress some files is not exactly good

Comment by Michael — 2007-06-29 @ 13:27
have you tried using divsufsort bwt algorithm? http://homepage3.nifty.com/wpage/software/libdivsufsort.html
or msufsort? http://www.michael-maniscalco.com/msufsort.htm

they are both very fast. could you try to use them in your program (ie. those bwt algos + your second stage algos)?

Comment by donkey7 — 2007-07-16 @ 13:00
> have you tried using divsufsort bwt algorithm?
> http://homepage3.nifty.com/wpage/software/libdivsufsort.html
> or msufsort? http://www.michael-maniscalco.com/msufsort.htm

no ive not tried them

> they are both very fast. could you try to use them in your program
> (ie. those bwt algos + your second stage algos)?

well, if you want you can try them, a patch which adds support for them would be welcome if they are faster or need less memory
i myself likely wont add support for them anytime soon if noone else does, but iam certainly willing to review patches, awnser questions and help …

Comment by Michael — 2007-07-17 @ 00:22
it’s better code.
if((prob[ctx]+= b*256 – p8)<1)prob[ctx]=1524;

Comment by da0 — 2012-02-26 @ 05:05

RSS feed for comments on this post. TrackBack URI