Thursday, October 29, 2015

Compression is Compilation

This post's title was inspired by the title of Paul Khuong's blog post "Refactoring With LZ77: Compression Is Compilation (?)". I doubt this idea is new, it's just the best viewpoint (or mental model) I think I need to have while working on LZHAM's successor. I've explained it to several developers and haven't been laughed out of the room yet.

I think the usual description of LZ coding given is a very mechanical (and boring) one, and dwells too much on details like match finding and entropy coding.

To create LZHAM I had to rethink the entire way I thought about (or mentally modeled) LZ77 codecs, basically from a compression centric viewpoint to a compiler+VM centric viewpoint. Before, while working on Deflate compatible codecs optimized for specialized processors like X360's, I focused intensely on lower level things like Huffman coding, match finding, simplistic parsing algorithms (greedy vs. lazy matching), and CPU specific decompressor optimizations.

My new viewpoint is that a LZ compressor is a compiler. The output data from a LZ compressor is an executable program consisting of very efficiently encoded instructions for a sort of virtual CPU. The decompressor is a type of virtual machine that decodes and executes the instructions encoded in the compressed data in order to precisely decode the original source data. The VM's output is the decompressed data. The compiler/compressor's goal is to produce a program that when executed results in the source data.

A modern LZ compressor like LZMA or LZHAM has a backend "Instruction selection" optimizer: "Typically, instruction selection is implemented with a backwards dynamic programming algorithm which computes the "optimal" tiling for each point starting from the end of the program and based from there." If you understand this then you will easily understand LZ "Optimal Parsing", which is more or less the same thing. (Charles Bloom and Yann Collet have some good posts on optimal parsing. There are also several decent thread replies on optimal parsing at encode.ru.)

The endless LZ77-based variants have a wide variety of "CPU" instruction sets, registers, and global state. They use a wide variety of entropy coding backends (raw bits, arithmetic, Huffman, or new fangled things like ANS/FSE) to encode instructions. Examples of some significant LZ77 variants that change up the instruction set and/or VM "registers" are ROLZ or LZP.

Also, in modern LZ77 implementations there is still a large amount of redundancy in how the instructions are selected and coded. The vanilla LZ codespace is just too flexible. Charles Bloom is one of the trail blazers in publishing observations about the inefficiencies and opportunities in LZ compression. His LAM ("Literals After Mismatches") exclusion concept is one example.

Some of these varieties are designed to tradeoff compression ratio (i.e. instruction encoding efficiency) for extremely high compression and decompression (i.e. VM execution) throughput, like LZ4 or Snappy. And some varieties very carefully focus on balancing compression ratio vs. decompression throughput (with slow, usually offline, compressors) like LZMA, LZHAM, and brotli.

I think LZHAM is useful at all because it has an instruction set optimizer (near-optimal parsing using dynamic programming), a carefully chosen instruction set and state "registers" (basically LZMA's), with a fast entropy coding scheme that blends Huffman and bitwise arithmetic coding in a way that trades off a tiny amount of ratio for higher symbol decoding throughput (i.e. VM execution speed) vs. LZMA. LZHAM shows one way to compete against another, well entrenched codec: choose one important axis to improve upon (say decompression performance), but keep the same (or better) compression ratio as your competitor. If all other things are equal, the new codec is "better". (Side note: Another way to compete against another codec is to clone its API and high-level behavior and semantics as closely as you can. This is what miniz and LZHAM do vs. zlib. These codecs even try to clone zlib's obscure flush semantics.)

At a even higher level, data compression (not just LZ77) can also be viewed as a type of Machine Learning followed by Entropy Coding. LZ77 is an intensely practical/efficient compromise that constrains the implementation in just the right way: byte level prediction/matching, with a high average amount of bytes output per each decoded LZ instruction during decompression.

LZ codecs like zlib, LZ4, LZMA, brotli, LZHAM, Tornado, and Rad Game Tool's Oodle product all strike some of the most interesting, and currently best (or very nearly the best) compromises and choices between ratio, various important measures of throughput, memory consumption, code size, optimization level, bug support, platform support, library update frequency, etc.

Benchmarks like Squash and the Large Text Compression Benchmark show the performance of a variety of codecs (not just LZ based ones), executed on different types of CPU's with different categories and sizes of source data.

Game developers need to pitch in and contribute samples of their shipping data to the public domain, so LZ codec authors can make implementation and optimization decisions tuned to their unique datasets. We need a "Game Data Compression Benchmark" site.

1 comment:

  1. > At an even higher level, data compression (not just LZ77) can also be viewed as a type of Machine Learning followed by Entropy Coding.

    It's worth mentioning Matt Mahoney's work along these lines and his impressive codec ZPAQ.

    -JB
    @JBrooksBSI

    ReplyDelete