Monday, October 26, 2015

File patching using data compression with flushing

This neat idea is from Alexander Suvorov at Unity. There's a similar approach described in "Data Compression: The Complete Reference", section 8.14.2 File Differencing: VCDIFF, but it's not described in a way that is easily adapted to existing codecs. (Note any errors below are due to me, not Alexander.)

This simple file patching (or differential compression) technique is compatible with any deterministic lossless codec that supports flushing during compression, like zlib and LZHAM. It doesn't require the codec to explicitly support patch file generation, just flushing during compression. (See the LZHAM_FULL_FLUSH etc. enums in lzham.h.)

Let's say you want to transmit a new version of a file (let's call it file B) to a remote client, and the remote client already has the first version (file A). Here's how you can create a patch file which the remote client can use to recover file B from file A.

Assumptions: Both the local and remote clients have the same codec, use the same codec settings, the compressor is deterministic, and flushing doesn't reset the compressor's internal state (it just compresses all input provided so far). For LZ, the dictionary size is at least filesize(A)+filesize(B) bytes.

1. The remote client compresses file A and flushes the compressor, resulting in a blob of compressed data (let's call this "CA").

Of course, it's probable the remote client already has the compressed version of file A, in which case you skip this step.

2. The local client also compresses file A and flushes the compressor, resulting in a blob of compressed data which is the same as CA in step 1. Then it immediately compresses file B (using the same compressor instance), resulting in another blob of compressed data (CB). 

The flush doesn't reset the compressor's internal state, it just causes it to output as much compressed data as possible aligned to a byte boundary, so when it compresses file B it'll take advantage of matches against file A.

Said in another way: You compress a total of filesize(A)+filesize(B) bytes using one instance of the compressor (the compressor's internal state is not reset in between files, for LZ this means the sliding dictionary is not reset during the flush), but flush between A and B. The compressor flush cleanly separates the compressed data needed to decompress file A from the compressed data needed to decompress file B.

3. Key step: Only transmit CB to the remote client (i.e. only the compressed data occurring immediately after the flush). The local client doesn't need to transmit CA because the remote client has already computed it in step 1.

4. Now the remote client combines compressed blobs CA and CB into a single compressed blob and decompresses the entire thing. It ignores the first filesize(A) decompressed bytes, resulting in file B. That's it.

Several downsides to this idea: The remote client must compress file A (costing CPU and/or storage), the dictionary size must be large enough to encompass both files (costing RAM), and the remote client must decompress both files. These requirements could be too expensive for mobile with large files.

Patch file generation using LZHAM is simpler because it explicitly supports seed dictionaries. You set the seed dictionary to file A, compress file B, resulting in a compressed patch file that recovers B from A. This still can require too much RAM for mobile, because you need a dictionary of at least filesize(A) bytes.

5 comments:

  1. If I understand correctly, using zlib with windowBits=15, you would be limited to 32k file size?

    ReplyDelete
    Replies
    1. Yes, that's correct and a good point. I mentioned zlib mostly as a point of reference that everyone is probably aware of, because it has quite good support for flushing the compressor. In practice it wouldn't be very useful for patching real files (the window size is too small).

      Delete
    2. By comparison, LZHAM supports up to 512MB window size. Larger files would have to be divided up into blocks in some way.

      Delete
  2. If files are similar, I think you can split both files in 32k (or 16k ?) chunks and alternate chunks of the two files in the output and apply the algorithm to each pair of chunk.

    ReplyDelete
  3. If files are similar, I think you can split both files in 32k (or 16k ?) chunks and alternate chunks of the two files in the output and apply the algorithm to each pair of chunks.

    ReplyDelete