Thursday, March 21, 2013

A Streaming format for LZ4

It is a long time since I'm willing to produce a suitable streaming format for LZ4. As stated earlier, the format defined into lz4demo was not meant to become widespread, and is too limited. It's also completely unsuitable for network-oriented protocols.

As a consequence, you'll find in the following link a nearly-final specification of the LZ4 streaming format, in OpenDocument format.

It's only "nearly" final, since there are still a few questions left, summarized at the end of each chapter.

However, the main properties seem settled. The format accomodates for a large selection of buffer sizes, authorizes sequential and independant blocks, embed a few checksum options, accept arbitrary flushes at any point, and even define provisions for preset dictionaries mode.

At this stage, the very last questions seem ready to be settled in the next few weeks. Inputs, comments are welcomed.

[Edit] progresses :
Settled thanks to your comments (here and direct email):

  • Endian convention : more votes for Little Endian.
  • Stream Size : 8 bytes seems okay
  • Stream checksum : removed (v0.7), block-level checksum seems enough
  • High compression flag : removed (v0.8), seems not useful enough
  • Block Maximum size : reduced table (v0.9), from 1KB to 4MB
  • Block size : simplified compressed/uncompressed flags (v1.0)

[Edit] answering spec-related questions directly into the post

Jim> suggestion is to allow different checksums, with a 16-bit word identifying which hash

Actually, it was my initial intention.
But i eventually backed off. Why ?

One of LZ4 strong points is its simple specification, which makes it possible for any programmer to produce an LZ4-compatible algorithm of its own, in any language. To reach this goal, complexity is always fought and reduced to a minimum.

If multiple hash algorithms are possible, then the decoder will have to support them all to be compatible with the specification. It's a significant burden, which will deter or slow down people willing to produce, test and maintain their own decoding routine.

Obviously, the streaming format proposed here is not supposed to be "the most appropriate for any usage". I intend it to become "mainstream", cross-platform, and to replace the current "static format" of "lz4demo". But there will always be some specific circumstances in which another format will do a better job.

Matt> deflate format supports preset dictionaries, but nobody uses them. I would drop it.

Actually, I had a few requests for such a feature. The idea is that pre-set dictionaries can make a great difference when it comes to sending a lot of small independant blocks.

Matt> Do you need checksums at all?

I think so. The format is for both short-lived transmission data, and for storage one. Checksum is almost mandatory for the second use-case. Anyway, in the proposal, Checksum is still an "option", so it can be disabled by the sender if it seems better for its own use case.

Matt> Do you need both block and stream checksums? Probably not.
Mark> Stream Checksum: I don't see the point if data block checksum gives the appropriate protection 

That's the good question. It's kind of 50/50 when it comes to evaluating comments.
The simplicity argument looks good though.
[Edit 2] Stream checksum is removed (v0.7+), keeping only block-level checksum

Mark> Do you really think your block maximum size values make sense (...) All in all, I tend to think it is a too wide choice anyway

Good point. In v0.9, the choice of values for block maximum size has been reduced.

Matt> Do you need variable sized fields just to save a few bytes in the header? Probably not. 
Mark> I would make that "compressed flag" explicit, and thus keep only one "data size" field disregarding if the data was compressed or not. (...) . I'm not even sure you would need two possible sizes just to save one byte per block. 

Good points. Apparently, this part looks too complex, and would deserve some re-thinking.
[Edit 2] : the specification regarding block size is simplified in v1.0.

Adrien> Would it be better to let users select their own dictionary identifiers, rather than requiring Dict-ID to be the result of passing the dictionnary through xxHash-32 ?

Good point . This behavior mimics the spec of RFC1950 (zlib). The only advantage i can think of is that it allows the decoder (and encoder) to check if the dictionary is the right one. I'm unsure if this is really useful though....

Takayuki> LZ4 stream may be (pre) loaded on Ready Only Memory. In this case, temporal masking 0 for BC.Checkbits is difficult.

Correct. The current proposal is to load the header into RAM in order to perform the masking and checksum there. Is that an issue ?
Another proposal could be to not check the checkbits for ROM pre-loaded streams, since potential transmission error is nullified for such scenario.