Discussion
gzip decompression in 250 lines of rust
MisterTea: > twenty five thousand lines of pure C not counting CMake files. ...Keep in mind this is also 31 years of cruft and lord knows what.Plan 9 gzip is 738 lines total: gzip.c 217 lines gzip.h 40 lines zip.c 398 lines zip.h 83 lines Even the zipfs file server that mounts zip files as file systems is 391 lines.> ... (and whenever working with C always keep in mind that C stands for CVE).Sigh.
nayuki: Just like that author, many years ago, I went through the process of understanding the DEFLATE compression standard and producing a short and concise decompressor for gzip+DEFLATE. Here are the resources I published:* https://www.nayuki.io/page/deflate-specification-v1-3-html* https://www.nayuki.io/page/simple-deflate-decompressor* https://github.com/nayuki/Simple-DEFLATE-decompressor
jeffrallen: [delayed]
tyingq: His also omits CRC, which is part of the 25k lines, no --fast/--best/etc, missing some output formats, and so on. I'm sure the 25k includes a lot of bloat, but the comparison is odd. Comparing to your list would make much more sense.
kibwen: I would expect a CRC to add a negligible number of lines of code. The reason that production-grade decompressors are tens of thousands of LOC is likely attributable to extreme manual optimization. For example, I wouldn't be surprised if a measurable fraction of those lines are actually inline assembly.
tyingq: Yes, there's subdirs with language bindings for many non-C langs, an examples folder with example C code, win32 specific C code, test code, etc.More reasons it's an odd comparison.
up2isomorphism: Another dev who doesn’t show respect to what has been done and expect a particular language will do wonders for him. Also I don’t see this is much better in term of readability.
hybrid_study: he does mention https://github.com/trifectatechfoundation/zlib-rs not just https://github.com/madler/zlib, but it would be interesting to hear from those developers also
xxs: Crc32 can be written in handful lines of code. Although it'd be better to use vector set - e.g. AVX when available.
ack_complete: Doesn't need to be inline assembly, just pre-encoded lookup tables and intrinsics-based vectorized CRC alone will add quite a lot of code. Most multi-platform CRC algorithms tend to have at least a few paths for byte/word/dword at a time, hardware CRC, and hardware GF(2) multiply. It's not really extreme optimization, just better algorithms to match better hardware capabilities.The Huffman decoding implementation is also bigger in production implementations for both speed and error checking. Two Huffman trees need to be exactly complete except in the special case of a single code, and in most cases they are flattened to two-level tables for speed (though the latest desktop CPUs have enough L1 cache to use single-level).Finally, the LZ copy typically has special cases added for using wider than byte copies for non-overlapping, non-wrapping runs. This is a significant decoding speed optimization.
maverwa: Where do you see the lack of respect? The author wanted to learn how gzip works and chose to implement it in a language they like to do so. As a learning tool, not because the world needs another gzip decompressor.