Internet-Draft GDEFLATE bitstream specification July 2024
Uralsky Expires 8 January 2025 [Page]
Workgroup:
Internet Engineering Task Force
Internet-Draft:
draft-uralsky-gdeflate-00
Published:
Intended Status:
Informational
Expires:
Author:
Y. Uralsky
NVIDIA Corporation

GDEFLATE bitstream specification

Abstract

This specification defines a lossless data compression algorithm optimized for high-throughput decompression on many-core data-parallel architectures. It is based on the original DEFLATE algorithm, modifying its bit stream layout to facilitate efficient SIMD decoding of the bitstream.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 8 January 2025.

Table of Contents

1. Introduction

GDeflate is a lossless data compression algorithm optimized for high-throughput decompression on many-core data-parallel architectures, such as GPUs. It is based on the original DEFLATE algorithm, modifying its bit stream layout to facilitate efficient SIMD decoding of the bitstream. Modulo the bit stream formatting, GDeflate largely reuses the compression scheme defined by RFC 1951, applying minimal changes. This enables reuse of existing algorithms and other IP for compression and decompression, only requiring changes to the addressing of the bitstream.

Throughout this document, familiarity with the original DEFLATE algorithm and encoding principles is assumed. For full details on the DEFLATE algorithm please refer to RFC 1951. Following sections will focus on GDeflate-specific changes to the DEFLATE encoding and its modified bit stream layout.

2. Deviations from the original DEFLATE encoding

2.1. DEFLATE64 extensions

GDeflate directly integrates DEFLATE64 extensions (which were not originally included in RFC 1951), allowing longer copies with distance references spanning a range of up to 64KB. These extensions comprise the following modifications: first, length code 285 is re-purposed to represent an LZ copy of length in the range between 3 and 65538 bytes. 16 extra bits of unencoded length value follow this code. Second, two previously unused distance codes 30 and 31 are now used to represent look back distances of up to 65536 bytes. They are followed with 14 extra bits of unencoded distance value, specifying distances in the range of 32769 to 49152 and 49153 to 65536 bytes, respectively.

For completeness, modified tables of length and distance codes are included below. Changes relative to the original tables (refer to section 3.2.5 in RFC 1951) are highlighted:

Length codes:

Table 1
Code Extra Bits Length(s)
257 0 3
258 0 4
259 0 5
260 0 6
261 0 7
262 0 8
263 0 9
264 0 10
265 1 11,12
266 1 13,14
267 1 15,16
268 1 17,18
269 2 19-22
270 2 23-26
271 2 27-30
272 2 31-34
273 3 35-42
274 3 43-50
275 3 51-58
276 3 59-66
277 4 67-82
278 4 83-98
279 4 99-114
280 4 115-130
281 5 131-162
282 5 163-194
283 5 195-226
284 5 227-257
285 16 3-65538

Distance codes:

Table 2
Code Extra Bits Distance
0 0 1
1 0 2
2 0 3
3 0 4
4 1 5,6
5 1 7,8
6 2 9-12
7 2 13-16
8 3 17-24
9 3 25-32
10 4 33-48
11 4 49-64
12 5 65-96
13 5 97-128
14 6 129-192
15 6 193-256
16 7 257-384
17 7 385-512
18 8 513-768
19 8 769-1024
20 9 1025-1536
21 9 1537-2048
22 10 2049-3072
23 10 3073-4096
24 11 4097-6144
25 11 6145-8192
26 12 8193-12288
27 12 12289-16384
28 13 16385-24576
29 13 24577-32768
30 14 32769-49152
31 14 49153-65536

2.2. Formatting of non-compressed blocks

GDeflate also changes formatting of non-compressed blocks (BTYPE=00, refer to section 3.2.4 in RFC 1951). In the original DEFLATE algorithm, non-compressed blocks start with a block-specific header specifying their length in bytes as a 16 bit value, followed by its one's complement. The header is also specified to start on a byte-aligned boundary. In GDeflate, the one's complement of length is dropped from the header and the header is no longer required to be byte-aligned. As a result, data across all blocks, compressed or non-compressed, forms one contiguous bit stream. This streamlines SIMD decoding of the entire bit stream, as will be explained later.

3. General assumptions and the machine model

GDeflate assumes that decoding is done on a multi-threaded many-core SIMD or SIMT machine with 32 bit registers and 32 bit memory access grain. It is expected that per-lane operations can be predicated, and that there is support for basic cross-lane communication and synchronization, allowing to construct fundamental data parallel primitives such as reductions and prefix sums.

GDeflate targets SIMD width of 32, which aligns well with most common CPU and GPU architectures today and in foreseeable future. Machines with narrower physical SIMD width can serialize decoding, or assign multiple SIMD threads to decode a single GDeflate bit stream. As an extreme example, GDeflate can be decoded on a fully scalar processor without any SIMD support. Machines with wider SIMD width can decode multiple GDeflate bit streams using a single SIMD group, or choose to leave a subset of SIMD lanes idle during decoding.

In the discussion below, the term "SIMD group" can refer to a cooperative thread group, a warp, or a SIMD instruction or several SIMD instructions, depending on the implementation. The SIMD group is assumed to comprise 32 parallel "lanes", even though the physical SIMD width of the underlying implementation may be wider or narrower.

4. High-level blocking of GDeflate stream

At high-level, GDeflate splits the data to be compressed into pages of 64KB in size. The pages are completely independent, enabling compression and decompression to operate on multiple pages in parallel. The expectation is that each page will be processed by a physical thread, or a separate core.

64KB page blocking enables random access to compressed data at page granularity and allows GDeflate bit streams to be used directly with applications utilizing sparse/tiled resource infrastructure available in modern graphics APIs.

Any higher-level enveloping of the bitstream is not discussed in this specification. It is assumed that the application will use a separate data structure (such as an array of pointers) in order to address pages comprising their overall compressed data set.

5. Bit stream rearrangement for SIMD parallelism

The key change that GDeflate makes to the original DEFLATE bit stream layout is rearranging the way its variable-length codes (representing literals, copies, or other auxiliary meta-data) are ordered in the bit stream.

In the original DEFLATE algorithm, these variable-length codes are ordered linearly, assuming the decoder consumes these codes one at a time as it traverses the bit stream serially. While simple to understand, this serial ordering makes it difficult to efficiently utilize SIMD processing on a data-parallel machine, due to serial dependencies between the codes: decoding of the immediately following code cannot start until the bit length of the current code is determined and it is fully consumed.

5.1. Sub-streams and ordering of symbols

To enable parallel parsing of the bit stream, GDeflate splits the original sequence of variable-length codes into 32 independent sub-streams. Each sub-stream is assigned to a fixed SIMD lane, so all lanes in the SIMD group can collectively parse all 32 sub-streams in parallel. Each iteration of the parsing process is referred to as "SIMD round". At each SIMD round, 32 codes from 32 sub-streams are decoded into symbols, with one code processed in each lane (with some exceptions detailed in the following sections).

Within a SIMD round, decoded symbols follow the SIMD lane order. For example, symbols corresponding to entropy-coded data within compressed blocks (literal bytes and LZ copies) which are processed within the same round always map to a contiguous range of memory addresses in the output buffer: this allows all decoded symbols to be written as a dense SIMD memory write (assuming each SIMD lane computes its output pointer accordingly). Once a SIMD round is complete, the SIMD group proceeds to the next round, decoding the next group of 32 symbols from 32 sub-streams, advancing its input and output pointers. Thus, across SIMD rounds, corresponding groups of 32 symbols follow sequentially.

5.2. Packing of variable-length codes within sub-streams

To enable concurrent processing of sub-streams, GDeflate packs the variable-length codes, comprising individual sub-streams into a set of naturally aligned 32 bit words, which can be directly addressed by the SIMD group. As a result, the bit stream is always accessed at word granularity: at each SIMD round, at most a single aligned 32 bit word is read from each sub-stream (therefore, at most 32 memory reads can be performed during a round across all lanes). These reads are done on as-needed basis: each lane keeps track of the number of bits it is yet to decode from the previous SIMD round, and when that number drops below a threshold, a memory read on the current SIMD round is scheduled. The bits returned by the new memory read are appended to the bits remaining to be decoded in that lane, so that the lane can proceed to decoding a symbol. Once a symbol is decoded, its corresponding code is consumed, reducing the number of bits available for decoding on the next round, which will eventually trigger another memory read in one of the subsequent rounds.

In GDeflate, the longest possible code contains 31 bits: 15 bits of the worst-case Huffman code (per DEFLATE spec) plus maximum of 16 extra bits (this could happen for the length symbol of a long LZ copy). Consequently, the memory read threshold in GDeflate is set to 32 bits (rounding up). This ensures that at any time during parsing, at least 32 bits of the sub-stream data are immediately available for decoding in each lane, and hence at least one symbol per lane can be fully decoded on each SIMD round.

Note that as a result of this packing, variable-length codes may be split across words. However, no variable-length code (including any extra bits it may require) can span more than 2 words, so reading a single word always adds enough data to guarantee that at least 1 symbol can be decoded per lane.

It is expected that implementations will maintain this per lane state in a pair of 32 bit registers and a counter variable tracking how many bits are currently valid. When the number of valid bits in the state variable drops below 32, a new 32 bit word is read from memory and appended to the state variable. Corresponding memory addresses for these reads can be computed by cooperative data-parallel operations across the SIMD group, such as prefix sums. Bits consumed during decoding can be removed from the state variable by shifting the register pair down.

The following pseudocode illustrates the process of reading the bitstream:

// For each SIMD lane in parallel ...

uint64_t bitBuffer; // State variable
uint     bitCount;  // Number of valid bits in the state variable

bool isRefillRequired = bitCount < 32;  // Generate refill predicate
                                        // for this lane

uint32_t refillMask = vote(isRefillRequired);   // Collect refill
    // predicates across all lanes into a mask

// Compute this lane's offset as a prefix sum over refill mask
uint localOffset = countbits(refillMask & ((1 << lane) - 1));

if (isRefillRequired) { // On a refill ...
    // ... Append new bits to the state variable
    bitBuffer |= (uint64_t)input[localOffset] << bitCount;
    bitCount += 32; // ... Track the number of valid bits available
}

// Lane 0 advances shared input pointer
if (lane == 0) input += countbits(refillMask);

// Proceed to decoding variable-length codes contained in
// LSBs of bitBuffer

5.3. Initial decoder state and formatting of the final bit stream

At the start of decoding, state variables in all SIMD lanes are initialized to contain 0 bits. Therefore, the first SIMD round of decoding always loads 32 consecutive words from the input stream, effectively initializing state variables of all 32 sub-streams. As a result, any valid GDeflate bit stream cannot be smaller than 4 * 32 = 128 bytes. Subsequent rounds will load variable number of words from the input, depending on the rate at which each SIMD lane consumes codes from the sub-stream it is responsible for.

The final bit stream comprises words from the 32 sub-stream interleaved in the order in which they are read during decoding. Note that this reordering/interleaving applies end-to-end across all blocks in the bit stream, disregarding block boundaries. Effectively, GDeflate treats the entire bit stream as a contiguous sequence of variable-length codes, permuted and packed into sub-streams as described above. Words residing at the very end of the bit stream may not be fully packed with variable-length codes, which means leaving some available bits unused. This tail effect represents certain small overhead introduced by the SIMD-oriented rearrangement of data in the bit stream.

5.4. Parsing example

To illustrate the parsing process, consider the following bit stream layout. The numbers indicate byte addresses of corresponding 32 bit words in the bit stream. Letters represent variable-length codes packed within their respective words. Note that some codes are split across two words - this is indicated with a dot to the left or to the right of the corresponding letter. The sets of 32b words loaded on each SIMD round are shown underneath the boxes.

0       4       8       12     124     128
+-------+-------+-------+-     -+-------+
|A,B,C. | D,E.  |F,G,H,I|  ...  | J,K.  |
+-------+-------+-------+-     -+-------+
|<----------- SIMD round 0 ------------>|


128     132     136     252     256     K      K+4     K+8
+-------+-------+-     -+-------+-    -+-------+-------+-------+
|  .C,L |.E,M,N |  ...  | .K,O  |  ... | X,Y.  |   Z.  |  .W   |
+-------+-------+-     -+-------+-    -+-------+-------+-------+
|<------- SIMD round 1 -------->|      |<--- SIMD round N ---->|

On the very first SIMD round, the decoder loads the initial set of 32 words from the input stream, at contiguous memory addresses spanning the first 128 bytes. This initializes per-lane state variables in each SIMD lane with 32 bits of input data, which contain packed variable-length codes. On this first round, the SIMD group processes codes A,D,F...J (in that order) which all come from 32 separate sub-streams. Processed codes are removed from their respective per-lane state variables. At that point, the number of valid bits available for decoding in each lane drops below 32, and hence on the second round each lane loads another 4-byte word of input, which reside at addresses 128 through 252 in the input stream, appending loaded bits to the valid bits remaining from the first round. On the second round, codes B,E,G...K from the 32 sub-streams are processed by the SIMD group. Note that code E was split across two words, residing at byte addresses 4 and 132. On the second round, all bits of E become available in the lane #1's state variable, hence it can be fully decoded on that round. The same situation applies to code K, split across words at addresses 124 and 252. At the end of the second round, the codes were processed in the following order: A,D,F...J,B,E,G...K.

The decoding process continues, however on subsequent rounds the SIMD group will likely be loading fewer than 32 words of input. In our example, on round N the SIMD group loads only 3 additional words, at byte addresses K through K+8. Which lanes perform these loads is a function of the number of valid bits remaining in per-lane state from previous rounds, which is in turn a function of how many variable-length codes were packed in the corresponding sub-stream processed by that lane so far. In essence, each sub-stream is represented by a chain of 32b words scattered in the input stream. All lanes in the SIMD group collectively traverse 32 such sub-streams, cooperatively computing memory addresses for each lane from round to round.

6. Block headers

Like in standard DEFLATE, each GDeflate block starts with a generic 3 bit header. The first bit indicates whether this block is the last block in the stream, and the following two bits encode the block type (see section 3.2.3 in the DEFLATE RFC for details).

Additionally, non-compressed blocks and blocks compressed with dynamic huffman codes (BTYPE=00 and BTYPE=10, respectively) also include their own block-specific headers: headers for non-compressed blocks are 16 bits long (see section 2.2), and headers for dynamic huffman blocks are 14 bits long.

In GDeflate, all block headers are always assigned to sub-stream 0 and require one dedicated SIMD round of processing per header, where only 1 SIMD lane is active. The other 31 lanes do not modify their state variables during the header decode round.

This implies that non-compressed blocks and blocks encoded with dynamic huffman codes process their headers in 2 SIMD rounds (1 round for the generic header and 1 additional round for the block-specific header). During both rounds, only 1 SIMD lane is active. Header data is then broadcast to all SIMD lanes.

Blocks encoded with fixed huffman trees do not have block-specific headers, and therefore it takes only 1 SIMD round to decode their generic header.

7. Non-compressed blocks

The data in the body of non-compressed blocks is distributed across 32 sub-streams and goes through the same reordering described in section 5. "Codes" in each sub-stream within non-compressed blocks are assumed to be fixed-size 8 bit atoms, so on every SIMD round each lane writes a byte to the output buffer and removes exactly 8 bits from its state variable. Thus, when processing non-compressed blocks each SIMD round loads eight 32 bit words from the input stream in steady state. The number of full SIMD rounds required to process a non-compressed block is determined by floor(block_size_in_bytes / 32). On the last SIMD round, only (block_size_in_bytes % 32) SIMD lanes are active, remaining lanes stay idle and do not change their state variables.

8. Blocks compressed with dynamic huffman codes

8.1. Code lengths for the huffman trees alphabet

In blocks compressed with dynamic huffman codes, the header is followed by an array of up to 19 three bit values, specifying code lengths for the code length alphabet used to describe the huffman trees. Each code length value gets assigned to a separate sub-stream, and hence SIMD lane, therefore all code length values are processed within a single SIMD round. At least 32 - 19 = 13 SIMD lanes remain inactive in this round and do not modify their state variables. The actual number of sub-streams used (and hence the number of SIMD lanes that participate in this round) is specified in the header (value of field HCLEN + 4). Each active lane decodes a fixed-length 3 bit code on this round before the SIMD groups proceeds to the next section.

8.2. Huffman trees meta-data

Block-specific huffman trees represented in the form of compressed arrays of code length values for literal/length and distance tables are encoded next in the bitstream. There can be up to 318 symbols in this section (up to 286 literal/length symbols plus up to 32 distance symbols), encoded with variable-length codes. These codes are evenly distributed across 32 sub-streams and require up to 10 (corresponding to the maximum of 318 symbols) SIMD rounds to process. The ordering of symbols in this section follows the rules described in section 5: each SIMD round processes a group of 32 consecutive symbols, with groups ordered sequentially as they are processed.

9. Blocks compressed with static huffman codes

Blocks compressed with static huffman codes have no block-specific headers, and use pre-defined huffman tables as per RFC 1951. The entropy-coded data within such blocks is ordered analogously to dynamic huffman blocks (i.e. distributed across all 32 sub-streams).

10. Processing of the end-of-block symbol

Blocks compressed using either dynamic or static huffman codes utilize a special end-of-block symbol (code 256) to indicate the end of compressed data for that block. The decoder is expected to monitor whether such code is detected on one of the SIMD rounds. If that is the case, that SIMD round is considered to be the last round for that block, and all SIMD lanes following the lane that has decoded the end-of-block symbol should not attempt to perform symbol decode and consume any bits from their state variables, leaving them unmodified instead: those bits belong to the immediately following block, and will get used shortly after it starts getting decoded.

11. LZ copy scheduling

GDeflate is designed to process (up to) 32 symbols per SIMD round, with each lane decoding a single symbol from a separate sub-stream. Since LZ copies are specified using two symbols (a length symbol followed by a distance symbol), copy processing is always split across two SIMD rounds. On the first SIMD round, the length symbol is decoded and the corresponding amount of space is reserved in the output buffer. On the immediately following SIMD round, the distance symbol is decoded and the copy is performed, filling the previously reserved space. The two symbols are always assigned to the same sub-stream so that they get processed by the same SIMD lane during two back-to-back SIMD rounds - this ensures that the state necessary to perform the copy doesn't have to migrate across SIMD lanes and the copy scheduling can be implemented with a simple per-lane state machine.

12. Normative References

13. Security Considerations

Any data compression method involves the reduction of redundancy in the data. Consequently, any corruption of the data is likely to have severe effects and be difficult to correct. Uncompressed text, on the other hand, will probably still be readable despite the presence of some corrupted bytes. It is recommended that systems using this data format provide some means of validating the integrity of the compressed data.

14. IANA Considerations

This document has no IANA actions.

15. Acknowledgments

Trademarks cited in this document are the property of their respective owners.

Yury Uralsky, Ilya Terentyev, Vikram Kushwaha, Akshay Subramaniam and Nikolay Sakharnykh contributed to GDeflate and this document.

Phil Katz designed the deflate format. Nvidia Corporation modified its bit stream layout to facilitate efficient SIMD decoding of the bitstream.

Author's Address

Yury Uralsky
NVIDIA Corporation