GitHub - awslabs/aws-c-compression: C99 implementation of huffman encoding/decoding (original) (raw)

AWS C Compression

This is a cross-platform C99 implementation of compression algorithms such as gzip, and huffman encoding/decoding. Currently only huffman is implemented.

License

This library is licensed under the Apache 2.0 License.

Usage

Building

Note that aws-c-compression has a dependency on aws-c-common:

git clone git@github.com:awslabs/aws-c-common.git
cmake -DCMAKE_PREFIX_PATH=<install-path> -DCMAKE_INSTALL_PREFIX=<install-path> -S aws-c-common -B aws-c-common/build
cmake --build aws-c-common/build --target install

git clone git@github.com:awslabs/aws-c-compression.git
cmake -DCMAKE_PREFIX_PATH=<install-path> -DCMAKE_INSTALL_PREFIX=<install-path> -S aws-c-compression -B aws-c-compression/build
cmake --build aws-c-compression/build --target install

Huffman

The Huffman implemention in this library is designed around the concept of a generic "symbol coder" object, which defines how each symbol (value between 0 and 255) is encoded and decoded. This object looks like this:

typedef struct aws_huffman_code (*aws_huffman_symbol_encoder)(uint8_t symbol, void *userdata); typedef uint8_t (*aws_huffman_symbol_decoder)(uint32_t bits, uint8_t *symbol, void *userdata);

struct aws_huffman_symbol_coder { aws_huffman_symbol_encoder encode; aws_huffman_symbol_decoder decode; void *userdata; };

These callbacks may be implemented manually, or you may use the included Huffman coder generator to generate one from a table definition file. The generator expects to be called with the following arguments:

$ aws-c-compression-huffman-generator path/to/table.def path/to/generated.c coder_name

The table definition file should be in the following format:

/* sym bits code len / HUFFMAN_CODE( 0, "1100101110", 0x32e, 10) HUFFMAN_CODE( 1, "1100101111", 0x32f, 10) / ... */

The HUFFMAN_CODE macro expects 4 arguments:

Note

This file may also be #included in the following way to generate a static list of codes:

/* Provides the HUFFMAN_CODE macro */ #include <aws/testing/compression/huffman.h>

static struct huffman_test_code_point code_points[] = { #include "test_huffman_static_table.def" };

This will emit a c file which exports a function with the following signiture:

struct aws_huffman_symbol_coder *{coder_name}_get_coder();

Note that this function does not allocate, but maintains a static instance of the coder.

An example implementation of this file is provided intests/test_huffman_static_table.def.

To use the coder, forward declare that function, and pass the result as the second argument to aws_huffman_encoder_init and aws_huffman_decoder_init.

struct aws_huffman_encoder encoder; aws_huffman_encoder_init(&encoder, {coder_name}_get_coder());

struct aws_huffman_decoder decoder; aws_huffman_decoder_init(&decoder, {coder_name}_get_coder())

Encoding

/**

The encoder is built to support partial encoding. This means that if there isn't enough space in output, the encoder will encode as much as possible, update length to indicate how much was consumed, output_size won't change, and AWS_ERROR_SHORT_BUFFER will be raised. aws_huffman_encode may then be called again like the following pseudo-code:

void encode_and_send(const char to_encode, size_t size) { while (size > 0) { uint8_t output[some_chunk_size]; size_t output_size = sizeof(output); size_t bytes_read = size; aws_huffman_encode(encoder, to_encode, &bytes_read, output, &output_size); / AWS_ERROR_SHORT_BUFFER was raised... */ send_output_to_someone_else(output, output_size);

    to_encode += bytes_read;
    size -= bytes_read;
}
/* Be sure to reset the encoder after use */
aws_huffman_encoder_reset(encoder);

}

aws_huffman_encoder also has a uint8_t field called eos_padding that defines how any unwritten bits in the last byte of output are filled. The most significant bits will used. For example, if the last byte contains only 3 bits and eos_padding is 0b01010101, 01010 will be appended to the byte.

Decoding

/**

The decoder is built to support partial encoding. This means that if there isn't enough space in output, the decoder will decode as much as possible, update length to indicate how much was consumed, output_size won't change, and AWS_ERROR_SHORT_BUFFER will be raised. aws_huffman_decode may then be called again like the following pseudo-code:

void decode_and_send(const char to_decode, size_t size) { while (size > 0) { uint8_t output[some_chunk_size]; size_t output_size = sizeof(output); size_t bytes_read = size; aws_huffman_decode(decoder, to_decode, &bytes_read, output, &output_size); / AWS_ERROR_SHORT_BUFFER was raised... */ send_output_to_someone_else(output, output_size);

    to_decode += bytes_read;
    size -= bytes_read;
}
/* Be sure to reset the decoder after use */
aws_huffman_decoder_reset(decoder);

}

Upon completion of a decode, the most significant bits ofdecoder->working_bits will contain the final bits of to_decode that could not match a symbol. This is useful for verifying the padding bits of a stream. For example, to validate that a stream ends in all 1's (like HPACK requires), you could do the following:

AWS_ASSERT(decoder->working_bits == UINT64_MAX << (64 - decoder->num_bits));