BCOS Home » The BCOS Project » BCOS Specifications » BCOS File Format Specifications

BCOS Compressed Native File Format Specification

Preliminary Draft

Chapter 1: Overview

This document describes the native file format used for compressed files. It is intended to provide simple but effective compression for native file formats only.

The method used to compress data involves finding strings of bytes (or "runs") that are repeated, and replacing the string of bytes with a reference to its duplicate.

Chapter 2: Specification Change Policy

Any changes made in future versions of this specification are not guaranteed to be backward or forward compatible (however changes may be backward or forward compatible if possible).

In general, this specification is not expected to change in future.

Chapter 3: General File Structure

The basic structure of the file is described in Figure 3-1. File Structure.

Figure 3-1. File Structure
		End of file
	(Optional) Metadata
	Compressed data
	Extended Header	0x00000030
	Generic Header	0x00000000
Note: Not to scale.

3.1. Generic File Header

This is the native file format header defined in the BCOS Native File Format Specification. To comply with this specification, the file type field must be 0xC0000000.

3.2. Extended File Header

The extended file header follows the generic file header, and is described in Table 3-1. Extended File Header Format.

Table 3-1. Extended File Header Format
Offset	Size	Description
0x00000030	8 bytes	Uncompressed file size
0x00000038	4 bytes	Uncompressed file checksum
0x0000003C	4 bytes	Uncompressed file type

This information is used to reconstruct the first 24 bytes of the original file's native file format header during decompression (the first 24 bytes of the compressed file's data is not included in the compressed data section), allows software to determine the type of the compressed file without decompressing it first, and allows decompression software to allocate a buffer of the correct size for the decompressed data. When a file is compressed its checksum is copied "as is" into the compressed file's extended header, so that if the file's checksum was incorrect before the file was compressed it will still be incorrect after the file is decompressed. However, if a file's checksum hasn't been set (and only if the file's checksum hasn't been set) then code used to compress the file may (should) generate a correct checksum before compressing the file.

3.3. Compressed Data

The compressed data consists of a variable number of entries, that ends when both the end of the compressed file and the end of the uncompressed file is reached. There's 2 types of entries: unmatched runs (where the data is embedded "as is" into the compressed data) and matched runs (where the data can be copied from somewhere else).

3.3.1. Unmatched Runs

The entry for an unmatched run indicates how many bytes of data are inserted unchanged into the compressed data.

Table 3-2. Initial Byte For Unmatched Runs
Bit/s	Description
7	Must be 0 to indicate run is unmatched
5 to 6	Number of extra size bytes
0 to 4	Size bits 0 to 4

The size of the run is determined from the size bits in the initial byte, plus the bits from any extra size bytes, plus one. For example, if the initial byte is 0x65 then there's three extra size bytes following the initial byte, and if the extra size bytes are "0x65, 0x34, 0x12" then the size of the run would be "(0x65 & 0x1F) + (0x56 << 5) + (0x34 << 13) + (0x12 << 21) + 1". These bytes would be followed by the bytes of the unmatched run itself.

3.3.2. Matched Runs

The entry for a matched run indicates how many bytes of data are the same as the bytes at a specified offset.

Table 3-3. Initial Byte For Matched Runs
Bit/s	Description
7	Must be 1 to indicate run is matched
5 to 6	Number of extra size bytes
4	Offset encoding (0 = literal, 1 = negative)
2 to 3	Number of offset bytes - 1
0 to 1	Size bits 0 to 1

The size of the run is determined from the size bits in the initial byte, plus the bits from any extra size bytes, plus three. For example, if the initial byte is 0xE3 then there's three extra size bytes following the initial byte, and if the extra size bytes are "0x56, 0x34, 0x12" then the size of the matched run would be "(0xE3 & 0x03) + (0x56 << 2) + (0x34 << 10) + (0x12 << 18) + 3".

The extra size bytes (if any) or the initial byte (if there's no extra size bytes) are immediately followed by 1, 2, 3 or 4 bytes used to encode the offset of the matching run. The number of offset bytes is determined by bits 2 to 3 in the initial byte. For example, if the initial byte is 0x8C (no extra size bytes and 4 offset bytes) and the next 4 bytes are "0x78, 0x56, 0x34, 0x12" then the offset would be "0x78 + (0x56 << 8) + (0x34 << 16) + (0x12 << 24)".

If the offset is a literal offset (bit 4 in the initial byte is clear) then the offset is an offset from the beginning of the decompressed data and remains "as is". If the offset is a negative offset, then it's a displacement from the current position in the decompressed data. Negative offsets can be converted into literal offsets using "literal_offset = (offset_for_next_byte_in_decompressed_data - 1) - negative_offset". Normally compression code chooses the shortest possible encoding for the offset, however for very large files (larger than 4 GiB) this increases the range of offsets possible (for example, for a 12 GiB file the offset may refer to somewhere in the first 4 GiB of the decompressed file or in the 4 GiB before the current position in the decompressed file).