BCOS File FormatsProject Map
BCOS Compressed Native File Format Specification
Preliminary Draft
 

Contents

1                Overview
2                General File Structure
2.1                Generic File Header
2.2                Extended File Header
2.3                Compressed Data
2.3.1                Unmatched Runs
2.3.2                Matched Runs
3                Compression Example


Tables

Table 2.1      Extended File Header Format
Table 2.2      Initial Byte For Unmatched Runs
Table 2.3      Initial Byte For Matched Runs


Listings

Listing 3.1      Example Compressed File Data
Listing 3.2      Example Compressed File Breakdown
Listing 3.3      Example Original File



1   Overview

This document describes the native file format used for compressed files. It is intended to provide simple but effective compression for native file formats only.

The method used to compress data involves finding strings of bytes (or "runs") that are repeated and replacing the later string of bytes with a reference to the an earlier string of bytes.


2   General File Structure

The basic structure of the file is described in Figure 2.1: File Layout.

_End of file


 Compressed data 


_Offset 48
 Extended File Header 
_Offset 32
 Generic File Header 
_Offset 0

Figure 2.1 - File Layout


2.1   Generic File Header

This is the native file format header defined in the BCOS Native File Format Specification.


2.2   Extended File Header

The extended file header follows the generic file header, and is described in Table 2.1: Extended File Header Format.

OffsetSizeDescription
  0x00000020
  8 bytes
  Uncompressed file size
  0x00000028
  4 bytes
  Uncompressed file checksum
  0x0000002C
  4 bytes
  Uncompressed file type
Table 2.1 - Extended File Header Format

This information is used to reconstruct the first 24 bytes of the original file's native file format header during decompression (the first 24 bytes of the compressed file's data is not included in the compressed data section), allows software to determine the type of the compressed file without decompressing it first, and allows decompression software to allocate a buffer of the correct size for the decompressed data. When a file is compressed it's checksum is copied "as is" into the compressed file's extended header, so that if the file's checksum was incorrect before the file was compressed it will still be incorrect after the file is decompressed. However, if a file's checksum hasn't been set (and only if the file's checksum hasn't been set) then code used to compress the file may (should) generate a correct checksum before compressing the file.


2.3   Compressed Data

The compressed data consists of a variable number of entries, that ends when both the end of the compressed file and the end of the uncompressed file is reached. There's 2 types of entries: unmatched runs (where the data is embedded "as is" into the compressed data) and matched runs (where the data can be copied from somewhere else).


2.3.1   Unmatched Runs

The entry for an unmatched run indicates how many bytes of data are inserted unchanged into the compressed data.

Bit/sDescription
  7
  Must be 0 to indicate run is unmatched
  5 to 6
  Number of extra size bytes
  0 to 4
  Size bits 0 to 4
Table 2.2 - Initial Byte For Unmatched Runs

The size of the run is determined from the size bits in the initial byte, plus the bits from any extra size bytes, plus one. For example, if the initial byte is 0x65 then there's three extra size bytes following the initial byte, and if the extra size bytes are "0x56, 0x34, 0x12" then the size of the unmatched run would be "(0x25 & 0x1F) + (0x56 << 5) + (0x34 << 14) + (0x12 << 21) + 1". These byte would be immediately followed by the bytes of the unmatched run itself.


2.3.2   Matched Runs

The entry for a matched run indicates how many bytes of data are the same as the bytes at a specified offset.

Bit/sDescription
  7
  Must be 1 to indicate run is matched
  5 to 6
  Number of extra size bytes
  4
  Offset encoding (0 = literal, 1 = negative)
  2 to 3
  Number of offset bytes - 1
  0 to 1
  Size bits 0 to 1
Table 2.3 - Initial Byte For Matched Runs

The size of the run is determined from the size bits in the initial byte, plus the bits from any extra size bytes, plus three. For example, if the initial byte is 0xE3 then there's three extra size bytes following the initial byte, and if the extra size bytes are "0x56, 0x34, 0x12" then the size of the matched run would be "(0x25 & 0x03) + (0x56 << 2) + (0x34 << 10) + (0x12 << 18) + 3".

The extra size bytes (if any) or the initial byte (if there's no extra size bytes) are immediately followed by 1, 2, 3 or 4 bytes used to encode the offset of the matching run. The number of offset bytes is determined by bits 2 to 3 in the initial byte. For example, if the initial byte is 0x8C (no extra size bytes and 4 offset bytes) and the next 4 bytes are "0x78, 0x56, 0x34, 0x12" then the offset would be "0x78 + (0x56 << 8) + (0x34 << 16) + (0x12 << 24)". If the offset is a literal offset (bit 4 in the initial byte is clear) then the offset is an offset from the beginning of the decompressed data and remains "as is". If the offset is a negative offset, then it's a displacement from the current position in the decompressed data. Negative offsets can be converted into literal offsets using "literal_offset = (offset_for_next_byte_in_decompressed_data - 1) - negative_offset". Normally compression code chooses the shortest possible encoding for the offset, however for very large files (larger than 4 GiB) this increases the range of offsets possible (for example, for a 12 GiB file the offset may refer to somewhere in the first 4 GiB of the decompressed file or in the 4 GiB before the current position in the decompressed file).


3   Compression Example

The following example shows a hexadecimal dump for a compressed text file, a breakdown of what each byte in the compressed text file means, and the original plain text file.

0x00000000  56 00 00 00 00 00 00 00  42 43 4F 53 5F 4E 46 46  |V.......BCOS_NFF|
0x00000010  B3 62 18 6C 00 00 00 C0  00 00 00 00 00 00 00 00  |.b.l............|
0x00000020  48 00 00 00 00 00 00 00  3F 21 33 A3 00 00 10 00  |H.......?!3.....|
0x00000030  B1 01 00 81 08 1A 20 54  65 73 74 0A 0A 48 65 6c  |...... Test..Hel|
0x00000040  6C 6F 20 57 6F 72 6C 64  21 0a 47 6F 6F 64 62 79  |lo World!.Goodby|
0x00000050  65 B1 01 0E 00 00                                 |e.....          |
86 bytes
Listing 3.1 - Example Compressed File Data

The first 32 bytes are the generic file header for the compressed file.

0x00000000  56 00 00 00 00 00 00 00  42 43 4F 53 5F 4E 46 46  |V.......BCOS_NFF|
0x00000010  B3 62 18 6C 00 00 00 C0  00 00 00 00 00 00 00 00  |.b.l............|


The next 16 bytes are the extended file header, consisting of the size of the
original/uncompressed file (0x0000000000000048 bytes), followed by the
checksum/CRC for the original/uncompressed file (0xA333213F), followed by the
original/uncompressed file's type (0x00100000, plain text).

0x00000020  48 00 00 00 00 00 00 00  3F 21 33 A3 00 00 10 00  |H.......?!3.....|


The compressed data begins immediately after the extended header.


This first 3 byte entry tells decompression software to copy 8 bytes from offset
0x000000017 in the decompressed file to offset 0x00000018 in the decompressed
file. The first byte is decoded as "matched run, one extra byte for size,
negative offset, one byte for offset, size = 1". The second byte is the extra
byte for the size, which contains bits 2 to 10 of the size of the run. This
makes the size of the run "1 + (0x01 << 2) + 3" (or 8 bytes long), where the "1"
is from the first byte, "+ (0x01 << 2)" is from the extra size byte, and the
final "+ 3" is implied. The last byte is the offset (0x00) which is a negative
offset and is therefore relative to the current position in decompressed file.
The current position in the decompressed file would be 0x00000018, so the actual
offset of the matched run would be "0x00000018 - 0x00 - 1" or 0x00000017.

0x00000030  b1 01 00                                          |...             |


The next 2 byte entry tells decompression software to copy 4 bytes from offset
0x000000008 in the decompressed file to the current offset in the decompressed
file. The first byte is decoded as "matched run, no extra size bytes,
literal offset, one byte for offset, size = 1". The actual size of the run is
"1 + 3" (or 4 bytes long) because there are no extra size bytes, where the "1"
is from the first byte and the "+ 3" is implied. The second byte is the offset
(0x08) which is taken literally.

0x00000030           81 08                                    |   ..           |


The next 28 byte entry tells decompression software to copy 27 bytes from the
current position in the compressed file to the current position in the
decompressed file. The first byte is decoded as "unmatched run, no extra bytes
for size, size = 0x1A". The actual size of the run is "0x1A + 1" where the "+ 1"
is implied.

0x00000030                 1A 20 54  65 73 74 0A 0A 48 65 6c  |     . Test..Hel|
0x00000040  6C 6F 20 57 6F 72 6C 64  21 0a 47 6F 6F 64 62 79  |lo World!.Goodby|
0x00000050  65                                                |e               |


The next 3 byte entry tells decompression software to copy 8 bytes from offset
0x000000030 in the decompressed file to offset 0x0000003F in the decompressed
file. The first byte is decoded as "matched run, one extra byte for size,
negative offset, one byte for offset, size = 1". The second byte is the extra
byte for the size, which contains bits 2 to 10 of the size of the run. This
makes the size of the run "1 + (0x01 << 2) + 3" (or 8 bytes long), where the "1"
is from the first byte, "+ (0x01 << 2)" is from the extra size byte, and the
final "+ 3" is implied. The last byte is the offset (0x0E) which is a negative
offset and is therefore relative to the current position in decompressed file.
The current position in the decompressed file would be 0x0000003F, so the actual
offset of the matched run would be "0x0000003F - 0x0E - 1" or 0x00000030.

0x00000050     B1 01 0E 00 00                                 | .....          |


The last 2 byte entry tells decompression software to copy 1 byte from the
current position in the compressed file to the current position in the
decompressed file. The first byte is decoded as "unmatched run, no extra bytes
for size, size = 0x00". The actual size of the run is "0x00 + 1" where the "+ 1"
is implied.

0x00000050              00 00                                 |    ..          |
Listing 3.2 - Example Compressed File Breakdown

0x00000000  48 00 00 00 00 00 00 00  42 43 4F 53 5F 4E 46 46  |H.......BCOS_NFF|
0x00000010  00 00 00 00 00 00 10 00  00 00 00 00 00 00 00 00  |................|
0x00000020  42 43 4F 53 20 54 65 73  74 0A 0A 48 65 6C 6C 6F  |BCOS Test..Hello|
0x00000030  20 57 6F 72 6C 64 21 0A  47 6F 6F 64 62 79 65 20  | World!.Goodbye |
0x00000040  57 6F 72 6C 64 21 0A 00                           |World!..        |
72 bytes
Listing 3.3 - Example Original File

For this example the compressed file is larger than the original/uncompressed file. This is because a completely new native file format header is prepended to the original data to indicate that it is a valid compressed file. Basically, 72 bytes of data is compressed down to 54 bytes, and then a 32 byte header is added to give a final file size of 86 bytes.


Generated on Sat Aug 1 16:05:47 2009