BCOS File FormatsProject Map
BCOS Faulty RAM List File Format Specification
Version 1.0
(Preliminary Draft)
 

Contents

1                Overview
1.1                A Cautionary Tale
1.2                The Faulty RAM List File
2                General File Structure
3                File Format
3.1                Generic File Header
3.2                Extended Header
3.2.1                Faulty RAM List Version Numbers (Major and Minor)
3.2.2                RAM Test Control Flags
3.2.2.1                Simple Boot RAM Test Enable Flag
3.2.2.2                Scheduled Boot RAM Test Enable Flag
3.2.2.3                Run-time RAM Testing Enable Flag
3.2.3                Scheduled Boot RAM Test Passes
3.2.4                Platform ID
3.3                Faulty RAM List Entries
4                Platform ID '8632'
4.1                Faulty RAM List Entries
4.1.1                Area Size
4.1.2                Area Starting Address
4.1.3                Faulty RAM List Entry Examples


Tables

Table 3.1      Extended Header Format
Table 3.2      RAM Test Control Flags
Table 3.3      Platform IDs
Table 4.1      Faulty RAM List Entry, First Dword
Table 4.2      Faulty RAM List Entry Examples



1   Overview

1.1   A Cautionary Tale

In the past one of the computers I used developed faulty RAM. The only symptom was that the web browser occasionally crashed, but I ignored this (the web browser I was using was well known for stability problems, and everything else worked). Several months later, immediately after defragmenting the file system, I found out the RAM was faulty. The utility I used to defragment the file system loaded data from disk into the faulty RAM, and wrote the data back to disk in a different place; which caused around half of my files to become corrupted. Of course I had no easy way to tell which half of the files had become corrupted - I completely reformatted the drive (after doing some testing and replacing the faulty RAM).

There's a few important lessons in this example. The first lesson is that RAM faults can remain undetected for a relatively long time. This may be partly due to the CPUs caches (for e.g. if data is written to the cache and the faulty RAM and then read back from the cache, then corrupt data in RAM isn't used) and partly because a relatively large amount data stored in RAM can be modified with no noticeable problem - a pixel with a slightly different color in some graphics data, a digitized sound with slightly more "noise", etc.

The second lesson is that the RAM test done by the BIOS is almost entirely useless. In my experience it's capable of detecting incorrectly inserted RAM modules and major RAM faults, but subtle errors and intermittent errors are (almost) never detected. To detect if RAM is faulty you need to use a tool designed to test RAM properly (e.g. http://www.memtest.org/), but most people don't use a stand-alone tool regularly - they wait until they suspect problems, and by then it's too late...

If you're lucky you'll never see a RAM failure. If you're unlucky RAM failures can be one of the most insidious hardware errors possible. The best solution is to use ECC RAM, however ECC RAM is significantly more expensive (often close to twice the price of standard RAM), and for this reason it's rarely used except for servers.


1.2   The Faulty RAM List File

For the purpose of fault tolerance, the operating system is capable of detecting and avoiding areas of faulty RAM. To do this, the "Faulty RAM List" file is used to keep track of areas of RAM that shouldn't be used by the operating system. In addition, the "Faulty RAM List" file is used to control features designed to detect faulty RAM areas.

The "Faulty RAM List" file is one of the very first files used during boot, so that faulty/unreliable areas of RAM can be avoided from an early stage. While the OS is running changes to this file may be made automatically by the OS (e.g. if additional faulty areas are detected) or manually by administrators (e.g. if certain fault tolerance features are enabled or disabled). This means that new areas of faulty RAM detected while the operating system is running are "remembered" and not used after rebooting.


2   General File Structure

The basic structure of the file is described in Figure 2.1: File Layout.

_End of file

 Faulty RAM List Entries 

_Offset obtained from extended header
 Extended Header 
_Offset 32
 Generic File Header 
_Offset 0

Figure 2.1 - File Layout

Note: This file format is designed to allow for future expansion. Future versions of this specification may increase the size of the extended header, or add additional areas between the extended header and the faulty RAM list entries, or define new meanings to any reserved fields in the extended header. Code written to handle faulty RAM lists that comply with this specification should also be able to handle faulty RAM lists that comply with future versions of this specification.


3   File Format

3.1   Generic File Header

This is the native file format header defined in BCOS Native File Format Specification.


3.2   Extended Header

The extended header follows the generic file header, and is described in Table 3.1: Extended Header Format.

OffsetSizeDescription
  0x00000020
  1 byte
  Faulty RAM List version number minor (see Subsection 3.2.1: Faulty RAM List Version Numbers (Major and Minor))
  0x00000021
  1 byte
  Faulty RAM List version number major (see Subsection 3.2.1: Faulty RAM List Version Numbers (Major and Minor))
  0x00000022
  1 byte
  RAM Test Control Flags (see Subsection 3.2.2: RAM Test Control Flags)
  0x00000023
  1 byte
  Scheduled Boot RAM Test Passes (see Subsection 3.2.3: Scheduled Boot RAM Test Passes)
  0x00000024
  4 bytes
  Platform ID (see Subsection 3.2.4: Platform ID)
  0x00000028
  4 bytes
  32-bit offset within file for start of Faulty RAM List Entries (see Section 3.3: Faulty RAM List Entries)
  0x0000002C
  4 bytes
  32-bit offset within file for byte after Faulty RAM List Entries (see Section 3.3: Faulty RAM List Entries)
Table 3.1 - Extended Header Format


3.2.1   Faulty RAM List Version Numbers (Major and Minor)

The version numbers allow code to determine which version of this specification a Faulty RAM List file complies with. These fields are encoded in BCD. For display purposes, the version number is displayed as 2 seperate decimal numbers seperated by a full stop (e.g. "major.minor") where leading zeros are suppressed for the major version number and displayed for the minor version number; and trailing zeros are suppressed for the minor version number and displayed for the major version number. For example, if the major version number is 0x01 and the minor version number is 0x02 then it would be displayed as "1.02"; and if the major version number is 0x10 and the minor version number is 0x20 then it would be displayed as "10.2".

To indicate compliance with this version of the specification, the major version number must be 0x01 and the minor version number must be 0x00 (version 1.0).


3.2.2   RAM Test Control Flags

The "RAM Test Control Flags" field in the extended header controls the RAM testing features of the operating system.

Bit/sDescription
  0
  Simple Boot RAM Test Enable (clear = disable, set = enable)
  1
  Scheduled Boot RAM Test Enable (clear = disable, set = enable)
  2 to 6
  Reserved (must be clear)
  7
  Run-time RAM Testing Enable (clear = disable, set = enable)
Table 3.2 - RAM Test Control Flags


3.2.2.1   Simple Boot RAM Test Enable Flag

This flag (when set) enables a "single pass" RAM test that occurs every time the OS boots. In this case, "in use" RAM is tested during boot (to make sure the boot code isn't relying on faulty RAM) and all RAM that's allocated after this step is tested once before it's used (either by boot code, or by the operating system itself). In effect, the time taken to test RAM during boot is reduced by postponing some of the RAM testing until it's necessary, or until the operating system is idle.


3.2.2.2   Scheduled Boot RAM Test Enable Flag

This flag (when set) enables a more thorough RAM test during boot, where the number of passes performed is specified via. the Scheduled Boot RAM Test Passes field (see Subsection 3.2.3: Scheduled Boot RAM Test Passes) in the extended header.

If the Simple Boot RAM Test and the Scheduled Boot RAM Test are both enabled then Simple Boot RAM Test will be skipped.

It is expected that the operating system will clear this flag and update the "Faulty RAM List" file after the OS has booted (and after the Scheduled Boot RAM Test is done). This allows a normal utility to schedule regular tests (for e.g. once per week a utility might set this flag and reboot the OS). The Scheduled Boot RAM Test is (typically) also used when the operating system is first installed to ensure that RAM is reliable, and so that a more permanent "Faulty RAM List" file (that includes any areas of faulty RAM that should be avoided, and has the Scheduled Boot RAM Test Enable Flag cleared) is generated and installed.


3.2.2.3   Run-time RAM Testing Enable Flag

Testing the computer's RAM during boot doesn't help in some situations (for example, a server that runs 24 hours per day for years). In general, the Simple Boot RAM Test and the Scheduled Boot RAM Test try to detect faults that occurred while the computer is off, and the Run-time RAM Test is intended to detect faults that occur while the computer is running.

Run-time RAM testing is designed to run in the background (where possible) to minimize the effect RAM testing has on performance. However, because the operating system uses paging the run-time RAM testing can't be as thorough as RAM testing done during boot (address line testing can't be done). This is considered acceptable though, as address line failure in a running system is extremely rare and is likely to cause catastrophic software failures (which would typically cause a reboot, hopefully resulting in the Simple Boot RAM Test or Scheduled Boot RAM Test being performed).


3.2.3   Scheduled Boot RAM Test Passes

This byte specifies the number of times that a full RAM test should be performed during a Scheduled Boot RAM Test. This value is ignored if the Scheduled Boot RAM Test is disabled (see Subsection 3.2.2: RAM Test Control Flags).

Non-zero values specify between 1 and 255 passes. The value zero is used to specify an infinite number of RAM test passes, where RAM testing is continuously done until the computer is turned off or reset. This effectively turns the operating system's boot code into a stand-alone RAM testing utility. In this case, nothing that's normally executed after the RAM test (e.g. the operating system itself) is necessary - it won't be used.


3.2.4   Platform ID

The platform ID is used to determine the format for Faulty RAM List Entries and System Area CRC Entries. Boot code that uses the Faulty RAM List file should check to make sure that the platform ID is the correct platform ID for the computer being booted. The platform ID is a 4 character string (without a zero terminator).

Defined platform IDs are listed in Table 3.3: Platform IDs, including a reference to the section within this document that describes the format for Faulty RAM List Entries and System Area CRC Entries for each platform ID.

Platform IDSectionPlatform Description
  "8632"
  Chapter 4: Platform ID '8632'
  All 80x86 systems (including 64-bit 80x86 systems)
Table 3.3 - Platform IDs


3.3   Faulty RAM List Entries

Faulty RAM List Entries are used to inform boot code of faulty or unreliable areas of RAM, so that these areas can be avoided during boot and after boot.

The offset of the first Faulty RAM List Entry within the file and the number of Faulty RAM List Entries are specified in the extended header (see Table 3.1: Extended Header Format). The format for a Faulty RAM List Entry depends on the platform ID (see Table 3.3: Platform IDs), and is described in the section corresponding to the platform ID.

If there are no Faulty RAM List Entries, then the "32-bit offset within file for start of Faulty RAM List Entries" field and the "32-bit offset within file for byte after Faulty RAM List Entries" field in the extended header must still be valid - they describe a zero length list.

To improve search times, all Faulty RAM List Entries must be sorted in order of lowest starting address to highest starting address.


4   Platform ID '8632'

4.1   Faulty RAM List Entries

For 80x86 systems the operating system manages 4 KiB pages, and each Faulty RAM List Entry contains the address for the start if first page containing faulty RAM and the number of sequential pages containing faulty RAM. A Faulty RAM List Entry is encoded as between one and three 32-bit dwords to save space.

The first dword in a Faulty RAM List Entry has the format shown in Table 4.1: Faulty RAM List Entry, First Dword.

Bit/sDescription
  0 to 10
  Area Size (see Subsection 4.1.1: Area Size)
  11
  Area Starting Address Size Flag (see Subsection 4.1.2: Area Starting Address)
  12 to 31
  Area Starting Address Low (see Subsection 4.1.2: Area Starting Address)
Table 4.1 - Faulty RAM List Entry, First Dword


4.1.1   Area Size

If the Area Size field in the first dword is non-zero, then the Area Size field in the first dword specifies the number of faulty pages of RAM at the starting address. Otherwise (if the Area Size field in the first dword is zero), the last dword in the Faulty RAM List Entry contains the number of faulty pages of RAM at the starting address


4.1.2   Area Starting Address

If the Area Starting Address Size Flag in the first dword is clear, then the starting address of the first page of faulty RAM is a 32-bit address that is entirely contained within the Area Starting Address Low field of the first dword of the Faulty RAM List Entry. Otherwise (if the Area Starting Address Size Flag in the first dword is set) the starting address of the first page of faulty RAM is a 64-bit address, where the next dword in the Faulty RAM List Entry contains the highest (most significant) bits of the starting address and the Area Starting Address Low field of the first dword of the Faulty RAM List Entry contains bits 12 to 31 of the starting address.

Note: Because pages start on 4 KiB boundaries the least significant 12 bits of a starting address are always zero.


4.1.3   Faulty RAM List Entry Examples

The following table provides examples of Faulty RAM List Entries.

Entry DataDescription
  0x76543001
  One faulty page (4 KiB) at 0x0000000076543000
  0x76543400
  1024 faulty pages (4 MiB) at 0x0000000076543000
  0x76543801,0xFEDCBA98
  One faulty page (4 KiB) at 0xFEDCBA9876543000
  0x76543000,0x00001234
  4660 faulty pages (18640 KiB) at 0x0000000076543000
  0x76543800,0xFEDCBA98,0x00012345
  74565 faulty pages (298260 KiB) at 0xFEDCBA9876543000
Table 4.2 - Faulty RAM List Entry Examples

This encoding allows a "single dword" Faulty RAM List Entry to refer to up to 2047 faulty pages (8188 KiB) at a 32-bit starting address. A "double word" Faulty RAM List Entry with a 32-bit starting address, or a "triple word" Faulty RAM List Entry with a 64-bit starting address, can refer to up to 4294967295 faulty pages (almost 16 GiB). For larger areas multiple Faulty RAM List Entries can be used.


Generated on Sat Aug 1 16:05:47 2009