Error correcting codes (ECCs) are used in computer and communication systems to improve resiliency to bit flips caused by permanent hardware faults or transient conditions, such as neutron particles from cosmic rays, known generally as soft errors. This note describes the principles of Hamming codes that underpin ECC schemes, ECC codes are constructed, focusing on single-error correction and double error detection, and how they are implemented.

ECCs work by adding additional redundant bits to be stored or transported with data. The bits are encoded as a function of the data in such a way that it is possible to detect erroneous bit flips and to correct them. The ratio of the number of data bits to the total number of bits encoded is called the code rate, with a rate of 1 being a an impossible encoding with no overhead.

Simple ECCs

Parity coding adds a single bit that indicates whether the number of set bits in the data is odd or even. When the data and parity bit is accessed or received, the parity can be recomputed and compared. This is sufficient to detect any odd number of bit flips but not to correct them. For applications where the error rate is low, so that only single bit flips are likely and double bit flips are rare enough to be ignored, parity error detection is sufficient and desirable due to it’s low overhead (just a single bit) and simple implementation.

Repetition coding simply repeats each data bit a fixed number of times. When the encoded data is received, if each of the repeated bits are non identical, an error has occurred. With a repetition of two, single-bit errors can be detected but not corrected. With a repetition of three, single bit flips can be corrected by determining each data bit as the majority value in each triple, but double bit flips are undetectable and will cause an erroneous correction. Repetition codes are simple to implement but have a high overhead.

Hamming codes

Hamming codes are an efficient family of codes using additional redundant bits to detect up to two-bit errors and correct single-bit errors (technically, they are linear error-correcting codes). In them, check bits are added to data bits to form a codeword, and the codeword is valid only when the check bits have been generated from the data bits, according to the Hamming code. The check bits are chosen so that there is a fixed Hamming distance between any two valid codewords (the number of positions in which bits differ).

When valid codewords have a Hamming distance of two, any single bit flip will invalidate the word and allow the error to be detected. For example, the valid codewords 00 and 11 are separated for single bit flips by the invalid codewords 01 and 10. If either of the invalid words is obtained an error has occurred, but neither can be associated with a valid codeword. Two bit flips are undetectable since they always map to a valid codeword. Note that parity encoding is an example of a distance-two Hamming code.

00 < Valid codeword
|
10 < Invalid codeword (obtained by exactly 1 bit flip)
|
11 < Valid codeword

With Hamming distance three, any single bit flip in a valid codeword makes an invalid one, and the invalid codeword is Hamming distance one from exactly one valid codeword. Using this, the valid codeword can be restored, enabling single error correction. Any two bit flips map to an invalid codeword, which would cause correction to the wrong valid codeword.

000 < Valid codeword
 |
001
 |
011
 |
111 < Valid codeword

With Hamming distance four, two bit flips moves any valid codeword Hamming distance two from exactly two valid codewords, allowing detection of two flips but not correction. Single bit flips can be corrected as they were for distance three. Distance-four codes are widely used in computing, where is it often the case where single errors are frequent, double errors are rare and triple errors occur so rarely they can be ignored. These codes are referred to as ‘SECDED ECC’ (single error correction, double error detection).

0000 < Valid codeword
 |
0001
 |
0011 < Two bit flips from either codeword.
 |
0111
 |
1111 < Valid codeword

Double errors can be corrected with a distance-five code, as well as enabling the detection of triple errors. In general, if a Hamming code can detect $d$ errors, it must have a minimum distance of $d+1$ so there is no way $d$ errors can change one valid codeword into another one. If a code can correct $d$ errors, it must have a minimum distance of $2d+1$ so that the originating code is always the closest one. The following table summarises Hamming codes.

Distance Max bits corrected Max bits detected
2 1 0 Single error detection (eg parity code)
3 1 1 Single error correction (eg triple repetition code)
4 1 2 Single error correction, double error detection (a ‘SECDED’ code)
5 2 2 Double error correction
6 2 3 Double error correction, triple error detection

Creating a Hamming code

A codeword includes the data bits and checkbits. Each check bit corresponds to a subset of the data bits and it is set when the parity of those data bits is odd. To obtain a code with a particular Hamming distance, the number of check bits and their mapping to data bits must be chosen carefully.

To build a single-error correcting (SEC) code, which requires Hamming distance three between valid codewords, it is necessary for:

To see why this works, consider two distinct codewords in which necessarily must have different data bits. If the data bits differ by:

To build a SECDED code, which requires Hamming distance four between valid codewords, it is necessary for:

Following a similar argument, consider two distinct codewords, data differing by:

Check bits:  0 1 2 3
data[a]      x x x
data[b]        x x x
-----------  --------
Flips        x     x

Check bits:  0 1 2 3 4
data[a]      x x x
data[b]      x x x x x
----------   ---------
Flips              x x
Check bits:  0 1 2 3 4
data[a]      x x x
data[b]        x x x
data[c]      x     x x
-----------  ---------
Flips                x

Check bits:  0 1 2 3 4
data[a]      x x x x x
data[b]      x x x
data[c]      x     x x
-----------  ---------
Flips        x

An example SEC code for eight data bits with four parity bits:

Check bits:  0 1 2 3
data[0]      x x x
data[1]        x x x
data[2]      x   x x
data[3]      x x   x
data[4]      x x
data[5]        x x
data[6]          x x
data[7]      x     x

An example SECDED code for eight data bits with five parity bits:

Check bits:  0 1 2 3 4
data[0]      x x x
data[1]      x x   x
data[2]      x   x x
data[3]        x x x
data[4]      x x     x
data[5]      x   x   x
data[6]        x x   x
data[7]      x     x x

Note that mappings of data bits to check bits can be chosen flexibly, providing they maintain the rules that set the Hamming distance. This flexibility is useful when implementing ECC to reduce the cost of calculating the check bits. In contrast, many descriptions of ECC that I have found in text books and on Wikipedia describe a specific encoding that does not acknowledge this freedom. The encoding they describe allows the syndrome to be interpreted as the bit index of the single bit error, by the check bit in position $i$ covering data bits in position $i$. Additionally, they specify that parity bits are positioned in the codeword at power-of-two positions, for no apparent benefit.

Implementing ECC

Given data bits and check bits, and mapping of data bits to check bits, ECC encoding works by calculating the check bits from the data bits, then combining data bits and check bits to form the codeword. Decoding works by taking the data bits from a codeword, recalculating the check bits, then calculating the bitwise XOR between the original check bits and the recalculated ones. This value is called the syndrome. By inspecting the number of bits set in the syndrome, it is possible to determine whether there has been an error, whether it is correctable, and how to correct it.

Using the SEC check-bit encoding above, creating a codeword from data[7:0], the check bits are calculated as follows (using Verilog syntax):

assign check_word[0] = data[0] ^ data[2] ^ data[3] ^ data[4] ^ data[7];
assign check_word[1] = data[0] ^ data[1] ^ data[3] ^ data[4] ^ data[5];
assign check_word[2] = data[0] ^ data[1] ^ data[2] ^ data[5] ^ data[6];
assign check_word[3] = data[1] ^ data[2] ^ data[3] ^ data[6] ^ data[7];

And the codeword formed by concatenating the check bits and data:

assign codeword = {check[3:0], data[7:0]};

Decoding of a codeword, splits it into the checkword and data bits, recomputes the check bits and calculates the syndrome:

assign {old_check_word, old_data} = codeword;
assign new_check_word[0] = ...;
assign new_check_word[1] = ...;
assign new_check_word[2] = ...;
assign new_check_word[3] = ...;
assign syndrome = new_check_word ^ old_check_word;

When single bit errors occur, the syndrome will have the bit pattern corresponding to a particular data bit, so a correction can be applied by creating a mask to flip the bit in that position:

unique case(syndrome)
  4'b1110: correction = 1<<0;
  4'b0111: correction = 1<<1;
  4'b1011: correction = 1<<2;
  4'b1101: correction = 1<<3;
  4'b1100: correction = 1<<4;
  4'b0110: correction = 1<<5;
  4'b0011: correction = 1<<6;
  4'b1001: correction = 1<<7;
  default: correction = 0;
endcase

And using it to generate the corrected data:

assign corrected_data = data ^ correction;

The value of the syndrome can be further inspected to signal what action has been taken. If the syndrome is:

The above SECDED check-bit encoding can be implemented in a similar way, but since it uses only three-bit patterns, mapping syndromes to correction masks can be done with three-input AND gates:

unique case(syndrome)
  syndrome[0] && syndrome[1] && syndrome[2]: correction = 1<<0;
  syndrome[0] && syndrome[1] && syndrome[3]: correction = 1<<1;
  syndrome[0] && syndrome[2] && syndrome[3]: correction = 1<<2;
  syndrome[1] && syndrome[2] && syndrome[3]: correction = 1<<3;
  syndrome[0] && syndrome[1] && syndrome[4]: correction = 1<<4;
  syndrome[0] && syndrome[2] && syndrome[4]: correction = 1<<5;
  syndrome[1] && syndrome[2] && syndrome[4]: correction = 1<<6;
  syndrome[0] && syndrome[3] && syndrome[4]: correction = 1<<7;
  default:                                   correction = 0;
endcase

And any syndromes with one or two bits set are correctable, and otherwise uncorrectable.

References / further reading