Markus Grassl
Max‐Planck‐Institut für die Physik des Lichts, Staudstraße 2, 91058 Erlangen, Germany
Information theory establishes a framework for any kind of communication and information processing. It allows us to derive bounds on the complexity or cost of tasks such as storing information using the minimal amount of space or sending data over a noisy channel. It provides means to quantify information. So one may ask, “How much does someone know about what somebody else knows?” – a question which is important in cryptographic context. Before studying the new aspects that quantum mechanics adds to information theory in later chapters, we will have a brief look at the basics of classical information theory in the next section.
While information theory provides an answer to the question how fast one can send information over a noisy channel, it usually does not give a constructive solution to this task. This is a problem error correction deals with. In Section 1.3, we give a short introduction to linear blocks codes, laying ground for the discussion of error‐correcting codes for quantum systems in Chapter 7.
The foundations of information theory have been laid by Claude Shannon in his landmark paper “A Mathematical Theory of Communication” (1). In that paper, Shannon introduces the basic mathematical concepts for communication systems and proves two fundamental coding theorems. Here, we mainly follow his approach.
The first important observation of Shannon is that although the process of communication is intended to transfer a message with some meaning, the design of a communication system can abstract from any meaning:
The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.
Additionally, we can, to some extent, abstract from the physical channel that is used to transmit the message. For this, we introduce a transmitter and a receiver that convert the messages into some physical signal and vice versa. The general layout of such a communication system is illustrated in Figure 1.1. Given a channel and an information source, the basic problem is to transmit the messages produced by the information source through the channel as efficient and as reliable as possible. Efficient means that we can send as much information as possible per use of the channel, and reliable means that, despite the disturbance due to the noise added by the channel, the original message is (with high probability) reproduced by the receiver. Shannon has shown that one can treat the two problems separately. First, we will consider a noiseless channel, which transmits every input perfectly, and then we will deal with noisy channels. For simplicity, we will consider only discrete channels here, that is, both the input and the output of the channel, as well as those of the transmitter and receiver, are symbols of a finite discrete set.
For a channel that transmits its inputs perfectly, the design of a communication system aims to maximize its efficiency, that is, the amount of information that can be sent through the channel per time. Usually, it is assumed that for each channel input, the transmission takes the same amount of time. Then, we want to maximize the throughput per channel use. Otherwise, we first have to consider how many symbols we can send through the channel per time. Following (1), we define
If we use the symbols with durations , then we get the recursive equation
as we can partition the sequences of duration by, say, the last symbol. For large , tends to where is the largest real solution of the characteristic equation
Summarizing, we get
In order to maximize the efficiency of the communication system, we additionally need a measure for the amount of information that is produced by the source. Recall that we abstract from the meaning of a message, that is, a single message does not provide any information. Instead, we always consider a set of possible symbols, and each of the symbols will occur with some probability. The less frequent a symbol, the more surprising is its occurrence and hence it bears more information. The amount of information of a source is described as follows:
In this definition, we have assumed that the symbols are emitted independently, that is, the probability of receiving a sequence of length is given by
If there are some correlations between the symbols, the entropy of the source decreases when the original symbols are combined to new symbols. This is due to the fact that the entropy is maximal when all probabilities are equal. As an example, we may consider any natural language. The entropy depends on whether we consider the single‐letter entropy, or the entropy of pairs or triples of letters, whole words, or even sentences. Of course, the entropy does also depend on the language itself.
From now on, we fix the alphabet of the information source and assume that there are no correlations between the symbols. A very important concept in information theory is that of ‐typical words.
Asymptotically, this means that a sequence either occurs with negligible probability, that is, is nontypical, or it is a so‐called ‐typical word, which are approximately equally distributed.
Fixing the length one can order the sequences by decreasing probability. For , we define as the minimum number of sequences of length that accumulate a total probability . Shannon has shown that in the limit of large , this fraction is independent of :
The quantity can be interpreted as the number of bits that are required to describe a sequence when considering only the most probable sequences with total probability . From Theorem 1.1, we get that even for the finite length , almost all words can be described in this way. The bounds for sending arbitrary sequences through the channel are given by Shannon's first fundamental coding theorem:
The small defect compared to the maximal achievable transmission speed is due to the small extra information that is needed to encode the nontypical words of the source. An efficient scheme for encoding the output of the source is for example, the so‐called Huffman coding (2). In view of Theorem 1.1, one can also ignore the nontypical words that have a negligible total probability in the encoding, resulting in a small error (lossy data compression).
A discrete noisy channel maps an input symbol from the (finite) input alphabet to an output symbol from the output alphabet . A common assumption is that the channel is memoryless, that is, the probability of observing a symbol depends only on the last channel input and nothing else. The size of the input and output alphabets need not be the same, as depicted in Figure 1.2. Given the channel output , the task for the receiver is to determine the most likely input to the channel. For this, we consider how much information the channel output provides about the channel input. First, we define some general quantities for pairs of random variables (see, e.g., (3)).
For the joint entropy, we consider the channel input and the channel output together as one symbol.
The conditional entropy is a measure for the information that we additionally get when considering both and together and not only . This is reflected by the following chain rule (see [ (3), Theorem 2.2.1]).
Another important quantity in information theory is the mutual information.
The relationship between entropy and mutual information is illustrated in Figure 1.3. From Theorem 1.4, we get the following equivalent expressions for the mutual information:
With this preparation, we are ready to define the capacity of a noisy discrete memoryless channel.
The justification of this definition is provided by Shannon's second fundamental coding theorem.
For the proof of this theorem, one considers a particular set of encoding schemes and then averages the frequency of errors. This average can be made arbitrarily small, implying that at least one of the encoding schemes must have a negligible error probability.
Before we turn our attention to the explicit construction of error‐correcting codes, we consider a particular interesting channel.
The generalization of the BSC to more than one input symbol is shown in Figure 1.4. Again, a symbol is transmitted correctly with probability . If an error occurs, each of the, say, other symbols is equally likely, that is, it occurs with probability . These types of channels are extremal in the sense that the transition probabilities only depend on whether a symbol is transmitted correctly or not. Hence, an incorrect symbol bears minimal information about the input symbol. Any deviation from this symmetry results in an increased capacity.
When sending information over a noisy channel, on the highest level of abstraction, we distinguish only the cases whether a symbol is transmitted correctly or not. Then the difference between the input sequence and the output sequence is measured by the Hamming distance.
In order to be able to correct errors, we use only a subset of all possible sequences. In particular, we may take a subset of all possible sequences of length .
The simplest code that can be used to detect or correct errors is the repetition code. A repetition code with rate transmits every symbol twice. At the receiver, the two symbols are compared, and if they differ, an error is detected. Using this code over a channel with error probability , the probability of an undetected error is . Sending more than two copies of each symbol, we can decrease the probability of an undetected error even more. But at the same time, the rate of the code decreases since the number of codewords remains fixed while the length of the code increases.
A repetition code can not only be used to detect errors but also to correct errors. For this, we send three copies of each symbols, that is, we have a repetition code with rate . At the receiver, the three symbols are compared. If at most one symbol is wrong, the two error‐free symbols agree and we assume that the corresponding symbol is correct. Again, increasing the number of copies sent increases the number of errors that can be corrected. For the general situation, we consider the distance between two words of the block code .
The error‐correcting ability of a code is related to its minimum distance.
For a general block code over a alphabet , we have to make a list of all codewords, that is, the description of the code is proportional to its size. In order to get a more efficient description—and thereby more efficient algorithms for encoding and decoding—we impose some additional structure. In particular, we require that the elements of the alphabet have the algebraic structure of a field, that is, we can add, subtract, and multiply any two elements, and every nonzero element has a multiplicative inverse. First, we consider a finite field whose size is a prime number.
The smallest field is the binary field , which has only two elements 0 and 1. Note that the integers modulo a composite number do not form a field as some nonzero elements do not have a multiplicative inverse. For example, for the integers modulo 4 we have .
In order to construct a field whose size is not a prime number, one uses the following construction.
It can be shown that for any prime number and for any positive integer , there exists an irreducible polynomial of degree over , that is, for any prime power , there exists a finite field of that size. Furthermore, it can be shown that any finite field can be obtained by the construction of Proposition 1.2. Hence, we get (see, e.g., (5))
In order to get a more efficient description of a block code of length , we consider only codes whose alphabet is a finite field . Furthermore, we require that the code forms a linear vector space over the field , that is,
This implies that the code has elements for some , . We will use the notation . Instead of listing all elements, it is sufficient to specify a basis of linear independent vectors in . Alternatively, the linear space can be given as the solution of linearly independent homogeneous equations.
The generator matrix with entries provides a compact description of a code with elements. Moreover, encoding of information sequences of length corresponds to the linear map given by , that is,
The parity check matrix can be used to check whether a vector lies in the code.
The reason for defining the parity check matrix as a matrix with columns and rows and not as its transpose is motivated by the following.
As we have seen in Theorem 1.6, the minimum distance of a code is a criterion for its error‐correcting ability. For linear codes, the minimum distance equals the minimum Hamming weight of the nonzero codewords as
The minimum Hamming weight of a linear code can be computed using the parity check matrix.
The last proposition can be used to construct codes. For a single‐error‐correcting code, we require . This implies that any two columns in have to be linearly independent, that is, no column is a scalar multiple of another column. If we fix the redundancy , it is possible to find vectors with this property, which can be combined to a parity check matrix . This construction gives the following class of single‐error‐correcting codes (see (6,7)).
For binary Hamming codes, the parity check matrix consists of all nonzero vectors of length . If we order those columns in such a way that the th column equals the binary expansion of , error correction is particularly easy. If is an error of weight 1, then the syndrome equals the th column of and hence the binary expansion of . Therefore, the syndrome directly provides the position of the error.
Usually, a received vector will be decoded as the codeword, which is closest in the Hamming distance. In general, decoding an arbitrary linear binary code is an NP hard problem (8). More precisely, it was shown that it is an NP complete problem to decide whether there is a vector , which corresponds to a given syndrome and whose Hamming weight is at most . Hence, we cannot expect to have an efficient general algorithm for decoding.
Instead, by exhaustive search we can precompute an error vector of minimal Hamming weight corresponding to each syndrome. For this, we first arrange all codewords as the first row of an array, where the all‐zero codeword is the first element. Among the remaining vectors of length , we pick a vector with minimal Hamming weight. This vector is the first element of the next row in our array. The remaining entries of this row are obtained by adding the vector to the corresponding codeword in the first row. This guarantees that all elements of a row correspond to the same syndrome. We proceed until all vectors have been arranged into an array with rows and columns, the so‐called standard array. The elements in the first column of the standard array are called coset leaders, having minimal Hamming weight among all vectors in a row. Table 1.2 shows the standard array of a binary code , which is the dual of the Hamming code of Example 1.4. Actually, the code is a subcode of the Hamming code. In the first row, 16 codewords are listed. For the next seven rows, the coset leader is the unique vector of Hamming weight 1 in each coset, reflecting the fact that the code can correct a single error. For the next seven rows, the coset leader has weight 2, but each coset contains three vectors of weight 2. Hence, decoding succeeds only in one out of three cases. In the final row, we have even seven vectors of weight 3.
Table 1.2 Standard array for decoding the code , the dual of a binary Hamming code.
0000000 | 0001111 | 0110011 | 0111100 | 1010101 | 1011010 | 1100110 | 1101001 |
0000001 | 0001110 | 0110010 | 0111101 | 1010100 | 1011011 | 1100111 | 1101000 |
0000010 | 0001101 | 0110001 | 0111110 | 1010111 | 1011000 | 1100100 | 1101011 |
0000100 | 0001011 | 0110111 | 0111000 | 1010001 | 1011110 | 1100010 | 1101101 |
0001000 | 0000111 | 0111011 | 0110100 | 1011101 | 1010010 | 1101110 | 1100001 |
0010000 | 0011111 | 0100011 | 0101100 | 1000101 | 1001010 | 1110110 | 1111001 |
0100000 | 0101111 | 0010011 | 0011100 | 1110101 | 1111010 | 1000110 | 1001001 |
1000000 | 1001111 | 1110011 | 1111100 | 0010101 | 0011010 | 0100110 | 0101001 |
0110000 | 0111111 | 0000011 | 0001100 | 1100101 | 1101010 | 1010110 | 1011001 |
1000001 | 1001110 | 1110010 | 1111101 | 0010100 | 0011011 | 0100111 | 0101000 |
1000010 | 1001101 | 1110001 | 1111110 | 0010111 | 0011000 | 0100100 | 0101011 |
1000100 | 1001011 | 1110111 | 1111000 | 0010001 | 0011110 | 0100010 | 0101101 |
1001000 | 1000111 | 1111011 | 1110100 | 0011101 | 0010010 | 0101110 | 0100001 |
1010000 | 1011111 | 1100011 | 1101100 | 0000101 | 0001010 | 0110110 | 0111001 |
1100000 | 1101111 | 1010011 | 1011100 | 0110101 | 0111010 | 0000110 | 0001001 |
1110000 | 1111111 | 1000011 | 1001100 | 0100101 | 0101010 | 0010110 | 0011001 |
We have seen that the Hamming code is a code for which the correction of errors is rather simple, but it can only correct a single error. On the other hand, using an arbitrary linear code, the problem of error correction is NP complete. But luckily, there are other families of error‐correcting codes for which efficient algorithms exist to correct at least all errors of bounded weight. More about these codes can be found in any textbook on coding theory or the book by MacWilliams and Sloane (7), which is an excellent reference for the theory of error‐correcting codes.