May 12, 2010

Statistical Physics and Error-Correcting Codes #1

As I have been working on the statistical physics of error-correcting codes since 2006, I guess it is only fair to write a bit about that. Although it may not seem as appealing as cosmology, strings or the LHC, there is something fundamental and interesting behind it. Besides, since I started to talk about quantum codes I thought I should write something more basics about classical error-correction and information theory as they are the basis for the their quantum versions.

Since I will have to introduce some information theory first, I thought it would be a better idea to break down this post into two or three parts, depending on how it goes. The standard reference for information theory is Thomas and Cover book:

[1] Elements of Information Theory, Thomas and Cover

The basics of error correction theory is very simple. The so-called sender wants to transmit some information, be it a word, a song or a picture, to someone usually called the receiver. However, some part of that information may be lost during the transmission. We say that the information is being transmitted by a noisy channel. To give a non-orthodox example, if you burn a CD (like the one in the figure above) with some songs to give to a friend but accidentally you scratch the surface of the disc, some of the data will not be correctly read by the laser head. What do you do to avoid losing the information?

Every human language has already the basic required feature for error-correction, something we call redundancy. Can you r**d th*s phr**e withou* some of t*e le**ers? You can because every language has some degree of redundancy, which means, information is encoded with much more symbols than it is really needed. Because of this, if you loose some of the symbols, you still can manage to understand the original message. In the case of a language, redundancy is a very intricate thing for even if there are entire words missing from a document, you can still retrieve them.

Erasing a symbol (or some), like I did above, is not the only way of corrupting a message. Letters can be scrambled, exchanged for others and, unlike the example I gave, they may be corrupted without you even knowing exactly where it happened.

Error-correction is the driving force behind the digital revolution, simply because it is much easier to correct binary messages and then error correction is much better. For digital messages, which are simply composed of 0's and 1's, errors can be reduced to 2 types: erasure, when you simply cannot read which symbol is there, or flipping, when a 0 becomes a 1 and vice-versa.

The simplest, most naive technique of error-correction is called a repetition code. If your message is 1, then you just repeat it many times, like 111. If the amount of errors is small, for instance 1 in each 3 symbols is corrupted in the example above, you can retrieve the original message by looking at the majority of bits. It means that if you receive 110 instead of 111, you know what the correct message is just by looking to which symbol is the majority. If the noise (the amount of errors) is larger, you need to add more redundancy. For example, if the probability of flipping a bit is 2 in three, you need to use at least 5 copies of the symbol to be able to retrieve the original message. The result of the original message plus the redundancy is what we call the codeword.

Of course you have already noticed that to protect against many errors using the repetition code we need to repeat the symbol many times. The ratio R=N/M of the size N of the original message to the size M of the codeword is called code rate. The code rate is always between zero and one and we obviously want to make it as close to one as possible. For example, in our repetition codes the first code rate was 1/3 and the second was 1/5. It is obvious that the higher the redundancy, the lower the code rate. This is the main dichotomy of codes: we want to add as much redundancy as possible but keep the codeword as small as we can.

It's easy to see that repetition codes give too low code rates, their not very efficient. The next level of the game is named linear codes. A codeword in a linear code is simply the result of multiplying the original N-sized (binary) message m by a binary matrix G, of size M×N to get the M-sized codeword t. Mathematically speaking, we have just t=Gm. G is called the generator matrix. Okay, but how does it help? Well, there is another piece missing and that is the star of the play, a matrix A called the parity-check matrix. A is also binary and is chosen in such a way that AG=0. Now, if you apply the parity-check matrix to the codeword, you can easily see that the result is zero.

I need to make an observation here. All the operations with these matrices and vectors are supposed to be on the binary field, a.k.a. the Galois Field of order two or GF(2). I am saying that because things can be more or less generalised to higher order finite fields (another name for a Galois field), but that's too ahead of us now.

The parity-check matrix has this name because it can be seen as a constraint on t that forces some chosen combination of entries in it to add to zero (addition mod 2, for we are still talking about the binary field). A binary sum of this type has only two possible results, 0 or 1. As we can associate it to a number being even or odd, we end up calling this value the parity of this sum and then we say that we need to check that the parity is 0.

Let us now model the noise in the channel, where by channel I just mean the medium through which the information is being transmitted. Look at the simplest possible model which is called the binary symmetric channel, where each bit of your message can be flipped independently with probability p. To flip a binary number you just need to add 1 to it, for in the binary field the addition is defined by 1+1=0, 1+0=1, 0+0=0 and 0+1=1. So, we can model the noise in our transmitted codeword by adding to it a vector of zeros and ones, a binary vector, that we call the noise vector and where each bit has probability p of being one. Let us now consider the received message that we will call by the letter r. In our notation, we then have

r=t+n, (1)

One thing that you should notice is that if we add the noise vector to the received vector, we get back the transmitted codeword, clean and uncorrupted. Why? Because of the addition rule I gave above. Note that any number added to itself result in zero. That means that n+n=0 and r+n=t+n+n=t. That means that if we can discover n we can recover our codeword. Now, you probably already noticed that this does not help because I am just restating the problem in a different way. Just that. We still have nothing from which we can recover the message. But now, apply A to both sides of equation (1). Then we end up with what we call the error syndrome:

z=Ar=An. (2)

You may or may not have noticed already, but now we are talking! The parity check matrix is something that both sender and receiver know. The received message is what the receiver has after all. Therefore the receiver can do the above operation and calculate the syndrome. With that, equation (2) gives you a set of linear equations that can in principle be solved to find n and therefore the original codeword t. Of course there are many subtleties, I will talk about them a little in the next posts. The bottom line however is that parity-check codes are much more efficient than repetition codes. Actually, they seem to be at the moment one of the most efficient, if not the most, error-correcting codes.

What is the relationship with physics? We are not ready to appreciate it yet, but just notice that we have here a system where each variable can have two values, more or less like an electron spin, and they must obey a constraint. The constraint relates one variable to the other, or you may say that these variables "interact" is some sense. I will leave you to think a bit about that before spoiling the surprise completely.

No comments: