Motivation
Error Correcting Codes
- Error correcting codes are a way of extending a message into a codeword so that if it is altered in transmission, errors introduced into the message can be easily detected and corrected. For the purposes of interactive proofs, we're only interested in using these codes to efficiently compare data, not for error correction, but we still often hear them referred to as error correcting codes.
- Error correcting codes are distance amplifying, making them valuable for proof systems. Distance amplification means that if a message is altered in even a small way, the resulting codeword for the message will be almost completely different. This allows us to efficiently compare long messages by comparing a few random symbols in the codewords. The full codeword for a message is much longer than the original message, but comparing a symbol is much faster than comparing the full message.
- Reed-Solomon encoding is done by treating each character in a message as a coefficient in a polynomial. The polynomial is evaluated for a set of many inputs to generate the code. We also call these codes extension codes, and we refer to the polynomial as a unique low degree extension of a message.
- There are different approaches to obtaining low degree extensions of data, but each will determine a unique polynomial which will have degree at most . Reed-Solomon and Lagrange interpolation are two approaches to obtaining unique low degree extension polynomials which can be used to create distance amplifying extension codes for efficient probabilistic comparison of the underlying data.
Random Sampling and Probabilistic Error
- Proof systems use randomness to improve efficiency at the expense of introducing a small probabilistic error. In the case of message comparison via error correcting codes, the tradeoff is the same. We can evaluate a few entries in the low degree extension encoding of two messages, and if they are the same it is extremely likely that the messages are the same.
- The probability of the unique low degree extension polynomials and for two different messages and producing the same output for a random value is very low, so messages can be probabilistically compared by selecting a random and checking whether .
- The entries are much shorter than the messages, so comparing and is much more efficient than comparing the full messages.
Reed-Solomon Encodings
To obtain the Reed-Solomon encoding of an -length message , we define a finite field with order and where is the number of distinct symbols in the character set of . Working over the field, we obtain the low degree extension of by using the entries of as the weights in a linear combination of polynomials in the standard monomial basis .
Low Degree Extension of :
By evaluating at all , we obtain a -dimensional vector which is the low degree extension encoding of .
Low Degree Extension Encoding of :
Reed-Solomon Algorithm
- Given a message of length where each character has up to possible symbol values, chose a prime number such that and .
- Selecting limits the probability that two different messages will have equal entries at a random index to (as we will see later).
- Selecting , ensures that every possible symbol will have a unique representation in the field .
- We interpret the message as an -dimensional vector such that .
- Let be the finite field of integers modulo and let be the following polynomial of degree over .
- Expanded, this looks like:
- The full Reed-Solomon encoding is a -dimensional vector of the evaluations for all such that the th entry in the encoding is .
Example
- Consider the scenario where two long messages and on separate machines need to be checked for equality. Transmitting the full messages is prohibitively expensive, but fingerprinting and random sampling can be used to allow one message holder, Alice, to inexpensively verify that she has the same message as the other, Bob.
- Alice and Bob agree on a prime field such that and where is the length of the messages (if their lengths are not equal the messages are clearly not equal) and is the number of possible values for each symbol in the messages.
- Alice interprets the message as an -dimensional vector of elements of , so .
- Alice uses the vector as the coefficients of a polynomial .
- Bob performs the same procedure for his message , resulting in a polynomial .
- Alice selects a random element such that .
- Alice generates the value and sends to Bob.
- Bob checks if .
- If they are not equal, Bob can conclude . Otherwise, Bob can conclude that and are probably equal, or .
- They can repeat this random sampling procedure as many times as they'd like to increase the probability that the messages are equal. For repetitions, the probability that the messages differ becomes .
Different Polynomials are Different Almost Everywhere
- Polynomials are especially useful as representations of data because two different polynomials are different at almost all points. In the case of Alice and Bob, that means that if and are non equal polynomials, then for almost all , . As a result, we can easily compare polynomials probabilistically.
- Two distinct polynomials and of degree will intersect at or fewer points because is an or lower degree polynomial with roots at each point where and intersect (where ). If and intersect at more than points, is an or lower degree polynomial with more than roots which would violate the fundamental theorem of algebra.
Probabilistic Error
- The extension code discussed previously is far larger than the message vector. In this case, since , the length of the code is at least the length of the message squared. While the codes are guaranteed to be different for two different messages, comparing them instead of the messages would certainly not save us any processing or communication. Random sampling allows us to gain confidence that two messages are the same with the transmission of just two group elements.
- We've seen that two distinct or lower degree polynomials can intersect at no more than points, and this fact allows us to bound the probability of a random element resulting in the same evaluation for distinct polynomials and .
- Since and are or lower degree polynomials, there can be up to values for which . There are total elements in the field, so the full encodings differ at at most out of coordinates.
- With the random fingerprinting, the probability of two differing messages producing the same evaluation is or lower.
- We've intentionally selected such that to further bound this probability, so we can conclude that the probability of an error is below for each fingerprinting exchange.
- The order of the field determines the probability that two distinct messages will have the same polynomial evaluations at a single element.
Efficiency
- Recall that a string of bits can represent up to distinct values (0 to ). Since the character sets for our messages have possible values, we must find such that . That means bits are required for each character. Since the number of bits must be an integer, we'd really use the ceiling of this value, .
- The cost to transmit a message of length with possible character values is .
- In the fingerprinting protocol, we only need to transmit two values: the random field element , and the polynomial evaluation . The cost is since there are possible field elements. Assuming we choose to be polynomially related to ( where is a constant rather than ), we can simplify the cost to . Dropping the constant, we have reduced the communication cost of comparing message from to .