Reed-Solomon Encoding and Fingerprinting

July 01, 2024

Motivation

Error Correcting Codes

  • Error correcting codes are a way of extending a message into a codeword so that if it is altered in transmission, errors introduced into the message can be easily detected and corrected. For the purposes of interactive proofs, we're only interested in using these codes to efficiently compare data, not for error correction, but we still often hear them referred to as error correcting codes.
  • Error correcting codes are distance amplifying, making them valuable for proof systems. Distance amplification means that if a message is altered in even a small way, the resulting codeword for the message will be almost completely different. This allows us to efficiently compare long messages by comparing a few random symbols in the codewords. The full codeword for a message is much longer than the original message, but comparing a symbol is much faster than comparing the full message.
  • Reed-Solomon encoding is done by treating each character in a message as a coefficient in a polynomial. The polynomial is evaluated for a set of many inputs to generate the code. We also call these codes extension codes, and we refer to the polynomial as a unique low degree extension of a message.
  • There are different approaches to obtaining low degree extensions of data, but each will determine a unique polynomial which will have degree at most n1n-1. Reed-Solomon and Lagrange interpolation are two approaches to obtaining unique low degree extension polynomials which can be used to create distance amplifying extension codes for efficient probabilistic comparison of the underlying data.

Random Sampling and Probabilistic Error

  • Proof systems use randomness to improve efficiency at the expense of introducing a small probabilistic error. In the case of message comparison via error correcting codes, the tradeoff is the same. We can evaluate a few entries in the low degree extension encoding of two messages, and if they are the same it is extremely likely that the messages are the same.
  • The probability of the unique low degree extension polynomials PaP_a and PbP_b for two different messages aa and bb producing the same output for a random value rr is very low, so messages can be probabilistically compared by selecting a random rr and checking whether Pa(r)=Pb(r)P_a(r)=P_b(r).
  • The entries are much shorter than the messages, so comparing Pa(r)P_a(r) and Pb(r)P_b(r) is much more efficient than comparing the full messages.

Reed-Solomon Encodings

To obtain the Reed-Solomon encoding of an nn-length message mm, we define a finite field with order p>>np >>n and p>kp>k where kk is the number of distinct symbols in the character set of mm. Working over the field, we obtain the low degree extension of m=(m1,,mn)m = (m_1, \dots,m_n) by using the entries of (m1,,mn)(m_1, \dots,m_n) as the weights in a linear combination of polynomials in the standard monomial basis {x0,x1,,xn1}\{x^0,x^1,\dots,x^{n-1}\}.
Low Degree Extension of mm:

Pm(x)=i=1nmixi1P_m(x) = \sum^{n}_{i=1}m_i*x^{i-1}

By evaluating Pm(x)P_m(x) at all xFpx \in \mathbb{F}_p, we obtain a pp-dimensional vector which is the low degree extension encoding of mm.
Low Degree Extension Encoding of mm:

c=(Pm(0),,Pm(p1))c = (P_m(0), \dots, P_m(p-1))

Reed-Solomon Algorithm

  • Given a message mm of length nn where each character mim_i has up to kk possible symbol values, chose a prime number pp such that pn2p\geq n^2 and pkp\geq k.
  • Selecting pn2p \geq n^2 limits the probability that two different messages will have equal entries at a random index to 1/n1/n (as we will see later).
  • Selecting pkp \geq k, ensures that every possible symbol will have a unique representation in the field Fp\mathbb{F}_p.
  • We interpret the message mm as an nn-dimensional vector such that m=(m1,,mn)Fpnm = (m_1, \ldots, m_{n}) \in \mathbb{F}_p^n.
  • Let Fp\mathbb{F}_p be the finite field of integers modulo pp and let Pm(x)P_m(x) be the following polynomial of degree n1\leq n-1 over Fp\mathbb{F}_p.
    Pm(x)=i=1nmixi1P_m(x) = \sum_{i=1}^{n} m_i \cdot x^{i-1}
  • Expanded, this looks like:
    Pm(x)=m1x0+m2x1++mnxn1P_m(x) = m_1\cdot x^{0} + m_2\cdot x^{1} + \dots + m_n\cdot x^{n-1}
  • The full Reed-Solomon encoding is a pp-dimensional vector of the evaluations P(i)P(i) for all iFpi \in \mathbb{F}_p such that the iith entry in the encoding is P(i)P(i).
    C=(P(0),P(1),,P(p1))C = (P(0), P(1),\dots,P(p-1))

Example

  • Consider the scenario where two long messages aa and bb on separate machines need to be checked for equality. Transmitting the full messages is prohibitively expensive, but fingerprinting and random sampling can be used to allow one message holder, Alice, to inexpensively verify that she has the same message as the other, Bob.
    1. Alice and Bob agree on a prime field Fp\mathbb{F}_p such that pn2p \geq n^2 and pkp \geq k where nn is the length of the messages (if their lengths are not equal the messages are clearly not equal) and kk is the number of possible values for each symbol in the messages.
    2. Alice interprets the message aa as an nn-dimensional vector of elements of Fp\mathbb{F}_p, so a=(a1,a2,,an)Fpna = (a_1, a_2, \dots,a_n) \in \mathbb{F}_p^n.
    3. Alice uses the vector as the coefficients of a polynomial Pa(x)P_a(x).
      Pa(x)=i=1naixi1P_a(x) = \sum_{i=1}^{n} a_i \cdot x^{i-1}
      Pa(x)=a1x0+a2x1++anxn1P_a(x) = a_1\cdot x^{0} + a_2\cdot x^{1} + \dots + a_n\cdot x^{n-1}
    4. Bob performs the same procedure for his message bb, resulting in a polynomial Pb(x)P_b(x).
      Pb(x)=i=1nbixi1P_b(x) = \sum_{i=1}^{n} b_i \cdot x^{i-1}
      Pb(x)=b1x0+b2x1++bnxn1P_b(x) = b_1\cdot x^{0} + b_2\cdot x^{1} + \dots + b_n\cdot x^{n-1}
    5. Alice selects a random element rr such that rFpr \in \mathbb{F}_p.
    6. Alice generates the value v:=Pa(r)v := {P_a(r)} and sends r,vr,v to Bob.
    7. Bob checks if Pb(r)=vP_b(r) = v .
    8. If they are not equal, Bob can conclude aba \neq b. Otherwise, Bob can conclude that aa and bb are probably equal, or Pr[ab]1nPr[a \neq b] \leq \frac{1}{n}.
    9. They can repeat this random sampling procedure as many times as they'd like to increase the probability that the messages are equal. For kk repetitions, the probability that the messages differ becomes 1nk\leq \frac{1}{n^k}.

Different Polynomials are Different Almost Everywhere

  • Polynomials are especially useful as representations of data because two different polynomials are different at almost all points. In the case of Alice and Bob, that means that if PaP_a and PbP_b are non equal polynomials, then for almost all rFpr \in \mathbb{F}_p, Pa(r)Pb(r)P_a(r) \neq P_b(r). As a result, we can easily compare polynomials probabilistically.
  • Two distinct polynomials PaP_a and PbP_b of degree n\leq n will intersect at nn or fewer points because PaPbP_a - P_b is an nn or lower degree polynomial with roots at each point where PaP_a and PbP_b intersect (where Pa(x)=Pb(x)P_a(x) = P_b(x)). If PaP_a and PbP_b intersect at more than nn points, PaPbP_a - P_b is an nn or lower degree polynomial with more than nn roots which would violate the fundamental theorem of algebra.

Probabilistic Error

  • The extension code discussed previously is far larger than the message vector. In this case, since pn2p \geq n^2, the length of the code is at least the length of the message squared. While the codes are guaranteed to be different for two different messages, comparing them instead of the messages would certainly not save us any processing or communication. Random sampling allows us to gain confidence that two messages are the same with the transmission of just two group elements.
  • We've seen that two distinct nn or lower degree polynomials can intersect at no more than nn points, and this fact allows us to bound the probability of a random element rFpr \in \mathbb{F}_p resulting in the same evaluation for distinct polynomials PaP_a and PbP_b.
  • Since PaP_a and PbP_b are n1n-1 or lower degree polynomials, there can be up to n1n-1 values for which Pa(x)=Pb(x)P_a(x) = P_b(x). There are pp total elements in the field, so the full encodings differ at at most (n1)(n-1) out of pp coordinates.
  • With the random fingerprinting, the probability of two differing messages producing the same evaluation is (n1)/p(n-1)/p or lower.
  • We've intentionally selected pp such that pn2p \geq n^2 to further bound this probability, so we can conclude that the probability of an error is below (n1)/n21/n(n-1)/n^2 \approx 1/n for each fingerprinting exchange.
  • The order of the field Fp\mathbb{F}_p determines the probability that two distinct messages will have the same polynomial evaluations at a single element.

Efficiency

  • Recall that a string of bb bits can represent up to 2b2^b distinct values (0 to 2b12^b-1). Since the character sets for our messages have kk possible values, we must find bb such that 2bk2^b \geq k. That means log2(k)log_2(k) bits are required for each character. Since the number of bits must be an integer, we'd really use the ceiling of this value, log2(k)\lceil log_2(k)\rceil.
  • The cost to transmit a message of length nn with kk possible character values is O(nlog2(k))O(n\cdot log_2(k)).
  • In the fingerprinting protocol, we only need to transmit two values: the random field element rr, and the polynomial evaluation Pm(r)P_m(r). The cost is O(log2(p))O(log_2(p)) since there are pp possible field elements. Assuming we choose pp to be polynomially related to nn (p=ncp = n^c where cc is a constant rather than p=nnp = n^n ), we can simplify the cost to O(log2(nc))=O(clog2(n))O(log_2(n^c)) = O(c\cdot log_2(n)). Dropping the constant, we have reduced the communication cost of comparing message mm from O(nlog2(k))O(n\cdot log_2(k)) to O(log2(n))O(log_2(n)).

Profile picture

Written by Breads