2.2 Notes - Reed-Solomon Encoding and Fingerprinting

July 06, 2024

Randomness Can Improve Efficiency

  • Given two nn-dimensional vectors aa and bb, deterministic comparison of the vectors will always require communicating all nn characters and have perfect soundness.
    • That's O(n)O(n) communication cost before factoring in the size of the character set.
  • Probabilistic comparison can be done with constant communication at the expense of a soundness error that is proportional to the size of the field over which the low degree extensions of the vectors are defined.
  • Probabilistic comparison of the vectors can be done by creating unique low degree extension polynomials of the vectors over the domain Fp\mathbb{F}_p and comparing evaluations of the unique low degree extensions pa(r)p_a(r) and pb(r)p_b(r) for some random rFpr\in \mathbb{F}_p
    • Pick p>>np >> n and p>kp>k where kk is the size of the character set for aa and bb .
    • The low degree extensions will have degree n1\leq n-1.
    • Two field elements are transmitted, rr and pa(r)p_a(r), resulting in constant communication.
    • The soundness error (probability that the evaluations will be the same for different vectors) is
      PrrFp[pa(r)=pb(r)]n1p\Pr_{r\in\mathbb{F}_p}[p_a(r) = p_b(r)] \leq \frac{n-1}{p}
  • The trivial comparison requires transmitting all nn characters, each with kk possible values. Each character requires log2(k)log_2(k) bits to represent, so the communication cost would be O(nlog2(k))O(n \cdot log_2(k)).
  • In the probabilistic approach two field elements are transmitted. The field has order pp, so there are pp possible values requiring log2(p)log_2(p) bits each to represent. If the order of the field is a constant power of nn, then p=ncp = n^c and the communication cost is 2log2(nc)=2clog2(n)=O(log2(n))2 \cdot log_2(n^c) = 2c \cdot log_2(n) = O(log_2(n)).
  • We pick pp such that p>kp>k and p>>np >>n. It is likely that nn is the term that decides how large pp should be, but if the messages were very short but each character had a high number of possible values, the field could be defined to the be the same size as kk. Then we would have 2log2(k)2 \cdot log_2(k) bits being transmitted and a communication cost of O(log2(k))O(log_2(k)).

Low Degree Extensions and Fingerprints

  • Pick a prime number pp such that p>n2p > n^2 where nn is the length of the message being encoded and p>kp>k where kk is the size of the character set for the message.
  • Interpret each entry in the message m=(m1,,mn)m = (m_1,\dots,m_n) as an element in Fp\mathbb{F}_p.
  • The unique low degree extension of an nn-length message mm is a polynomial of degree n1n-1 or lower
    pm(x):=i=1nmixi1p_m(x):= \sum_{i=1}^{n} m_{i} \cdot x^{i-1}
  • Each evaluation of the low degree extension at a field element can be thought of as a hash or fingerprint. For each element rFpr\in \mathbb{F}_p, there is a hash function hrh_r representing evaluation of the extension at the element rr.
    hr(m1,,mn):=i=1nmiri1h_r(m_1,\dots,m_n) := \sum_{i=1}^{n} m_i \cdot r^{i-1}
  • The family of hash functions HH is the set of all hash functions for all elements in Fp\mathbb{F}_p
    H={hr:rFp}H = \{h_r :r\in \mathbb{F}_p\}
  • The hash function hrh_r for rFpr\in \mathbb{F}_p applied to the message mm is the low degree extension pmp_m evaluated at rr
    hr(m1,,mn)=pm(r)h_r(m_1,\dots,m_n) = p_m(r)
  • Rather than comparing nn-length messages aa and bb directly, we can compare their hashes hr(a)=hr(b)h_r(a) = h_r(b), or equivalently, we can compare the evaluations of their low degree extensions pa(r)=pb(r)p_a(r) = p_b(r) for rFpr\in \mathbb{F}_p
    PrrFp[hr(a)=hr(b)]=PrrFp[pa(r)=pb(r)]n1p\Pr_{r\in \mathbb{F}_p} [h_r(a) = h_r(b)] = \Pr_{r\in \mathbb{F}_p} [p_a(r) = p_b(r)] \leq \frac{n-1}{p}

Message Comparison

  • Given two nn-length messages aa and bb, if the messages are identical, their hashes hr(a)h_r(a) and hr(b)h_r(b) will be the same for all rFpr \in \mathbb{F}_p because their low degree extension polynomials will be the same. If a=ba = b then i{1,,n}(ai=bi)\forall i \in \{1,\dots,n\} (a_i = b_i).
    hr(a)=pa(r)=i=1nairi1=i=1nbiri1=pb(r)=hr(b)h_r(a) = p_a(r)= \sum_{i=1}^{n} a_{i} \cdot r^{i-1} = \sum_{i=1}^{n} b_{i} \cdot r^{i-1} = p_b(r) = h_r(b)
  • If the messages differ, their hashes will differ with high probability over rFpr \leftarrow \mathbb{F}_p.
    • The extension polynomials pa(x)=i=1naixi1p_a(x) = \sum_{i=1}^{n} a_{i} \cdot x^{i-1} and pb(x)=i=1nbixi1p_b(x) = \sum_{i=1}^{n} b_{i} \cdot x^{i-1} will have degree n1n-1 by definition, so if they are different polynomials they will intersect at no more than n1n-1 points.
    • Two distinct polynomials pap_a and pbp_b of degree n1\leq n-1 will intersect at n1n-1 or fewer points because papbp_a - p_b is an n1n-1 or lower degree polynomial with roots at each point where pap_a and pbp_b intersect. If pap_a and pbp_b intersect at more than nn points, papbp_a - p_b is an n1n-1 or lower degree polynomial with more than n1n-1 roots which would violate the fundamental theorem of algebra.
    • There are pp points in Fp\mathbb{F}_p, so there is an n1p\frac{n-1}{p} chance of agreement at a random point
      PrrFp[pa(r)=pb(r)]n1p\Pr_{r\in\mathbb{F}_p}[p_a(r) = p_b(r)] \leq \frac{n-1}{p}

Discussion

  • The vector (a1,,an)(a_1,\dots,a_n) has an error corrected encoding (pa(1),,pa(p))p_a(1),\dots,p_a(p)), and pa(r)p_a(r) for rFpr\in \mathbb{F}_p is a random entry in the encoding.
    • Alternatively, the set of all evaluations (h1(a),,hp(a))(h_1(a),\dots,h_p(a)) is the error corrected encoding of aa
  • There are other encoding protocols, all that is required is a family of hashes HH such that for aba\neq b, PrhH[h(a)=h(b)]\Pr_{h\in H}[h(a)=h(b)] is low
  • The algebraic structure of Reed-Solomon fingerprinting makes it useful for probabilistic proofs.

Implementation

/**
 * Evaluates a low degree extension polynomial derived from the Reed Solomon encoding of an ASCII message at a given field element `r`.
 *
 * This function interprets the given ASCII message as coefficients of a polynomial over the finite field F.
 * It constructs the polynomial as a linear transformation over the standard monomial basis:
 *   P(x) = c₀ + c₁·x + c₂·x² + ... + cₙ₋₁·xⁿ⁻¹
 * where each coefficient cᵢ is the field representation of the ASCII code of the i-th character of message.
 *
 * @param message - The ASCII string message to encode.
 * @param r - The field element at which to evaluate the polynomial.
 * @returns The field element P(r), the evaluation of the polynomial at r.
 * @throws Error if the message is empty.
 */
function getReedSolomon(message: string, r: Field): Field {
  if (message.length === 0) {
    throw new Error("Message cannot be empty.");
  }

  // Accumulator for summation  
  let result = Field(0);

  // Current power of the field element `r` (r⁰=1 initially)  
  let currentPower = Field(1);

  // Sum over each character in the message to construct and evaluate the polynomial  
  for (let i = 0; i < message.length; i++) {
    // Convert the ASCII character to its numeric representation  
    const coefficient = Field(asciiToNumber(message[i]));

    // Add the current term (coefficient * r^i) to the result  
    result = result.add(coefficient.mul(currentPower));

    // Update the current power of `r` for the next term (r^(i+1))  
    currentPower = currentPower.mul(r);
  }

  return result;
}

/**
 * Converts an ASCII character to its numeric ASCII code. * * @param char - The ASCII character to convert.
 * @returns The numeric ASCII code of the character.
 * @throws Error if the character is not a valid ASCII character (code between 0 and 127).
 */
export function asciiToNumber(char: string): number {
  const num = char.charCodeAt(0);
  if (num < 0 || num > 127) {
    throw new Error("Input must be a valid ASCII character (0-127).");
  }
  return num;
}

Profile picture

Written by Breads