Supplemental: Randomness

Primality Testing

A key component of many cryptography algorithms is finding large prime numbers, with hundreds or even thousands of digits. A typical approach is to randomly choose a large number and then apply a primality test to determine whether it is prime – the prime number theorem states that approximately \(1/m\) of the \(m\)-bit numbers are prime, so we only expect to check \(m\) numbers before finding one that is prime [1]. As long as the primality test itself is efficient with respect to \(m\), the expected time to find a prime number with \(m\) bits is polynomial in \(m\).

Formally, we wish to decide the following language:

\[\PRIMES = \set{m \in \N : m \text{ is prime}} \; \text.\]

A positive integer \(m \geq 2\) is prime if it has no positive divisors other than \(1\) and itself. (For technical reasons, the number \(1\) is not considered prime, although its only positive divisors are \(1\) and itself, which are the same.) Equivalently, it has no divisor between \(2\) and \(m-1\), inclusive.

Is the language \(\PRIMES\) in \(\P\)? Let us consider the following simple algorithm to decide the language:

\begin{algorithmic}
\INPUT a positive integer $m \geq 2$
\OUTPUT whether $m$ is prime
\FUNCTION{IsPrime}{$m$}
\FOR{$i = 2$ to $m-1$}
  \IF{$m$ is divisible by $i$} \REJECT \ENDIF
\ENDFOR
\ACCEPT
\ENDFUNCTION
\end{algorithmic}

This algorithm does \(\O(m)\) trial divisions. Is this efficient? The size of the input \(m\) is the number of bits required to represent \(m\), which is \(\O(\log m)\). Thus, the number of operations is \(\O(m) = \O(2^{\log m})\), which is exponential in the size of \(m\).

We can improve the algorithm by observing that if \(m\) has a factor within the interval \([2, m-1]\), it must have a factor between \(2\) and \(\floor{\sqrt{m}}\) (inclusive) – because the product of two or more larger numbers is larger than \(m\). Thus, we need only iterate up to \(\floor{\sqrt{m}}\):

\begin{algorithmic}
\FUNCTION{IsPrime}{$m$}
\FOR{$i = 2$ to $\floor{\sqrt{m}}$}
  \IF{$m$ is divisible by $i$} \REJECT \ENDIF
\ENDFOR
\ACCEPT
\ENDFUNCTION
\end{algorithmic}

This algorithm does \(\O(\sqrt{m})\) trial divisions. However, this still is not polynomial; we have:

\[\O(\sqrt{m}) = \O(m^{1/2}) = \O(2^{(\log m)/2}) \; \text.\]

To be efficient, a primality-testing algorithm must have runtime \(\O(\log^k m)\) for some constant \(k\). Neither of the algorithms above meet this threshold.

In fact, there is a known efficient algorithm for primality testing, the AKS primality test. Thus, it is indeed the case that \(\PRIMES \in \P\). However, this algorithm is somewhat complicated, and its running time is high enough to preclude it from being used in practice. Instead, we consider a randomized primality test that is efficient and works well in practice for most inputs. The algorithm we construct relies on the extended Fermat’s little theorem.

Theorem 254 (Extended Fermat’s Little Theorem)

Let \(n \in \N\) be a natural number such that \(n \ge 2\). Let \(a\) be a witness in the range \(1 \le a \le n - 1\). Then:

  • If \(n\) is prime, then \(a^{n-1} \equiv 1 \pmod{n}\) for any witness \(a\).

  • If \(n\) is composite and \(n\) is not a Carmichael number, then \(a^{n-1} \equiv 1 \pmod{n}\) for at most half the witnesses \(1 \le a \le n - 1\).

We postpone discussion of Carmichael numbers for the moment. Instead, we take a look at some small cases of composite numbers to see that the extended Fermat’s little theorem holds. We first consider \(n = 6\). We have:

\[\begin{split}\begin{array}{lll} a = 1 :& 1^5 ~&\equiv 1 \pmod{6}\\ a = 2 :& 2^5 = 32 = 2 + 5\cdot 6 ~&\equiv 2 \pmod{6}\\ a = 3 :& 3^5 = 243 = 3 + 40\cdot 6 ~&\equiv 3 \pmod{6}\\ a = 4 :& 4^5 = 1024 = 4 + 170\cdot 6 ~&\equiv 4 \pmod{6}\\ a = 5 :& 5^5 = 3125 = 5 + 520\cdot 6 ~&\equiv 5 \pmod{6} \end{array}\end{split}\]

We see that \(a^{n-1} \equiv 1 \pmod{n}\) for only the single witness \(a = 1\). Similarly, we consider \(n = 9\):

\[\begin{split}\begin{array}{lll} a = 1 :& 1^8 ~&\equiv 1 \pmod{9}\\ a = 2 :& 2^8 = 256 = 4 + 28\cdot 9 ~&\equiv 4 \pmod{9}\\ a = 3 :& 3^8 = 6561 = 0 + 729\cdot 9 ~&\equiv 0 \pmod{9}\\ a = 4 :& 4^8 = 65536 = 7 + 7281\cdot 9 ~&\equiv 7 \pmod{9}\\ a = 5 :& 5^8 = 390625 = 7 + 43402\cdot 9 ~&\equiv 7 \pmod{9}\\ a = 6 :& 6^8 = 1679616 = 0 + 186624\cdot 9 ~&\equiv 0 \pmod{9}\\ a = 7 :& 7^8 = 5764801 = 4 + 640533\cdot 9 ~&\equiv 4 \pmod{9}\\ a = 8 :& 8^8 = 16777216 = 1 + 1864135\cdot 9 ~&\equiv 1 \pmod{9} \end{array}\end{split}\]

Here, there are two witnesses \(a\) where \(a^{n-1} \equiv 1 \pmod{n}\), out of eight total. Finally, we consider \(n = 7\):

\[\begin{split}\begin{array}{lll} a = 1 :& 1^6 ~&\equiv 1 \pmod{7}\\ a = 2 :& 2^6 = 64 = 1 + 9\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 3 :& 3^6 = 729 = 1 + 104\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 4 :& 4^6 = 4096 = 1 + 585\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 5 :& 5^6 = 15625 = 1 + 2232\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 6 :& 6^6 = 46656 = 1 + 6665\cdot 7 ~&\equiv 1 \pmod{7}\\ \end{array}\end{split}\]

Since \(7\) is prime, we have \(a^{n-1} \equiv 1 \pmod{n}\) for all witnesses \(a\).

The extended Fermat’s little theorem leads directly to a simple, efficient randomized algorithm for primality testing. This Fermat primality test is as follows:

\begin{algorithmic}
\FUNCTION{FermatTest}{$m$}
\STATE choose uniformly random $a$ such that $1 \leq a \leq m-1$
\IF{$a^{m-1} \equiv 1 \pmod{m}$} \ACCEPT \ENDIF
\STATE \REJECT
\ENDFUNCTION
\end{algorithmic}

The modular exponentiation in this algorithm can be done with \(\O(\log m)\) multiplications using a divide and conquer strategy, and each multiplication can be done efficiently. Thus, this algorithm is polynomial with respect to the size of \(m\).

As for correctness, we have:

  • If \(m\) is prime, then the algorithm always accepts \(m\). In other words:

    \[\Pr[\text{the Fermat test accepts $m$}] = 1\]
  • If \(m\) is composite and not a Carmichael number, then the algorithm rejects \(m\) with probability at least \(\frac{1}{2}\). In other words:

    \[\Pr[\text{the Fermat test rejects $m$}] \ge \frac{1}{2}\]

Thus, if the algorithm accepts \(m\), we can be fairly confident that \(m\) is prime. And as we will see shortly, we can repeat the algorithm to obtain higher confidence that we get the right answer.

We now return to the problem of Carmichael numbers. A Carmichael number is a composite number \(n\) such that \(a^{n-1} \equiv 1 \pmod{n}\) for all witnesses \(a\) that are relatively prime to \(n\), i.e. \(\gcd(a, n) = 1\). This implies that for a Carmichael number, the Fermat test reports with high probability that the number is prime, despite it being composite. We call a number that passes the Fermat test with high probability a pseudoprime, and the Fermat test is technically a randomized algorithm for deciding the following language:

\[\begin{split}\mathrm{PSEUDOPRIMES} = \left\{ \begin{aligned} m \in \N :~ &a^{m-1} \bmod{m} \equiv 1 \text{ for at least half}\\ &\text{the witnesses } 1 \le a \le m - 1 \end{aligned} \right\}\end{split}\]

Carmichael numbers are much rarer than prime numbers, so for many applications, the Fermat test is sufficient to determine with high confidence that a randomly chosen number is prime. On the other hand, if the number is chosen by an adversary, then the Fermat test is unsuitable, and a more complex randomized algorithm such as the Miller-Rabin primality test must be used instead.

The Miller-Rabin Test

The Miller-Rabin test is designed around the fact that for prime \(m\), the only square roots of 1 modulo \(m\) are -1 and 1. More specifically, if \(x\) is a square root of 1 modulo \(m\), we have

\[x^2 \equiv 1 \pmod{m}\]

By definition of modular arithmetic, this means that

\[x^2 - 1 = km\]

for some integer \(k\) (i.e. \(m\) evenly divides the difference of \(x^2\) and 1). We can factor \(x^2-1\) to obtain:

\[(x + 1)(x - 1) = km\]

When \(m\) is composite, so that \(m = pq\) for integers \(p,q > 1\), then it is possible for \(p\) to divide \(x+1\) and \(q\) to divide \(x-1\), in which case \(pq = m\) divides their product \((x+1)(x-1)\). However, when \(m\) is prime, the only way for the equation to hold is if \(m\) divides either \(x+1\) or \(x-1\); otherwise, the prime factorization of \((x+1)(x-1)\) does not contain \(m\), and by the fundamental theorem of arithmetic, it cannot be equal to a number whose prime factorization contains \(m\) [2]. Thus, we have either \(x+1 = am\) for some integer \(a\), or \(x-1 = bm\) for some \(b \in \Z\). The former gives us \(x \equiv -1 \pmod{m}\), while the latter results in \(x \equiv 1 \pmod{m}\)., This, if \(m\) is prime and \(x^2 \equiv 1 \pmod{m}\), then either \(x \equiv 1 \pmod{m}\) or \(x \equiv -1 \pmod{m}\).

The Miller-Rabin test starts with the Fermat test: choose a witness \(1 \le a \le m-1\) and check whether \(a^{m-1} \equiv 1 \pmod{m}\). If this does not hold, then \(m\) fails the Fermat test and therefore is not prime. If it does indeed hold, then the test checks the square root \(a^{\frac{1}{2}(m-1)}\) of \(a^{m-1}\) to see if it is 1 or -1. If it is 1, the test checks the square root \(a^{\frac{1}{4}(m-1)}\) of \(a^{\frac{1}{2}(m-1)}\) to see if it is 1 or -1, and so on. The termination conditions are as follows:

  • The test finds a square root of 1 that is not -1 or 1. By the reasoning above, \(m\) must be composite.

  • The test finds a square root of 1 that is -1. The reasoning above only holds for square roots of 1, so the test cannot continue by computing the square root of -1. In this case, \(m\) may be prime or composite.

  • The test reaches some \(r\) for which \(1/2^r\cdot(m-1)\) is no longer an integer. Then it cannot compute \(a\) to this power modulo \(m\). In this case, \(m\) may be prime or composite.

To compute these square roots, we first extract powers of 2 from the number \(m-1\) (which is even for odd \(m\), the cases of interest):

\begin{algorithmic}
\FUNCTION{ExtractPowersOf2}{$x$}
\IF{$x$ is odd} \RETURN $(x,0)$ \ENDIF
\STATE $(d,s') = $ \CALL{ExtractPowersOf2}{$x/2$}
\RETURN $(d, s'+1)$
\ENDFUNCTION
\end{algorithmic}

So, \(\algo{ExtractPowersOf2}(m-1)\) computes \(d\) and \(s\) such that

\[m-1 = 2^s d\]

where \(\gcd(d, 2) = 1\). Then we have \(a^{2^{s-1}d}\) is the square root of \(a^{2^sd}\), since

\[\large\parens{a^{2^{s-1}d}}^2 = a^{2\cdot 2^{s-1}d} = a^{2^sd}\]

The full Miller-Rabin algorithm is as follows:

\begin{algorithmic}
\FUNCTION{MillerRabin}{$m$}
\STATE choose uniformly random $1 \leq a \leq m-1$
\STATE compute $s,d$ such that $m-1 = 2^s d$ and $d$ is odd
\STATE compute $a^d, a^{2d}, a^{4d}, \ldots, a^{2^{s-1} d}, a^{2^s
d}$
\IF{$a^{2^s d} \not\equiv 1 \pmod{m}$} \REJECT \ENDIF
\FOR{$t=s-1$ down to $0$}
  \IF{$a^{2^t d} \equiv -1 \pmod{m}$} \ACCEPT
  \ELIF{$a^{2^t d} \not\equiv 1 \pmod{m}$} \REJECT
  \ENDIF
\ENDFOR
\STATE \ACCEPT
\ENDFUNCTION
\end{algorithmic}

This algorithm is efficient: \(a^d\) can be computed using \(\O(\log d) = \O(\log m)\) multiplications, and it takes another \(\O(\log m)\) multiplications to compute \(a^{2d}, a^{4d}, \dots, a^{2^sd}\) since \(s = \O(\log m)\). Each multiplication is efficient as it is done modulo \(m\), so the entire algorithm takes polynomial time in the size of \(m\).

As for correctness, we have:

  • If \(m\) is prime, then the algorithm accepts \(m\) with probability 1.

  • If \(m\) is composite, then the algorithm rejects \(m\) with probability at least \(\frac{3}{4}\).

The latter is due to the fact that when \(m\) is composite, \(m\) passes the Miller-Rabin test for at most \(\frac{1}{4}\) of the witnesses \(1 \le a \le m-1\). Unlike the Fermat test, there are no exceptions to this, making the Miller-Rabin test a better choice in many applications.

Exercise 255

Polynomial identity testing is the problem of determining whether two polynomials \(p(x)\) and \(q(x)\) are the same, or equivalently, whether \(p(x) - q(x)\) is the zero polynomial.

  1. Let \(d\) be an upper bound on the degree of \(p(x)\) and \(q(x)\). Show that for a randomly chosen integer \(a \in [1, 4d]\), if \(p(x) \ne q(x)\), then \(\Pr[p(a) \ne q(a)] \ge \frac{3}{4}\).

    Hint: A nonzero polynomial \(r(x)\) of degree at most \(d\) can have at most \(d\) roots \(s_i\) such that \(r(s_i) = 0\).

  2. Devise an efficient, randomized algorithm \(A\) to determine whether \(p(x)\) and \(q(x)\) are the same, with the behavior that:

    • if \(p(x) = q(x)\), then

      \[\Pr[A \text{ accepts } p(x),q(x)] = 1\]
    • if \(p(x) \ne q(x)\), then

      \[\Pr[A \text{ rejects } p(x),q(x)] \ge \frac{3}{4}\]

    Assume that a polynomial can be efficiently evaluated on a single input.

Multiplicative Chernoff Bounds

Like Markov’s inequality, Chernoff bounds are a form of concentration bounds that allow us to reason about the deviation of a random variable from its expectations. There are multiple variants of Chernoff bounds; we restrict ourselves to the following “multiplicative” versions.

Let \(X = X_1 + \dots + X_n\) be the sum of independent indicator random variables, where the indicator \(X_i\) has the expectation

\[\Ex[X_i] = \Pr[X_i = 1] = p_i\]

Let \(\mu\) be the expected value of \(X\):

\[\mu = \Ex[X] = \sum_i \Ex[X_i] = \sum_i p_i\]

Here, we’ve applied linearity of expectation to relate the expectation of \(X\) to that of the indicators \(X_i\). The Chernoff bounds then are as follows.

Theorem 256 (Multiplicative Chernoff Bound – Upper Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) exceeding its expectation \(\mu\) by at least a factor of \(1 + \delta\), where \(\delta > 0\). Then

\[\Pr[X \ge (1 + \delta)\mu] \le \parens{\frac{e^\delta}{(1+\delta)^{1+\delta}}}^\mu\]
Theorem 257 (Multiplicative Chernoff Bound – Lower Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) being below its expectation \(\mu\) by at least a factor of \(1 - \delta\), where \(0 < \delta < 1\). Then

\[\Pr[X \le (1 - \delta)\mu] \le \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu\]

These inequalities can be unwieldy to work with, so we often use the following simpler, looser bounds.

Theorem 258 (Multiplicative Chernoff Bound – Simplified Upper Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) exceeding its expectation \(\mu\) by at least a factor of \(1 + \delta\), where \(\delta > 0\). Then

\[\large \Pr[X \ge (1 + \delta)\mu] \le e^{-\frac{\delta^2\mu}{2+\delta}}\]
Theorem 259 (Multiplicative Chernoff Bound – Simplified Lower Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) being below its expectation \(\mu\) by at least a factor of \(1 - \delta\), where \(0 < \delta < 1\). Then

\[\large \Pr[X \le (1 - \delta)\mu] \le e^{-\frac{\delta^2\mu}{2}}\]

Before we proceed to make use of these bounds, we first prove that the unsimplified upper-tail and lower-tail bounds hold [3].

Proof 260 (Proof of Upper-tail Chernoff Bound)

To demonstrate the upper-tail Chernoff bound, we make use of the Chernoff bounding technique – rather than reasoning about the random variable \(X\) directly, we instead reason about \(e^{tX}\), since small deviations in \(X\) turn into large deviations in \(e^{tX}\). We have that \(X \ge (1+\delta)\mu\) exactly when \(e^{tX} \ge e^{t(1+\delta)\mu}\) for any \(t \ge 0\); we obtain this by raising \(e^t \ge 1\) to the power of both sides of the former inequality. Then

\[\begin{split}\large \begin{aligned} \Pr[X \ge (1+\delta)\mu] &= \Pr[e^{tX} \ge e^{t(1+\delta)\mu}]\\ &\le \frac{\Ex[e^{tX}]}{e^{t(1+\delta)\mu}} \end{aligned}\end{split}\]

where the latter step follows from Markov’s inequality. Continuing, we have

\[\begin{split}\large \begin{aligned} \frac{\Ex[e^{tX}]}{e^{t(1+\delta)\mu}} &= e^{-t(1+\delta)\mu}\cdot\Ex[e^{tX}]\\ &= e^{-t(1+\delta)\mu}\cdot\Ex[e^{t(X_1 + \dots + X_n)}]\\ &= e^{-t(1+\delta)\mu}\cdot\lrEx{\prod_i e^{tX_i}} \end{aligned}\end{split}\]

We now make use of the fact that the \(X_i\) (and therefore the \(e^{tX_i}\)) are independent. For independent random variables \(Y\) and \(Z\), we have \(\Ex[YZ] = \Ex[Y]\cdot\Ex[Z]\) (see the appendix for a proof). Thus,

\[\large e^{-t(1+\delta)\mu}\cdot\lrEx{\prod_i e^{tX_i}} = e^{-t(1+\delta)\mu}\prod_i \Ex[e^{tX_i}]\]

The \(X_i\) are indicators, with \(\Pr[X_i = 1] = p_i\) and \(\Pr[X_i = 0] = 1 - p_i\). When \(X_i = 1\), \(e^{tX_i} = e^t\), and when \(X_i = 0\), \(e^{tX_i} = e^0 = 1\). Thus, the distribution of \(e^{tX_i}\) is

\[\begin{split}\large \begin{aligned} \Pr[e^{tX_i} = e^t] &= p_i\\ \Pr[e^{tX_i} = 1] &= 1 - p_i \end{aligned}\end{split}\]

We can then compute the expectation of \(e^{tX_i}\):

\[\begin{split}\large \begin{aligned} \Ex[e^{tX_i}] &= e^t\cdot \Pr[e^{tX_i} = e^t] + 1\cdot\Pr[e^{tX_i} = 1]\\ &= e^t p_i + 1 - p_i\\ &= 1 + p_i(e^t - 1) \end{aligned}\end{split}\]

Plugging this into the upper bound that resulted from Markov’s inequality, we get

\[\large e^{-t(1+\delta)\mu}\prod_i \Ex[e^{tX_i}] = e^{-t(1+\delta)\mu}\prod_i 1 + p_i(e^t-1)\]

We can further simplify this expression by observing that \(1+x \le e^x\) for all \(x\):

graph showing that the function 1+x is bounded from above by e^x over all x


Using \(x = p_i(e^t-1)\), we have

\[\large 1 + p_i(e^t-1) \le e^{p_i(e^t-1)}\]

Applying this to our previously computed upper bound, we get

\[\begin{split}\large \begin{aligned} e^{-t(1+\delta)\mu}\prod_i 1 + p_i(e^t-1) &\le e^{-t(1+\delta)\mu}\prod_i e^{p_i(e^t-1)}\\ &= e^{-t(1+\delta)\mu}e^{(e^t-1)\sum_i p_i}\\ &= e^{-t(1+\delta)\mu}e^{(e^t-1)\mu}\\ &= e^{\mu(e^t - 1 - t(1+\delta))} \end{aligned}\end{split}\]

We can now choose \(t\) to minimize the exponent, which in turn minimizes the value of the exponential itself. Let \(f(t) = e^t - 1 - t(1+\delta)\). Then we compute the derivative with respect to \(t\) and set that to zero to find an extremum:

\[\begin{split}f(t) &= e^t - 1 - t(1+\delta)\\ f'(t) &= e^t - (1+\delta) = 0\\ e^t &= 1+\delta\\ t &= \ln(1+\delta)\end{split}\]

Computing the second derivative \(f''(\ln(1+\delta)) = e^{\ln(1+\delta)} = 1+\delta\), we see that it is positive since \(\delta > 0\), therefore \(t = \ln(1+\delta)\) is a minimum. Substituting it into our earlier expression, we obtain

\[\begin{split}\large \begin{aligned} e^{\mu(e^t - 1 - t(1+\delta))} &\le e^{\mu(e^{\ln(1+\delta)} - 1 - (1+\delta)\ln(1+\delta))}\\ &= e^{\mu((1+\delta) - 1 - (1+\delta)\ln(1+\delta))}\\ &= e^{\mu(\delta - (1+\delta)\ln(1+\delta))}\\ &= \parens{e^{\delta - (1+\delta)\ln(1+\delta)}}^\mu\\ &= \parens{\frac{e^{\delta}}{e^{(1+\delta)\ln(1+\delta)}}}^\mu\\ &= \parens{\frac{e^{\delta}}{(e^{\ln(1+\delta)})^{1+\delta}}}^\mu\\ &= \parens{\frac{e^{\delta}}{(1+\delta)^{1+\delta}}}^\mu \end{aligned}\end{split}\]
Proof 261 (Proof of Lower-tail Chernoff Bound)

The proof of the lower-tail Chernoff bound follows similar reasoning as that of the upper-tail bound. We have:

\[\begin{split}\large \begin{aligned} \Pr[X \le (1-\delta)\mu] &= \Pr[-X \ge -(1-\delta)\mu]\\ &= \Pr[e^{-tX} \ge e^{-t(1-\delta)\mu}]\\ &\le \frac{\Ex[e^{-tX}]}{e^{-t(1-\delta)\mu}} \end{aligned}\end{split}\]

Here, we can apply Markov’s inequality to the random variable \(e^{-tX}\) since it is nonnegative, unlike \(-X\). Continuing, we have

\[\begin{split}\large \begin{aligned} \frac{\Ex[e^{-tX}]}{e^{-t(1-\delta)\mu}} &= e^{t(1-\delta)\mu}\cdot\Ex[e^{-tX}]\\ &= e^{t(1-\delta)\mu}\cdot\Ex[e^{-t(X_1 + \dots + X_n)}]\\ &= e^{t(1-\delta)\mu}\cdot\lrEx{\prod_i e^{-tX_i}}\\ &= e^{t(1-\delta)\mu}\prod_i \Ex[e^{-tX_i}] \end{aligned}\end{split}\]

In the last step, we make use of the fact that the \(X_i\) and therefore the \(e^{-tX_i}\) are independent. The distribution of \(e^{-tX_i}\) is

\[\begin{split}\large \begin{aligned} \Pr[e^{-tX_i} = e^{-t}] &= p_i\\ \Pr[e^{-tX_i} = 1] &= 1 - p_i \end{aligned}\end{split}\]

Then the expectation of \(e^{-tX_i}\) is:

\[\begin{split}\large \begin{aligned} \Ex[e^{-tX_i}] &= e^{-t}\cdot \Pr[e^{-tX_i} = e^{-t}] + 1\cdot\Pr[e^{-tX_i} = 1]\\ &= e^{-t} p_i + 1 - p_i\\ &= 1 + p_i(e^{-t} - 1) \end{aligned}\end{split}\]

Plugging this into the bound we’ve computed so far, we get

\[\large e^{t(1-\delta)\mu}\prod_i \Ex[e^{-tX_i}] = e^{t(1-\delta)\mu}\prod_i 1 + p_i(e^{-t}-1)\]

As with the upper-tail, we use the fact that \(1+x \le e^x\) for all \(x\) to simplify this, with \(x = p_i(e^{-t}-1)\):

\[\begin{split}\large \begin{aligned} e^{t(1-\delta)\mu}\prod_i 1 + p_i(e^{-t}-1) &\le e^{t(1-\delta)\mu}\prod_i e^{p_i(e^{-t}-1)}\\ &= e^{t(1-\delta)\mu}e^{(e^{-t}-1)\sum_i p_i}\\ &= e^{t(1-\delta)\mu}e^{(e^{-t}-1)\mu}\\ &= e^{\mu(e^{-t} - 1 + t(1-\delta))} \end{aligned}\end{split}\]

We choose \(t\) to minimize the exponent. Let \(f(t) = e^{-t} - 1 + t(1-\delta)\). Then we compute the derivative with respect to \(t\) and set that to zero to find an extremum:

\[\begin{split}f(t) &= e^{-t} - 1 + t(1-\delta)\\ f'(t) &= -e^{-t} + (1-\delta) = 0\\ e^{-t} &= 1-\delta\\ t &= -\ln(1-\delta)\end{split}\]

Computing the second derivative \(f''(-\ln(1-\delta)) = e^{-(-\ln(1-\delta))} = 1-\delta\), we see that it is positive since \(0 < \delta < 1\), therefore \(t = \ln(1-\delta)\) is a minimum. Substituting it into our earlier expression, we obtain

\[\begin{split}\large \begin{aligned} e^{\mu(e^{-t} - 1 + t(1-\delta))} &\le e^{\mu(e^{\ln(1-\delta)} - 1 - (1-\delta)\ln(1-\delta))}\\ &= e^{\mu((1-\delta) - 1 - (1-\delta)\ln(1-\delta))}\\ &= e^{\mu(-\delta - (1-\delta)\ln(1-\delta))}\\ &= \parens{e^{-\delta - (1-\delta)\ln(1-\delta)}}^\mu\\ &= \parens{\frac{e^{-\delta}}{e^{(1-\delta)\ln(1-\delta)}}}^\mu\\ &= \parens{\frac{e^{-\delta}}{(e^{\ln(1-\delta)})^{1-\delta}}}^\mu\\ &= \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu \end{aligned}\end{split}\]

In lieu of applying Chernoff bounds to the algorithm for estimating \(\pi\), we consider a different example of flipping a coin. If we flip a biased coin with probability \(p\) of heads, we expect to see \(np\) heads. Let \(H\) be the total number of heads, and let \(H_i\) be an indicator variable corresponding to whether the \(i\)th flip is heads. We have \(\Pr[H_i = 1] = p\), and \(\Ex[H] = np\) by linearity of expectation.

Suppose the coin is fair. What is the probability of getting at least six heads out of ten flips? This is a fractional deviation of \(\delta = 6/5 - 1 = 0.2\), and applying the upper-tail Chernoff bound gives us:

\[\begin{split}\Pr[H \ge (1+0.2)\cdot 5] &\le \parens{\frac{e^{0.2}}{1.2^{1.2}}}^5\\ &\approx 0.9814^5\\ &\approx 0.91\end{split}\]

What is the probability of getting at least 60 heads out of 100 flips, which is the same fractional deviation \(\delta = 0.2\) from the expectation? Applying the upper tail again, we get

\[\begin{split}\Pr[H \ge (1+0.2)\cdot 50] &\le \parens{\frac{e^{0.2}}{1.2^{1.2}}}^{50}\\ &\approx 0.9814^{50}\\ &\approx 0.39\end{split}\]

Thus, the probability of deviating by a factor of \(1+\delta\) decreases significantly as the number of samples increases. For this particular example, we can compute the actual probabilities quite tediously from exact formulas for a binomial distribution, which yields 0.37 for getting six heads out of ten flips and 0.03 for 60 heads out of 100 flips. However, this approach becomes more and more expensive as \(n\) increases, and the Chernoff bound produces a reasonable result with much less work.

Polling Analysis with Chernoff Bounds

Recall that in polling, a confidence level \(1-\gamma\) and margin of error \(\varepsilon\) requires

\[\lrPr{\abs*{\frac{X}{n} - p} \le \varepsilon} \ge 1-\gamma\]

or equivalently

\[\lrPr{\abs*{\frac{X}{n} - p} > \varepsilon} < \gamma\]

Formally, we define indicator variables \(X_i\) as

\[\begin{split}X_i = \begin{cases} 1 &\mbox{if person $i$ supports the candidate}\\ 0 &\mbox{otherwise} \end{cases}\end{split}\]

for each person \(i\) in the set that we poll. Then \(X = X_1 + \dots + X_n\) is the sum of independent indicator variables, with

\[\mu = \Ex[X] = \sum_i \Ex[X_i] = np\]

For an arbitrary margin of error \(\varepsilon\), we want to bound the probability

\[\begin{split}\lrPr{\abs*{\frac{X}{n} - p} > \varepsilon} &= \lrPr{\frac{X}{n} - p > \varepsilon} + \lrPr{\frac{X}{n} - p < -\varepsilon}\\ &= \lrPr{\frac{X}{n} > \varepsilon+p} + \lrPr{\frac{X}{n} < -\varepsilon+p}\\ &= \Pr[X > \varepsilon n+pn] + \Pr[X < -\varepsilon n+pn]\\ &= \Pr[X > \varepsilon n+\mu] + \Pr[X < -\varepsilon n+\mu]\\ &= \Pr[X > (1+\varepsilon n/\mu)\mu] + \Pr[X < (1-\varepsilon n/\mu)\mu]\\ &= \Pr[X \ge (1+\varepsilon n/\mu)\mu] + \Pr[X \le (1-\varepsilon n/\mu)\mu]\\ &~~~~~ - \Pr[X = (1+\varepsilon n/\mu)\mu] - \Pr[X = (1-\varepsilon n/\mu)\mu]\\ &\le \Pr[X \ge (1+\varepsilon n/\mu)\mu] + \Pr[X \le (1-\varepsilon n/\mu)\mu]\end{split}\]

In the second-to-last step, we use the fact that \(X = x\) and \(X > x\) are disjoint events, since a random variable maps each sample point to a single value, and that \((X \ge x) = (X = x) \cup (X > x)\).

We now have events that are in the right form to apply a Chernoff bound, with \(\delta = \varepsilon n/\mu\). We first apply the simplified upper-tail bound to the first term, obtaining

\[\begin{split}\large \begin{aligned} \Pr[X \ge (1+\varepsilon n/\mu)\mu] &\le e^{-\frac{\delta^2\mu}{2+\delta}} ~~~~\text{where $\delta=\varepsilon n/\mu$}\\ &= e^{-\frac{(\varepsilon n)^2\mu}{\mu^2(2+\varepsilon n/\mu)}}\\ &= e^{-\frac{(\varepsilon n)^2}{\mu(2+\varepsilon n/\mu)}}\\ &= e^{-\frac{(\varepsilon n)^2}{2\mu+\varepsilon n}}\\ &= e^{-\frac{(\varepsilon n)^2}{2np+\varepsilon n}}\\ &= e^{-\frac{\varepsilon^2 n}{2p+\varepsilon}} \end{aligned}\end{split}\]

We want this to be less than some value \(\beta\):

\[\begin{split}\large \begin{aligned} e^{-\frac{\varepsilon^2 n}{2p+\varepsilon}} &< \beta\\ \frac{1}{\beta} &< e^{\frac{\varepsilon^2 n}{2p+\varepsilon}}\\ \ln\parens{\frac{1}{\beta}} &< \frac{\varepsilon^2 n}{2p+\varepsilon}\\ n &> \frac{2p+\varepsilon}{\varepsilon^2}\ln\parens{\frac{1}{\beta}} \end{aligned}\end{split}\]

Unfortunately, this expression includes \(p\), the quantity we’re trying to estimate However, we know that \(p \le 1\), so we can set

\[n > \frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{1}{\beta}}\]

to ensure that \(\Pr[X \ge (1+\varepsilon n/\mu)\mu] < \beta\).

We can also apply the simplified lower-tail bound to the second term in our full expression above:

\[\begin{split}\large \begin{aligned} \Pr[X \le (1-\varepsilon n/\mu)\mu] &\le e^{-\frac{\delta^2\mu}{2}} ~~~~\text{where $\delta=\varepsilon n/\mu$}\\ &= e^{-\frac{(\varepsilon n)^2\mu}{2\mu^2}}\\ &= e^{-\frac{(\varepsilon n)^2}{2\mu}}\\ &= e^{-\frac{(\varepsilon n)^2}{2np}}\\ &= e^{-\frac{\varepsilon^2 n}{2p}} \end{aligned}\end{split}\]

As before, we want this to be less some value \(\beta\):

\[\begin{split}\large \begin{aligned} e^{-\frac{\varepsilon^2 n}{2p)}} &< \beta\\ \frac{1}{\beta} &< e^{\frac{\varepsilon^2 n}{2p}}\\ \ln\parens{\frac{1}{\beta}} &< \frac{\varepsilon^2 n}{2p}\\ n &> \frac{2p}{\varepsilon^2}\ln\parens{\frac{1}{\beta}} \end{aligned}\end{split}\]

We again observe that \(p \le 1\), so we can set

\[n > \frac{2}{\varepsilon^2}\ln\parens{\frac{1}{\beta}}\]

to ensure that \(\Pr[X \le (1-\varepsilon n/\mu)\mu] < \beta\).

To bound both terms simultaneously, we can choose \(\beta=\gamma/2\), so that

\[\Pr[X \ge (1+\varepsilon n/\mu)\mu] + \Pr[X \le (1-\varepsilon n/\mu)\mu] < 2\beta = \gamma\]

To achieve this, we require

\[\begin{split}n &> \max\parens{ \frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}}, \frac{2}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}}}\\ &= \frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}}\end{split}\]

For example, for a 95% confidence level and a margin of error of ±2%, we have \(\varepsilon=0.02\) and \(\gamma=0.05\). Plugging those values into the result above, we need no more than

\[\frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}} = \frac{2.02}{0.02^2}\ln 40 \approx 18623\]

samples to achieve the desired confidence level and margin of error. Observe that this does not depend on the total population size!

Probabilistic Complexity Classes

The Fermat primality test is an example of a one-sided-error randomized algorithm – an input that is pseudoprime is always accepted, while a non-pseudoprime is sometimes accepted and sometimes rejected. We can define complexity classes corresponding to decision problems that have efficient one-sided-error randomized algorithms.

Definition 262 (RP)

\(\RP\) is the class of languages that have efficient one-sided-error randomized algorithms with no false positives. A language \(L\) is in \(\RP\) if there is an efficient randomized algorithm \(A\) such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] = 1\)

Here, \(c\) must be a constant greater than 0. Often, \(c\) is chosen to be \(\frac{1}{2}\).

Definition 263 (coRP)

\(\coRP\) is the class of languages that have efficient one-sided-error randomized algorithms with no false negatives. A language \(L\) is in \(\coRP\) if there is an efficient randomized algorithm \(A\) such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] = 1\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] \ge c\)

Here, \(c\) must be a constant greater than 0. Often, \(c\) is chosen to be \(\frac{1}{2}\).

\(\RP\) stands for randomized polynomial time. If a language \(L\) is in \(\RP\), then its complement language \(\overline{L}\) is in \(\coRP\) – an algorithm for \(L\) with no false positives can be converted into an algorithm for \(\overline{L}\) with no false negatives. The Fermat test produces no false negatives, so \(\mathrm{PSEUDOPRIMES} \in \coRP\). Thus, the language

\[\begin{split}\mathrm{PSEUDOPRIMES} = \left\{ \begin{aligned} m \in \N :~ &a^{m-1} \bmod{m} \equiv 1 \text{ for at least half}\\ &\text{the witnesses } 1 \le a \le m - 1 \end{aligned} \right\}\end{split}\]

is in \(\RP\).

The constant \(c\) in the definition of \(\RP\) and \(\coRP\) is arbitrary. With any \(c > 0\), we can amplify the probability of success by repeatedly running the algorithm. For instance, if we have a randomized algorithm \(A\) with no false positives for a language \(L\), we have:

\[\begin{split}x \in L &\implies \Pr[A \text{ accepts } x] \ge c\\ x \notin L &\implies \Pr[A \text{ accepts } x] = 0\end{split}\]

We can construct a new algorithm \(B\) as follows:

\begin{algorithmic}
\FUNCTION{$B$}{$x$}
\STATE run $A(x)$ twice (with fresh randomness each time)
\IF{it accepts at least once} \ACCEPT \ENDIF
\REJECT
\ENDFUNCTION
\end{algorithmic}

\(B\) just runs \(A\) twice on the input, accepting if at least one run of \(A\) accepts; \(B\) only rejects if both runs of \(A\) reject. Thus, the probability that \(B\) rejects \(x\) is:

\[\begin{split}\Pr[B \text{ rejects } x] &= \Pr[A \text{ rejects $x$ in run 1}, A \text{ rejects $x$ in run 2}]\\ &= \Pr[A \text{ rejects $x$ in run 1}] \cdot \Pr[A \text{ rejects $x$ in run 2}]\\ &= \Pr[A \text{ rejects } x]^2\end{split}\]

Here, we have used the fact that the two runs of \(A\) are independent, so the probability of \(A\) rejecting twice is the product of the probabilities it rejects each time. This gives us:

\[\begin{split}x \in L &\implies \Pr[B \text{ rejects } x] \le (1 - c)^2\\ x \notin L &\implies \Pr[B \text{ rejects } x] = 1\end{split}\]

or equivalently:

\[\begin{split}x \in L &\implies \Pr[B \text{ accepts } x] \ge 1 - (1 - c)^2\\ x \notin L &\implies \Pr[B \text{ accepts } x] = 0\end{split}\]

Repeating this reasoning, if we modify \(B\) to run \(A\) \(k\) times, we get:

\[\begin{split}x \in L &\implies \Pr[B \text{ accepts } x] \ge 1 - (1 - c)^k\\ x \notin L &\implies \Pr[B \text{ accepts } x] = 0\end{split}\]

By applying a form of Bernoulli’s inequality, \((1 + x)^r \le e^{rx}\), we get:

\[x \in L \implies \Pr[B \text{ accepts } x] \ge 1 - e^{-ck}\]

If we want a particular lower bound \(1 - \varepsilon\) for acceptance of true positives, we need

\[\begin{split}e^{-ck} &\le \varepsilon\\ -ck &\le \ln \varepsilon\\ k &\ge \frac{-\ln \varepsilon}{c} = \frac{\ln(1/\varepsilon)}{c}\end{split}\]

This is a constant for any constants \(c\) and \(\varepsilon\). Thus, we can amplify the probability of accepting a true positive from \(c\) to \(1 - \varepsilon\) for any \(\varepsilon\) with a constant \(\ln(1/\varepsilon)/c\) number of repetitions. The same reasoning can be applied to amplify one-sided-error algorithms with no false negatives.

One-sided-error randomized algorithms are a special case of two-sided-error randomized algorithms. Such an algorithm for a language \(L\) can produce the wrong result for an input \(x\) whether or not \(x \in L\). However, the probability of getting the wrong result is bounded by a constant.

Definition 264 (BPP)

\(\BPP\) is the class of languages that have efficient two-sided-error randomized algorithms. A language \(L\) is in \(\BPP\) if there is an efficient randomized algorithm \(A\) such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] \ge c\)

Here, \(c\) must be a constant greater than \(\frac{1}{2}\), so that the algorithm produces the correct answer the majority of the time. Often, \(c\) is chosen to be \(\frac{2}{3}\) or \(\frac{3}{4}\).

\(\BPP\) stands for bounded-error probabilistic polynomial time. Given the symmetric definition of a two-sided error algorithm, it is clear that the class \(\BPP\) is closed under complement – if a language \(L \in \BPP\), then \(\overline{L} \in \BPP\) as well.

Languages in \(\RP\) and in \(\coRP\) both trivially satisfy the conditions for \(\BPP\); for instance, we have the following for \(\RP\):

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] = 1 \ge c\)

Thus, \(\RP \cup \coRP \subseteq \BPP\). By similar reasoning, we can relate \(\P\) to these classes as follows:

\begin{gather*} \P \subseteq \RP \subseteq \BPP\\ \P \subseteq \coRP \subseteq \BPP \end{gather*}

As with one-sided-error algorithms, the probability of success for two-sided-error randomized algorithms can be amplified arbitrarily, as we will see shortly.

One-sided-error and two-sided-error randomized algorithms are known as Monte Carlo algorithms. Such algorithms have a bounded runtime, but they may produce the wrong answer. Contrast this with Las Vegas algorithms, which always produce the correct answer, but where the runtime bound only holds in expectation. We can define a complexity class for languages that have Las Vegas algorithms with expected runtime that is polynomial in the size of the input.

Definition 265 (ZPP)

\(\ZPP\) is the class of languages that have efficient Las Vegas algorithms. A language \(L\) is in \(\ZPP\) if there is a randomized algorithm \(A\) such that:

  • If \(x \in L\), \(A\) always accepts \(x\).

  • If \(x \notin L\), \(A\) always rejects \(x\).

  • The expected runtime of \(A\) on input \(x\) is \(\O(\abs{x}^k)\) for some constant \(k\).

There is a relationship between Monte Carlo and Las Vegas algorithms: if a language \(L\) has both a one-sided-error algorithm with no false positives and a one-sided-error algorithm with no false negatives, then it has a Las Vegas algorithm, and vice versa. In terms of complexity classes, we have \(L\) is in both \(\RP\) and \(\coRP\) if and only if \(L\) is in \(\ZPP\). This implies that

\[\RP \cap \coRP = \ZPP\]

We demonstrate this as follows. Suppose \(L\) is in \(\RP \cap \coRP\). Then it has an efficient, one-sided-error algorithm \(A\) with no false positives such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge \frac{1}{2}\)

  • if \(x \notin L\), \(\Pr[A \text{ accepts } x] = 0\)

Here, we have selected \(c = \frac{1}{2}\) for concreteness. \(L\) also has an efficient, one-sided algorithm \(B\) with no false negatives such that:

  • if \(x \in L\), \(\Pr[B \text{ rejects } x] = 0\)

  • if \(x \notin L\), \(\Pr[B \text{ rejects } x] \ge \frac{1}{2}\)

We can construct a new algorithm \(C\) as follows:

\begin{algorithmic}
\FUNCTION{$C$}{$x$}
\WHILE{true}
  \IF{$A(x)$ accepts} \ACCEPT \ENDIF
  \IF{$B(x)$ rejects} \REJECT \ENDIF
\ENDWHILE
\ENDFUNCTION
\end{algorithmic}

Since \(A\) only accepts \(x \in L\) and \(C\) only accepts when \(A\) does, \(C\) only accepts \(x \in L\). Similarly, \(B\) only rejects \(x \notin L\), so \(C\) only rejects \(x \notin L\). Thus, \(C\) always produces the correct answer for a given input.

To show that \(L \in \ZPP\), we must also demonstrate that the expected runtime of \(C\) is polynomial. For each iteration \(i\) of \(C\), we have:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x \text{ in iteration } i] \ge \frac{1}{2}\)

  • if \(x \notin L\), \(\Pr[B \text{ rejects } x \text{ in iteration } i] \ge \frac{1}{2}\)

Thus, if \(C\) gets to iteration \(i\), the probability that \(C\) terminates in that iteration is at least \(\frac{1}{2}\). We can model the number of iterations as a random variable \(X\), and \(X\) has the probability distribution:

\[\Pr[X > k] \le \parens{\frac{1}{2}}^k\]

This is similar to a geometric distribution, where

\[\Pr[Y > k] = (1 - p)^k\]

for some \(p \in (0, 1]\). The expected value of such a distribution is \(\Ex[Y] = 1/p\). This gives us:

\[\Ex[X] \le 2\]

Thus, the expected number of iterations of \(C\) is no more than two, and since \(A\) and \(B\) are both efficient, calling them twice is also efficient.

We have shown that \(\RP \cap \coRP \subseteq \ZPP\). We will proceed to show that \(\ZPP \subseteq \RP\), meaning that if a language has an efficient Las Vegas algorithm, it also has an efficient one-sided-error algorithm with no false positives. Assume that \(C\) is a Las Vegas algorithm for \(L \in \ZPP\), with expected runtime of no more than \(\abs{x}^k\) steps for an input \(x\) and some constant \(k\). We construct a new algorithm \(A\) as follows:

\begin{algorithmic}
\FUNCTION{$A$}{$x$}
\SIMULATE $C(x)$ for up to $2 \abs{x}^k$ steps (or until it halts)
\IF{it accepts} \ACCEPT \ENDIF
\STATE \REJECT
\ENDFUNCTION
\end{algorithmic}

\(A\) is clearly efficient in the size of the input \(x\). It also has no false positives, since it only accepts if \(C\) accepts within the first \(2\abs{x}^k\) steps, and \(C\) always produces the correct answer. All that is left is to show that if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge \frac{1}{2}\) (again, we arbitrarily choose \(c = \frac{1}{2}\)).

Let \(S\) be the number of steps before \(C\) accepts \(x\). By assumption, we have

\[\Ex[S] \le \abs{x}^k\]

\(A\) runs \(C\) for \(2\abs{x}^k\) steps, so we need to bound the probability that \(C\) takes more than \(2\abs{x}^k\) steps on \(x\). We have:

\[\begin{split}\Pr[S > 2\abs{x}^k] &\le \Pr[S \ge 2\abs{x}^k]\\ &\le \frac{\Ex[S]}{2\abs{x}^k}\\ &\le \frac{\abs{x}^k}{2\abs{x}^k}\\ &= \frac{1}{2}\end{split}\]

Here, we applied Markov’s inequality in the second step. Then:

\[\begin{split}\Pr[A \text{ accepts } x] &= \Pr[S \le 2\abs{x}^k]\\ &= 1 - \Pr[S > 2\abs{x}^k]\\ &\ge \frac{1}{2}\end{split}\]

Thus, \(A\) is indeed a one-sided-error algorithm with no false positives, so \(L \in \RP\). A similar construction can be used to demonstrate that \(L \in \coRP\), allowing us to conclude that \(\ZPP \subseteq \RP \cap \coRP\). Combined with our previous proof that \(\RP \cap \coRP \subseteq \ZPP\), we have \(\RP \cap \coRP = \ZPP\).

One final observation we will make is that if a language \(L\) is in \(\RP\), then it is also in \(\NP\). Since \(L \in \RP\), there is an efficient, one-sided-error randomized algorithm \(A\) to decide \(L\). \(A\) is allowed to make random choices, and we can model each choice as a coin flip, i.e. being either 0 or 1 according to some probability distribution. We can represent the combination of these choices in a particular run of \(A\) as a binary string \(c\). This enables us to write an efficient verifier \(V\) for \(L\) as follows:

\begin{algorithmic}
\FUNCTION{$V$}{$x, c = c_1 c_2 \cdots c_m$}
\SIMULATE $A(x)$, using $c_i$ for the $i$th random bit used by $A$
\IF{$A$ accepts} \ACCEPT \ENDIF
\STATE \REJECT
\ENDFUNCTION
\end{algorithmic}

If \(x \in L\), then \(\Pr[A \text{ accepts } x] \ge \frac{1}{2}\), so at least half the possible sequences of random choices lead to \(A\) accepting \(x\). On the other hand, if \(x \notin L\), then \(\Pr[A \text{ rejects } x] = 1\), so all sequences of random choices lead to \(A\) rejecting \(x\). Thus, \(V\) accepts at least half of all possible certificates when \(x \in L\), and \(V\) rejects all certificates when \(x \notin L\), so \(V\) is a verifier for \(L\). Since \(A\) is efficient, \(V\) is also efficient.

We summarize the known relationships between complexity classes as follows:

\begin{gather*} \P \subseteq \ZPP \subseteq \RP \subseteq \NP\\ \P \subseteq \ZPP \subseteq \RP \subseteq \BPP\\ \P \subseteq \ZPP \subseteq \coRP \subseteq \coNP\\ \P \subseteq \ZPP \subseteq \coRP \subseteq \BPP \end{gather*}

These relationships are represented pictorially, with edges signifying containment, as follows:

complexity-class hierarchy, with arrows from P to ZPP, from ZPP to RP and coRP, from RP to NP and BPP, and from coRP to coNP and BPP

Here, \(\coNP\) is the class of languages whose complements are in \(\NP\):

\[\coNP = \{L : \overline{L} \in \NP\}\]

We do not know if any of the containments above are strict, and we do not know the relationships between \(\NP\), \(\coNP\), and \(\BPP\). It is commonly believed that \(\P\) and \(\BPP\) are equal and thus \(\BPP\) is contained in \(\NP \cap \coNP\), that neither \(\P\) nor \(\BPP\) contain all of \(\NP\) or all of \(\coNP\), and that \(\NP\) and \(\coNP\) are not equal. However, none of these conjectures has been proven [4].

Amplification for Two-Sided-Error Algorithms

Previously, we saw how to amplify the probability of success for one-sided-error randomized algorithms. We now consider amplification for two-sided-error algorithms. Recall that such an algorithm \(A\) for a language \(L\) has the following behavior:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] \ge c\)

Here, \(c\) must be a constant that is strictly greater than \(\frac{1}{2}\).

Unlike in the one-sided case with no false positives, we cannot just run the algorithm multiple times and observe if it accepts at least once. A two-sided-error algorithm can accept both inputs in and not in the language, and it can reject both such inputs as well. However, we observe that because \(c > \frac{1}{2}\), we expect to get the right answer the majority of the time when we run a two-sided-error algorithm on an input. More formally, suppose we run the algorithm \(n\) times. Let \(Y_i\) be an indicator random variable that is 1 if the algorithm accepts in the \(i\)th run. Then:

  • if \(x \in L\), \(\Ex[Y_i] = \Pr[Y_i = 1] \ge c\)

  • if \(x \notin L\), \(\Ex[Y_i] = \Pr[Y_i = 1] \le 1 - c\)

Let \(Y = Y_1 + \dots + Y_n\) be the total number of accepts out of \(n\) trials. By linearity of expectation, we have:

  • if \(x \in L\), \(\Ex[Y] \ge cn > \frac{n}{2}~~~\) (since \(c > \frac{1}{2}\))

  • if \(x \notin L\), \(\Ex[Y] \le (1 - c)n < \frac{n}{2}~~~\) (since \(1 - c < \frac{1}{2}\))

This motivates an amplification algorithm \(B\) that runs the original algorithm \(A\) multiple times and takes the majority vote:

\begin{algorithmic}
\FUNCTION{$B$}{$x$}
\STATE run $A(x)$ $n$ times (with fresh randomness each time)
\IF{it accepts at least $n/2$ times} \ACCEPT \ENDIF
\REJECT
\ENDFUNCTION
\end{algorithmic}

Suppose we wish to obtain a bound that is within \(\gamma\) of 1:

  • if \(x \in L\), \(\Pr[B \text{ accepts } x] \ge 1 - \gamma\)

  • if \(x \notin L\), \(\Pr[B \text{ rejects } x] \ge 1 - \gamma\)

What value for \(n\) should we use in \(B\)?

We first consider the case where \(x \in L\). The indicators \(Y_i\) are independent, allowing us to use Hoeffding’s inequality on their sum \(Y\). \(B\) accepts \(x\) when \(Y \ge \frac{n}{2}\), or equivalently, \(\frac{Y}{n} \ge \frac{1}{2}\). We want

\[\lrPr{\frac{Y}{n} \ge \frac{1}{2}} \ge 1 - \gamma\]

or equivalently,

\[\begin{split}\lrPr{\frac{Y}{n} < \frac{1}{2}} &\le \lrPr{\frac{Y}{n} \le \frac{1}{2}}\\ &= \lrPr{\frac{Y}{n} \le c - \parens{c - \frac{1}{2}}}\\ &= \lrPr{\frac{Y}{n} \le c - \varepsilon}\\ &\le \gamma\end{split}\]

where \(\varepsilon = c - \frac{1}{2}\). To apply Hoeffding’s inequality, we need something of the form

\[\lrPr{\frac{Y}{n} \le p - \varepsilon}\]

where \(p = \lrEx{\frac{Y}{n}}\). Unfortunately, we do not know the exact value of \(p\); all we know is that \(p = \lrEx{\frac{Y}{n}} \ge c\). However, we know that because \(p \ge c\),

\[\lrPr{\frac{Y}{n} \le c - \varepsilon} \le \lrPr{\frac{Y}{n} \le p - \varepsilon}\]

In general, the event \(X \le a\) includes at most as many sample points as \(X \le b\) when \(a \le b\); the latter includes all the outcomes in \(X \le a\), as well as those in \(a < X \le b\). We thus need only compute an upper bound on \(\lrPr{\frac{Y}{n} \le p - \varepsilon}\), and that same upper bound will apply to \(\lrPr{\frac{Y}{n} \le c - \varepsilon}\). Taking \(\varepsilon = c - \frac{1}{2}\) and applying the lower-tail Hoeffding’s inequality, we get:

\[\begin{split}\lrPr{\frac{Y}{n} \le p - \varepsilon} &= \lrPr{\frac{Y}{n} \le p - \parens{c - \frac{1}{2}}}\\ &\le e^{-2(c-1/2)^2 n}\end{split}\]

We want this to be bound by \(\gamma\):

\[\begin{split}e^{-2(c-1/2)^2 n} &\le \gamma\\ e^{2(c-1/2)^2 n} &\ge \frac{1}{\gamma}\\ 2(c - 1/2)^2 n &\ge \ln\parens{\frac{1}{\gamma}}\\ n &\ge \frac{1}{2(c - 1/2)^2} \ln\parens{\frac{1}{\gamma}}\end{split}\]

As a concrete example, suppose \(c = \frac{3}{4}\), meaning that \(A\) accepts \(x \in L\) with probability at least \(\frac{3}{4}\). Suppose we want \(B\) to accept \(x \in L\) at least 99% of the time, giving us \(\gamma = 0.01\). Then:

\[\begin{split}n &\ge \frac{1}{2(3/4 - 1/2)^2} \ln\parens{\frac{1}{0.01}}\\ &= \frac{1}{2\cdot 0.25^2} \ln 100\\ &\approx 36.8\end{split}\]

Thus, it is sufficient for \(B\) to run \(n \ge 37\) trials of \(A\) on \(x\).

We now consider \(x \notin L\). We want \(B\) to reject \(x\) with probability at least \(1 - \gamma\), or equivalently, \(B\) to accept \(x\) with probability at most \(\gamma\):

\[\lrPr{\frac{Y}{n} \ge \frac{1}{2}} \le \gamma\]

Similar to before, we have

\[\begin{split}\lrPr{\frac{Y}{n} \ge \frac{1}{2}} &= \lrPr{\frac{Y}{n} \ge (1 - c) + \parens{c - \frac{1}{2}}}\\ &= \lrPr{\frac{Y}{n} \ge (1 - c) + \varepsilon}\end{split}\]

with \(\varepsilon = c - \frac{1}{2}\). Since \(p = \lrEx{\frac{Y}{n}} \le (1 - c)\), we have

\[\lrPr{\frac{Y}{n} \ge (1 - c) + \varepsilon} \le \lrPr{\frac{Y}{n} \ge p + \varepsilon}\]

This follows from similar reasoning as earlier: an event \(X \ge a\) contains no more sample points than \(X \ge b\) when \(a \ge b\). With \(\varepsilon = c - \frac{1}{2}\), the upper-tail Hoeffding’s inequality gives us:

\[\begin{split}\lrPr{\frac{Y}{n} \ge p + \varepsilon} &= \lrPr{\frac{Y}{n} \ge p + \parens{c - \frac{1}{2}}}\\ &\le e^{-2(c-1/2)^2 n}\end{split}\]

We want this to be no more than \(\gamma\), which leads to the same solution as before:

\[n \ge \frac{1}{2(c - 1/2)^2} \ln\parens{\frac{1}{\gamma}}\]

Using the same concrete example as before, if we want \(B\) to reject \(x \notin L\) at least 99% of the time when \(A\) rejects \(x \notin L\) with probability at least \(\frac{3}{4}\), it suffices for \(B\) to run \(n \ge 37\) trials of \(A\) on \(x\).