\[\def\gets{:=} \def\N{\mathbb{N}} \def\Z{\mathbb{Z}} \def\R{\mathbb{R}} \def\Q{\mathbb{Q}} \def\O{\mathrm{O}} \def\Prop{\mathbb{P}} \def\Algorithm{\textbf{Algorithm}~} \def\Subroutine{\textbf{Subroutine}~} \def\Let{\textbf{Let}~} \def\Find{\textbf{Find}~} \def\Compute{\textbf{Compute}~} \def\For{\textbf{For}~} \def\While{\textbf{While}~} \def\If{\textbf{If}~} \def\Else{\textbf{Else}~} \def\ElseIf{\textbf{Else If}~} \def\Return{\textbf{Return}~} \def\return{~\textbf{return}~} \def\Output{\textbf{Output}~} \def\output{~\textbf{output}~} \def\tot{~\textrm{to}~} \def\andt{~\textrm{and}~} \def\ort{~\textrm{or}~} \def\Forallt{\textbf{For all}~} \def\forallt{~\textbf{for all}~} \def\Foreacht{\textbf{For each}~} \def\foreacht{~\textbf{for each}~} \newcommand{\abs}[1]{|#1|} \newcommand{\lrabs}[1]{\left|#1\right|} \newcommand{\parens}[1]{\left(#1\right)} \newcommand{\brackets}[1]{\left[#1\right]} \newcommand{\floor}[1]{\left\lfloor#1\right\rfloor} \newcommand{\ceil}[1]{\left\lceil#1\right\rceil} \def\gcd{\mathrm{gcd}} \def\qst{q_{start}} \def\qacc{q_{accept}} \def\qrej{q_{reject}} \newcommand{\lang}[1]{L_\mathrm{#1}} \def\atm{\lang{ACC}} \def\htm{\lang{HALT}} \def\ehtm{\lang{\varepsilon\text{-}HALT}} \def\eqtm{\lang{EQ}} \def\etm{\lang{\varnothing}} \def\oninput{\text{ on input }} \def\oninputs{\text{ on inputs }} \def\Run{\textbf{Run}~} \def\ont{~\textrm{on}~} \def\Accept{\textbf{Accept}~} \def\accept{~\textbf{accept}~} \def\accepts{~\textrm{accepts}~} \def\Reject{\textbf{Reject}~} \def\reject{~\textbf{reject}~} \def\rejects{~\textrm{rejects}~} \def\Halt{\textbf{Halt}~} \def\halt{~\textbf{halt}~} \newcommand{\inner}[1]{\langle #1 \rangle} \def\Tr{\le_\mathrm{T}} \def\pmr{\le_\mathrm{p}} \def\Construct{\textbf{Construct}~} \def\bquote{\text{"}} \def\equote{\text{"}} \def\RD{\mathsf{R}} \def\RE{\mathsf{RE}} \def\coRE{\mathsf{coRE}} \def\DTIME{\mathsf{DTIME}} \def\P{\mathsf{P}} \def\NP{\mathsf{NP}} \def\coNP{\mathsf{coNP}} \def\MAZE{\mathrm{MAZE}} \def\TSP{\mathrm{TSP}} \def\SAT{\mathrm{SAT}} \def\TSAT{\mathrm{3SAT}} \def\VC{\mathrm{VERTEX\text{-}COVER}} \def\SC{\mathrm{SET\text{-}COVER}} \def\HC{\mathrm{HAMCYCLE}} \def\HP{\mathrm{HAMPATH}} \def\IS{\mathrm{INDEPENDENT\text{-}SET}} \def\CLIQUE{\mathrm{CLIQUE}} \def\SSUM{\mathrm{SUBSET\text{-}SUM}} \def\KNAPSACK{\mathrm{KNAPSACK}} \def\MAXCUT{\mathrm{MAX\text{-}CUT}} \def\Ex{\mathbb{E}} \newcommand{\lrPr}[1]{\Pr\brackets{#1}} \newcommand{\lrEx}[1]{\Ex\brackets{#1}} \newcommand{\lrexp}[1]{\exp\parens{#1}} \def\MSAT{\mathrm{Max}\text{-}\mathrm{3SAT}} \def\PRIMES{\mathrm{PRIMES}} \def\RP{\mathsf{RP}} \def\coRP{\mathsf{coRP}} \def\BPP{\mathsf{BPP}} \def\ZPP{\mathsf{ZPP}} \def\Xj{X^{(j)}} \def\BQP{\mathsf{BQP}} \def\NPI{\mathsf{NPI}} \newcommand{\expp}[1]{\exp\left(#1\right)}\]

Randomness

Randomized Algorithms

The algorithms we have seen thus far have been deterministic, meaning that they execute the same steps each time they are run and produce the same result. In this unit, we consider how randomness can be applied to computation. We will exploit randomness to design simple, probabilistic algorithms, and we will discuss the tools that are necessary to analyze randomized algorithms.

As a first example, consider a game of rock-paper-scissors between two players. In this game, both players simultaneously choose one of three options: rock, paper, or scissors. The game is a tie if both players choose the same option. If they make different choices, then rock beats scissors, scissors beats paper, and paper beats rock:

table showing that rock beats scissors, scissors beats paper, and paper beats rock

Suppose we write a program to act as a player in a best-out-of-five game of rock-paper-scissors. If we hardcode a strategy such as playing rock in the first two moves, then paper, then scissors, and then paper, what is the probability that the program wins a game against an opponent? The program is deterministic, so a smart opponent would be able to observe the program’s strategy once and defeat it every single time thereafter. If an opponent is able to read the program’s source code, then a preliminary observation isn’t even necessary. This illustrates the problem with a deterministic strategy – it makes the program’s actions predictable and thus easily countered.

What if we change the program to choose a random action in each move, with equal probability for rock, paper, or scissors? Now even if an opponent knows the program’s strategy, the opponent cannot know which action the program will take and therefore cannot universally counter that action. In fact, no matter what strategy the opponent uses, the probability the opponent wins an individual round is always \(\frac{1}{3}\). Before we see why, we review some basic tools we use to analyze probabilistic outcomes.

Review of Probability

A random variable is a variable whose value is dependent on the outcome of a random phenomenon. The set of outcomes associated with a random phenomenon is called the sample space, and a random variable \(X\) can be regarded as a function from the sample space \(\Omega\) to the set of real numbers, i.e. \(X: \Omega \to \R\). Thus, the random variable associates a numerical value with each outcome.

A random variable \(X\) can be described with a probability distribution – for each possible value \(a_i\), the random variable has a probability \(p_i\) of taking on the value \(a_i\):

\[\Pr[X = a_i] = p_i\]

The probability \(p_i\) must be in the interval \([0, 1]\), i.e. \(0 \le p_i \le 1\). Furthermore, the probabilities \(p_i\) over all possible \(i\) must add up to 1:

\[\sum_i p_i = 1\]

A random variable \(X\) takes on the value \(a_i\) when the outcome \(\omega\) of a random experiment has associated value \(X(\omega) = a_i\). Thus, \(X = a_i\) is an event, a subset of the sample space of an experiment. More specifically, it is the event:

\[(X = a_i) = \{\omega \in \Omega: X(\omega) = a_i\}\]

The probability of an event can be computed as the sum of the probabilities for each outcome in the event. Thus, the probability \(\Pr[X = a_i]\) is:

\[\Pr[X = a_i] = \sum_{\omega\in\Omega: X(\omega) = a_i} \Pr[\omega]\]

If each outcome has equal probability \(\frac{1}{\abs{\Omega}}\), we get

\[\Pr[X = a_i] = \frac{\abs{\{\omega \in \Omega: X(\omega) = a_i\}}}{\abs{\Omega}}\]
Example 127

Suppose we roll a pair of distinguishable, six-sided dice. Let \(X\) be a random variable corresponding to the sum of the values of the dice – if the values are \((i, j)\), then \(X = i + j\). The events corresponding to each value of \(X\) are as follows:

\[\begin{split}(X = 2) &= \{(1,1)\}\\ (X = 3) &= \{(1,2),(2,1)\}\\ (X = 4) &= \{(1,3),(2,2),(3,1)\}\\ (X = 5) &= \{(1,4),(2,3),(3,2),(4,1)\}\\ (X = 6) &= \{(1,5),(2,4),(3,3),(4,2),(5,1))\}\\ (X = 7) &= \{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\}\\ (X = 8) &= \{(2,6),(3,5),(4,4),(5,3),(6,2)\}\\ (X = 9) &= \{(3,6),(4,5),(5,4),(6,3)\}\\ (X = 10) &= \{(4,6),(5,5),(6,4)\}\\ (X = 11) &= \{(5,6),(6,5)\}\\ (X = 12) &= \{(6,6)\}\end{split}\]

If the dice are fair, each outcome has equal probability. This gives us the following probability distribution for \(X\):

\[\begin{split}\Pr[X = 2] &= \frac{1}{36}\\ \Pr[X = 3] &= \frac{2}{36} = \frac{1}{18}\\ \Pr[X = 4] &= \frac{3}{36} = \frac{1}{12}\\ \Pr[X = 5] &= \frac{4}{36} = \frac{1}{9}\\ \Pr[X = 6] &= \frac{5}{36}\\ \Pr[X = 7] &= \frac{6}{36} = \frac{1}{6}\\ \Pr[X = 8] &= \frac{5}{36}\\ \Pr[X = 9] &= \frac{4}{36} = \frac{1}{9}\\ \Pr[X = 10] &= \frac{3}{36} = \frac{1}{12}\\ \Pr[X = 11] &= \frac{2}{36} = \frac{1}{18}\\ \Pr[X = 12] &= \frac{1}{36}\end{split}\]

The probability distribution provides full detail about a random variable. However, we often are interested in more succinct information that in a sense summarizes the details. One such piece of information is the expectation or expected value of a random variable, which corresponds to the “average” value that the random variable takes on. We compute the expectation \(\Ex[X]\) as follows:

\[\Ex[X] = \sum_i a_i \cdot \Pr[X = a_i]\]

Thus, the expectation is the average value of the variable weighted by the probability of obtaining each value.

Recall that

\[\Pr[X = a_i] = \sum_{\omega\in\Omega: X(\omega) = a_i} \Pr[\omega]\]

Plugging this into the formula for expectation, we get

\[\begin{split}\Ex[X] &= \sum_i a_i \cdot \Pr[X = a_i]\\ &= \sum_i a_i \cdot \parens{\sum_{\omega\in\Omega: X(\omega) = a_i} \Pr[\omega]}\\ &= \sum_i \sum_{\omega\in\Omega: X(\omega) = a_i} X(\omega) \cdot \Pr[\omega]\\ &= \sum_{\omega \in \Omega} X(\omega) \cdot \Pr[\omega]\end{split}\]

as an alternate expression of \(\Ex[X]\).

Often, we can express a random variable \(X\) in terms of other random variables \(X_i\). We can then apply linearity of expectation to compute the expectation of \(X\) from the expectations of each \(X_i\).

Theorem 128 (Linearity of Expectation)

Let \(X\) be a weighted sum of random variables \(X_i\), each multiplied by a constant \(c_i\):

\[X = \sum_i c_i \cdot X_i\]

Then \(\Ex[X]\) is:

\[\Ex[X] = \sum_i c_i \cdot \Ex[X_i]\]
Proof 129

We prove the case where \(X\) is the weighted sum of two variables

\[X = c_1 X_1 + c_2 X_2\]

From the alternate formulation of expectation, we have

\[\begin{split}\Ex[X] &= \sum_{\omega\in\Omega} X(\omega) \cdot \Pr[\omega]\\ &= \sum_{\omega\in\Omega} (c_1 X_1(\omega) + c_2 X_2(\omega)) \cdot \Pr[\omega]\\ &= \sum_{\omega\in\Omega} c_1 X_1(\omega) \cdot \Pr[\omega] + c_2 X_2(\omega) \cdot \Pr[\omega]\\ &= \parens{\sum_{\omega\in\Omega} c_1 X_1(\omega) \cdot \Pr[\omega]} + \parens{\sum_{\omega\in\Omega} c_2 X_2(\omega) \cdot \Pr[\omega]}\\ &= c_1 \cdot \parens{\sum_{\omega\in\Omega} X_1(\omega) \cdot \Pr[\omega]} + c_2 \cdot \parens{\sum_{\omega\in\Omega} X_2(\omega) \cdot \Pr[\omega]}\\ &= c_1 \cdot \Ex[X_1] + c_2 \cdot \Ex[X_2]\end{split}\]
Example 130

Suppose we roll a pair of distinguishable, six-sided dice. Let \(X_1\) be the value of the first die and \(X_2\) the value of the second. If the dice are fair, then each outcome is equally likely, so we have

\[\begin{split}\Ex[X_1] = \Ex[X_2] &= \sum_{i=1}^6 i \cdot \frac{1}{6}\\ &= 3.5\end{split}\]

Let \(X = X_1 + X_2\) be the sum of the values of the two dice. By linearity of expectation, we have

\[\Ex[X] = \Ex[X_1] + \Ex[X_2] = 7\]

Computing the expectation from the probability distribution of \(X\) gives us the same result, but it requires much more work to do so.

A sequence of random variables \(X_1, X_2, \dots, X_n\) is independent if for every possible set of values \(b_1, b_2, \dots, b_n\), we have:

\[\Pr[X_1 = b_1, X_2 = b_2, \dots, X_n = b_n] = \prod_i \Pr[X_i = b_i]\]

The notation \(\Pr[X_1 = b_1, X_2 = b_2, \dots, X_n = b_n]\) represents the joint probability of the events \(X_1 = b_1\), \(X_2 = b_2\), \(\dots\), \(X_n = b_n\) occurring simultaneously. The random variables are independent if the joint probability of them taking on any sequence of values is the same as the product of each of them individually taking on the respective value.

A common type of random variable is an indicator random variable, also called a 0-1 random variable. As the latter name suggests, it is a random variable that only takes on the values 0 or 1. If \(X\) is an indicator and \(\Pr[X = 1] = p\), we have \(\Pr[X = 0] = 1 - p\), and

\[\begin{split}\Ex[X] &= 0 \cdot \Pr[X = 0] + 1 \cdot Pr[X = 1]\\ &= \Pr[X = 1] = p\end{split}\]

If we can define a random variable \(Y\) as the sum of \(n\) indicator random variables \(X_i\), each with the same probability \(\Pr[X_i = 1] = p\), then:

\[\begin{split}\Ex[Y] &= \sum_i \Ex[X_i]\\ &= \sum_i \Pr[X_i = 1]\\ &= np\end{split}\]
Example 131

Suppose we repeatedly flip a biased coin with probability \(p\) of heads. What is the expected number of heads over \(n\) flips?

Define \(X_i\) to be 0 if the \(i\)th flip is tails and 1 if it is heads. Then \(X_i\) is an indicator with \(\Ex[X_i] = \Pr[X_i] = p\). Let \(Y\) be the sum of the \(X_i\). Since \(Y\) is the sum of \(n\) indicators \(X_i\), each with probability \(\Pr[X_i = 1] = p\), we have \(\Ex[Y] = np\). Observe also that \(Y\) is the total number of heads, since \(X_i\) contributes 1 to the value of \(Y\) when the \(i\)th flip is heads and 0 otherwise. Thus, the expected number of heads is \(np\).

Returning back to the rock-paper-scissors example, let \(R\), \(P\), and \(S\) be arbitrary numbers corresponding to rock, paper, and scissors. Let \(A\) be a random variable representing the action taken by our randomized program, \(B\) be a random variable representing the opponent’s action. Let \(W\) be the event that the opponent wins – this is when the program chooses rock and the opponent paper, or the program paper and the opponent scissors, or the program scissors and the opponent rock. These are disjoint events, so the probability of any one of them happening is the sum of their individual probabilities. Then we have:

\[\begin{split}\Pr[W] &= \Pr[A=R,B=P] + \Pr[A=P,B=S] + \Pr[A=S,B=R]\\ &= \Pr[A=R]\cdot\Pr[B=P] + \Pr[A=P]\cdot\Pr[B=S] + \Pr[A=S]\cdot\Pr[B=R]\\ &= \frac{1}{3} (\Pr[B=P] + \Pr[B=S] + \Pr[B=R])\\ &= \frac{1}{3}\end{split}\]

In the second step, we make use of the fact that the program’s action and the opponent’s are independent. Then we apply the fact that the program chooses each action with equal probability \(\frac{1}{3}\). Finally, the opponent must choose one of the three possible actions, so regardless of how the opponent makes their choice, the probabilities sum to one. The conclusion is that the probability the opponent wins is \(\frac{1}{3}\) no matter what strategy they use. This is a significant improvement over the deterministic program, where the opponent could always win once they discover the program’s strategy.

The rock-paper-scissors example is illustrative of a more general notion of considering algorithms as games. We allow one player to choose an algorithm to solve a problem. A second, more powerful player known as the adversary has the ability to choose an input. The goal of the adversary is to maximize the number of steps taken by the chosen algorithm on the given input. The game is unfair – the adversary knows the process by which the first player chooses an algorithm (e.g. from prior observations, or from reading the source code for the first player). If the first player chooses the algorithm deterministically, the adversary can always find an input that maximizes the number of steps. On the other hand, if the first player chooses an algorithm randomly according to some distribution, the adversary can no longer always find a worst input, even if the adversary knows the distribution the first player uses.

Randomized Approximations

We can exploit randomness to construct simple algorithms that produce an α-approximation in expectation, meaning that the expected value of the output is within a factor α of the value of the optimal solution. As an example, we consider the problem of satisfying as many clauses as possible given an exact 3CNF formula. Such a formula is in 3CNF, with the added stipulation that each clause has three distinct variables. For example, the clause

\[(x \vee \neg y \vee \neg z)\]

is in exact 3CNF, while the clause

\[(x \vee \neg y \vee \neg x)\]

is not, since it only has two distinct variables.

The problem of finding an assignment that maximizes the number of satisfied clauses in an exact 3CNF formula is known as \(\MSAT\) (also called \(\mathrm{Max}\text{-}\mathrm{E3SAT}\)). This is an \(\NP\)-Hard optimization problem. However, a simple, randomized algorithm that assigns the truth value of each variable randomly (with equal probability for true or false) achieves a 7/8-approximation in expectation. We demonstrate this as follows.

Suppose \(\phi\) is an exact 3CNF formula with \(n\) variables and \(m\) clauses. Let \(a\) be a random assignment of the variables, and let \(X(a)\) be the number of satisfied clauses for this assignment. Then \(X\) is a random variable, since it associates each output with a number. We define indicator random variables \(X_i\), where \(X_i = 1\) if clause \(i\) is satisfied and \(X_i = 0\) if it is not. Then we have

\[X = X_1 + \dots + X_m\]

An exact 3CNF clause has three literals, each with a distinct variable, and the probability that an individual literal is satisfied by a random assignment to its variable is \(\frac{1}{2}\); a literal of the form \(x\) is satisfied when \(x = 1\), which occurs with probability \(\frac{1}{2}\), and a literal of the form \(\neg x\) is satisfied when \(x = 0\), which also occurs with probability \(\frac{1}{2}\). A clause is unsatisfied when all three of its literals are false. Since the literals have distinct variables, the events that each is false are independent. Thus, we have

\[\begin{split}\Pr[\text{all three literals in clause $i$ are false}] &= \prod_{j=1}^3 \Pr[\text{literal $j$ in clause $i$ is false}]\\ &= \parens{\frac{1}{2}}^3 = \frac{1}{8}\end{split}\]

If even a single literal is true, the clause is satisfied. Thus, the probability that clause \(i\) is satisfied with a random assignment is

\[\begin{split}\Pr[X_i = 1] &= 1 - \Pr[\text{all three literals in clause $i$ are false}]\\ &= 1 - \frac{1}{8} = \frac{7}{8}\end{split}\]

We then have \(\Ex[X_i] = \frac{7}{8}\), so by linearity of expectation,

\[\begin{split}\Ex[X] &= \Ex[X_1] + \dots + \Ex[X_m]\\ &= \frac{7}{8}m\end{split}\]

Thus, the expected number of satisfied clauses is \(\frac{7}{8}m\). Since the optimal assignment can satisfy at most \(m\) clauses, the random assignment is a 7/8-approximation, in expectation, of the optimal.

We can further conclude from this result that any exact 3CNF formula has an assignment that satisfies at least \(\frac{7}{8}\) of its clauses. We demonstrate this via contradiction. Suppose that a formula \(\phi\) does not have an assignment that satisfies at least \(\frac{7}{8}\) of its clauses. Let \(m\) be the number of clauses in \(\phi\), and let \(X(a)\) be the number of clauses satisfied by an assignment \(a\). We showed above that \(\Ex[X] = \frac{7}{8}m\). By the alternate formulation of expectation, we have

\[\Ex[X] = \sum_{a \in \Omega} X(a) \cdot \Pr[a]\]

where \(\Omega\) is the set of all possible assignments. By assumption, we have \(X(a) < \frac{7}{8}m\) for all \(a \in \Omega\). This gives us

\[\begin{split}\Ex[X] &< \sum_{a \in \Omega} \frac{7}{8}m \cdot \Pr[a]\\ &= \frac{7}{8}m \cdot \sum_{a \in \Omega} \Pr[a]\\ &= \frac{7}{8}m\end{split}\]

However, this contradicts our result that \(\Ex[X] = \frac{7}{8}m\). Thus, our assumption that no assignment exists that satisfies \(\frac{7}{8}\) of the clauses must be false.

More generally, the reasoning above gives us the following for an arbitrary random variable \(X\):

Claim 132

Suppose \(X\) is a random variable over a sample space \(\Omega\). Then:

  • There is at least one outcome \(\omega \in \Omega\) such that

    \[X(\omega) \ge \Ex[X]\]
  • There is at least one outcome \(\omega \in \Omega\) such that

    \[X(\omega) \le \Ex[X]\]

Applying the above, given \(m\) objects, if a property holds for each object with probability \(p\), then there must be a case where it holds for at least \(mp\) objects simultaneously. This is because the expected number of objects for which the property holds is \(mp\), and by Claim 132, there must be some case where it holds for at least \(mp\) objects.

Another question we may have is, what is the probability that a random assignment satisfies at least half the clauses of an exact 3CNF formula? In terms of the random variable \(X\), what is \(\Pr[X \ge \frac{1}{2}m]\) for a formula with \(m\) clauses? We can apply Markov’s inequality to place a bound on this probability.

Theorem 133 (Markov’s Inequality)

Let \(X\) be a nonnegative random variable (i.e. \(X\) never takes on a negative value), and let \(v > 0\). Then

\[\Pr[X \ge v] \le \frac{\Ex[X]}{v}\]
Proof 134

We prove Markov’s inequality for a discrete random variable \(X\), one that can only take on a countable number of values. By definition of expectation, we have:

\[\begin{split}\Ex[X] &= \sum_i a_i \cdot \Pr[X = a_i]\\ &= \parens{\sum_{a_i < v} a_i \cdot \Pr[X = a_i]} + \parens{\sum_{a_i \ge v} a_i \cdot \Pr[X = a_i]}\\ &\ge \sum_{a_i \ge v} a_i \cdot \Pr[X = a_i]\\ &\ge \sum_{a_i \ge v} v \cdot \Pr[X = a_i]\\ &= v \cdot \Pr[X \ge v]\end{split}\]

The third step follows from the second because \(X\) is nonnegative, so \(a_i \ge 0\) and therefore the left-hand summation is nonnegative.

Dividing both sides of the result by \(v\), we get

\[\frac{\Ex[X]}{v} \ge \Pr[X \ge v]\]

In the case of \(\MSAT\), \(X\) is a count of the number of satisfied clauses, so it never takes on a negative value. Thus, we can apply Markov’s inequality with \(v = \frac{1}{2}\) to obtain:

\[\lrPr{X \ge \frac{1}{2}m} \le \parens{\frac{7}{8}m} / \parens{\frac{1}{2}m} = \frac{7}{4}\]

Unfortunately, this doesn’t tell us anything we didn’t know – we already know by definition of probability that \(\Pr[X \ge \frac{1}{2}m] \le 1\). Furthermore, we would like a lower bound on the probability that at least half the clauses are satisfied, not an upper bound. We need another tool to reason about this.

Theorem 135 (Reverse Markov’s Inequality)

Let \(X\) be a random variable that is never larger than some value \(b\), and let \(v < b\). Then

\[\Pr[X > v] \ge \frac{\Ex[X] - v}{b - v}\]

More loosely, we have

\[\Pr[X \ge v] \ge \frac{\Ex[X] - v}{b - v}\]
Proof 136

Define \(Y = b - X\). Since \(X\) is never larger than \(b\), \(Y\) is a nonnegative random variable. We have

\[\begin{split}\Pr[X \le v] &= \Pr[b - Y \le v]\\ &= \Pr[Y \ge b - v]\end{split}\]

Applying Markov’s inequality, we get:

\[\begin{split}\Pr[Y \ge b - v] &\le \frac{\Ex[Y]}{b - v}\\ &= \frac{b - \Ex[X]}{b - v}\end{split}\]

Then we have

\[\begin{split}\Pr[X > v] &= 1 - \Pr[X \le v]\\ &= 1 - \Pr[Y \ge b - v]\\ &\ge 1 - \frac{b - \Ex[X]}{b - v}\\ &= \frac{\Ex[X] - v}{b - v}\end{split}\]

To obtain the looser formulation, we note that

\[\begin{split}\Pr[X \ge v] &= \Pr[X > v] + \Pr[X = v]\\ &\ge \Pr[X > v]\end{split}\]

Combining this with the previous result, we obtain

\[\Pr[X \ge v] \ge \frac{\Ex[X] - v}{b - v}\]

Observe that for \(\MSAT\), we have an upper bound of \(m\) on the value of \(X\), where \(m\) is the number of clauses. Setting \(v = \frac{1}{2}m < m\), we apply the reverse Markov’s inequality to get

\[\begin{split}\lrPr{X \ge \frac{1}{2}m} &\ge \parens{\frac{7}{8}m - \frac{1}{2}m} / \parens{m - \frac{1}{2}m}\\ &= \parens{\frac{3}{8}} / \parens{\frac{1}{2}} = \frac{3}{4}\end{split}\]

Thus, the randomized algorithm produces an assignment that satisfies at least half the clauses with probability \(\ge \frac{3}{4}\).

Exercise 137

The following is a randomized algorithm for producing a cut in an undirected graph \(G\):

\[\begin{split}&\Algorithm Random\text{-}Cut(G = (V,E)):\\ &~~~S \gets \emptyset\\ &~~~\Foreacht v \in V:\\ &~~~~~~\text{Add $v$ to $S$ with probability } 1/2\\ &~~~\Return S, T = V \setminus S\end{split}\]

Recall that \(C(S,T)\) is the set of crossing edges for the cut \(S,T\), i.e., those that have one endpoint in \(S\) and the other endpoint in \(T\). The size of the cut is \(\abs{C(S,T)}\).

  1. Prove that the randomized algorithm is a 1/2-approximation in expectation for finding the maximal cut of an undirected graph. That is, the expected size of the returned cut is at least half the size of a maximal cut.

    Hint: Use linearity of expectation to show that in expectation, half the edges of the graph are in \(C(S,T)\).

  2. Argue that, for any undirected graph \(G=(V,E)\), there exists a cut of size at least \(\frac{1}{2}\abs{E}\).

Primality Testing

A key component of many cryptography algorithms is finding large prime numbers, with hundreds or even thousands of digits. A typical approach is to randomly choose a large number and then apply a primality test to determine whether it is prime – the prime number theorem states that approximately \(1/m\) of the \(m\)-bit numbers are prime, so we only expect to check \(m\) numbers before finding one that is prime 1. As long as the primality test itself is efficient with respect to \(m\), the expected time to find a prime number with \(m\) bits is polynomial in \(m\).

1

This follows from the fact that the expected value of a geometric distribution with parameter \(p\) is \(1/p\).

Formally, we wish to decide the following language:

\[\PRIMES = \{m \in \N : m \text{ is prime}\}\]

A positive integer \(m\) is a member of \(\PRIMES\) if it has no positive factors other than \(1\) and \(m\).

Is the language \(\PRIMES\) in \(\P\)? Let us consider the following simple algorithm to decide the language:

\[\begin{split}&\Algorithm Prime\text{-}Test(m):\\ &~~~\For i = 2 \tot m-1:\\ &~~~~~~\If m \bmod i \equiv 0, \reject\\ &~~~\Accept\end{split}\]

This algorithm does \(\O(m)\) mod operations. Is this efficient? The size of the input \(m\) is the number of bits required to represent \(m\), which is \(\O(\log m)\). Thus, the number of operations is \(\O(m) = \O(2^{\log m})\), which is exponential in the size of \(m\).

We can improve the algorithm by observing that if \(m\) has a factor within the interval \([2, m-1]\), it must have a factor within \([2, \sqrt{m}]\) – otherwise, we would obtain a product larger than \(m\) when we multiplied its factors together. Thus, we need only iterate up to \(\sqrt{m}\):

\[\begin{split}&\Algorithm Prime\text{-}Test(m):\\ &~~~\For i = 2 \tot \sqrt{m}:\\ &~~~~~~\If m \bmod i \equiv 0, \reject\\ &~~~\Accept\end{split}\]

This algorithm does \(\O(\sqrt{m})\) mod operations. However, this still is not polynomial; we have:

\[\O(\sqrt{m}) = \O(m^{1/2}) = \O(2^{1/2 \cdot \log m})\]

To be efficient, a primality-testing algorithm must have runtime \(\O(\log^k m)\) for some constant \(k\). Neither of the algorithms above meet this threshold.

In fact, there is a known efficient algorithm for primality testing, the AKS primality test. Thus, it is indeed the case that \(\PRIMES \in \P\). However, this algorithm is somewhat complicated, and its running time is high enough to preclude it from being used in practice. Instead, we consider a randomized primality test that is efficient and works well in practice for most inputs. The algorithm we construct relies on the extended Fermat’s little theorem.

Theorem 138 (Extended Fermat’s Little Theorem)

Let \(n \in \N\) be a natural number such that \(n \ge 2\). Let \(a\) be a witness in the range \(1 \le a \le n - 1\). Then:

  • If \(n\) is prime, then \(a^{n-1} \equiv 1 \pmod{n}\) for any witness \(a\).

  • If \(n\) is composite and \(n\) is not a Carmichael number, then \(a^{n-1} \equiv 1 \pmod{n}\) for at most half the witnesses \(1 \le a \le n - 1\).

We postpone discussion of Carmichael numbers for the moment. Instead, we take a look at some small cases of composite numbers to see that the extended Fermat’s little theorem holds. We first consider \(n = 6\). We have:

\[\begin{split}\begin{array}{lll} a = 1 :& 1^5 ~&\equiv 1 \pmod{6}\\ a = 2 :& 2^5 = 32 = 2 + 5\cdot 6 ~&\equiv 2 \pmod{6}\\ a = 3 :& 3^5 = 243 = 3 + 40\cdot 6 ~&\equiv 3 \pmod{6}\\ a = 4 :& 4^5 = 1024 = 4 + 170\cdot 6 ~&\equiv 4 \pmod{6}\\ a = 5 :& 5^5 = 3125 = 5 + 520\cdot 6 ~&\equiv 5 \pmod{6} \end{array}\end{split}\]

We see that \(a^{n-1} \equiv 1 \pmod{n}\) for only the single witness \(a = 1\). Similarly, we consider \(n = 9\):

\[\begin{split}\begin{array}{lll} a = 1 :& 1^8 ~&\equiv 1 \pmod{9}\\ a = 2 :& 2^8 = 256 = 4 + 28\cdot 9 ~&\equiv 4 \pmod{9}\\ a = 3 :& 3^8 = 6561 = 0 + 729\cdot 9 ~&\equiv 0 \pmod{9}\\ a = 4 :& 4^8 = 65536 = 7 + 7281\cdot 9 ~&\equiv 7 \pmod{9}\\ a = 5 :& 5^8 = 390625 = 7 + 43402\cdot 9 ~&\equiv 7 \pmod{9}\\ a = 6 :& 6^8 = 1679616 = 0 + 186624\cdot 9 ~&\equiv 0 \pmod{9}\\ a = 7 :& 7^8 = 5764801 = 4 + 640533\cdot 9 ~&\equiv 4 \pmod{9}\\ a = 8 :& 8^8 = 16777216 = 1 + 1864135\cdot 9 ~&\equiv 1 \pmod{9} \end{array}\end{split}\]

Here, there are two witnesses \(a\) where \(a^{n-1} \equiv 1 \pmod{n}\), out of eight total. Finally, we consider \(n = 7\):

\[\begin{split}\begin{array}{lll} a = 1 :& 1^6 ~&\equiv 1 \pmod{7}\\ a = 2 :& 2^6 = 64 = 1 + 9\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 3 :& 3^6 = 729 = 1 + 104\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 4 :& 4^6 = 4096 = 1 + 585\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 5 :& 5^6 = 15625 = 1 + 2232\cdot 7 ~&\equiv 1 \pmod{7}\\ a = 6 :& 6^6 = 46656 = 1 + 6665\cdot 7 ~&\equiv 1 \pmod{7}\\ \end{array}\end{split}\]

Since \(7\) is prime, we have \(a^{n-1} \equiv 1 \pmod{n}\) for all witnesses \(a\).

The extended Fermat’s little theorem leads directly to a simple, efficient randomized algorithm for primality testing. This Fermat primality test is as follows:

\[\begin{split}&\Algorithm Fermat\text{-}Test(m):\\ &~~~\text{Pick a random witness $1 \le a \le m-1$}\\ &~~~\If a^{m-1} \bmod m \equiv 1, \accept\\ &~~~\Else \reject\end{split}\]

The modular exponentiation in this algorithm can be done with \(\O(\log m)\) multiplications using a divide and conquer strategy, and each multiplication can be done efficiently. Thus, this algorithm is polynomial with respect to the size of \(m\).

As for correctness, we have:

  • If \(m\) is prime, then the algorithm always accepts \(m\). In other words:

    \[\Pr[\text{the Fermat test accepts $m$}] = 1\]
  • If \(m\) is composite and not a Carmichael number, then the algorithm rejects \(m\) with probability at least \(\frac{1}{2}\). In other words:

    \[\Pr[\text{the Fermat test rejects $m$}] \ge \frac{1}{2}\]

Thus, if the algorithm accepts \(m\), we can be fairly confident that \(m\) is prime. And as we will see shortly, we can repeat the algorithm to obtain higher confidence that we get the right answer.

We now return to the problem of Carmichael numbers. A Carmichael number is a composite number \(n\) such that \(a^{n-1} \equiv 1 \pmod{n}\) for all witnesses \(a\) that are relatively prime to \(n\), i.e. \(\gcd(a, n) = 1\). This implies that for a Carmichael number, the Fermat test reports with high probability that the number is prime, despite it being composite. We call a number that passes the Fermat test with high probability a pseudoprime, and the Fermat test is technically a randomized algorithm for deciding the following language:

\[\begin{split}\mathrm{PSEUDOPRIMES} = \left\{ \begin{aligned} m \in \N :~ &a^{m-1} \bmod{m} \equiv 1 \text{ for at least half}\\ &\text{the witnesses } 1 \le a \le m - 1 \end{aligned} \right\}\end{split}\]

Carmichael numbers are much rarer than prime numbers, so for many applications, the Fermat test is sufficient to determine with high confidence that a randomly chosen number is prime. On the other hand, if the number is chosen by an adversary, then the Fermat test is unsuitable, and a more complex randomized algorithm such as the Miller-Rabin primality test must be used instead.

The Miller-Rabin Test

The Miller-Rabin test is designed around the fact that for prime \(m\), the only square roots of 1 modulo \(m\) are -1 and 1. More specifically, if \(x\) is a square root of 1 modulo \(m\), we have

\[x^2 \equiv 1 \pmod{m}\]

By definition of modular arithmetic, this means that

\[x^2 - 1 = km\]

for some integer \(k\) (i.e. \(m\) evenly divides the difference of \(x^2\) and 1). We can factor \(x^2-1\) to obtain:

\[(x + 1)(x - 1) = km\]

When \(m\) is composite, so that \(m = pq\) for integers \(p,q > 1\), then it is possible for \(p\) to divide \(x+1\) and \(q\) to divide \(x-1\), in which case \(pq = m\) divides their product \((x+1)(x-1)\). However, when \(m\) is prime, the only way for the equation to hold is if \(m\) divides either \(x+1\) or \(x-1\); otherwise, the prime factorization of \((x+1)(x-1)\) does not contain \(m\), and by the fundamental theorem of arithmetic, it cannot be equal to a number whose prime factorization contains \(m\) 2. Thus, we have either \(x+1 = am\) for some integer \(a\), or \(x-1 = bm\) for some \(b \in \Z\). The former gives us \(x \equiv -1 \pmod{m}\), while the latter results in \(x \equiv 1 \pmod{m}\)., This, if \(m\) is prime and \(x^2 \equiv 1 \pmod{m}\), then either \(x \equiv 1 \pmod{m}\) or \(x \equiv -1 \pmod{m}\).

2

Euclid’s lemma more directly states that if \(m\) is prime and \(m\) divides a product \(ab\), then \(m\) must divide either \(a\) or \(b\).

The Miller-Rabin test starts with the Fermat test: choose a witness \(1 \le a \le m-1\) and check whether \(a^{m-1} \equiv 1 \pmod{m}\). If this does not hold, then \(m\) fails the Fermat test and therefore is not prime. If it does indeed hold, then the test checks the square root \(a^{\frac{1}{2}(m-1)}\) of \(a^{m-1}\) to see if it is 1 or -1. If it is 1, the test checks the square root \(a^{\frac{1}{4}(m-1)}\) of \(a^{\frac{1}{2}(m-1)}\) to see if it is 1 or -1, and so on. The termination conditions are as follows:

  • The test finds a square root of 1 that is not -1 or 1. By the reasoning above, \(m\) must be composite.

  • The test finds a square root of 1 that is -1. The reasoning above only holds for square roots of 1, so the test cannot continue by computing the square root of -1. In this case, \(m\) may be prime or composite.

  • The test reaches some \(r\) for which \(1/2^r\cdot(m-1)\) is no longer an integer. Then it cannot compute \(a\) to this power modulo \(m\). In this case, \(m\) may be prime or composite.

To compute these square roots, we first extract powers of 2 from the number \(m-1\) (which is even for odd \(m\), the cases of interest):

\[\begin{split}&\Algorithm ExtractPowersOf2(x):\\ &~~~\If x\bmod 2 = 1 \return (x, 0)\\ &~~~(d, s') \gets ExtractPowersOf2(x / 2)\\ &~~~\Return (d, s'+1)\end{split}\]

Then \(ExtractPowersOf2(m-1)\) computes \(d\) and \(s\) such that

\[m-1 = 2^s d\]

where \(\gcd(d, 2) = 1\). Then we have \(a^{2^{s-1}d}\) is the square root of \(a^{2^sd}\), since

\[\large\parens{a^{2^{s-1}d}}^2 = a^{2\cdot 2^{s-1}d} = a^{2^sd}\]

The full Miller-Rabin algorithm is as follows:

\[\begin{split}&\Algorithm Miller\text{-}Rabin(m):\\ &~~~\text{Pick a random witness $1 \le a \le m-1$}\\ &~~~\Compute s,d \text{ such that } m-1=2^sd \andt \gcd(d, 2) = 1\\ &~~~\Compute a^d, a^{2d}, a^{4d}, \dots, a^{2^{s-1}d}, a^{2^sd}\\ &~~~\If a^{2^sd} \not\equiv 1 \pmod{m}, \reject ~~~~\textit{(Fermat test)}\\ &~~~\For t = s-1, s-2, \dots, 0:\\ &~~~~~~\If a^{2^td} \equiv -1 \pmod{m}, \accept\\ &~~~~~~\ElseIf a^{2^td} \not\equiv 1 \pmod{m}, \reject\\ &~~~\Accept\end{split}\]

This algorithm is efficient: \(a^d\) can be computed using \(\O(\log d) = \O(\log m)\) multiplications, and it takes another \(\O(\log m)\) multiplications to compute \(a^{2d}, a^{4d}, \dots, a^{2^sd}\) since \(s = \O(\log m)\). Each multiplication is efficient as it is done modulo \(m\), so the entire algorithm takes polynomial time in the size of \(m\).

As for correctness, we have:

  • If \(m\) is prime, then the algorithm accepts \(m\) with probability 1.

  • If \(m\) is composite, then the algorithm rejects \(m\) with probability at least \(\frac{3}{4}\).

The latter is due to the fact that when \(m\) is composite, \(m\) passes the Miller-Rabin test for at most \(\frac{1}{4}\) of the witnesses \(1 \le a \le m-1\). Unlike the Fermat test, there are no exceptions to this, making the Miller-Rabin test a better choice in many applications.

Exercise 139

Polynomial identity testing is the problem of determining whether two polynomials \(p(x)\) and \(q(x)\) are the same, or equivalently, whether \(p(x) - q(x)\) is the zero polynomial.

  1. Let \(d\) be an upper bound on the degree of \(p(x)\) and \(q(x)\). Show that for a randomly chosen integer \(a \in [1, 4d]\), if \(p(x) \ne q(x)\), then \(\Pr[p(a) \ne q(a)] \ge \frac{3}{4}\).

    Hint: A nonzero polynomial \(r(x)\) of degree at most \(d\) can have at most \(d\) roots \(s_i\) such that \(r(s_i) = 0\).

  2. Devise an efficient, randomized algorithm \(A\) to determine whether \(p(x)\) and \(q(x)\) are the same, with the behavior that:

    • if \(p(x) = q(x)\), then

      \[\Pr[A \text{ accepts } p(x),q(x)] = 1\]
    • if \(p(x) \ne q(x)\), then

      \[\Pr[A \text{ rejects } p(x),q(x)] \ge \frac{3}{4}\]

    Assume that a polynomial can be efficiently evaluated on a single input.

Skip Lists

Randomness can also be utilized to construct simple data structures that provide excellent expected performance. As an example, we consider a set abstract data type, which is a container that supports inserting, deleting, and searching for elements of some mutually comparable type. There are many ways to implement a set abstraction, using a variety of underlying data structures including linked lists, arrays, and binary search trees. However, simple, deterministic implementations suffer a linear runtime in the worst case for some operations. For example, if an adversary inserts the items \(1, 2, \dots, n\) into a non-balancing binary search tree, the insertions take \(\O(n)\) time each, for a total time of \(\O(n^2)\). The usual way to avoid this is to perform complicated rebalancing operations, and there are a variety of schemes to do so (e.g. AVL trees, red-black trees, scapegoat trees, and so on). Can we use randomness instead to design a simple data structure that provides expected \(\O(\log n)\) runtime for all operations on a set of size \(n\)?

One solution is a skip list, which is essentially a linked list with multiple levels, where each level contains about half the elements of the level below (ratios other than \(\frac{1}{2}\) may be used instead, trading off the space and time overheads). Whether or not an element is duplicated to the next level is a random decision, which prevents an adversary from coming up with a sequence of operations that degrades performance. The following is an illustration of a skip list:

a skip list is a linked-list with multiple levels, where all elements are in the bottom level, and each level above duplicates about half the elements of the level below; there is also a sentinel head at each level

In the scheme illustrated above, each level has a sentinel head and is terminated by a null pointer. An element may appear at multiple levels, and the list may be traversed by following links rightward or by moving downward at a duplicated element.

To search for an item, we start at the sentinel node at the top left. We maintain the invariant that we are always located at a node whose value is less than the item we are looking for (or is a sentinel head node). Then in each step, we check the node to the right. If its value is greater than the item we are looking for, or if the right pointer is null, we move downward; however, if we are on the lowest level, we cannot move downward, so the item is not in the list. If the value of the node to the right is less than the item, we move to the right. If the value is equal to the item, we have found the item and terminate the search. The following illustrates a search for the item 0 in the list above:

example of searching for an element, which moves down a level of the element to the right is larger than the item being searched for; otherwise the search moves to the right

The search terminates in failure, since we reach the value -3 in the bottom level with the value 1 to the right. The following illustrates searching for the item 9:

example of searching for an element, which moves down a level of the element to the right is larger than the item being searched for; otherwise the search moves to the right

To insert an item, we first do a search for the item. If the item is in the list, then no further work is necessary. If it is not, the search terminates in the bottom level at the maximal element that is less than the item. Therefore, we insert the item immediately to the right. We then make a random decision of whether or not to duplicate the element to the next level. If the answer is yes, we make another random decision for the next level, and so on until the answer is no. If the probability of duplicating an item is \(\frac{1}{2}\), the expected number of copies of an item is \(2\). Thus, insertion only takes an expected constant amount of time in addition to the search.

We proceed to determine the expected time of a search on a skip list. We start by determining the expected number of elements on each level \(i\). Suppose the skip list contains \(n\) elements total. Let \(n_i\) be the number of elements on level \(i\). We have \(n_1 = n\), since the first level contains all elements. For \(i > 1\), define \(X_{ij}\) to be an indicator random variable for whether or not the \(j\)th element is replicated to level \(i\). We have

\[\Pr[X_{ij} = 1] = 1/2^{i-1}\]

since it takes \(i-1\) “yes” decisions for the element to make it to level \(i\), and each decision is made independently with probability \(1/2\). Thus, \(\Ex[X_{ij}] = \Pr[X_{ij} = 1] = 1/2^{i-1}\).

The total number of elements \(n_i\) on level \(i\) is just the sum of the indicators \(X_{ij}\). This gives us:

\[\begin{split}\Ex[n_i] &= \lrEx{\sum_j X_{ij}}\\ &= \sum_j \Ex[X_{ij}]\\ &= \sum_j \frac{1}{2^{i-1}}\\ &= \frac{n}{2^{i-1}}\end{split}\]

Next, we consider how many levels we expect the list to contain. Suppose we only maintain levels that have at least one element. From our previous result, the expected number of elements is 1 for level \(i\) when

\[\begin{split}\frac{n}{2^{i-1}} &= 1\\ n &= 2^{i-1}\\ \log n &= i - 1\\ i &= \log n + 1\end{split}\]

Intuitively, this implies we should expect somewhere around \(\log n\) levels. To reason about the expected number more formally, consider the number of elements on some level \(i = \log n + k\). By Markov’s inequality, we have

\[\begin{split}\lrPr{n_{\log n + k} \ge 1} &\le \lrEx{n_{\log n + k}}/1\\ &= \frac{n}{2^{\log n + k - 1}}\\ &= \frac{n}{n2^{k-1}}\\ &= \frac{1}{2^{k-1}}\end{split}\]

Then let \(Z_k\) be an indicator for whether \(n_{\log n + k} \ge 1\). The number of levels \(L\) is at most

\[L \le \log n + \sum_{k=1}^\infty Z_k\]

We have

\[\begin{split}\Ex[Z_k] &= \Pr[Z_k = 1]\\ &= \lrPr{n_{\log n + k} \ge 1}\\ &\le \frac{1}{2^{k-1}}\end{split}\]

Then by linearity of expectation,

\[\begin{split}\Ex[L] &\le \lrEx{\log n + \sum_{k=1}^\infty Z_k}\\ &= \log n + \sum_{k=1}^\infty \Ex[Z_k]\\ &\le \log n + \sum_{k=1}^\infty \frac{1}{2^{k-1}}\\ &= \log n + 2\end{split}\]

Thus, the expected number of levels is \(\O(\log n)\).

Finally, we reason about how many nodes a search encounters on each level \(i\). Suppose we are looking for some element \(y\). Let \(w\) be the maximal element on level \(i\) that is less than \(y\) (or the sentinel head if no such element exists), and let \(z\) be the minimal element on level \(i\) that is greater than or equal to \(y\) (or the null sentinel if no such element exists). Let \(v_1, v_2, \dots\) be the elements on level \(i\) that are less than \(w\), with \(v_1\) the largest, \(v_2\) the next largest, and so on.

$w$ is the maximal element on level $i$ that is less than $y$, $z$ is the minimal element that is greater than or equal to $y$, and $v_1, v_2, \dots$ are the elements on level $i$ that are less than $w$

The search encounters the nodes corresponding to \(w\) and \(z\). It does not touch nodes on level \(i\) that come after \(z\) – either \(y = z\), in which case the search terminates, or \(z > y\) (or \(z\) is null), in which case the search moves down to level \(i-1\). Thus, we need only consider whether the search encounters the nodes corresponding to \(v_1, v_2, \dots\).

We observe that each element on level \(i\) is replicated on level \(i+1\) with probability \(1/2\).

$w, v_1, v_2, \dots$ are each replicated on level $i+1$ with independent probability 1/2

There are two possible routes the search can take to get to \(w\) on level \(i\):

  • The search can come down from level \(i+1\) to level \(i\) at \(w\). This happens exactly when \(w\) is replicated on level \(i+1\). The search process proceeds along the same level until it reaches a node that is greater than the target value \(y\), and since \(w < y\), it would not come down to level \(i\) before \(w\) if \(w\) appears on level \(i+1\). Thus, this possibility occurs with probability \(1/2\).

  • The search can come from the node corresponding to \(v_1\) on level \(i\). This occurs when \(w\) is not replicated on level \(i+1\). In that case, the search must come down to level \(i\) before \(w\), in which case it goes through \(v_1\). This possibility also occurs with probability \(1/2\).

These two possibilities are illustrated below:

if $w$ is replicated on level $i+1$, the search comes down to level $i$ at $w$, in which case $v_1$ is not encountered; if $w$ is not replicated on level $i+1$, then the search comes to $w$ from $v_1$ on level $i$, in which case $v_1$ is encountered; each of these happens with probability 1/2

Thus, \(v_1\) is encountered on level \(i\) exactly when \(w\) is not replicated to level \(i+1\), which occurs with probability \(1/2\). We can inductively repeat the same reasoning for preceding elements. The search only encounters \(v_2\) on level \(i\) if neither \(w\) nor \(v_1\) are replicated on level \(i+1\). Since they are replicated independently with probability \(1/2\), the probability that neither is replicated is \(1/4\). In general, the search encounters \(v_j\) on level \(i\) exactly when none of \(w, v_1, v_2, \dots, v_{j-1}\) are replicated to level \(i+1\), which happens with probability \(1/2^j\).

We can now compute the expected number of nodes encountered on level \(i\). Let \(Y_i\) be this number, and let \(V_{ij}\) be an indicator for whether or not element \(v_j\) is encountered. Then we have

\[Y_i = 2 + \sum_j V_{ij}\]

The first term \(2\) is due to the fact that the search encounters the nodes corresponding to \(w\) and \(z\). From our reasoning above, we have

\[\Ex[V_{ij}] = \Pr[V_{ij} = 1] = \frac{1}{2^j}\]

Then by linearity of expectation,

\[\begin{split}\Ex[Y_i] &= \lrEx{2 + \sum_j V_{ij}}\\ &= 2 + \sum_j \Ex[V_{ij}]\\ &= 2 + \sum_j \frac{1}{2^j}\\ &< 2 + \sum_{k=1}^\infty \frac{1}{2^k}\\ &= 3\end{split}\]

The second-to-last step follows from the fact that \(\sum_j \frac{1}{2^j}\) has finitely many terms since there are only finitely many elements less than \(y\). Thus, the expected number of nodes encountered on a level \(i\) is \(\O(1)\).

Combining this with our previous result that the expected number of levels is \(\O(\log n)\), the expected total number of nodes encountered across all levels is \(\O(\log n)\). Thus, a search on a skip list does an expected \(\O(\log n)\) comparisons. Since insertion does only a constant number of operations in addition to the search, it too does \(\O(\log n)\) operations, and the same reasoning holds for deletion as well. We achieve this expected runtime with a much simpler implementation than a self-balancing search tree.

Quick Sort

Another example of taking advantage of randomness in algorithm design is quick sort. The core algorithm requires selecting a pivot element from a sequence, and then partitioning the sequence into those elements that are less than the pivot and those that are greater than (or equal to) the pivot. The two partitions are recursively sorted. The following is an illustration of how this algorithm works, with the recursive steps elided:

quick sort works by choosing a pivot, partitioning the elements into those that are less than, equal to, and greater than the pivot, and recursively sorting the partitions less than and greater than the pivot

It analyzing the runtime, we focus on the number of comparisons between elements – they dominate the runtime, and other aspects of the algorithm are dependent on choice of data structure. Partitioning an array of \(n\) elements requires comparing each element to the pivot, for a total of \(n-1 = O(n)\) comparisons in each recursive step. If the two partitions corresponding to the elements less than and greater than the pivot are approximately equal in size, the recurrence for the number of comparisons is:

\[T(n) = 2T\parens{\frac{n}{2}} + O(n)\]

This is the same recurrence as merge sort, and the algorithm achieves a runtime complexity of \(\O(n \log n)\) comparisons for a sequence of \(n\) elements. One way to accomplish this balanced partitioning is to find the median element – this can be done in linear time. However, a simpler solution is to just pick a random element as the pivot. We proceed to analyze this version of the algorithm, which is as follows:

Algorithm 140 (Quick Sort)
\[\begin{split}&\Algorithm QuickSort(A[1..n] : \text{array of $n$ integers}):\\ &~~~\If n = 1 \return A\\ &~~~p \gets \text{index chosen uniformly at random from } \{1,\dots,n\}\\ &~~~L, R \gets partition(A, p)\\ &~~~\Return QuickSort(L) + A[p] + QuickSort(R)\\ \\ &\Subroutine partition(A[1..n], p):\\ &~~~L \gets \text{empty array}\\ &~~~R \gets \text{empty array}\\ &~~~\For i = 1 \tot n:\\ &~~~~~~\If i \ne p \andt A[i] < A[p]:\\ &~~~~~~~~~L \gets L + A[i]\\ &~~~~~~\ElseIf i \ne p:\\ &~~~~~~~~~R \gets R + A[i]\\ &~~~\Return L, R\end{split}\]

Let \(X\) be a random variable corresponding to the rank of the chosen pivot – the pivot has rank \(i\) if it is the \(i\)th largest element. The algorithm still works well when \(n/6 \le X \le 5n/6\). By examining the probability distribution of \(X\), we observe that:

\[\begin{split}\Ex[X] = \frac{n}{2}\\ \lrPr{\frac{n}{6} \le X \le \frac{5n}{6}} \approx \frac{2}{3}\end{split}\]

This follows from the fact that the element is chosen uniformly at random – half the outcomes have rank above \(\frac{n}{2}\) and half below, and two-thirds have rank between \(n/6\) and \(5n/6\). Thus, we obtain a good pivot both with high probability and in expectation. Over the full course of the algorithm, most of the pivots chosen are good, and the expected runtime, in terms of number of comparisons, is still \(\O(n \log n)\).

For a more detailed analysis of the algorithm, we first modify it so that all the randomness is fixed at the beginning of the algorithm. Given \(n\) elements, we start by choosing a permutation of the numbers \(\{1,2,\dots,n\}\) uniformly at random to be the priorities of the elements. Then when choosing a pivot for a subset of the elements, we pick the element that has the lowest priority over that subset. The following illustrates an execution of this modified algorithm:

modified quick sort works by choosing priorities for the elements, then picking the element with lowest priority to be the pivot at each point

In the first step, the element 37 has the lowest priority, so it is chosen as the pivot. At the second level, the element 15 has the minimal priority in the first recursive call, while the element 86 has the lowest priority in the second call. At the third level of the recursion, the element 4 has lowest priority over its subset, and similarly the element 52 for the subset \(\{70, 52\}\).

The full definition of the modified algorithm is as follows:

Algorithm 141 (Modified Quick Sort)
\[\begin{split}&\Algorithm ModifiedQuickSort(A[1..n] : \text{array of $n$ integers}):\\ &~~~P \gets \text{a random permutation of } \{1,\dots,n\} \text{, chosen uniformly at random}\\ &~~~\Return prioritizedSort(A, P)\\ \\ &\Subroutine prioritizedSort(A[1..n], P[1..n]):\\ &~~~\If n = 1 \return A\\ &~~~p \gets \text{index of the minimal element in $P$}\\ &~~~L, PL, R, PR \gets partition(A, P, p)\\ &~~~\Return prioritizedSort(L, PL) + A[p] + prioritizedSort(R, PR)\\ \\ &\Subroutine partition(A[1..n], P[1..n], p):\\ &~~~L \gets \text{empty array}\\ &~~~PL \gets \text{empty array}\\ &~~~R \gets \text{empty array}\\ &~~~PR \gets \text{empty array}\\ &~~~\For i = 1 \tot n:\\ &~~~~~~\If i \ne p \andt A[i] < A[p]:\\ &~~~~~~~~~L \gets L + A[i]\\ &~~~~~~~~~PL \gets PL + P[i]\\ &~~~~~~\ElseIf i \ne p:\\ &~~~~~~~~~R \gets R + A[i]\\ &~~~~~~~~~PR \gets PR + P[i]\\ &~~~\Return L, PL, R, PR\end{split}\]

This modified algorithm matches the behavior of the original randomized quick sort in two important ways:

  1. In both algorithms, when a pivot is chosen among a subset consisting of \(m\) elements, each element is chosen with a uniform probability \(1/m\). This is explicitly the case in the original version. In the modified version, the priorities assigned to each of the \(m\) elements are unique, and for any set of \(m\) priorities, each element receives the lowest priority in exactly \(1/m\) of the permutations of those priorities.

  2. The two algorithms compare elements in exactly the same situation – when partitioning the array, each element is compared to the pivot. (The modified version also compares priorities to find the minimal, but we can just ignore that in our analysis. Even if we did consider them, they just increase the number of comparisons by a constant factor of two.)

Thus, the modified algorithm models the execution of the original algorithm in both the evolution of the recursion as well as the element comparisons. We proceed to analyze the modified algorithm, and the result will equally apply to the original.

For simplicity, we assume for the purposes of analysis that the initial array does not contain any duplicate elements. We have defined our two versions of quick sort to be stable, meaning that they preserve the relative ordering of duplicate elements. Thus, even when duplicates are present, the algorithm distinguishes between them, resulting in the same behavior as if there were no duplicates.

Claim 142

Let \(A\) be an array of \(n\) elements, and let \(S\) be the result of sorting \(A\). Let \(priority(S[i])\) be the priority that was assigned to the element \(S[i]\) in the modified quick-sort algorithm. Then \(S[i]\) and \(S[j]\) were compared by the algorithm if and only if \(i \ne j\), and one of the following holds:

\[\begin{split}priority(S[i]) &= \min_{i \le k \le j}\{priority(S[k])\} \text{, or}\\ priority(S[j]) &= \min_{i \le k \le j}\{priority(S[k])\}\end{split}\]

In other words, \(S[i]\) and \(S[j]\) are compared exactly when the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is that of either \(S[i]\) or \(S[j]\).

The claim above holds regardless of the original order of the elements in \(A\). For example, take a look at the result from our illustration of the modified algorithm above:

sorted array with assigned priorities; two elements were compared if and only if at least one of their priorities is lower than those of all the elements that lie between them

As stated above, two elements are compared only in the partitioning step, and one of the two must be the pivot. Referring to the execution above, we can see that 0 and 15 were compared in the second level of the recursion, whereas 52 and 99 were never compared to each other. Now consider a different initial ordering of the elements, but with the same priorities assigned to each element:

the same elements and priorities, but with a different initial order; the algorithm evolves in the exact same way with the same subsets at each step, but each subset may be reordered

The execution of the algorithm is essentially the same! It still picks 37 as the first pivot, since it has the lowest priority, and forms the same partitions; the only possible difference is the ordering of a partition. Subsequent steps again pick the same pivots, since they are finding the minimal priority over the same subset of elements and associated priorities. Thus, given just the sorted set of elements and their priorities, we can determine which elements were compared as stated in Claim 142, regardless of how they were initially ordered. We proceed to prove Claim 142:

Proof 143

First, we show that if the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is that of either \(S[i]\) or \(S[j]\), then \(S[i]\) and \(S[j]\) are compared. Without loss of generality, suppose \(S[i]\) is the element with the lowest priority in this subset. This implies that no element in \(S[i+1], \dots, S[j]\) will be picked as a pivot prior to \(S[i]\) being picked. Thus, all pivots chosen by the algorithm before \(S[i]\) must be either less than \(S[i]\) or greater than \(S[j]\). For each such pivot, the elements \(S[i], S[i+1], \dots, S[j]\) are all placed in the same partition – they are in sorted order, and if the pivot is less than \(S[i]\), all of these elements are larger than the pivot, while if the pivot is greater than \(S[j]\), then all these elements are smaller than the pivot. Then when the algorithm chooses \(S[i]\) as the pivot, \(S[j]\) is in the same subset of the array as \(S[i]\) and thus will be compared to \(S[i]\) in the partitioning step. The same reasoning holds for when \(S[j]\) has the lowest priority – it will eventually be chosen as the pivot, at which point \(S[i]\) will be in the same subset and will be compared to \(S[j]\).

Next, we show that if the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is neither that of \(S[i]\) nor of \(S[j]\), then \(S[i]\) and \(S[j]\) are never compared. As long as the algorithm chooses pivots outside of the set \(S[i], S[i+1], \dots, S[j]\), the two elements \(S[i]\) and \(S[j]\) remain in the same subset of the array without having been compared to each other – they remain in the same subset since the pivots so far have been either less than both \(S[i]\) and \(S[j]\) or greater than both of them, and they have not been compared yet since neither has been chosen as a pivot. When the algorithm first chooses a pivot from among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\), the pivot it chooses must be one of \(S[i+1], \dots, S[j-1]\), since one of those elements has lower priority than both \(S[i]\) and \(S[j]\). In the partitioning step for that pivot, \(S[i]\) and \(S[j]\) are placed into separate partitions, since \(S[i]\) is smaller than the pivot while \(S[j]\) is larger than the pivot. Since \(S[i]\) and \(S[j]\) end up in separate partitions, they cannot be compared in subsequent steps of the algorithm.

We can now proceed to compute the expected number of comparisons. Let \(X_{ij}\) be an indicator random variable such that \(X_{ij} = 1\) if \(S[i]\) and \(S[j]\) were compared at some point in the algorithm, \(X_{ij} = 0\) otherwise. From Claim 142, we can conclude that

\[\Pr[X_{ij} = 1] = \frac{2}{j-i+1}\]

This is because each of the \(j-1+1\) elements in \(S[i], S[i+1], \dots, S[j-1], S[j]\) has equal chance of having the lowest priority among those elements, and \(S[i]\) and \(S[j]\) are compared when one of those two elements has lowest priority among this set.

We also observe that a pair of elements is compared at most once – they are compared once in the partitioning step if they are in the same subarray and one of the two is chosen as a pivot, and after the partitioning step is over, the pivot is never compared to anything else. Thus, the total number of comparisons \(X\) is just the number of pairs that are compared, out of the \(n(n-1)/2\) distinct pairs:

\[X = \sum_{i=1}^n \sum_{j=i+1}^n X_{ij}\]

We have

\[\Ex[X_{ij}] = \Pr[X_{ij} = 1] = \frac{2}{j-i+1}\]

and by linearity of expectation, we get

\[\begin{split}\Ex[X] &= \sum_{i=1}^n \sum_{j=i+1}^n \Ex[X_{ij}]\\ &= \sum_{i=1}^n \sum_{j=i+1}^n \frac{2}{j-i+1}\end{split}\]

Let \(t = j - i + 1\). We can then rewrite this sum as

\[\begin{split}\Ex[X] &= \sum_{i=1}^n \sum_{j=i+1}^n \frac{2}{j-i+1}\\ &= \sum_{i=1}^n \sum_{t=2}^{n-i+1} \frac{2}{t}\\ &= 2 \cdot \sum_{i=1}^n \sum_{t=2}^{n-i+1} \frac{1}{t}\end{split}\]

The quantity \(\sum_{t=2}^{n-i+1} \frac{1}{t}\) is a partial sum of the harmonic series, which has an upper bound

\[\sum_{m=1}^k \frac{1}{m} \le \ln k + 1\]

Thus, we have

\[\begin{split}\sum_{t=2}^{n-i+1} \frac{1}{t} &= -1 + \sum_{t=1}^{n-i+1} \frac{1}{t}\\ &\le -1 + \ln (n-i+1) + 1\\ &= \ln (n-i+1)\\ &\le \ln n \quad\text{(for $i \ge 1$)}\end{split}\]

Plugging this into our expression for \(\Ex[X]\), we get

\[\begin{split}\Ex[X] &= 2 \cdot \sum_{i=1}^n \sum_{t=2}^{n-i+1} \frac{1}{t}\\ &\le 2 \cdot \sum_{i=1}^n \ln n\\ &= 2 n \ln n\\ &= \O(n \log n)\end{split}\]

Thus, the expected number of comparisons is \(\O(n \log n)\), as previously claimed. Since the modified quick-sort algorithm models the behavior of the original randomized algorithm, this result also applies to the latter.

In practice, randomized quick sort is faster than the deterministic version as well as other deterministic \(\O(n \log n)\) sorts such as merge sort. It is also much simpler to implement in place (i.e. without using an auxiliary data structure). Thus, variants of randomized quick sort are widely used in practice.

Exercise 144

Randomized quick sort has worst-case running time \(\Theta(n^2)\) but expected running time \(\O(n \log n)\) when run on any array of \(n\) elements.

Prove that for an array \(A\) of \(n\) elements and any constant \(c > 0\),

\[\lim_{n \to \infty} \Pr[\text{randomized quick sort on input $A$ takes at least $c n^2$ steps}] = 0\]

That is, randomized quick sort is very unlikely to run in \(\Omega(n^2)\) time.

Hint: The number of steps a randomized algorithm takes is a random variable. You do not need to know anything about how randomized quick sort works; just use the facts above and Markov’s inequality.

Probabilistic Complexity Classes

The Fermat primality test is an example of a one-sided-error randomized algorithm – an input that is pseudoprime is always accepted, while a non-pseudoprime is sometimes accepted and sometimes rejected. We can define complexity classes corresponding to decision problems that have efficient one-sided-error randomized algorithms.

Definition 145 (RP)

\(\RP\) is the class of languages that have efficient one-sided-error randomized algorithms with no false positives. A language \(L\) is in \(\RP\) if there is an efficient randomized algorithm \(A\) such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] = 1\)

Here, \(c\) must be a constant greater than 0. Often, \(c\) is chosen to be \(\frac{1}{2}\).

Definition 146 (coRP)

\(\coRP\) is the class of languages that have efficient one-sided-error randomized algorithms with no false negatives. A language \(L\) is in \(\coRP\) if there is an efficient randomized algorithm \(A\) such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] = 1\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] \ge c\)

Here, \(c\) must be a constant greater than 0. Often, \(c\) is chosen to be \(\frac{1}{2}\).

\(\RP\) stands for randomized polynomial time. If a language \(L\) is in \(\RP\), then its complement language \(\overline{L}\) is in \(\coRP\) – an algorithm for \(L\) with no false positives can be converted into an algorithm for \(\overline{L}\) with no false negatives. The Fermat test produces no false negatives, so \(\mathrm{PSEUDOPRIMES} \in \coRP\). Thus, the language

\[\begin{split}\mathrm{PSEUDOPRIMES} = \left\{ \begin{aligned} m \in \N :~ &a^{m-1} \bmod{m} \equiv 1 \text{ for at least half}\\ &\text{the witnesses } 1 \le a \le m - 1 \end{aligned} \right\}\end{split}\]

is in \(\RP\).

The constant \(c\) in the definition of \(\RP\) and \(\coRP\) is arbitrary. With any \(c > 0\), we can amplify the probability of success by repeatedly running the algorithm. For instance, if we have a randomized algorithm \(A\) with no false positives for a language \(L\), we have:

\[\begin{split}x \in L &\implies \Pr[A \text{ accepts } x] \ge c\\ x \notin L &\implies \Pr[A \text{ accepts } x] = 0\end{split}\]

We can construct a new algorithm \(B\) as follows:

\[\begin{split}&\Algorithm B(x):\\ &~~~\Run A \ont x \text{ twice}\\ &~~~\If A \text{ accepts at least once}, \accept\\ &~~~\Else \reject\end{split}\]

\(B\) just runs \(A\) twice on the input, accepting if at least one run of \(A\) accepts; \(B\) only rejects if both runs of \(A\) reject. Thus, the probability that \(B\) rejects \(x\) is:

\[\begin{split}\Pr[B \text{ rejects } x] &= \Pr[A \text{ rejects $x$ in run 1}, A \text{ rejects $x$ in run 2}]\\ &= \Pr[A \text{ rejects $x$ in run 1}] \cdot \Pr[A \text{ rejects $x$ in run 2}]\\ &= \Pr[A \text{ rejects } x]^2\end{split}\]

Here, we have used the fact that the two runs of \(A\) are independent, so the probability of \(A\) rejecting twice is the product of the probabilities it rejects each time. This gives us:

\[\begin{split}x \in L &\implies \Pr[B \text{ rejects } x] \le (1 - c)^2\\ x \notin L &\implies \Pr[B \text{ rejects } x] = 1\end{split}\]

or equivalently:

\[\begin{split}x \in L &\implies \Pr[B \text{ accepts } x] \ge 1 - (1 - c)^2\\ x \notin L &\implies \Pr[B \text{ accepts } x] = 0\end{split}\]

Repeating this reasoning, if we modify \(B\) to run \(A\) \(k\) times, we get:

\[\begin{split}x \in L &\implies \Pr[B \text{ accepts } x] \ge 1 - (1 - c)^k\\ x \notin L &\implies \Pr[B \text{ accepts } x] = 0\end{split}\]

By applying a form of Bernoulli’s inequality, \((1 + x)^r \le e^{rx}\), we get:

\[x \in L \implies \Pr[B \text{ accepts } x] \ge 1 - e^{-ck}\]

If we want a particular lower bound \(1 - \varepsilon\) for acceptance of true positives, we need

\[\begin{split}e^{-ck} &\le \varepsilon\\ -ck &\le \ln \varepsilon\\ k &\ge \frac{-\ln \varepsilon}{c} = \frac{\ln(1/\varepsilon)}{c}\end{split}\]

This is a constant for any constants \(c\) and \(\varepsilon\). Thus, we can amplify the probability of accepting a true positive from \(c\) to \(1 - \varepsilon\) for any \(\varepsilon\) with a constant \(\ln(1/\varepsilon)/c\) number of repetitions. The same reasoning can be applied to amplify one-sided-error algorithms with no false negatives.

One-sided-error randomized algorithms are a special case of two-sided-error randomized algorithms. Such an algorithm for a language \(L\) can produce the wrong result for an input \(x\) whether or not \(x \in L\). However, the probability of getting the wrong result is bounded by a constant.

Definition 147 (BPP)

\(\BPP\) is the class of languages that have efficient two-sided-error randomized algorithms. A language \(L\) is in \(\BPP\) if there is an efficient randomized algorithm \(A\) such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] \ge c\)

Here, \(c\) must be a constant greater than \(\frac{1}{2}\), so that the algorithm produces the correct answer the majority of the time. Often, \(c\) is chosen to be \(\frac{2}{3}\) or \(\frac{3}{4}\).

\(\BPP\) stands for bounded-error probabilistic polynomial time. Given the symmetric definition of a two-sided error algorithm, it is clear that the class \(\BPP\) is closed under complement – if a language \(L \in \BPP\), then \(\overline{L} \in \BPP\) as well.

Languages in \(\RP\) and in \(\coRP\) both trivially satisfy the conditions for \(\BPP\); for instance, we have the following for \(\RP\):

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] = 1 \ge c\)

Thus, \(\RP \cup \coRP \subseteq \BPP\). By similar reasoning, we can relate \(\P\) to these classes as follows:

\begin{gather*} \P \subseteq \RP \subseteq \BPP\\ \P \subseteq \coRP \subseteq \BPP \end{gather*}

As with one-sided-error algorithms, the probability of success for two-sided-error randomized algorithms can be amplified arbitrarily. However, we do not yet have the tools we need to reason about how to do so, so we will come back to this later.

One-sided-error and two-sided-error randomized algorithms are known as Monte Carlo algorithms. Such algorithms have a bounded runtime, but they may produce the wrong answer. Contrast this with Las Vegas algorithms, which always produce the correct answer, but where the runtime bound only holds in expectation. We can define a complexity class for languages that have Las Vegas algorithms with expected runtime that is polynomial in the size of the input.

Definition 148 (ZPP)

\(\ZPP\) is the class of languages that have efficient Las Vegas algorithms. A language \(L\) is in \(\ZPP\) if there is a randomized algorithm \(A\) such that:

  • If \(x \in L\), \(A\) always accepts \(x\).

  • If \(x \notin L\), \(A\) always rejects \(x\).

  • The expected runtime of \(A\) on input \(x\) is \(\O(\abs{x}^k)\) for some constant \(k\).

There is a relationship between Monte Carlo and Las Vegas algorithms: if a language \(L\) has both a one-sided-error algorithm with no false positives and a one-sided-error algorithm with no false negatives, then it has a Las Vegas algorithm, and vice versa. In terms of complexity classes, we have \(L\) is in both \(\RP\) and \(\coRP\) if and only if \(L\) is in \(\ZPP\). This implies that

\[\RP \cap \coRP = \ZPP\]

We demonstrate this as follows. Suppose \(L\) is in \(\RP \cap \coRP\). Then it has an efficient, one-sided-error algorithm \(A\) with no false positives such that:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge \frac{1}{2}\)

  • if \(x \notin L\), \(\Pr[A \text{ accepts } x] = 0\)

Here, we have selected \(c = \frac{1}{2}\) for concreteness. \(L\) also has an efficient, one-sided algorithm \(B\) with no false negatives such that:

  • if \(x \in L\), \(\Pr[B \text{ rejects } x] = 0\)

  • if \(x \notin L\), \(\Pr[B \text{ rejects } x] \ge \frac{1}{2}\)

We can construct a new algorithm \(C\) as follows:

\[\begin{split}&\Algorithm C(x):\\ &~~~\textbf{Repeat}:\\ &~~~~~~\Run A \ont x, \accept \text{if } A \accepts\\ &~~~~~~\Run B \ont x, \reject \text{if } B \rejects\end{split}\]

Since \(A\) only accepts \(x \in L\) and \(C\) only accepts when \(A\) does, \(C\) only accepts \(x \in L\). Similarly, \(B\) only rejects \(x \notin L\), so \(C\) only rejects \(x \notin L\). Thus, \(C\) always produces the correct answer for a given input.

To show that \(L \in \ZPP\), we must also demonstrate that the expected runtime of \(C\) is polynomial. For each iteration \(i\) of \(C\), we have:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x \text{ in iteration } i] \ge \frac{1}{2}\)

  • if \(x \notin L\), \(\Pr[B \text{ rejects } x \text{ in iteration } i] \ge \frac{1}{2}\)

Thus, if \(C\) gets to iteration \(i\), the probability that \(C\) terminates in that iteration is at least \(\frac{1}{2}\). We can model the number of iterations as a random variable \(X\), and \(X\) has the probability distribution:

\[\Pr[X > k] \le \parens{\frac{1}{2}}^k\]

This is similar to a geometric distribution, where

\[\Pr[Y > k] = (1 - p)^k\]

for some \(p \in (0, 1]\). The expected value of such a distribution is \(\Ex[Y] = 1/p\). This gives us:

\[\Ex[X] \le 2\]

Thus, the expected number of iterations of \(C\) is no more than two, and since \(A\) and \(B\) are both efficient, calling them twice is also efficient.

We have shown that \(\RP \cap \coRP \subseteq \ZPP\). We will proceed to show that \(\ZPP \subseteq \RP\), meaning that if a language has an efficient Las Vegas algorithm, it also has an efficient one-sided-error algorithm with no false positives. Assume that \(C\) is a Las Vegas algorithm for \(L \in \ZPP\), with expected runtime of no more than \(\abs{x}^k\) steps for an input \(x\) and some constant \(k\). We construct a new algorithm \(A\) as follows:

\[\begin{split}&\Algorithm A(x):\\ &~~~\Run C \ont x \text{ for $2\abs{x}^k$ steps (or until it halts, whichever comes first)}\\ &~~~\Accept \text{ if $C$ accepted, else} \reject\end{split}\]

\(A\) is clearly efficient in the size of the input \(x\). It also has no false positives, since it only accepts if \(C\) accepts within the first \(2\abs{x}^k\) steps, and \(C\) always produces the correct answer. All that is left is to show that if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge \frac{1}{2}\) (again, we arbitrarily choose \(c = \frac{1}{2}\)).

Let \(S\) be the number of steps before \(C\) accepts \(x\). By assumption, we have

\[\Ex[S] \le \abs{x}^k\]

\(A\) runs \(C\) for \(2\abs{x}^k\) steps, so we need to bound the probability that \(C\) takes more than \(2\abs{x}^k\) steps on \(x\). We have:

\[\begin{split}\Pr[S > 2\abs{x}^k] &\le \Pr[S \ge 2\abs{x}^k]\\ &\le \frac{\Ex[S]}{2\abs{x}^k}\\ &\le \frac{\abs{x}^k}{2\abs{x}^k}\\ &= \frac{1}{2}\end{split}\]

Here, we applied Markov’s inequality in the second step. Then:

\[\begin{split}\Pr[A \text{ accepts } x] &= \Pr[S \le 2\abs{x}^k]\\ &= 1 - \Pr[S > 2\abs{x}^k]\\ &\ge \frac{1}{2}\end{split}\]

Thus, \(A\) is indeed a one-sided-error algorithm with no false positives, so \(L \in \RP\). A similar construction can be used to demonstrate that \(L \in \coRP\), allowing us to conclude that \(\ZPP \subseteq \RP \cap \coRP\). Combined with our previous proof that \(\RP \cap \coRP \subseteq \ZPP\), we have \(\RP \cap \coRP = \ZPP\).

One final observation we will make is that if a language \(L\) is in \(\RP\), then it is also in \(\NP\). Since \(L \in \RP\), there is an efficient, one-sided-error randomized algorithm \(A\) to decide \(L\). \(A\) is allowed to make random choices, and we can model each choice as a coin flip, i.e. being either 0 or 1 according to some probability distribution. We can represent the combination of these choices in a particular run of \(A\) as a binary string \(c\). This enables us to write an efficient verifier \(V\) for \(L\) as follows:

\[\begin{split}&\Algorithm V(x, c = c_1 c_2 \dots c_m):\\ &~~~\text{Simulate the execution of $A$ on $x$, taking choice $c_i$}\\ &~~~~~~\text{for the $i$th random decision in $A$}\\ &~~~\Accept \text{if } A \accepts, \text{else} \reject\end{split}\]

If \(x \in L\), then \(\Pr[A \text{ accepts } x] \ge \frac{1}{2}\), so at least half the possible sequences of random choices lead to \(A\) accepting \(x\). On the other hand, if \(x \notin L\), then \(\Pr[A \text{ rejects } x] = 1\), so all sequences of random choices lead to \(A\) rejecting \(x\). Thus, \(V\) accepts at least half of all possible certificates when \(x \in L\), and \(V\) rejects all certificates when \(x \notin L\), so \(V\) is a verifier for \(L\). Since \(A\) is efficient, \(V\) is also efficient.

We summarize the known relationships between complexity classes as follows:

\begin{gather*} \P \subseteq \ZPP \subseteq \RP \subseteq \NP\\ \P \subseteq \ZPP \subseteq \RP \subseteq \BPP\\ \P \subseteq \ZPP \subseteq \coRP \subseteq \coNP\\ \P \subseteq \ZPP \subseteq \coRP \subseteq \BPP \end{gather*}

These relationships are represented pictorially, with edges signifying containment, as follows:

complexity-class hierarchy, with arrows from P to ZPP, from ZPP to RP and coRP, from RP to NP and BPP, and from coRP to coNP and BPP

Here, \(\coNP\) is the class of languages whose complements are in \(\NP\):

\[\coNP = \{L : \overline{L} \in \NP\}\]

We do not know if any of the containments above are strict, and we do not know the relationships between \(\NP\), \(\coNP\), and \(\BPP\). It is commonly believed that \(\P\) and \(\BPP\) are equal and thus \(\BPP\) is contained in \(\NP \cap \coNP\), that neither \(\P\) nor \(\BPP\) contain all of \(\NP\) or all of \(\coNP\), and that \(\NP\) and \(\coNP\) are not equal. However, none of these conjectures has been proven 3.

3

It is known, however, that if \(\P = \NP\), then \(\P = \BPP\). This is because \(\BPP\) is contained within the polynomial hierarchy, and one of the consequences of \(\P = \NP\) is the “collapse” of the hierarchy. The details are beyond the scope of this text.

Monte Carlo Methods and Concentration Bounds

Some algorithms rely on repeated trials to compute an approximation of a result. Such algorithms are called Monte Carlo methods, which are distinct from Monte Carlo algorithms – a Monte Carlo method does repeated sampling of a probability distribution, while a Monte Carlo algorithm is an algorithm that may produce the wrong result within some bounded probability. The former are commonly approximation algorithms for estimating a quantity, while the latter are exact algorithms that sometimes produce the wrong result. As we will see, there is a connection – repeated sampling (the strategy of a Monte Carlo method) can be used to amplify the probability of getting the correct result from a Monte Carlo algorithm.

As an example of a Monte Carlo method, we consider an algorithm for estimating the value of \(\pi\), the area of a unit circle (a circle with a radius of one). If such a circle is located in the plane, centered at the origin, its top-right quadrant is as follows:

overlap between a circle of radius one and a square with sides one, with the square located at the top-right of the circle

This quadrant falls within the square interval between \((0, 0)\) and \((1, 1)\). The area of the quadrant is \(\pi/4\), while the area of the interval is \(1\). Thus, if we choose a random point in this interval, the probability that it lies within the quadrant of the circle is \(\pi/4\) 4. This motivates the following algorithm for estimating the value of \(\pi\):

\[\begin{split}&\Algorithm ComputePi(n):\\ &~~~count \gets 0\\ &~~~\For i = 1 \tot n:\\ &~~~~~~x \gets \text{a value in $[0, 1)$ chosen uniformly at random}\\ &~~~~~~y \gets \text{a value in $[0, 1)$ chosen uniformly at random}\\ &~~~~~~\If \sqrt{x^2 + y^2} \le 1:\\ &~~~~~~~~~count \gets count + 1\\ &~~~\Return 4 \cdot count / n\end{split}\]
4

This is follows from applying the tools of continuous probability, which we will not discuss in any detail in this text.

The algorithm randomly chooses points between \((0, 0)\) and \((1, 1)\), counting how many of them fall within the unit circle, which is when the point’s distance from the origin is at most one. We expect this number to be \(\pi/4\) of the samples, so the algorithm returns four times the ratio of the points that fell within the circle as an estimate of \(\pi\).

We formally show that the algorithm returns the value of \(\pi\) in expectation. Let \(X\) be a random variable corresponding to the value returned, and let \(Y_i\) be an indicator random variable that is 1 if the \(i\)th point falls within the unit circle. As argued previously, we have:

\[\Pr[Y_i = 1] = \pi/4\]

We also have that

\[X = 4/n \cdot (Y_1 + \dots + Y_n)\]

This leads to the following expected value for \(X\):

\[\begin{split}\Ex[X] &= 4/n \cdot (\Ex[Y_1] + \Ex[Y_2] + \dots + \Ex[Y_n])\\ &= 4/n \cdot (\Pr[Y_1 = 1] + \Pr[Y_2 = 1] + \dots + \Pr[Y_n = 1])\\ &= 4/n \cdot (\pi/4 + \pi/4 + \dots + \pi/4)\\ &= \pi\end{split}\]

The expected value of the algorithm’s output is indeed \(\pi\).

Chernoff Bounds

When estimating \(\pi\), how likely are we to actually get a result that is close to the expected value? While we can apply Markov’s inequality, the bound we get is very loose, and it does not give us any information about how the number of samples affects the quality of the estimate. The law of large numbers states that the actual result converges to the expected value as the number of samples \(n\) increases. But how fast does it converge? There are many types of concentration bounds that allow us to reason about the deviation of a random variable from its expectation; Markov’s inequality is just the simplest one. Chernoff bounds are another tool. There are multiple variants of Chernoff bounds, but the ones we will use are as follows.

Let \(X = X_1 + \dots + X_n\) be the sum of independent indicator random variables, where the indicator \(X_i\) has the expectation

\[\Ex[X_i] = \Pr[X_i = 1] = p_i\]

Let \(\mu\) be the expected value of \(X\):

\[\mu = \Ex[X] = \sum_i \Ex[X_i] = \sum_i p_i\]

Here, we’ve applied linearity of expectation to relate the expectation of \(X\) to that of the indicators \(X_i\). The Chernoff bounds then are as follows.

Theorem 149 (Chernoff Bound – Upper Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) exceeding its expectation \(\mu\) by at least a factor of \(1 + \delta\), where \(\delta > 0\). Then

\[\Pr[X \ge (1 + \delta)\mu] \le \parens{\frac{e^\delta}{(1+\delta)^{1+\delta}}}^\mu\]
Theorem 150 (Chernoff Bound – Lower Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) being below its expectation \(\mu\) by at least a factor of \(1 - \delta\), where \(0 < \delta < 1\). Then

\[\Pr[X \le (1 - \delta)\mu] \le \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu\]

These inequalities can be unwieldy to work with, so we often use the following simpler, looser bounds.

Theorem 151 (Chernoff Bound – Simplified Upper Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) exceeding its expectation \(\mu\) by at least a factor of \(1 + \delta\), where \(\delta > 0\). Then

\[\large \Pr[X \ge (1 + \delta)\mu] \le e^{-\frac{\delta^2\mu}{2+\delta}}\]
Theorem 152 (Chernoff Bound – Simplified Lower Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\), and \(\mu = \sum_i p_i\). Suppose we wish to bound the probability of \(X\) being below its expectation \(\mu\) by at least a factor of \(1 - \delta\), where \(0 < \delta < 1\). Then

\[\large \Pr[X \le (1 - \delta)\mu] \le e^{-\frac{\delta^2\mu}{2}}\]

Before we proceed to make use of these bounds, we first prove that the unsimplified upper-tail and lower-tail bounds hold 5.

5

Refer to the appendix for proofs of the simplified bounds.

Proof 153 (Proof of Upper-tail Chernoff Bound)

To demonstrate the upper-tail Chernoff bound, we make use of the Chernoff bounding technique – rather than reasoning about the random variable \(X\) directly, we instead reason about \(e^{tX}\), since small deviations in \(X\) turn into large deviations in \(e^{tX}\). We have that \(X \ge (1+\delta)\mu\) exactly when \(e^{tX} \ge e^{t(1+\delta)\mu}\) for any \(t \ge 0\); we obtain this by raising \(e^t \ge 1\) to the power of both sides of the former inequality. Then

\[\begin{split}\large \begin{aligned} \Pr[X \ge (1+\delta)\mu] &= \Pr[e^{tX} \ge e^{t(1+\delta)\mu}]\\ &\le \frac{\Ex[e^{tX}]}{e^{t(1+\delta)\mu}} \end{aligned}\end{split}\]

where the latter step follows from Markov’s inequality. Continuing, we have

\[\begin{split}\large \begin{aligned} \frac{\Ex[e^{tX}]}{e^{t(1+\delta)\mu}} &= e^{-t(1+\delta)\mu}\cdot\Ex[e^{tX}]\\ &= e^{-t(1+\delta)\mu}\cdot\Ex[e^{t(X_1 + \dots + X_n)}]\\ &= e^{-t(1+\delta)\mu}\cdot\lrEx{\prod_i e^{tX_i}} \end{aligned}\end{split}\]

We now make use of the fact that the \(X_i\) (and therefore the \(e^{tX_i}\)) are independent. For independent random variables \(Y\) and \(Z\), we have \(\Ex[YZ] = \Ex[Y]\cdot\Ex[Z]\) (see the appendix for a proof). Thus,

\[\large e^{-t(1+\delta)\mu}\cdot\lrEx{\prod_i e^{tX_i}} = e^{-t(1+\delta)\mu}\prod_i \Ex[e^{tX_i}]\]

The \(X_i\) are indicators, with \(\Pr[X_i = 1] = p_i\) and \(\Pr[X_i = 0] = 1 - p_i\). When \(X_i = 1\), \(e^{tX_i} = e^t\), and when \(X_i = 0\), \(e^{tX_i} = e^0 = 1\). Thus, the distribution of \(e^{tX_i}\) is

\[\begin{split}\large \begin{aligned} \Pr[e^{tX_i} = e^t] &= p_i\\ \Pr[e^{tX_i} = 1] &= 1 - p_i \end{aligned}\end{split}\]

We can then compute the expectation of \(e^{tX_i}\):

\[\begin{split}\large \begin{aligned} \Ex[e^{tX_i}] &= e^t\cdot \Pr[e^{tX_i} = e^t] + 1\cdot\Pr[e^{tX_i} = 1]\\ &= e^t p_i + 1 - p_i\\ &= 1 + p_i(e^t - 1) \end{aligned}\end{split}\]

Plugging this into the upper bound that resulted from Markov’s inequality, we get

\[\large e^{-t(1+\delta)\mu}\prod_i \Ex[e^{tX_i}] = e^{-t(1+\delta)\mu}\prod_i 1 + p_i(e^t-1)\]

We can further simplify this expression by observing that \(1+x \le e^x\) for all \(x\):

graph showing that the function 1+x is bounded from above by e^x over all x


Using \(x = p_i(e^t-1)\), we have

\[\large 1 + p_i(e^t-1) \le e^{p_i(e^t-1)}\]

Applying this to our previously computed upper bound, we get

\[\begin{split}\large \begin{aligned} e^{-t(1+\delta)\mu}\prod_i 1 + p_i(e^t-1) &\le e^{-t(1+\delta)\mu}\prod_i e^{p_i(e^t-1)}\\ &= e^{-t(1+\delta)\mu}e^{(e^t-1)\sum_i p_i}\\ &= e^{-t(1+\delta)\mu}e^{(e^t-1)\mu}\\ &= e^{\mu(e^t - 1 - t(1+\delta))} \end{aligned}\end{split}\]

We can now choose \(t\) to minimize the exponent, which in turn minimizes the value of the exponential itself. Let \(f(t) = e^t - 1 - t(1+\delta)\). Then we compute the derivative with respect to \(t\) and set that to zero to find an extremum:

\[\begin{split}f(t) &= e^t - 1 - t(1+\delta)\\ f'(t) &= e^t - (1+\delta) = 0\\ e^t &= 1+\delta\\ t &= \ln(1+\delta)\end{split}\]

Computing the second derivative \(f''(\ln(1+\delta)) = e^{\ln(1+\delta)} = 1+\delta\), we see that it is positive since \(\delta > 0\), therefore \(t = \ln(1+\delta)\) is a minimum. Substituting it into our earlier expression, we obtain

\[\begin{split}\large \begin{aligned} e^{\mu(e^t - 1 - t(1+\delta))} &\le e^{\mu(e^{\ln(1+\delta)} - 1 - (1+\delta)\ln(1+\delta))}\\ &= e^{\mu((1+\delta) - 1 - (1+\delta)\ln(1+\delta))}\\ &= e^{\mu(\delta - (1+\delta)\ln(1+\delta))}\\ &= \parens{e^{\delta - (1+\delta)\ln(1+\delta)}}^\mu\\ &= \parens{\frac{e^{\delta}}{e^{(1+\delta)\ln(1+\delta)}}}^\mu\\ &= \parens{\frac{e^{\delta}}{(e^{\ln(1+\delta)})^{1+\delta}}}^\mu\\ &= \parens{\frac{e^{\delta}}{(1+\delta)^{1+\delta}}}^\mu \end{aligned}\end{split}\]
Proof 154 (Proof of Lower-tail Chernoff Bound)

The proof of the lower-tail Chernoff bound follows similar reasoning as that of the upper-tail bound. We have:

\[\begin{split}\large \begin{aligned} \Pr[X \le (1-\delta)\mu] &= \Pr[-X \ge -(1-\delta)\mu]\\ &= \Pr[e^{-tX} \ge e^{-t(1-\delta)\mu}]\\ &\le \frac{\Ex[e^{-tX}]}{e^{-t(1-\delta)\mu}} \end{aligned}\end{split}\]

Here, we can apply Markov’s inequality to the random variable \(e^{-tX}\) since it is nonnegative, unlike \(-X\). Continuing, we have

\[\begin{split}\large \begin{aligned} \frac{\Ex[e^{-tX}]}{e^{-t(1-\delta)\mu}} &= e^{t(1-\delta)\mu}\cdot\Ex[e^{-tX}]\\ &= e^{t(1-\delta)\mu}\cdot\Ex[e^{-t(X_1 + \dots + X_n)}]\\ &= e^{t(1-\delta)\mu}\cdot\lrEx{\prod_i e^{-tX_i}}\\ &= e^{t(1-\delta)\mu}\prod_i \Ex[e^{-tX_i}] \end{aligned}\end{split}\]

In the last step, we make use of the fact that the \(X_i\) and therefore the \(e^{-tX_i}\) are independent. The distribution of \(e^{-tX_i}\) is

\[\begin{split}\large \begin{aligned} \Pr[e^{-tX_i} = e^{-t}] &= p_i\\ \Pr[e^{-tX_i} = 1] &= 1 - p_i \end{aligned}\end{split}\]

Then the expectation of \(e^{-tX_i}\) is:

\[\begin{split}\large \begin{aligned} \Ex[e^{-tX_i}] &= e^{-t}\cdot \Pr[e^{-tX_i} = e^{-t}] + 1\cdot\Pr[e^{-tX_i} = 1]\\ &= e^{-t} p_i + 1 - p_i\\ &= 1 + p_i(e^{-t} - 1) \end{aligned}\end{split}\]

Plugging this into the bound we’ve computed so far, we get

\[\large e^{t(1-\delta)\mu}\prod_i \Ex[e^{-tX_i}] = e^{t(1-\delta)\mu}\prod_i 1 + p_i(e^{-t}-1)\]

As with the upper-tail, we use the fact that \(1+x \le e^x\) for all \(x\) to simplify this, with \(x = p_i(e^{-t}-1)\):

\[\begin{split}\large \begin{aligned} e^{t(1-\delta)\mu}\prod_i 1 + p_i(e^{-t}-1) &\le e^{t(1-\delta)\mu}\prod_i e^{p_i(e^{-t}-1)}\\ &= e^{t(1-\delta)\mu}e^{(e^{-t}-1)\sum_i p_i}\\ &= e^{t(1-\delta)\mu}e^{(e^{-t}-1)\mu}\\ &= e^{\mu(e^{-t} - 1 + t(1-\delta))} \end{aligned}\end{split}\]

We choose \(t\) to minimize the exponent. Let \(f(t) = e^{-t} - 1 + t(1-\delta)\). Then we compute the derivative with respect to \(t\) and set that to zero to find an extremum:

\[\begin{split}f(t) &= e^{-t} - 1 + t(1-\delta)\\ f'(t) &= -e^{-t} + (1-\delta) = 0\\ e^{-t} &= 1-\delta\\ t &= -\ln(1-\delta)\end{split}\]

Computing the second derivative \(f''(-\ln(1-\delta)) = e^{-(-\ln(1-\delta))} = 1-\delta\), we see that it is positive since \(0 < \delta < 1\), therefore \(t = \ln(1-\delta)\) is a minimum. Substituting it into our earlier expression, we obtain

\[\begin{split}\large \begin{aligned} e^{\mu(e^{-t} - 1 + t(1-\delta))} &\le e^{\mu(e^{\ln(1-\delta)} - 1 - (1-\delta)\ln(1-\delta))}\\ &= e^{\mu((1-\delta) - 1 - (1-\delta)\ln(1-\delta))}\\ &= e^{\mu(-\delta - (1-\delta)\ln(1-\delta))}\\ &= \parens{e^{-\delta - (1-\delta)\ln(1-\delta)}}^\mu\\ &= \parens{\frac{e^{-\delta}}{e^{(1-\delta)\ln(1-\delta)}}}^\mu\\ &= \parens{\frac{e^{-\delta}}{(e^{\ln(1-\delta)})^{1-\delta}}}^\mu\\ &= \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu \end{aligned}\end{split}\]

In lieu of applying Chernoff bounds to the algorithm for estimating \(\pi\), we consider a different example of flipping a coin. If we flip a biased coin with probability \(p\) of heads, we expect to see \(np\) heads. Let \(H\) be the total number of heads, and let \(H_i\) be an indicator variable corresponding to whether the \(i\)th flip is heads. We have \(\Pr[H_i = 1] = p\), and \(\Ex[H] = np\) by linearity of expectation.

Suppose the coin is fair. What is the probability of getting at least six heads out of ten flips? This is a fractional deviation of \(\delta = 6/5 - 1 = 0.2\), and applying the upper-tail Chernoff bound gives us:

\[\begin{split}\Pr[H \ge (1+0.2)\cdot 5] &\le \parens{\frac{e^{0.2}}{1.2^{1.2}}}^5\\ &\approx 0.9814^5\\ &\approx 0.91\end{split}\]

What is the probability of getting at least 60 heads out of 100 flips, which is the same fractional deviation \(\delta = 0.2\) from the expectation? Applying the upper tail again, we get

\[\begin{split}\Pr[H \ge (1+0.2)\cdot 50] &\le \parens{\frac{e^{0.2}}{1.2^{1.2}}}^{50}\\ &\approx 0.9814^{50}\\ &\approx 0.39\end{split}\]

Thus, the probability of deviating by a factor of \(1+\delta\) decreases significantly as the number of samples increases. For this particular example, we can compute the actual probabilities quite tediously from exact formulas for a binomial distribution, which yields 0.37 for getting six heads out of ten flips and 0.03 for 60 heads out of 100 flips. However, this approach becomes more and more expensive as \(n\) increases, and the Chernoff bound produces a reasonable result with much less work.

Hoeffding’s Inequality

Another concentration bound that can often produce a tighter result than Chernoff bounds is Hoeffding’s inequality. For simplicity, we restrict ourselves to the special case 6 of independent indicator random variables, with expectation

\[\Pr[X_i = 1] = p_i\]

for each indicator \(X_i\).

Let \(p\) be the expected value of \(\frac{1}{n}X\). We have

\[\begin{split}p &= \lrEx{\frac{1}{n}X}\\ &= \frac{1}{n} \sum_i \Ex[X_i]\\ &= \frac{1}{n} \sum_i p_i\end{split}\]

Hoeffding’s inequality places a limit on how much the actual value of \(\frac{1}{n}X\) may deviate from the expectation \(p\).

6

See the appendix for the general case of Hoeffding’s inequality, as well as proofs of the special and general cases.

Theorem 155 (Hoeffding’s Inequality – Upper Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\). Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le e^{-2\varepsilon^2 n}\]
Theorem 156 (Hoeffding’s Inequality – Lower Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\). Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\frac{1}{n}X \le p - \varepsilon} \le e^{-2\varepsilon^2 n}\]
Theorem 157 (Hoeffding’s Inequality – Combined)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\). Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\lrabs{\frac{1}{n}X - p} \ge \varepsilon} \le 2e^{-2\varepsilon^2 n}\]

We consider again the example of flipping a fair coin with probability. Let \(H\) be the total number of heads, and let \(H_i\) be an indicator variable corresponding to whether the \(i\)th flip is heads. We have \(\Pr[H_i = 1] = \frac{1}{2}\), and \(\Ex[\frac{1}{n}H] = \frac{1}{2}\) for any number of flips \(n\).

What is the probability of getting at least six heads out of ten flips? This is a deviation of \(\varepsilon = 0.1\), and applying the upper-tail Hoeffding’s inequality gives us:

\[\begin{split}\lrPr{\frac{1}{n}H \ge p + \varepsilon} &= \lrPr{\frac{1}{n}H \ge 0.5 + 0.1}\\ &\le e^{-2 \cdot 0.1^2 \cdot 10}\\ &\approx 0.9802^{10}\\ &\approx 0.82\end{split}\]

What is the probability of getting at least 60 heads out of 100 flips, which is the same deviation \(\varepsilon = 0.1\) from the expectation? Applying the upper tail again, we get

\[\lrPr{\frac{1}{n}H \ge p + \varepsilon} \le e^{-2 \cdot 0.1^2 \cdot 100} \approx 0.9802^{100} \approx 0.14\]

This is a significantly tighter bound than that produced by the Chernoff bound.

Polling

Rather than applying concentration bounds to compute the probability of a deviation for a specific sample size, we often wish to determine how many samples we need to be within a particular deviation with high confidence. One application of this is big data, where we have vast datasets that are too large to examine in their entirety, so we sample the data instead to estimate the quantities of interest. Similar to this is polling – outside of elections themselves, we typically do not have the resources to ask the entire population for their opinions, so we need to sample the populace to estimate the support for a particular candidate or political position. In both applications, we need to determine how many samples are needed to obtain a good estimate.

In general, a poll estimates the fraction of the population that supports a particular candidate by asking \(n\) randomly chosen people whether they support that candidate. Let \(X\) be a random variable corresponding to the number of people who answer this question in the affirmative. Then \(X/n\) is an estimate of the level of support in the full population.

A typical poll has both a confidence level and a margin of error – the latter corresponds to the deviation from the true fraction \(p\) of people who support the candidate, and the former corresponds to a bound on the probability that the estimate is within that deviation. For example, a 95% confidence level and a margin of error of ±2% requires that

\[\lrPr{\lrabs{\frac{X}{n} - p} \le 0.02} \ge 0.95\]

More generally, for a confidence level \(1-\gamma\) and margin of error \(\varepsilon\), we require

\[\lrPr{\lrabs{\frac{X}{n} - p} \le \varepsilon} \ge 1-\gamma\]

or equivalently

\[\lrPr{\lrabs{\frac{X}{n} - p} > \varepsilon} < \gamma\]

Formally, we define indicator variables \(X_i\) as

\[\begin{split}X_i = \begin{cases} 1 &\mbox{if person $i$ supports the candidate}\\ 0 &\mbox{otherwise} \end{cases}\end{split}\]

for each person \(i\) in the set that we poll. Then \(X = X_1 + \dots + X_n\) is the sum of independent indicator variables, with

\[\mu = \Ex[X] = \sum_i \Ex[X_i] = np\]

Analysis with Chernoff Bounds

For an arbitrary margin of error \(\varepsilon\), we want to bound the probability

\[\begin{split}\lrPr{\lrabs{\frac{X}{n} - p} > \varepsilon} &= \lrPr{\frac{X}{n} - p > \varepsilon} + \lrPr{\frac{X}{n} - p < -\varepsilon}\\ &= \lrPr{\frac{X}{n} > \varepsilon+p} + \lrPr{\frac{X}{n} < -\varepsilon+p}\\ &= \Pr[X > \varepsilon n+pn] + \Pr[X < -\varepsilon n+pn]\\ &= \Pr[X > \varepsilon n+\mu] + \Pr[X < -\varepsilon n+\mu]\\ &= \Pr[X > (1+\varepsilon n/\mu)\mu] + \Pr[X < (1-\varepsilon n/\mu)\mu]\\ &= \Pr[X \ge (1+\varepsilon n/\mu)\mu] + \Pr[X \le (1-\varepsilon n/\mu)\mu]\\ &~~~~~ - \Pr[X = (1+\varepsilon n/\mu)\mu] - \Pr[X = (1-\varepsilon n/\mu)\mu]\\ &\le \Pr[X \ge (1+\varepsilon n/\mu)\mu] + \Pr[X \le (1-\varepsilon n/\mu)\mu]\end{split}\]

In the second-to-last step, we use the fact that \(X = x\) and \(X > x\) are disjoint events, since a random variable maps each sample point to a single value, and that \((X \ge x) = (X = x) \cup (X > x)\).

We now have events that are in the right form to apply a Chernoff bound, with \(\delta = \varepsilon n/\mu\). We first apply the simplified upper-tail bound to the first term, obtaining

\[\begin{split}\large \begin{aligned} \Pr[X \ge (1+\varepsilon n/\mu)\mu] &\le e^{-\frac{\delta^2\mu}{2+\delta}} ~~~~\text{where $\delta=\varepsilon n/\mu$}\\ &= e^{-\frac{(\varepsilon n)^2\mu}{\mu^2(2+\varepsilon n/\mu)}}\\ &= e^{-\frac{(\varepsilon n)^2}{\mu(2+\varepsilon n/\mu)}}\\ &= e^{-\frac{(\varepsilon n)^2}{2\mu+\varepsilon n}}\\ &= e^{-\frac{(\varepsilon n)^2}{2np+\varepsilon n}}\\ &= e^{-\frac{\varepsilon^2 n}{2p+\varepsilon}} \end{aligned}\end{split}\]

We want this to be less than some value \(\beta\):

\[\begin{split}\large \begin{aligned} e^{-\frac{\varepsilon^2 n}{2p+\varepsilon}} &< \beta\\ \frac{1}{\beta} &< e^{\frac{\varepsilon^2 n}{2p+\varepsilon}}\\ \ln\parens{\frac{1}{\beta}} &< \frac{\varepsilon^2 n}{2p+\varepsilon}\\ n &> \frac{2p+\varepsilon}{\varepsilon^2}\ln\parens{\frac{1}{\beta}} \end{aligned}\end{split}\]

Unfortunately, this expression includes \(p\), the quantity we’re trying to estimate However, we know that \(p \le 1\), so we can set

\[n > \frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{1}{\beta}}\]

to ensure that \(\Pr[X \ge (1+\varepsilon n/\mu)\mu] < \beta\).

We can also apply the simplified lower-tail bound to the second term in our full expression above:

\[\begin{split}\large \begin{aligned} \Pr[X \le (1-\varepsilon n/\mu)\mu] &\le e^{-\frac{\delta^2\mu}{2}} ~~~~\text{where $\delta=\varepsilon n/\mu$}\\ &= e^{-\frac{(\varepsilon n)^2\mu}{2\mu^2}}\\ &= e^{-\frac{(\varepsilon n)^2}{2\mu}}\\ &= e^{-\frac{(\varepsilon n)^2}{2np}}\\ &= e^{-\frac{\varepsilon^2 n}{2p}} \end{aligned}\end{split}\]

As before, we want this to be less some value \(\beta\):

\[\begin{split}\large \begin{aligned} e^{-\frac{\varepsilon^2 n}{2p)}} &< \beta\\ \frac{1}{\beta} &< e^{\frac{\varepsilon^2 n}{2p}}\\ \ln\parens{\frac{1}{\beta}} &< \frac{\varepsilon^2 n}{2p}\\ n &> \frac{2p}{\varepsilon^2}\ln\parens{\frac{1}{\beta}} \end{aligned}\end{split}\]

We again observe that \(p \le 1\), so we can set

\[n > \frac{2}{\varepsilon^2}\ln\parens{\frac{1}{\beta}}\]

to ensure that \(\Pr[X \le (1-\varepsilon n/\mu)\mu] < \beta\).

To bound both terms simultaneously, we can choose \(\beta=\gamma/2\), so that

\[\Pr[X \ge (1+\varepsilon n/\mu)\mu] + \Pr[X \le (1-\varepsilon n/\mu)\mu] < 2\beta = \gamma\]

To achieve this, we require

\[\begin{split}n &> \max\parens{ \frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}}, \frac{2}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}}}\\ &= \frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}}\end{split}\]

For example, for a 95% confidence level and a margin of error of ±2%, we have \(\varepsilon=0.02\) and \(\gamma=0.05\). Plugging those values into the result above, we need no more than

\[\frac{2+\varepsilon}{\varepsilon^2}\ln\parens{\frac{2}{\gamma}} = \frac{2.02}{0.02^2}\ln 40 \approx 18623\]

samples to achieve the desired confidence level and margin of error. Observe that this does not depend on the total population size!

Analysis with Hoeffding’s Inequality

We can also apply Hoeffding’s inequality to polling. The combined inequality gives us

\[\lrPr{\lrabs{\frac{1}{n}X - p} \ge \varepsilon} \le 2e^{-2\varepsilon^2 n}\]

For a 95% confidence level and a margin of error of ±2%, we require that

\[\lrPr{\lrabs{\frac{X}{n} - p} \le 0.02} \ge 0.95\]

However, this isn’t quite in the form where we can apply Hoeffding’s inequality, so we need to do some manipulation first. We have:

\[\begin{split}\lrPr{\lrabs{\frac{X}{n} - p} \le 0.02} &= 1 - \lrPr{\lrabs{\frac{X}{n} - p} > 0.02}\\ &\ge 1 - \lrPr{\lrabs{\frac{X}{n} - p} \ge 0.02}\end{split}\]

Hoeffding’s inequality gives us:

\[\lrPr{\lrabs{\frac{X}{n} - p} \ge 0.02} \le 2e^{-2 \cdot 0.02^2 \cdot n}\]

Substituting this into the above, we get:

\[\lrPr{\lrabs{\frac{X}{n} - p} \le 0.02} \ge 1 - 2e^{-2 \cdot 0.02^2 \cdot n}\]

We want this to be at least 0.95:

\[\begin{split}1 - 2e^{-2 \cdot 0.02^2 \cdot n} &\ge 0.95\\ 2e^{-2 \cdot 0.02^2 \cdot n} &\le 0.05\\ e^{2 \cdot 0.02^2 \cdot n} &\ge 40\\ 2 \cdot 0.02^2 \cdot n &\ge \ln 40\\ n &\ge \frac{\ln 40}{2 \cdot 0.02^2}\\ &\approx 4611.1\end{split}\]

Thus, we obtain the given confidence level and margin of error by polling at least 4612 people, which is a significantly better result than we obtain from the simplified Chernoff bounds.

For an arbitrary margin of error ±\(\varepsilon\), we obtain:

\[\begin{split}\lrPr{\lrabs{\frac{X}{n} - p} \le \varepsilon} &= 1 - \lrPr{\lrabs{\frac{X}{n} - p} > \varepsilon}\\ &\ge 1 - \lrPr{\lrabs{\frac{X}{n} - p} \ge \varepsilon}\\ &\ge 1 - 2e^{-2\varepsilon^2 n}\end{split}\]

To achieve an arbitrary confidence level \(1 - \gamma\), we need:

\[\begin{split}1 - 2e^{-2\varepsilon^2 n} &\ge 1 - \gamma\\ 2e^{-2\varepsilon^2 n} &\le \gamma\\ e^{2\varepsilon^2 n} &\ge \frac{2}{\gamma}\\ 2\varepsilon^2 n &\ge \ln\parens{\frac{2}{\gamma}}\\ n &\ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2}{\gamma}}\end{split}\]

This is a factor of four better than the result obtained from the simplified Chernoff bounds.

More generally, if we wish to gauge the level of support for \(m\) different candidates, the sampling theorem tells us that the number of samples required is logarithmic in \(m\).

Theorem 158 (Sampling Theorem)

Suppose \(n\) people are polled to ask which candidate they support, out of \(m\) possible candidates. Let \(\Xj\) be the number of people who state that they support candidate \(j\), and let \(p_j\) be the true level of support for that candidate. We wish to obtain

\[\lrPr{\bigcap_j \parens{\lrabs{\frac{\Xj}{n} - p_j} \le \varepsilon}} \ge 1 - \gamma\]

In other words, we desire a confidence level \(1 - \gamma\) that all estimates \(\Xj/n\) are within margin of error ±\(\varepsilon\) of the true values \(p_j\). We obtain this when the number of samples \(n\) is

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2m}{\gamma}}\]

The sampling theorem can be derived from applying the union bound, which we will discuss shortly.

In conclusion, when sampling from a large dataset, the number of samples required does depend on the desired accuracy of the estimation and the range size (i.e. number of possible answers). But it does not depend on the population size. This makes sampling a powerful technique for dealing with big data, as long as we are willing to tolerate a small possibility of obtaining an inaccurate estimate.

Amplification for Two-Sided-Error Algorithms

Previously, we saw how to amplify the probability of success for one-sided-error randomized algorithms. We now consider amplification for two-sided-error algorithms. Recall that such an algorithm \(A\) for a language \(L\) has the following behavior:

  • if \(x \in L\), \(\Pr[A \text{ accepts } x] \ge c\)

  • if \(x \notin L\), \(\Pr[A \text{ rejects } x] \ge c\)

Here, \(c\) must be a constant that is strictly greater than \(\frac{1}{2}\).

Unlike in the one-sided case with no false positives, we cannot just run the algorithm multiple times and observe if it accepts at least once. A two-sided-error algorithm can accept both inputs in and not in the language, and it can reject both such inputs as well. However, we observe that because \(c > \frac{1}{2}\), we expect to get the right answer the majority of the time when we run a two-sided-error algorithm on an input. More formally, suppose we run the algorithm \(n\) times. Let \(Y_i\) be an indicator random variable that is 1 if the algorithm accepts in the \(i\)th run. Then:

  • if \(x \in L\), \(\Ex[Y_i] = \Pr[Y_i = 1] \ge c\)

  • if \(x \notin L\), \(\Ex[Y_i] = \Pr[Y_i = 1] \le 1 - c\)

Let \(Y = Y_1 + \dots + Y_n\) be the total number of accepts out of \(n\) trials. By linearity of expectation, we have:

  • if \(x \in L\), \(\Ex[Y] \ge cn > \frac{n}{2}~~~\) (since \(c > \frac{1}{2}\))

  • if \(x \notin L\), \(\Ex[Y] \le (1 - c)n < \frac{n}{2}~~~\) (since \(1 - c < \frac{1}{2}\))

This motivates an amplification algorithm \(B\) that runs the original algorithm \(A\) multiple times and takes the majority vote:

\[\begin{split}&\Algorithm B(x):\\ &~~~\Run A \ont x \text{ repeatedly, for a total of $n$ times}\\ &~~~\If A \text{ accepts at least $n/2$ times}, \accept\\ &~~~\Else \reject\end{split}\]

Suppose we wish to obtain a bound that is within \(\gamma\) of 1:

  • if \(x \in L\), \(\Pr[B \text{ accepts } x] \ge 1 - \gamma\)

  • if \(x \notin L\), \(\Pr[B \text{ rejects } x] \ge 1 - \gamma\)

What value for \(n\) should we use in \(B\)?

We first consider the case where \(x \in L\). The indicators \(Y_i\) are independent, allowing us to use Hoeffding’s inequality on their sum \(Y\). \(B\) accepts \(x\) when \(Y \ge \frac{n}{2}\), or equivalently, \(\frac{Y}{n} \ge \frac{1}{2}\). We want

\[\lrPr{\frac{Y}{n} \ge \frac{1}{2}} \ge 1 - \gamma\]

or equivalently,

\[\begin{split}\lrPr{\frac{Y}{n} < \frac{1}{2}} &\le \lrPr{\frac{Y}{n} \le \frac{1}{2}}\\ &= \lrPr{\frac{Y}{n} \le c - \parens{c - \frac{1}{2}}}\\ &= \lrPr{\frac{Y}{n} \le c - \varepsilon}\\ &\le \gamma\end{split}\]

where \(\varepsilon = c - \frac{1}{2}\). To apply Hoeffding’s inequality, we need something of the form

\[\lrPr{\frac{Y}{n} \le p - \varepsilon}\]

where \(p = \lrEx{\frac{Y}{n}}\). Unfortunately, we do not know the exact value of \(p\); all we know is that \(p = \lrEx{\frac{Y}{n}} \ge c\). However, we know that because \(p \ge c\),

\[\lrPr{\frac{Y}{n} \le c - \varepsilon} \le \lrPr{\frac{Y}{n} \le p - \varepsilon}\]

In general, the event \(X \le a\) includes at most as many sample points as \(X \le b\) when \(a \le b\); the latter includes all the outcomes in \(X \le a\), as well as those in \(a < X \le b\). We thus need only compute an upper bound on \(\lrPr{\frac{Y}{n} \le p - \varepsilon}\), and that same upper bound will apply to \(\lrPr{\frac{Y}{n} \le c - \varepsilon}\). Taking \(\varepsilon = c - \frac{1}{2}\) and applying the lower-tail Hoeffding’s inequality, we get:

\[\begin{split}\lrPr{\frac{Y}{n} \le p - \varepsilon} &= \lrPr{\frac{Y}{n} \le p - \parens{c - \frac{1}{2}}}\\ &\le e^{-2(c-1/2)^2 n}\end{split}\]

We want this to be bound by \(\gamma\):

\[\begin{split}e^{-2(c-1/2)^2 n} &\le \gamma\\ e^{2(c-1/2)^2 n} &\ge \frac{1}{\gamma}\\ 2(c - 1/2)^2 n &\ge \ln\parens{\frac{1}{\gamma}}\\ n &\ge \frac{1}{2(c - 1/2)^2} \ln\parens{\frac{1}{\gamma}}\end{split}\]

As a concrete example, suppose \(c = \frac{3}{4}\), meaning that \(A\) accepts \(x \in L\) with probability at least \(\frac{3}{4}\). Suppose we want \(B\) to accept \(x \in L\) at least 99% of the time, giving us \(\gamma = 0.01\). Then:

\[\begin{split}n &\ge \frac{1}{2(3/4 - 1/2)^2} \ln\parens{\frac{1}{0.01}}\\ &= \frac{1}{2\cdot 0.25^2} \ln 100\\ &\approx 36.8\end{split}\]

Thus, it is sufficient for \(B\) to run \(n \ge 37\) trials of \(A\) on \(x\).

We now consider \(x \notin L\). We want \(B\) to reject \(x\) with probability at least \(1 - \gamma\), or equivalently, \(B\) to accept \(x\) with probability at most \(\gamma\):

\[\lrPr{\frac{Y}{n} \ge \frac{1}{2}} \le \gamma\]

Similar to before, we have

\[\begin{split}\lrPr{\frac{Y}{n} \ge \frac{1}{2}} &= \lrPr{\frac{Y}{n} \ge (1 - c) + \parens{c - \frac{1}{2}}}\\ &= \lrPr{\frac{Y}{n} \ge (1 - c) + \varepsilon}\end{split}\]

with \(\varepsilon = c - \frac{1}{2}\). Since \(p = \lrEx{\frac{Y}{n}} \le (1 - c)\), we have

\[\lrPr{\frac{Y}{n} \ge (1 - c) + \varepsilon} \le \lrPr{\frac{Y}{n} \ge p + \varepsilon}\]

This follows from similar reasoning as earlier: an event \(X \ge a\) contains no more sample points than \(X \ge b\) when \(a \ge b\). With \(\varepsilon = c - \frac{1}{2}\), the upper-tail Hoeffding’s inequality gives us:

\[\begin{split}\lrPr{\frac{Y}{n} \ge p + \varepsilon} &= \lrPr{\frac{Y}{n} \ge p + \parens{c - \frac{1}{2}}}\\ &\le e^{-2(c-1/2)^2 n}\end{split}\]

We want this to be no more than \(\gamma\), which leads to the same solution as before:

\[n \ge \frac{1}{2(c - 1/2)^2} \ln\parens{\frac{1}{\gamma}}\]

Using the same concrete example as before, if we want \(B\) to reject \(x \notin L\) at least 99% of the time when \(A\) rejects \(x \notin L\) with probability at least \(\frac{3}{4}\), it suffices for \(B\) to run \(n \ge 37\) trials of \(A\) on \(x\).

Load Balancing

Job scheduling is another application where we can exploit randomness to construct a simple, highly effective algorithm. In this problem, we have \(k\) servers, and there are \(n\) jobs that need to be distributed to these servers. The goal is to balance the load among the servers, so that no one server is overloaded. This problem is very relevant to content-delivery networks – there may be on the order of millions or even billions of concurrent users, and each user needs to be routed to one of only hundreds or thousands of servers. Thus, we have \(n \gg k\) in such a case.

One possible algorithm is to always send a job to the most lightly loaded server. However, this requires significant coordination – the scheduler must keep track of the load on each server, which requires extra communication, space, and computational resources. Instead, we consider a simple, randomized algorithm that just sends each job to a random server. The expected number of jobs on each server is \(n/k\). But how likely are we to be close to this ideal, balanced load?

We start our analysis by defining random variables \(\Xj\) corresponding to the number of jobs assigned to server \(j\). We would like to demonstrate that the joint probability of \(\Xj\) being close to \(n/k\) for all \(j\) is high. We first reason about the load on an individual server. In particular, we wish to compute a bound on

\[\lrPr{\Xj \ge \frac{n}{k} + c}\]

for some value \(c > 0\), i.e. the probability that server \(j\) is overloaded by at least \(c\) jobs. Let \(\Xj_i\) be an indicator random variable that is 1 if job \(i\) is sent to server \(j\). Since the target server for job \(i\) is chosen uniformly at random out of \(k\) possible choices, we have

\[\lrEx{\Xj_i} = \lrPr{\Xj_i = 1} = \frac{1}{k}\]

We also have \(\Xj = \Xj_1 + \dots + \Xj_n\), giving us

\[\lrEx{\Xj} = \sum_i \lrEx{\Xj_i} = \frac{n}{k}\]

as we stated before. Then:

\[\begin{split}\lrPr{\Xj \ge \frac{n}{k} + c} &= \lrPr{\frac{1}{n}\Xj \ge \frac{1}{k} + \frac{c}{n}}\\ &\le e^{-2(c/n)^2 n}\\ &= e^{-2c^2/n}\end{split}\]

by the upper-tail Hoeffding’s inequality.

Now that we have a bound on an individual server being overloaded by \(c\) jobs, we wish to bound the probability that there is some server that is overloaded. The complement event is that no server is overloaded, so we can equivalently compute the joint probability that none of the servers is overloaded by \(c\) or more jobs:

\[\lrPr{X^{(1)} < \frac{n}{k} + c, \dots, X^{(n)} < \frac{n}{k} + c}\]

However, we cannot do so by simply multiplying the individual probabilities together – that only works if the probabilities are independent, and in this case, they are not. In particular, if one server is overloaded, some other server must be underloaded, so the random variables \(\Xj\) are not independent.

Rather than trying to work with the complement event, we attempt to directly compute the probability that there is at least one overloaded server:

\[\lrPr{\parens{X^{(1)} \ge \frac{n}{k} + c} \cup \dots \cup \parens{X^{(n)} \ge \frac{n}{k} + c}}\]

More succinctly, we denote this as:

\[\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + c}}\]

We have a union of events, and we want to compute the probability of that union. The union bound allows us to compute a bound on that probability.

Theorem 159 (Union Bound)

Let \(E_1, E_2, \dots, E_m\) be a sequence of events over the same sample space. Then

\[\lrPr{\bigcup_{i=1}^m E_i} \le \sum_{i=1}^m \Pr[E_i]\]

In other words, the probability of the union is no more than the sum of the probabilities of the individual events.

To see why the union bound holds, consider the outcomes that are in the union event \(\cup_i E_i\). The probability of this event is the sum of the probabilities of the individual outcomes in \(\cup_i E_i\). In the maximal case, the events \(E_i\) are disjoint, so that an individual outcome is not a member of more than one \(E_i\). Then the sum of the probabilities of the events \(E_i\) is also the sum of the probabilities of the individual outcomes in \(\cup_i E_i\). In the general case, the \(E_i\)’s may overlap:

overlapping events in a sample space; the number of outcomes in the union of the events as at most the sum of the numbers in the individual events (and is usually less because some outcomes are in multiple events and thus "double counted")

In this case, the sum \(\sum_i \Pr[E_i]\) overcounts the probability \(\Pr[\cup_i E_i]\), since we “double count” outcomes that occur in more than one \(E_i\). Thus, \(\sum_i \Pr[E_i]\) is an upper bound on the probability \(\Pr[\cup_i E_i]\).

Returning to the job-scheduling problem, we have:

\[\begin{split}\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + c}} &\le \sum_j \lrPr{\Xj \ge \frac{n}{k} + c}\\ &\le \sum_j e^{-2c^2/n}\\ &= k \cdot e^{-2c^2/n}\\ &= e^{\ln k} \cdot e^{-2c^2/n}\\ &= e^{\ln k - 2c^2/n}\end{split}\]

For \(c = \sqrt{n\ln k}\), we get:

\[\begin{split}\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + \sqrt{n\ln k}}} &\le e^{\ln k - 2(\sqrt{n\ln k})^2/n}\\ &= e^{\ln k - 2(n\ln k)/n}\\ &= e^{-\ln k}\\ &= 1/e^{\ln k}\\ &= \frac{1}{k}\end{split}\]

With concrete values \(n = 10^{10}\) and \(k = 1000\), we compute the overload relative to the expected value as:

\[\begin{split}\frac{\sqrt{n \ln k}}{n/k} &= \frac{\sqrt{10^{10} \ln 1000}}{10^7}\\ &\approx 0.026\end{split}\]

This is an overhead of about 2.6% above the expected load. The probability that there is a server overloaded by at least 2.6% is bounded from above by \(1/k = 0.001\), or ≤0.1%. Thus, when there are \(n = 10^{10}\) jobs and \(k = 1000\) servers, the randomized algorithm has a high probability (≥99.9%) of producing a schedule where the servers all have a low overhead (≤2.6%).

Exercise 160

Previously, we demonstrated that to use polling to achieve a confidence level \(1 - \gamma\) and a margin of error ±\(\varepsilon\),

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2}{\gamma}}\]

samples are sufficient when there is a single candidate. We also saw the sampling theorem, which states that for \(m\) candidates,

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2m}{\gamma}}\]

samples suffice to achieve a confidence level \(1 - \gamma\) that all estimates \(\Xj/n\) are within margin of error ±\(\varepsilon\).

Use the union bound to demonstrate that this latter result follows from the result for a single candidate.