Randomness

Randomized Algorithms

Thus far, every algorithm we have seen is deterministic. This means that, when run on a given input, the steps the algorithm performs and the output it produces are entirely determined. So, when run multiple times on the same input, the algorithm produces the same output. At heart, this is because any Turing machine’s transition rules are deterministic, i.e., it has a transition function.

In this unit, we consider how randomness can be applied in computation. We will exploit randomness to design simple, probabilistic algorithms, and we will discuss several tools that can be used to analyze randomized algorithms.

Since Turing machines are deterministic, they cannot make random choices—so how are randomized algorithms possible? In real life, when we want to make a random choice, we can flip a coin, roll a die, or the like. Similarly, to model randomized algorithms and Turing machines, we augment them with an external source of randomness.

For this purpose, the simplest source of randomness is a (virtual) “fair coin” that the algorithm can “flip” as needed. Formally, we can define a randomized Turing machine as having a special “coin flip” state that, whenever it is entered, writes a fresh unbiased random bit at the location of the tape head, and transitions back to the previous state. With this basic feature it is possible to simulate richer sources of randomness, like the roll of a die with any finite number of sides.

In this text, we write algorithms and pseudocode that make truly random, unbiased, and independent choices from whatever finite set is convenient. In the real world, when implementing and using randomized algorithms—especially for cryptography—it is vital to use a high-quality source of randomness. (Ensure this can be a very delicate matter, but it is outside the scope of this course.)

As a first example of the utility of randomized algorithms, consider the game of rock-paper-scissors between two players. In this game, both players simultaneously choose one of three options: rock, paper, or scissors. The game is a tie if both players choose the same option. If they make different choices, then rock beats (“smashes”) scissors, scissors beats (“cuts”) paper, and paper beats (“covers”) rock:

Suppose we write an algorithm (or strategy) to play the game, e.g., in a best-out-of-five sequence of rounds. Similar to worst-case analysis of algorithms, we should consider our strategy’s performance against an opponent strategy that plays optimally against it. In other words, we should assume that our opponent “knows” our strategy, and plays the best it can with that knowledge.

If we “hard-code” a strategy—like playing rock in the first two rounds, then paper, then scissors, then paper—then how will this strategy perform? The strategy is fixed and deterministic, so an optimal opponent can win every round (namely, play paper in the first two rounds, then scissors, then rock, then scissors). More generally, for any deterministic strategy—even one that is “adaptive” based on the opponent’s moves—there exists an opponent strategy that entirely beats it. This illustrates the problem with deterministic strategies: the make the program’s actions predictable, and thus easily countered.

Now consider a strategy that chooses a random action in each move, with equal probability for rock, paper, and scissors. Here, even an opponent that knows this strategy does not know in advance which action the strategy will choose, and therefore cannot universally counter that move. In fact, we prove below that no matter what strategy the opponent uses, the probability that the opponent wins an individual round is just $1/3$ (and the same goes for tying and losing the round). To see why this is the case, we first review some basic tools of probability.

Review of Probability

We first recall the basic definitions and axioms of probability. For simplicity, we focus on discrete probability, which will be sufficient for our purposes.

Probability Spaces and Events

Definition 206 (Probability Space)

A (discrete) probability space is a countable set $\Omega$, called the sample space of possible outcomes, along with a probability function $\Pr \colon \Omega \to [0,1]$ that assigns a probability to each outcome, and satisfies

\[\sum_{\omega \in \Omega} \Pr[\omega] = 1 \; \text.\]

Let’s elaborate on this definition and consider some examples. The sample space $\Omega$ is the set of every possible outcome—i.e., “every thing that could potentially happen”—in a randomized process, which is also known as an experiment. Each outcome has a probability between $0$ and $1$ (inclusive) of occurring, as defined by the probability function $\Pr$, and all these probabilities must sum to $1$.

Example 207 (Three Fair Coins: Probability Space)

Consider the experiment of tossing three “fair” (unbiased) coins in sequence. We formalize this as a probability space—i.e., a sample space and a probability function on it.

The sample space $\Omega$ is the set of all $8 = 2^3$ possible sequences of three heads or tails that can occur, i.e.,

\[\Omega = \set{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} \; \text,\]

where $H$ represents heads and $T$ represents tails.

Next, we define an appropriate probability function. To represent the fairness of the coins, we require that each of the eight outcomes has the same probability. Since by definition the probabilities of all eight outcomes must sum to $1$, we should define $\Pr[\omega] = 1/8$ for every outcome $\omega \in \Omega$, e.g., $\Pr[HHH] = \Pr[HHT] = \cdots = \Pr[TTT] = 1/8$. [1]

These definitions generalize straightforwardly to the probability experiment of tossing $n \geq 0$ fair coins, where the sample space is $\Omega = \set{H,T}^n$, and each outcome $\omega \in \Omega$ has the same probability $\Pr[\omega] = 1/2^n$.

Definition 208 (Event)

For a probability space $(\Omega,\Pr)$, an event is any subset $E \subseteq \Omega$ of the sample space. The probability function is extended to events as

\[\Pr[E] = \sum_{\omega \in E} \Pr[\omega] \; \text.\]

In other words, an event is any subset of outcomes, and its probability is the sum of the probabilities of those outcomes. We now build on the previous example.

Example 209 (Three Fair Coins: Events)

Continuing from the probability space for tossing three fair coins as defined in Example 207, we now consider some example events.

The event “the first toss comes up heads” is formalized as $E_1 = \set{HHH, HHT, HTH, HTT} \subseteq \Omega$, i.e., the subset of all outcomes in which the first character is $H$. By definition, the probability of this event is the sum of the probabilities of all the outcomes in the event. Since this event has four outcomes, and each has probability $1/8$, we get that $\Pr[E_1] = 4/8 = 1/2$. This matches our intuition that the probability of getting heads on the first toss (ignoring what happens on the remaining two tosses) is $1/2$.
The event “the first and third tosses come up the same” is formalized as the subset $E_2 = \set{HHH, HTH, THT, TTT} \subseteq \Omega$. Similarly, the probability of this event is $\Pr[E_2] = 4/8 = 1/2$. This also matches our intuition: no matter how the first toss comes up, the third toss has a $1/2$ probability of matching it.
The event “the same number of heads and tails come up” is formalized as the empty subset $E_3 = \emptyset \subseteq \Omega$. This is because no outcome has the same number of heads and tails. [2] By definition, the probability of this event is $\Pr[E_3] = 0$.

[2]
If the impossible outcome FFF (as mentioned in the previous footnote) was a member of the sample space, then because it has the same number of heads and tails, it would be an element of event $E_3$. However, we would still have $\Pr[E_3]=0$, because $\Pr[FFF]=0$.

Because events are just subsets of the sample space, we can use set relations to relate their probabilities. For example, for any events $E_1 \subseteq E_2$, we have that $\Pr[E_1] \leq \Pr[E_2]$, because $\Pr[\omega] \geq 0$ for every outcome $\omega$. (However, we caution that $E_1 \subsetneq E_2$ does not imply that $\Pr[E_1] < \Pr[E_2]$, because the events in $E_2 \setminus E_1$ might have probability zero.)

We can also use set operations on events to yield other events. For example, every event $E$ has a complement event $\overline{E} = \Omega \setminus E$, whose probability is $\Pr[\overline{E}] = 1 - \Pr[E]$, because

\[\begin{split}\Pr[E] + \Pr[\overline{E}] &= \sum_{\omega \in E} \Pr[\omega] + \sum_{\omega \in \overline{E}} \Pr[\omega] \\ &= \sum_{\omega \in \Omega} \Pr[\omega] \\ &= 1 \; \text.\end{split}\]

We can also consider intersections and unions of events. Notationally, instead of $\cap$ and $\cup$, we usually use the logical operators $\wedge$ (AND) and $\vee$ (OR), respectively. This captures the idea that the intersection of two events represents the occurrence of both events simultaneously, and their union represents the occurrence of one event or the other (or both).

For two events $E_1,E_2$, the probability $\Pr[E_1 \wedge E_2]$ that both occur simultaneously is called their joint probability (and similarly for more events). It follows immediately that this probability is at most $\min \set{\Pr[E_1], \Pr[E_2]}$, because $E_1 \wedge E_2 \subseteq E_1, E_2$. When considering the intersection of events, an important case is when, informally, the (non-)occurrence of one event does not affect the probability of the (non-)occurrence of another event. This is known as independence; the formal definition is as follows.

Definition 210 (Independent Events)

Two events $E_1, E_2$ (of the same probability space) are independent if $\Pr[E_1 \wedge E_2] = \Pr[E_1] \cdot \Pr[E_2]$.

More generally, events $E_1, \ldots, E_n$ are mutually independent if $\Pr[E_1 \wedge \cdots \wedge E_n] = \Pr[E_1] \cdots \Pr[E_n]$. They are pairwise independent if every pair of them is independent.

Mutual independence of several events is a stronger notion than pairwise independence: a collection of events can be pairwise independent but not mutually independent (see Example 218 below).

Example 211 (Three Fair Coins: Independence)

Continuing from Example 209, we consider several events and whether they are independent:

The events $E_1$ (“the first toss comes up heads”) and $E_2$ (“the first and third tosses come up the same”) are independent, because $E_1 \wedge E_2 = \set{HHH, HTH}$ and hence $\Pr[E_1 \wedge E_2] = 1/4 = 1/2 \cdot 1/2 = \Pr[E_1] \cdot \Pr[E_2]$. This matches our intuition that whether or not the first toss comes up heads, the probability that the third toss matches the first one is not affected.
The three events “the first/second/third toss comes up heads” are mutually independent (and hence also pairwise independent): their intersection is the event $\set{HHH}$, which has probability $1/8 = (1/2)^3$, which is the product of the $1/2$ probabilities of the three individual events. The same holds if we replace “heads” with “tails” in any of these events. All this matches our intuition that the results of any of the tosses do not affect any of the other tosses.
The two events “the first/last pair of tosses come up heads” are not independent: their intersection is the event $\set{HHH}$, which has probability $1/8$, but they each have probability $1/4$. This matches our intuition: the first pairs of tosses coming up heads makes it more likely that the last pairs of tosses also do, because they have the middle (second) toss in common.
The three events “the [first two/last two/first and third] tosses come up the same” are pairwise independent, but not mutually independent. Each of them has probability $1/2$, and the intersection of any two of them is the event $\set{HHH,TTT}$, which has probability $1/4 = 1/2 \cdot 1/2$. However, the intersection of all three of them is also $\set{HHH,TTT}$, whereas the product of their individual $1/2$ probabilities is $1/8$. This failure of mutual independence is because the occurrence of any two of these events implies the occurrence of the third one.

When considering the union of events, the following lemma is a basic but widely applicable tool.

overlapping events in a sample space; the probability of their union is at most the sum of the probabilities of the individual events (and may be strictly less, because some outcomes are in multiple events, and thus their probabilities are counted multiple times)

Lemma 212 (Union Bound)

Let $E_1, E_2 \subseteq \Omega$ be events of the same probability space. Then

\[\Pr[E_1 \vee E_2] \leq \Pr[E_1] + \Pr[E_2] \; \text,\]

with equality if the events are disjoint (i.e., if $E_1 \wedge E_2 = \emptyset$).

More generally, the probability of the union of any countably many events is at most the sum of their individual probabilities, with equality if the events are pairwise disjoint.

Proof 213

By definition,

\[\begin{split}\Pr[E_1 \vee E_2] &= \sum_{\omega \in E_1 \vee E_2} \Pr[\omega] \\ &\leq \sum_{\omega \in E_1} \Pr[\omega] + \sum_{\omega \in E_2} \Pr[\omega] \\ &= \Pr[E_1] + \Pr[E_2] \; \text,\end{split}\]

where the inequality holds because the two sums on the right include every $\omega \in E_1 \vee E_2$ at least once, and $\Pr[\omega] \geq 0$ for those $\omega$ that are double-counted, i.e., the $\omega \in E_1 \wedge E_2$. In particular, if $E_1,E_2$ are disjoint, then no outcome is double-counted, hence the bound is an equality.

The more general claim (for more than two events) follows via induction, by applying the main claim (for two events) to the first event and the union of the others, and applying the inductive hypothesis to the latter union.

Random Variables

A random variable is, informally, a numerical quantity whose value is determined by the outcome of a probability experiment. It is defined formally as follows.

Definition 214 (Random Variable)

A random variable of a probability space $(\Omega,\Pr)$ is a function $X \colon \Omega \to \R$ from the sample space to the real numbers. So, the set of potential values of $X$ is its image $X(\Omega) = \set{X(\omega) : \omega \in \Omega}$.

For any real number $v \in \R$, the notation $X=v$ denotes the event $X^{-1}(v) = \set{\omega \in \Omega : X(\omega)=v}$, i.e., the set of all outcomes that map to value $v$ under $X$. The events $X \geq v$, $X < v$, etc. are defined similarly.

By the above definition, the probabilities of the events $X=v$ and $X \geq v$ are, respectively,

\[\begin{split}\Pr[X=v] &= \sum_{\omega : X(\omega)=v} \Pr[\omega] \; \text, \\ \Pr[X \geq v] &= \sum_{\omega : X(\omega) \geq v} \Pr[\omega] \; \text.\end{split}\]

A frequently useful kind of random variable is an indicator random variable, also called a 0-1 random variable.

Definition 215 (Indicator Random Variable)

An indicator random variable (of a probability space $(\Omega, \Pr)$) is one that is limited to the values 0 and 1, i.e., a function $X \colon \Omega \to \bit$.

An indicator random variable can be defined for any event $E$, as $X(\omega)=1$ for every $\omega \in E$ and $X(\omega)=0$ otherwise. This means that the random variable has value $1$ if the event occurs, and value $0$ if the event does not occur. In the other direction, any indicator random variable $X$ has a corresponding event $X^{-1}(1) = \set{\omega \in \Omega : X(\omega)=1}$, the set of exactly those outcomes that make the random variable have value $1$.

Similar to the case for events, we can define independence for random variables.

Definition 216 (Independent Random Variables)

Random variables $X_1, X_2$ are independent if for any $v_1, v_2 \in \R$, the events $X_1 = v_1$ and $X_2 = v_2$ are independent, i.e.,

\[\Pr[X_1 = v_1 \wedge X_2 = v_2] = \Pr[X_1=v_1] \cdot \Pr[X_2=v_2] \; \text.\]

Mutual and pairwise independence of several random variables are defined similarly, as in Definition 210.

Example 217 (Three Fair Coins: Random Variables and Independence)

Building on Example 207 from above, we define several random variables and consider whether they are independent.

We first define a random variable $X$ representing the number of heads that come up, as the function $X \colon \Omega \to \R$ where

\[\begin{split}X(TTT) &= 0 \; \text, \\ X(HTT) = X(THT) = X(TTH) &= 1 \; \text, \\ X(HHT) = X(HTH) = X(THH) &= 2 \; \text, \\ X(HHH) &= 3 \; \text.\end{split}\]

The event $X=0$ is $\set{TTT}$, and hence has probability $\Pr[X=0] = 1/8$. The event $X \geq 2$ is $\set{HHT,HTH,THH,HHH}$, and hence has probability $\Pr[X \geq 2] = 4/8 = 1/2$. The events $X=1.5$ and $X=5$ are both $\emptyset$, and hence each have probability zero.

For each $i=1,2,3$, we can define an indicator random variable $X_i$ that indicates whether the $i$th toss comes up heads. Specifically, $X_i(\omega)=1$ if the $i$th character of $\omega$ is $H$, and $X_i(\omega)=0$ otherwise. For example, $X_2$ is the indicator random variable for the event $\set{HHH, HHT, THH, THT}$. By inspection, $\Pr[X_i = 0] = \Pr[X_i = 1] = 1/2$ for every $i$.

Observe that $X = X_1 + X_2 + X_3$, because $X$ is the total number of heads, and each $X_i$ contributes $1$ to the sum if and only if the $i$th toss comes up heads (otherwise it contributes $0$).

The random variables $X_1, X_2, X_3$ are mutually independent (and hence also pairwise independent). To see this, consider any $v_1,v_2,v_3 \in \bit$. Then the event $X_1 = v_1 \wedge X_2 = v_2 \wedge X_3 = v_3$ consists of exactly one outcome (because the $v_i$ values collectively specify the result of every toss), so it has probability $1/8 = (1/2)^3$, which is the product of the $1/2$ probabilities of the three individual events $X_i = v_i$.

All of the definitions and claims from this example generalize straightforwardly to the probability experiment of tossing any number of fair coins.

Example 218 (Three Fair Coins: More Random Variables and (Non-)Independence)

We can also define a random variable $Y$ representing the number of pairs of tosses that come up the same. That is,

\[\begin{split}Y(HHH) = Y(TTT) &= 3 \; \text, \\ Y(\omega) &= 1 \; \text{otherwise.}\end{split}\]

For each pair of coin tosses, we can define an indicator random variable that indicates whether that pair comes up the same. Specifically, for each $i=1,2,3$, consider the pair of tosses that does not include the $i$th toss; then let $Y_i=1$ if this pair comes up the same, and $Y_i=0$ otherwise. For example, $Y_1$ is the indicator random variable for the event $\set{HHH, HTT, THH, TTT}$, so $\Pr[Y_1 = 1] = \Pr[Y_1 = 0] = 1/2$, and similarly for $Y_2, Y_3$.

Observe that $Y = Y_1 + Y_2 + Y_3$, because $Y$ is the total number of pairs that come up the same, and each $Y_i$ corresponds to one of the three pairs, contributing $1$ to the sum if that pair comes up the same, and $0$ otherwise.

The random variables $Y_1, Y_2, Y_3$ are pairwise independent, but not mutually independent. To see pairwise independence, take any values $v_i, v_j \in \bit$ for any $i \neq j$. Then the event $Y_i = v_i \wedge Y_j = v_j$ consists of exactly two outcomes, so it has probability $1/4$, which is the product of the $1/2$ probabilities of the two individual events $Y_i = v_i$ and $Y_j = v_j$. This matches our intuition that whether one pair of coins comes up the same does not affect whether another pair of coins does (even if the two pairs have a coin in common).

To see that the $Y_i$ are not mutually independent, there are several counterexamples to the definition. For instance, the event $Y_1 = 1 \wedge Y_2 = 1 \wedge Y_3 = 1$ is $\set{HHH,TTT}$, so it has probability $1/4$, which is not the product of the $1/2$ probabilities of the three individual events $Y_i = 1$ for $i=1,2,3$. This matches our intuition: if two pairs of coins come up the same, then this implies that the third pair comes up the same as well.

Definition 219

The probability distribution (also known as probability mass function) of a random variable $X$ is the function $p_X \colon \R \to [0,1]$ that maps each real value to the probability that $X$ takes on this value, i.e., $p_X(v) = \Pr[X=v]$.

Example 220 (Three Fair Coins: Probability Distributions)

Continuing from Example 217, the probability distribution of the random variable $X$ representing the number of heads that come up is

\[\begin{split}p_X(0) = \Pr[X=0] &= 1/8 \; \text, \\ p_X(1) = \Pr[X=1] &= 3/8 \; \text, \\ p_X(2) = \Pr[X=2] &= 3/8 \; \text, \\ p_X(3) = \Pr[X=3] &= 1/8 \; \text, \\ p_X(v) = \Pr[X=v] &= 0 \text{ otherwise.}\end{split}\]

The probability distribution of the random variable $Y$ from Example 218 is $p_Y(1) = 3/4$, $p_Y(3) = 1/4$, and $p_Y(v) = 0$ otherwise.

Expectation

The probability distribution of a random variable gives the probability of each potential value of the variable. Often, we seek more succinct “summary” quantities about the random variable. One such quantity is its expectation, which is the “weighted average” of the random variable, where each value is weighted by its probability.

Definition 221 (Expectation)

The expected value, also known as expectation, of a random variable $X$ is defined as

\[\Ex[X] = \sum_{v \in X(\Omega)} v \cdot \Pr[X = v] \; \text.\]

The name “expected value” is a bit of a misnomer, because it is not really the value we “expect” to get in the probability experiment. Indeed, a random variable might not even be able to take on its expected value at all! (Example 224 below gives a simple example of this.) Instead, it is good to think of the expectation as the average value of a random variable.

Recall that

\[\Pr[X = v] = \sum_{\omega: X(\omega) = v} \Pr[\omega] \; \text.\]

Plugging this into the definition of expectation, we get that

\[\begin{split}\Ex[X] &= \sum_{v \in X(\Omega)} v \cdot \parens[\Big]{\sum_{\omega: X(\omega) = v} \Pr[\omega]} \\ &= \sum_{v \in X(\Omega)} \sum_{\omega: X(\omega) = v} X(\omega) \cdot \Pr[\omega] \\ &= \sum_{\omega \in \Omega} X(\omega) \cdot \Pr[\omega]\end{split}\]

as an alternative expression of $\Ex[X]$.

Lemma 222

If $X$ is an indicator random variable, then $\Ex[X] = \Pr[X=1]$.

Proof 223

This follows directly from the definition of expectation, and the fact that an indicator random variable is limited to the values $0,1$:

\[\Ex[X] = 0 \cdot \Pr[X=0] + 1 \cdot \Pr[X=1] = \Pr[X=1] \; \text.\]

Example 224 (Three Fair Coins: Expectations)

Continuing from Example 217, the expectation of the random variable $X$ (the number of heads that come up) is

\[\begin{split}\Ex[X] &= \sum_{v \in X(\Omega)} v \cdot \Pr[X=v] \\ &= \sum_{v=0}^3 v \cdot \Pr[X=v] \\ &= 0 \cdot 1/8 + 1 \cdot 3/8 + 2 \cdot 3/8 + 3 \cdot 1/8 \\ &= 12/8 = 3/2 \; \text.\end{split}\]

This matches our intuition, that when we toss three fair coins, the average number of heads that come up is $3/2$. Notice that this expectation is not a value that $X$ can actually take on; it is merely the “average” value.

Continuing from Example 218, the expectation of $Y$ (the number of pairs of tosses that come up the same) is

\[\begin{split}\Ex[Y] &= \sum_{v \in Y(\Omega)} v \cdot \Pr[Y=v] \\ &= 1 \cdot 3/4 + 3 \cdot 1/4 \\ &= 6/4 = 3/2 \; \text.\end{split}\]

This also might (or might not!) match our intuition: there are three pairs of tosses, and each one has probability $1/2$ of coming up the same. Even though these three events are not mutually independent (see Example 218), the expected number of equal pairs of tosses happens to be $3/2$. This is not a coincidence, as we will see next.

An important feature of averages is that it is impossible for all potential outcomes to be above average, nor can they all be below average. In other words, there must be some outcome that is at least the expectation, and one that is at most the expectation. This is formalized in the following lemma, which is known as the averaging argument.

Lemma 225 (Averaging Argument)

Let $X$ be a random variable of a probability space $(\Omega,\Pr)$. Then there exists an outcome $\omega \in \Omega$ for which $X(\omega) \geq \Ex[X]$, and an outcome $\omega' \in \Omega$ for which $X(\omega') \leq \Ex[X]$.

For example, building on Example 224, there exists at least one outcome for which $X$, the number of heads that come up, is at least $3/2$. Indeed, $HHH$ and $HHT$ are two such outcomes.

Proof 226

Let $E = \Ex[X]$, and suppose for the purposes of contradiction that $X(\omega) < E$ for every $\omega \in \Omega$. Then by the alternative formula for the expectation, and because the $\Pr[\omega]$ are non-negative and sum to 1 (so at least one of them is positive),

\[\begin{split}\Ex[X] &= \sum_{\omega \in \Omega} X(\omega) \cdot \Pr[\omega] \\ &< \sum_{\omega \in \Omega} E \cdot \Pr[\omega] \\ &= E \cdot \sum_{\omega \in \Omega} \Pr[\omega] = E \; \text.\end{split}\]

Thus $\Ex[X] < E$, which contradicts the definition of $E$, hence the first claim is proved. The second claim proceeds symmetrically.

For a non-negative random variable, its expectation gives some useful information about its probability distribution. Informally, it “not so likely” that the random variable is “much larger than” its expectation. Markov’s inequality quantifies this.

Theorem 227 (Markov’s Inequality)

Let $X$ be a non-negative random variable, i.e., $X$ never takes on a negative value, and let $a > 0$. Then

\[\Pr[X \geq a] \leq \frac{\Ex[X]}{a} \; \text.\]

If $\Ex[X] > 0$, this implies that for any $t > 0$, the probability that $X$ is at least $t$ times its expectation is at most $1/t$. Notice that these statements are trivial (or “useless”) for $a \leq \Ex[X]$ or $t \leq 1$, because in these cases Markov’s inequality gives probability upper bounds that are $\geq 1$, and the probability of any event is trivially $\leq 1$. So, Markov’s inequality is useful only when we take $a > \Ex[X]$ or $t > 1$.

As an example, if the class average on an exam is $70$ points (and negative points are not possible), then Markov’s inequality implies that the fraction of the class that earned at least $a = 90$ points is at most $70/90 = 7/9$. (Stated probabilistically: if we choose a student uniformly at random, then the probability that the student earned $\geq 90$ points is $\leq 7/9$.) Similarly, at most $(7/6)$ths of the class earned $\geq 60$ points, but this is a trivial statement.

Proof 228

By definition of expectation,

\[\begin{split}\Ex[X] &= \sum_{v \in X(\Omega)} v \cdot \Pr[X = v] \\ &= \parens[\Big]{\sum_{v < a} v \cdot \Pr[X = v]} + \parens[\Big]{\sum_{v \geq a} v \cdot \Pr[X = v]} \\ &\geq \sum_{v \geq a} v \cdot \Pr[X = v] \\ &\geq \sum_{v \geq a} a \cdot \Pr[X = v] \\ &= a \cdot \Pr[X \geq a] \; \text.\end{split}\]

The first inequality holds because $X$ is non-negative, so every $v$ in the first sum is non-negative, and therefore that sum is non-negative. The second inequality holds because $v \geq a$, and all the $\Pr[X=v]$ are non-negative. The last equality holds because the event $X \geq a$ is the union of all the disjoint events $X \geq v$ for $v \geq a$.

Finally, dividing both sides of the inequality by $a > 0$ (which preserves the direction of the inequality), we get that

\[\frac{\Ex[X]}{a} \geq \Pr[X \geq a] \; \text.\]

The following ‘reverse’ version of Markov’s inequality says, informally, that if a random variable is upper bounded, then it has a certain positive probability of exceeding any value less than its expectation. (Note that by the averaging argument, there is some nonzero probability that the random variable is at least its expectation. The reverse Markov inequality gives a specific probability bound, which increases as the threshold of interest decreases.)

Theorem 229 (Reverse Markov’s Inequality)

Let $X$ be a random variable that is never larger than some value $b$. Then for any $a < b$,

\[\Pr[X \geq a] \geq \Pr[X > a] \geq \frac{\Ex[X] - a}{b - a} \; \text.\]

Notice that the statement is trivial (or “useless”) for $a \geq \Ex[X]$, because in this case it gives a probability lower bound that is $\leq 0$, and the probability of any event is $\geq 0$.

For example, if the class average on an exam is $70$ points out of a maximum of $b=100$, then the ‘reverse’ Markov inequality implies that at least $(70-60)/(100-60) = 1/4$ of the class earned more than $a=60$ points. However, it does not say anything about how much of the class earned $70$ or more points.

Proof 230

The first inequality holds because the event $X > a$ is a subset of the event $X \geq a$. So, it remains to prove the second inequality.

Define the random variable $Y = b - X$. This is non-negative because $X$ is never larger than $b$, so we can apply Markov’s inequality to it, and it has expectation $\Ex[Y] = b - \Ex[X]$. Hence,

\[\begin{split}\Pr[X \leq a] &= \Pr[b - Y \leq a] \\ &= \Pr[Y \geq b - a] \\ &\leq \frac{\Ex[Y]}{b - a} \\ &= \frac{b - \Ex[X]}{b - a} \; \text.\end{split}\]

So, considering the complementary event,

\[\Pr[X > a] = 1 - \Pr[X \leq a] \geq \frac{(b-a) - (b-\Ex[X])}{b-a} = \frac{\Ex[X]-a}{b-a} \; \text.\]

Often, we can express a random variable $X$ as a linear combination of some other, typically “simpler,” random variables $X_i$. We can then apply linearity of expectation to compute the expectation of $X$ from the expectations of the $X_i$. Importantly, the $X_i$ do not need to be independent in order for linearity of expectation to apply.

Theorem 231 (Linearity of Expectation)

Let $X_1, X_2, \ldots$ be any countably many random variables, and $c_1, c_2, \ldots \in \R$. Then

\[\Ex\brackets[\Big]{\sum_i c_i \cdot X_i} = \sum_i c_i \cdot \Ex[X_i] \; \text.\]

In other words, constant-factor scalings and summations can be “moved in and out of” expectations.

Proof 232

We prove the case for a linear combination of two variables; the general case then follows by induction. From the alternative formulation of expectation,

\[\begin{split}\Ex[c_1 X_1 + c_2 X_2] &= \sum_{\omega\in\Omega} (c_1 X_1(\omega) + c_2 X_2(\omega)) \cdot \Pr[\omega] \\ &= \sum_{\omega\in\Omega} (c_1 X_1(\omega) \cdot \Pr[\omega] + c_2 X_2(\omega) \cdot \Pr[\omega]) \\ &= \sum_{\omega\in\Omega} c_1 X_1(\omega) \cdot \Pr[\omega] + \sum_{\omega\in\Omega} c_2 X_2(\omega) \cdot \Pr[\omega] \\ &= c_1 \cdot \sum_{\omega\in\Omega} X_1(\omega) \cdot \Pr[\omega] + c_2 \cdot \sum_{\omega\in\Omega} X_2(\omega) \cdot \Pr[\omega] \\ &= c_1 \cdot \Ex[X_1] + c_2 \cdot \Ex[X_2] \; \text.\end{split}\]

Example 233 (Three Fair Coins: Linearity of Expectation)

Continuing from Example 217, recall that the random variable $X$ is the number of heads that come up. In Example 224, we showed directly via the definition that $\Ex[X] = 3/2$. What follows is an alternative, simpler way—especially when generalizing to more coin tosses—to obtain the same result using linearity of expectation.

Recall that $X$ is the sum of the three indicator random variables $X_i$ that indicate whether the $i$th toss comes up heads: $X = X_1 + X_2 + X_3$. Recall from Example 217 that $\Ex[X_i] = \Pr[X_i = 1] = 1/2$ for every $i$. So, by linearity of expectation (Lemma 231),

\[\Ex[X] = \Ex[X_1 + X_2 + X_3] = \Ex[X_1] + \Ex[X_2] + \Ex[X_3] = 3/2 \; \text.\]

Similarly, in Example 218, we defined the random variable $Y$ to be the number of pairs of tosses that come up the same, and showed directly from the definition that $\Ex[Y] = 3/2$.

We observed that $Y$ is the sum of the three indicator random variables $Y_i$ that indicate whether the corresponding pair of tosses come up the same: $Y = Y_1 + Y_2 + Y_3$. Even though these indicator random variables are not mutually independent, we can still apply linearity of expectation to their sum:

\[\Ex[Y] = \Ex[Y_1 + Y_2 + Y_3] = 3/2 \; \text.\]

Example 234 (Biased Coins)

Suppose we repeatedly flip a biased coin that has probability $p \in [0,1]$ of coming up heads. We can determine the expected number of heads in $n$ flips, using indicator random variables and linearity of expectation.

For $i=1,\ldots, n$, define $X_i$ to be the indicator random variable indicating whether the $i$th flip comes up heads. Then $\Ex[X_i] = \Pr[X_i] = p$. Define $X$ to be the total number of heads that come up in $n$ flips, and observe that $X = \sum_{i=1}^n X_i$. So, by linearity of expectation,

\[\Ex[X] = \Ex\brackets[\Big]{\sum_{i=1}^n X_i} = \sum_{i=1}^n \Ex[X_i] = \sum_{i=1}^n p = p \cdot n \; \text.\]

By contrast, deriving this expectation directly via the definitions (of expectation, and of the random variable $X$) would be rather complicated: we would need to determine $\Pr[X=v]$ for each $v=0,1,\ldots,n$, then analyze $\sum_{i=1}^n v \cdot \Pr[X=v]$. Linearity of expectation makes the derivation almost trivial.

Analyzing Rock-Paper-Scissors

Now let’s return to the rock-paper-scissors example, modeling one round of the game as a probability space, and analyzing the appropriate events. The sample space—i.e., the set of outcomes that can potentially occur—is the set $\Omega = \set{R,P,S}^2 = \set{RR, RP, RS, PR, PP, PS, SR, SP, SS}$ of possible move pairs made by the two players, where $R,P,S$ respectively denote the moves rock, paper, and scissors, and the $i$th character represents player $i$’s move, for $i=1,2$.

Next, let $R_i$ be the event that player $i$ plays rock, i.e., $R_1 = \set{RR, RP, RS}$ and $R_2 = \set{RR, PR, SR}$, and similarly for $P_i$ and $S_i$.

We need to assign probabilities to the outcomes. Player $i$’s strategy can be seen as choosing a move at random according to some specified probabilities of events $R_i, P_i, S_i$, which must sum to $1$ because they partition the sample space. (This is true even for deterministic strategies, in which case the chosen move has probability $1$.) Since the players move simultaneously, we model their choices as independent. So, letting $M_i$ be any one of $R_i,P_i,S_i$ for $i=1,2$, we see that the event $M_1 \wedge M_2$ consists of a single outcome, and we have that

\[\Pr[M_1 \wedge M_2] = \Pr[M_1] \cdot \Pr[M_2] \; \text.\]

Now, let $W_1 = \set{RS, PR, SP}$ be the event that player $1$ wins. Suppose that player $1$ plays uniformly at random, i.e., plays R,P,S each with equal probability, i.e., $\Pr[R_1]=\Pr[P_1]=\Pr[S_1]=1/3$. Then no matter how player $2$ plays, the probability that player $1$ wins is

\[\begin{split}\Pr[W_1] &= \Pr[R_1 \wedge S_2] + \Pr[P_1 \wedge R_2] + \Pr[S_1 \wedge P_2] \\ &= \Pr[R_1] \cdot \Pr[S_2] + \Pr[P_1] \cdot \Pr[R_2] + \Pr[S_1] \cdot \Pr[P_2] \\ &= \frac{1}{3} (\Pr[S_2] + \Pr[R_2] + \Pr[P_2]) \\ &= 1/3 \; \text.\end{split}\]

The first equation uses independence, and the final equation uses the fact that the events $S_2,R_2,P_2$ partition the sample space, so their probabilities must sum to $1$. The conclusion is that playing uniformly at random yields a $1/3$ probability of winning, no matter what strategy the opponent uses. This is a significant improvement over any deterministic strategy, against which an opponent can win with certainty. [3] A similar analysis shows that the probabilities of tying and of losing are both $1/3$ as well.

The rock-paper-scissors example is illustrative of a more general notion of considering algorithms as games. In this perspective, we view one “player” as choosing an algorithm for a computational problem. A second player known as the adversary then chooses an input (of a certain size) to the algorithm. The adversary aims to maximize the number of steps taken by the algorithm on the given input; i.e., to induce the algorithm’s worst-case runtime behavior.

If the first player chooses a deterministic algorithm, then the adversary can find an input that maximizes the number of steps—even if the algorithm happens to be much faster on “most” other inputs. On the other hand, if the first player chooses a randomized algorithm—i.e., one that makes some random choices—then the adversary might not be able to induce worst-case behavior, because it cannot predict the algorithm’s random choices. In a good randomized algorithm, the random choices will tend to avoid the worst-case behavior no matter what the input is, although there is still some (ideally very small) probability that it will make “unlucky” choices that lead to poor performance.

Randomized Approximation Algorithms

We can use randomness to design simple algorithms that produce an α-approximation in expectation, meaning that the expected value of the output is within an $\alpha$ factor of the value of an optimal solution.

As an example, recall the search version of the $\TSAT$ problem: given a 3CNF Boolean formula, find a satisfying assignment for it, if one exists. Since a 3CNF formula is the AND of several clauses, this problem is asking for an assignment that satisfies all of the clauses simultaneously.

Because $\TSAT$ is $\NP$-hard, there is no efficient algorithm for this problem unless $\P=\NP$. So, let us relax the goal, and seek an assignment that satisfies as many clauses as we can manage. In other words, we seek an approximation algorithm for the problem of maximizing the number of satisfied clauses.

In what follows, we actually restrict the input to be what we call an “exact”-3CNF formula. This is a 3CNF formula with the added restriction that each clause involves three distinct variables. (Different clauses can involve the same variable, however.) For example, the clause

\[(x \vee \neg y \vee \neg z)\]

is allowed, but the clause

\[(x \vee \neg y \vee \neg x)\]

is not, since it involves only two distinct variables (the literals $x$ and $\neg x$ involve the same variable). Similarly, the clause $(y \vee y \vee \neg z)$ is not allowed.

The problem of finding an assignment that maximizes the number of satisfied clauses in a given exact-3CNF formula is known as $\MSAT$, and its decision version is $\NP$-complete. Despite this, there is a very simple randomized algorithm that obtains a $7/8$-approximation, in expectation. The algorithm is as follows: assign random truth values (true or false) to the formula’s variables, uniformly and independently. That is, for each variable, assign it to be true or false each with $1/2$ probability, independently of all the others.

Theorem 235

Let $\phi$ be an exact-3CNF formula with $m$ clauses. Then assigning its variables with uniformly random and independent truth values satisfies $(7/8)$ths of $\phi$’s clauses in expectation, i.e.,

\[\Ex[\text{number of satisfied clauses in $\phi$}] = 7m/8 \; \text.\]

In particular, this is a $7/8$-approximation algorithm for $\MSAT$, in expectation.

The proof of this theorem proceeds by a powerful and widely applicable strategy. First, we define a suitable probability space and a random variable $X$ that captures the quantity of interest, namely, the number of satisfied clauses. This variable $X$ typically has a complicated probability distribution that makes it hard to analyze its expectation directly. Instead, we express $X$ as the sum of several simpler indicator random variables $X_i$. By linearity of expectation, $\Ex[X]$ is the sum of the individual expectations $\Ex[X_i] = \Pr[X_i=1]$, where the equality holds because the $X_i$ are indicators. Finally, we analyze the probabilities of the individual events $X_i=1$, which are simple to understand, and arrive at the conclusion.

Proof 236

First, suppose that the expected number of satisfied clauses is indeed $7m/8$ (which we prove below). The optimal number of clauses that can be simultaneously satisfied is $\OPT \leq m$, so the expected number of satisfied clauses is $7m/8 \geq (7/8) \cdot \OPT$, hence this is indeed a $7/8$-approximation algorithm for $\MSAT$.

We now analyze the expected number of satisfied clauses. The probability space is the set of all assignments to the variables of $\phi$, where each assignment is equally likely—i.e., the assignment is chosen uniformly at random.

Now, define $X$ be the random variable for the number of satisfied clauses in $\phi$. That is, for any particular assignment $\alpha$, we define $X(\alpha)$ to be the number of clauses satisfied by $\alpha$. Because the clauses of $\phi$ can share variables in various ways, the probability distribution and expectation of $X$ can be quite complicated and difficult to analyze directly. Instead, we decompose $X$ into a sum of much simpler indicator variables.

Specifically, for each $i=1,\ldots,m$ we define an indicator random variable $X_i$ that indicates whether the $i$th clause is satisfied, i.e., $X_i = 1$ if so and $X_i = 0$ if not. Then because $X$ is the total number of satisfied clauses, we have that

\[X = X_1 + \cdots + X_m = \sum_{i=1}^m X_i \; \text.\]

So, by linearity of expectation (Lemma 231),

\[\Ex[X] = \Ex\brackets[\Big]{\sum_{i=1}^m X_i} = \sum_{i=1}^m \Ex[X_i] \; \text.\]

It is important to realize that because the clauses of $\phi$ can share variables in various ways, the indicators $X_i$ are typically not independent, even pairwise. For example, because clauses $(x \vee y \vee \neg z)$ and $(u \vee \neg v \vee x)$ share the variable $x$, the former clause being satisfied makes it more likely that the latter clause is also satisfied (i.e., their indicators are positively correlated). Fortunately, linearity of expectation holds even for dependent variables, so the above equation is still valid.

All that remains to analyze $\Ex[X_i] = \Pr[X_i = 1]$ (see Lemma 222), i.e., the probability that a uniformly random assignment satisfies the $i$th clause. Every clause in the exact-3CNF formula $\phi$ is the OR of exactly three literals involving distinct variables, e.g., $(x \vee \neg y \vee z)$. The clause fails to be satisfied exactly when all three of its literals are false. Because the variables are assigned uniformly at random, each literal has probability $1/2$ of being false (regardless of whether the literal is a variable or its negation). And because the literals in clause $i$ involve distinct variables, their values are mutually independent. Thus,

\[\begin{split}\Pr[X_i = 0] &= \Pr[\text{clause $i$ is unsatisfied}] \\ &= \Pr[\text{the 1st and 2nd and 3rd literals of clause $i$ are all false}] \\ &= \prod_{j=1}^3 \Pr[\text{the $j$th literal of clause $i$ is false}] \\ &= (1/2)^3 = 1/8 \; \text.\end{split}\]

Next, because $X_i$ is an indicator random variable (it is limited to values $0$ and $1$),

\[\Ex[X_i] = \Pr[X_i = 1] = 1 - \Pr[X_i=0] = 7/8 \; \text.\]

Finally, returning to $X$ itself,

\[\Ex[X] = \sum_{i=1}^m \Ex[X_i] = \sum_{i=1}^m (7/8) = 7m/8 \; \text.\]

By the averaging argument (Lemma 225), from Theorem 235 we can further conclude that for any exact-3CNF formula, there exists an assignment that satisfies at least $7/8$ths of the clauses. Notice that this is a “deterministic” statement—it makes no reference to probability—but we proved it using the tools of probability, and it is not obvious how we could hope to do so without them! (This is an example of the “probabilistic method,” which is a very powerful technique with many applications.)

Theorem 235 says that if we assign variables at random, then on average, $7/8$ths of the clauses are satisfied. However, this does not say anything about how likely it is that a certain fraction of clauses are satisfied. For example, we might want a lower bound on the probability that at least half (or two-thirds, or three-quarters, or 90%) of the clauses are satisfied. Written in terms of the random variable $X$ representing the number of satisfied clauses, we might hope to get a lower bound on $\Pr[X \geq c \cdot m]$ for $c=1/2$, or various other $c \in [0,1]$.

Notice that because $X$ is defined as the number of satisfied clauses, it is non-negative (i.e., it never takes on a negative value). Thus, we can apply Markov’s inequality (Lemma 227) to obtain that

\[\Pr[X \geq m/2] \leq \frac{7m/8}{m/2} = \frac{7}{4} \; \text.\]

Unfortunately, this is a trivial (or “useless”) statement, because the probability bound is $7/4 > 1$, and the probability of any event is $\leq 1$. Furthermore, the probability bound goes in the “wrong direction” from what we want: it says that the probability of satisfying at least half the clauses is at most some value, whereas we hope to show that it is at least some positive value. An upper bound is not useful to us, because it is consistent with the actual probability being zero. So, even if we replaced $c=1/2$ with, say, $c = 9/10$ to get a nontrivial bound, the statement would not be of the kind we want. We need to use a different tool.

Observe that the number of satisfied clauses $X$ cannot exceed $m$, the total number of clauses in the given formula. So, we can apply the ‘reverse’ Markov inequality (Lemma 229) to it, with $b=m$. Setting $a = m/2 < b$, we get that

\[\Pr[X > m/2] \geq \frac{7m/8 - m/2}{m - m/2} = \frac{3m/8}{m/2} = 3/4 \; \text.\]

That is, a random assignment satisfies more than half the clauses with at least $75$ percent probability. More generally, for any $c < 7/8$, setting $a = c \cdot m < b$,

\[\Pr[X > c \cdot m] \geq \frac{7m/8 - c \cdot m}{m - c \cdot m} = \frac{7/8-c}{1-c} > 0 \; \text.\]

So, for any $c < 7/8$ there is a positive probability of satisfying more than a $c$ fraction of the clauses. However, the ‘reverse’ Markov inequality does not tell us anything useful about the probability of satisfying $7/8$ths or more of the clauses.

Remarkably, the randomized algorithm for $\MSAT$ can be derandomized into a deterministic algorithm that is guaranteed to satisfy at least $7/8$ths of the input formula’s clauses. The algorithm considers each variable in turn, computes conditional expectations for each potential choice of its truth value, and sets the variable according to a best choice. This is called the method of conditional probabilities (or the method of conditional expectations). It is notable that even though the algorithm is entirely deterministic, it is heavily inspired by the randomized algorithm, and it is not clear how it could have been discovered without taking a probabilistic perspective.

Let $A,B$ be events with $\Pr[B] > 0$. The conditional probability $\Pr[A|B]$ captures the probability that event $A$ occurs, given that event $B$ occurs. It is defined as

\[\begin{split}\Pr[A|B] &= \frac{\Pr[A \cap B]}{\Pr[B]} \\ &= \frac{\sum_{\omega \in A \cap B} \Pr[\omega]}{\sum_{\omega \in B} \Pr[\omega]} \; \text.\end{split}\]

Next, the conditional expectation of a random variable $X$ conditioned on event $B$, which captures the “average” value of $X$ given that event $B$ occurs, is defined as

\[\Ex[X|B] = \sum_{v \in X(\Omega)} v \cdot \Pr[X = v|B] \; \text.\]

It can be shown that

\[\Ex[X] = \Pr[B]\cdot\Ex[X|B] + \Pr[\overline{B}]\cdot\Ex[X|\overline{B}] \; \text.\]

Since $\Ex[X]$ is the weighted average of $\Ex[X|B]$ and $\Ex[X|\overline{B}]$, by an averaging argument, at least one of $\Ex[X|B]$ or $\Ex[X|\overline{B}]$ is $\geq \Ex[X]$.

We now apply this to $\MSAT$. Let $\phi$ be an exact-3CNF formula with $m$ clauses and $n$ variables $x_1, \ldots, x_n$. Let $X$ be the number of satisfied clauses for a random assignment. Theorem 235 established that $\Ex[X] = 7m/8$. Now consider the events $x_1 = 0$ and $x_1 = 1$. These are complementary events, so by the reasoning above, at least one of $\Ex[X | x_1=0]$ or $\Ex[X | x_1=1]$ is $\geq 7m/8$. This generalizes to the case where variables $x_1, \ldots, x_{i-1}$ have been set so that the conditional expectation of $X$ remains $\geq 7m/8$, and we consider the two cases $x_i=0$ and $x_i=1$. Thus, we can deterministically build an assignment $\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_n)$ one entry at time, by doing the following for each variable $x_i$ in sequence, i.e., for $i=1, \ldots, n$:

Compute $\Ex[X | x_1=\alpha_1, \ldots, x_{i-1}=\alpha_{i-1}, x_i=0]$ and $\Ex[X | x_1=\alpha_1, \ldots, x_{i-1}=\alpha_{i-1}, x_i=1]$. In other words, given the partial assignment so far, consider both $x_i=0$ and $x_i=1$ and compute the resulting conditional expectations of $X$.
Compute the conditional expectations for partial assignments using linearity of expectation over the clauses, and by considering the values of any variables that have already been set. For example, consider the clause $(x_1 \vee \neg x_3 \vee x_7)$. If we have set $x_1=1$, then the expectation for this clause is $1$, regardless of how the other variables have been set (or not). But if we have set $x_1=0$ and the other two variables are not yet set, the expectation for this clause is $3/4$.
If setting $x_i=0$ produces a larger expectation, then fix $\alpha_i=0$, otherwise fix $\alpha_i=1$.

At each step, at least one of the two choices must have expectation at least $7m/8$, by the averaging argument. Thus, in the end, the full assignment also has expectation at least $7m/8$. But at that point, there are no more random choices to make (i.e., variables to set), so the full assignment must satisfy at least $7m/8$ clauses.

The full algorithm is as follows:

\begin{algorithmic}
\INPUT an exact-3CNF formula
\OUTPUT an assignment that satisfies at least $7/8$ths of the
formula's clauses
\FUNCTION{ApproxExact3SAT}{$\phi$}
\STATE $A = $ an empty assignment for $\phi$
\FORALL{variables $x_i$ in $\phi$}
  \FORALL{Boolean values $v = 0,1$}
    \STATE $\mu_v = $ \CALL{ExpectedSatisfied}{$\phi, A + (x_i=v)$}
  \ENDFOR
  \IF{$\mu_0 > \mu_1$}
    \STATE $A = A + (x_i=0)$
  \ELSE
    \STATE $A = A + (x_i=1)$
  \ENDIF
\ENDFOR
\STATE \RETURN $A$
\ENDFUNCTION

\INPUT an exact 3CNF formula and a partial assignment for it
\OUTPUT the expected number of satisfied clauses, over a
uniformly random completion of the assignment
\FUNCTION{ExpectedSatisfied}{$\phi, A$}
\STATE $\mu = 0$
\FORALL{clauses $C_j$ in $\phi$}
  \IF{$C_j$ is satisfied by $A$}
    \STATE $\mu = \mu + 1$
  \ELSE
    \STATE $k = $ the number of unassigned variables (by $A$) in
  $C_j$
    \STATE $\mu = \mu + (1 - 1/2^k)$
  \ENDIF
\ENDFOR
\STATE \RETURN $\mu$
\ENDFUNCTION
\end{algorithmic}

The ExpectedSatisfied subroutine uses linearity of expectation to compute the expected number of satisfied clauses, for a uniformly random completion of the given partial assignment. Let $X$ be the total number of satisfied clauses, and let $X_j$ be the indicator random variable for whether clause $C_j$ is satisfied. Then if $C_j$ is already satisfied by the given partial assignment $A$, $\Ex[X_j|A] = 1$. If $C_j$ is not yet satisfied, the probability of it being satisfied by a random assignment of the remaining variables depends on how many unset variables $C_j$ has. By the same reasoning as for the randomized algorithm, this is $1 - (1/2)^k$ if there are $k$ unset variables. Thus, $\Ex[X_j|A] = 1 - (1/2)^k$.

Exercise 237

Recall the max-cut problem. The following is a randomized algorithm that outputs a random cut in a given undirected graph (which has no self-loop edge, without loss of generality):

\begin{algorithmic}
\FUNCTION{RandomCut}{$G=(V,E)$}
\STATE $S = \emptyset$
\FORALL{$v \in V$}
  \STATE add $v$ to $S$ with probability $1/2$
\ENDFOR
\STATE \RETURN $S$
\ENDFUNCTION
\end{algorithmic}

Recall that $C(S)$ is the set of crossing edges for the cut $S$, i.e., those that have exactly one endpoint in $S$. The size of the cut is $\abs{C(S)}$.

Prove that in expectation, this algorithm outputs a 1/2-approximation of a maximum cut in the given undirected graph. That is, the expected size of the returned cut is at least half the size of a maximum cut.

Hint: Use linearity of expectation to show that in expectation, half of the edges of the graph are in $C(S)$. Similarly, give an upper bound on the size of a maximum cut (also in terms of the number of edges in the graph).
Conclude that, for any undirected graph $G=(V,E)$, there exists a cut of size at least $\abs{E}/2$. (By contrast, in Theorem 203 we saw an efficient deterministic algorithm that finds such a cut, with certainty.)

Quick Sort

Another example of an algorithm that takes advantage of randomness is quick sort. This is a recursive algorithm that sorts an array as follows:

First, it chooses some element of the array as the pivot.
Then, it partitions the array around the pivot, by comparing the pivot to every other element, placing all the smaller elements to the “left” of the pivot—not necessarily in sorted order—and all the larger ones to the “right” of the pivot. Without loss of generality, we assume that the array elements are distinct, by breaking ties according to the elements’ positions in the array.
Finally, it recursively sorts the two subarrays to the left and right of the pivot, which results in the entire array being sorted.

The following is an illustration of an execution of quick sort, with the recursive work elided:

quick sort works by choosing a pivot, partitioning the elements into those that are less than, equal to, and greater than the pivot, and recursively sorting the partitions less than and greater than the pivot

The running time of quick sort is dominated by the number of comparisons between pairs of elements. Once a pivot is selected, partitioning an array of $n$ elements compares every other element to the pivot. This is a total of $n-1 = \Theta(n)$ comparisons of non-recursive work, which is followed by the recursive calls on the two subarrays on each side of the pivot. If selecting the pivot takes $O(n)$ time and can guarantee that the two subarrays are “balanced,” i.e., approximately equal in size, then the recurrence for the total number of comparisons (and hence the overall running time) is

\[T(n) = 2T(n/2) + \Theta(n) \; \text.\]

This is the same recurrence as for merge sort, and it has the closed-form solution $T(n) = \Theta(n \log n)$. However, quick sort tends to be faster in practice than merge sort and other deterministic $\O(n \log n)$ sorting algorithms, in large part because it is fast and easy to implement in place—i.e., reordering the array within its own memory space, using only a small amount of auxiliary memory. (By comparison, the standard merge subroutine from merge sort makes a full copy of the array with each call.) Thus, variants of quick sort are widely used in practice.

One way to guarantee a balanced partition is to use the median element as the pivot, which can be found in linear time. However, this subroutine is rather complicated to implement, and its linear runtime has a fairly large hidden constant factor.

A much simpler and more practical solution is to choose a uniformly random array element as the pivot. The pseudocode is as follows.

Algorithm 238 (Randomized Quick Sort)

\begin{algorithmic}
\FUNCTION{RandQuickSort}{$A[1,\ldots,n]$}
\IF{$n=1$} \RETURN $A$ \ENDIF
\STATE $p = \text{ an index chosen uniformly at random from }
\set{1,\ldots,n}$
\STATE $(L,R) =$ \CALL{Partition}{$A,p$}
\STATE \RETURN \CALL{RandQuickSort}{$L$} $+ A[p] +$
\CALL{RandQuickSort}{$R$}
\ENDFUNCTION

\FUNCTION{Partition}{$A[1,\ldots,n],p$}
\STATE initialize empty arrays $L,R$
\FOR{$i=1$ to $n$}
  \IF{$i \neq p$ and $A[i] < A[p]$}
    \STATE $L = L + A[i]$
  \ELIF{$i \neq p$}
    \STATE $R = R + A[i]$
  \ENDIF
\ENDFOR
\STATE \RETURN $(L,R)$
\ENDFUNCTION
\end{algorithmic}

The following figure illustrates a possible execution of RandQuickSort on an example input array, with the random choices of pivot elements highlighted, and the rank of each element displayed below it. The rank of an element is the number of elements in the array (including itself) that are less than or equal to it. So, the smallest element has rank $1$, the next-smallest has rank $2$, etc.

a full run of the quick-sort algorithm; when a pivot is chosen, it is compared against every element in the same subarray; if one element is less than the pivot and another greater than the pivot, they are placed in different partitions and therefore never directly compared

The running time of RandQuickSort is a random variable that depends in a complicated way on the particular random choices of pivots made throughout the algorithm. Roughly speaking, choices of pivots that yield “unbalanced” subarrays tend to yield larger running times than ones that yield more balanced subarrays. How can we hope to analyze this very complicated probability experiment? As we did above, we express the running time as a sum of much simpler indicator variables, apply linearity of expectation, and then analyze these indicators individually. The rest of this section is devoted to proving the following theorem.

Theorem 239

On any array of $n$ elements, the expected running time of RandQuickSort is $\O(n \log n)$.

Recall that the running time of RandQuickSort is dominated by the number of comparisons of pairs of array elements. First observe that if a certain pair of elements is ever compared, then it is never compared again. This is because at the time of comparison, one of the elements is the pivot, which is never compared with anything after partitioning around it is complete. So, the total number of comparisons done by RandQuickSort is the total number of distinct pairs of elements that are ever compared.

Accordingly, define the random variable $X$ to be the total number of comparisons during the entire execution of quick sort on an arbitrary array of $n$ elements. In additions, for any ranks $1 \leq i < j \leq n$, define $X_{ij}$ to be the indicator random variable that indicates whether the pair of elements having ranks $i,j$ are ever compared during the execution, i.e., $X_{ij}=1$ if this pair is ever compared, and $X_{ij}=0$ otherwise. By the observations from the previous paragraph,

\[X = \sum_{1 \leq i < j \leq n} X_{ij} = \sum_{i=1}^{n-1} \sum_{j=i+1}^n X_{ij} \; \text.\]

Therefore, by linearity of expectation (Lemma 231), the expected total number of comparisons is

\[\Ex[X] = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \Ex[X_{ij}] \; \text.\]

So, for every $i < j$ it just remains to analyze the individual expectations $\Ex[X_{ij}] = \Pr[X_{ij}=1]$ in isolation (where the equality holds by Lemma 222), and sum them. (As above, the random variables $X_{ij}$ are highly correlated in difficult-to-understand ways, but this does not matter for linearity of expectation.)

Under what conditions are the elements of ranks $i < j$ ever compared, and what is the probability $\Pr[X_{ij}=1]$ that this occurs? By inspection of Algorithm 238, we see that elements are compared only in the Partition subroutine. In particular, the only comparisons made during a call to Partition are between the selected pivot element and all the other elements in the input subarray. This has two important implications:

When a subarray is input to Partition, none of its elements have been compared with each other yet.
In the subarray input to Partition, if one element is less than the pivot and another is greater than the pivot, then these two elements are placed on opposite sides of the pivot, and thus are never compared with each other (because the recursive calls operate completely separately on the two sides).

Therefore, the two elements $a < b$ having ranks $i < j$ are compared if and only if the first pivot chosen from the range $[a,b]$ is either $a$ or $b$. To elaborate: for as long as pivots are chosen from outside this range, $a$ and $b$ remain in the same subarray, and have not yet been compared. Then, once an element from this range is chosen as a pivot:

it is either $a$ or $b$, in which case $a$ and $b$ are compared with each other, or
it is an element strictly between $a$ and $b$, in which case $a$ and $b$ are placed in different subarrays and never compared with each other.

choosing a pivot; if a pivot is chosen outside of ranks i through j, they remain in the same subarray and we recurse on it; otherwise, a pivot is chosen somewhere between ranks i and j, and is a uniformly random such element

Therefore, $\Ex[X_{ij}]$ is the probability that the first pivot whose rank is between $i$ and $j$ (inclusive) has rank either $i$ or $j$. There are $j-i+1$ elements in this range. Each pivot is chosen uniformly at random from the subarray input to Partition, and as argued above, when the first pivot from this range is selected, the entire range is part of the subarray. So, given that the pivot is in the range, it is uniformly random over the range, i.e., it is equally likely to be any of the $j-i+1$ elements in the range. (This statement can be formalized using conditional probability.) Because exactly two of these possible choices make $X_{ij}=1$, we have that

\[\Ex[X_{ij}] = \Pr[X_{ij} = 1] = \frac{2}{j-i+1} \; \text.\]

Resuming from what we showed above,

\[\begin{split}\Ex[X] &= \sum_{i=1}^{n-1} \sum_{j=i+1}^n \Ex[X_{ij}] \\ &= \sum_{i=1}^{n-1} \sum_{j=i+1}^n \frac{2}{j-i+1} \\ &= 2 \sum_{i=1}^{n-1} \sum_{t=2}^{n-i+1} \frac{1}{t} \; \text.\end{split}\]

The quantity $\sum_{t=2}^{k} 1/t$ is part of the harmonic series, and is known to be less than $\ln(k)$. Thus,

\[\Ex[X] < 2 \sum_{i=1}^{n-1} \ln(n-i+1) \leq 2 \sum_{i=1}^{n-1} \ln(n) = 2(n-1) \ln(n) = O(n \log n) \; \text,\]

which completes the proof.

Exercise 240

Although RandQuickSort runs in expected time $O(n \log n)$ on an array of $n$ elements, it can potentially run in time $\Omega(n^2)$, due to an unlucky choice of random pivots.

Prove that for any constant $c > 0$,

\[\lim_{n \to \infty} \Pr[\algo{RandQuickSort} \text{ on an $n$-element array takes $\geq c n \log^2 n$ steps}] = 0 \; \text.\]

That is, as the array size grows large, randomized quick sort is very unlikely to run in even $\Omega(n \log^2 n)$ time.

Hint: This follows immediately from Theorem 239 and Markov’s inequality (Lemma 227), without needing to understand any of the analysis of RandQuickSort.

Skip Lists

Randomness can also be used to obtain very simple data structures that provide excellent performance in expectation. As an example, here we consider a structure called a “skip list”, which implements a dictionary abstract data type. A dictionary is a structure that stores data as key-value pairs, and supports inserting, deleting, and finding (looking up) data by its associated key. For example, in a database, students’ academic records might be keyed by their unique ID numbers.

For simplicity, we focus on storing just the keys themselves, since the associated data can be attached to them using pointers. We assume without loss of generality that keys can be ordered in some consistent way. (For example, in the absence of a “natural” ordering, one can use lexicographic order on the string representations of the keys.)

There are many ways to implement a dictionary abstraction using a variety of underlying data structures, such as arrays, linked lists, hash tables, and binary search trees. However, simple deterministic implementations have worst-case running times that are linear in the number of items in the dictionary, for some or all operations. For example, if we insert elements $1, 2, \ldots, n$ into a sorted linked list or non-balancing binary search tree, most of the insertions take $\Theta(n)$ time each. Similarly, finding a key takes $\Theta(n)$ time in the worst case, because we may need to traverse most or all of the list.

One common way to avoid such poor performance is to use a more advanced data structure that does some “rebalancing” operations to ensure $O(\log n)$ worst-case runtimes for all operations, regardless of which elements are inserted or deleted. There are a variety of such structures, like AVL trees, red-black trees, scapegoat trees, and so on. However, these structures can be complicated and difficult to implement correctly, and can have significant hidden constant factors in their runtimes and memory overheads.

Here we investigate an alternative, very simple data structure called a “skip list”, which uses randomness to obtain expected $\O(\log n)$ runtimes for all dictionary operations. A skip list is essentially a linked list with multiple levels, where each level has about half the elements of the level below it. [4]

Like in a subway or highway system, skip lists can be thought of as having “express lanes” that make it possible to reach a desired element more quickly by “skipping over” many undesired elements at a time—just as a subway’s express train can go faster by skipping past many regular stops. A skip list, however, typically has several levels of “express lanes,” each one able to skip over more elements than the one below it. Moreover, the choice of which elements appear on each level, and which are skipped, is made at random. This tends to keep the data structure sufficiently “balanced” to support fast operations, in expectation. [5]

a skip list is a linked list with multiple levels, where the bottom level holds all the elements, and each level has about half the elements of the level below it; there are also sentinel head and null terminator at each level

The above diagram illustrates a possible state of a skip list. Every level has a sentinel “head” and a terminal null pointer; for convenience, their values are defined to be smaller and larger (respectively) than every element. The bottom level has all of the elements, in a sorted linked list. Every other level has some subset of the elements in the level below it, also in a sorted linked list. In addition, each element above the bottom level also points to its duplicate in the level below it; these downward pointers are illustrated by stacking the duplicate elements. Therefore, from any non-terminal position in the skip list, we may go “rightward” to the next-larger element in the same level, or (when not in the bottom level) “downward” to the same element at the next-lowest level. [6]

To search for an element $e$, we start at the sentinel head of the top level. We maintain the invariant that we are always located at some element $a < e$ in some level, and repeat the following loop:

Consider the successor element $a'$ of $a$ in the current level.
If $a' < e$, move to $a'$ (setting $a=a'$) and repeat.
Otherwise (i.e., $a' \geq e$), if we are not in the bottom level, move downward (to the same element $a$) and repeat; else terminate the loop.

At this point, we are in the bottom level, at the largest element smaller than $e$. If its successor is the desired element $e$, output it; otherwise, report that $e$ is not in the structure.

Note that as an optimization, if we ever find that $a'=e$ in some level, we can immediately return it, instead of always descending to the bottom level. However, for defining the insertion and deletion algorithms, and for the probabilistic analysis below, it is more convenient to define the search so that it ultimately ends in the bottom level.

For example, the following diagram illustrates the search for element 0 in the above skip list. The search terminates in failure, since in the bottom level we end up at -3, whose successor is 1.

The following diagram illustrates the successful search for element 9 (optimized to return as soon as it is found, above the bottom level):

To delete an element $e$ from the structure, we first search for it. If $e$ is not in the structure, there is nothing more to do. Otherwise, the search finds $e$ in the bottom level, and along the way it also finds $e$’s predecessor in every level in which $e$ appears. (These are the elements at which the search moves downward.) So, we just delete $e$ from each of those levels, by making each predecessor point to the corresponding successor of $e$ in that level. Notice that the running time of deletion is within a constant factor of the running time of the search.

To insert an element $e$, we first search for $e$. If it is already in the structure, then there is nothing more to do. Otherwise, the search terminates in the bottom level, at the largest element $a < e$; along the way, it also finds the largest element smaller than $e$ in each level.

We first insert $e$ as the successor of $a$ in the bottom level. Then, we determine at random which higher levels the element $e$ will occupy, using the following loop:

Flip a fair coin.
If it comes up heads, insert $e$ in the next level up (as the successor of the largest element smaller than $e$), and repeat.
Otherwise, stop.

In other words, we insert $e$ into one higher level for each head that comes up, until tails comes up. Note that there is no upper limit to how many levels we might insert $e$ into, and we create new levels as needed.

Observe that the number of copies of the inserted element is a random variable, which equals the total number of coin flips until tails comes up (inclusive). So, this random variable follows the geometric distribution with success probability $1/2$, which has expectation $2$. Thus, the runtime of insertion is, in expectation, only a constant more than the runtime of the search.

Based on the above insertion process, we can also bound the expected size (memory usage) of a skip list. If there are $n$ distinct elements, then by linearity of expectation on the number of copies of each element, the expected total number of copies is $2n$. So, the expected total memory usage is $O(n)$.

Finally, based on what we have already seen, to determine the expected runtime of any of the operations (insert, delete, search), it just remains to analyze the expected runtime of a search. We have the following theorem.

Theorem 241

Assume that all the operations (insert, delete, search) performed on a skip list are independent of the internal random choices made during insertions. Then in a skip list with $n$ distinct elements, the expected running time of a search is $O(\log n)$.

Before proving the theorem, we discuss the meaning and importance of the assumption, that the sequence of operations is independent of the internal random choices. Without this assumption, a skip list may have very poor performance. For example, if an adversarial user is able to learn which elements appear above the bottom level, then the user could just delete all such elements from the structure. This makes the skip list just an ordinary linked list, with its associated $\Theta(n)$ runtimes. But notice that in this scenario, the sequence of operations depends on the skip list’s internal random choices, because the deleted elements are exactly those that are randomly promoted above the bottom level. By contrast, under the independence assumption, we can treat every element as if it has just been inserted and promoted at random according to the above process.

To prove the theorem, we first show the following useful lemma.

Lemma 242

Under the independence assumption from Theorem 241, in a skip list with $n$ distinct elements, the expected height (i.e., number of levels above the bottom level) is less than $\log_2 n + 2$.

Proof 243

We first analyze the expected number of elements on each level, and use this to determine the expected height of the structure.

Define $X_{ij}$ to be the indicator random variable that indicates whether the $j$th element appears on level $i$, where $i=0$ corresponds to the bottom level. Then by the insertion rule and the independence assumption,

\[\Pr[X_{ij} = 1] = 1/2^{i} \; \text.\]

This is because an element appears on level $i$ if at least $i$ heads come up when inserting it, and the coin flips are fair and independent. Thus, $\Ex[X_{ij}] = \Pr[X_{ij} = 1] = 1/2^{i}$.

Now define random variable $n_i$ to be the number of (non-sentinel) elements on level $i$; this is just the sum of the indicators $X_{ij}$. So, by linearity of expectation and the formula for geometric series, the expected number of elements on level $i$ is

\[\Ex[n_i] = \Ex\brackets[\Big]{\sum_{j=1}^n X_{ij}} = \sum_{j=1}^n \Ex[X_{ij}] = \sum_{j=1}^n \frac{1}{2^i} = \frac{n}{2^i} \; \text.\]

(For example, the expected number of elements on level 1 is $n/2$, on level 2 is $n/4$, etc.)

Now, we analyze the expected height of the structure, i.e., the number of levels above the bottom one that have at least one element. For intuition, the expected number of elements on level $i$ is less than $1$ when

\[\Ex[n_i] = \frac{n}{2^i} < 1 \iff 2^i > n \iff i > \log_2 n \; \text,\]

so we should expect roughly $\ell(n) = \log_2 n$ levels.

To show this properly, define $H_i$ to be the indicator random variable that indicates whether level $i$ has at least one element, i.e., whether $n_i \geq 1$. Because $n_i$ is non-negative, by Markov’s inequality and the fact that $n = 2^{\ell(n)}$,

\[\Ex[H_i] = \Pr[n_{i} \geq 1] \leq \Ex[n_{i}]/1 = \frac{n}{2^i} = \frac{1}{2^{i-\ell(n)}} \; \text.\]

Notice that this upper bound on $\Ex[H_i]$ is nontrivial only for $i > \ell(n)$, because the expectation of any indicator random variable is at most $1$. For $i \leq \ell(n)$ we have the trivial but tighter bound $\Ex[H_i] \leq 1$.

Now, the total height $H$ of the skip list is

\[H = \sum_{i=1}^\infty H_i \; \text,\]

so by linearity of expectation, the above bounds on $\Ex[H_i]$, and the formula for a geometric series, the expected height is

\[\begin{split}\Ex[H] &= \Ex\brackets[\Big]{\sum_{i=1}^\infty H_i} \\ &= \sum_{i=1}^\infty \Ex[H_i] \\ &= \sum_{1 \leq i \leq \ell(n)} \Ex[H_i] + \sum_{i > \ell(n)} \Ex[H_i] \\ &\leq \sum_{1 \leq i \leq \ell(n)} 1 + \sum_{i > \ell(n)} \frac{1}{2^{i-\ell(n)}} \\ &< \ell(n) + \sum_{k=0}^\infty \frac{1}{2^k} \\ &= \log_2 n + 2 \; \text.\end{split}\]

Now we can prove Theorem 241, which says that the expected running time of a search in a skip list of $n$ distinct elements is $\O(\log n)$.

Proof 244

The running time of the search for an element $e$ is proportional to the total number of downward and rightward moves during the search, starting from the topmost sentinel head. The number of downward moves is exactly the height of the structure, whose expectation we analyzed above in Lemma 242. So, it remains to analyze the expected number of rightward moves. The simplest way to do this rigorously is to consider the search path in reverse.

The end of the search path is the largest value $a < e$ in the bottom level, and hence in the entire structure as well. Suppose that this $a$ is not the sentinel head, otherwise there is no rightward move in the entire search. The search moved rightward to this $a$ only if $a$ is not present in the next level up. Indeed, if $a$ were in the next level up, then the prior step of the search must have been downward from that copy of $a$, because the search reaches the largest value less than $e$ on each level.

By definition of the insertion process (and the independence assumption), with probability $1/2$, the first coin toss when $a$ was inserted came up tails, in which case $a$ is not present in the next level up, and so the prior step was rightward. In this case, the exact same reasoning applies to the step prior to that one, and so on, until we reach an element that either is in the next level up or is the sentinel head. So, in the bottom level, the number of rightward moves $R_0$ is the number of independent fair coin tosses (out of $\leq n$ tosses) that come up tails before the first heads comes up. This is bounded by one less than a geometric random variable with success probability $1/2$, so $\Ex[R_0] < 1$.

The exact same reasoning applies to the number of rightward moves on every level. That is, the last value visited on level $i$ is the largest value $a < e$ on that level. If $a$ is not the sentinel head, then the prior step was rightward only if $a$ is not in the next level up, which holds with probability $1/2$ by definition of the insertion process, and so on. So, defining random variable $R_i$ to be the number of rightward moves on level $i$, we have that $\Ex[R_i] < 1$ for all $i$.

Separately, the number of rightward moves on level $i$ cannot exceed the total number of (non-sentinel) elements $n_i$ on level $i$. So, $\Ex[R_i] \leq \Ex[n_i] = n/2^i$, as shown in the proof of Lemma 242 above. This bound on $\Ex[R_i]$ is tighter than the one obtained above when $n/2^i < 1$, or equivalently, when $i > \ell(n) = \log_2(n)$.

Finally, let $H$ be the height of the structure, which is the number of downward moves in the search, and let $R$ be the total number of rightward moves, which is the sum of all the $R_i$ for $i \geq 0$. Then by linearity of expectation, the above bounds on $\Ex[R_i]$, and Lemma 242 (and proceeding similarly to its proof), the expected total number of moves in the search is

\[\begin{split}\Ex[H+R] &= \Ex[H] + \Ex\brackets[\Big]{\sum_{i=0}^\infty R_i} \\ &= \Ex[H] + \sum_{i=0}^\infty \Ex[R_i] \\ &= \Ex[H] + \sum_{0 \leq i \leq \ell(n)} \Ex[R_i] + \sum_{i > \ell(n)} \Ex[R_i] \\ &< \Ex[H] + \sum_{0 \leq i \leq \ell(n)} 1 + \sum_{i > \ell(n)} \frac{n}{2^i} \\ &< \Ex[H] + (\ell(n) + 1) + 2 \\ &< 2 \log_2 n + 5 \; \text.\end{split}\]

Therefore, the expected running time of the search is $O(\log n)$, as claimed.

Monte Carlo Methods and Concentration Bounds

Some algorithms rely on repeated trials to compute an approximation of a result. Such algorithms are called Monte Carlo methods, which are distinct from Monte Carlo algorithms – a Monte Carlo method does repeated sampling of a probability distribution, while a Monte Carlo algorithm is an algorithm that may produce the wrong result within some bounded probability. The former are commonly approximation algorithms for estimating a quantity, while the latter are exact algorithms that sometimes produce the wrong result. As we will see, there is a connection: repeated sampling (the strategy of a Monte Carlo method) can be used to amplify the probability of getting the correct result from a Monte Carlo algorithm.

As an example of a Monte Carlo method, we consider an algorithm for estimating the value of $\pi$, the area of a unit circle (a circle with a radius of one). If such a circle is located in the plane, centered at the origin, its top-right quadrant is as follows:

This quadrant falls within the square interval between $(0, 0)$ and $(1, 1)$. The area of the quadrant is $\pi/4$, while the area of the interval is $1$. Thus, if we choose a random point in this interval, the probability that it lies within the quadrant of the circle is $\pi/4$ [7]. This motivates the following algorithm for estimating the value of $\pi$:

\begin{algorithmic}
\FUNCTION{EstimatePi}{$n$}
\STATE $\text{count} = 0$
\FOR{$i=1$ to $n$}
  \STATE $x,y = $ values in $[0,1]$ chosen uniformly and
  independently at random
  \IF{$x^2+y^2 \leq 1$}
  \STATE $\text{count} = \text{count} + 1$
  \ENDIF
\ENDFOR
\STATE \RETURN $4 \cdot \text{count} / n$
\ENDFUNCTION
\end{algorithmic}

The algorithm randomly chooses points between $(0, 0)$ and $(1, 1)$, counting how many of them fall within the unit circle, which is when the point’s distance from the origin is at most one. We expect this number to be $\pi/4$ of the samples, so the algorithm returns four times the ratio of the points that fell within the circle as an estimate of $\pi$.

We formally show that the algorithm returns the value of $\pi$ in expectation. Let $X$ be a random variable corresponding to the value returned, and let $Y_i$ be an indicator random variable that is 1 if the $i$th point falls within the unit circle. As argued previously, we have:

\[\Pr[Y_i = 1] = \pi/4 \; \text.\]

We also have that

\[X = 4/n \cdot (Y_1 + \dots + Y_n) \; \text.\]

This leads to the following expected value for $X$:

\[\begin{split}\Ex[X] &= 4/n \cdot (\Ex[Y_1] + \Ex[Y_2] + \dots + \Ex[Y_n])\\ &= 4/n \cdot (\Pr[Y_1 = 1] + \Pr[Y_2 = 1] + \dots + \Pr[Y_n = 1])\\ &= 4/n \cdot (\pi/4 + \pi/4 + \dots + \pi/4)\\ &= \pi \; \text.\end{split}\]

The expected value of the algorithm’s output is indeed $\pi$.

When estimating $\pi$, how likely are we to actually get a result that is close to the expected value? While we can apply Markov’s inequality, the bound we get is very loose, and it does not give us any information about how the number of samples affects the quality of the estimate. The law of large numbers states that the actual result converges to the expected value as the number of samples $n$ increases. But how fast does it converge? There are many types of concentration bounds that allow us to reason about the deviation of a random variable from its expectation; Markov’s inequality is just the simplest one. Chebyshev’s inequality is another simple bound that makes use of more information about a random variable, namely its variance. Chernoff bounds are yet another tool. There are multiple variants of Chernoff bounds, including the multiplicative form and the additive Hoeffding bounds.

Variance and Chebyshev’s Inequality

The expectation of a random variable gives us one piece of information about its probability distribution, but there are many aspects of the distribution that it does not capture. For instance, the following illustrates three random variables that have different distributions but the same expected value of 0.5:

three random variables with different distributions but the same expectation; the first can only take on a value equal to the expectation, the second can take on two values with equal probability, and the third takes on a range of values with different probabilities

Beyond the expectation, the next most important aspect of a probability distribution is its “spread” [8]. The variance of a random variable encapsulates this information.

Definition 245 (Variance)

Suppose $X$ is a random variable with expectation $\Ex[X]$. Then the variance of $X$ is defined as

\[\Var(X) = \Ex[(X-\Ex[X])^2] \; \text,\]

or equivalently,

\[\Var(X) = \Ex[X^2] - \Ex[X]^2 \; \text.\]

The second definition follows from the first due to linearity of expectation:

\[\begin{split}\Var(X) &= \Ex[(X-\Ex[X])^2] \\ &= \Ex[X^2 - 2\Ex[X]\cdot X + \Ex[X]^2] \\ &= \Ex[X^2] - 2\Ex[\Ex[X]\cdot X] + \Ex[\Ex[X]^2] \\ &= \Ex[X^2] - 2\Ex[X]\cdot\Ex[X] + \Ex[X]^2 \\ &= \Ex[X^2] - 2\Ex[X]^2 + \Ex[X]^2 \\ &= \Ex[X^2] - \Ex[X]^2 \; \text.\end{split}\]

In the fourth step, we used the fact that $\Ex[X]$ is a constant to pull it out of the outer expectation: by linearity of expectation, $\Ex[cY] = c\Ex[Y]$ for any constant $c$.

The variance tells us the average square of the distance of a random variable from its expectation. Taking the square root of the variance gives us an approximate measure of the distance itself; this is called the standard deviation, and it is often denoted by the symbol $\sigma$:

\[\sigma(X) = \sqrt{\Var(X)} \; \text.\]

Example 246

Suppose that random variable $X$ has the distribution

\[X = 0.5 \text{ with probability $1$}\]

and $Y$ has the distribution

\[\begin{split}Y = \begin{cases} 0 &\text{with probability $1/2$} \\ 1 &\text{with probability $1/2$.} \end{cases}\end{split}\]

Both $X$ and $Y$ have expectations $0.5$. We calculate their variances. The distributions of $X^2$ and $Y^2$ are as follows:

\[\begin{split}X^2 &= 0.25 \text{ with probability $1$} \\ Y^2 &= \begin{cases} 0 & \text{with probability $1/2$} \\ 1 & \text{with probability $1/2$.} \end{cases}\end{split}\]

Therefore,

\[\begin{split}\Var(X) &= \Ex[X^2]-\Ex[X]^2 = 0.25 - 0.5^2 = 0 \\ \Var(Y) &= \Ex[Y^2]-\Ex[Y]^2 = 0.5 - 0.5^2 = 0.25 \; \text.\end{split}\]

We see that while $X$ and $Y$ have the same expectation, their variances differ. This quantifies the fact that $X$ has a certain value with certainty, whereas $Y$ can take on multiple values, so $Y$ has larger variance.

Example 247

Let $X$ be an indicator random variable with probability $p$ of being 1:

\[\begin{split}X = \begin{cases} 0 & \text{with probability $1-p$} \\ 1 & \text{with probability $p$.} \end{cases}\end{split}\]

Then $\Ex[X] = \Pr[X=1] = p$. Observe that $X^2 = X$ in all cases, so $\Ex[X^2] = \Ex[X] = p$. Thus, the variance of $X$ is

\[\begin{split}\Var(X) &= \Ex[X^2] - \Ex[X]^2 \\ &= p - p^2 \\ &= p(1-p) \; \text.\end{split}\]

As a more complex example, define $X$ to be the number of heads over $n$ tosses of a biased coin with probability $p$ of coming up heads. In Example 234 we calculated that $\Ex[X] = pn$. Defining $X_i$ to be the indicator random variable for whether the $i$th toss comes up heads, we can infer from Example 247 that $\Var(X_i) = p(1-p)$. What can we say about the variance $\Var(X)=\Var\parens[\big]{\sum_i X_i}$ of $X$ itself?

For expectation, we know from Theorem 231 that linearity of expectation holds for any random variables, even arbitrarily correlated ones. For variance, this is not the case. As an example, for a random variable $Y$, define $Z = Y + Y = 2Y$. It is not the case that $\Var(Z) = 2\Var(Y)$; in fact, $\Var(Z) = 4\Var(Y)$.

Theorem 248

Let $X$ be a random variable. Then $\Var(cX) = c^2 \cdot \Var(X)$ for any constant $c$.

Proof 249

By definition of variance,

\[\begin{split}\Var(cX) &= \Ex[(cX)^2] - \Ex[cX]^2 \\ &= \Ex[c^2 X^2] - (c\Ex[X])^2 \\ &= c^2 \cdot \Ex[X^2] - c^2 \cdot \Ex[X]^2 \\ &= c^2 \cdot (\Ex[X^2] - \Ex[X]^2) \\ &= c^2 \cdot \Var(X) \; \text.\end{split}\]

In the second and third steps, we applied linearity of expectation to pull the constants $c$ and $c^2$ out of the expectations.

For a sum of random variables, the variances sum if the random variables are independent; in fact, pairwise independence suffices for this. To establish this, we first demonstrate that the expectation of the product of independent random variables is the product of their expectations.

Lemma 250

Let $X$ and $Y$ be independent random variables. Then

\[\Ex[XY] = \Ex[X]\cdot\Ex[Y] \; \text.\]

Proof 251

By definition of expectation,

\[\begin{split}\Ex[XY] &= \sum_{x,y} xy \cdot \Pr[X = x \wedge Y = y] \\ &= \sum_x \sum_y xy \cdot \Pr[X = x]\cdot\Pr[Y = y] \\ &= \parens[\Big]{\sum_x x \cdot \Pr[X = x]} \parens[\Big]{\sum_y y \cdot \Pr[Y = y]} \\ &= \Ex[X]\cdot\Ex[Y] \; \text.\end{split}\]

In the second step, we used the fact that $X$ and $Y$ are independent, so that $\Pr[X = x \wedge Y = y] = \Pr[X = x]\cdot\Pr[Y = y]$.

By induction, we conclude that

\[\Ex\brackets[\Big]{\prod_i X_i} = \prod_i \Ex[X_i]\]

for mutually independent random variables $X_i$.

We now consider the variance of the sum of independent random variables.

Theorem 252

Let $X$ and $Y$ be independent random variables. Then $\Var(X+Y) = \Var(X) + \Var(Y)$.

Proof 253

By definition of variance, linearity of expectation, and Lemma 250,

\[\begin{split}\Var(X+Y) &= \Ex[(X+Y)^2] - \Ex[X+Y]^2 \\ &= \Ex[X^2 + 2XY + Y^2] - (\Ex[X] + \Ex[Y])^2 \\ &= \Ex[X^2] + 2\Ex[XY] + \Ex[Y^2] - (\Ex[X]^2 + 2\Ex[X] \cdot \Ex[Y] + \Ex[Y]^2) \\ &= \Ex[X^2] + 2\Ex[X] \cdot \Ex[Y] + \Ex[Y^2] - (\Ex[X]^2 + 2\Ex[X]\Ex[Y] + \Ex[Y]^2) \\ &= \Ex[X^2] - \Ex[X]^2 + \Ex[Y^2] - \Ex[Y]^2 \\ &= \Var(X) + \Var(Y) \; \text.\end{split}\]

By induction, we conclude that

\[\Var\parens[\Big]{\sum_i X_i} = \sum_i \Var(X_i)\]

for mutually independent random variables $X_i$. (In fact, a close inspection of the proof reveals that pairwise independence suffices.)

Thus, letting $X$ be the number of heads over $n$ tosses of a biased coin with probability $p$ of coming up heads, we have that

\[\begin{split}\Var(X) &= \Var\parens[\Big]{\sum_i X_i} \\ &= \sum_i \Var(X_i) \\ &= \sum_i p(1-p) \\ &= p(1-p) n \; \text.\end{split}\]

Chebyshev’s inequality bounds the probability that a random variable differs from its expectation by some threshold.

Theorem 254 (Chebyshev’s Inequality)

Let $X$ be a random variable and $a > 0$. Then

\[\Pr\brackets[\big]{\abs*{X-\Ex[X]} \geq a} \leq \frac{\Var(X)}{a^2} \; \text.\]

Proof 255

We apply Markov’s inequality to the random variable $Y = (X - \Ex[X])^2$, which has expectation $\Ex[Y] = \Var(X)$ by definition. Observe that because $Y$ is the square of a real number, it is non-negative, so Markov’s inequality applies to it. Furthermore, because $a$ is positive, $Y = (X - \Ex[X])^2 \geq a^2$ holds if and only if $\abs{X - \Ex[X]} \geq a$. Thus,

\[\begin{split}\Pr\brackets[\big]{\abs*{X-\Ex[X]} \geq a} &= \Pr[Y \geq a^2]\\ &\leq \frac{\Ex[Y]}{a^2}\\ &= \frac{\Var(X)}{a^2} \; \text.\end{split}\]

Example 256

Suppose we toss a fair coin $n$ times. We use Chebyshev’s inequality to calculate an upper bound on the probability of getting $\leq 49\%$ or $\geq 51\%$ heads.

Let $X$ be the number of heads. We previously showed that $\Ex[X] = n/2$, and $\Var(X) = n/4$. Then

\[\begin{split}\Pr\brackets[\big]{\abs*{X-n/2} \geq n/100} &\leq \frac{\Var(X)}{(n/100)^2} \\ &= 10000 \cdot \frac{n/4}{n^2} \\ &= \frac{2500}{n} \; \text.\end{split}\]

For example, for 10,000 tosses, the probability of deviating from the expectation by 1% is at most 1/4, while for 1,000,000 tosses, it is at most 1/400.

In the above example, the random variable $X = X_1 + \cdots + X_n$ is the sum of (mutually) independent, identically distributed (i.i.d.) random variables $X_i$. In such a scenario, it is often more convenient to reason about the “normalized” value $X/n$ rather than $X$ itself, since the expectation of $X/n$ is independent of $n$: by linearity of expectation, $\Ex[X] = n \cdot \Ex[X_i]$, whereas $\Ex[X/n] = \Ex[X]/n = \Ex[X_i]$. The following is an alternative formulation of Chebyshev’s inequality for this case.

Corollary 257

Let $X = X_1 + \cdots + X_n$ be the sum of $n \geq 1$ independent, identically distributed random variables $X_i$. Let $\varepsilon > 0$ be a deviation from the expectation $\Ex[X/n] = \Ex[X_i]$. Then

\[\Pr\brackets[\big]{\abs*{X/n - \Ex[X_i]} \ge \varepsilon} \leq \frac{\Var(X_i)}{\varepsilon^2 n} \; \text.\]

Proof 258

The event $\abs{X/n-\Ex[X_i]} \geq \varepsilon$ is equivalent to the event $\abs{X-\Ex[X]} \geq \varepsilon n$, by multiplying both sides by $n$. By the standard form of Chebyshev’s inequality, we have

\[\begin{split}\Pr\brackets[\big]{\abs*{X/n - \Ex[X_i]} \geq \varepsilon} &= \Pr\brackets[\big]{\abs{X-\Ex[X]} \geq \varepsilon n} \\ &\leq \frac{\Var(X)}{\varepsilon^2 n^2} \\ &= \frac{n \cdot \Var(X_i)}{\varepsilon^2 n^2} \\ &= \frac{\Var(X_i)}{\varepsilon^2 n} \; \text.\end{split}\]

In the second-to-last step, we used Theorem 252 to deduce that $\Var(X) = n \cdot \Var(X_i)$, since the $X_i$ are independent.

We can redo Example 256 using Corollary 257 to obtain the same bounds.

In conclusion, Chebyshev’s inequality tells us that for $n$ independent coin tosses, the probability of deviating by at least (say) 1% from the expected number of heads decreases at least linearly in $n$. In fact, the probability actually decreases much faster than this! Using multiplicative Chernoff bounds or Hoeffding’s inequality from the following sections, we can show that the probability decreases exponentially in the number of tosses.

Hoeffding’s Inequality

Hoeffding’s inequality, also called Chernoff-Hoeffding bounds, is a set of concentration bounds that can give tighter results than Markov’s inequality, Chebyshev’s inequality, or other forms of Chernoff bounds. For simplicity, we restrict ourselves to the special case of independent indicator random variables having expectation

\[\Pr[X_i = 1] = p_i\]

for each indicator $X_i$. [9]

Let $p$ denote the expectation of $X/n$. Then by linearity of expectation,

\[p = \Ex[X/n] = \sum_i \Ex[X_i] / n = \frac{1}{n} \sum_{i=1}^n p_i \; \text.\]

That is, the expectation of $X/n$ is just the average of the $p_i$.

Hoeffding’s inequality bounds the probability that $X/n$ deviates from its expectation by a threshold of interest.

Theorem 259 (Hoeffding’s Inequality)

Let $X = X_1 + \cdots + X_n$ be the sum of independent indicator random variables $X_i$ with $\Ex[X_i] = p_i$, and let

\[p = \Ex[X/n] = \frac{1}{n} \sum_{i=1}^n p_i \; \text.\]

Let $\varepsilon > 0$ be a deviation from the expectation. Then we have the “upper tail” bound

\[\Pr[X/n \geq p + \varepsilon] \leq e^{-2\varepsilon^2 n} \; \text,\]

and the same holds for the “lower tail” $\Pr[X/n \leq p - \varepsilon]$. Combining these via the union bound,

\[\Pr[\abs{X/n - p} \geq \varepsilon] \leq 2e^{-2\varepsilon^2 n} \; \text.\]

We again consider the example of tossing a fair coin $n$ times. Let $H$ be the total number of heads that come up, and let $H_i$ be the indicator random variable that indicates whether the $i$th toss comes up heads. We have that $\Ex[H_i] = \Pr[H_i = 1] = 1/2$, and $\Ex[H/n] = \Ex[H_i] = 1/2$ for any number of tosses $n$.

What is a bound on the probability that in ten tosses, at least six heads come up? This is a deviation of $\varepsilon = 0.1$ from the (normalized) expectation of $p=0.5$, so applying the upper-tail Hoeffding’s inequality gives us:

\[\Pr[H/n \geq p + \varepsilon] = \Pr[H/n \ge 0.5 + 0.1] \leq e^{-2 \cdot (0.1)^2 \cdot 10} \approx 0.9802^{10} \approx 0.82 \; \text.\]

Now, what is a bound on the probability that in $100$ tosses, at least 60 heads come up? This is again the same deviation $\varepsilon = 0.1$ from the expectation of $p=0.5$; only the number of trials is different. Applying the upper tail again, we get

\[\Pr[H/n \ge p + \varepsilon] \leq e^{-2 \cdot (0.1)^2 \cdot 100} \approx 0.9802^{100} \approx 0.14 \; \text.\]

This is a significantly tighter bound than what we would get from Chebyshev’s inequality (Theorem 254); see Example 256 for a comparison. It is also tighter than what we would get from the multiplicative Chernoff bound.

Example 260

Suppose we have a coin that is biased by probability $\varepsilon$ towards either heads or tails, but we don’t know which one. In other words, either:

$\Pr[\text{heads}] = \frac{1}{2} + \varepsilon$ and $\Pr[\text{tails}] = \frac{1}{2} - \varepsilon$, or
$\Pr[\text{heads}] = \frac{1}{2} - \varepsilon$ and $\Pr[\text{tails}] = \frac{1}{2} + \varepsilon$

To determine in which direction the coin is biased, we toss the coin $n$ times. If the results include at least $n/2$ heads, we assume the coin is biased towards heads, otherwise we assume it is biased towards tails. How many tosses should we do to guarantee our answer is correct with probability at least $1-\delta$?

Let $X$ be the number of heads, and let $X_i$ be an indicator that is 1 if the $i$th toss is heads, 0 if it is tails. Then $X = X_1 + \dots + X_n$ is the sum of independent indicator random variables with $\Ex[X_i]$ equal to either $\frac{1}{2} + \varepsilon$ or to $\frac{1}{2} - \varepsilon$. We analyze the two cases individually.

Case 1: The coin is biased towards heads. Then $\Ex[X_i] = p = \frac{1}{2} + \varepsilon$. Our guess is erroneous when:

\[\begin{split}X &< \frac{n}{2}\\ \frac{1}{n}X &< \frac{1}{2}\\ &= \parens{\frac{1}{2} + \varepsilon} - \varepsilon\\ &= p - \varepsilon\end{split}\]

By the lower-tail Hoeffding’s inequality, we have

\[\begin{split}\lrPr{\frac{1}{n}X < p - \varepsilon} &\le \lrPr{\frac{1}{n}X \le p - \varepsilon}\\ &\le e^{-2\varepsilon^2 n}\end{split}\]

In the first step, we used the fact that $\Pr[Y \le a] = \Pr[Y < a] + \Pr[Y = a] \ge \Pr[Y < a]$ for a random variable $Y$ and any value $a$.
Case 2: The coin is biased towards tails. Then $\Ex[X_i] = p = \frac{1}{2} - \varepsilon$. Our guess is erroneous when:

\[\begin{split}X &\ge \frac{n}{2}\\ \frac{1}{n}X &\ge \frac{1}{2}\\ &= \parens{\frac{1}{2} - \varepsilon} + \varepsilon\\ &= p + \varepsilon\end{split}\]

By the upper-tail Hoeffding’s inequality, we have

\[\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le e^{-2\varepsilon^2 n}\]

In either case, the probability of error after $n$ tosses is upper-bounded by $e^{-2\varepsilon^2 n}$. We want this probability to be no more than $\delta$:

\[\begin{split}e^{-2\varepsilon^2 n} &\le \delta\\ 1/\delta &\le e^{2\varepsilon^2 n}\\ \ln(1/\delta) &\le 2\varepsilon^2 n\\ \frac{\ln(1/\delta)}{2\varepsilon^2} &\le n\end{split}\]

If $\varepsilon = 0.01$ (i.e. the coin is biased towards heads or tails by 1%) and $\delta = 0.0001$ (we want to be correct at least 99.99% of the time), then

\[n \ge \frac{\ln(1/0.0001)}{2\cdot 0.01^2} \approx 46502\]

tosses suffice.

Exercise 261

Suppose we have a coin that is either fair, or is biased by probability $\varepsilon$ towards heads, but we don’t know which is the case. In other words, either:

$\Pr[\text{heads}] = \frac{1}{2}$ and $\Pr[\text{tails}] = \frac{1}{2}$, or
$\Pr[\text{heads}] = \frac{1}{2} + \varepsilon$ and $\Pr[\text{tails}] = \frac{1}{2} - \varepsilon$

We toss the coin $n$ times and count the number of heads $X$.

For what values of $X$ should we guess the coin is fair, and for what values that it is biased towards heads?
How many tosses should we do to guarantee our answer is correct with probability at least $1 - \delta$?
How many tosses should we do for $\varepsilon = 0.01$ and $\delta = 0.0001$? How does this compare to the situation in Example 260 with $\varepsilon = 0.005$ and $\delta = 0.0001$?

Polling

Rather than applying concentration bounds to compute the probability of a deviation for a specific sample size, we often wish to determine how many samples we need to be within a particular deviation with high confidence. One application of this is big data, where we have vast datasets that are too large to examine in their entirety, so we sample the data instead to estimate the quantities of interest. Similar to this is polling – outside of elections themselves, we typically do not have the resources to ask the entire population for their opinions, so we need to sample the populace to estimate the support for a particular candidate or political position. In both applications, we need to determine how many samples are needed to obtain a good estimate.

In general, a poll estimates the fraction of the population that supports a particular candidate by asking $n$ randomly chosen people whether they support that candidate. Let $X$ be a random variable corresponding to the number of people who answer this question in the affirmative. Then $X/n$ is an estimate of the level of support in the full population.

A typical poll has both a confidence level and a margin of error – the latter corresponds to the deviation from the true fraction $p$ of people who support the candidate, and the former corresponds to a bound on the probability that the estimate is within that deviation. For example, a 95% confidence level and a margin of error of ±2% requires that

\[\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} \ge 0.95\]

More generally, for a confidence level $1-\gamma$ and margin of error $\varepsilon$, we require

\[\lrPr{\abs*{\frac{X}{n} - p} \le \varepsilon} \ge 1-\gamma\]

or equivalently

\[\lrPr{\abs*{\frac{X}{n} - p} > \varepsilon} < \gamma\]

Formally, we define indicator variables $X_i$ as

\[\begin{split}X_i = \begin{cases} 1 &\text{if person $i$ supports the candidate}\\ 0 &\text{otherwise.} \end{cases}\end{split}\]

for each person $i$ in the set that we poll. Then $X = X_1 + \dots + X_n$ is the sum of independent indicator variables, with

\[\mu = \Ex[X] = \sum_i \Ex[X_i] = np\]

Analysis with Hoeffding’s Inequality

Applying Hoeffding’s inequality to polling, the combined inequality gives us

\[\lrPr{\abs*{\frac{1}{n}X - p} \ge \varepsilon} \le 2e^{-2\varepsilon^2 n}\]

For a 95% confidence level and a margin of error of ±2%, we require that

\[\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} \ge 0.95\]

However, this isn’t quite in the form where we can apply Hoeffding’s inequality, so we need to do some manipulation first. We have:

\[\begin{split}\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} &= 1 - \lrPr{\abs*{\frac{X}{n} - p} > 0.02}\\ &\ge 1 - \lrPr{\abs*{\frac{X}{n} - p} \ge 0.02}\end{split}\]

Hoeffding’s inequality gives us:

\[\lrPr{\abs*{\frac{X}{n} - p} \ge 0.02} \le 2e^{-2 \cdot 0.02^2 \cdot n}\]

Substituting this into the above, we get:

\[\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} \ge 1 - 2e^{-2 \cdot 0.02^2 \cdot n}\]

We want this to be at least 0.95:

\[\begin{split}1 - 2e^{-2 \cdot 0.02^2 \cdot n} &\ge 0.95\\ 2e^{-2 \cdot 0.02^2 \cdot n} &\le 0.05\\ e^{2 \cdot 0.02^2 \cdot n} &\ge 40\\ 2 \cdot 0.02^2 \cdot n &\ge \ln 40\\ n &\ge \frac{\ln 40}{2 \cdot 0.02^2}\\ &\approx 4611.1\end{split}\]

Thus, we obtain the given confidence level and margin of error by polling at least 4612 people. Observe that this does not depend on the total population size!

For an arbitrary margin of error ±$\varepsilon$, we obtain:

\[\begin{split}\lrPr{\abs*{\frac{X}{n} - p} \le \varepsilon} &= 1 - \lrPr{\abs*{\frac{X}{n} - p} > \varepsilon}\\ &\ge 1 - \lrPr{\abs*{\frac{X}{n} - p} \ge \varepsilon}\\ &\ge 1 - 2e^{-2\varepsilon^2 n}\end{split}\]

To achieve an arbitrary confidence level $1 - \gamma$, we need:

\[\begin{split}1 - 2e^{-2\varepsilon^2 n} &\ge 1 - \gamma\\ 2e^{-2\varepsilon^2 n} &\le \gamma\\ e^{2\varepsilon^2 n} &\ge \frac{2}{\gamma}\\ 2\varepsilon^2 n &\ge \ln\parens{\frac{2}{\gamma}}\\ n &\ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2}{\gamma}}\end{split}\]

More generally, if we wish to gauge the level of support for $m$ different candidates, the sampling theorem tells us that the number of samples required is logarithmic in $m$.

Theorem 262 (Sampling Theorem)

Suppose $n$ people are polled to ask which candidate they support, out of $m$ possible candidates. Let $\Xj$ be the number of people who state that they support candidate $j$, and let $p_j$ be the true level of support for that candidate. We wish to obtain

\[\lrPr{\bigcap_j \parens{\abs*{\frac{\Xj}{n} - p_j} \le \varepsilon}} \ge 1 - \gamma\]

In other words, we desire a confidence level $1 - \gamma$ that all estimates $\Xj/n$ are within margin of error ±$\varepsilon$ of the true values $p_j$. We obtain this when the number of samples $n$ is

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2m}{\gamma}}\]

The sampling theorem can be derived from applying the union bound.

In conclusion, when sampling from a large dataset, the number of samples required does depend on the desired accuracy of the estimation and the range size (i.e. number of possible answers). But it does not depend on the population size. This makes sampling a powerful technique for dealing with big data, as long as we are willing to tolerate a small possibility of obtaining an inaccurate estimate.

Load Balancing

Job scheduling is another application where we can exploit randomness to construct a simple, highly effective algorithm. In this problem, we have $k$ servers, and there are $n$ jobs that need to be distributed to these servers. The goal is to balance the load among the servers, so that no one server is overloaded. This problem is very relevant to content-delivery networks – there may be on the order of millions or even billions of concurrent users, and each user needs to be routed to one of only hundreds or thousands of servers. Thus, we have $n \gg k$ in such a case.

One possible algorithm is to always send a job to the most lightly loaded server. However, this requires significant coordination – the scheduler must keep track of the load on each server, which requires extra communication, space, and computational resources. Instead, we consider a simple, randomized algorithm that just sends each job to a random server. The expected number of jobs on each server is $n/k$. But how likely are we to be close to this ideal, balanced load?

We start our analysis by defining random variables $\Xj$ corresponding to the number of jobs assigned to server $j$. We would like to demonstrate that the joint probability of $\Xj$ being close to $n/k$ for all $j$ is high. We first reason about the load on an individual server. In particular, we wish to compute a bound on

\[\lrPr{\Xj \ge \frac{n}{k} + c}\]

for some value $c > 0$, i.e. the probability that server $j$ is overloaded by at least $c$ jobs. Let $\Xj_i$ be an indicator random variable that is 1 if job $i$ is sent to server $j$. Since the target server for job $i$ is chosen uniformly at random out of $k$ possible choices, we have

\[\lrEx{\Xj_i} = \lrPr{\Xj_i = 1} = \frac{1}{k}\]

We also have $\Xj = \Xj_1 + \dots + \Xj_n$, giving us

\[\lrEx{\Xj} = \sum_i \lrEx{\Xj_i} = \frac{n}{k}\]

as we stated before. Then:

\[\begin{split}\lrPr{\Xj \ge \frac{n}{k} + c} &= \lrPr{\frac{1}{n}\Xj \ge \frac{1}{k} + \frac{c}{n}}\\ &\le e^{-2(c/n)^2 n}\\ &= e^{-2c^2/n}\end{split}\]

by the upper-tail Hoeffding’s inequality.

Now that we have a bound on an individual server being overloaded by $c$ jobs, we wish to bound the probability that there is some server that is overloaded. The complement event is that no server is overloaded, so we can equivalently compute the joint probability that none of the servers is overloaded by $c$ or more jobs:

\[\lrPr{X^{(1)} < \frac{n}{k} + c, \dots, X^{(n)} < \frac{n}{k} + c}\]

However, we cannot do so by simply multiplying the individual probabilities together – that only works if the probabilities are independent, and in this case, they are not. In particular, if one server is overloaded, some other server must be underloaded, so the random variables $\Xj$ are not independent.

Rather than trying to work with the complement event, we attempt to directly compute the probability that there is at least one overloaded server:

\[\lrPr{\parens{X^{(1)} \ge \frac{n}{k} + c} \cup \dots \cup \parens{X^{(n)} \ge \frac{n}{k} + c}}\]

More succinctly, we denote this as:

\[\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + c}}\]

We have a union of events, and we want to analyze the probability of that union. We use the union bound (Lemma 212):

\[\begin{split}\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + c}} &\le \sum_j \lrPr{\Xj \ge \frac{n}{k} + c}\\ &\le \sum_j e^{-2c^2/n}\\ &= k \cdot e^{-2c^2/n}\\ &= e^{\ln k} \cdot e^{-2c^2/n}\\ &= e^{\ln k - 2c^2/n}\end{split}\]

For $c = \sqrt{n\ln k}$, we get:

\[\begin{split}\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + \sqrt{n\ln k}}} &\le e^{\ln k - 2(\sqrt{n\ln k})^2/n}\\ &= e^{\ln k - 2(n\ln k)/n}\\ &= e^{-\ln k}\\ &= 1/e^{\ln k}\\ &= \frac{1}{k}\end{split}\]

With concrete values $n = 10^{10}$ and $k = 1000$, we compute the overload relative to the expected value as:

\[\begin{split}\frac{\sqrt{n \ln k}}{n/k} &= \frac{\sqrt{10^{10} \ln 1000}}{10^7}\\ &\approx 0.026\end{split}\]

This is an overhead of about 2.6% above the expected load. The probability that there is a server overloaded by at least 2.6% is bounded from above by $1/k = 0.001$, or ≤0.1%. Thus, when there are $n = 10^{10}$ jobs and $k = 1000$ servers, the randomized algorithm has a high probability (≥99.9%) of producing a schedule where the servers all have a low overhead (≤2.6%).

Exercise 263

Previously, we demonstrated that to use polling to achieve a confidence level $1 - \gamma$ and a margin of error ±$\varepsilon$,

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2}{\gamma}}\]

samples are sufficient when there is a single candidate. We also saw the sampling theorem, which states that for $m$ candidates,

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2m}{\gamma}}\]

samples suffice to achieve a confidence level $1 - \gamma$ that all estimates $\Xj/n$ are within margin of error ±$\varepsilon$.

Use the union bound to demonstrate that this latter result follows from the result for a single candidate.