Randomness

Randomized Algorithms

Thus far, every algorithm we have seen is deterministic. This means that, when run on a given input, the steps the algorithm performs and the output it produces are entirely determined. So, when run multiple times on the same input, the algorithm produces the same output. At heart, this is because any Turing machine’s transition rules are deterministic, i.e., it has a transition function.

In this unit, we consider how randomness can be applied in computation. We will exploit randomness to design simple, probabilistic algorithms, and we will discuss several tools that can be used to analyze randomized algorithms.

Since Turing machines are deterministic, they cannot make random choices—so how are randomized algorithms possible? In real life, when we want to make a random choice, we can flip a coin, roll a die, or the like. Similarly, to model randomized algorithms and Turing machines, we augment them with an external source of randomness.

For this purpose, the simplest source of randomness is a (virtual) “fair coin” that the algorithm can “flip” as needed. Formally, we can define a randomized Turing machine as having a special “coin flip” state that, whenever it is entered, writes a fresh unbiased random bit at the location of the tape head, and transitions back to the previous state. With this basic feature it is possible to simulate richer sources of randomness, like the roll of a die with any finite number of sides.

In this text, we write algorithms and pseudocode that make truly random, unbiased, and independent choices from whatever finite set is convenient. In the real world, when implementing and using randomized algorithms—especially for cryptography—it is vital to use a high-quality source of randomness. (Ensure this can be a very delicate matter, but it is outside the scope of this course.)

As a first example of the utility of randomized algorithms, consider the game of rock-paper-scissors between two players. In this game, both players simultaneously choose one of three options: rock, paper, or scissors. The game is a tie if both players choose the same option. If they make different choices, then rock beats (“smashes”) scissors, scissors beats (“cuts”) paper, and paper beats (“covers”) rock:

table showing that rock beats scissors, scissors beats paper, and paper beats rock

Suppose we write an algorithm (or strategy) to play the game, e.g., in a best-out-of-five sequence of rounds. Similar to worst-case analysis of algorithms, we should consider our strategy’s performance against an opponent strategy that plays optimally against it. In other words, we should assume that our opponent “knows” our strategy, and plays the best it can with that knowledge.

If we “hard-code” a strategy—like playing rock in the first two rounds, then paper, then scissors, then paper—then how will this strategy perform? The strategy is fixed and deterministic, so an optimal opponent can win every round (namely, play paper in the first two rounds, then scissors, then rock, then scissors). More generally, for any deterministic strategy—even one that is “adaptive” based on the opponent’s moves—there exists an opponent strategy that entirely beats it. This illustrates the problem with deterministic strategies: the make the program’s actions predictable, and thus easily countered.

Now consider a strategy that chooses a random action in each move, with equal probability for rock, paper, and scissors. Here, even an opponent that knows this strategy does not know in advance which action the strategy will choose, and therefore cannot universally counter that move. In fact, we prove below that no matter what strategy the opponent uses, the probability that the opponent wins an individual round is just \(1/3\) (and the same goes for tying and losing the round). To see why this is the case, we first review some basic tools of probability.

Review of Probability

We first recall the basic definitions and axioms of probability. For simplicity, we focus on discrete probability, which will be sufficient for our purposes.

Probability Spaces and Events

Definition 206 (Probability Space)

A (discrete) probability space is a countable set \(\Omega\), called the sample space of possible outcomes, along with a probability function \(\Pr \colon \Omega \to [0,1]\) that assigns a probability to each outcome, and satisfies

\[\sum_{\omega \in \Omega} \Pr[\omega] = 1 \; \text.\]

Let’s elaborate on this definition and consider some examples. The sample space \(\Omega\) is the set of every possible outcome—i.e., “every thing that could potentially happen”—in a randomized process, which is also known as an experiment. Each outcome has a probability between \(0\) and \(1\) (inclusive) of occurring, as defined by the probability function \(\Pr\), and all these probabilities must sum to \(1\).

Example 207 (Three Fair Coins: Probability Space)

Consider the experiment of tossing three “fair” (unbiased) coins in sequence. We formalize this as a probability space—i.e., a sample space and a probability function on it.

The sample space \(\Omega\) is the set of all \(8 = 2^3\) possible sequences of three heads or tails that can occur, i.e.,

\[\Omega = \set{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} \; \text,\]

where \(H\) represents heads and \(T\) represents tails.

Next, we define an appropriate probability function. To represent the fairness of the coins, we require that each of the eight outcomes has the same probability. Since by definition the probabilities of all eight outcomes must sum to \(1\), we should define \(\Pr[\omega] = 1/8\) for every outcome \(\omega \in \Omega\), e.g., \(\Pr[HHH] = \Pr[HHT] = \cdots = \Pr[TTT] = 1/8\). [1]

These definitions generalize straightforwardly to the probability experiment of tossing \(n \geq 0\) fair coins, where the sample space is \(\Omega = \set{H,T}^n\), and each outcome \(\omega \in \Omega\) has the same probability \(\Pr[\omega] = 1/2^n\).

Definition 208 (Event)

For a probability space \((\Omega,\Pr)\), an event is any subset \(E \subseteq \Omega\) of the sample space. The probability function is extended to events as

\[\Pr[E] = \sum_{\omega \in E} \Pr[\omega] \; \text.\]

In other words, an event is any subset of outcomes, and its probability is the sum of the probabilities of those outcomes. We now build on the previous example.

Example 209 (Three Fair Coins: Events)

Continuing from the probability space for tossing three fair coins as defined in Example 207, we now consider some example events.

  • The event “the first toss comes up heads” is formalized as \(E_1 = \set{HHH, HHT, HTH, HTT} \subseteq \Omega\), i.e., the subset of all outcomes in which the first character is \(H\). By definition, the probability of this event is the sum of the probabilities of all the outcomes in the event. Since this event has four outcomes, and each has probability \(1/8\), we get that \(\Pr[E_1] = 4/8 = 1/2\). This matches our intuition that the probability of getting heads on the first toss (ignoring what happens on the remaining two tosses) is \(1/2\).

  • The event “the first and third tosses come up the same” is formalized as the subset \(E_2 = \set{HHH, HTH, THT, TTT} \subseteq \Omega\). Similarly, the probability of this event is \(\Pr[E_2] = 4/8 = 1/2\). This also matches our intuition: no matter how the first toss comes up, the third toss has a \(1/2\) probability of matching it.

  • The event “the same number of heads and tails come up” is formalized as the empty subset \(E_3 = \emptyset \subseteq \Omega\). This is because no outcome has the same number of heads and tails. [2] By definition, the probability of this event is \(\Pr[E_3] = 0\).

Because events are just subsets of the sample space, we can use set relations to relate their probabilities. For example, for any events \(E_1 \subseteq E_2\), we have that \(\Pr[E_1] \leq \Pr[E_2]\), because \(\Pr[\omega] \geq 0\) for every outcome \(\omega\). (However, we caution that \(E_1 \subsetneq E_2\) does not imply that \(\Pr[E_1] < \Pr[E_2]\), because the events in \(E_2 \setminus E_1\) might have probability zero.)

We can also use set operations on events to yield other events. For example, every event \(E\) has a complement event \(\overline{E} = \Omega \setminus E\), whose probability is \(\Pr[\overline{E}] = 1 - \Pr[E]\), because

\[\begin{split}\Pr[E] + \Pr[\overline{E}] &= \sum_{\omega \in E} \Pr[\omega] + \sum_{\omega \in \overline{E}} \Pr[\omega] \\ &= \sum_{\omega \in \Omega} \Pr[\omega] \\ &= 1 \; \text.\end{split}\]

We can also consider intersections and unions of events. Notationally, instead of \(\cap\) and \(\cup\), we usually use the logical operators \(\wedge\) (AND) and \(\vee\) (OR), respectively. This captures the idea that the intersection of two events represents the occurrence of both events simultaneously, and their union represents the occurrence of one event or the other (or both).

For two events \(E_1,E_2\), the probability \(\Pr[E_1 \wedge E_2]\) that both occur simultaneously is called their joint probability (and similarly for more events). It follows immediately that this probability is at most \(\min \set{\Pr[E_1], \Pr[E_2]}\), because \(E_1 \wedge E_2 \subseteq E_1, E_2\). When considering the intersection of events, an important case is when, informally, the (non-)occurrence of one event does not affect the probability of the (non-)occurrence of another event. This is known as independence; the formal definition is as follows.

Definition 210 (Independent Events)

Two events \(E_1, E_2\) (of the same probability space) are independent if \(\Pr[E_1 \wedge E_2] = \Pr[E_1] \cdot \Pr[E_2]\).

More generally, events \(E_1, \ldots, E_n\) are mutually independent if \(\Pr[E_1 \wedge \cdots \wedge E_n] = \Pr[E_1] \cdots \Pr[E_n]\). They are pairwise independent if every pair of them is independent.

Mutual independence of several events is a stronger notion than pairwise independence: a collection of events can be pairwise independent but not mutually independent (see Example 218 below).

Example 211 (Three Fair Coins: Independence)

Continuing from Example 209, we consider several events and whether they are independent:

  • The events \(E_1\) (“the first toss comes up heads”) and \(E_2\) (“the first and third tosses come up the same”) are independent, because \(E_1 \wedge E_2 = \set{HHH, HTH}\) and hence \(\Pr[E_1 \wedge E_2] = 1/4 = 1/2 \cdot 1/2 = \Pr[E_1] \cdot \Pr[E_2]\). This matches our intuition that whether or not the first toss comes up heads, the probability that the third toss matches the first one is not affected.

  • The three events “the first/second/third toss comes up heads” are mutually independent (and hence also pairwise independent): their intersection is the event \(\set{HHH}\), which has probability \(1/8 = (1/2)^3\), which is the product of the \(1/2\) probabilities of the three individual events. The same holds if we replace “heads” with “tails” in any of these events. All this matches our intuition that the results of any of the tosses do not affect any of the other tosses.

  • The two events “the first/last pair of tosses come up heads” are not independent: their intersection is the event \(\set{HHH}\), which has probability \(1/8\), but they each have probability \(1/4\). This matches our intuition: the first pairs of tosses coming up heads makes it more likely that the last pairs of tosses also do, because they have the middle (second) toss in common.

  • The three events “the [first two/last two/first and third] tosses come up the same” are pairwise independent, but not mutually independent. Each of them has probability \(1/2\), and the intersection of any two of them is the event \(\set{HHH,TTT}\), which has probability \(1/4 = 1/2 \cdot 1/2\). However, the intersection of all three of them is also \(\set{HHH,TTT}\), whereas the product of their individual \(1/2\) probabilities is \(1/8\). This failure of mutual independence is because the occurrence of any two of these events implies the occurrence of the third one.

When considering the union of events, the following lemma is a basic but widely applicable tool.

overlapping events in a sample space; the probability of their union is at most the sum of the probabilities of the individual events (and may be strictly less, because some outcomes are in multiple events, and thus their probabilities are counted multiple times)
Lemma 212 (Union Bound)

Let \(E_1, E_2 \subseteq \Omega\) be events of the same probability space. Then

\[\Pr[E_1 \vee E_2] \leq \Pr[E_1] + \Pr[E_2] \; \text,\]

with equality if the events are disjoint (i.e., if \(E_1 \wedge E_2 = \emptyset\)).

More generally, the probability of the union of any countably many events is at most the sum of their individual probabilities, with equality if the events are pairwise disjoint.

Proof 213

By definition,

\[\begin{split}\Pr[E_1 \vee E_2] &= \sum_{\omega \in E_1 \vee E_2} \Pr[\omega] \\ &\leq \sum_{\omega \in E_1} \Pr[\omega] + \sum_{\omega \in E_2} \Pr[\omega] \\ &= \Pr[E_1] + \Pr[E_2] \; \text,\end{split}\]

where the inequality holds because the two sums on the right include every \(\omega \in E_1 \vee E_2\) at least once, and \(\Pr[\omega] \geq 0\) for those \(\omega\) that are double-counted, i.e., the \(\omega \in E_1 \wedge E_2\). In particular, if \(E_1,E_2\) are disjoint, then no outcome is double-counted, hence the bound is an equality.

The more general claim (for more than two events) follows via induction, by applying the main claim (for two events) to the first event and the union of the others, and applying the inductive hypothesis to the latter union.

Random Variables

A random variable is, informally, a numerical quantity whose value is determined by the outcome of a probability experiment. It is defined formally as follows.

Definition 214 (Random Variable)

A random variable of a probability space \((\Omega,\Pr)\) is a function \(X \colon \Omega \to \R\) from the sample space to the real numbers. So, the set of potential values of \(X\) is its image \(X(\Omega) = \set{X(\omega) : \omega \in \Omega}\).

For any real number \(v \in \R\), the notation \(X=v\) denotes the event \(X^{-1}(v) = \set{\omega \in \Omega : X(\omega)=v}\), i.e., the set of all outcomes that map to value \(v\) under \(X\). The events \(X \geq v\), \(X < v\), etc. are defined similarly.

By the above definition, the probabilities of the events \(X=v\) and \(X \geq v\) are, respectively,

\[\begin{split}\Pr[X=v] &= \sum_{\omega : X(\omega)=v} \Pr[\omega] \; \text, \\ \Pr[X \geq v] &= \sum_{\omega : X(\omega) \geq v} \Pr[\omega] \; \text.\end{split}\]

A frequently useful kind of random variable is an indicator random variable, also called a 0-1 random variable.

Definition 215 (Indicator Random Variable)

An indicator random variable (of a probability space \((\Omega, \Pr)\)) is one that is limited to the values 0 and 1, i.e., a function \(X \colon \Omega \to \bit\).

An indicator random variable can be defined for any event \(E\), as \(X(\omega)=1\) for every \(\omega \in E\) and \(X(\omega)=0\) otherwise. This means that the random variable has value \(1\) if the event occurs, and value \(0\) if the event does not occur. In the other direction, any indicator random variable \(X\) has a corresponding event \(X^{-1}(1) = \set{\omega \in \Omega : X(\omega)=1}\), the set of exactly those outcomes that make the random variable have value \(1\).

Similar to the case for events, we can define independence for random variables.

Definition 216 (Independent Random Variables)

Random variables \(X_1, X_2\) are independent if for any \(v_1, v_2 \in \R\), the events \(X_1 = v_1\) and \(X_2 = v_2\) are independent, i.e.,

\[\Pr[X_1 = v_1 \wedge X_2 = v_2] = \Pr[X_1=v_1] \cdot \Pr[X_2=v_2] \; \text.\]

Mutual and pairwise independence of several random variables are defined similarly, as in Definition 210.

Example 217 (Three Fair Coins: Random Variables and Independence)

Building on Example 207 from above, we define several random variables and consider whether they are independent.

We first define a random variable \(X\) representing the number of heads that come up, as the function \(X \colon \Omega \to \R\) where

\[\begin{split}X(TTT) &= 0 \; \text, \\ X(HTT) = X(THT) = X(TTH) &= 1 \; \text, \\ X(HHT) = X(HTH) = X(THH) &= 2 \; \text, \\ X(HHH) &= 3 \; \text.\end{split}\]

The event \(X=0\) is \(\set{TTT}\), and hence has probability \(\Pr[X=0] = 1/8\). The event \(X \geq 2\) is \(\set{HHT,HTH,THH,HHH}\), and hence has probability \(\Pr[X \geq 2] = 4/8 = 1/2\). The events \(X=1.5\) and \(X=5\) are both \(\emptyset\), and hence each have probability zero.

For each \(i=1,2,3\), we can define an indicator random variable \(X_i\) that indicates whether the \(i\)th toss comes up heads. Specifically, \(X_i(\omega)=1\) if the \(i\)th character of \(\omega\) is \(H\), and \(X_i(\omega)=0\) otherwise. For example, \(X_2\) is the indicator random variable for the event \(\set{HHH, HHT, THH, THT}\). By inspection, \(\Pr[X_i = 0] = \Pr[X_i = 1] = 1/2\) for every \(i\).

Observe that \(X = X_1 + X_2 + X_3\), because \(X\) is the total number of heads, and each \(X_i\) contributes \(1\) to the sum if and only if the \(i\)th toss comes up heads (otherwise it contributes \(0\)).

The random variables \(X_1, X_2, X_3\) are mutually independent (and hence also pairwise independent). To see this, consider any \(v_1,v_2,v_3 \in \bit\). Then the event \(X_1 = v_1 \wedge X_2 = v_2 \wedge X_3 = v_3\) consists of exactly one outcome (because the \(v_i\) values collectively specify the result of every toss), so it has probability \(1/8 = (1/2)^3\), which is the product of the \(1/2\) probabilities of the three individual events \(X_i = v_i\).

All of the definitions and claims from this example generalize straightforwardly to the probability experiment of tossing any number of fair coins.

Example 218 (Three Fair Coins: More Random Variables and (Non-)Independence)

We can also define a random variable \(Y\) representing the number of pairs of tosses that come up the same. That is,

\[\begin{split}Y(HHH) = Y(TTT) &= 3 \; \text, \\ Y(\omega) &= 1 \; \text{otherwise.}\end{split}\]

For each pair of coin tosses, we can define an indicator random variable that indicates whether that pair comes up the same. Specifically, for each \(i=1,2,3\), consider the pair of tosses that does not include the \(i\)th toss; then let \(Y_i=1\) if this pair comes up the same, and \(Y_i=0\) otherwise. For example, \(Y_1\) is the indicator random variable for the event \(\set{HHH, HTT, THH, TTT}\), so \(\Pr[Y_1 = 1] = \Pr[Y_1 = 0] = 1/2\), and similarly for \(Y_2, Y_3\).

Observe that \(Y = Y_1 + Y_2 + Y_3\), because \(Y\) is the total number of pairs that come up the same, and each \(Y_i\) corresponds to one of the three pairs, contributing \(1\) to the sum if that pair comes up the same, and \(0\) otherwise.

The random variables \(Y_1, Y_2, Y_3\) are pairwise independent, but not mutually independent. To see pairwise independence, take any values \(v_i, v_j \in \bit\) for any \(i \neq j\). Then the event \(Y_i = v_i \wedge Y_j = v_j\) consists of exactly two outcomes, so it has probability \(1/4\), which is the product of the \(1/2\) probabilities of the two individual events \(Y_i = v_i\) and \(Y_j = v_j\). This matches our intuition that whether one pair of coins comes up the same does not affect whether another pair of coins does (even if the two pairs have a coin in common).

To see that the \(Y_i\) are not mutually independent, there are several counterexamples to the definition. For instance, the event \(Y_1 = 1 \wedge Y_2 = 1 \wedge Y_3 = 1\) is \(\set{HHH,TTT}\), so it has probability \(1/4\), which is not the product of the \(1/2\) probabilities of the three individual events \(Y_i = 1\) for \(i=1,2,3\). This matches our intuition: if two pairs of coins come up the same, then this implies that the third pair comes up the same as well.

Definition 219

The probability distribution (also known as probability mass function) of a random variable \(X\) is the function \(p_X \colon \R \to [0,1]\) that maps each real value to the probability that \(X\) takes on this value, i.e., \(p_X(v) = \Pr[X=v]\).

Example 220 (Three Fair Coins: Probability Distributions)

Continuing from Example 217, the probability distribution of the random variable \(X\) representing the number of heads that come up is

\[\begin{split}p_X(0) = \Pr[X=0] &= 1/8 \; \text, \\ p_X(1) = \Pr[X=1] &= 3/8 \; \text, \\ p_X(2) = \Pr[X=2] &= 3/8 \; \text, \\ p_X(3) = \Pr[X=3] &= 1/8 \; \text, \\ p_X(v) = \Pr[X=v] &= 0 \text{ otherwise.}\end{split}\]

The probability distribution of the random variable \(Y\) from Example 218 is \(p_Y(1) = 3/4\), \(p_Y(3) = 1/4\), and \(p_Y(v) = 0\) otherwise.

Expectation

The probability distribution of a random variable gives the probability of each potential value of the variable. Often, we seek more succinct “summary” quantities about the random variable. One such quantity is its expectation, which is the “weighted average” of the random variable, where each value is weighted by its probability.

Definition 221 (Expectation)

The expected value, also known as expectation, of a random variable \(X\) is defined as

\[\Ex[X] = \sum_{v \in X(\Omega)} v \cdot \Pr[X = v] \; \text.\]

The name “expected value” is a bit of a misnomer, because it is not really the value we “expect” to get in the probability experiment. Indeed, a random variable might not even be able to take on its expected value at all! (Example 224 below gives a simple example of this.) Instead, it is good to think of the expectation as the average value of a random variable.

Recall that

\[\Pr[X = v] = \sum_{\omega: X(\omega) = v} \Pr[\omega] \; \text.\]

Plugging this into the definition of expectation, we get that

\[\begin{split}\Ex[X] &= \sum_{v \in X(\Omega)} v \cdot \parens[\Big]{\sum_{\omega: X(\omega) = v} \Pr[\omega]} \\ &= \sum_{v \in X(\Omega)} \sum_{\omega: X(\omega) = v} X(\omega) \cdot \Pr[\omega] \\ &= \sum_{\omega \in \Omega} X(\omega) \cdot \Pr[\omega]\end{split}\]

as an alternative expression of \(\Ex[X]\).

Lemma 222

If \(X\) is an indicator random variable, then \(\Ex[X] = \Pr[X=1]\).

Proof 223

This follows directly from the definition of expectation, and the fact that an indicator random variable is limited to the values \(0,1\):

\[\Ex[X] = 0 \cdot \Pr[X=0] + 1 \cdot \Pr[X=1] = \Pr[X=1] \; \text.\]
Example 224 (Three Fair Coins: Expectations)

Continuing from Example 217, the expectation of the random variable \(X\) (the number of heads that come up) is

\[\begin{split}\Ex[X] &= \sum_{v \in X(\Omega)} v \cdot \Pr[X=v] \\ &= \sum_{v=0}^3 v \cdot \Pr[X=v] \\ &= 0 \cdot 1/8 + 1 \cdot 3/8 + 2 \cdot 3/8 + 3 \cdot 1/8 \\ &= 12/8 = 3/2 \; \text.\end{split}\]

This matches our intuition, that when we toss three fair coins, the average number of heads that come up is \(3/2\). Notice that this expectation is not a value that \(X\) can actually take on; it is merely the “average” value.

Continuing from Example 218, the expectation of \(Y\) (the number of pairs of tosses that come up the same) is

\[\begin{split}\Ex[Y] &= \sum_{v \in Y(\Omega)} v \cdot \Pr[Y=v] \\ &= 1 \cdot 3/4 + 3 \cdot 1/4 \\ &= 6/4 = 3/2 \; \text.\end{split}\]

This also might (or might not!) match our intuition: there are three pairs of tosses, and each one has probability \(1/2\) of coming up the same. Even though these three events are not mutually independent (see Example 218), the expected number of equal pairs of tosses happens to be \(3/2\). This is not a coincidence, as we will see next.

An important feature of averages is that it is impossible for all potential outcomes to be above average, nor can they all be below average. In other words, there must be some outcome that is at least the expectation, and one that is at most the expectation. This is formalized in the following lemma, which is known as the averaging argument.

Lemma 225 (Averaging Argument)

Let \(X\) be a random variable of a probability space \((\Omega,\Pr)\). Then there exists an outcome \(\omega \in \Omega\) for which \(X(\omega) \geq \Ex[X]\), and an outcome \(\omega' \in \Omega\) for which \(X(\omega') \leq \Ex[X]\).

For example, building on Example 224, there exists at least one outcome for which \(X\), the number of heads that come up, is at least \(3/2\). Indeed, \(HHH\) and \(HHT\) are two such outcomes.

Proof 226

Let \(E = \Ex[X]\), and suppose for the purposes of contradiction that \(X(\omega) < E\) for every \(\omega \in \Omega\). Then by the alternative formula for the expectation, and because the \(\Pr[\omega]\) are non-negative and sum to 1 (so at least one of them is positive),

\[\begin{split}\Ex[X] &= \sum_{\omega \in \Omega} X(\omega) \cdot \Pr[\omega] \\ &< \sum_{\omega \in \Omega} E \cdot \Pr[\omega] \\ &= E \cdot \sum_{\omega \in \Omega} \Pr[\omega] = E \; \text.\end{split}\]

Thus \(\Ex[X] < E\), which contradicts the definition of \(E\), hence the first claim is proved. The second claim proceeds symmetrically.

For a non-negative random variable, its expectation gives some useful information about its probability distribution. Informally, it “not so likely” that the random variable is “much larger than” its expectation. Markov’s inequality quantifies this.

Theorem 227 (Markov’s Inequality)

Let \(X\) be a non-negative random variable, i.e., \(X\) never takes on a negative value, and let \(a > 0\). Then

\[\Pr[X \geq a] \leq \frac{\Ex[X]}{a} \; \text.\]

If \(\Ex[X] > 0\), this implies that for any \(t > 0\), the probability that \(X\) is at least \(t\) times its expectation is at most \(1/t\). Notice that these statements are trivial (or “useless”) for \(a \leq \Ex[X]\) or \(t \leq 1\), because in these cases Markov’s inequality gives probability upper bounds that are \(\geq 1\), and the probability of any event is trivially \(\leq 1\). So, Markov’s inequality is useful only when we take \(a > \Ex[X]\) or \(t > 1\).

As an example, if the class average on an exam is \(70\) points (and negative points are not possible), then Markov’s inequality implies that the fraction of the class that earned at least \(a = 90\) points is at most \(70/90 = 7/9\). (Stated probabilistically: if we choose a student uniformly at random, then the probability that the student earned \(\geq 90\) points is \(\leq 7/9\).) Similarly, at most \((7/6)\)ths of the class earned \(\geq 60\) points, but this is a trivial statement.

Proof 228

By definition of expectation,

\[\begin{split}\Ex[X] &= \sum_{v \in X(\Omega)} v \cdot \Pr[X = v] \\ &= \parens[\Big]{\sum_{v < a} v \cdot \Pr[X = v]} + \parens[\Big]{\sum_{v \geq a} v \cdot \Pr[X = v]} \\ &\geq \sum_{v \geq a} v \cdot \Pr[X = v] \\ &\geq \sum_{v \geq a} a \cdot \Pr[X = v] \\ &= a \cdot \Pr[X \geq a] \; \text.\end{split}\]

The first inequality holds because \(X\) is non-negative, so every \(v\) in the first sum is non-negative, and therefore that sum is non-negative. The second inequality holds because \(v \geq a\), and all the \(\Pr[X=v]\) are non-negative. The last equality holds because the event \(X \geq a\) is the union of all the disjoint events \(X \geq v\) for \(v \geq a\).

Finally, dividing both sides of the inequality by \(a > 0\) (which preserves the direction of the inequality), we get that

\[\frac{\Ex[X]}{a} \geq \Pr[X \geq a] \; \text.\]

The following ‘reverse’ version of Markov’s inequality says, informally, that if a random variable is upper bounded, then it has a certain positive probability of exceeding any value less than its expectation. (Note that by the averaging argument, there is some nonzero probability that the random variable is at least its expectation. The reverse Markov inequality gives a specific probability bound, which increases as the threshold of interest decreases.)

Theorem 229 (Reverse Markov’s Inequality)

Let \(X\) be a random variable that is never larger than some value \(b\). Then for any \(a < b\),

\[\Pr[X \geq a] \geq \Pr[X > a] \geq \frac{\Ex[X] - a}{b - a} \; \text.\]

Notice that the statement is trivial (or “useless”) for \(a \geq \Ex[X]\), because in this case it gives a probability lower bound that is \(\leq 0\), and the probability of any event is \(\geq 0\).

For example, if the class average on an exam is \(70\) points out of a maximum of \(b=100\), then the ‘reverse’ Markov inequality implies that at least \((70-60)/(100-60) = 1/4\) of the class earned more than \(a=60\) points. However, it does not say anything about how much of the class earned \(70\) or more points.

Proof 230

The first inequality holds because the event \(X > a\) is a subset of the event \(X \geq a\). So, it remains to prove the second inequality.

Define the random variable \(Y = b - X\). This is non-negative because \(X\) is never larger than \(b\), so we can apply Markov’s inequality to it, and it has expectation \(\Ex[Y] = b - \Ex[X]\). Hence,

\[\begin{split}\Pr[X \leq a] &= \Pr[b - Y \leq a] \\ &= \Pr[Y \geq b - a] \\ &\leq \frac{\Ex[Y]}{b - a} \\ &= \frac{b - \Ex[X]}{b - a} \; \text.\end{split}\]

So, considering the complementary event,

\[\Pr[X > a] = 1 - \Pr[X \leq a] \geq \frac{(b-a) - (b-\Ex[X])}{b-a} = \frac{\Ex[X]-a}{b-a} \; \text.\]

Often, we can express a random variable \(X\) as a linear combination of some other, typically “simpler,” random variables \(X_i\). We can then apply linearity of expectation to compute the expectation of \(X\) from the expectations of the \(X_i\). Importantly, the \(X_i\) do not need to be independent in order for linearity of expectation to apply.

Theorem 231 (Linearity of Expectation)

Let \(X_1, X_2, \ldots\) be any countably many random variables, and \(c_1, c_2, \ldots \in \R\). Then

\[\Ex\brackets[\Big]{\sum_i c_i \cdot X_i} = \sum_i c_i \cdot \Ex[X_i] \; \text.\]

In other words, constant-factor scalings and summations can be “moved in and out of” expectations.

Proof 232

We prove the case for a linear combination of two variables; the general case then follows by induction. From the alternative formulation of expectation,

\[\begin{split}\Ex[c_1 X_1 + c_2 X_2] &= \sum_{\omega\in\Omega} (c_1 X_1(\omega) + c_2 X_2(\omega)) \cdot \Pr[\omega] \\ &= \sum_{\omega\in\Omega} (c_1 X_1(\omega) \cdot \Pr[\omega] + c_2 X_2(\omega) \cdot \Pr[\omega]) \\ &= \sum_{\omega\in\Omega} c_1 X_1(\omega) \cdot \Pr[\omega] + \sum_{\omega\in\Omega} c_2 X_2(\omega) \cdot \Pr[\omega] \\ &= c_1 \cdot \sum_{\omega\in\Omega} X_1(\omega) \cdot \Pr[\omega] + c_2 \cdot \sum_{\omega\in\Omega} X_2(\omega) \cdot \Pr[\omega] \\ &= c_1 \cdot \Ex[X_1] + c_2 \cdot \Ex[X_2] \; \text.\end{split}\]
Example 233 (Three Fair Coins: Linearity of Expectation)

Continuing from Example 217, recall that the random variable \(X\) is the number of heads that come up. In Example 224, we showed directly via the definition that \(\Ex[X] = 3/2\). What follows is an alternative, simpler way—especially when generalizing to more coin tosses—to obtain the same result using linearity of expectation.

Recall that \(X\) is the sum of the three indicator random variables \(X_i\) that indicate whether the \(i\)th toss comes up heads: \(X = X_1 + X_2 + X_3\). Recall from Example 217 that \(\Ex[X_i] = \Pr[X_i = 1] = 1/2\) for every \(i\). So, by linearity of expectation (Lemma 231),

\[\Ex[X] = \Ex[X_1 + X_2 + X_3] = \Ex[X_1] + \Ex[X_2] + \Ex[X_3] = 3/2 \; \text.\]

Similarly, in Example 218, we defined the random variable \(Y\) to be the number of pairs of tosses that come up the same, and showed directly from the definition that \(\Ex[Y] = 3/2\).

We observed that \(Y\) is the sum of the three indicator random variables \(Y_i\) that indicate whether the corresponding pair of tosses come up the same: \(Y = Y_1 + Y_2 + Y_3\). Even though these indicator random variables are not mutually independent, we can still apply linearity of expectation to their sum:

\[\Ex[Y] = \Ex[Y_1 + Y_2 + Y_3] = 3/2 \; \text.\]
Example 234 (Biased Coins)

Suppose we repeatedly flip a biased coin that has probability \(p \in [0,1]\) of coming up heads. We can determine the expected number of heads in \(n\) flips, using indicator random variables and linearity of expectation.

For \(i=1,\ldots, n\), define \(X_i\) to be the indicator random variable indicating whether the \(i\)th flip comes up heads. Then \(\Ex[X_i] = \Pr[X_i] = p\). Define \(X\) to be the total number of heads that come up in \(n\) flips, and observe that \(X = \sum_{i=1}^n X_i\). So, by linearity of expectation,

\[\Ex[X] = \Ex\brackets[\Big]{\sum_{i=1}^n X_i} = \sum_{i=1}^n \Ex[X_i] = \sum_{i=1}^n p = p \cdot n \; \text.\]

By contrast, deriving this expectation directly via the definitions (of expectation, and of the random variable \(X\)) would be rather complicated: we would need to determine \(\Pr[X=v]\) for each \(v=0,1,\ldots,n\), then analyze \(\sum_{i=1}^n v \cdot \Pr[X=v]\). Linearity of expectation makes the derivation almost trivial.

Analyzing Rock-Paper-Scissors

Now let’s return to the rock-paper-scissors example, modeling one round of the game as a probability space, and analyzing the appropriate events. The sample space—i.e., the set of outcomes that can potentially occur—is the set \(\Omega = \set{R,P,S}^2 = \set{RR, RP, RS, PR, PP, PS, SR, SP, SS}\) of possible move pairs made by the two players, where \(R,P,S\) respectively denote the moves rock, paper, and scissors, and the \(i\)th character represents player \(i\)’s move, for \(i=1,2\).

Next, let \(R_i\) be the event that player \(i\) plays rock, i.e., \(R_1 = \set{RR, RP, RS}\) and \(R_2 = \set{RR, PR, SR}\), and similarly for \(P_i\) and \(S_i\).

We need to assign probabilities to the outcomes. Player \(i\)’s strategy can be seen as choosing a move at random according to some specified probabilities of events \(R_i, P_i, S_i\), which must sum to \(1\) because they partition the sample space. (This is true even for deterministic strategies, in which case the chosen move has probability \(1\).) Since the players move simultaneously, we model their choices as independent. So, letting \(M_i\) be any one of \(R_i,P_i,S_i\) for \(i=1,2\), we see that the event \(M_1 \wedge M_2\) consists of a single outcome, and we have that

\[\Pr[M_1 \wedge M_2] = \Pr[M_1] \cdot \Pr[M_2] \; \text.\]

Now, let \(W_1 = \set{RS, PR, SP}\) be the event that player \(1\) wins. Suppose that player \(1\) plays uniformly at random, i.e., plays R,P,S each with equal probability, i.e., \(\Pr[R_1]=\Pr[P_1]=\Pr[S_1]=1/3\). Then no matter how player \(2\) plays, the probability that player \(1\) wins is

\[\begin{split}\Pr[W_1] &= \Pr[R_1 \wedge S_2] + \Pr[P_1 \wedge R_2] + \Pr[S_1 \wedge P_2] \\ &= \Pr[R_1] \cdot \Pr[S_2] + \Pr[P_1] \cdot \Pr[R_2] + \Pr[S_1] \cdot \Pr[P_2] \\ &= \frac{1}{3} (\Pr[S_2] + \Pr[R_2] + \Pr[P_2]) \\ &= 1/3 \; \text.\end{split}\]

The first equation uses independence, and the final equation uses the fact that the events \(S_2,R_2,P_2\) partition the sample space, so their probabilities must sum to \(1\). The conclusion is that playing uniformly at random yields a \(1/3\) probability of winning, no matter what strategy the opponent uses. This is a significant improvement over any deterministic strategy, against which an opponent can win with certainty. [3] A similar analysis shows that the probabilities of tying and of losing are both \(1/3\) as well.

The rock-paper-scissors example is illustrative of a more general notion of considering algorithms as games. In this perspective, we view one “player” as choosing an algorithm for a computational problem. A second player known as the adversary then chooses an input (of a certain size) to the algorithm. The adversary aims to maximize the number of steps taken by the algorithm on the given input; i.e., to induce the algorithm’s worst-case runtime behavior.

If the first player chooses a deterministic algorithm, then the adversary can find an input that maximizes the number of steps—even if the algorithm happens to be much faster on “most” other inputs. On the other hand, if the first player chooses a randomized algorithm—i.e., one that makes some random choices—then the adversary might not be able to induce worst-case behavior, because it cannot predict the algorithm’s random choices. In a good randomized algorithm, the random choices will tend to avoid the worst-case behavior no matter what the input is, although there is still some (ideally very small) probability that it will make “unlucky” choices that lead to poor performance.

Randomized Approximation Algorithms

We can use randomness to design simple algorithms that produce an α-approximation in expectation, meaning that the expected value of the output is within an \(\alpha\) factor of the value of an optimal solution.

As an example, recall the search version of the \(\TSAT\) problem: given a 3CNF Boolean formula, find a satisfying assignment for it, if one exists. Since a 3CNF formula is the AND of several clauses, this problem is asking for an assignment that satisfies all of the clauses simultaneously.

Because \(\TSAT\) is \(\NP\)-hard, there is no efficient algorithm for this problem unless \(\P=\NP\). So, let us relax the goal, and seek an assignment that satisfies as many clauses as we can manage. In other words, we seek an approximation algorithm for the problem of maximizing the number of satisfied clauses.

In what follows, we actually restrict the input to be what we call an “exact”-3CNF formula. This is a 3CNF formula with the added restriction that each clause involves three distinct variables. (Different clauses can involve the same variable, however.) For example, the clause

\[(x \vee \neg y \vee \neg z)\]

is allowed, but the clause

\[(x \vee \neg y \vee \neg x)\]

is not, since it involves only two distinct variables (the literals \(x\) and \(\neg x\) involve the same variable). Similarly, the clause \((y \vee y \vee \neg z)\) is not allowed.

The problem of finding an assignment that maximizes the number of satisfied clauses in a given exact-3CNF formula is known as \(\MSAT\), and its decision version is \(\NP\)-complete. Despite this, there is a very simple randomized algorithm that obtains a \(7/8\)-approximation, in expectation. The algorithm is as follows: assign random truth values (true or false) to the formula’s variables, uniformly and independently. That is, for each variable, assign it to be true or false each with \(1/2\) probability, independently of all the others.

Theorem 235

Let \(\phi\) be an exact-3CNF formula with \(m\) clauses. Then assigning its variables with uniformly random and independent truth values satisfies \((7/8)\)ths of \(\phi\)’s clauses in expectation, i.e.,

\[\Ex[\text{number of satisfied clauses in $\phi$}] = 7m/8 \; \text.\]

In particular, this is a \(7/8\)-approximation algorithm for \(\MSAT\), in expectation.

The proof of this theorem proceeds by a powerful and widely applicable strategy. First, we define a suitable probability space and a random variable \(X\) that captures the quantity of interest, namely, the number of satisfied clauses. This variable \(X\) typically has a complicated probability distribution that makes it hard to analyze its expectation directly. Instead, we express \(X\) as the sum of several simpler indicator random variables \(X_i\). By linearity of expectation, \(\Ex[X]\) is the sum of the individual expectations \(\Ex[X_i] = \Pr[X_i=1]\), where the equality holds because the \(X_i\) are indicators. Finally, we analyze the probabilities of the individual events \(X_i=1\), which are simple to understand, and arrive at the conclusion.

Proof 236

First, suppose that the expected number of satisfied clauses is indeed \(7m/8\) (which we prove below). The optimal number of clauses that can be simultaneously satisfied is \(\OPT \leq m\), so the expected number of satisfied clauses is \(7m/8 \geq (7/8) \cdot \OPT\), hence this is indeed a \(7/8\)-approximation algorithm for \(\MSAT\).

We now analyze the expected number of satisfied clauses. The probability space is the set of all assignments to the variables of \(\phi\), where each assignment is equally likely—i.e., the assignment is chosen uniformly at random.

Now, define \(X\) be the random variable for the number of satisfied clauses in \(\phi\). That is, for any particular assignment \(\alpha\), we define \(X(\alpha)\) to be the number of clauses satisfied by \(\alpha\). Because the clauses of \(\phi\) can share variables in various ways, the probability distribution and expectation of \(X\) can be quite complicated and difficult to analyze directly. Instead, we decompose \(X\) into a sum of much simpler indicator variables.

Specifically, for each \(i=1,\ldots,m\) we define an indicator random variable \(X_i\) that indicates whether the \(i\)th clause is satisfied, i.e., \(X_i = 1\) if so and \(X_i = 0\) if not. Then because \(X\) is the total number of satisfied clauses, we have that

\[X = X_1 + \cdots + X_m = \sum_{i=1}^m X_i \; \text.\]

So, by linearity of expectation (Lemma 231),

\[\Ex[X] = \Ex\brackets[\Big]{\sum_{i=1}^m X_i} = \sum_{i=1}^m \Ex[X_i] \; \text.\]

It is important to realize that because the clauses of \(\phi\) can share variables in various ways, the indicators \(X_i\) are typically not independent, even pairwise. For example, because clauses \((x \vee y \vee \neg z)\) and \((u \vee \neg v \vee x)\) share the variable \(x\), the former clause being satisfied makes it more likely that the latter clause is also satisfied (i.e., their indicators are positively correlated). Fortunately, linearity of expectation holds even for dependent variables, so the above equation is still valid.

All that remains to analyze \(\Ex[X_i] = \Pr[X_i = 1]\) (see Lemma 222), i.e., the probability that a uniformly random assignment satisfies the \(i\)th clause. Every clause in the exact-3CNF formula \(\phi\) is the OR of exactly three literals involving distinct variables, e.g., \((x \vee \neg y \vee z)\). The clause fails to be satisfied exactly when all three of its literals are false. Because the variables are assigned uniformly at random, each literal has probability \(1/2\) of being false (regardless of whether the literal is a variable or its negation). And because the literals in clause \(i\) involve distinct variables, their values are mutually independent. Thus,

\[\begin{split}\Pr[X_i = 0] &= \Pr[\text{clause $i$ is unsatisfied}] \\ &= \Pr[\text{the 1st and 2nd and 3rd literals of clause $i$ are all false}] \\ &= \prod_{j=1}^3 \Pr[\text{the $j$th literal of clause $i$ is false}] \\ &= (1/2)^3 = 1/8 \; \text.\end{split}\]

Next, because \(X_i\) is an indicator random variable (it is limited to values \(0\) and \(1\)),

\[\Ex[X_i] = \Pr[X_i = 1] = 1 - \Pr[X_i=0] = 7/8 \; \text.\]

Finally, returning to \(X\) itself,

\[\Ex[X] = \sum_{i=1}^m \Ex[X_i] = \sum_{i=1}^m (7/8) = 7m/8 \; \text.\]

By the averaging argument (Lemma 225), from Theorem 235 we can further conclude that for any exact-3CNF formula, there exists an assignment that satisfies at least \(7/8\)ths of the clauses. Notice that this is a “deterministic” statement—it makes no reference to probability—but we proved it using the tools of probability, and it is not obvious how we could hope to do so without them! (This is an example of the “probabilistic method,” which is a very powerful technique with many applications.)

Theorem 235 says that if we assign variables at random, then on average, \(7/8\)ths of the clauses are satisfied. However, this does not say anything about how likely it is that a certain fraction of clauses are satisfied. For example, we might want a lower bound on the probability that at least half (or two-thirds, or three-quarters, or 90%) of the clauses are satisfied. Written in terms of the random variable \(X\) representing the number of satisfied clauses, we might hope to get a lower bound on \(\Pr[X \geq c \cdot m]\) for \(c=1/2\), or various other \(c \in [0,1]\).

Notice that because \(X\) is defined as the number of satisfied clauses, it is non-negative (i.e., it never takes on a negative value). Thus, we can apply Markov’s inequality (Lemma 227) to obtain that

\[\Pr[X \geq m/2] \leq \frac{7m/8}{m/2} = \frac{7}{4} \; \text.\]

Unfortunately, this is a trivial (or “useless”) statement, because the probability bound is \(7/4 > 1\), and the probability of any event is \(\leq 1\). Furthermore, the probability bound goes in the “wrong direction” from what we want: it says that the probability of satisfying at least half the clauses is at most some value, whereas we hope to show that it is at least some positive value. An upper bound is not useful to us, because it is consistent with the actual probability being zero. So, even if we replaced \(c=1/2\) with, say, \(c = 9/10\) to get a nontrivial bound, the statement would not be of the kind we want. We need to use a different tool.

Observe that the number of satisfied clauses \(X\) cannot exceed \(m\), the total number of clauses in the given formula. So, we can apply the ‘reverse’ Markov inequality (Lemma 229) to it, with \(b=m\). Setting \(a = m/2 < b\), we get that

\[\Pr[X > m/2] \geq \frac{7m/8 - m/2}{m - m/2} = \frac{3m/8}{m/2} = 3/4 \; \text.\]

That is, a random assignment satisfies more than half the clauses with at least \(75\) percent probability. More generally, for any \(c < 7/8\), setting \(a = c \cdot m < b\),

\[\Pr[X > c \cdot m] \geq \frac{7m/8 - c \cdot m}{m - c \cdot m} = \frac{7/8-c}{1-c} > 0 \; \text.\]

So, for any \(c < 7/8\) there is a positive probability of satisfying more than a \(c\) fraction of the clauses. However, the ‘reverse’ Markov inequality does not tell us anything useful about the probability of satisfying \(7/8\)ths or more of the clauses.

Exercise 237

Recall the max-cut problem. The following is a randomized algorithm that outputs a random cut in a given undirected graph (which has no self-loop edge, without loss of generality):

\begin{algorithmic}
\FUNCTION{RandomCut}{$G=(V,E)$}
\STATE $S = \emptyset$
\FORALL{$v \in V$}
  \STATE add $v$ to $S$ with probability $1/2$
\ENDFOR
\STATE \RETURN $S$
\ENDFUNCTION
\end{algorithmic}

Recall that \(C(S)\) is the set of crossing edges for the cut \(S\), i.e., those that have exactly one endpoint in \(S\). The size of the cut is \(\abs{C(S)}\).

  1. Prove that in expectation, this algorithm outputs a 1/2-approximation of a maximum cut in the given undirected graph. That is, the expected size of the returned cut is at least half the size of a maximum cut.

    Hint: Use linearity of expectation to show that in expectation, half of the edges of the graph are in \(C(S)\). Similarly, give an upper bound on the size of a maximum cut (also in terms of the number of edges in the graph).

  2. Conclude that, for any undirected graph \(G=(V,E)\), there exists a cut of size at least \(\abs{E}/2\). (By contrast, in Theorem 203 we saw an efficient deterministic algorithm that finds such a cut, with certainty.)

Quick Sort

Another example of an algorithm that takes advantage of randomness is quick sort. This is a recursive algorithm that sorts an array as follows:

  1. First, it chooses some element of the array as the pivot.

  2. Then, it partitions the array around the pivot, by comparing the pivot to every other element, placing all the smaller elements to the “left” of the pivot—not necessarily in sorted order—and all the larger ones to the “right” of the pivot. Without loss of generality, we assume that the array elements are distinct, by breaking ties according to the elements’ positions in the array.

  3. Finally, it recursively sorts the two subarrays to the left and right of the pivot, which results in the entire array being sorted.

The following is an illustration of an execution of quick sort, with the recursive work elided:

quick sort works by choosing a pivot, partitioning the elements into those that are less than, equal to, and greater than the pivot, and recursively sorting the partitions less than and greater than the pivot

The running time of quick sort is dominated by the number of comparisons between pairs of elements. Once a pivot is selected, partitioning an array of \(n\) elements compares every other element to the pivot. This is a total of \(n-1 = \Theta(n)\) comparisons of non-recursive work, which is followed by the recursive calls on the two subarrays on each side of the pivot. If selecting the pivot takes \(O(n)\) time and can guarantee that the two subarrays are “balanced,” i.e., approximately equal in size, then the recurrence for the total number of comparisons (and hence the overall running time) is

\[T(n) = 2T(n/2) + \Theta(n) \; \text.\]

This is the same recurrence as for merge sort, and it has the closed-form solution \(T(n) = \Theta(n \log n)\). However, quick sort tends to be faster in practice than merge sort and other deterministic \(\O(n \log n)\) sorting algorithms, in large part because it is fast and easy to implement in place—i.e., reordering the array within its own memory space, using only a small amount of auxiliary memory. (By comparison, the standard merge subroutine from merge sort makes a full copy of the array with each call.) Thus, variants of quick sort are widely used in practice.

One way to guarantee a balanced partition is to use the median element as the pivot, which can be found in linear time. However, this subroutine is rather complicated to implement, and its linear runtime has a fairly large hidden constant factor.

A much simpler and more practical solution is to choose a uniformly random array element as the pivot. The pseudocode is as follows.

Algorithm 238 (Randomized Quick Sort)
\begin{algorithmic}
\FUNCTION{RandQuickSort}{$A[1,\ldots,n]$}
\IF{$n=1$} \RETURN $A$ \ENDIF
\STATE $p = \text{ an index chosen uniformly at random from }
\set{1,\ldots,n}$
\STATE $(L,R) =$ \CALL{Partition}{$A,p$}
\STATE \RETURN \CALL{RandQuickSort}{$L$} $+ A[p] +$
\CALL{RandQuickSort}{$R$}
\ENDFUNCTION

\FUNCTION{Partition}{$A[1,\ldots,n],p$}
\STATE initialize empty arrays $L,R$
\FOR{$i=1$ to $n$}
  \IF{$i \neq p$ and $A[i] < A[p]$}
    \STATE $L = L + A[i]$
  \ELIF{$i \neq p$}
    \STATE $R = R + A[i]$
  \ENDIF
\ENDFOR
\STATE \RETURN $(L,R)$
\ENDFUNCTION
\end{algorithmic}

The following figure illustrates a possible execution of RandQuickSort on an example input array, with the random choices of pivot elements highlighted, and the rank of each element displayed below it. The rank of an element is the number of elements in the array (including itself) that are less than or equal to it. So, the smallest element has rank \(1\), the next-smallest has rank \(2\), etc.

a full run of the quick-sort algorithm; when a pivot is chosen, it is compared against every element in the same subarray; if one element is less than the pivot and another greater than the pivot, they are placed in different partitions and therefore never directly compared

The running time of RandQuickSort is a random variable that depends in a complicated way on the particular random choices of pivots made throughout the algorithm. Roughly speaking, choices of pivots that yield “unbalanced” subarrays tend to yield larger running times than ones that yield more balanced subarrays. How can we hope to analyze this very complicated probability experiment? As we did above, we express the running time as a sum of much simpler indicator variables, apply linearity of expectation, and then analyze these indicators individually. The rest of this section is devoted to proving the following theorem.

Theorem 239

On any array of \(n\) elements, the expected running time of RandQuickSort is \(\O(n \log n)\).

Recall that the running time of RandQuickSort is dominated by the number of comparisons of pairs of array elements. First observe that if a certain pair of elements is ever compared, then it is never compared again. This is because at the time of comparison, one of the elements is the pivot, which is never compared with anything after partitioning around it is complete. So, the total number of comparisons done by RandQuickSort is the total number of distinct pairs of elements that are ever compared.

Accordingly, define the random variable \(X\) to be the total number of comparisons during the entire execution of quick sort on an arbitrary array of \(n\) elements. In additions, for any ranks \(1 \leq i < j \leq n\), define \(X_{ij}\) to be the indicator random variable that indicates whether the pair of elements having ranks \(i,j\) are ever compared during the execution, i.e., \(X_{ij}=1\) if this pair is ever compared, and \(X_{ij}=0\) otherwise. By the observations from the previous paragraph,

\[X = \sum_{1 \leq i < j \leq n} X_{ij} = \sum_{i=1}^{n-1} \sum_{j=i+1}^n X_{ij} \; \text.\]

Therefore, by linearity of expectation (Lemma 231), the expected total number of comparisons is

\[\Ex[X] = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \Ex[X_{ij}] \; \text.\]

So, for every \(i < j\) it just remains to analyze the individual expectations \(\Ex[X_{ij}] = \Pr[X_{ij}=1]\) in isolation (where the equality holds by Lemma 222), and sum them. (As above, the random variables \(X_{ij}\) are highly correlated in difficult-to-understand ways, but this does not matter for linearity of expectation.)

Under what conditions are the elements of ranks \(i < j\) ever compared, and what is the probability \(\Pr[X_{ij}=1]\) that this occurs? By inspection of Algorithm 238, we see that elements are compared only in the Partition subroutine. In particular, the only comparisons made during a call to Partition are between the selected pivot element and all the other elements in the input subarray. This has two important implications:

  • When a subarray is input to Partition, none of its elements have been compared with each other yet.

  • In the subarray input to Partition, if one element is less than the pivot and another is greater than the pivot, then these two elements are placed on opposite sides of the pivot, and thus are never compared with each other (because the recursive calls operate completely separately on the two sides).

Therefore, the two elements \(a < b\) having ranks \(i < j\) are compared if and only if the first pivot chosen from the range \([a,b]\) is either \(a\) or \(b\). To elaborate: for as long as pivots are chosen from outside this range, \(a\) and \(b\) remain in the same subarray, and have not yet been compared. Then, once an element from this range is chosen as a pivot:

  • it is either \(a\) or \(b\), in which case \(a\) and \(b\) are compared with each other, or

  • it is an element strictly between \(a\) and \(b\), in which case \(a\) and \(b\) are placed in different subarrays and never compared with each other.

choosing a pivot; if a pivot is chosen outside of ranks i through j, they remain in the same subarray and we recurse on it; otherwise, a pivot is chosen somewhere between ranks i and j, and is a uniformly random such element

Therefore, \(\Ex[X_{ij}]\) is the probability that the first pivot whose rank is between \(i\) and \(j\) (inclusive) has rank either \(i\) or \(j\). There are \(j-i+1\) elements in this range. Each pivot is chosen uniformly at random from the subarray input to Partition, and as argued above, when the first pivot from this range is selected, the entire range is part of the subarray. So, given that the pivot is in the range, it is uniformly random over the range, i.e., it is equally likely to be any of the \(j-i+1\) elements in the range. (This statement can be formalized using conditional probability.) Because exactly two of these possible choices make \(X_{ij}=1\), we have that

\[\Ex[X_{ij}] = \Pr[X_{ij} = 1] = \frac{2}{j-i+1} \; \text.\]

Resuming from what we showed above,

\[\begin{split}\Ex[X] &= \sum_{i=1}^{n-1} \sum_{j=i+1}^n \Ex[X_{ij}] \\ &= \sum_{i=1}^{n-1} \sum_{j=i+1}^n \frac{2}{j-i+1} \\ &= 2 \sum_{i=1}^{n-1} \sum_{t=2}^{n-i+1} \frac{1}{t} \; \text.\end{split}\]

The quantity \(\sum_{t=2}^{k} 1/t\) is part of the harmonic series, and is known to be less than \(\ln(k)\). Thus,

\[\Ex[X] < 2 \sum_{i=1}^{n-1} \ln(n-i+1) \leq 2 \sum_{i=1}^{n-1} \ln(n) = 2(n-1) \ln(n) = O(n \log n) \; \text,\]

which completes the proof.

Exercise 240

Although RandQuickSort runs in expected time \(O(n \log n)\) on an array of \(n\) elements, it can potentially run in time \(\Omega(n^2)\), due to an unlucky choice of random pivots.

Prove that for any constant \(c > 0\),

\[\lim_{n \to \infty} \Pr[\algo{RandQuickSort} \text{ on an $n$-element array takes $\geq c n \log^2 n$ steps}] = 0 \; \text.\]

That is, as the array size grows large, randomized quick sort is very unlikely to run in even \(\Omega(n \log^2 n)\) time.

Hint: This follows immediately from Theorem 239 and Markov’s inequality (Lemma 227), without needing to understand any of the analysis of RandQuickSort.

Skip Lists

Randomness can also be utilized to construct simple data structures that provide excellent expected performance. As an example, we consider a set abstract data type, which is a container that supports inserting, deleting, and searching for elements of some mutually comparable type. There are many ways to implement a set abstraction, using a variety of underlying data structures including linked lists, arrays, and binary search trees. However, simple, deterministic implementations suffer a linear runtime in the worst case for some operations. For example, if an adversary inserts the items \(1, 2, \dots, n\) into a non-balancing binary search tree, the insertions take \(\O(n)\) time each, for a total time of \(\O(n^2)\). The usual way to avoid this is to perform complicated rebalancing operations, and there are a variety of schemes to do so (e.g. AVL trees, red-black trees, scapegoat trees, and so on). Can we use randomness instead to design a simple data structure that provides expected \(\O(\log n)\) runtime for all operations on a set of size \(n\)?

One solution is a skip list, which is essentially a linked list with multiple levels, where each level contains about half the elements of the level below (ratios other than \(\frac{1}{2}\) may be used instead, trading off the space and time overheads). Whether or not an element is duplicated to the next level is a random decision, which prevents an adversary from coming up with a sequence of operations that degrades performance. The following is an illustration of a skip list:

a skip list is a linked-list with multiple levels, where all elements are in the bottom level, and each level above duplicates about half the elements of the level below; there is also a sentinel head at each level

In the scheme illustrated above, each level has a sentinel head and is terminated by a null pointer. An element may appear at multiple levels, and the list may be traversed by following links rightward or by moving downward at a duplicated element.

To search for an item, we start at the sentinel node at the top left. We maintain the invariant that we are always located at a node whose value is less than the item we are looking for (or is a sentinel head node). Then in each step, we check the node to the right. If its value is greater than the item we are looking for, or if the right pointer is null, we move downward; however, if we are on the lowest level, we cannot move downward, so the item is not in the list. If the value of the node to the right is less than the item, we move to the right. If the value is equal to the item, we have found the item and terminate the search. The following illustrates a search for the item 0 in the list above:

example of searching for an element, which moves down a level of the element to the right is larger than the item being searched for; otherwise the search moves to the right

The search terminates in failure, since we reach the value -3 in the bottom level with the value 1 to the right. The following illustrates searching for the item 9:

example of searching for an element, which moves down a level of the element to the right is larger than the item being searched for; otherwise the search moves to the right

To insert an item, we first do a search for the item. If the item is in the list, then no further work is necessary. If it is not, the search terminates in the bottom level at the maximal element that is less than the item. Therefore, we insert the item immediately to the right. We then make a random decision of whether or not to duplicate the element to the next level. If the answer is yes, we make another random decision for the next level, and so on until the answer is no. If the probability of duplicating an item is \(\frac{1}{2}\), the expected number of copies of an item is \(2\). Thus, insertion only takes an expected constant amount of time in addition to the search.

We proceed to determine the expected time of a search on a skip list. We start by determining the expected number of elements on each level \(i\). Suppose the skip list contains \(n\) elements total. Let \(n_i\) be the number of elements on level \(i\). We have \(n_1 = n\), since the first level contains all elements. For \(i > 1\), define \(X_{ij}\) to be an indicator random variable for whether or not the \(j\)th element is replicated to level \(i\). We have

\[\Pr[X_{ij} = 1] = 1/2^{i-1}\]

since it takes \(i-1\) “yes” decisions for the element to make it to level \(i\), and each decision is made independently with probability \(1/2\). Thus, \(\Ex[X_{ij}] = \Pr[X_{ij} = 1] = 1/2^{i-1}\).

The total number of elements \(n_i\) on level \(i\) is just the sum of the indicators \(X_{ij}\). This gives us:

\[\begin{split}\Ex[n_i] &= \lrEx{\sum_j X_{ij}}\\ &= \sum_j \Ex[X_{ij}]\\ &= \sum_j \frac{1}{2^{i-1}}\\ &= \frac{n}{2^{i-1}}\end{split}\]

Next, we consider how many levels we expect the list to contain. Suppose we only maintain levels that have at least one element. From our previous result, the expected number of elements is 1 for level \(i\) when

\[\begin{split}\frac{n}{2^{i-1}} &= 1\\ n &= 2^{i-1}\\ \log n &= i - 1\\ i &= \log n + 1\end{split}\]

Intuitively, this implies we should expect somewhere around \(\log n\) levels. To reason about the expected number more formally, consider the number of elements on some level \(i = \log n + k\). By Markov’s inequality, we have

\[\begin{split}\lrPr{n_{\log n + k} \ge 1} &\le \lrEx{n_{\log n + k}}/1\\ &= \frac{n}{2^{\log n + k - 1}}\\ &= \frac{n}{n2^{k-1}}\\ &= \frac{1}{2^{k-1}}\end{split}\]

Then let \(Z_k\) be an indicator for whether \(n_{\log n + k} \ge 1\). The number of levels \(L\) is at most

\[L \le \log n + \sum_{k=1}^\infty Z_k\]

We have

\[\begin{split}\Ex[Z_k] &= \Pr[Z_k = 1]\\ &= \lrPr{n_{\log n + k} \ge 1}\\ &\le \frac{1}{2^{k-1}}\end{split}\]

Then by linearity of expectation,

\[\begin{split}\Ex[L] &\le \lrEx{\log n + \sum_{k=1}^\infty Z_k}\\ &= \log n + \sum_{k=1}^\infty \Ex[Z_k]\\ &\le \log n + \sum_{k=1}^\infty \frac{1}{2^{k-1}}\\ &= \log n + 2\end{split}\]

Thus, the expected number of levels is \(\O(\log n)\).

Finally, we reason about how many nodes a search encounters on each level \(i\). Suppose we are looking for some element \(y\). Let \(w\) be the maximal element on level \(i\) that is less than \(y\) (or the sentinel head if no such element exists), and let \(z\) be the minimal element on level \(i\) that is greater than or equal to \(y\) (or the null sentinel if no such element exists). Let \(v_1, v_2, \dots\) be the elements on level \(i\) that are less than \(w\), with \(v_1\) the largest, \(v_2\) the next largest, and so on.

$w$ is the maximal element on level $i$ that is less than $y$, $z$ is the minimal element that is greater than or equal to $y$, and $v_1, v_2, \dots$ are the elements on level $i$ that are less than $w$

The search encounters the nodes corresponding to \(w\) and \(z\). It does not touch nodes on level \(i\) that come after \(z\) – either \(y = z\), in which case the search terminates, or \(z > y\) (or \(z\) is null), in which case the search moves down to level \(i-1\). Thus, we need only consider whether the search encounters the nodes corresponding to \(v_1, v_2, \dots\).

We observe that each element on level \(i\) is replicated on level \(i+1\) with probability \(1/2\).

$w, v_1, v_2, \dots$ are each replicated on level $i+1$ with independent probability 1/2

There are two possible routes the search can take to get to \(w\) on level \(i\):

  • The search can come down from level \(i+1\) to level \(i\) at \(w\). This happens exactly when \(w\) is replicated on level \(i+1\). The search process proceeds along the same level until it reaches a node that is greater than the target value \(y\), and since \(w < y\), it would not come down to level \(i\) before \(w\) if \(w\) appears on level \(i+1\). Thus, this possibility occurs with probability \(1/2\).

  • The search can come from the node corresponding to \(v_1\) on level \(i\). This occurs when \(w\) is not replicated on level \(i+1\). In that case, the search must come down to level \(i\) before \(w\), in which case it goes through \(v_1\). This possibility also occurs with probability \(1/2\).

These two possibilities are illustrated below:

if $w$ is replicated on level $i+1$, the search comes down to level $i$ at $w$, in which case $v_1$ is not encountered; if $w$ is not replicated on level $i+1$, then the search comes to $w$ from $v_1$ on level $i$, in which case $v_1$ is encountered; each of these happens with probability 1/2

Thus, \(v_1\) is encountered on level \(i\) exactly when \(w\) is not replicated to level \(i+1\), which occurs with probability \(1/2\). We can inductively repeat the same reasoning for preceding elements. The search only encounters \(v_2\) on level \(i\) if neither \(w\) nor \(v_1\) are replicated on level \(i+1\). Since they are replicated independently with probability \(1/2\), the probability that neither is replicated is \(1/4\). In general, the search encounters \(v_j\) on level \(i\) exactly when none of \(w, v_1, v_2, \dots, v_{j-1}\) are replicated to level \(i+1\), which happens with probability \(1/2^j\).

We can now compute the expected number of nodes encountered on level \(i\). Let \(Y_i\) be this number, and let \(V_{ij}\) be an indicator for whether or not element \(v_j\) is encountered. Then we have

\[Y_i = 2 + \sum_j V_{ij}\]

The first term \(2\) is due to the fact that the search encounters the nodes corresponding to \(w\) and \(z\). From our reasoning above, we have

\[\Ex[V_{ij}] = \Pr[V_{ij} = 1] = \frac{1}{2^j}\]

Then by linearity of expectation,

\[\begin{split}\Ex[Y_i] &= \lrEx{2 + \sum_j V_{ij}}\\ &= 2 + \sum_j \Ex[V_{ij}]\\ &= 2 + \sum_j \frac{1}{2^j}\\ &< 2 + \sum_{k=1}^\infty \frac{1}{2^k}\\ &= 3\end{split}\]

The second-to-last step follows from the fact that \(\sum_j \frac{1}{2^j}\) has finitely many terms since there are only finitely many elements less than \(y\). Thus, the expected number of nodes encountered on a level \(i\) is \(\O(1)\).

Combining this with our previous result that the expected number of levels is \(\O(\log n)\), the expected total number of nodes encountered across all levels is \(\O(\log n)\). Thus, a search on a skip list does an expected \(\O(\log n)\) comparisons. Since insertion does only a constant number of operations in addition to the search, it too does \(\O(\log n)\) operations, and the same reasoning holds for deletion as well. We achieve this expected runtime with a much simpler implementation than a self-balancing search tree.

Monte Carlo Methods and Concentration Bounds

Some algorithms rely on repeated trials to compute an approximation of a result. Such algorithms are called Monte Carlo methods, which are distinct from Monte Carlo algorithms – a Monte Carlo method does repeated sampling of a probability distribution, while a Monte Carlo algorithm is an algorithm that may produce the wrong result within some bounded probability. The former are commonly approximation algorithms for estimating a quantity, while the latter are exact algorithms that sometimes produce the wrong result. As we will see, there is a connection – repeated sampling (the strategy of a Monte Carlo method) can be used to amplify the probability of getting the correct result from a Monte Carlo algorithm.

As an example of a Monte Carlo method, we consider an algorithm for estimating the value of \(\pi\), the area of a unit circle (a circle with a radius of one). If such a circle is located in the plane, centered at the origin, its top-right quadrant is as follows:

overlap between a circle of radius one and a square with sides one, with the square located at the top-right of the circle

This quadrant falls within the square interval between \((0, 0)\) and \((1, 1)\). The area of the quadrant is \(\pi/4\), while the area of the interval is \(1\). Thus, if we choose a random point in this interval, the probability that it lies within the quadrant of the circle is \(\pi/4\) [4]. This motivates the following algorithm for estimating the value of \(\pi\):

\begin{algorithmic}
\FUNCTION{EstimatePi}{$n$}
\STATE $\text{count} = 0$
\FOR{$i=1$ to $n$}
  \STATE $x,y = $ values in $[0,1]$ chosen uniformly and
  independently at random
  \IF{$x^2+y^2 \leq 1$}
  \STATE $\text{count} = \text{count} + 1$
  \ENDIF
\ENDFOR
\STATE \RETURN $4 \cdot \text{count} / n$
\ENDFUNCTION
\end{algorithmic}

The algorithm randomly chooses points between \((0, 0)\) and \((1, 1)\), counting how many of them fall within the unit circle, which is when the point’s distance from the origin is at most one. We expect this number to be \(\pi/4\) of the samples, so the algorithm returns four times the ratio of the points that fell within the circle as an estimate of \(\pi\).

We formally show that the algorithm returns the value of \(\pi\) in expectation. Let \(X\) be a random variable corresponding to the value returned, and let \(Y_i\) be an indicator random variable that is 1 if the \(i\)th point falls within the unit circle. As argued previously, we have:

\[\Pr[Y_i = 1] = \pi/4\]

We also have that

\[X = 4/n \cdot (Y_1 + \dots + Y_n)\]

This leads to the following expected value for \(X\):

\[\begin{split}\Ex[X] &= 4/n \cdot (\Ex[Y_1] + \Ex[Y_2] + \dots + \Ex[Y_n])\\ &= 4/n \cdot (\Pr[Y_1 = 1] + \Pr[Y_2 = 1] + \dots + \Pr[Y_n = 1])\\ &= 4/n \cdot (\pi/4 + \pi/4 + \dots + \pi/4)\\ &= \pi\end{split}\]

The expected value of the algorithm’s output is indeed \(\pi\).

When estimating \(\pi\), how likely are we to actually get a result that is close to the expected value? While we can apply Markov’s inequality, the bound we get is very loose, and it does not give us any information about how the number of samples affects the quality of the estimate. The law of large numbers states that the actual result converges to the expected value as the number of samples \(n\) increases. But how fast does it converge? There are many types of concentration bounds that allow us to reason about the deviation of a random variable from its expectation; Markov’s inequality is just the simplest one. Chebyshev’s inequality is another simple bound that makes use of more information about a random variable, namely its variance. Chernoff bounds are yet another tool. There are multiple variants of Chernoff bounds, including the multiplicative form and the additive Hoeffding bounds.

Variance and Chebyshev’s Inequality

The expectation of a random variable gives us one piece of information about its probability distribution, but there are many aspects of the distribution that it does not capture. For instance, the following illustrates three random variables that have different distributions but the same expected value of 0.5:

three random variables with different distributions but the same expectation; the first can only take on a value equal to the expectation, the second can take on two values with equal probability, and the third takes on a range of values with different probabilities

Beyond the expectation, the next most important aspect of a probability distribution is its “spread” [5]. The variance of a random variable encapsulates this information.

Definition 241 (Variance)

Suppose \(X\) is a random variable with expectation \(\Ex[X]\). Then the variance of \(X\) is defined as

\[\Var(X) = \Ex[(X-\Ex[X])^2]\]

or, equivalently

\[\Var(X) = \Ex[X^2] - \Ex[X]^2\]

The second definition follows from the first due to linearity of expectation:

\[\begin{split}\Var(X) &= \Ex[(X-\Ex[X])^2]\\ &= \Ex[X^2 - 2\Ex[X]\cdot X + \Ex[X]^2]\\ &= \Ex[X^2] - 2\Ex[\Ex[X]\cdot X] + \Ex[\Ex[X]^2]\\ &= \Ex[X^2] - 2\Ex[X]\cdot\Ex[X] + \Ex[X]^2\\ &= \Ex[X^2] - 2\Ex[X]^2 + \Ex[X]^2\\ &= \Ex[X^2] - \Ex[X]^2\end{split}\]

In the fourth step, we used the fact that \(\Ex[X]\) is a constant to pull it out of the outer expectation – by linearity of expectation, \(\Ex[cY] = c\Ex[Y]\) for a constant \(c\).

The variance tells us the average square of the distance of a random variable from its expectation. Taking the square root of the variance gives us an approximate measure of the distance itself; this is called the standard deviation, and it is often denoted by the symbol \(\sigma\):

\[\sigma(X) = \sqrt{\Var(X)}\]
Example 242

Suppose the random variable \(X\) has the distribution

\[X = \begin{cases} 0.5 &\mbox{with probability $1$} \end{cases}\]

and \(Y\) has the distribution

\[\begin{split}Y = \begin{cases} 0 &\mbox{with probability $1/2$}\\ 1 &\mbox{with probability $1/2$} \end{cases}\end{split}\]

Both \(X\) and \(Y\) have an expected value of 0.5. We compute their variances. The distributions of \(X^2\) and \(Y^2\) are as follows:

\[ \begin{align}\begin{aligned}\begin{split}X^2 = \begin{cases} 0.25 &\mbox{with probability $1$} \end{cases}\\\end{split}\\\begin{split}Y^2 = \begin{cases} 0 &\mbox{with probability $1/2$}\\ 1 &\mbox{with probability $1/2$} \end{cases}\end{split}\end{aligned}\end{align} \]

Then we have:

\[\begin{split}\Var(X) = \Ex[X^2]-\Ex[X]^2 = 0.25 - 0.5^2 = 0\\ \Var(Y) = \Ex[Y^2]-\Ex[Y]^2 = 0.5 - 0.5^2 = 0.25\end{split}\]

We see that while \(X\) and \(Y\) have the same expectation, their variances do differ.

Example 243

Let \(X\) be an indicator random variable with probability \(p\) of being 1:

\[\begin{split}X = \begin{cases} 0 &\mbox{with probability $1-p$}\\ 1 &\mbox{with probability $p$} \end{cases}\end{split}\]

Then \(\Ex[X] = \Pr[X=1] = p\). Observe that \(X^2\) is always equal to \(X\), so \(\Ex[X^2] = \Ex[X] = p\). Thus, the variance of \(X\) is

\[\begin{split}\Var(X) &= \Ex[X^2] - \Ex[X]^2\\ &= p - p^2\\ &= p(1-p)\end{split}\]

As a more complex example, define \(X\) to be the number of heads over \(n\) flips of a biased coin with probability \(p\) of heads. We computed in Example 234 that \(\Ex[X] = np\). If we define \(X_i\) to be an indicator for whether or not flip \(i\) produces heads, we can infer from Example 243 that \(\Var(X_i) = p(1-p)\). What about \(\Var(X)=\lrVar{\sum_i X_i}\)?

For expectation, we know from Theorem 231 that linearity of expectation always holds. However, for variance, this is not necessarily the case. As an example, for a random variable \(Y\), define \(Z = Y + Y = 2Y\). It is not the case that \(\Var(Z) = 2\Var(Y)\); in actuality, \(\Var(Z) = 4\Var(Y)\).

Theorem 244

Suppose \(X\) is a random variable. Then for any constant \(c\), \(\Var(cX) = c^2\Var(X)\).

Proof 245

By definition of variance, we have

\[\begin{split}\Var(cX) &= \Ex[(cX)^2] - \Ex[cX]^2\\ &= \Ex[c^2 X^2] - (c\Ex[X])^2\\ &= c^2\Ex[X^2] - c^2\Ex[X]^2\\ &= c^2(\Ex[X^2] - \Ex[X]^2)\\ &= c^2\Var(X)\end{split}\]

In the second and third steps, we applied linearity of expectation to pull the constants \(c\) and \(c^2\) out of the expectations.

For a more general sum of random variables, the variances only add if the random variables are independent. To establish this, we first demonstrate that the expectation of the product of independent random variables is the product of their expectations.

Lemma 246

Let \(X\) and \(Y\) be independent random variables. Then

\[\Ex[XY] = \Ex[X]\cdot\Ex[Y]\]
Proof 247

By definition of expectation, we have

\[\begin{split}\Ex[XY] &= \sum_{x,y} xy\Pr[X = x, Y = y]\\ &= \sum_x \sum_y xy\Pr[X = x]\cdot\Pr[Y = y]\\ &= \sum_x \parens{x\Pr[X = x] \sum_y y\Pr[Y = y]}\\ &= \sum_x x\Pr[X = x]\cdot\Ex[Y]\\ &= \Ex[Y] \sum_x x\Pr[X = x]\\ &= \Ex[Y]\cdot\Ex[X]\end{split}\]

In the second step, we used the fact that \(X\) and \(Y\) are independent, so that \(\Pr[X = x, Y = y] = \Pr[X = x]\cdot\Pr[Y = y]\).

By induction, we can conclude that

\[\lrEx{\prod_i X_i} = \prod_i \Ex[X_i]\]

when the \(X_i\) are independent.

We now consider the variance of the sum of independent random variables.

Theorem 248

Suppose \(X\) and \(Y\) are independent random variables. Then \(\Var(X+Y) = \Var(X) + \Var(Y)\).

Proof 249

By definition of variance, we have

\[\begin{split}\Var(X+Y) &= \Ex[(X+Y)^2] - \Ex[X+Y]^2\\ &= \Ex[X^2 + 2XY + Y^2] - \Ex[X+Y]^2\\\ &= \Ex[X^2] + 2\Ex[XY] + \Ex[Y^2] - (\Ex[X] + \Ex[Y])^2\\ &= \Ex[X^2] + 2\Ex[XY] + \Ex[Y^2] - (\Ex[X]^2 + 2\Ex[X]\Ex[Y] + \Ex[Y]^2)\\ &= \Ex[X^2] + 2\Ex[X]\Ex[Y] + \Ex[Y^2] - (\Ex[X]^2 + 2\Ex[X]\Ex[Y] + \Ex[Y]^2)\\ &= \Ex[X^2] + \Ex[Y^2] - \Ex[X]^2 - \Ex[Y]^2\\ &= \Var(X) + \Var(Y)\end{split}\]

In the third step, we applied linearity of expectation, and we applied Lemma 246 in the fifth step.

By induction, we can conclude that

\[\lrVar{\sum_i X_i} = \sum_i \Var(X_i)\]

when the \(X_i\) are independent. Thus, if \(X\) is the number of heads over \(n\) flips of a biased coin with probability \(p\) of heads, we have

\[\begin{split}\Var(X) &= \lrVar{\sum_i X_i}\\ &= \sum_i \Var(X_i)\\ &= \sum_i p(1-p)\\ &= np(1-p)\end{split}\]

While variance tells us the expected squared distance from the expectation, Chebyshev’s inequality allows us to bound the probability of the absolute distance exceeding a given threshold.

Theorem 250 (Chebyshev’s Inequality)

Let \(X\) be a random variable and let \(a > 0\). Then

\[\Pr[\abs{X-\Ex[X]} \ge a] \le \frac{\Var(X)}{a^2}\]
Proof 251

We can derive Chebyshev’s inequality by applying Markov’s inequality to the random variable \(Y = (X - \Ex[X])^2\). Observe that since \(Y\) is a squared quantity, it is always nonnegative, so Markov is applicable. Furthermore, for a nonnegative \(a\), \((X - \Ex[X])^2 \ge a^2\) exactly when \(\abs{X - \Ex[X]} \ge a\). Thus:

\[\begin{split}\Pr[\abs{X-\Ex[X]} \ge a] &= \Pr[(X - \Ex[X])^2 \ge a^2]\\ &\le \frac{\Ex[(X-\Ex[X])^2]}{a^2}\\ &= \frac{\Var(X)}{a^2}\end{split}\]
Example 252

Suppose we flip a fair coin \(n\) times. What is the probability of of getting \(\le 49\%\) or \(\ge 51\%\) heads?

Let \(X\) be the number of heads. We previously computed that \(\Ex[X] = \frac{n}{2}\), and \(\Var(X) = n\cdot\frac{1}{2}\cdot\frac{1}{2} = \frac{n}{4}\). Then

\[\begin{split}\lrPr{\abs*{X-\frac{n}{2}} \ge 0.01n} &\le \frac{\Var(X)}{(0.01n)^2}\\ &= 10000 \frac{n\cdot 1/4}{n^2}\\ &= \frac{2500}{n}\end{split}\]

Thus, for 10,000 flips, the probability of deviating from the expectation by 1% is at most 1/4, while for 1,000,000 flips, it is at most 1/400.

In the example above, the random variable \(X\) can be defined as the sum of independent, identically distributed (i.i.d.) random variables \(X = X_1 + \dots + X_n\). In such a situation, we often prefer to reason about \(X/n\) rather than \(X\) itself, since the expectation of \(X/n\) is independent of \(n\); in particular, \(\Ex[X] = n\Ex[X_i]\), whereas \(\Ex[X/n] = \Ex[X_i]\). The following is an alternative expression of Chebyshev’s inequality for such a case:

Corollary 253

Let \(X = X_1 + \dots + X_n\) be the sum of \(n\) independent, identically distributed random variables. Let \(\varepsilon > 0\) be a deviation from the expectation \(\Ex[X/n] = \Ex[X_i]\). Then

\[\lrPr{\abs*{\frac{1}{n}X-\Ex[X_i]} \ge \varepsilon} \le \frac{\Var(X_i)}{\varepsilon^2 n}\]
Proof 254

The event \(\abs{X/n-\Ex[X_i]} \ge \varepsilon\) is equivalent to the event \(\abs{X-\Ex[X]} \ge \varepsilon n\), since \(\Ex[X] = n\Ex[X_i]\) (multiply both sides of the first inequality by \(n\)). By the standard form of Chebyshev’s inequality, we have

\[\begin{split}\lrPr{\abs*{\frac{1}{n}X-\Ex[X_i]} \ge \varepsilon} &= \lrPr{\abs{X-\Ex[X]} \ge \varepsilon n}\\ &\le \frac{\Var(X)}{\varepsilon^2 n^2}\\ &= \frac{n\Var(X_i)}{\varepsilon^2 n^2}\\ &= \frac{\Var(X_i)}{\varepsilon^2 n}\end{split}\]

In the second-to-last step, we used Theorem 248 to deduce that \(\Var(X) = n\Var(X_i)\) since the \(X_i\) are independent.

Example 255

Suppose we flip a fair coin \(n\) times. What is the probability of of getting \(\le 49\%\) or \(\ge 51\%\) heads?

Let \(X_i\) be an indicator for whether flip \(i\) produces heads, and let \(X = X_1 + \dots + X_n\). We have \(\Ex[X/n] = \Ex[X_i] = 1/2\), and \(\Var(X_i) = 1/4\). Then

\[\begin{split}\lrPr{\abs*{\frac{1}{n}X-\frac{1}{2}} \ge 0.01} &\le \frac{\Var(X_i)}{0.01^2 n}\\ &= 10000 \frac{1/4}{n}\\ &= \frac{2500}{n}\end{split}\]

as before.

Using Chebyshev’s inequality, we observe that the probability of deviating by 1% from the expectation over \(n\) flips of a coin decreases (at least) linearly in the number of flips \(n\). In reality, the probability decreases much faster than this – using multiplicative Chernoff bounds or Hoeffding’s inequality, we can conclude that the probability decreases exponentially in the number of flips.

Hoeffding’s Inequality

Hoeffding’s inequality, also called Chernoff-Hoeffding bounds, is a set of concentration bounds that can give tighter results than Markov’s inequality, Chebyshev’s inequality, or other forms of Chernoff bounds. For simplicity, we restrict ourselves to the special case [6] of independent indicator random variables, with expectation

\[\Pr[X_i = 1] = p_i\]

for each indicator \(X_i\).

Let \(p\) be the expected value of \(\frac{1}{n}X\). We have

\[\begin{split}p &= \lrEx{\frac{1}{n}X}\\ &= \frac{1}{n} \sum_i \Ex[X_i]\\ &= \frac{1}{n} \sum_i p_i\end{split}\]

Hoeffding’s inequality places a limit on how much the actual value of \(\frac{1}{n}X\) may deviate from the expectation \(p\).

Theorem 256 (Hoeffding’s Inequality – Upper Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\). Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le e^{-2\varepsilon^2 n}\]
Theorem 257 (Hoeffding’s Inequality – Lower Tail)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\). Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\frac{1}{n}X \le p - \varepsilon} \le e^{-2\varepsilon^2 n}\]
Theorem 258 (Hoeffding’s Inequality – Combined)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent indicator random variables with \(\Ex[X_i] = p_i\). Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\abs*{\frac{1}{n}X - p} \ge \varepsilon} \le 2e^{-2\varepsilon^2 n}\]

We consider again the example of flipping a fair coin with probability. Let \(H\) be the total number of heads, and let \(H_i\) be an indicator variable corresponding to whether the \(i\)th flip is heads. We have \(\Pr[H_i = 1] = \frac{1}{2}\), and \(\Ex[\frac{1}{n}H] = \frac{1}{2}\) for any number of flips \(n\).

What is the probability of getting at least six heads out of ten flips? This is a deviation of \(\varepsilon = 0.1\), and applying the upper-tail Hoeffding’s inequality gives us:

\[\begin{split}\lrPr{\frac{1}{n}H \ge p + \varepsilon} &= \lrPr{\frac{1}{n}H \ge 0.5 + 0.1}\\ &\le e^{-2 \cdot 0.1^2 \cdot 10}\\ &\approx 0.9802^{10}\\ &\approx 0.82\end{split}\]

What is the probability of getting at least 60 heads out of 100 flips, which is the same deviation \(\varepsilon = 0.1\) from the expectation? Applying the upper tail again, we get

\[\lrPr{\frac{1}{n}H \ge p + \varepsilon} \le e^{-2 \cdot 0.1^2 \cdot 100} \approx 0.9802^{100} \approx 0.14\]

This is a significantly tighter bound than that produced by the multiplicative Chernoff bound.

Example 259

Suppose we have a coin that is biased by probability \(\varepsilon\) towards either heads or tails, but we don’t know which one. In other words, either:

  • \(\Pr[\text{heads}] = \frac{1}{2} + \varepsilon\) and \(\Pr[\text{tails}] = \frac{1}{2} - \varepsilon\), or

  • \(\Pr[\text{heads}] = \frac{1}{2} - \varepsilon\) and \(\Pr[\text{tails}] = \frac{1}{2} + \varepsilon\)

To determine in which direction the coin is biased, we flip the coin \(n\) times. If the results include at least \(n/2\) heads, we assume the coin is biased towards heads, otherwise we assume it is biased towards tails. How many flips should we do to guarantee our answer is correct with probability at least \(1-\delta\)?

Let \(X\) be the number of heads, and let \(X_i\) be an indicator that is 1 if the \(i\)th flip is heads, 0 if it is tails. Then \(X = X_1 + \dots + X_n\) is the sum of independent indicator random variables with \(\Ex[X_i]\) equal to either \(\frac{1}{2} + \varepsilon\) or to \(\frac{1}{2} - \varepsilon\). We analyze the two cases individually.

  • Case 1: The coin is biased towards heads. Then \(\Ex[X_i] = p = \frac{1}{2} + \varepsilon\). Our guess is erroneous when:

    \[\begin{split}X &< \frac{n}{2}\\ \frac{1}{n}X &< \frac{1}{2}\\ &= \parens{\frac{1}{2} + \varepsilon} - \varepsilon\\ &= p - \varepsilon\end{split}\]

    By the lower-tail Hoeffding’s inequality, we have

    \[\begin{split}\lrPr{\frac{1}{n}X < p - \varepsilon} &\le \lrPr{\frac{1}{n}X \le p - \varepsilon}\\ &\le e^{-2\varepsilon^2 n}\end{split}\]

    In the first step, we used the fact that \(\Pr[Y \le a] = \Pr[Y < a] + \Pr[Y = a] \ge \Pr[Y < a]\) for a random variable \(Y\) and any value \(a\).

  • Case 2: The coin is biased towards tails. Then \(\Ex[X_i] = p = \frac{1}{2} - \varepsilon\). Our guess is erroneous when:

    \[\begin{split}X &\ge \frac{n}{2}\\ \frac{1}{n}X &\ge \frac{1}{2}\\ &= \parens{\frac{1}{2} - \varepsilon} + \varepsilon\\ &= p + \varepsilon\end{split}\]

    By the upper-tail Hoeffding’s inequality, we have

    \[\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le e^{-2\varepsilon^2 n}\]

In either case, the probability of error after \(n\) flips is upper-bounded by \(e^{-2\varepsilon^2 n}\). We want this probability to be no more than \(\delta\):

\[\begin{split}e^{-2\varepsilon^2 n} &\le \delta\\ 1/\delta &\le e^{2\varepsilon^2 n}\\ \ln(1/\delta) &\le 2\varepsilon^2 n\\ \frac{\ln(1/\delta)}{2\varepsilon^2} &\le n\end{split}\]

If \(\varepsilon = 0.01\) (i.e. the coin is biased towards heads or tails by 1%) and \(\delta = 0.0001\) (we want to be correct at least 99.99% of the time), then

\[n \ge \frac{\ln(1/0.0001)}{2\cdot 0.01^2} \approx 46502\]

flips suffice.

Exercise 260

Suppose we have a coin that is either fair, or is biased by probability \(\varepsilon\) towards heads, but we don’t know which is the case. In other words, either:

  • \(\Pr[\text{heads}] = \frac{1}{2}\) and \(\Pr[\text{tails}] = \frac{1}{2}\), or

  • \(\Pr[\text{heads}] = \frac{1}{2} + \varepsilon\) and \(\Pr[\text{tails}] = \frac{1}{2} - \varepsilon\)

We flip the coin \(n\) times and count the number of heads \(X\).

  1. For what values of \(X\) should we guess the coin is fair, and for what values that it is biased towards heads?

  2. How many flips should we do to guarantee our answer is correct with probability at least \(1 - \delta\)?

  3. How many flips should we do for \(\varepsilon = 0.01\) and \(\delta = 0.0001\)? How does this compare to the situation in Example 259 with \(\varepsilon = 0.005\) and \(\delta = 0.0001\)?

Polling

Rather than applying concentration bounds to compute the probability of a deviation for a specific sample size, we often wish to determine how many samples we need to be within a particular deviation with high confidence. One application of this is big data, where we have vast datasets that are too large to examine in their entirety, so we sample the data instead to estimate the quantities of interest. Similar to this is polling – outside of elections themselves, we typically do not have the resources to ask the entire population for their opinions, so we need to sample the populace to estimate the support for a particular candidate or political position. In both applications, we need to determine how many samples are needed to obtain a good estimate.

In general, a poll estimates the fraction of the population that supports a particular candidate by asking \(n\) randomly chosen people whether they support that candidate. Let \(X\) be a random variable corresponding to the number of people who answer this question in the affirmative. Then \(X/n\) is an estimate of the level of support in the full population.

A typical poll has both a confidence level and a margin of error – the latter corresponds to the deviation from the true fraction \(p\) of people who support the candidate, and the former corresponds to a bound on the probability that the estimate is within that deviation. For example, a 95% confidence level and a margin of error of ±2% requires that

\[\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} \ge 0.95\]

More generally, for a confidence level \(1-\gamma\) and margin of error \(\varepsilon\), we require

\[\lrPr{\abs*{\frac{X}{n} - p} \le \varepsilon} \ge 1-\gamma\]

or equivalently

\[\lrPr{\abs*{\frac{X}{n} - p} > \varepsilon} < \gamma\]

Formally, we define indicator variables \(X_i\) as

\[\begin{split}X_i = \begin{cases} 1 &\mbox{if person $i$ supports the candidate}\\ 0 &\mbox{otherwise} \end{cases}\end{split}\]

for each person \(i\) in the set that we poll. Then \(X = X_1 + \dots + X_n\) is the sum of independent indicator variables, with

\[\mu = \Ex[X] = \sum_i \Ex[X_i] = np\]

Analysis with Hoeffding’s Inequality

Applying Hoeffding’s inequality to polling, the combined inequality gives us

\[\lrPr{\abs*{\frac{1}{n}X - p} \ge \varepsilon} \le 2e^{-2\varepsilon^2 n}\]

For a 95% confidence level and a margin of error of ±2%, we require that

\[\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} \ge 0.95\]

However, this isn’t quite in the form where we can apply Hoeffding’s inequality, so we need to do some manipulation first. We have:

\[\begin{split}\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} &= 1 - \lrPr{\abs*{\frac{X}{n} - p} > 0.02}\\ &\ge 1 - \lrPr{\abs*{\frac{X}{n} - p} \ge 0.02}\end{split}\]

Hoeffding’s inequality gives us:

\[\lrPr{\abs*{\frac{X}{n} - p} \ge 0.02} \le 2e^{-2 \cdot 0.02^2 \cdot n}\]

Substituting this into the above, we get:

\[\lrPr{\abs*{\frac{X}{n} - p} \le 0.02} \ge 1 - 2e^{-2 \cdot 0.02^2 \cdot n}\]

We want this to be at least 0.95:

\[\begin{split}1 - 2e^{-2 \cdot 0.02^2 \cdot n} &\ge 0.95\\ 2e^{-2 \cdot 0.02^2 \cdot n} &\le 0.05\\ e^{2 \cdot 0.02^2 \cdot n} &\ge 40\\ 2 \cdot 0.02^2 \cdot n &\ge \ln 40\\ n &\ge \frac{\ln 40}{2 \cdot 0.02^2}\\ &\approx 4611.1\end{split}\]

Thus, we obtain the given confidence level and margin of error by polling at least 4612 people. Observe that this does not depend on the total population size!

For an arbitrary margin of error ±\(\varepsilon\), we obtain:

\[\begin{split}\lrPr{\abs*{\frac{X}{n} - p} \le \varepsilon} &= 1 - \lrPr{\abs*{\frac{X}{n} - p} > \varepsilon}\\ &\ge 1 - \lrPr{\abs*{\frac{X}{n} - p} \ge \varepsilon}\\ &\ge 1 - 2e^{-2\varepsilon^2 n}\end{split}\]

To achieve an arbitrary confidence level \(1 - \gamma\), we need:

\[\begin{split}1 - 2e^{-2\varepsilon^2 n} &\ge 1 - \gamma\\ 2e^{-2\varepsilon^2 n} &\le \gamma\\ e^{2\varepsilon^2 n} &\ge \frac{2}{\gamma}\\ 2\varepsilon^2 n &\ge \ln\parens{\frac{2}{\gamma}}\\ n &\ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2}{\gamma}}\end{split}\]

More generally, if we wish to gauge the level of support for \(m\) different candidates, the sampling theorem tells us that the number of samples required is logarithmic in \(m\).

Theorem 261 (Sampling Theorem)

Suppose \(n\) people are polled to ask which candidate they support, out of \(m\) possible candidates. Let \(\Xj\) be the number of people who state that they support candidate \(j\), and let \(p_j\) be the true level of support for that candidate. We wish to obtain

\[\lrPr{\bigcap_j \parens{\abs*{\frac{\Xj}{n} - p_j} \le \varepsilon}} \ge 1 - \gamma\]

In other words, we desire a confidence level \(1 - \gamma\) that all estimates \(\Xj/n\) are within margin of error ±\(\varepsilon\) of the true values \(p_j\). We obtain this when the number of samples \(n\) is

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2m}{\gamma}}\]

The sampling theorem can be derived from applying the union bound.

In conclusion, when sampling from a large dataset, the number of samples required does depend on the desired accuracy of the estimation and the range size (i.e. number of possible answers). But it does not depend on the population size. This makes sampling a powerful technique for dealing with big data, as long as we are willing to tolerate a small possibility of obtaining an inaccurate estimate.

Load Balancing

Job scheduling is another application where we can exploit randomness to construct a simple, highly effective algorithm. In this problem, we have \(k\) servers, and there are \(n\) jobs that need to be distributed to these servers. The goal is to balance the load among the servers, so that no one server is overloaded. This problem is very relevant to content-delivery networks – there may be on the order of millions or even billions of concurrent users, and each user needs to be routed to one of only hundreds or thousands of servers. Thus, we have \(n \gg k\) in such a case.

One possible algorithm is to always send a job to the most lightly loaded server. However, this requires significant coordination – the scheduler must keep track of the load on each server, which requires extra communication, space, and computational resources. Instead, we consider a simple, randomized algorithm that just sends each job to a random server. The expected number of jobs on each server is \(n/k\). But how likely are we to be close to this ideal, balanced load?

We start our analysis by defining random variables \(\Xj\) corresponding to the number of jobs assigned to server \(j\). We would like to demonstrate that the joint probability of \(\Xj\) being close to \(n/k\) for all \(j\) is high. We first reason about the load on an individual server. In particular, we wish to compute a bound on

\[\lrPr{\Xj \ge \frac{n}{k} + c}\]

for some value \(c > 0\), i.e. the probability that server \(j\) is overloaded by at least \(c\) jobs. Let \(\Xj_i\) be an indicator random variable that is 1 if job \(i\) is sent to server \(j\). Since the target server for job \(i\) is chosen uniformly at random out of \(k\) possible choices, we have

\[\lrEx{\Xj_i} = \lrPr{\Xj_i = 1} = \frac{1}{k}\]

We also have \(\Xj = \Xj_1 + \dots + \Xj_n\), giving us

\[\lrEx{\Xj} = \sum_i \lrEx{\Xj_i} = \frac{n}{k}\]

as we stated before. Then:

\[\begin{split}\lrPr{\Xj \ge \frac{n}{k} + c} &= \lrPr{\frac{1}{n}\Xj \ge \frac{1}{k} + \frac{c}{n}}\\ &\le e^{-2(c/n)^2 n}\\ &= e^{-2c^2/n}\end{split}\]

by the upper-tail Hoeffding’s inequality.

Now that we have a bound on an individual server being overloaded by \(c\) jobs, we wish to bound the probability that there is some server that is overloaded. The complement event is that no server is overloaded, so we can equivalently compute the joint probability that none of the servers is overloaded by \(c\) or more jobs:

\[\lrPr{X^{(1)} < \frac{n}{k} + c, \dots, X^{(n)} < \frac{n}{k} + c}\]

However, we cannot do so by simply multiplying the individual probabilities together – that only works if the probabilities are independent, and in this case, they are not. In particular, if one server is overloaded, some other server must be underloaded, so the random variables \(\Xj\) are not independent.

Rather than trying to work with the complement event, we attempt to directly compute the probability that there is at least one overloaded server:

\[\lrPr{\parens{X^{(1)} \ge \frac{n}{k} + c} \cup \dots \cup \parens{X^{(n)} \ge \frac{n}{k} + c}}\]

More succinctly, we denote this as:

\[\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + c}}\]

We have a union of events, and we want to analyze the probability of that union. We use the union bound (Lemma 212):

\[\begin{split}\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + c}} &\le \sum_j \lrPr{\Xj \ge \frac{n}{k} + c}\\ &\le \sum_j e^{-2c^2/n}\\ &= k \cdot e^{-2c^2/n}\\ &= e^{\ln k} \cdot e^{-2c^2/n}\\ &= e^{\ln k - 2c^2/n}\end{split}\]

For \(c = \sqrt{n\ln k}\), we get:

\[\begin{split}\lrPr{\bigcup_j \parens{\Xj \ge \frac{n}{k} + \sqrt{n\ln k}}} &\le e^{\ln k - 2(\sqrt{n\ln k})^2/n}\\ &= e^{\ln k - 2(n\ln k)/n}\\ &= e^{-\ln k}\\ &= 1/e^{\ln k}\\ &= \frac{1}{k}\end{split}\]

With concrete values \(n = 10^{10}\) and \(k = 1000\), we compute the overload relative to the expected value as:

\[\begin{split}\frac{\sqrt{n \ln k}}{n/k} &= \frac{\sqrt{10^{10} \ln 1000}}{10^7}\\ &\approx 0.026\end{split}\]

This is an overhead of about 2.6% above the expected load. The probability that there is a server overloaded by at least 2.6% is bounded from above by \(1/k = 0.001\), or ≤0.1%. Thus, when there are \(n = 10^{10}\) jobs and \(k = 1000\) servers, the randomized algorithm has a high probability (≥99.9%) of producing a schedule where the servers all have a low overhead (≤2.6%).

Exercise 262

Previously, we demonstrated that to use polling to achieve a confidence level \(1 - \gamma\) and a margin of error ±\(\varepsilon\),

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2}{\gamma}}\]

samples are sufficient when there is a single candidate. We also saw the sampling theorem, which states that for \(m\) candidates,

\[n \ge \frac{1}{2\varepsilon^2} \ln\parens{\frac{2m}{\gamma}}\]

samples suffice to achieve a confidence level \(1 - \gamma\) that all estimates \(\Xj/n\) are within margin of error ±\(\varepsilon\).

Use the union bound to demonstrate that this latter result follows from the result for a single candidate.