Thus far, every algorithm we have seen is deterministic. This means
that, when run on a given input, the steps the algorithm performs and
the output it produces are entirely determined. So, when run multiple
times on the same input, the algorithm produces the same output. At
heart, this is because any Turing machine’s transition rules are
deterministic, i.e., it has a transition function.
In this unit, we consider how randomness can be applied in
computation. We will exploit randomness to design simple,
probabilistic algorithms, and we will discuss several tools that can
be used to analyze randomized algorithms.
Since Turing machines are deterministic, they cannot make random
choices—so how are randomized algorithms possible? In real life,
when we want to make a random choice, we can flip a coin, roll a die,
or the like. Similarly, to model randomized algorithms and Turing
machines, we augment them with an external source of randomness.
For this purpose, the simplest source of randomness is a (virtual)
“fair coin” that the algorithm can “flip” as needed. Formally, we can
define a randomized Turing machine as having a special “coin flip”
state that, whenever it is entered, writes a fresh unbiased random bit
at the location of the tape head, and transitions back to the previous
state. With this basic feature it is possible to simulate richer
sources of randomness, like the roll of a die with any finite number
of sides.
In this text, we write algorithms and pseudocode that make truly
random, unbiased, and independent choices from whatever finite set is
convenient. In the real world, when implementing and using randomized
algorithms—especially for cryptography—it is vital to use a
high-quality source of randomness. (Ensure this can be a very delicate
matter, but it is outside the scope of this course.)
As a first example of the utility of randomized algorithms, consider
the game of rock-paper-scissors between two players. In this game,
both players simultaneously choose one of three options: rock, paper,
or scissors. The game is a tie if both players choose the same option.
If they make different choices, then rock beats (“smashes”) scissors,
scissors beats (“cuts”) paper, and paper beats (“covers”) rock:
Suppose we write an algorithm (or strategy) to play the game, e.g., in
a best-out-of-five sequence of rounds. Similar to worst-case analysis
of algorithms, we should consider our strategy’s performance against
an opponent strategy that plays optimally against it. In other words,
we should assume that our opponent “knows” our strategy, and plays the
best it can with that knowledge.
If we “hard-code” a strategy—like playing rock in the first two
rounds, then paper, then scissors, then paper—then how will this
strategy perform? The strategy is fixed and deterministic, so an
optimal opponent can win every round (namely, play paper in the first
two rounds, then scissors, then rock, then scissors). More generally,
for any deterministic strategy—even one that is “adaptive” based
on the opponent’s moves—there exists an opponent strategy that
entirely beats it. This illustrates the problem with deterministic
strategies: the make the program’s actions predictable, and thus
easily countered.
Now consider a strategy that chooses a random action in each move,
with equal probability for rock, paper, and scissors. Here, even an
opponent that knows this strategy does not know in advance which
action the strategy will choose, and therefore cannot universally
counter that move. In fact, we prove below that no matter what
strategy the opponent uses, the probability that the opponent wins an
individual round is just \(1/3\) (and the same goes for tying and losing
the round). To see why this is the case, we first review some basic
tools of probability.
We first recall the basic definitions and axioms of probability. For
simplicity, we focus on discrete probability, which will be
sufficient for our purposes.
A (discrete) probability space is a countable set \(\Omega\),
called the sample space of possible outcomes, along with a
probability function\(\Pr \colon \Omega \to [0,1]\) that assigns a
probability to each outcome, and satisfies
Let’s elaborate on this definition and consider some examples. The
sample space \(\Omega\) is the set of every possible outcome—i.e.,
“every thing that could potentially happen”—in a randomized process,
which is also known as an experiment. Each outcome has a probability
between \(0\) and \(1\) (inclusive) of occurring, as defined by the
probability function \(\Pr\), and all these probabilities must sum to
\(1\).
Example 207(Three Fair Coins: Probability Space)
Consider the experiment of tossing three “fair” (unbiased) coins in
sequence. We formalize this as a probability space—i.e., a sample
space and a probability function on it.
The sample space \(\Omega\) is the set of all \(8 = 2^3\) possible
sequences of three heads or tails that can occur, i.e.,
where \(H\) represents heads and \(T\) represents tails.
Next, we define an appropriate probability function. To represent
the fairness of the coins, we require that each of the eight
outcomes has the same probability. Since by definition the
probabilities of all eight outcomes must sum to \(1\), we should
define \(\Pr[\omega] = 1/8\) for every outcome \(\omega \in \Omega\),
e.g., \(\Pr[HHH] = \Pr[HHT] = \cdots = \Pr[TTT] = 1/8\). [1]
These definitions generalize straightforwardly to the probability
experiment of tossing \(n \geq 0\) fair coins, where the sample space
is \(\Omega = \set{H,T}^n\), and each outcome \(\omega \in \Omega\) has
the same probability \(\Pr[\omega] = 1/2^n\).
Definition 208(Event)
For a probability space \((\Omega,\Pr)\), an event is any subset \(E
\subseteq \Omega\) of the sample space. The probability function is
extended to events as
In other words, an event is any subset of outcomes, and its
probability is the sum of the probabilities of those outcomes. We now
build on the previous example.
Example 209(Three Fair Coins: Events)
Continuing from the probability space for tossing three fair coins
as defined in Example 207, we now
consider some example events.
The event “the first toss comes up heads” is formalized as \(E_1 =
\set{HHH, HHT, HTH, HTT} \subseteq \Omega\), i.e., the subset of
all outcomes in which the first character is \(H\). By definition,
the probability of this event is the sum of the probabilities of
all the outcomes in the event. Since this event has four
outcomes, and each has probability \(1/8\), we get that \(\Pr[E_1] =
4/8 = 1/2\). This matches our intuition that the probability of
getting heads on the first toss (ignoring what happens on the
remaining two tosses) is \(1/2\).
The event “the first and third tosses come up the same” is
formalized as the subset \(E_2 = \set{HHH, HTH, THT, TTT}
\subseteq \Omega\). Similarly, the probability of this event is
\(\Pr[E_2] = 4/8 = 1/2\). This also matches our intuition: no
matter how the first toss comes up, the third toss has a \(1/2\)
probability of matching it.
The event “the same number of heads and tails come up” is
formalized as the empty subset \(E_3 = \emptyset \subseteq
\Omega\). This is because no outcome has the same number of heads
and tails. [2] By definition, the probability of this event is
\(\Pr[E_3] = 0\).
Because events are just subsets of the sample space, we can use set
relations to relate their probabilities. For example, for any events
\(E_1 \subseteq E_2\), we have that \(\Pr[E_1] \leq \Pr[E_2]\), because
\(\Pr[\omega] \geq 0\) for every outcome \(\omega\). (However, we caution
that \(E_1 \subsetneq E_2\) does not imply that \(\Pr[E_1] < \Pr[E_2]\),
because the events in \(E_2 \setminus E_1\) might have probability
zero.)
We can also use set operations on events to yield other events. For
example, every event \(E\) has a complement event \(\overline{E} =
\Omega \setminus E\), whose probability is \(\Pr[\overline{E}] = 1 -
\Pr[E]\), because
We can also consider intersections and unions of events. Notationally,
instead of \(\cap\) and \(\cup\), we usually use the logical operators
\(\wedge\) (AND) and \(\vee\) (OR), respectively. This captures the idea
that the intersection of two events represents the occurrence of both
events simultaneously, and their union represents the occurrence of
one event or the other (or both).
For two events \(E_1,E_2\), the probability \(\Pr[E_1 \wedge E_2]\) that
both occur simultaneously is called their joint probability (and
similarly for more events). It follows immediately that this
probability is at most \(\min \set{\Pr[E_1], \Pr[E_2]}\), because \(E_1
\wedge E_2 \subseteq E_1, E_2\). When considering the intersection of
events, an important case is when, informally, the (non-)occurrence of
one event does not affect the probability of the (non-)occurrence of
another event. This is known as independence; the formal definition
is as follows.
Definition 210(Independent Events)
Two events \(E_1, E_2\) (of the same probability space) are
independent if \(\Pr[E_1 \wedge E_2] = \Pr[E_1] \cdot \Pr[E_2]\).
More generally, events \(E_1, \ldots, E_n\) are mutually
independent if \(\Pr[E_1 \wedge \cdots \wedge E_n] = \Pr[E_1]
\cdots \Pr[E_n]\). They are pairwise independent if every pair of
them is independent.
Mutual independence of several events is a stronger notion than
pairwise independence: a collection of events can be pairwise
independent but not mutually independent (see Example
218 below).
Example 211(Three Fair Coins: Independence)
Continuing from Example 209, we
consider several events and whether they are independent:
The events \(E_1\) (“the first toss comes up heads”) and \(E_2\)
(“the first and third tosses come up the same”) are independent,
because \(E_1 \wedge E_2 = \set{HHH, HTH}\) and hence \(\Pr[E_1
\wedge E_2] = 1/4 = 1/2 \cdot 1/2 = \Pr[E_1] \cdot \Pr[E_2]\).
This matches our intuition that whether or not the first toss
comes up heads, the probability that the third toss matches the
first one is not affected.
The three events “the first/second/third toss comes up heads” are
mutually independent (and hence also pairwise independent):
their intersection is the event \(\set{HHH}\), which has
probability \(1/8 = (1/2)^3\), which is the product of the \(1/2\)
probabilities of the three individual events. The same holds if
we replace “heads” with “tails” in any of these events. All this
matches our intuition that the results of any of the tosses do
not affect any of the other tosses.
The two events “the first/last pair of tosses come up heads” are
not independent: their intersection is the event \(\set{HHH}\),
which has probability \(1/8\), but they each have probability
\(1/4\). This matches our intuition: the first pairs of tosses
coming up heads makes it more likely that the last pairs of
tosses also do, because they have the middle (second) toss in
common.
The three events “the [first two/last two/first and third] tosses
come up the same” are pairwise independent, but not mutually
independent. Each of them has probability \(1/2\), and the
intersection of any two of them is the event \(\set{HHH,TTT}\),
which has probability \(1/4 = 1/2 \cdot 1/2\). However, the
intersection of all three of them is also \(\set{HHH,TTT}\),
whereas the product of their individual \(1/2\) probabilities is
\(1/8\). This failure of mutual independence is because the
occurrence of any two of these events implies the occurrence of
the third one.
When considering the union of events, the following lemma is a basic
but widely applicable tool.
Lemma 212(Union Bound)
Let \(E_1, E_2 \subseteq \Omega\) be events of the same probability
space. Then
with equality if the events are disjoint (i.e., if \(E_1 \wedge
E_2 = \emptyset\)).
More generally, the probability of the union of any countably
many events is at most the sum of their individual probabilities,
with equality if the events are pairwise disjoint.
where the inequality holds because the two sums on the right
include every \(\omega \in E_1 \vee E_2\) at least once, and
\(\Pr[\omega] \geq 0\) for those \(\omega\) that are double-counted,
i.e., the \(\omega \in E_1 \wedge E_2\). In particular, if \(E_1,E_2\)
are disjoint, then no outcome is double-counted, hence the bound is
an equality.
The more general claim (for more than two events) follows via
induction, by applying the main claim (for two events) to the first
event and the union of the others, and applying the inductive
hypothesis to the latter union.
A random variable is, informally, a numerical quantity whose value
is determined by the outcome of a probability experiment. It is
defined formally as follows.
Definition 214(Random Variable)
A random variable of a probability space \((\Omega,\Pr)\) is a
function \(X \colon \Omega \to \R\) from the sample space to the real
numbers. So, the set of potential values of \(X\) is its image
\(X(\Omega) = \set{X(\omega) : \omega \in \Omega}\).
For any real number \(v \in \R\), the notation \(X=v\) denotes the
event \(X^{-1}(v) = \set{\omega \in \Omega : X(\omega)=v}\), i.e.,
the set of all outcomes that map to value \(v\) under \(X\). The events
\(X \geq v\), \(X < v\), etc. are defined similarly.
By the above definition, the probabilities of the events \(X=v\) and \(X
\geq v\) are, respectively,
A frequently useful kind of random variable is an indicator random
variable, also called a 0-1 random variable.
Definition 215(Indicator Random Variable)
An indicator random variable (of a probability space \((\Omega,
\Pr)\)) is one that is limited to the values 0 and 1, i.e., a
function \(X \colon \Omega \to \bit\).
An indicator random variable can be defined for any event \(E\), as
\(X(\omega)=1\) for every \(\omega \in E\) and \(X(\omega)=0\) otherwise.
This means that the random variable has value \(1\) if the event occurs,
and value \(0\) if the event does not occur. In the other direction, any
indicator random variable \(X\) has a corresponding event \(X^{-1}(1) =
\set{\omega \in \Omega : X(\omega)=1}\), the set of exactly those
outcomes that make the random variable have value \(1\).
Similar to the case for events, we can define independence for random
variables.
Definition 216(Independent Random Variables)
Random variables \(X_1, X_2\) are independent if for any\(v_1,
v_2 \in \R\), the events \(X_1 = v_1\) and \(X_2 = v_2\) are
independent, i.e.,
The event \(X=0\) is \(\set{TTT}\), and hence has probability \(\Pr[X=0]
= 1/8\). The event \(X \geq 2\) is \(\set{HHT,HTH,THH,HHH}\), and hence
has probability \(\Pr[X \geq 2] = 4/8 = 1/2\). The events \(X=1.5\) and
\(X=5\) are both \(\emptyset\), and hence each have probability zero.
For each \(i=1,2,3\), we can define an indicator random variable
\(X_i\) that indicates whether the \(i\)th toss comes up heads.
Specifically, \(X_i(\omega)=1\) if the \(i\)th character of \(\omega\) is
\(H\), and \(X_i(\omega)=0\) otherwise. For example, \(X_2\) is the
indicator random variable for the event \(\set{HHH, HHT, THH, THT}\).
By inspection, \(\Pr[X_i = 0] = \Pr[X_i = 1] = 1/2\) for every \(i\).
Observe that \(X = X_1 + X_2 + X_3\), because \(X\) is the total number
of heads, and each \(X_i\) contributes \(1\) to the sum if and only if
the \(i\)th toss comes up heads (otherwise it contributes \(0\)).
The random variables \(X_1, X_2, X_3\) are mutually independent
(and hence also pairwise independent). To see this, consider any
\(v_1,v_2,v_3 \in \bit\). Then the event \(X_1 = v_1 \wedge X_2 = v_2
\wedge X_3 = v_3\) consists of exactly one outcome (because the
\(v_i\) values collectively specify the result of every toss), so it
has probability \(1/8 = (1/2)^3\), which is the product of the \(1/2\)
probabilities of the three individual events \(X_i = v_i\).
All of the definitions and claims from this example generalize
straightforwardly to the probability experiment of tossing any
number of fair coins.
Example 218(Three Fair Coins: More Random Variables and
(Non-)Independence)
We can also define a random variable \(Y\) representing the number
of pairs of tosses that come up the same. That is,
For each pair of coin tosses, we can define an indicator random
variable that indicates whether that pair comes up the same.
Specifically, for each \(i=1,2,3\), consider the pair of tosses that
does not include the \(i\)th toss; then let \(Y_i=1\) if this pair
comes up the same, and \(Y_i=0\) otherwise. For example, \(Y_1\) is the
indicator random variable for the event \(\set{HHH, HTT, THH, TTT}\),
so \(\Pr[Y_1 = 1] = \Pr[Y_1 = 0] = 1/2\), and similarly for \(Y_2,
Y_3\).
Observe that \(Y = Y_1 + Y_2 + Y_3\), because \(Y\) is the total number
of pairs that come up the same, and each \(Y_i\) corresponds to one
of the three pairs, contributing \(1\) to the sum if that pair comes
up the same, and \(0\) otherwise.
The random variables \(Y_1, Y_2, Y_3\) are pairwise independent,
but not mutually independent. To see pairwise independence, take
any values \(v_i, v_j \in \bit\) for any \(i \neq j\). Then the event
\(Y_i = v_i \wedge Y_j = v_j\) consists of exactly two outcomes, so
it has probability \(1/4\), which is the product of the \(1/2\)
probabilities of the two individual events \(Y_i = v_i\) and \(Y_j =
v_j\). This matches our intuition that whether one pair of coins
comes up the same does not affect whether another pair of coins
does (even if the two pairs have a coin in common).
To see that the \(Y_i\) are not mutually independent, there are
several counterexamples to the definition. For instance, the event
\(Y_1 = 1 \wedge Y_2 = 1 \wedge Y_3 = 1\) is \(\set{HHH,TTT}\), so it
has probability \(1/4\), which is not the product of the \(1/2\)
probabilities of the three individual events \(Y_i = 1\) for
\(i=1,2,3\). This matches our intuition: if two pairs of coins come
up the same, then this implies that the third pair comes up the
same as well.
Definition 219
The probability distribution (also known as probability mass
function) of a random variable \(X\) is the function \(p_X \colon \R
\to [0,1]\) that maps each real value to the probability that \(X\)
takes on this value, i.e., \(p_X(v) = \Pr[X=v]\).
Example 220(Three Fair Coins: Probability Distributions)
Continuing from Example 217, the
probability distribution of the random variable \(X\) representing
the number of heads that come up is
The probability distribution of a random variable gives the
probability of each potential value of the variable. Often, we seek
more succinct “summary” quantities about the random variable. One such
quantity is its expectation, which is the “weighted average” of the
random variable, where each value is weighted by its probability.
Definition 221(Expectation)
The expected value, also known as expectation, of a random
variable \(X\) is defined as
The name “expected value” is a bit of a misnomer, because it is not
really the value we “expect” to get in the probability experiment.
Indeed, a random variable might not even be able to take on its
expected value at all! (Example 224 below gives a simple example of this.)
Instead, it is good to think of the expectation as the average value
of a random variable.
This matches our intuition, that when we toss three fair coins, the
average number of heads that come up is \(3/2\). Notice that this
expectation is not a value that \(X\) can actually take on; it is
merely the “average” value.
Continuing from Example 218, the
expectation of \(Y\) (the number of pairs of tosses that come up
the same) is
This also might (or might not!) match our intuition: there are
three pairs of tosses, and each one has probability \(1/2\) of coming
up the same. Even though these three events are not mutually
independent (see Example 218), the
expected number of equal pairs of tosses happens to be \(3/2\). This
is not a coincidence, as we will see next.
An important feature of averages is that it is impossible for all
potential outcomes to be above average, nor can they all be below
average. In other words, there must be some outcome that is at least
the expectation, and one that is at most the expectation. This is
formalized in the following lemma, which is known as the averaging
argument.
Lemma 225(Averaging Argument)
Let \(X\) be a random variable of a probability space \((\Omega,\Pr)\).
Then there exists an outcome \(\omega \in \Omega\) for which
\(X(\omega) \geq \Ex[X]\), and an outcome \(\omega' \in \Omega\) for
which \(X(\omega') \leq \Ex[X]\).
For example, building on Example 224, there exists at least one outcome for
which \(X\), the number of heads that come up, is at least \(3/2\).
Indeed, \(HHH\) and \(HHT\) are two such outcomes.
Proof 226
Let \(E = \Ex[X]\), and suppose for the purposes of contradiction
that \(X(\omega) < E\) for every \(\omega \in \Omega\). Then by the
alternative formula for the expectation, and because the
\(\Pr[\omega]\) are non-negative and sum to 1 (so at least one of
them is positive),
\[\begin{split}\Ex[X]
&= \sum_{\omega \in \Omega} X(\omega) \cdot \Pr[\omega] \\
&< \sum_{\omega \in \Omega} E \cdot \Pr[\omega] \\
&= E \cdot \sum_{\omega \in \Omega} \Pr[\omega] = E \; \text.\end{split}\]
Thus \(\Ex[X] < E\), which contradicts the definition of \(E\), hence
the first claim is proved. The second claim proceeds symmetrically.
For a non-negative random variable, its expectation gives some
useful information about its probability distribution. Informally, it
“not so likely” that the random variable is “much larger than” its
expectation. Markov’s inequality quantifies this.
Theorem 227(Markov’s Inequality)
Let \(X\) be a non-negative random variable, i.e., \(X\) never takes on
a negative value, and let \(a > 0\). Then
\[\Pr[X \geq a] \leq \frac{\Ex[X]}{a} \; \text.\]
If \(\Ex[X] > 0\), this implies that for any \(t > 0\), the probability
that \(X\) is at least \(t\) times its expectation is at most \(1/t\).
Notice that these statements are trivial (or “useless”) for \(a \leq
\Ex[X]\) or \(t \leq 1\), because in these cases Markov’s inequality
gives probability upper bounds that are \(\geq 1\), and the probability
of any event is trivially \(\leq 1\). So, Markov’s inequality is
useful only when we take \(a > \Ex[X]\) or \(t > 1\).
As an example, if the class average on an exam is \(70\) points (and
negative points are not possible), then Markov’s inequality implies
that the fraction of the class that earned at least \(a = 90\) points is
at most \(70/90 = 7/9\). (Stated probabilistically: if we choose a
student uniformly at random, then the probability that the student
earned \(\geq 90\) points is \(\leq 7/9\).) Similarly, at most \((7/6)\)ths
of the class earned \(\geq 60\) points, but this is a trivial statement.
Proof 228
By definition of expectation,
\[\begin{split}\Ex[X]
&= \sum_{v \in X(\Omega)} v \cdot \Pr[X = v] \\
&= \parens[\Big]{\sum_{v < a} v \cdot \Pr[X = v]} +
\parens[\Big]{\sum_{v \geq a} v \cdot \Pr[X = v]} \\
&\geq \sum_{v \geq a} v \cdot \Pr[X = v] \\
&\geq \sum_{v \geq a} a \cdot \Pr[X = v] \\
&= a \cdot \Pr[X \geq a]
\; \text.\end{split}\]
The first inequality holds because \(X\) is non-negative, so every
\(v\) in the first sum is non-negative, and therefore that sum is
non-negative. The second inequality holds because \(v \geq a\), and
all the \(\Pr[X=v]\) are non-negative. The last equality holds
because the event \(X \geq a\) is the union of all the disjoint
events \(X \geq v\) for \(v \geq a\).
Finally, dividing both sides of the inequality by \(a > 0\) (which
preserves the direction of the inequality), we get that
\[\frac{\Ex[X]}{a} \geq \Pr[X \geq a] \; \text.\]
The following ‘reverse’ version of Markov’s inequality says,
informally, that if a random variable is upper bounded, then it has a
certain positive probability of exceeding any value less than its
expectation. (Note that by the averaging argument, there is some
nonzero probability that the random variable is at least its
expectation. The reverse Markov inequality gives a specific
probability bound, which increases as the threshold of interest
decreases.)
Theorem 229(Reverse Markov’s Inequality)
Let \(X\) be a random variable that is never larger than some value
\(b\). Then for any \(a < b\),
Notice that the statement is trivial (or “useless”) for \(a \geq
\Ex[X]\), because in this case it gives a probability lower bound that
is \(\leq 0\), and the probability of any event is \(\geq 0\).
For example, if the class average on an exam is \(70\) points out of a
maximum of \(b=100\), then the ‘reverse’ Markov inequality implies that
at least \((70-60)/(100-60) = 1/4\) of the class earned more than \(a=60\)
points. However, it does not say anything about how much of the
class earned \(70\) or more points.
Proof 230
The first inequality holds because the event \(X > a\) is a subset of
the event \(X \geq a\). So, it remains to prove the second
inequality.
Define the random variable \(Y = b - X\). This is non-negative
because \(X\) is never larger than \(b\), so we can apply Markov’s
inequality to it, and it has expectation \(\Ex[Y] = b - \Ex[X]\).
Hence,
Often, we can express a random variable \(X\) as a linear combination of
some other, typically “simpler,” random variables \(X_i\). We can then
apply linearity of expectation to compute the expectation of \(X\)
from the expectations of the \(X_i\). Importantly, the \(X_i\)do not
need to be independent in order for linearity of expectation to
apply.
Theorem 231(Linearity of Expectation)
Let \(X_1, X_2, \ldots\) be any countably many random variables, and
\(c_1, c_2, \ldots \in \R\). Then
In other words, constant-factor scalings and summations can be
“moved in and out of” expectations.
Proof 232
We prove the case for a linear combination of two variables; the
general case then follows by induction. From the alternative
formulation of expectation,
Example 233(Three Fair Coins: Linearity of Expectation)
Continuing from Example 217, recall
that the random variable \(X\) is the number of heads that come up.
In Example 224, we showed
directly via the definition that \(\Ex[X] = 3/2\). What follows is an
alternative, simpler way—especially when generalizing to more
coin tosses—to obtain the same result using linearity of
expectation.
Recall that \(X\) is the sum of the three indicator random variables
\(X_i\) that indicate whether the \(i\)th toss comes up heads: \(X =
X_1 + X_2 + X_3\). Recall from Example 217 that \(\Ex[X_i] = \Pr[X_i = 1] = 1/2\) for every
\(i\). So, by linearity of expectation (Lemma 231),
Similarly, in Example 218, we
defined the random variable \(Y\) to be the number of pairs of
tosses that come up the same, and showed directly from the
definition that \(\Ex[Y] = 3/2\).
We observed that \(Y\) is the sum of the three indicator random
variables \(Y_i\) that indicate whether the corresponding pair of
tosses come up the same: \(Y = Y_1 + Y_2 + Y_3\). Even though these
indicator random variables are not mutually independent, we can
still apply linearity of expectation to their sum:
\[\Ex[Y] = \Ex[Y_1 + Y_2 + Y_3] = 3/2 \; \text.\]
Example 234(Biased Coins)
Suppose we repeatedly flip a biased coin that has probability \(p
\in [0,1]\) of coming up heads. We can determine the expected number
of heads in \(n\) flips, using indicator random variables and
linearity of expectation.
For \(i=1,\ldots, n\), define \(X_i\) to be the indicator random
variable indicating whether the \(i\)th flip comes up heads. Then
\(\Ex[X_i] = \Pr[X_i] = p\). Define \(X\) to be the total number of
heads that come up in \(n\) flips, and observe that \(X = \sum_{i=1}^n
X_i\). So, by linearity of expectation,
\[\Ex[X]
= \Ex\brackets[\Big]{\sum_{i=1}^n X_i}
= \sum_{i=1}^n \Ex[X_i]
= \sum_{i=1}^n p
= p \cdot n \; \text.\]
By contrast, deriving this expectation directly via the definitions
(of expectation, and of the random variable \(X\)) would be rather
complicated: we would need to determine \(\Pr[X=v]\) for each
\(v=0,1,\ldots,n\), then analyze \(\sum_{i=1}^n v \cdot \Pr[X=v]\).
Linearity of expectation makes the derivation almost trivial.
Now let’s return to the rock-paper-scissors example, modeling one
round of the game as a probability space, and analyzing the
appropriate events. The sample space—i.e., the set of outcomes that
can potentially occur—is the set \(\Omega = \set{R,P,S}^2 = \set{RR,
RP, RS, PR, PP, PS, SR, SP, SS}\) of possible move pairs made by the
two players, where \(R,P,S\) respectively denote the moves rock, paper,
and scissors, and the \(i\)th character represents player \(i\)’s move,
for \(i=1,2\).
Next, let \(R_i\) be the event that player \(i\) plays rock, i.e., \(R_1 =
\set{RR, RP, RS}\) and \(R_2 = \set{RR, PR, SR}\), and similarly for
\(P_i\) and \(S_i\).
We need to assign probabilities to the outcomes. Player \(i\)’s strategy
can be seen as choosing a move at random according to some specified
probabilities of events \(R_i, P_i, S_i\), which must sum to \(1\) because
they partition the sample space. (This is true even for deterministic
strategies, in which case the chosen move has probability \(1\).) Since
the players move simultaneously, we model their choices as
independent. So, letting \(M_i\) be any one of \(R_i,P_i,S_i\) for
\(i=1,2\), we see that the event \(M_1 \wedge M_2\) consists of a single
outcome, and we have that
Now, let \(W_1 = \set{RS, PR, SP}\) be the event that player \(1\) wins.
Suppose that player \(1\) plays uniformly at random, i.e., plays R,P,S
each with equal probability, i.e., \(\Pr[R_1]=\Pr[P_1]=\Pr[S_1]=1/3\).
Then no matter how player \(2\) plays, the probability that player \(1\)
wins is
The first equation uses independence, and the final equation uses the
fact that the events \(S_2,R_2,P_2\) partition the sample space, so
their probabilities must sum to \(1\). The conclusion is that playing
uniformly at random yields a \(1/3\) probability of winning, no matter
what strategy the opponent uses. This is a significant improvement
over any deterministic strategy, against which an opponent can win
with certainty. [3] A similar analysis shows that the probabilities
of tying and of losing are both \(1/3\) as well.
The rock-paper-scissors example is illustrative of a more general
notion of considering algorithms as games. In this perspective, we
view one “player” as choosing an algorithm for a computational
problem. A second player known as the adversary then chooses an
input (of a certain size) to the algorithm. The adversary aims to
maximize the number of steps taken by the algorithm on the given
input; i.e., to induce the algorithm’s worst-case runtime behavior.
If the first player chooses a deterministic algorithm, then the
adversary can find an input that maximizes the number of steps—even
if the algorithm happens to be much faster on “most” other inputs. On
the other hand, if the first player chooses a randomized
algorithm—i.e., one that makes some random choices—then the
adversary might not be able to induce worst-case behavior, because it
cannot predict the algorithm’s random choices. In a good randomized
algorithm, the random choices will tend to avoid the worst-case
behavior no matter what the input is, although there is still some
(ideally very small) probability that it will make “unlucky” choices
that lead to poor performance.
We can use randomness to design simple algorithms that produce an
α-approximationin expectation, meaning
that the expected value of the output is within an \(\alpha\) factor of
the value of an optimal solution.
As an example, recall the search version of the \(\TSAT\) problem: given
a 3CNF Boolean formula, find a satisfying assignment
for it, if one exists. Since a 3CNF formula is the AND of several
clauses, this problem is asking for an assignment that satisfies all
of the clauses simultaneously.
Because \(\TSAT\) is \(\NP\)-hard, there is no efficient algorithm for
this problem unless \(\P=\NP\). So, let us relax the goal, and seek an
assignment that satisfies as many clauses as we can manage. In other
words, we seek an approximation algorithm for the problem of
maximizing the number of satisfied clauses.
In what follows, we actually restrict the input to be what we call an
“exact”-3CNF formula. This is a 3CNF formula with the added
restriction that each clause involves three distinct variables.
(Different clauses can involve the same variable, however.) For
example, the clause
\[(x \vee \neg y \vee \neg z)\]
is allowed, but the clause
\[(x \vee \neg y \vee \neg x)\]
is not, since it involves only two distinct variables (the literals
\(x\) and \(\neg x\) involve the same variable). Similarly, the clause \((y
\vee y \vee \neg z)\) is not allowed.
The problem of finding an assignment that maximizes the number of
satisfied clauses in a given exact-3CNF formula is known as \(\MSAT\),
and its decision version is \(\NP\)-complete. Despite this, there is a
very simple randomized algorithm that obtains a \(7/8\)-approximation,
in expectation. The algorithm is as follows: assign random truth
values (true or false) to the formula’s variables, uniformly and
independently. That is, for each variable, assign it to be true or
false each with \(1/2\) probability, independently of all the others.
Theorem 235
Let \(\phi\) be an exact-3CNF formula with \(m\) clauses. Then
assigning its variables with uniformly random and independent truth
values satisfies \((7/8)\)ths of \(\phi\)’s clauses in expectation, i.e.,
\[\Ex[\text{number of satisfied clauses in $\phi$}] = 7m/8 \;
\text.\]
In particular, this is a \(7/8\)-approximation algorithm for \(\MSAT\),
in expectation.
The proof of this theorem proceeds by a powerful and widely applicable
strategy. First, we define a suitable probability space and a random
variable \(X\) that captures the quantity of interest, namely, the
number of satisfied clauses. This variable \(X\) typically has a
complicated probability distribution that makes it hard to analyze its
expectation directly. Instead, we express \(X\) as the sum of several
simpler indicator random variables \(X_i\). By linearity of
expectation, \(\Ex[X]\) is the sum of the individual expectations
\(\Ex[X_i] = \Pr[X_i=1]\), where the equality holds because the \(X_i\)
are indicators. Finally, we analyze the probabilities of the
individual events \(X_i=1\), which are simple to understand, and arrive
at the conclusion.
Proof 236
First, suppose that the expected number of satisfied clauses is
indeed \(7m/8\) (which we prove below). The optimal number of clauses
that can be simultaneously satisfied is \(\OPT \leq m\), so the
expected number of satisfied clauses is \(7m/8 \geq (7/8) \cdot
\OPT\), hence this is indeed a \(7/8\)-approximation algorithm for
\(\MSAT\).
We now analyze the expected number of satisfied clauses. The
probability space is the set of all assignments to the variables of
\(\phi\), where each assignment is equally likely—i.e., the
assignment is chosen uniformly at random.
Now, define \(X\) be the random variable for the number of satisfied
clauses in \(\phi\). That is, for any particular assignment \(\alpha\),
we define \(X(\alpha)\) to be the number of clauses satisfied by
\(\alpha\). Because the clauses of \(\phi\) can share variables in
various ways, the probability distribution and expectation of \(X\)
can be quite complicated and difficult to analyze directly.
Instead, we decompose \(X\) into a sum of much simpler indicator
variables.
Specifically, for each \(i=1,\ldots,m\) we define an indicator random
variable \(X_i\) that indicates whether the \(i\)th clause is
satisfied, i.e., \(X_i = 1\) if so and \(X_i = 0\) if not. Then because
\(X\) is the total number of satisfied clauses, we have that
It is important to realize that because the clauses of \(\phi\) can
share variables in various ways, the indicators \(X_i\) are typically
not independent, even pairwise. For example, because clauses \((x
\vee y \vee \neg z)\) and \((u \vee \neg v \vee x)\) share the
variable \(x\), the former clause being satisfied makes it more
likely that the latter clause is also satisfied (i.e., their
indicators are positively correlated). Fortunately, linearity of
expectation holds even for dependent variables, so the above
equation is still valid.
All that remains to analyze \(\Ex[X_i] = \Pr[X_i = 1]\) (see
Lemma 222), i.e., the
probability that a uniformly random assignment satisfies the \(i\)th
clause. Every clause in the exact-3CNF formula \(\phi\) is the OR of
exactly three literals involving distinct variables, e.g., \((x \vee
\neg y \vee z)\). The clause fails to be satisfied exactly when all
three of its literals are false. Because the variables are
assigned uniformly at random, each literal has probability \(1/2\) of
being false (regardless of whether the literal is a variable or its
negation). And because the literals in clause \(i\) involve distinct
variables, their values are mutually independent. Thus,
\[\begin{split}\Pr[X_i = 0]
&= \Pr[\text{clause $i$ is unsatisfied}] \\
&= \Pr[\text{the 1st and 2nd and 3rd literals of clause $i$ are all false}] \\
&= \prod_{j=1}^3 \Pr[\text{the $j$th literal of clause $i$ is
false}] \\
&= (1/2)^3 = 1/8 \; \text.\end{split}\]
Next, because \(X_i\) is an indicator random variable (it is limited
to values \(0\) and \(1\)),
By the averaging argument (Lemma 225), from
Theorem 235 we can further conclude that
for any exact-3CNF formula, there exists an assignment that
satisfies at least \(7/8\)ths of the clauses. Notice that this is a
“deterministic” statement—it makes no reference to probability—but
we proved it using the tools of probability, and it is not obvious how
we could hope to do so without them! (This is an example of the
“probabilistic method,” which is a very powerful technique with many
applications.)
Theorem 235 says that if we assign
variables at random, then on average, \(7/8\)ths of the clauses are
satisfied. However, this does not say anything about how likely it
is that a certain fraction of clauses are satisfied. For example, we
might want a lower bound on the probability that at least half (or
two-thirds, or three-quarters, or 90%) of the clauses are satisfied.
Written in terms of the random variable \(X\) representing the number of
satisfied clauses, we might hope to get a lower bound on \(\Pr[X \geq c
\cdot m]\) for \(c=1/2\), or various other \(c \in [0,1]\).
Notice that because \(X\) is defined as the number of satisfied clauses,
it is non-negative (i.e., it never takes on a negative value). Thus,
we can apply Markov’s inequality (Lemma 227)
to obtain that
Unfortunately, this is a trivial (or “useless”) statement, because the
probability bound is \(7/4 > 1\), and the probability of any event is
\(\leq 1\). Furthermore, the probability bound goes in the “wrong
direction” from what we want: it says that the probability of
satisfying at least half the clauses is at most some value, whereas
we hope to show that it is at least some positive value. An upper
bound is not useful to us, because it is consistent with the actual
probability being zero. So, even if we replaced \(c=1/2\) with, say, \(c
= 9/10\) to get a nontrivial bound, the statement would not be of the
kind we want. We need to use a different tool.
Observe that the number of satisfied clauses \(X\) cannot exceed \(m\),
the total number of clauses in the given formula. So, we can apply the
‘reverse’ Markov inequality (Lemma 229) to it, with \(b=m\). Setting \(a = m/2 < b\), we get
that
That is, a random assignment satisfies more than half the clauses with
at least \(75\) percent probability. More generally, for any \(c < 7/8\),
setting \(a = c \cdot m < b\),
\[\Pr[X > c \cdot m] \geq \frac{7m/8 - c \cdot m}{m - c \cdot m} =
\frac{7/8-c}{1-c} > 0 \; \text.\]
So, for any \(c < 7/8\) there is a positive probability of satisfying
more than a \(c\) fraction of the clauses. However, the ‘reverse’ Markov
inequality does not tell us anything useful about the probability of
satisfying \(7/8\)ths or more of the clauses.
Exercise 237
Recall the max-cut problem. The following is a
randomized algorithm that outputs a random cut in a given
undirected graph (which has no self-loop edge, without loss of
generality):
\begin{algorithmic}
\FUNCTION{RandomCut}{$G=(V,E)$}
\STATE $S = \emptyset$
\FORALL{$v \in V$}
\STATE add $v$ to $S$ with probability $1/2$
\ENDFOR
\STATE \RETURN $S$
\ENDFUNCTION
\end{algorithmic}
Recall that \(C(S)\) is the set of crossing edges for the cut \(S\),
i.e., those that have exactly one endpoint in \(S\). The size of the
cut is \(\abs{C(S)}\).
Prove that in expectation, this algorithm outputs a
1/2-approximation of a maximum cut in the given undirected
graph. That is, the expected size of the returned cut is at
least half the size of a maximum cut.
Hint: Use linearity of expectation to show that in
expectation, half of the edges of the graph are in \(C(S)\).
Similarly, give an upper bound on the size of a maximum cut
(also in terms of the number of edges in the graph).
Conclude that, for any undirected graph \(G=(V,E)\), there exists
a cut of size at least \(\abs{E}/2\). (By contrast, in
Theorem 203 we saw an
efficient deterministic algorithm that finds such a cut, with
certainty.)
Another example of an algorithm that takes advantage of randomness is
quick sort. This is a recursive algorithm that sorts an array as
follows:
First, it chooses some element of the array as the pivot.
Then, it partitions the array around the pivot, by comparing the
pivot to every other element, placing all the smaller elements to
the “left” of the pivot—not necessarily in sorted order—and all
the larger ones to the “right” of the pivot. Without loss of
generality, we assume that the array elements are distinct, by
breaking ties according to the elements’ positions in the array.
Finally, it recursively sorts the two subarrays to the left and
right of the pivot, which results in the entire array being sorted.
The following is an illustration of an execution of quick sort, with
the recursive work elided:
The running time of quick sort is dominated by the number of
comparisons between pairs of elements. Once a pivot is selected,
partitioning an array of \(n\) elements compares every other element to
the pivot. This is a total of \(n-1 = \Theta(n)\) comparisons of
non-recursive work, which is followed by the recursive calls on the
two subarrays on each side of the pivot. If selecting the pivot takes
\(O(n)\) time and can guarantee that the two subarrays are “balanced,”
i.e., approximately equal in size, then the recurrence for the total
number of comparisons (and hence the overall running time) is
\[T(n) = 2T(n/2) + \Theta(n) \; \text.\]
This is the same recurrence as for merge sort, and
it has the closed-form solution \(T(n) = \Theta(n \log n)\). However,
quick sort tends to be faster in practice than merge sort and other
deterministic \(\O(n \log n)\) sorting algorithms, in large part because
it is fast and easy to implement in place—i.e., reordering the
array within its own memory space, using only a small amount of
auxiliary memory. (By comparison, the standard merge subroutine from
merge sort makes a full copy of the array with each call.) Thus,
variants of quick sort are widely used in practice.
One way to guarantee a balanced partition is to use the median
element as the pivot, which can be found in linear time. However, this
subroutine is rather complicated to implement, and its linear runtime
has a fairly large hidden constant factor.
A much simpler and more practical solution is to choose a uniformly
random array element as the pivot. The pseudocode is as follows.
Algorithm 238(Randomized Quick Sort)
\begin{algorithmic}
\FUNCTION{RandQuickSort}{$A[1,\ldots,n]$}
\IF{$n=1$} \RETURN $A$ \ENDIF
\STATE $p = \text{ an index chosen uniformly at random from }
\set{1,\ldots,n}$
\STATE $(L,R) =$ \CALL{Partition}{$A,p$}
\STATE \RETURN \CALL{RandQuickSort}{$L$} $+ A[p] +$
\CALL{RandQuickSort}{$R$}
\ENDFUNCTION
\FUNCTION{Partition}{$A[1,\ldots,n],p$}
\STATE initialize empty arrays $L,R$
\FOR{$i=1$ to $n$}
\IF{$i \neq p$ and $A[i] < A[p]$}
\STATE $L = L + A[i]$
\ELIF{$i \neq p$}
\STATE $R = R + A[i]$
\ENDIF
\ENDFOR
\STATE \RETURN $(L,R)$
\ENDFUNCTION
\end{algorithmic}
The following figure illustrates a possible execution of
RandQuickSort on an example input array, with the random
choices of pivot elements highlighted, and the rank of each element
displayed below it. The rank of an element is the number of elements
in the array (including itself) that are less than or equal to it. So,
the smallest element has rank \(1\), the next-smallest has rank \(2\),
etc.
The running time of RandQuickSort is a random variable that
depends in a complicated way on the particular random choices of
pivots made throughout the algorithm. Roughly speaking, choices of
pivots that yield “unbalanced” subarrays tend to yield larger running
times than ones that yield more balanced subarrays. How can we hope to
analyze this very complicated probability experiment? As we did above,
we express the running time as a sum of much simpler indicator
variables, apply linearity of expectation, and then analyze these
indicators individually. The rest of this section is devoted to
proving the following theorem.
Theorem 239
On any array of \(n\) elements, the expected running time of
RandQuickSort is \(\O(n \log n)\).
Recall that the running time of RandQuickSort is dominated by
the number of comparisons of pairs of array elements. First observe
that if a certain pair of elements is ever compared, then it is never
compared again. This is because at the time of comparison, one of the
elements is the pivot, which is never compared with anything after
partitioning around it is complete. So, the total number of
comparisons done by RandQuickSort is the total number of
distinct pairs of elements that are ever compared.
Accordingly, define the random variable \(X\) to be the total number of
comparisons during the entire execution of quick sort on an arbitrary
array of \(n\) elements. In additions, for any ranks \(1 \leq i < j \leq
n\), define \(X_{ij}\) to be the indicator random variable that indicates
whether the pair of elements having ranks \(i,j\) are ever compared
during the execution, i.e., \(X_{ij}=1\) if this pair is ever compared,
and \(X_{ij}=0\) otherwise. By the observations from the previous
paragraph,
So, for every \(i < j\) it just remains to analyze the individual
expectations \(\Ex[X_{ij}] = \Pr[X_{ij}=1]\)in isolation (where the
equality holds by Lemma 222),
and sum them. (As above, the random variables \(X_{ij}\) are highly
correlated in difficult-to-understand ways, but this does not matter
for linearity of expectation.)
Under what conditions are the elements of ranks \(i < j\) ever compared,
and what is the probability \(\Pr[X_{ij}=1]\) that this occurs? By
inspection of Algorithm 238, we
see that elements are compared only in the Partition
subroutine. In particular, the only comparisons made during a call to
Partition are between the selected pivot element and all the
other elements in the input subarray. This has two important
implications:
When a subarray is input to Partition, none of its elements
have been compared with each other yet.
In the subarray input to Partition, if one element is less
than the pivot and another is greater than the pivot, then these two
elements are placed on opposite sides of the pivot, and thus are
never compared with each other (because the recursive calls operate
completely separately on the two sides).
Therefore, the two elements \(a < b\) having ranks \(i < j\) are compared
if and only if the first pivot chosen from the range \([a,b]\) is
either \(a\) or \(b\). To elaborate: for as long as pivots are chosen from
outside this range, \(a\) and \(b\) remain in the same subarray, and have
not yet been compared. Then, once an element from this range is chosen
as a pivot:
it is either \(a\) or \(b\), in which case \(a\) and \(b\) are compared with
each other, or
it is an element strictly between \(a\) and \(b\), in which case \(a\) and
\(b\) are placed in different subarrays and never compared with each
other.
Therefore, \(\Ex[X_{ij}]\) is the probability that the first pivot whose
rank is between \(i\) and \(j\) (inclusive) has rank either \(i\) or \(j\).
There are \(j-i+1\) elements in this range. Each pivot is chosen
uniformly at random from the subarray input to Partition, and
as argued above, when the first pivot from this range is selected, the
entire range is part of the subarray. So, given that the pivot is in
the range, it is uniformly random over the range, i.e., it is
equally likely to be any of the \(j-i+1\) elements in the range. (This
statement can be formalized using conditional probability.) Because
exactly two of these possible choices make \(X_{ij}=1\), we have that
Although RandQuickSort runs in expected time \(O(n \log
n)\) on an array of \(n\) elements, it can potentially run in time
\(\Omega(n^2)\), due to an unlucky choice of random pivots.
Prove that for any constant \(c > 0\),
\[\lim_{n \to \infty} \Pr[\algo{RandQuickSort} \text{ on an
$n$-element array takes $\geq c n \log^2 n$ steps}] = 0 \; \text.\]
That is, as the array size grows large, randomized quick sort is
very unlikely to run in even \(\Omega(n \log^2 n)\) time.
Hint: This follows immediately from Theorem 239 and Markov’s inequality
(Lemma 227), without needing to understand
any of the analysis of RandQuickSort.
Randomness can also be utilized to construct simple data structures
that provide excellent expected performance. As an example, we
consider a set abstract data type, which is a container that supports
inserting, deleting, and searching for elements of some mutually
comparable type. There are many ways to implement a set abstraction,
using a variety of underlying data structures including linked lists,
arrays, and binary search trees. However, simple, deterministic
implementations suffer a linear runtime in the worst case for some
operations. For example, if an adversary inserts the items \(1, 2,
\dots, n\) into a non-balancing binary search tree, the insertions take
\(\O(n)\) time each, for a total time of \(\O(n^2)\). The usual way to
avoid this is to perform complicated rebalancing operations, and there
are a variety of schemes to do so (e.g. AVL trees, red-black trees, scapegoat
trees, and so on).
Can we use randomness instead to design a simple data structure that
provides expected \(\O(\log n)\) runtime for all operations on a set of
size \(n\)?
One solution is a skip list, which is essentially a linked list with
multiple levels, where each level contains about half the elements of
the level below (ratios other than \(\frac{1}{2}\) may be used instead,
trading off the space and time overheads). Whether or not an element
is duplicated to the next level is a random decision, which prevents
an adversary from coming up with a sequence of operations that
degrades performance. The following is an illustration of a skip list:
In the scheme illustrated above, each level has a sentinel head and is
terminated by a null pointer. An element may appear at multiple
levels, and the list may be traversed by following links rightward or
by moving downward at a duplicated element.
To search for an item, we start at the sentinel node at the top left.
We maintain the invariant that we are always located at a node whose
value is less than the item we are looking for (or is a sentinel head
node). Then in each step, we check the node to the right. If its value
is greater than the item we are looking for, or if the right pointer
is null, we move downward; however, if we are on the lowest level, we
cannot move downward, so the item is not in the list. If the value of
the node to the right is less than the item, we move to the right. If
the value is equal to the item, we have found the item and terminate
the search. The following illustrates a search for the item 0 in the
list above:
The search terminates in failure, since we reach the value -3 in the
bottom level with the value 1 to the right. The following illustrates
searching for the item 9:
To insert an item, we first do a search for the item. If the item is
in the list, then no further work is necessary. If it is not, the
search terminates in the bottom level at the maximal element that is
less than the item. Therefore, we insert the item immediately to the
right. We then make a random decision of whether or not to duplicate
the element to the next level. If the answer is yes, we make another
random decision for the next level, and so on until the answer is no.
If the probability of duplicating an item is \(\frac{1}{2}\), the
expected number of copies of an item is \(2\). Thus, insertion only
takes an expected constant amount of time in addition to the search.
We proceed to determine the expected time of a search on a skip list.
We start by determining the expected number of elements on each level
\(i\). Suppose the skip list contains \(n\) elements total. Let \(n_i\) be
the number of elements on level \(i\). We have \(n_1 = n\), since the
first level contains all elements. For \(i > 1\), define \(X_{ij}\) to
be an indicator random variable for whether or not the \(j\)th element
is replicated to level \(i\). We have
\[\Pr[X_{ij} = 1] = 1/2^{i-1}\]
since it takes \(i-1\) “yes” decisions for the element to make it to
level \(i\), and each decision is made independently with probability
\(1/2\). Thus, \(\Ex[X_{ij}] = \Pr[X_{ij} = 1] = 1/2^{i-1}\).
The total number of elements \(n_i\) on level \(i\) is just the sum of the
indicators \(X_{ij}\). This gives us:
Next, we consider how many levels we expect the list to contain.
Suppose we only maintain levels that have at least one element. From
our previous result, the expected number of elements is 1 for level
\(i\) when
\[\begin{split}\frac{n}{2^{i-1}} &= 1\\
n &= 2^{i-1}\\
\log n &= i - 1\\
i &= \log n + 1\end{split}\]
Intuitively, this implies we should expect somewhere around \(\log n\)
levels. To reason about the expected number more formally, consider
the number of elements on some level \(i = \log n + k\). By Markov’s
inequality, we have
\[\begin{split}\lrPr{n_{\log n + k} \ge 1} &\le \lrEx{n_{\log n + k}}/1\\
&= \frac{n}{2^{\log n + k - 1}}\\
&= \frac{n}{n2^{k-1}}\\
&= \frac{1}{2^{k-1}}\end{split}\]
Then let \(Z_k\) be an indicator for whether \(n_{\log n + k} \ge 1\). The
number of levels \(L\) is at most
\[\begin{split}\Ex[L] &\le \lrEx{\log n + \sum_{k=1}^\infty Z_k}\\
&= \log n + \sum_{k=1}^\infty \Ex[Z_k]\\
&\le \log n + \sum_{k=1}^\infty \frac{1}{2^{k-1}}\\
&= \log n + 2\end{split}\]
Thus, the expected number of levels is \(\O(\log n)\).
Finally, we reason about how many nodes a search encounters on each
level \(i\). Suppose we are looking for some element \(y\). Let \(w\) be the
maximal element on level \(i\) that is less than \(y\) (or the sentinel
head if no such element exists), and let \(z\) be the minimal element on
level \(i\) that is greater than or equal to \(y\) (or the null sentinel
if no such element exists). Let \(v_1, v_2, \dots\) be the elements on
level \(i\) that are less than \(w\), with \(v_1\) the largest, \(v_2\) the
next largest, and so on.
The search encounters the nodes corresponding to \(w\) and \(z\). It does
not touch nodes on level \(i\) that come after \(z\) – either \(y = z\), in
which case the search terminates, or \(z > y\) (or \(z\) is null), in
which case the search moves down to level \(i-1\). Thus, we need only
consider whether the search encounters the nodes corresponding to
\(v_1, v_2, \dots\).
We observe that each element on level \(i\) is replicated on level \(i+1\)
with probability \(1/2\).
There are two possible routes the search can take to get to \(w\) on
level \(i\):
The search can come down from level \(i+1\) to level \(i\) at \(w\). This
happens exactly when \(w\) is replicated on level \(i+1\). The search
process proceeds along the same level until it reaches a node that
is greater than the target value \(y\), and since \(w < y\), it would
not come down to level \(i\) before \(w\) if \(w\) appears on level \(i+1\).
Thus, this possibility occurs with probability \(1/2\).
The search can come from the node corresponding to \(v_1\) on level \(i\).
This occurs when \(w\) is not replicated on level \(i+1\). In that case,
the search must come down to level \(i\) before \(w\), in which case it
goes through \(v_1\). This possibility also occurs with probability
\(1/2\).
These two possibilities are illustrated below:
Thus, \(v_1\) is encountered on level \(i\) exactly when \(w\) is not
replicated to level \(i+1\), which occurs with probability \(1/2\). We can
inductively repeat the same reasoning for preceding elements. The
search only encounters \(v_2\) on level \(i\) if neither \(w\) nor \(v_1\) are
replicated on level \(i+1\). Since they are replicated independently
with probability \(1/2\), the probability that neither is replicated is
\(1/4\). In general, the search encounters \(v_j\) on level \(i\) exactly
when none of \(w, v_1, v_2, \dots, v_{j-1}\) are replicated to level
\(i+1\), which happens with probability \(1/2^j\).
We can now compute the expected number of nodes encountered on level
\(i\). Let \(Y_i\) be this number, and let \(V_{ij}\) be an indicator for
whether or not element \(v_j\) is encountered. Then we have
\[Y_i = 2 + \sum_j V_{ij}\]
The first term \(2\) is due to the fact that the search encounters the
nodes corresponding to \(w\) and \(z\). From our reasoning above, we have
The second-to-last step follows from the fact that \(\sum_j
\frac{1}{2^j}\) has finitely many terms since there are only finitely
many elements less than \(y\). Thus, the expected number of nodes
encountered on a level \(i\) is \(\O(1)\).
Combining this with our previous result that the expected number of
levels is \(\O(\log n)\), the expected total number of nodes encountered
across all levels is \(\O(\log n)\). Thus, a search on a skip list does
an expected \(\O(\log n)\) comparisons. Since insertion does only a
constant number of operations in addition to the search, it too does
\(\O(\log n)\) operations, and the same reasoning holds for deletion as
well. We achieve this expected runtime with a much simpler
implementation than a self-balancing search tree.
Some algorithms rely on repeated trials to compute an approximation of
a result. Such algorithms are called Monte Carlo methods, which are
distinct from Monte Carlo algorithms – a
Monte Carlo method does repeated sampling of a probability
distribution, while a Monte Carlo algorithm is an algorithm that may
produce the wrong result within some bounded probability. The former
are commonly approximation algorithms for estimating a quantity, while
the latter are exact algorithms that sometimes produce the wrong
result. As we will see, there is a
connection – repeated sampling (the strategy of a Monte Carlo method)
can be used to amplify the probability of getting the correct result
from a Monte Carlo algorithm.
As an example of a Monte Carlo method, we consider an algorithm for
estimating the value of \(\pi\), the area of a unit circle (a circle
with a radius of one). If such a circle is located in the plane,
centered at the origin, its top-right quadrant is as follows:
This quadrant falls within the square interval between \((0, 0)\) and
\((1, 1)\). The area of the quadrant is \(\pi/4\), while the area of the
interval is \(1\). Thus, if we choose a random point in this interval,
the probability that it lies within the quadrant of the circle is
\(\pi/4\)[4]. This motivates the following algorithm for estimating
the value of \(\pi\):
\begin{algorithmic}
\FUNCTION{EstimatePi}{$n$}
\STATE $\text{count} = 0$
\FOR{$i=1$ to $n$}
\STATE $x,y = $ values in $[0,1]$ chosen uniformly and
independently at random
\IF{$x^2+y^2 \leq 1$}
\STATE $\text{count} = \text{count} + 1$
\ENDIF
\ENDFOR
\STATE \RETURN $4 \cdot \text{count} / n$
\ENDFUNCTION
\end{algorithmic}
The algorithm randomly chooses points between \((0, 0)\) and \((1, 1)\),
counting how many of them fall within the unit circle, which is when
the point’s distance from the origin is at most one. We expect this
number to be \(\pi/4\) of the samples, so the algorithm returns four
times the ratio of the points that fell within the circle as an
estimate of \(\pi\).
We formally show that the algorithm returns the value of \(\pi\) in
expectation. Let \(X\) be a random variable corresponding to the value
returned, and let \(Y_i\) be an indicator random variable that is 1 if
the \(i\)th point falls within the unit circle. As argued previously, we
have:
\[\Pr[Y_i = 1] = \pi/4\]
We also have that
\[X = 4/n \cdot (Y_1 + \dots + Y_n)\]
This leads to the following expected value for \(X\):
The expected value of the algorithm’s output is indeed \(\pi\).
When estimating \(\pi\), how likely are we to actually get a result that
is close to the expected value? While we can apply Markov’s
inequality, the bound we get is very loose, and it does not give us
any information about how the number of samples affects the quality of
the estimate. The law of large numbers states that the actual result
converges to the expected value as the number of samples \(n\)
increases. But how fast does it converge? There are many types of
concentration bounds that allow us to reason about the deviation of
a random variable from its expectation; Markov’s inequality is just
the simplest one. Chebyshev’s inequality
is another simple bound that makes use of more information about a
random variable, namely its variance. Chernoff bounds are yet
another tool. There are multiple variants of Chernoff bounds,
including the multiplicative form and the
additive Hoeffding bounds.
The expectation of a random variable gives us one piece of information
about its probability distribution, but there are many aspects of the
distribution that it does not capture. For instance, the following
illustrates three random variables that have different distributions
but the same expected value of 0.5:
Beyond the expectation, the next most important aspect of a
probability distribution is its “spread” [5]. The variance of a
random variable encapsulates this information.
Definition 241(Variance)
Suppose \(X\) is a random variable with expectation \(\Ex[X]\). Then
the variance of \(X\) is defined as
\[\Var(X) = \Ex[(X-\Ex[X])^2]\]
or, equivalently
\[\Var(X) = \Ex[X^2] - \Ex[X]^2\]
The second definition follows from the first due to linearity of
expectation:
In the fourth step, we used the fact that \(\Ex[X]\) is a constant to
pull it out of the outer expectation – by linearity of expectation,
\(\Ex[cY] = c\Ex[Y]\) for a constant \(c\).
The variance tells us the average square of the distance of a random
variable from its expectation. Taking the square root of the variance
gives us an approximate measure of the distance itself; this is called
the standard deviation, and it is often denoted by the symbol
\(\sigma\):
\[\sigma(X) = \sqrt{\Var(X)}\]
Example 242
Suppose the random variable \(X\) has the distribution
\[X = \begin{cases}
0.5 &\mbox{with probability $1$}
\end{cases}\]
and \(Y\) has the distribution
\[\begin{split}Y = \begin{cases}
0 &\mbox{with probability $1/2$}\\
1 &\mbox{with probability $1/2$}
\end{cases}\end{split}\]
Both \(X\) and \(Y\) have an expected value of 0.5. We compute their
variances. The distributions of \(X^2\) and \(Y^2\) are as follows:
\[ \begin{align}\begin{aligned}\begin{split}X^2 = \begin{cases}
0.25 &\mbox{with probability $1$}
\end{cases}\\\end{split}\\\begin{split}Y^2 = \begin{cases}
0 &\mbox{with probability $1/2$}\\
1 &\mbox{with probability $1/2$}
\end{cases}\end{split}\end{aligned}\end{align} \]
As a more complex example, define \(X\) to be the number of heads over
\(n\) flips of a biased coin with probability \(p\) of heads. We computed
in Example 234 that \(\Ex[X] =
np\). If we define \(X_i\) to be an indicator for whether or not flip \(i\)
produces heads, we can infer from Example 243 that \(\Var(X_i) = p(1-p)\). What about
\(\Var(X)=\lrVar{\sum_i X_i}\)?
For expectation, we know from Theorem 231 that linearity of expectation always holds.
However, for variance, this is not necessarily the case. As an
example, for a random variable \(Y\), define \(Z = Y + Y = 2Y\). It is
not the case that \(\Var(Z) = 2\Var(Y)\); in actuality, \(\Var(Z) =
4\Var(Y)\).
Theorem 244
Suppose \(X\) is a random variable. Then for any constant \(c\),
\(\Var(cX) = c^2\Var(X)\).
In the second and third steps, we applied linearity of expectation
to pull the constants \(c\) and \(c^2\) out of the expectations.
For a more general sum of random variables, the variances only add if
the random variables are independent. To establish this, we first
demonstrate that the expectation of the product of independent random
variables is the product of their expectations.
Lemma 246
Let \(X\) and \(Y\) be independent random variables. Then
While variance tells us the expected squared distance from the
expectation, Chebyshev’s inequality allows us to bound the
probability of the absolute distance exceeding a given threshold.
Theorem 250(Chebyshev’s Inequality)
Let \(X\) be a random variable and let \(a > 0\). Then
We can derive Chebyshev’s inequality by applying Markov’s
inequality to the random variable \(Y = (X - \Ex[X])^2\).
Observe that since \(Y\) is a squared quantity, it is always
nonnegative, so Markov is applicable. Furthermore, for a
nonnegative \(a\), \((X - \Ex[X])^2 \ge a^2\) exactly when \(\abs{X -
\Ex[X]} \ge a\). Thus:
Suppose we flip a fair coin \(n\) times. What is the probability of
of getting \(\le 49\%\) or \(\ge 51\%\) heads?
Let \(X\) be the number of heads. We previously computed that \(\Ex[X]
= \frac{n}{2}\), and \(\Var(X) = n\cdot\frac{1}{2}\cdot\frac{1}{2} =
\frac{n}{4}\). Then
Thus, for 10,000 flips, the probability of deviating from the
expectation by 1% is at most 1/4, while for 1,000,000 flips, it is
at most 1/400.
In the example above, the random variable \(X\) can be defined as the
sum of independent, identically distributed (i.i.d.) random
variables \(X = X_1 + \dots + X_n\). In such a situation, we often
prefer to reason about \(X/n\) rather than \(X\) itself, since the
expectation of \(X/n\) is independent of \(n\); in particular, \(\Ex[X] =
n\Ex[X_i]\), whereas \(\Ex[X/n] = \Ex[X_i]\). The following is an
alternative expression of Chebyshev’s inequality for such a case:
Corollary 253
Let \(X = X_1 + \dots + X_n\) be the sum of \(n\) independent,
identically distributed random variables. Let \(\varepsilon > 0\)
be a deviation from the expectation \(\Ex[X/n] = \Ex[X_i]\). Then
The event \(\abs{X/n-\Ex[X_i]} \ge \varepsilon\) is equivalent to the
event \(\abs{X-\Ex[X]} \ge \varepsilon n\), since \(\Ex[X] =
n\Ex[X_i]\) (multiply both sides of the first inequality by \(n\)). By
the standard form of Chebyshev’s inequality, we have
In the second-to-last step, we used Theorem 248 to deduce that \(\Var(X) = n\Var(X_i)\) since the
\(X_i\) are independent.
Example 255
Suppose we flip a fair coin \(n\) times. What is the probability of
of getting \(\le 49\%\) or \(\ge 51\%\) heads?
Let \(X_i\) be an indicator for whether flip \(i\) produces heads, and
let \(X = X_1 + \dots + X_n\). We have \(\Ex[X/n] = \Ex[X_i] = 1/2\),
and \(\Var(X_i) = 1/4\). Then
Using Chebyshev’s inequality, we observe that the probability of
deviating by 1% from the expectation over \(n\) flips of a coin
decreases (at least) linearly in the number of flips \(n\). In reality,
the probability decreases much faster than this – using
multiplicative Chernoff bounds or
Hoeffding’s inequality, we can conclude
that the probability decreases exponentially in the number of flips.
Hoeffding’s inequality, also called Chernoff-Hoeffding bounds, is
a set of concentration bounds that can give tighter results than
Markov’s inequality, Chebyshev’s inequality, or other forms of
Chernoff bounds. For simplicity, we restrict ourselves to the special
case [6] of independent indicator random variables, with expectation
\[\Pr[X_i = 1] = p_i\]
for each indicator \(X_i\).
Let \(p\) be the expected value of \(\frac{1}{n}X\). We have
We consider again the example of flipping a fair coin with
probability. Let \(H\) be the total number of heads, and let \(H_i\) be an
indicator variable corresponding to whether the \(i\)th flip is heads.
We have \(\Pr[H_i = 1] = \frac{1}{2}\), and \(\Ex[\frac{1}{n}H] =
\frac{1}{2}\) for any number of flips \(n\).
What is the probability of getting at least six heads out of ten
flips? This is a deviation of \(\varepsilon = 0.1\), and applying the
upper-tail Hoeffding’s inequality gives us:
What is the probability of getting at least 60 heads out of 100 flips,
which is the same deviation \(\varepsilon = 0.1\) from the expectation?
Applying the upper tail again, we get
Suppose we have a coin that is biased by probability \(\varepsilon\)
towards either heads or tails, but we don’t know which one. In other
words, either:
\(\Pr[\text{heads}] = \frac{1}{2} + \varepsilon\) and
\(\Pr[\text{tails}] = \frac{1}{2} - \varepsilon\), or
To determine in which direction the coin is biased, we flip the
coin \(n\) times. If the results include at least \(n/2\) heads, we
assume the coin is biased towards heads, otherwise we assume it is
biased towards tails. How many flips should we do to guarantee our
answer is correct with probability at least \(1-\delta\)?
Let \(X\) be the number of heads, and let \(X_i\) be an indicator that
is 1 if the \(i\)th flip is heads, 0 if it is tails. Then \(X = X_1 +
\dots + X_n\) is the sum of independent indicator random variables
with \(\Ex[X_i]\) equal to either \(\frac{1}{2} + \varepsilon\) or to
\(\frac{1}{2} - \varepsilon\). We analyze the two cases individually.
Case 1: The coin is biased towards heads. Then \(\Ex[X_i] =
p = \frac{1}{2} + \varepsilon\). Our guess is erroneous when:
\[\lrPr{\frac{1}{n}X \ge p + \varepsilon}
\le e^{-2\varepsilon^2 n}\]
In either case, the probability of error after \(n\) flips is
upper-bounded by \(e^{-2\varepsilon^2 n}\). We want this probability
to be no more than \(\delta\):
If \(\varepsilon = 0.01\) (i.e. the coin is biased towards heads or
tails by 1%) and \(\delta = 0.0001\) (we want to be correct at least
99.99% of the time), then
Suppose we have a coin that is either fair, or is biased by
probability \(\varepsilon\) towards heads, but we don’t know which
is the case. In other words, either:
\(\Pr[\text{heads}] = \frac{1}{2}\) and
\(\Pr[\text{tails}] = \frac{1}{2}\), or
We flip the coin \(n\) times and count the number of heads \(X\).
For what values of \(X\) should we guess the coin is fair, and for
what values that it is biased towards heads?
How many flips should we do to guarantee our answer is correct
with probability at least \(1 - \delta\)?
How many flips should we do for \(\varepsilon = 0.01\) and \(\delta
= 0.0001\)? How does this compare to the situation in
Example 259 with
\(\varepsilon = 0.005\) and \(\delta = 0.0001\)?
Rather than applying concentration bounds to compute the probability
of a deviation for a specific sample size, we often wish to determine
how many samples we need to be within a particular deviation with high
confidence. One application of this is big data, where we have vast
datasets that are too large to examine in their entirety, so we sample
the data instead to estimate the quantities of interest. Similar to
this is polling – outside of elections themselves, we typically do
not have the resources to ask the entire population for their
opinions, so we need to sample the populace to estimate the support
for a particular candidate or political position. In both
applications, we need to determine how many samples are needed to
obtain a good estimate.
In general, a poll estimates the fraction of the population that
supports a particular candidate by asking \(n\) randomly chosen people
whether they support that candidate. Let \(X\) be a random variable
corresponding to the number of people who answer this question in the
affirmative. Then \(X/n\) is an estimate of the level of support in the
full population.
A typical poll has both a confidence level and a margin of error
– the latter corresponds to the deviation from the true fraction \(p\)
of people who support the candidate, and the former corresponds to a
bound on the probability that the estimate is within that deviation.
For example, a 95% confidence level and a margin of error of ±2%
requires that
Thus, we obtain the given confidence level and margin of error by
polling at least 4612 people. Observe that this does not depend on the
total population size!
For an arbitrary margin of error ±\(\varepsilon\), we obtain:
More generally, if we wish to gauge the level of support for \(m\)
different candidates, the sampling theorem tells us that the number
of samples required is logarithmic in \(m\).
Theorem 261(Sampling Theorem)
Suppose \(n\) people are polled to ask which candidate they support,
out of \(m\) possible candidates. Let \(\Xj\) be the number of people
who state that they support candidate \(j\), and let \(p_j\) be the
true level of support for that candidate. We wish to obtain
In other words, we desire a confidence level \(1 - \gamma\) that all
estimates \(\Xj/n\) are within margin of error ±\(\varepsilon\) of the
true values \(p_j\). We obtain this when the number of samples \(n\) is
The sampling theorem can be derived from applying the union
bound.
In conclusion, when sampling from a large dataset, the number of
samples required does depend on the desired accuracy of the estimation
and the range size (i.e. number of possible answers). But it does not
depend on the population size. This makes sampling a powerful
technique for dealing with big data, as long as we are willing to
tolerate a small possibility of obtaining an inaccurate estimate.
Job scheduling is another application where we can exploit randomness
to construct a simple, highly effective algorithm. In this problem, we
have \(k\) servers, and there are \(n\) jobs that need to be distributed
to these servers. The goal is to balance the load among the servers,
so that no one server is overloaded. This problem is very relevant to
content-delivery networks – there may be on the order of millions or
even billions of concurrent users, and each user needs to be routed to
one of only hundreds or thousands of servers. Thus, we have \(n \gg k\)
in such a case.
One possible algorithm is to always send a job to the most lightly
loaded server. However, this requires significant coordination – the
scheduler must keep track of the load on each server, which requires
extra communication, space, and computational resources. Instead, we
consider a simple, randomized algorithm that just sends each job to a
random server. The expected number of jobs on each server is \(n/k\).
But how likely are we to be close to this ideal, balanced load?
We start our analysis by defining random variables \(\Xj\) corresponding
to the number of jobs assigned to server \(j\). We would like to
demonstrate that the joint probability of \(\Xj\) being close to \(n/k\)
for all \(j\) is high. We first reason about the load on an individual
server. In particular, we wish to compute a bound on
\[\lrPr{\Xj \ge \frac{n}{k} + c}\]
for some value \(c > 0\), i.e. the probability that server \(j\) is
overloaded by at least \(c\) jobs. Let \(\Xj_i\) be an indicator random
variable that is 1 if job \(i\) is sent to server \(j\). Since the target
server for job \(i\) is chosen uniformly at random out of \(k\) possible
choices, we have
\[\lrEx{\Xj_i} = \lrPr{\Xj_i = 1} = \frac{1}{k}\]
We also have \(\Xj = \Xj_1 + \dots + \Xj_n\), giving us
Now that we have a bound on an individual server being overloaded by
\(c\) jobs, we wish to bound the probability that there is some server
that is overloaded. The complement event is that no server is
overloaded, so we can equivalently compute the joint probability that
none of the servers is overloaded by \(c\) or more jobs:
However, we cannot do so by simply multiplying the individual
probabilities together – that only works if the probabilities are
independent, and in this case, they are not. In particular, if one
server is overloaded, some other server must be underloaded, so the
random variables \(\Xj\) are not independent.
Rather than trying to work with the complement event, we attempt to
directly compute the probability that there is at least one overloaded
server:
This is an overhead of about 2.6% above the expected load. The
probability that there is a server overloaded by at least 2.6% is
bounded from above by \(1/k = 0.001\), or ≤0.1%. Thus, when there are \(n
= 10^{10}\) jobs and \(k = 1000\) servers, the randomized algorithm has a
high probability (≥99.9%) of producing a schedule where the servers
all have a low overhead (≤2.6%).
Exercise 262
Previously, we demonstrated that to use
polling to achieve a confidence level \(1 - \gamma\) and a margin of
error ±\(\varepsilon\),