Thus far, every algorithm we have seen is deterministic. This means
that, when run on a given input, the steps the algorithm performs and
the output it produces are entirely determined. So, when run multiple
times on the same input, the algorithm produces the same output. At
heart, this is because any Turing machine’s transition rules are
deterministic, i.e., it has a transition function.
In this unit, we consider how randomness can be applied in
computation. We will exploit randomness to design simple,
probabilistic algorithms, and we will discuss several tools that can
be used to analyze randomized algorithms.
Since Turing machines are deterministic, they cannot make random
choices—so how are randomized algorithms possible? In real life,
when we want to make a random choice, we can flip a coin, roll a die,
or the like. Similarly, to model randomized algorithms and Turing
machines, we augment them with an external source of randomness.
For this purpose, the simplest source of randomness is a (virtual)
“fair coin” that the algorithm can “flip” as needed. Formally, we can
define a randomized Turing machine as having a special “coin flip”
state that, whenever it is entered, writes a fresh unbiased random bit
at the location of the tape head, and transitions back to the previous
state. With this basic feature it is possible to simulate richer
sources of randomness, like the roll of a die with any finite number
of sides.
In this text, we write algorithms and pseudocode that make truly
random, unbiased, and independent choices from whatever finite set is
convenient. In the real world, when implementing and using randomized
algorithms—especially for cryptography—it is vital to use a
high-quality source of randomness. (Ensure this can be a very delicate
matter, but it is outside the scope of this course.)
As a first example of the utility of randomized algorithms, consider
the game of rock-paper-scissors between two players. In this game,
both players simultaneously choose one of three options: rock, paper,
or scissors. The game is a tie if both players choose the same option.
If they make different choices, then rock beats (“smashes”) scissors,
scissors beats (“cuts”) paper, and paper beats (“covers”) rock:
Suppose we write an algorithm (or strategy) to play the game, e.g., in
a best-out-of-five sequence of rounds. Similar to worst-case analysis
of algorithms, we should consider our strategy’s performance against
an opponent strategy that plays optimally against it. In other words,
we should assume that our opponent “knows” our strategy, and plays the
best it can with that knowledge.
If we “hard-code” a strategy—like playing rock in the first two
rounds, then paper, then scissors, then paper—then how will this
strategy perform? The strategy is fixed and deterministic, so an
optimal opponent can win every round (namely, play paper in the first
two rounds, then scissors, then rock, then scissors). More generally,
for any deterministic strategy—even one that is “adaptive” based
on the opponent’s moves—there exists an opponent strategy that
entirely beats it. This illustrates the problem with deterministic
strategies: the make the program’s actions predictable, and thus
easily countered.
Now consider a strategy that chooses a random action in each move,
with equal probability for rock, paper, and scissors. Here, even an
opponent that knows this strategy does not know in advance which
action the strategy will choose, and therefore cannot universally
counter that move. In fact, we prove below that no matter what
strategy the opponent uses, the probability that the opponent wins an
individual round is just \(1/3\) (and the same goes for tying and losing
the round). To see why this is the case, we first review some basic
tools of probability.
We first recall the basic definitions and axioms of probability. For
simplicity, we focus on discrete probability, which will be
sufficient for our purposes.
A (discrete) probability space is a countable set \(\Omega\),
called the sample space of possible outcomes, along with a
probability function\(\Pr \colon \Omega \to [0,1]\) that assigns a
probability to each outcome, and satisfies
Let’s elaborate on this definition and consider some examples. The
sample space \(\Omega\) is the set of every possible outcome—i.e.,
“every thing that could potentially happen”—in a randomized process,
which is also known as an experiment. Each outcome has a probability
between \(0\) and \(1\) (inclusive) of occurring, as defined by the
probability function \(\Pr\), and all these probabilities must sum to
\(1\).
Example 207(Three Fair Coins: Probability Space)
Consider the experiment of tossing three “fair” (unbiased) coins in
sequence. We formalize this as a probability space—i.e., a sample
space and a probability function on it.
The sample space \(\Omega\) is the set of all \(8 = 2^3\) possible
sequences of three heads or tails that can occur, i.e.,
where \(H\) represents heads and \(T\) represents tails.
Next, we define an appropriate probability function. To represent
the fairness of the coins, we require that each of the eight
outcomes has the same probability. Since by definition the
probabilities of all eight outcomes must sum to \(1\), we should
define \(\Pr[\omega] = 1/8\) for every outcome \(\omega \in \Omega\),
e.g., \(\Pr[HHH] = \Pr[HHT] = \cdots = \Pr[TTT] = 1/8\). [1]
These definitions generalize straightforwardly to the probability
experiment of tossing \(n \geq 0\) fair coins, where the sample space
is \(\Omega = \set{H,T}^n\), and each outcome \(\omega \in \Omega\) has
the same probability \(\Pr[\omega] = 1/2^n\).
Definition 208(Event)
For a probability space \((\Omega,\Pr)\), an event is any subset \(E
\subseteq \Omega\) of the sample space. The probability function is
extended to events as
In other words, an event is any subset of outcomes, and its
probability is the sum of the probabilities of those outcomes. We now
build on the previous example.
Example 209(Three Fair Coins: Events)
Continuing from the probability space for tossing three fair coins
as defined in Example 207, we now
consider some example events.
The event “the first toss comes up heads” is formalized as \(E_1 =
\set{HHH, HHT, HTH, HTT} \subseteq \Omega\), i.e., the subset of
all outcomes in which the first character is \(H\). By definition,
the probability of this event is the sum of the probabilities of
all the outcomes in the event. Since this event has four
outcomes, and each has probability \(1/8\), we get that \(\Pr[E_1] =
4/8 = 1/2\). This matches our intuition that the probability of
getting heads on the first toss (ignoring what happens on the
remaining two tosses) is \(1/2\).
The event “the first and third tosses come up the same” is
formalized as the subset \(E_2 = \set{HHH, HTH, THT, TTT}
\subseteq \Omega\). Similarly, the probability of this event is
\(\Pr[E_2] = 4/8 = 1/2\). This also matches our intuition: no
matter how the first toss comes up, the third toss has a \(1/2\)
probability of matching it.
The event “the same number of heads and tails come up” is
formalized as the empty subset \(E_3 = \emptyset \subseteq
\Omega\). This is because no outcome has the same number of heads
and tails. [2] By definition, the probability of this event is
\(\Pr[E_3] = 0\).
Because events are just subsets of the sample space, we can use set
relations to relate their probabilities. For example, for any events
\(E_1 \subseteq E_2\), we have that \(\Pr[E_1] \leq \Pr[E_2]\), because
\(\Pr[\omega] \geq 0\) for every outcome \(\omega\). (However, we caution
that \(E_1 \subsetneq E_2\) does not imply that \(\Pr[E_1] < \Pr[E_2]\),
because the events in \(E_2 \setminus E_1\) might have probability
zero.)
We can also use set operations on events to yield other events. For
example, every event \(E\) has a complement event \(\overline{E} =
\Omega \setminus E\), whose probability is \(\Pr[\overline{E}] = 1 -
\Pr[E]\), because
We can also consider intersections and unions of events. Notationally,
instead of \(\cap\) and \(\cup\), we usually use the logical operators
\(\wedge\) (AND) and \(\vee\) (OR), respectively. This captures the idea
that the intersection of two events represents the occurrence of both
events simultaneously, and their union represents the occurrence of
one event or the other (or both).
For two events \(E_1,E_2\), the probability \(\Pr[E_1 \wedge E_2]\) that
both occur simultaneously is called their joint probability (and
similarly for more events). It follows immediately that this
probability is at most \(\min \set{\Pr[E_1], \Pr[E_2]}\), because \(E_1
\wedge E_2 \subseteq E_1, E_2\). When considering the intersection of
events, an important case is when, informally, the (non-)occurrence of
one event does not affect the probability of the (non-)occurrence of
another event. This is known as independence; the formal definition
is as follows.
Definition 210(Independent Events)
Two events \(E_1, E_2\) (of the same probability space) are
independent if \(\Pr[E_1 \wedge E_2] = \Pr[E_1] \cdot \Pr[E_2]\).
More generally, events \(E_1, \ldots, E_n\) are mutually
independent if \(\Pr[E_1 \wedge \cdots \wedge E_n] = \Pr[E_1]
\cdots \Pr[E_n]\). They are pairwise independent if every pair of
them is independent.
Mutual independence of several events is a stronger notion than
pairwise independence: a collection of events can be pairwise
independent but not mutually independent (see Example
218 below).
Example 211(Three Fair Coins: Independence)
Continuing from Example 209, we
consider several events and whether they are independent:
The events \(E_1\) (“the first toss comes up heads”) and \(E_2\)
(“the first and third tosses come up the same”) are independent,
because \(E_1 \wedge E_2 = \set{HHH, HTH}\) and hence \(\Pr[E_1
\wedge E_2] = 1/4 = 1/2 \cdot 1/2 = \Pr[E_1] \cdot \Pr[E_2]\).
This matches our intuition that whether or not the first toss
comes up heads, the probability that the third toss matches the
first one is not affected.
The three events “the first/second/third toss comes up heads” are
mutually independent (and hence also pairwise independent):
their intersection is the event \(\set{HHH}\), which has
probability \(1/8 = (1/2)^3\), which is the product of the \(1/2\)
probabilities of the three individual events. The same holds if
we replace “heads” with “tails” in any of these events. All this
matches our intuition that the results of any of the tosses do
not affect any of the other tosses.
The two events “the first/last pair of tosses come up heads” are
not independent: their intersection is the event \(\set{HHH}\),
which has probability \(1/8\), but they each have probability
\(1/4\). This matches our intuition: the first pairs of tosses
coming up heads makes it more likely that the last pairs of
tosses also do, because they have the middle (second) toss in
common.
The three events “the [first two/last two/first and third] tosses
come up the same” are pairwise independent, but not mutually
independent. Each of them has probability \(1/2\), and the
intersection of any two of them is the event \(\set{HHH,TTT}\),
which has probability \(1/4 = 1/2 \cdot 1/2\). However, the
intersection of all three of them is also \(\set{HHH,TTT}\),
whereas the product of their individual \(1/2\) probabilities is
\(1/8\). This failure of mutual independence is because the
occurrence of any two of these events implies the occurrence of
the third one.
When considering the union of events, the following lemma is a basic
but widely applicable tool.
Lemma 212(Union Bound)
Let \(E_1, E_2 \subseteq \Omega\) be events of the same probability
space. Then
with equality if the events are disjoint (i.e., if \(E_1 \wedge
E_2 = \emptyset\)).
More generally, the probability of the union of any countably
many events is at most the sum of their individual probabilities,
with equality if the events are pairwise disjoint.
where the inequality holds because the two sums on the right
include every \(\omega \in E_1 \vee E_2\) at least once, and
\(\Pr[\omega] \geq 0\) for those \(\omega\) that are double-counted,
i.e., the \(\omega \in E_1 \wedge E_2\). In particular, if \(E_1,E_2\)
are disjoint, then no outcome is double-counted, hence the bound is
an equality.
The more general claim (for more than two events) follows via
induction, by applying the main claim (for two events) to the first
event and the union of the others, and applying the inductive
hypothesis to the latter union.
A random variable is, informally, a numerical quantity whose value
is determined by the outcome of a probability experiment. It is
defined formally as follows.
Definition 214(Random Variable)
A random variable of a probability space \((\Omega,\Pr)\) is a
function \(X \colon \Omega \to \R\) from the sample space to the real
numbers. So, the set of potential values of \(X\) is its image
\(X(\Omega) = \set{X(\omega) : \omega \in \Omega}\).
For any real number \(v \in \R\), the notation \(X=v\) denotes the
event \(X^{-1}(v) = \set{\omega \in \Omega : X(\omega)=v}\), i.e.,
the set of all outcomes that map to value \(v\) under \(X\). The events
\(X \geq v\), \(X < v\), etc. are defined similarly.
By the above definition, the probabilities of the events \(X=v\) and \(X
\geq v\) are, respectively,
A frequently useful kind of random variable is an indicator random
variable, also called a 0-1 random variable.
Definition 215(Indicator Random Variable)
An indicator random variable (of a probability space \((\Omega,
\Pr)\)) is one that is limited to the values 0 and 1, i.e., a
function \(X \colon \Omega \to \bit\).
An indicator random variable can be defined for any event \(E\), as
\(X(\omega)=1\) for every \(\omega \in E\) and \(X(\omega)=0\) otherwise.
This means that the random variable has value \(1\) if the event occurs,
and value \(0\) if the event does not occur. In the other direction, any
indicator random variable \(X\) has a corresponding event \(X^{-1}(1) =
\set{\omega \in \Omega : X(\omega)=1}\), the set of exactly those
outcomes that make the random variable have value \(1\).
Similar to the case for events, we can define independence for random
variables.
Definition 216(Independent Random Variables)
Random variables \(X_1, X_2\) are independent if for any\(v_1,
v_2 \in \R\), the events \(X_1 = v_1\) and \(X_2 = v_2\) are
independent, i.e.,
The event \(X=0\) is \(\set{TTT}\), and hence has probability \(\Pr[X=0]
= 1/8\). The event \(X \geq 2\) is \(\set{HHT,HTH,THH,HHH}\), and hence
has probability \(\Pr[X \geq 2] = 4/8 = 1/2\). The events \(X=1.5\) and
\(X=5\) are both \(\emptyset\), and hence each have probability zero.
For each \(i=1,2,3\), we can define an indicator random variable
\(X_i\) that indicates whether the \(i\)th toss comes up heads.
Specifically, \(X_i(\omega)=1\) if the \(i\)th character of \(\omega\) is
\(H\), and \(X_i(\omega)=0\) otherwise. For example, \(X_2\) is the
indicator random variable for the event \(\set{HHH, HHT, THH, THT}\).
By inspection, \(\Pr[X_i = 0] = \Pr[X_i = 1] = 1/2\) for every \(i\).
Observe that \(X = X_1 + X_2 + X_3\), because \(X\) is the total number
of heads, and each \(X_i\) contributes \(1\) to the sum if and only if
the \(i\)th toss comes up heads (otherwise it contributes \(0\)).
The random variables \(X_1, X_2, X_3\) are mutually independent
(and hence also pairwise independent). To see this, consider any
\(v_1,v_2,v_3 \in \bit\). Then the event \(X_1 = v_1 \wedge X_2 = v_2
\wedge X_3 = v_3\) consists of exactly one outcome (because the
\(v_i\) values collectively specify the result of every toss), so it
has probability \(1/8 = (1/2)^3\), which is the product of the \(1/2\)
probabilities of the three individual events \(X_i = v_i\).
All of the definitions and claims from this example generalize
straightforwardly to the probability experiment of tossing any
number of fair coins.
Example 218(Three Fair Coins: More Random Variables and
(Non-)Independence)
We can also define a random variable \(Y\) representing the number
of pairs of tosses that come up the same. That is,
For each pair of coin tosses, we can define an indicator random
variable that indicates whether that pair comes up the same.
Specifically, for each \(i=1,2,3\), consider the pair of tosses that
does not include the \(i\)th toss; then let \(Y_i=1\) if this pair
comes up the same, and \(Y_i=0\) otherwise. For example, \(Y_1\) is the
indicator random variable for the event \(\set{HHH, HTT, THH, TTT}\),
so \(\Pr[Y_1 = 1] = \Pr[Y_1 = 0] = 1/2\), and similarly for \(Y_2,
Y_3\).
Observe that \(Y = Y_1 + Y_2 + Y_3\), because \(Y\) is the total number
of pairs that come up the same, and each \(Y_i\) corresponds to one
of the three pairs, contributing \(1\) to the sum if that pair comes
up the same, and \(0\) otherwise.
The random variables \(Y_1, Y_2, Y_3\) are pairwise independent,
but not mutually independent. To see pairwise independence, take
any values \(v_i, v_j \in \bit\) for any \(i \neq j\). Then the event
\(Y_i = v_i \wedge Y_j = v_j\) consists of exactly two outcomes, so
it has probability \(1/4\), which is the product of the \(1/2\)
probabilities of the two individual events \(Y_i = v_i\) and \(Y_j =
v_j\). This matches our intuition that whether one pair of coins
comes up the same does not affect whether another pair of coins
does (even if the two pairs have a coin in common).
To see that the \(Y_i\) are not mutually independent, there are
several counterexamples to the definition. For instance, the event
\(Y_1 = 1 \wedge Y_2 = 1 \wedge Y_3 = 1\) is \(\set{HHH,TTT}\), so it
has probability \(1/4\), which is not the product of the \(1/2\)
probabilities of the three individual events \(Y_i = 1\) for
\(i=1,2,3\). This matches our intuition: if two pairs of coins come
up the same, then this implies that the third pair comes up the
same as well.
Definition 219
The probability distribution (also known as probability mass
function) of a random variable \(X\) is the function \(p_X \colon \R
\to [0,1]\) that maps each real value to the probability that \(X\)
takes on this value, i.e., \(p_X(v) = \Pr[X=v]\).
Example 220(Three Fair Coins: Probability Distributions)
Continuing from Example 217, the
probability distribution of the random variable \(X\) representing
the number of heads that come up is
The probability distribution of a random variable gives the
probability of each potential value of the variable. Often, we seek
more succinct “summary” quantities about the random variable. One such
quantity is its expectation, which is the “weighted average” of the
random variable, where each value is weighted by its probability.
Definition 221(Expectation)
The expected value, also known as expectation, of a random
variable \(X\) is defined as
The name “expected value” is a bit of a misnomer, because it is not
really the value we “expect” to get in the probability experiment.
Indeed, a random variable might not even be able to take on its
expected value at all! (Example 224 below gives a simple example of this.)
Instead, it is good to think of the expectation as the average value
of a random variable.
This matches our intuition, that when we toss three fair coins, the
average number of heads that come up is \(3/2\). Notice that this
expectation is not a value that \(X\) can actually take on; it is
merely the “average” value.
Continuing from Example 218, the
expectation of \(Y\) (the number of pairs of tosses that come up
the same) is
This also might (or might not!) match our intuition: there are
three pairs of tosses, and each one has probability \(1/2\) of coming
up the same. Even though these three events are not mutually
independent (see Example 218), the
expected number of equal pairs of tosses happens to be \(3/2\). This
is not a coincidence, as we will see next.
An important feature of averages is that it is impossible for all
potential outcomes to be above average, nor can they all be below
average. In other words, there must be some outcome that is at least
the expectation, and one that is at most the expectation. This is
formalized in the following lemma, which is known as the averaging
argument.
Lemma 225(Averaging Argument)
Let \(X\) be a random variable of a probability space \((\Omega,\Pr)\).
Then there exists an outcome \(\omega \in \Omega\) for which
\(X(\omega) \geq \Ex[X]\), and an outcome \(\omega' \in \Omega\) for
which \(X(\omega') \leq \Ex[X]\).
For example, building on Example 224, there exists at least one outcome for
which \(X\), the number of heads that come up, is at least \(3/2\).
Indeed, \(HHH\) and \(HHT\) are two such outcomes.
Proof 226
Let \(E = \Ex[X]\), and suppose for the purposes of contradiction
that \(X(\omega) < E\) for every \(\omega \in \Omega\). Then by the
alternative formula for the expectation, and because the
\(\Pr[\omega]\) are non-negative and sum to 1 (so at least one of
them is positive),
\[\begin{split}\Ex[X]
&= \sum_{\omega \in \Omega} X(\omega) \cdot \Pr[\omega] \\
&< \sum_{\omega \in \Omega} E \cdot \Pr[\omega] \\
&= E \cdot \sum_{\omega \in \Omega} \Pr[\omega] = E \; \text.\end{split}\]
Thus \(\Ex[X] < E\), which contradicts the definition of \(E\), hence
the first claim is proved. The second claim proceeds symmetrically.
For a non-negative random variable, its expectation gives some
useful information about its probability distribution. Informally, it
“not so likely” that the random variable is “much larger than” its
expectation. Markov’s inequality quantifies this.
Theorem 227(Markov’s Inequality)
Let \(X\) be a non-negative random variable, i.e., \(X\) never takes on
a negative value, and let \(a > 0\). Then
\[\Pr[X \geq a] \leq \frac{\Ex[X]}{a} \; \text.\]
If \(\Ex[X] > 0\), this implies that for any \(t > 0\), the probability
that \(X\) is at least \(t\) times its expectation is at most \(1/t\).
Notice that these statements are trivial (or “useless”) for \(a \leq
\Ex[X]\) or \(t \leq 1\), because in these cases Markov’s inequality
gives probability upper bounds that are \(\geq 1\), and the probability
of any event is trivially \(\leq 1\). So, Markov’s inequality is
useful only when we take \(a > \Ex[X]\) or \(t > 1\).
As an example, if the class average on an exam is \(70\) points (and
negative points are not possible), then Markov’s inequality implies
that the fraction of the class that earned at least \(a = 90\) points is
at most \(70/90 = 7/9\). (Stated probabilistically: if we choose a
student uniformly at random, then the probability that the student
earned \(\geq 90\) points is \(\leq 7/9\).) Similarly, at most \((7/6)\)ths
of the class earned \(\geq 60\) points, but this is a trivial statement.
Proof 228
By definition of expectation,
\[\begin{split}\Ex[X]
&= \sum_{v \in X(\Omega)} v \cdot \Pr[X = v] \\
&= \parens[\Big]{\sum_{v < a} v \cdot \Pr[X = v]} +
\parens[\Big]{\sum_{v \geq a} v \cdot \Pr[X = v]} \\
&\geq \sum_{v \geq a} v \cdot \Pr[X = v] \\
&\geq \sum_{v \geq a} a \cdot \Pr[X = v] \\
&= a \cdot \Pr[X \geq a]
\; \text.\end{split}\]
The first inequality holds because \(X\) is non-negative, so every
\(v\) in the first sum is non-negative, and therefore that sum is
non-negative. The second inequality holds because \(v \geq a\), and
all the \(\Pr[X=v]\) are non-negative. The last equality holds
because the event \(X \geq a\) is the union of all the disjoint
events \(X \geq v\) for \(v \geq a\).
Finally, dividing both sides of the inequality by \(a > 0\) (which
preserves the direction of the inequality), we get that
\[\frac{\Ex[X]}{a} \geq \Pr[X \geq a] \; \text.\]
The following ‘reverse’ version of Markov’s inequality says,
informally, that if a random variable is upper bounded, then it has a
certain positive probability of exceeding any value less than its
expectation. (Note that by the averaging argument, there is some
nonzero probability that the random variable is at least its
expectation. The reverse Markov inequality gives a specific
probability bound, which increases as the threshold of interest
decreases.)
Theorem 229(Reverse Markov’s Inequality)
Let \(X\) be a random variable that is never larger than some value
\(b\). Then for any \(a < b\),
Notice that the statement is trivial (or “useless”) for \(a \geq
\Ex[X]\), because in this case it gives a probability lower bound that
is \(\leq 0\), and the probability of any event is \(\geq 0\).
For example, if the class average on an exam is \(70\) points out of a
maximum of \(b=100\), then the ‘reverse’ Markov inequality implies that
at least \((70-60)/(100-60) = 1/4\) of the class earned more than \(a=60\)
points. However, it does not say anything about how much of the
class earned \(70\) or more points.
Proof 230
The first inequality holds because the event \(X > a\) is a subset of
the event \(X \geq a\). So, it remains to prove the second
inequality.
Define the random variable \(Y = b - X\). This is non-negative
because \(X\) is never larger than \(b\), so we can apply Markov’s
inequality to it, and it has expectation \(\Ex[Y] = b - \Ex[X]\).
Hence,
Often, we can express a random variable \(X\) as a linear combination of
some other, typically “simpler,” random variables \(X_i\). We can then
apply linearity of expectation to compute the expectation of \(X\)
from the expectations of the \(X_i\). Importantly, the \(X_i\)do not
need to be independent in order for linearity of expectation to
apply.
Theorem 231(Linearity of Expectation)
Let \(X_1, X_2, \ldots\) be any countably many random variables, and
\(c_1, c_2, \ldots \in \R\). Then
In other words, constant-factor scalings and summations can be
“moved in and out of” expectations.
Proof 232
We prove the case for a linear combination of two variables; the
general case then follows by induction. From the alternative
formulation of expectation,
Example 233(Three Fair Coins: Linearity of Expectation)
Continuing from Example 217, recall
that the random variable \(X\) is the number of heads that come up.
In Example 224, we showed
directly via the definition that \(\Ex[X] = 3/2\). What follows is an
alternative, simpler way—especially when generalizing to more
coin tosses—to obtain the same result using linearity of
expectation.
Recall that \(X\) is the sum of the three indicator random variables
\(X_i\) that indicate whether the \(i\)th toss comes up heads: \(X =
X_1 + X_2 + X_3\). Recall from Example 217 that \(\Ex[X_i] = \Pr[X_i = 1] = 1/2\) for every
\(i\). So, by linearity of expectation (Lemma 231),
Similarly, in Example 218, we
defined the random variable \(Y\) to be the number of pairs of
tosses that come up the same, and showed directly from the
definition that \(\Ex[Y] = 3/2\).
We observed that \(Y\) is the sum of the three indicator random
variables \(Y_i\) that indicate whether the corresponding pair of
tosses come up the same: \(Y = Y_1 + Y_2 + Y_3\). Even though these
indicator random variables are not mutually independent, we can
still apply linearity of expectation to their sum:
\[\Ex[Y] = \Ex[Y_1 + Y_2 + Y_3] = 3/2 \; \text.\]
Example 234(Biased Coins)
Suppose we repeatedly flip a biased coin that has probability \(p
\in [0,1]\) of coming up heads. We can determine the expected number
of heads in \(n\) flips, using indicator random variables and
linearity of expectation.
For \(i=1,\ldots, n\), define \(X_i\) to be the indicator random
variable indicating whether the \(i\)th flip comes up heads. Then
\(\Ex[X_i] = \Pr[X_i] = p\). Define \(X\) to be the total number of
heads that come up in \(n\) flips, and observe that \(X = \sum_{i=1}^n
X_i\). So, by linearity of expectation,
\[\Ex[X]
= \Ex\brackets[\Big]{\sum_{i=1}^n X_i}
= \sum_{i=1}^n \Ex[X_i]
= \sum_{i=1}^n p
= p \cdot n \; \text.\]
By contrast, deriving this expectation directly via the definitions
(of expectation, and of the random variable \(X\)) would be rather
complicated: we would need to determine \(\Pr[X=v]\) for each
\(v=0,1,\ldots,n\), then analyze \(\sum_{i=1}^n v \cdot \Pr[X=v]\).
Linearity of expectation makes the derivation almost trivial.
Now let’s return to the rock-paper-scissors example, modeling one
round of the game as a probability space, and analyzing the
appropriate events. The sample space—i.e., the set of outcomes that
can potentially occur—is the set \(\Omega = \set{R,P,S}^2 = \set{RR,
RP, RS, PR, PP, PS, SR, SP, SS}\) of possible move pairs made by the
two players, where \(R,P,S\) respectively denote the moves rock, paper,
and scissors, and the \(i\)th character represents player \(i\)’s move,
for \(i=1,2\).
Next, let \(R_i\) be the event that player \(i\) plays rock, i.e., \(R_1 =
\set{RR, RP, RS}\) and \(R_2 = \set{RR, PR, SR}\), and similarly for
\(P_i\) and \(S_i\).
We need to assign probabilities to the outcomes. Player \(i\)’s strategy
can be seen as choosing a move at random according to some specified
probabilities of events \(R_i, P_i, S_i\), which must sum to \(1\) because
they partition the sample space. (This is true even for deterministic
strategies, in which case the chosen move has probability \(1\).) Since
the players move simultaneously, we model their choices as
independent. So, letting \(M_i\) be any one of \(R_i,P_i,S_i\) for
\(i=1,2\), we see that the event \(M_1 \wedge M_2\) consists of a single
outcome, and we have that
Now, let \(W_1 = \set{RS, PR, SP}\) be the event that player \(1\) wins.
Suppose that player \(1\) plays uniformly at random, i.e., plays R,P,S
each with equal probability, i.e., \(\Pr[R_1]=\Pr[P_1]=\Pr[S_1]=1/3\).
Then no matter how player \(2\) plays, the probability that player \(1\)
wins is
The first equation uses independence, and the final equation uses the
fact that the events \(S_2,R_2,P_2\) partition the sample space, so
their probabilities must sum to \(1\). The conclusion is that playing
uniformly at random yields a \(1/3\) probability of winning, no matter
what strategy the opponent uses. This is a significant improvement
over any deterministic strategy, against which an opponent can win
with certainty. [3] A similar analysis shows that the probabilities
of tying and of losing are both \(1/3\) as well.
The rock-paper-scissors example is illustrative of a more general
notion of considering algorithms as games. In this perspective, we
view one “player” as choosing an algorithm for a computational
problem. A second player known as the adversary then chooses an
input (of a certain size) to the algorithm. The adversary aims to
maximize the number of steps taken by the algorithm on the given
input; i.e., to induce the algorithm’s worst-case runtime behavior.
If the first player chooses a deterministic algorithm, then the
adversary can find an input that maximizes the number of steps—even
if the algorithm happens to be much faster on “most” other inputs. On
the other hand, if the first player chooses a randomized
algorithm—i.e., one that makes some random choices—then the
adversary might not be able to induce worst-case behavior, because it
cannot predict the algorithm’s random choices. In a good randomized
algorithm, the random choices will tend to avoid the worst-case
behavior no matter what the input is, although there is still some
(ideally very small) probability that it will make “unlucky” choices
that lead to poor performance.
We can use randomness to design simple algorithms that produce an
α-approximationin expectation, meaning
that the expected value of the output is within an \(\alpha\) factor of
the value of an optimal solution.
As an example, recall the search version of the \(\TSAT\) problem: given
a 3CNF Boolean formula, find a satisfying assignment
for it, if one exists. Since a 3CNF formula is the AND of several
clauses, this problem is asking for an assignment that satisfies all
of the clauses simultaneously.
Because \(\TSAT\) is \(\NP\)-hard, there is no efficient algorithm for
this problem unless \(\P=\NP\). So, let us relax the goal, and seek an
assignment that satisfies as many clauses as we can manage. In other
words, we seek an approximation algorithm for the problem of
maximizing the number of satisfied clauses.
In what follows, we actually restrict the input to be what we call an
“exact”-3CNF formula. This is a 3CNF formula with the added
restriction that each clause involves three distinct variables.
(Different clauses can involve the same variable, however.) For
example, the clause
\[(x \vee \neg y \vee \neg z)\]
is allowed, but the clause
\[(x \vee \neg y \vee \neg x)\]
is not, since it involves only two distinct variables (the literals
\(x\) and \(\neg x\) involve the same variable). Similarly, the clause \((y
\vee y \vee \neg z)\) is not allowed.
The problem of finding an assignment that maximizes the number of
satisfied clauses in a given exact-3CNF formula is known as \(\MSAT\),
and its decision version is \(\NP\)-complete. Despite this, there is a
very simple randomized algorithm that obtains a \(7/8\)-approximation,
in expectation. The algorithm is as follows: assign random truth
values (true or false) to the formula’s variables, uniformly and
independently. That is, for each variable, assign it to be true or
false each with \(1/2\) probability, independently of all the others.
Theorem 235
Let \(\phi\) be an exact-3CNF formula with \(m\) clauses. Then
assigning its variables with uniformly random and independent truth
values satisfies \((7/8)\)ths of \(\phi\)’s clauses in expectation, i.e.,
\[\Ex[\text{number of satisfied clauses in $\phi$}] = 7m/8 \;
\text.\]
In particular, this is a \(7/8\)-approximation algorithm for \(\MSAT\),
in expectation.
The proof of this theorem proceeds by a powerful and widely applicable
strategy. First, we define a suitable probability space and a random
variable \(X\) that captures the quantity of interest, namely, the
number of satisfied clauses. This variable \(X\) typically has a
complicated probability distribution that makes it hard to analyze its
expectation directly. Instead, we express \(X\) as the sum of several
simpler indicator random variables \(X_i\). By linearity of
expectation, \(\Ex[X]\) is the sum of the individual expectations
\(\Ex[X_i] = \Pr[X_i=1]\), where the equality holds because the \(X_i\)
are indicators. Finally, we analyze the probabilities of the
individual events \(X_i=1\), which are simple to understand, and arrive
at the conclusion.
Proof 236
First, suppose that the expected number of satisfied clauses is
indeed \(7m/8\) (which we prove below). The optimal number of clauses
that can be simultaneously satisfied is \(\OPT \leq m\), so the
expected number of satisfied clauses is \(7m/8 \geq (7/8) \cdot
\OPT\), hence this is indeed a \(7/8\)-approximation algorithm for
\(\MSAT\).
We now analyze the expected number of satisfied clauses. The
probability space is the set of all assignments to the variables of
\(\phi\), where each assignment is equally likely—i.e., the
assignment is chosen uniformly at random.
Now, define \(X\) be the random variable for the number of satisfied
clauses in \(\phi\). That is, for any particular assignment \(\alpha\),
we define \(X(\alpha)\) to be the number of clauses satisfied by
\(\alpha\). Because the clauses of \(\phi\) can share variables in
various ways, the probability distribution and expectation of \(X\)
can be quite complicated and difficult to analyze directly.
Instead, we decompose \(X\) into a sum of much simpler indicator
variables.
Specifically, for each \(i=1,\ldots,m\) we define an indicator random
variable \(X_i\) that indicates whether the \(i\)th clause is
satisfied, i.e., \(X_i = 1\) if so and \(X_i = 0\) if not. Then because
\(X\) is the total number of satisfied clauses, we have that
It is important to realize that because the clauses of \(\phi\) can
share variables in various ways, the indicators \(X_i\) are typically
not independent, even pairwise. For example, because clauses \((x
\vee y \vee \neg z)\) and \((u \vee \neg v \vee x)\) share the
variable \(x\), the former clause being satisfied makes it more
likely that the latter clause is also satisfied (i.e., their
indicators are positively correlated). Fortunately, linearity of
expectation holds even for dependent variables, so the above
equation is still valid.
All that remains to analyze \(\Ex[X_i] = \Pr[X_i = 1]\) (see
Lemma 222), i.e., the
probability that a uniformly random assignment satisfies the \(i\)th
clause. Every clause in the exact-3CNF formula \(\phi\) is the OR of
exactly three literals involving distinct variables, e.g., \((x \vee
\neg y \vee z)\). The clause fails to be satisfied exactly when all
three of its literals are false. Because the variables are
assigned uniformly at random, each literal has probability \(1/2\) of
being false (regardless of whether the literal is a variable or its
negation). And because the literals in clause \(i\) involve distinct
variables, their values are mutually independent. Thus,
\[\begin{split}\Pr[X_i = 0]
&= \Pr[\text{clause $i$ is unsatisfied}] \\
&= \Pr[\text{the 1st and 2nd and 3rd literals of clause $i$ are all false}] \\
&= \prod_{j=1}^3 \Pr[\text{the $j$th literal of clause $i$ is
false}] \\
&= (1/2)^3 = 1/8 \; \text.\end{split}\]
Next, because \(X_i\) is an indicator random variable (it is limited
to values \(0\) and \(1\)),
By the averaging argument (Lemma 225), from
Theorem 235 we can further conclude that
for any exact-3CNF formula, there exists an assignment that
satisfies at least \(7/8\)ths of the clauses. Notice that this is a
“deterministic” statement—it makes no reference to probability—but
we proved it using the tools of probability, and it is not obvious how
we could hope to do so without them! (This is an example of the
“probabilistic method,” which is a very powerful technique with many
applications.)
Theorem 235 says that if we assign
variables at random, then on average, \(7/8\)ths of the clauses are
satisfied. However, this does not say anything about how likely it
is that a certain fraction of clauses are satisfied. For example, we
might want a lower bound on the probability that at least half (or
two-thirds, or three-quarters, or 90%) of the clauses are satisfied.
Written in terms of the random variable \(X\) representing the number of
satisfied clauses, we might hope to get a lower bound on \(\Pr[X \geq c
\cdot m]\) for \(c=1/2\), or various other \(c \in [0,1]\).
Notice that because \(X\) is defined as the number of satisfied clauses,
it is non-negative (i.e., it never takes on a negative value). Thus,
we can apply Markov’s inequality (Lemma 227)
to obtain that
Unfortunately, this is a trivial (or “useless”) statement, because the
probability bound is \(7/4 > 1\), and the probability of any event is
\(\leq 1\). Furthermore, the probability bound goes in the “wrong
direction” from what we want: it says that the probability of
satisfying at least half the clauses is at most some value, whereas
we hope to show that it is at least some positive value. An upper
bound is not useful to us, because it is consistent with the actual
probability being zero. So, even if we replaced \(c=1/2\) with, say, \(c
= 9/10\) to get a nontrivial bound, the statement would not be of the
kind we want. We need to use a different tool.
Observe that the number of satisfied clauses \(X\) cannot exceed \(m\),
the total number of clauses in the given formula. So, we can apply the
‘reverse’ Markov inequality (Lemma 229) to it, with \(b=m\). Setting \(a = m/2 < b\), we get
that
That is, a random assignment satisfies more than half the clauses with
at least \(75\) percent probability. More generally, for any \(c < 7/8\),
setting \(a = c \cdot m < b\),
\[\Pr[X > c \cdot m] \geq \frac{7m/8 - c \cdot m}{m - c \cdot m} =
\frac{7/8-c}{1-c} > 0 \; \text.\]
So, for any \(c < 7/8\) there is a positive probability of satisfying
more than a \(c\) fraction of the clauses. However, the ‘reverse’ Markov
inequality does not tell us anything useful about the probability of
satisfying \(7/8\)ths or more of the clauses.
Exercise 237
Recall the max-cut problem. The following is a
randomized algorithm that outputs a random cut in a given
undirected graph (which has no self-loop edge, without loss of
generality):
\begin{algorithmic}
\FUNCTION{RandomCut}{$G=(V,E)$}
\STATE $S = \emptyset$
\FORALL{$v \in V$}
\STATE add $v$ to $S$ with probability $1/2$
\ENDFOR
\STATE \RETURN $S$
\ENDFUNCTION
\end{algorithmic}
Recall that \(C(S)\) is the set of crossing edges for the cut \(S\),
i.e., those that have exactly one endpoint in \(S\). The size of the
cut is \(\abs{C(S)}\).
Prove that in expectation, this algorithm outputs a
1/2-approximation of a maximum cut in the given undirected
graph. That is, the expected size of the returned cut is at
least half the size of a maximum cut.
Hint: Use linearity of expectation to show that in
expectation, half of the edges of the graph are in \(C(S)\).
Similarly, give an upper bound on the size of a maximum cut
(also in terms of the number of edges in the graph).
Conclude that, for any undirected graph \(G=(V,E)\), there exists
a cut of size at least \(\abs{E}/2\). (By contrast, in
Theorem 203 we saw an
efficient deterministic algorithm that finds such a cut, with
certainty.)
Another example of an algorithm that takes advantage of randomness is
quick sort. This is a recursive algorithm that sorts an array as
follows:
First, it chooses some element of the array as the pivot.
Then, it partitions the array around the pivot, by comparing the
pivot to every other element, placing all the smaller elements to
the “left” of the pivot—not necessarily in sorted order—and all
the larger ones to the “right” of the pivot. Without loss of
generality, we assume that the array elements are distinct, by
breaking ties according to the elements’ positions in the array.
Finally, it recursively sorts the two subarrays to the left and
right of the pivot, which results in the entire array being sorted.
The following is an illustration of an execution of quick sort, with
the recursive work elided:
The running time of quick sort is dominated by the number of
comparisons between pairs of elements. Once a pivot is selected,
partitioning an array of \(n\) elements compares every other element to
the pivot. This is a total of \(n-1 = \Theta(n)\) comparisons of
non-recursive work, which is followed by the recursive calls on the
two subarrays on each side of the pivot. If selecting the pivot takes
\(O(n)\) time and can guarantee that the two subarrays are “balanced,”
i.e., approximately equal in size, then the recurrence for the total
number of comparisons (and hence the overall running time) is
\[T(n) = 2T(n/2) + \Theta(n) \; \text.\]
This is the same recurrence as for merge sort, and
it has the closed-form solution \(T(n) = \Theta(n \log n)\). However,
quick sort tends to be faster in practice than merge sort and other
deterministic \(\O(n \log n)\) sorting algorithms, in large part because
it is fast and easy to implement in place—i.e., reordering the
array within its own memory space, using only a small amount of
auxiliary memory. (By comparison, the standard merge subroutine from
merge sort makes a full copy of the array with each call.) Thus,
variants of quick sort are widely used in practice.
One way to guarantee a balanced partition is to use the median
element as the pivot, which can be found in linear time. However, this
subroutine is rather complicated to implement, and its linear runtime
has a fairly large hidden constant factor.
A much simpler and more practical solution is to choose a uniformly
random array element as the pivot. The pseudocode is as follows.
Algorithm 238(Randomized Quick Sort)
\begin{algorithmic}
\FUNCTION{RandQuickSort}{$A[1,\ldots,n]$}
\IF{$n=1$} \RETURN $A$ \ENDIF
\STATE $p = \text{ an index chosen uniformly at random from }
\set{1,\ldots,n}$
\STATE $(L,R) =$ \CALL{Partition}{$A,p$}
\STATE \RETURN \CALL{RandQuickSort}{$L$} $+ A[p] +$
\CALL{RandQuickSort}{$R$}
\ENDFUNCTION
\FUNCTION{Partition}{$A[1,\ldots,n],p$}
\STATE initialize empty arrays $L,R$
\FOR{$i=1$ to $n$}
\IF{$i \neq p$ and $A[i] < A[p]$}
\STATE $L = L + A[i]$
\ELIF{$i \neq p$}
\STATE $R = R + A[i]$
\ENDIF
\ENDFOR
\STATE \RETURN $(L,R)$
\ENDFUNCTION
\end{algorithmic}
The following figure illustrates a possible execution of
RandQuickSort on an example input array, with the random
choices of pivot elements highlighted, and the rank of each element
displayed below it. The rank of an element is the number of elements
in the array (including itself) that are less than or equal to it. So,
the smallest element has rank \(1\), the next-smallest has rank \(2\),
etc.
The running time of RandQuickSort is a random variable that
depends in a complicated way on the particular random choices of
pivots made throughout the algorithm. Roughly speaking, choices of
pivots that yield “unbalanced” subarrays tend to yield larger running
times than ones that yield more balanced subarrays. How can we hope to
analyze this very complicated probability experiment? As we did above,
we express the running time as a sum of much simpler indicator
variables, apply linearity of expectation, and then analyze these
indicators individually. The rest of this section is devoted to
proving the following theorem.
Theorem 239
On any array of \(n\) elements, the expected running time of
RandQuickSort is \(\O(n \log n)\).
Recall that the running time of RandQuickSort is dominated by
the number of comparisons of pairs of array elements. First observe
that if a certain pair of elements is ever compared, then it is never
compared again. This is because at the time of comparison, one of the
elements is the pivot, which is never compared with anything after
partitioning around it is complete. So, the total number of
comparisons done by RandQuickSort is the total number of
distinct pairs of elements that are ever compared.
Accordingly, define the random variable \(X\) to be the total number of
comparisons during the entire execution of quick sort on an arbitrary
array of \(n\) elements. In additions, for any ranks \(1 \leq i < j \leq
n\), define \(X_{ij}\) to be the indicator random variable that indicates
whether the pair of elements having ranks \(i,j\) are ever compared
during the execution, i.e., \(X_{ij}=1\) if this pair is ever compared,
and \(X_{ij}=0\) otherwise. By the observations from the previous
paragraph,
So, for every \(i < j\) it just remains to analyze the individual
expectations \(\Ex[X_{ij}] = \Pr[X_{ij}=1]\)in isolation (where the
equality holds by Lemma 222),
and sum them. (As above, the random variables \(X_{ij}\) are highly
correlated in difficult-to-understand ways, but this does not matter
for linearity of expectation.)
Under what conditions are the elements of ranks \(i < j\) ever compared,
and what is the probability \(\Pr[X_{ij}=1]\) that this occurs? By
inspection of Algorithm 238, we
see that elements are compared only in the Partition
subroutine. In particular, the only comparisons made during a call to
Partition are between the selected pivot element and all the
other elements in the input subarray. This has two important
implications:
When a subarray is input to Partition, none of its elements
have been compared with each other yet.
In the subarray input to Partition, if one element is less
than the pivot and another is greater than the pivot, then these two
elements are placed on opposite sides of the pivot, and thus are
never compared with each other (because the recursive calls operate
completely separately on the two sides).
Therefore, the two elements \(a < b\) having ranks \(i < j\) are compared
if and only if the first pivot chosen from the range \([a,b]\) is
either \(a\) or \(b\). To elaborate: for as long as pivots are chosen from
outside this range, \(a\) and \(b\) remain in the same subarray, and have
not yet been compared. Then, once an element from this range is chosen
as a pivot:
it is either \(a\) or \(b\), in which case \(a\) and \(b\) are compared with
each other, or
it is an element strictly between \(a\) and \(b\), in which case \(a\) and
\(b\) are placed in different subarrays and never compared with each
other.
Therefore, \(\Ex[X_{ij}]\) is the probability that the first pivot whose
rank is between \(i\) and \(j\) (inclusive) has rank either \(i\) or \(j\).
There are \(j-i+1\) elements in this range. Each pivot is chosen
uniformly at random from the subarray input to Partition, and
as argued above, when the first pivot from this range is selected, the
entire range is part of the subarray. So, given that the pivot is in
the range, it is uniformly random over the range, i.e., it is
equally likely to be any of the \(j-i+1\) elements in the range. (This
statement can be formalized using conditional probability.) Because
exactly two of these possible choices make \(X_{ij}=1\), we have that
Although RandQuickSort runs in expected time \(O(n \log
n)\) on an array of \(n\) elements, it can potentially run in time
\(\Omega(n^2)\), due to an unlucky choice of random pivots.
Prove that for any constant \(c > 0\),
\[\lim_{n \to \infty} \Pr[\algo{RandQuickSort} \text{ on an
$n$-element array takes $\geq c n \log^2 n$ steps}] = 0 \; \text.\]
That is, as the array size grows large, randomized quick sort is
very unlikely to run in even \(\Omega(n \log^2 n)\) time.
Hint: This follows immediately from Theorem 239 and Markov’s inequality
(Lemma 227), without needing to understand
any of the analysis of RandQuickSort.
Randomness can also be used to obtain very simple data structures
that provide excellent performance in expectation. As an example, here
we consider a structure called a “skip list”, which implements a
dictionary abstract data type. A dictionary is a structure that
stores data as key-value pairs, and supports inserting, deleting,
and finding (looking up) data by its associated key. For example, in a
database, students’ academic records might be keyed by their unique ID
numbers.
For simplicity, we focus on storing just the keys themselves, since
the associated data can be attached to them using pointers. We assume
without loss of generality that keys can be ordered in some
consistent way. (For example, in the absence of a “natural” ordering,
one can use lexicographic order on the string representations of the
keys.)
There are many ways to implement a dictionary abstraction using a
variety of underlying data structures, such as arrays, linked lists,
hash tables, and binary search trees. However, simple deterministic
implementations have worst-case running times that are linear in the
number of items in the dictionary, for some or all operations. For
example, if we insert elements \(1, 2, \ldots, n\) into a sorted linked
list or non-balancing binary search tree, most of the insertions take
\(\Theta(n)\) time each. Similarly, finding a key takes \(\Theta(n)\) time
in the worst case, because we may need to traverse most or all of the
list.
One common way to avoid such poor performance is to use a more
advanced data structure that does some “rebalancing” operations to
ensure \(O(\log n)\) worst-case runtimes for all operations, regardless
of which elements are inserted or deleted. There are a variety of such
structures, like AVL trees, red-black trees, scapegoat
trees, and so on.
However, these structures can be complicated and difficult to
implement correctly, and can have significant hidden constant factors
in their runtimes and memory overheads.
Here we investigate an alternative, very simple data structure
called a “skip list”, which uses randomness to obtain expected\(\O(\log n)\) runtimes for all dictionary operations. A skip list is
essentially a linked list with multiple levels, where each level has
about half the elements of the level below it. [4]
Like in a subway or highway system, skip lists can be thought of as
having “express lanes” that make it possible to reach a desired
element more quickly by “skipping over” many undesired elements at a
time—just as a subway’s express train can go faster by skipping past
many regular stops. A skip list, however, typically has several
levels of “express lanes,” each one able to skip over more elements
than the one below it. Moreover, the choice of which elements appear
on each level, and which are skipped, is made at random. This tends
to keep the data structure sufficiently “balanced” to support fast
operations, in expectation. [5]
The above diagram illustrates a possible state of a skip list. Every
level has a sentinel “head” and a terminal null pointer; for
convenience, their values are defined to be smaller and larger
(respectively) than every element. The bottom level has all of the
elements, in a sorted linked list. Every other level has some subset
of the elements in the level below it, also in a sorted linked list.
In addition, each element above the bottom level also points to its
duplicate in the level below it; these downward pointers are
illustrated by stacking the duplicate elements. Therefore, from any
non-terminal position in the skip list, we may go “rightward” to the
next-larger element in the same level, or (when not in the bottom
level) “downward” to the same element at the next-lowest level. [6]
To search for an element \(e\), we start at the sentinel head of the
top level. We maintain the invariant that we are always located at
some element \(a < e\) in some level, and repeat the following loop:
Consider the successor element \(a'\) of \(a\) in the current level.
If \(a' < e\), move to \(a'\) (setting \(a=a'\)) and repeat.
Otherwise (i.e., \(a' \geq e\)), if we are not in the bottom level,
move downward (to the same element \(a\)) and repeat; else terminate
the loop.
At this point, we are in the bottom level, at the largest element
smaller than \(e\). If its successor is the desired element \(e\), output
it; otherwise, report that \(e\) is not in the structure.
Note that as an optimization, if we ever find that \(a'=e\) in some
level, we can immediately return it, instead of always descending to
the bottom level. However, for defining the insertion and deletion
algorithms, and for the probabilistic analysis below, it is more
convenient to define the search so that it ultimately ends in the
bottom level.
For example, the following diagram illustrates the search for element
0 in the above skip list. The search terminates in failure, since in
the bottom level we end up at -3, whose successor is 1.
The following diagram illustrates the successful search for element 9
(optimized to return as soon as it is found, above the bottom level):
To delete an element \(e\) from the structure, we first search for it.
If \(e\) is not in the structure, there is nothing more to do.
Otherwise, the search finds \(e\) in the bottom level, and along the way
it also finds \(e\)’s predecessor in every level in which \(e\) appears.
(These are the elements at which the search moves downward.) So, we
just delete \(e\) from each of those levels, by making each predecessor
point to the corresponding successor of \(e\) in that level. Notice that
the running time of deletion is within a constant factor of the
running time of the search.
To insert an element \(e\), we first search for \(e\). If it is already
in the structure, then there is nothing more to do. Otherwise, the
search terminates in the bottom level, at the largest element \(a < e\);
along the way, it also finds the largest element smaller than \(e\) in
each level.
We first insert \(e\) as the successor of \(a\) in the bottom level. Then,
we determine at random which higher levels the element \(e\) will
occupy, using the following loop:
Flip a fair coin.
If it comes up heads, insert \(e\) in the next level up (as the
successor of the largest element smaller than \(e\)), and repeat.
Otherwise, stop.
In other words, we insert \(e\) into one higher level for each head that
comes up, until tails comes up. Note that there is no upper limit to
how many levels we might insert \(e\) into, and we create new levels as
needed.
Observe that the number of copies of the inserted element is a random
variable, which equals the total number of coin flips until tails
comes up (inclusive). So, this random variable follows the geometric
distribution
with success probability \(1/2\), which has expectation \(2\). Thus, the
runtime of insertion is, in expectation, only a constant more than the
runtime of the search.
Based on the above insertion process, we can also bound the expected
size (memory usage) of a skip list. If there are \(n\) distinct
elements, then by linearity of expectation on the number of copies of
each element, the expected total number of copies is \(2n\). So, the
expected total memory usage is \(O(n)\).
Finally, based on what we have already seen, to determine the expected
runtime of any of the operations (insert, delete, search), it just
remains to analyze the expected runtime of a search. We have the
following theorem.
Theorem 241
Assume that all the operations (insert, delete, search) performed
on a skip list are independent of the internal random choices made
during insertions. Then in a skip list with \(n\) distinct elements,
the expected running time of a search is \(O(\log n)\).
Before proving the theorem, we discuss the meaning and importance of
the assumption, that the sequence of operations is independent of the
internal random choices. Without this assumption, a skip list may have
very poor performance. For example, if an adversarial user is able to
learn which elements appear above the bottom level, then the user
could just delete all such elements from the structure. This makes the
skip list just an ordinary linked list, with its associated
\(\Theta(n)\) runtimes. But notice that in this scenario, the sequence
of operations depends on the skip list’s internal random choices,
because the deleted elements are exactly those that are randomly
promoted above the bottom level. By contrast, under the independence
assumption, we can treat every element as if it has just been inserted
and promoted at random according to the above process.
To prove the theorem, we first show the following useful lemma.
Lemma 242
Under the independence assumption from Theorem 241, in a skip list with \(n\) distinct
elements, the expected height (i.e., number of levels above the
bottom level) is less than \(\log_2 n + 2\).
Proof 243
We first analyze the expected number of elements on each level, and
use this to determine the expected height of the structure.
Define \(X_{ij}\) to be the indicator random variable that indicates
whether the \(j\)th element appears on level \(i\), where \(i=0\)
corresponds to the bottom level. Then by the insertion rule and the
independence assumption,
\[\Pr[X_{ij} = 1] = 1/2^{i} \; \text.\]
This is because an element appears on level \(i\) if at least \(i\)
heads come up when inserting it, and the coin flips are fair and
independent. Thus, \(\Ex[X_{ij}] = \Pr[X_{ij} = 1] = 1/2^{i}\).
Now define random variable \(n_i\) to be the number of (non-sentinel)
elements on level \(i\); this is just the sum of the indicators
\(X_{ij}\). So, by linearity of expectation and the formula for
geometric series, the expected number of elements on level \(i\) is
(For example, the expected number of elements on level 1 is \(n/2\),
on level 2 is \(n/4\), etc.)
Now, we analyze the expected height of the structure, i.e., the
number of levels above the bottom one that have at least one
element. For intuition, the expected number of elements on level
\(i\) is less than \(1\) when
\[\Ex[n_i] = \frac{n}{2^i} < 1 \iff 2^i > n \iff i > \log_2 n \;
\text,\]
so we should expect roughly \(\ell(n) = \log_2 n\) levels.
To show this properly, define \(H_i\) to be the indicator random
variable that indicates whether level \(i\) has at least one element,
i.e., whether \(n_i \geq 1\). Because \(n_i\) is non-negative, by
Markov’s inequality and the fact that \(n = 2^{\ell(n)}\),
Notice that this upper bound on \(\Ex[H_i]\) is nontrivial only for
\(i > \ell(n)\), because the expectation of any indicator random
variable is at most \(1\). For \(i \leq \ell(n)\) we have the trivial
but tighter bound \(\Ex[H_i] \leq 1\).
Now, the total height \(H\) of the skip list is
\[H = \sum_{i=1}^\infty H_i \; \text,\]
so by linearity of expectation, the above bounds on \(\Ex[H_i]\), and
the formula for a geometric series, the expected height is
Now we can prove Theorem 241,
which says that the expected running time of a search in a skip list
of \(n\) distinct elements is \(\O(\log n)\).
Proof 244
The running time of the search for an element \(e\) is proportional
to the total number of downward and rightward moves during the
search, starting from the topmost sentinel head. The number of
downward moves is exactly the height of the structure, whose
expectation we analyzed above in Lemma 242. So, it remains to analyze the
expected number of rightward moves. The simplest way to do this
rigorously is to consider the search path in reverse.
The end of the search path is the largest value \(a < e\) in the
bottom level, and hence in the entire structure as well. Suppose
that this \(a\) is not the sentinel head, otherwise there is no
rightward move in the entire search. The search moved rightward to
this \(a\) only if \(a\) is not present in the next level up. Indeed,
if \(a\) were in the next level up, then the prior step of the search
must have been downward from that copy of \(a\), because the search
reaches the largest value less than \(e\) on each level.
By definition of the insertion process (and the independence
assumption), with probability \(1/2\), the first coin toss when \(a\)
was inserted came up tails, in which case \(a\) is not present in the
next level up, and so the prior step was rightward. In this case,
the exact same reasoning applies to the step prior to that one, and
so on, until we reach an element that either is in the next level
up or is the sentinel head. So, in the bottom level, the number of
rightward moves \(R_0\) is the number of independent fair coin tosses
(out of \(\leq n\) tosses) that come up tails before the first heads
comes up. This is bounded by one less than a geometric random
variable with success probability \(1/2\), so \(\Ex[R_0] < 1\).
The exact same reasoning applies to the number of rightward moves
on every level. That is, the last value visited on level \(i\) is
the largest value \(a < e\) on that level. If \(a\) is not the sentinel
head, then the prior step was rightward only if \(a\) is not in the
next level up, which holds with probability \(1/2\) by definition of
the insertion process, and so on. So, defining random variable
\(R_i\) to be the number of rightward moves on level \(i\), we have
that \(\Ex[R_i] < 1\) for all \(i\).
Separately, the number of rightward moves on level \(i\) cannot
exceed the total number of (non-sentinel) elements \(n_i\) on level
\(i\). So, \(\Ex[R_i] \leq \Ex[n_i] = n/2^i\), as shown in the proof of
Lemma 242 above. This
bound on \(\Ex[R_i]\) is tighter than the one obtained above when
\(n/2^i < 1\), or equivalently, when \(i > \ell(n) = \log_2(n)\).
Finally, let \(H\) be the height of the structure, which is the
number of downward moves in the search, and let \(R\) be the total
number of rightward moves, which is the sum of all the \(R_i\) for \(i
\geq 0\). Then by linearity of expectation, the above bounds on
\(\Ex[R_i]\), and Lemma 242 (and proceeding similarly to its
proof), the expected total number of moves in the search is
Some algorithms rely on repeated trials to compute an approximation of
a result. Such algorithms are called Monte Carlo methods, which are
distinct from Monte Carlo algorithms – a
Monte Carlo method does repeated sampling of a probability
distribution, while a Monte Carlo algorithm is an algorithm that may
produce the wrong result within some bounded probability. The former
are commonly approximation algorithms for estimating a quantity, while
the latter are exact algorithms that sometimes produce the wrong
result. As we will see, there is a
connection: repeated sampling (the strategy of a Monte Carlo method)
can be used to amplify the probability of getting the correct result
from a Monte Carlo algorithm.
As an example of a Monte Carlo method, we consider an algorithm for
estimating the value of \(\pi\), the area of a unit circle (a circle
with a radius of one). If such a circle is located in the plane,
centered at the origin, its top-right quadrant is as follows:
This quadrant falls within the square interval between \((0, 0)\) and
\((1, 1)\). The area of the quadrant is \(\pi/4\), while the area of the
interval is \(1\). Thus, if we choose a random point in this interval,
the probability that it lies within the quadrant of the circle is
\(\pi/4\)[7]. This motivates the following algorithm for estimating
the value of \(\pi\):
\begin{algorithmic}
\FUNCTION{EstimatePi}{$n$}
\STATE $\text{count} = 0$
\FOR{$i=1$ to $n$}
\STATE $x,y = $ values in $[0,1]$ chosen uniformly and
independently at random
\IF{$x^2+y^2 \leq 1$}
\STATE $\text{count} = \text{count} + 1$
\ENDIF
\ENDFOR
\STATE \RETURN $4 \cdot \text{count} / n$
\ENDFUNCTION
\end{algorithmic}
The algorithm randomly chooses points between \((0, 0)\) and \((1, 1)\),
counting how many of them fall within the unit circle, which is when
the point’s distance from the origin is at most one. We expect this
number to be \(\pi/4\) of the samples, so the algorithm returns four
times the ratio of the points that fell within the circle as an
estimate of \(\pi\).
We formally show that the algorithm returns the value of \(\pi\) in
expectation. Let \(X\) be a random variable corresponding to the value
returned, and let \(Y_i\) be an indicator random variable that is 1 if
the \(i\)th point falls within the unit circle. As argued previously, we
have:
\[\Pr[Y_i = 1] = \pi/4 \; \text.\]
We also have that
\[X = 4/n \cdot (Y_1 + \dots + Y_n) \; \text.\]
This leads to the following expected value for \(X\):
The expected value of the algorithm’s output is indeed \(\pi\).
When estimating \(\pi\), how likely are we to actually get a result that
is close to the expected value? While we can apply Markov’s
inequality, the bound we get is very loose, and it does not give us
any information about how the number of samples affects the quality of
the estimate. The law of large numbers states that the actual result
converges to the expected value as the number of samples \(n\)
increases. But how fast does it converge? There are many types of
concentration bounds that allow us to reason about the deviation of
a random variable from its expectation; Markov’s inequality is just
the simplest one. Chebyshev’s inequality
is another simple bound that makes use of more information about a
random variable, namely its variance. Chernoff bounds are yet
another tool. There are multiple variants of Chernoff bounds,
including the multiplicative form and the
additive Hoeffding bounds.
The expectation of a random variable gives us one piece of information
about its probability distribution, but there are many aspects of the
distribution that it does not capture. For instance, the following
illustrates three random variables that have different distributions
but the same expected value of 0.5:
Beyond the expectation, the next most important aspect of a
probability distribution is its “spread” [8]. The variance of a
random variable encapsulates this information.
Definition 245(Variance)
Suppose \(X\) is a random variable with expectation \(\Ex[X]\). Then
the variance of \(X\) is defined as
\[\Var(X) = \Ex[(X-\Ex[X])^2] \; \text,\]
or equivalently,
\[\Var(X) = \Ex[X^2] - \Ex[X]^2 \; \text.\]
The second definition follows from the first due to linearity of
expectation:
In the fourth step, we used the fact that \(\Ex[X]\) is a constant to
pull it out of the outer expectation: by linearity of expectation,
\(\Ex[cY] = c\Ex[Y]\) for any constant \(c\).
The variance tells us the average square of the distance of a random
variable from its expectation. Taking the square root of the variance
gives us an approximate measure of the distance itself; this is called
the standard deviation, and it is often denoted by the symbol
\(\sigma\):
\[\sigma(X) = \sqrt{\Var(X)} \; \text.\]
Example 246
Suppose that random variable \(X\) has the distribution
\[X = 0.5 \text{ with probability $1$}\]
and \(Y\) has the distribution
\[\begin{split}Y = \begin{cases}
0 &\text{with probability $1/2$} \\
1 &\text{with probability $1/2$.}
\end{cases}\end{split}\]
Both \(X\) and \(Y\) have expectations \(0.5\). We calculate their
variances. The distributions of \(X^2\) and \(Y^2\) are as follows:
\[\begin{split}X^2 &= 0.25 \text{ with probability $1$} \\
Y^2 &= \begin{cases}
0 & \text{with probability $1/2$} \\
1 & \text{with probability $1/2$.}
\end{cases}\end{split}\]
We see that while \(X\) and \(Y\) have the same expectation, their
variances differ. This quantifies the fact that \(X\) has a certain
value with certainty, whereas \(Y\) can take on multiple values, so
\(Y\) has larger variance.
Example 247
Let \(X\) be an indicator random variable with probability \(p\) of
being 1:
\[\begin{split}X = \begin{cases}
0 & \text{with probability $1-p$} \\
1 & \text{with probability $p$.}
\end{cases}\end{split}\]
Then \(\Ex[X] = \Pr[X=1] = p\). Observe that \(X^2 = X\) in all cases,
so \(\Ex[X^2] = \Ex[X] = p\). Thus, the variance of \(X\) is
As a more complex example, define \(X\) to be the number of heads over
\(n\) tosses of a biased coin with probability \(p\) of coming up heads.
In Example 234 we calculated
that \(\Ex[X] = pn\). Defining \(X_i\) to be the indicator random variable
for whether the \(i\)th toss comes up heads, we can infer from
Example 247 that \(\Var(X_i) =
p(1-p)\). What can we say about the variance
\(\Var(X)=\Var\parens[\big]{\sum_i X_i}\) of \(X\) itself?
For expectation, we know from Theorem 231 that linearity of expectation holds for
any random variables, even arbitrarily correlated ones. For variance,
this is not the case. As an example, for a random variable \(Y\), define
\(Z = Y + Y = 2Y\). It is not the case that \(\Var(Z) = 2\Var(Y)\); in
fact, \(\Var(Z) = 4\Var(Y)\).
Theorem 248
Let \(X\) be a random variable. Then \(\Var(cX) = c^2 \cdot \Var(X)\)
for any constant \(c\).
In the second and third steps, we applied linearity of expectation
to pull the constants \(c\) and \(c^2\) out of the expectations.
For a sum of random variables, the variances sum if the random
variables are independent; in fact, pairwise independence suffices
for this. To establish this, we first demonstrate that the expectation
of the product of independent random variables is the product of their
expectations.
Lemma 250
Let \(X\) and \(Y\) be independent random variables. Then
We apply Markov’s inequality to the random variable
\(Y = (X - \Ex[X])^2\), which has expectation \(\Ex[Y] = \Var(X)\) by
definition. Observe that because \(Y\) is the square of a real
number, it is non-negative, so Markov’s inequality applies to it.
Furthermore, because \(a\) is positive, \(Y = (X - \Ex[X])^2 \geq a^2\)
holds if and only if \(\abs{X - \Ex[X]} \geq a\). Thus,
Suppose we toss a fair coin \(n\) times. We use Chebyshev’s
inequality to calculate an upper bound on the probability of
getting \(\leq 49\%\) or \(\geq 51\%\) heads.
Let \(X\) be the number of heads. We previously showed that \(\Ex[X] =
n/2\), and \(\Var(X) = n/4\). Then
For example, for 10,000 tosses, the probability of deviating from
the expectation by 1% is at most 1/4, while for 1,000,000 tosses,
it is at most 1/400.
In the above example, the random variable \(X = X_1 + \cdots + X_n\) is
the sum of (mutually) independent, identically distributed (i.i.d.)
random variables \(X_i\). In such a scenario, it is often more
convenient to reason about the “normalized” value \(X/n\) rather than
\(X\) itself, since the expectation of \(X/n\) is independent of \(n\): by
linearity of expectation, \(\Ex[X] = n \cdot \Ex[X_i]\), whereas
\(\Ex[X/n] = \Ex[X]/n = \Ex[X_i]\). The following is an alternative
formulation of Chebyshev’s inequality for this case.
Corollary 257
Let \(X = X_1 + \cdots + X_n\) be the sum of \(n \geq 1\) independent,
identically distributed random variables \(X_i\). Let \(\varepsilon >
0\) be a deviation from the expectation \(\Ex[X/n] = \Ex[X_i]\). Then
The event \(\abs{X/n-\Ex[X_i]} \geq \varepsilon\) is equivalent to
the event \(\abs{X-\Ex[X]} \geq \varepsilon n\), by multiplying both
sides by \(n\). By the standard form of Chebyshev’s inequality, we
have
In conclusion, Chebyshev’s inequality tells us that for \(n\)
independent coin tosses, the probability of deviating by at least
(say) 1% from the expected number of heads decreases at least linearly
in \(n\). In fact, the probability actually decreases much faster than
this! Using multiplicative Chernoff bounds or
Hoeffding’s inequality from the
following sections, we can show that the probability decreases
exponentially in the number of tosses.
Hoeffding’s inequality, also called Chernoff-Hoeffding bounds, is
a set of concentration bounds that can give tighter results than
Markov’s inequality, Chebyshev’s inequality, or other forms of
Chernoff bounds. For simplicity, we restrict ourselves to the special
case of independent indicator random variables having expectation
We again consider the example of tossing a fair coin \(n\) times. Let
\(H\) be the total number of heads that come up, and let \(H_i\) be the
indicator random variable that indicates whether the \(i\)th toss comes
up heads. We have that \(\Ex[H_i] = \Pr[H_i = 1] = 1/2\), and \(\Ex[H/n]
= \Ex[H_i] = 1/2\) for any number of tosses \(n\).
What is a bound on the probability that in ten tosses, at least six
heads come up? This is a deviation of \(\varepsilon = 0.1\) from the
(normalized) expectation of \(p=0.5\), so applying the upper-tail
Hoeffding’s inequality gives us:
Now, what is a bound on the probability that in \(100\) tosses, at least
60 heads come up? This is again the same deviation \(\varepsilon = 0.1\)
from the expectation of \(p=0.5\); only the number of trials is
different. Applying the upper tail again, we get
This is a significantly tighter bound than what we would get from
Chebyshev’s inequality (Theorem 254); see
Example 256 for a
comparison. It is also tighter than what we would get from the
multiplicative Chernoff bound.
Example 260
Suppose we have a coin that is biased by probability \(\varepsilon\)
towards either heads or tails, but we don’t know which one. In other
words, either:
\(\Pr[\text{heads}] = \frac{1}{2} + \varepsilon\) and
\(\Pr[\text{tails}] = \frac{1}{2} - \varepsilon\), or
To determine in which direction the coin is biased, we toss the
coin \(n\) times. If the results include at least \(n/2\) heads, we
assume the coin is biased towards heads, otherwise we assume it is
biased towards tails. How many tosses should we do to guarantee our
answer is correct with probability at least \(1-\delta\)?
Let \(X\) be the number of heads, and let \(X_i\) be an indicator that
is 1 if the \(i\)th toss is heads, 0 if it is tails. Then \(X = X_1 +
\dots + X_n\) is the sum of independent indicator random variables
with \(\Ex[X_i]\) equal to either \(\frac{1}{2} + \varepsilon\) or to
\(\frac{1}{2} - \varepsilon\). We analyze the two cases individually.
Case 1: The coin is biased towards heads. Then \(\Ex[X_i] =
p = \frac{1}{2} + \varepsilon\). Our guess is erroneous when:
\[\lrPr{\frac{1}{n}X \ge p + \varepsilon}
\le e^{-2\varepsilon^2 n}\]
In either case, the probability of error after \(n\) tosses is
upper-bounded by \(e^{-2\varepsilon^2 n}\). We want this probability
to be no more than \(\delta\):
If \(\varepsilon = 0.01\) (i.e. the coin is biased towards heads or
tails by 1%) and \(\delta = 0.0001\) (we want to be correct at least
99.99% of the time), then
Suppose we have a coin that is either fair, or is biased by
probability \(\varepsilon\) towards heads, but we don’t know which
is the case. In other words, either:
\(\Pr[\text{heads}] = \frac{1}{2}\) and
\(\Pr[\text{tails}] = \frac{1}{2}\), or
We toss the coin \(n\) times and count the number of heads \(X\).
For what values of \(X\) should we guess the coin is fair, and for
what values that it is biased towards heads?
How many tosses should we do to guarantee our answer is correct
with probability at least \(1 - \delta\)?
How many tosses should we do for \(\varepsilon = 0.01\) and \(\delta
= 0.0001\)? How does this compare to the situation in
Example 260 with
\(\varepsilon = 0.005\) and \(\delta = 0.0001\)?
Rather than applying concentration bounds to compute the probability
of a deviation for a specific sample size, we often wish to determine
how many samples we need to be within a particular deviation with high
confidence. One application of this is big data, where we have vast
datasets that are too large to examine in their entirety, so we sample
the data instead to estimate the quantities of interest. Similar to
this is polling – outside of elections themselves, we typically do
not have the resources to ask the entire population for their
opinions, so we need to sample the populace to estimate the support
for a particular candidate or political position. In both
applications, we need to determine how many samples are needed to
obtain a good estimate.
In general, a poll estimates the fraction of the population that
supports a particular candidate by asking \(n\) randomly chosen people
whether they support that candidate. Let \(X\) be a random variable
corresponding to the number of people who answer this question in the
affirmative. Then \(X/n\) is an estimate of the level of support in the
full population.
A typical poll has both a confidence level and a margin of error
– the latter corresponds to the deviation from the true fraction \(p\)
of people who support the candidate, and the former corresponds to a
bound on the probability that the estimate is within that deviation.
For example, a 95% confidence level and a margin of error of ±2%
requires that
Thus, we obtain the given confidence level and margin of error by
polling at least 4612 people. Observe that this does not depend on the
total population size!
For an arbitrary margin of error ±\(\varepsilon\), we obtain:
More generally, if we wish to gauge the level of support for \(m\)
different candidates, the sampling theorem tells us that the number
of samples required is logarithmic in \(m\).
Theorem 262(Sampling Theorem)
Suppose \(n\) people are polled to ask which candidate they support,
out of \(m\) possible candidates. Let \(\Xj\) be the number of people
who state that they support candidate \(j\), and let \(p_j\) be the
true level of support for that candidate. We wish to obtain
In other words, we desire a confidence level \(1 - \gamma\) that all
estimates \(\Xj/n\) are within margin of error ±\(\varepsilon\) of the
true values \(p_j\). We obtain this when the number of samples \(n\) is
The sampling theorem can be derived from applying the union
bound.
In conclusion, when sampling from a large dataset, the number of
samples required does depend on the desired accuracy of the estimation
and the range size (i.e. number of possible answers). But it does not
depend on the population size. This makes sampling a powerful
technique for dealing with big data, as long as we are willing to
tolerate a small possibility of obtaining an inaccurate estimate.
Job scheduling is another application where we can exploit randomness
to construct a simple, highly effective algorithm. In this problem, we
have \(k\) servers, and there are \(n\) jobs that need to be distributed
to these servers. The goal is to balance the load among the servers,
so that no one server is overloaded. This problem is very relevant to
content-delivery networks – there may be on the order of millions or
even billions of concurrent users, and each user needs to be routed to
one of only hundreds or thousands of servers. Thus, we have \(n \gg k\)
in such a case.
One possible algorithm is to always send a job to the most lightly
loaded server. However, this requires significant coordination – the
scheduler must keep track of the load on each server, which requires
extra communication, space, and computational resources. Instead, we
consider a simple, randomized algorithm that just sends each job to a
random server. The expected number of jobs on each server is \(n/k\).
But how likely are we to be close to this ideal, balanced load?
We start our analysis by defining random variables \(\Xj\) corresponding
to the number of jobs assigned to server \(j\). We would like to
demonstrate that the joint probability of \(\Xj\) being close to \(n/k\)
for all \(j\) is high. We first reason about the load on an individual
server. In particular, we wish to compute a bound on
\[\lrPr{\Xj \ge \frac{n}{k} + c}\]
for some value \(c > 0\), i.e. the probability that server \(j\) is
overloaded by at least \(c\) jobs. Let \(\Xj_i\) be an indicator random
variable that is 1 if job \(i\) is sent to server \(j\). Since the target
server for job \(i\) is chosen uniformly at random out of \(k\) possible
choices, we have
\[\lrEx{\Xj_i} = \lrPr{\Xj_i = 1} = \frac{1}{k}\]
We also have \(\Xj = \Xj_1 + \dots + \Xj_n\), giving us
Now that we have a bound on an individual server being overloaded by
\(c\) jobs, we wish to bound the probability that there is some server
that is overloaded. The complement event is that no server is
overloaded, so we can equivalently compute the joint probability that
none of the servers is overloaded by \(c\) or more jobs:
However, we cannot do so by simply multiplying the individual
probabilities together – that only works if the probabilities are
independent, and in this case, they are not. In particular, if one
server is overloaded, some other server must be underloaded, so the
random variables \(\Xj\) are not independent.
Rather than trying to work with the complement event, we attempt to
directly compute the probability that there is at least one overloaded
server:
This is an overhead of about 2.6% above the expected load. The
probability that there is a server overloaded by at least 2.6% is
bounded from above by \(1/k = 0.001\), or ≤0.1%. Thus, when there are \(n
= 10^{10}\) jobs and \(k = 1000\) servers, the randomized algorithm has a
high probability (≥99.9%) of producing a schedule where the servers
all have a low overhead (≤2.6%).
Exercise 263
Previously, we demonstrated that to use
polling to achieve a confidence level \(1 - \gamma\) and a margin of
error ±\(\varepsilon\),