# Appendix

## Proof of the Master Theorem

We first prove the standard master theorem without log factors.

We prove the master theorem for the case where \(n\) is a power of \(b\); when this is not the case, we can pad the input size to the smallest power of \(b\) that is larger than \(n\), which increases the input size by at most the constant factor \(b\) (can you see why?).

First, we determine how many subproblems we have at each level of the recursion. Initially, we have a single problem of size \(n\). We subdivide it into \(k\) subproblems, each of size \(n/b\). Each of those gets subdivided into \(k\) subproblems of size \(n/b^2\), for a total of \(k^2\) problems of size \(n/b^2\). Those \(k^2\) problems in turn are subdivided into a total of \(k^3\) problems of size \(n/b^3\).

In general, at the \(i\)th level of recursion (with \(i = 0\) the initial problem), there are \(k^i\) subproblems of size \(n/b^i\). We assume that the base case is when the problem size is 1, which is when

Thus, there are \(1 + \log_b n\) total levels in the recursion.

We now consider how much work is done in each subproblem, which corresponds to the \(\O(n^d)\) term of the recurrence relation. For a subproblem of size \(n/b^i\), this is

With \(k^i\) subproblems at level \(i\), the total work \(T_i\) at level \(i\) is

Summing over all the levels, we get a total work

where \(r = k/b^d\).

Since we are working with asymptotics, we only care about the dominating term in the sum

There are three cases:

\(r < 1\): The terms have decreasing value, so that the initial term 1 is the largest and dominates [1]. Thus, we have

\[\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d)\end{split}\]\(r = 1\): The terms all have the same value 1. Then

\[\begin{split}T &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} r^i}\\ &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} 1}\\ &= \O\parens{n^d \cdot (1 + \log_b n)}\\ &= \O\parens{n^d + n^d \log_b n)}\\ &= \O(n^d \log_b n)\end{split}\]where we discard the lower-order term in the last step. Since logarithms of different bases differ only by a multiplicative constant factor (see the logarithmic identity for changing the base), which does not affect asymptotics, we can elide the base:

\[\begin{split}T &= \O(n^d \log_b n)\\ &= \O(n^d \log n)\end{split}\]\(r > 1\): The terms have increasing value, so that the final term \(r^{\log_b n}\) is the largest. Then we have

\[\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O\parens{n^d r^{\log_b n}}\\ &= \O\parens{n^d \parens{\frac{k}{b^d}}^{\log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot b^{-d \log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\end{split}\]To simplify this further, we observe the following:

\(b^{\log_b n} = n\), which we can see by taking \(\log_b\) of both sides.

\(k^{\log_b n} = n^{\log_b k}\). We show this as follows:

\[k^{\log_b n} = \parens{b^{\log_b k}}^{\log_b n}\]Here, we applied the previous observation to substitute \(k = b^{\log_b k}\). We proceed to get

\[\begin{split}k^{\log_b n} &= \parens{b^{\log_b k}}^{\log_b n}\\ &= \parens{b^{\log_b n}}^{\log_b k}\\ &= n^{\log_b k}\end{split}\]applying the prior observation once again.

Applying both observations, we get

\[\begin{split}T &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\\ &= \O\parens{n^d \cdot n^{\log_b k} \cdot n^{-d}}\\ &=\O\parens{n^{\log_b k}}\end{split}\]Here, we cannot discard the base of the logarithm, as it appears in the exponent and so affects the value by an exponential rather than a constant factor.

Thus, we have demonstrated all three cases of the master theorem.

We now proceed to prove the version with log factors.

We need to modify the reasoning in Proof 211 to account for the amount of work done in each subproblem. For a subproblem of size \(n/b^i\), we now have work

We consider the cases of \(k/b^d \le 1\) and \(k/b^d > 1\) separately.

\(k/b^d \le 1\): We compute an upper bound on the work as follows:

\[\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\frac{n^d}{b^{id}} \cdot (\log^w n - \log^w b^i)}\\ &= b^{-id} \cdot \O(n^d \cdot (\log^w n - \log^w b^i))\\ &= b^{-id} \cdot \O(n^d \log^w n)\end{split}\]since \(n^d \cdot (\log^w n - \log^w b^i) \le n^d \log^w n\) for \(b > 1\), and we only need an upper bound for the asymptotics. Then the work \(T_i\) at level \(i\) is

\[\begin{split}T_i &= \frac{k^i}{b^{id}} \cdot \O(n^d \log^w n)\\ &= \parens{\frac{k}{b^d}}^i \cdot \O(n^d \log^w n)\end{split}\]Summing over all the levels, we get a total work

\[\begin{split}T &= \sum_{i=0}^{\log_b n} T_i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^d}}^i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} r^i\\ &= \O\parens{n^d \log^w n \cdot \sum_{i=0}^{\log_b n} r^i}\end{split}\]where \(r = k/b^d\). When \(r < 1\), the initial term of the summation dominates as before, so we have

\[\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n)\end{split}\]When \(r = 1\), the terms all have equal size, also as before. Then

\[\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n \cdot (1 + \log_b n))\\ &= \O(n^d \log^w n \log_b n)\\ &= \O(n^d \log^w n \log n)\\ &= \O(n^d \log^{w+1} n)\end{split}\]As before, we drop the lower order term, as well as the base in \(\log_b n\) since it only differs by a constant factor from \(\log n\).

\(k/b^d > 1\): We start by observing that since \(\log^w n\) is subpolynomial in \(n\), it grows slower than any polynomial in \(n\). We have

\[\frac{k}{b^d} > 1\]or equivalently

\[\begin{split}\log_b k &> d\\ 0 &< \log_b k - d\end{split}\]We choose a small exponent \(\epsilon\) such that

\[0 < \epsilon < \log_b k - d\]For instance, we can choose \(\epsilon = (\log_b k - d) / 2\). Then

\[\begin{split}\log_b k &> d + \epsilon\\ k &> b^{d+\epsilon}\\ \frac{k}{b^{d+\epsilon}} > 1\end{split}\]We also have

\[\log^w n \le C n^\epsilon\]for a sufficiently large constant \(C\) and \(n \ge 1\). For a subproblem of size \(n/b^i\), we now have work

\[\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\parens{\frac{n}{b^i}}^d \cdot C \parens{\frac{n}{b^i}}^\epsilon}\\ &= \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}\end{split}\]Then our work \(T_i\) at level \(i\) is

\[T_i = k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}\]The total work \(T\) is

\[\begin{split}T &= \sum_{i=0}^{\log_b n} \parens{k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{\sum_{i=0}^{\log_b n} \parens{k^i \cdot C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{-i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{d+\epsilon}}^{-i}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\end{split}\]Since \(k/b^{d+\epsilon} > 1\), the last term in the summation dominates as in Proof 211, so we get

\[\begin{split}T &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\\ &= \O\parens{C n^{d+\epsilon} \cdot \parens{\frac{k}{b^{d+\epsilon}}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{-(d+\epsilon)}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-(d+\epsilon)}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot n^{-(d+\epsilon)}}\\ &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{C n^{\log_b k}}\end{split}\]In the last step, we applied the observation that \(k^{\log_b n} = n^{\log_b k}\). Folding the constant \(C\) into the big-O notation, we end up with

\[\begin{split}T &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{n^{\log_b k}}\end{split}\]as required.

## Alternative Analysis of Quick Sort

We can conduct an alternative analysis of the quick-sort
algorithm by modifying the algorithm so that all the
randomness is fixed at the beginning. Given \(n\) elements, we start by
choosing a permutation of the numbers \(\{1,2,\dots,n\}\) uniformly at
random to be the *priorities* of the elements. Then when choosing a
pivot for a subset of the elements, we pick the element that has the
lowest priority over that subset. The following illustrates an
execution of this modified algorithm:

In the first step, the element 37 has the lowest priority, so it is chosen as the pivot. At the second level, the element 15 has the minimal priority in the first recursive call, while the element 86 has the lowest priority in the second call. At the third level of the recursion, the element 4 has lowest priority over its subset, and similarly the element 52 for the subset \(\{70, 52\}\).

The full definition of the modified algorithm is as follows:

This modified algorithm matches the behavior of the original randomized quick sort in two important ways:

In both algorithms, when a pivot is chosen among a subset consisting of \(m\) elements, each element is chosen with a uniform probability \(1/m\). This is explicitly the case in the original version. In the modified version, the priorities assigned to each of the \(m\) elements are unique, and for any set of \(m\) priorities, each element receives the lowest priority in exactly \(1/m\) of the permutations of those priorities.

The two algorithms compare elements in exactly the same situation – when partitioning the array, each element is compared to the pivot. (The modified version also compares priorities to find the minimal, but we can just ignore that in our analysis. Even if we did consider them, they just increase the number of comparisons by a constant factor of two.)

Thus, the modified algorithm models the execution of the original algorithm in both the evolution of the recursion as well as the element comparisons. We proceed to analyze the modified algorithm, and the result will equally apply to the original.

For simplicity, we assume for the purposes of analysis that the
initial array does not contain any duplicate elements. We have defined
our two versions of quick sort to be *stable*, meaning that they
preserve the relative ordering of duplicate elements. Thus, even when
duplicates are present, the algorithm distinguishes between them,
resulting in the same behavior as if there were no duplicates.

Let \(A\) be an array of \(n\) elements, and let \(S\) be the result of sorting \(A\). Let \(priority(S[i])\) be the priority that was assigned to the element \(S[i]\) in the modified quick-sort algorithm. Then \(S[i]\) and \(S[j]\) were compared by the algorithm if and only if \(i \ne j\), and one of the following holds:

In other words, \(S[i]\) and \(S[j]\) are compared exactly when the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is that of either \(S[i]\) or \(S[j]\).

The claim above holds regardless of the original order of the elements in \(A\). For example, take a look at the result from our illustration of the modified algorithm above:

As stated above, two elements are compared only in the partitioning step, and one of the two must be the pivot. Referring to the execution above, we can see that 0 and 15 were compared in the second level of the recursion, whereas 52 and 99 were never compared to each other. Now consider a different initial ordering of the elements, but with the same priorities assigned to each element:

The execution of the algorithm is essentially the same! It still picks 37 as the first pivot, since it has the lowest priority, and forms the same partitions; the only possible difference is the ordering of a partition. Subsequent steps again pick the same pivots, since they are finding the minimal priority over the same subset of elements and associated priorities. Thus, given just the sorted set of elements and their priorities, we can determine which elements were compared as stated in Claim 214, regardless of how they were initially ordered. We proceed to prove Claim 214:

First, we show that if the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is that of either \(S[i]\) or \(S[j]\), then \(S[i]\) and \(S[j]\) are compared. Without loss of generality, suppose \(S[i]\) is the element with the lowest priority in this subset. This implies that no element in \(S[i+1], \dots, S[j]\) will be picked as a pivot prior to \(S[i]\) being picked. Thus, all pivots chosen by the algorithm before \(S[i]\) must be either less than \(S[i]\) or greater than \(S[j]\). For each such pivot, the elements \(S[i], S[i+1], \dots, S[j]\) are all placed in the same partition – they are in sorted order, and if the pivot is less than \(S[i]\), all of these elements are larger than the pivot, while if the pivot is greater than \(S[j]\), then all these elements are smaller than the pivot. Then when the algorithm chooses \(S[i]\) as the pivot, \(S[j]\) is in the same subset of the array as \(S[i]\) and thus will be compared to \(S[i]\) in the partitioning step. The same reasoning holds for when \(S[j]\) has the lowest priority – it will eventually be chosen as the pivot, at which point \(S[i]\) will be in the same subset and will be compared to \(S[j]\).

Next, we show that if the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is neither that of \(S[i]\) nor of \(S[j]\), then \(S[i]\) and \(S[j]\) are never compared. As long as the algorithm chooses pivots outside of the set \(S[i], S[i+1], \dots, S[j]\), the two elements \(S[i]\) and \(S[j]\) remain in the same subset of the array without having been compared to each other – they remain in the same subset since the pivots so far have been either less than both \(S[i]\) and \(S[j]\) or greater than both of them, and they have not been compared yet since neither has been chosen as a pivot. When the algorithm first chooses a pivot from among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\), the pivot it chooses must be one of \(S[i+1], \dots, S[j-1]\), since one of those elements has lower priority than both \(S[i]\) and \(S[j]\). In the partitioning step for that pivot, \(S[i]\) and \(S[j]\) are placed into separate partitions, since \(S[i]\) is smaller than the pivot while \(S[j]\) is larger than the pivot. Since \(S[i]\) and \(S[j]\) end up in separate partitions, they cannot be compared in subsequent steps of the algorithm.

We can now proceed to compute the expected number of comparisons. Let \(X_{ij}\) be an indicator random variable such that \(X_{ij} = 1\) if \(S[i]\) and \(S[j]\) were compared at some point in the algorithm, \(X_{ij} = 0\) otherwise. From Claim 214, we can conclude that

This is because each of the \(j-1+1\) elements in \(S[i], S[i+1], \dots, S[j-1], S[j]\) has equal chance of having the lowest priority among those elements, and \(S[i]\) and \(S[j]\) are compared when one of those two elements has lowest priority among this set.

We also observe that a pair of elements is compared at most once – they are compared once in the partitioning step if they are in the same subarray and one of the two is chosen as a pivot, and after the partitioning step is over, the pivot is never compared to anything else. Thus, the total number of comparisons \(X\) is just the number of pairs that are compared, out of the \(n(n-1)/2\) distinct pairs:

The rest of the analysis is as before, where we concluded that \(\Ex[X] \le 2 n \ln n = \O(n \log n)\). Since the modified quick-sort algorithm models the behavior of the original randomized algorithm, this result also applies to the latter.

## Proof of the Simplified Multiplicative Chernoff Bounds

We previously showed that the unsimplified Chernoff bounds

hold. We now demonstrate that the simplified bounds

follow from the unsimplified bounds.

We first consider the upper-tail Chernoff bound. For \(\delta > 0\), we have

From a list of logarithmic identities, we obtain the inequality

for \(x > 0\). This gives us

Applying this to our Chernoff-bound expression with \(x = \delta\), we get

resulting in the simplified upper-tail bound.

For the lower-tail Chernoff bound, we have for \(0 < \delta < 1\):

The Taylor series for the natural logarithm gives us:

The second step follows since \(\frac{x^3}{3} + \frac{x^4}{4} + \dots\) is positive for \(x > 0\). Applying the result to our Chernoff-bound expression with \(x = \delta\), we obtain

Since \(\delta > 0\), we have \(\delta^3/2 > 0\), and

Thus,

completing our proof of the simplified lower-tail bound.

## Proof of the Upper-Tail Hoeffding’s Inequality

The upper-tail Hoeffding’s inequality we use (Theorem 150) states that for a sequence of independent indicator random variables \(X_1, \dots, X_n\),

where \(X = X_1 + \dots + X_n\), \(\Ex[X_i] = p_i\), and \(\lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i = p\). We proceed to prove a more general variant of this bound:

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent random variables such that \(X_i \in [0, 1]\), and the \(X_i\) have expectation \(\Ex[X_i] = p_i\), respectively. Let

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

(Note that \(\expp{x}\) is alternate notation for \(e^{x}\).)

Here, we do not require that the \(X_i\) are indicators, just that they are in the range \([0, 1]\).

We proceed to prove Theorem 217. We have

The latter step holds for all \(s \ge 0\). The process of working with
the random variable \(e^{sX}\) rather than \(X\) directly is the *Chernoff
bounding technique*; the idea is that small deviations in \(X\) turn
into large deviations in \(e^{sX}\), so that Markov’s inequality
(Theorem 128) produces a more useful result.
We will choose an appropriate value of \(s\) later. For any value of
\(s\), \(e^{sX}\) is nonnegative, so we can apply Markov’s inequality to
obtain:

In the last step, we applied the fact that the \(X_i\) are independent to conclude that the random variables \(e^{sX_i}\) are also independent, and therefore by Lemma 140, the expectation of their product is the product of their expectations.

We now need to establish a bound on \(\Ex[e^{sX_i}]\).

Let \(X\) be a random variable such that \(X \in [0, 1]\) and \(\Ex[X] = p\). Then

for all \(s \in \R\).

We first observe that because \(e^{sx}\) is a convex function, we can bound it from above on the interval \([0, 1]\) by the line that passes through the endpoints \(1\) and \(e^s\):

This line has slope \(e^s - 1\) and y-intercept \(1\), giving us \(xe^s - x + 1\). Thus,

in the interval \(0 \le x \le 1\). Then

where the second step follows from linearity of expectation. We now have an upper bound on \(\Ex[e^{sX}]\), but it is not a convenient one – we would like it to be of the form \(e^\alpha\), so that we can better use it in proving Theorem 217, for which we’ve already obtained a bound that contains an exponential. Using the fact that \(z = e^{\ln z}\), we get

To further bound this expression, let

Then finding an upper bound for \(\phi(s)\) will in turn allow us to bound \(e^{\phi(s)}\). By the Lagrange form of Taylor’s theorem, we get

for some \(v \in [0, s]\). To differentiate \(\phi(s)\), we let \(f(s) = \ln s\) and \(g(s) = pe^s - p + 1\), so that \(\phi(s) = (f \circ g)(s)\). Then by the chain rule, we have [2]

The function \(g(s) = pe^s - p + 1\) does not have any real roots for \(p \in [0, 1]\), so it is safe to divide by it.

To compute \(\phi''(s)\), we now let \(\hat{f}(s) = pe^s\), so that \(\phi'(s) = \hat{f}(s)/g(s)\). Then by the quotient rule, we get

where \(t = \frac{pe^s}{pe^s - p + 1}\). Observe that \(t(1 - t)\) is a parabola with a maximum at \(t = \frac{1}{2}\):

Setting \(h'(t) = 1 - 2t = 0\), we get that \(t = \frac{1}{2}\) is an extremum, and since the second derivative \(h''(\frac{1}{2})\) is negative, it is a maximum. Thus, we have

We can now plug everything into our result from applying Taylor’s theorem:

Thus,

as claimed.

Continuing our proof of Theorem 217, we have

Applying Lemma 218, we get

We now choose \(s\) to minimize the exponent \(r(s) = -sn\varepsilon + s^2n/8\). We have:

We have another parabola, and since the second derivative is positive, we obtain a minimum for \(r(s)\) at

Then

Thus,

completing our proof of Theorem 217.

## General Case of Hoeffding’s Inequality

We can further generalize Theorem 217 to the case where the individual random variables \(X_i\) are in the range \([a_i, b_i]\). We start by generalizing Lemma 218 to obtain Hoeffding’s lemma.

Let \(X\) be a random variable such that \(X \in [a, b]\), where \(a < b\), and \(\Ex[X] = p\). Then

for all \(s \in \R\).

Let \(X' = \frac{X - a}{b - a}\). Then \(X'\) is a random variable such that \(X' \in [0, 1]\), and

by linearity of expectation. Let \(p' = \frac{p - a}{b - a}\). Applying Lemma 218 to \(X'\), we get

Now consider \(e^{sX}\). We have \(X = a + (b - a)X'\), so

Let \(s' = s(b - a)\). From the result above, we have

Substituting \(p' = \frac{p - a}{b - a}\) and \(s' = s(b - a)\), we obtain the following for the exponent:

Thus,

as claimed.

We can now derive the general case of Hoeffding’s inequality.

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent random variables such that \(X_i \in [a_i, b_i]\) for \(a_i < b_i\) and \(\Ex[X_i] = p_i\). Let

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

We proceed as in the proof of Theorem 217 to obtain

Applying Hoeffding’s lemma (Lemma 220), we obtain

We choose \(s\) to minimize the exponent \(r(s)\):

Since \(a_i < b_i\), we have \(b_i - a_i > 0\) for all \(i\), so that \(r''(s)\) is positive. Thus, we obtain a minimum for \(r(s)\) when \(r'(s) = 0\):

Then

Thus,

completing our proof of the general case of Hoeffding’s inequality.