
# Appendix¶

## Proof of the Master Theorem¶

We first prove the standard master theorem without log factors.

Proof 211

We prove the master theorem for the case where $$n$$ is a power of $$b$$; when this is not the case, we can pad the input size to the smallest power of $$b$$ that is larger than $$n$$, which increases the input size by at most the constant factor $$b$$ (can you see why?).

First, we determine how many subproblems we have at each level of the recursion. Initially, we have a single problem of size $$n$$. We subdivide it into $$k$$ subproblems, each of size $$n/b$$. Each of those gets subdivided into $$k$$ subproblems of size $$n/b^2$$, for a total of $$k^2$$ problems of size $$n/b^2$$. Those $$k^2$$ problems in turn are subdivided into a total of $$k^3$$ problems of size $$n/b^3$$.

In general, at the $$i$$th level of recursion (with $$i = 0$$ the initial problem), there are $$k^i$$ subproblems of size $$n/b^i$$. We assume that the base case is when the problem size is 1, which is when

$\begin{split}1 &= \frac{n}{b^i}\\ b^i &= n\\ i &= \log_b n\end{split}$

Thus, there are $$1 + \log_b n$$ total levels in the recursion.

We now consider how much work is done in each subproblem, which corresponds to the $$\O(n^d)$$ term of the recurrence relation. For a subproblem of size $$n/b^i$$, this is

$\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d} &= \O\parens{\frac{n^d}{b^{id}}}\\ &= b^{-id} \cdot \O(n^d)\end{split}$

With $$k^i$$ subproblems at level $$i$$, the total work $$T_i$$ at level $$i$$ is

$\begin{split}T_i &= \frac{k^i}{b^{id}} \cdot \O(n^d)\\ &= \parens{\frac{k}{b^d}}^i \cdot \O(n^d)\end{split}$

Summing over all the levels, we get a total work

$\begin{split}T &= \sum_{i=0}^{\log_b n} T_i\\ &= \sum_{i=0}^{\log_b n} \parens{\parens{\frac{k}{b^d}}^i \cdot \O(n^d)}\\ &= \O(n^d) \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^d}}^i\\ &= \O(n^d) \cdot \sum_{i=0}^{\log_b n} r^i\\ &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} r^i}\\ &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\end{split}$

where $$r = k/b^d$$.

Since we are working with asymptotics, we only care about the dominating term in the sum

$\sum_{i=0}^{\log_b n} r^i = 1 + r + r^2 + \dots + r^{\log_b n}$

There are three cases:

• $$r < 1$$: The terms have decreasing value, so that the initial term 1 is the largest and dominates 1. Thus, we have

$\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d)\end{split}$
1

We’re being a bit sloppy here – the largest term in a sum doesn’t necessarily dominate, as can be seen with a divergent series such as the harmonic series $$\sum_{n=1}^\infty 1/n$$. However, what we have here is a geometric series $$\sum_{i=0}^n r^i$$, which has the closed-form solution $$\sum_{i=0}^n r^i = (1 - r^{n+1}) / (1 - r)$$ for $$r \ne 1$$. When $$0 \le r < 1$$ (and therefore $$r^{n+1} < 1$$), this is bounded from above by the constant $$1 / (1 - r)$$, so the sum is $$\O(1)$$.

• $$r = 1$$: The terms all have the same value 1. Then

$\begin{split}T &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} r^i}\\ &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} 1}\\ &= \O\parens{n^d \cdot (1 + \log_b n)}\\ &= \O\parens{n^d + n^d \log_b n)}\\ &= \O(n^d \log_b n)\end{split}$

where we discard the lower-order term in the last step. Since logarithms of different bases differ only by a multiplicative constant factor (see the logarithmic identity for changing the base), which does not affect asymptotics, we can elide the base:

$\begin{split}T &= \O(n^d \log_b n)\\ &= \O(n^d \log n)\end{split}$
• $$r > 1$$: The terms have increasing value, so that the final term $$r^{\log_b n}$$ is the largest. Then we have

$\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O\parens{n^d r^{\log_b n}}\\ &= \O\parens{n^d \parens{\frac{k}{b^d}}^{\log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot b^{-d \log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\end{split}$

To simplify this further, we observe the following:

• $$b^{\log_b n} = n$$, which we can see by taking $$\log_b$$ of both sides.

• $$k^{\log_b n} = n^{\log_b k}$$. We show this as follows:

$k^{\log_b n} = \parens{b^{\log_b k}}^{\log_b n}$

Here, we applied the previous observation to substitute $$k = b^{\log_b k}$$. We proceed to get

$\begin{split}k^{\log_b n} &= \parens{b^{\log_b k}}^{\log_b n}\\ &= \parens{b^{\log_b n}}^{\log_b k}\\ &= n^{\log_b k}\end{split}$

applying the prior observation once again.

Applying both observations, we get

$\begin{split}T &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\\ &= \O\parens{n^d \cdot n^{\log_b k} \cdot n^{-d}}\\ &=\O\parens{n^{\log_b k}}\end{split}$

Here, we cannot discard the base of the logarithm, as it appears in the exponent and so affects the value by an exponential rather than a constant factor.

Thus, we have demonstrated all three cases of the master theorem.

We now proceed to prove the version with log factors.

Proof 212

We need to modify the reasoning in Proof 211 to account for the amount of work done in each subproblem. For a subproblem of size $$n/b^i$$, we now have work

$\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}}$

We consider the cases of $$k/b^d \le 1$$ and $$k/b^d > 1$$ separately.

• $$k/b^d \le 1$$: We compute an upper bound on the work as follows:

$\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\frac{n^d}{b^{id}} \cdot (\log^w n - \log^w b^i)}\\ &= b^{-id} \cdot \O(n^d \cdot (\log^w n - \log^w b^i))\\ &= b^{-id} \cdot \O(n^d \log^w n)\end{split}$

since $$n^d \cdot (\log^w n - \log^w b^i) \le n^d \log^w n$$ for $$b > 1$$, and we only need an upper bound for the asymptotics. Then the work $$T_i$$ at level $$i$$ is

$\begin{split}T_i &= \frac{k^i}{b^{id}} \cdot \O(n^d \log^w n)\\ &= \parens{\frac{k}{b^d}}^i \cdot \O(n^d \log^w n)\end{split}$

Summing over all the levels, we get a total work

$\begin{split}T &= \sum_{i=0}^{\log_b n} T_i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^d}}^i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} r^i\\ &= \O\parens{n^d \log^w n \cdot \sum_{i=0}^{\log_b n} r^i}\end{split}$

where $$r = k/b^d$$. When $$r < 1$$, the initial term of the summation dominates as before, so we have

$\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n)\end{split}$

When $$r = 1$$, the terms all have equal size, also as before. Then

$\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n \cdot (1 + \log_b n))\\ &= \O(n^d \log^w n \log_b n)\\ &= \O(n^d \log^w n \log n)\\ &= \O(n^d \log^{w+1} n)\end{split}$

As before, we drop the lower order term, as well as the base in $$\log_b n$$ since it only differs by a constant factor from $$\log n$$.

• $$k/b^d > 1$$: We start by observing that since $$\log^w n$$ is subpolynomial in $$n$$, it grows slower than any polynomial in $$n$$. We have

$\frac{k}{b^d} > 1$

or equivalently

$\begin{split}\log_b k &> d\\ 0 &< \log_b k - d\end{split}$

We choose a small exponent $$\epsilon$$ such that

$0 < \epsilon < \log_b k - d$

For instance, we can choose $$\epsilon = (\log_b k - d) / 2$$. Then

$\begin{split}\log_b k &> d + \epsilon\\ k &> b^{d+\epsilon}\\ \frac{k}{b^{d+\epsilon}} > 1\end{split}$

We also have

$\log^w n \le C n^\epsilon$

for a sufficiently large constant $$C$$ and $$n \ge 1$$. For a subproblem of size $$n/b^i$$, we now have work

$\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\parens{\frac{n}{b^i}}^d \cdot C \parens{\frac{n}{b^i}}^\epsilon}\\ &= \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}\end{split}$

Then our work $$T_i$$ at level $$i$$ is

$T_i = k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}$

The total work $$T$$ is

$\begin{split}T &= \sum_{i=0}^{\log_b n} \parens{k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{\sum_{i=0}^{\log_b n} \parens{k^i \cdot C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{-i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{d+\epsilon}}^{-i}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\end{split}$

Since $$k/b^{d+\epsilon} > 1$$, the last term in the summation dominates as in Proof 211, so we get

$\begin{split}T &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\\ &= \O\parens{C n^{d+\epsilon} \cdot \parens{\frac{k}{b^{d+\epsilon}}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{-(d+\epsilon)}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-(d+\epsilon)}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot n^{-(d+\epsilon)}}\\ &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{C n^{\log_b k}}\end{split}$

In the last step, we applied the observation that $$k^{\log_b n} = n^{\log_b k}$$. Folding the constant $$C$$ into the big-O notation, we end up with

$\begin{split}T &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{n^{\log_b k}}\end{split}$

as required.

## Alternative Analysis of Quick Sort¶

We can conduct an alternative analysis of the quick-sort algorithm by modifying the algorithm so that all the randomness is fixed at the beginning. Given $$n$$ elements, we start by choosing a permutation of the numbers $$\{1,2,\dots,n\}$$ uniformly at random to be the priorities of the elements. Then when choosing a pivot for a subset of the elements, we pick the element that has the lowest priority over that subset. The following illustrates an execution of this modified algorithm:

In the first step, the element 37 has the lowest priority, so it is chosen as the pivot. At the second level, the element 15 has the minimal priority in the first recursive call, while the element 86 has the lowest priority in the second call. At the third level of the recursion, the element 4 has lowest priority over its subset, and similarly the element 52 for the subset $$\{70, 52\}$$.

The full definition of the modified algorithm is as follows:

Algorithm 213 (Modified Quick Sort)
$\begin{split}&\Algorithm ModifiedQuickSort(A[1..n] : \text{array of n integers}):\\ &~~~P \gets \text{a random permutation of } \{1,\dots,n\} \text{, chosen uniformly at random}\\ &~~~\Return prioritizedSort(A, P)\\ \\ &\Subroutine prioritizedSort(A[1..n], P[1..n]):\\ &~~~\If n = 1 \return A\\ &~~~p \gets \text{index of the minimal element in P}\\ &~~~L, PL, R, PR \gets partition(A, P, p)\\ &~~~\Return prioritizedSort(L, PL) + A[p] + prioritizedSort(R, PR)\\ \\ &\Subroutine partition(A[1..n], P[1..n], p):\\ &~~~L \gets \text{empty array}\\ &~~~PL \gets \text{empty array}\\ &~~~R \gets \text{empty array}\\ &~~~PR \gets \text{empty array}\\ &~~~\For i = 1 \tot n:\\ &~~~~~~\If i \ne p \andt A[i] < A[p]:\\ &~~~~~~~~~L \gets L + A[i]\\ &~~~~~~~~~PL \gets PL + P[i]\\ &~~~~~~\ElseIf i \ne p:\\ &~~~~~~~~~R \gets R + A[i]\\ &~~~~~~~~~PR \gets PR + P[i]\\ &~~~\Return L, PL, R, PR\end{split}$

This modified algorithm matches the behavior of the original randomized quick sort in two important ways:

1. In both algorithms, when a pivot is chosen among a subset consisting of $$m$$ elements, each element is chosen with a uniform probability $$1/m$$. This is explicitly the case in the original version. In the modified version, the priorities assigned to each of the $$m$$ elements are unique, and for any set of $$m$$ priorities, each element receives the lowest priority in exactly $$1/m$$ of the permutations of those priorities.

2. The two algorithms compare elements in exactly the same situation – when partitioning the array, each element is compared to the pivot. (The modified version also compares priorities to find the minimal, but we can just ignore that in our analysis. Even if we did consider them, they just increase the number of comparisons by a constant factor of two.)

Thus, the modified algorithm models the execution of the original algorithm in both the evolution of the recursion as well as the element comparisons. We proceed to analyze the modified algorithm, and the result will equally apply to the original.

For simplicity, we assume for the purposes of analysis that the initial array does not contain any duplicate elements. We have defined our two versions of quick sort to be stable, meaning that they preserve the relative ordering of duplicate elements. Thus, even when duplicates are present, the algorithm distinguishes between them, resulting in the same behavior as if there were no duplicates.

Claim 214

Let $$A$$ be an array of $$n$$ elements, and let $$S$$ be the result of sorting $$A$$. Let $$priority(S[i])$$ be the priority that was assigned to the element $$S[i]$$ in the modified quick-sort algorithm. Then $$S[i]$$ and $$S[j]$$ were compared by the algorithm if and only if $$i \ne j$$, and one of the following holds:

$\begin{split}priority(S[i]) &= \min_{i \le k \le j}\{priority(S[k])\} \text{, or}\\ priority(S[j]) &= \min_{i \le k \le j}\{priority(S[k])\}\end{split}$

In other words, $$S[i]$$ and $$S[j]$$ are compared exactly when the lowest priority among the elements $$S[i], S[i+1], \dots, S[j-1], S[j]$$ is that of either $$S[i]$$ or $$S[j]$$.

The claim above holds regardless of the original order of the elements in $$A$$. For example, take a look at the result from our illustration of the modified algorithm above:

As stated above, two elements are compared only in the partitioning step, and one of the two must be the pivot. Referring to the execution above, we can see that 0 and 15 were compared in the second level of the recursion, whereas 52 and 99 were never compared to each other. Now consider a different initial ordering of the elements, but with the same priorities assigned to each element:

The execution of the algorithm is essentially the same! It still picks 37 as the first pivot, since it has the lowest priority, and forms the same partitions; the only possible difference is the ordering of a partition. Subsequent steps again pick the same pivots, since they are finding the minimal priority over the same subset of elements and associated priorities. Thus, given just the sorted set of elements and their priorities, we can determine which elements were compared as stated in Claim 214, regardless of how they were initially ordered. We proceed to prove Claim 214:

Proof 215

First, we show that if the lowest priority among the elements $$S[i], S[i+1], \dots, S[j-1], S[j]$$ is that of either $$S[i]$$ or $$S[j]$$, then $$S[i]$$ and $$S[j]$$ are compared. Without loss of generality, suppose $$S[i]$$ is the element with the lowest priority in this subset. This implies that no element in $$S[i+1], \dots, S[j]$$ will be picked as a pivot prior to $$S[i]$$ being picked. Thus, all pivots chosen by the algorithm before $$S[i]$$ must be either less than $$S[i]$$ or greater than $$S[j]$$. For each such pivot, the elements $$S[i], S[i+1], \dots, S[j]$$ are all placed in the same partition – they are in sorted order, and if the pivot is less than $$S[i]$$, all of these elements are larger than the pivot, while if the pivot is greater than $$S[j]$$, then all these elements are smaller than the pivot. Then when the algorithm chooses $$S[i]$$ as the pivot, $$S[j]$$ is in the same subset of the array as $$S[i]$$ and thus will be compared to $$S[i]$$ in the partitioning step. The same reasoning holds for when $$S[j]$$ has the lowest priority – it will eventually be chosen as the pivot, at which point $$S[i]$$ will be in the same subset and will be compared to $$S[j]$$.

Next, we show that if the lowest priority among the elements $$S[i], S[i+1], \dots, S[j-1], S[j]$$ is neither that of $$S[i]$$ nor of $$S[j]$$, then $$S[i]$$ and $$S[j]$$ are never compared. As long as the algorithm chooses pivots outside of the set $$S[i], S[i+1], \dots, S[j]$$, the two elements $$S[i]$$ and $$S[j]$$ remain in the same subset of the array without having been compared to each other – they remain in the same subset since the pivots so far have been either less than both $$S[i]$$ and $$S[j]$$ or greater than both of them, and they have not been compared yet since neither has been chosen as a pivot. When the algorithm first chooses a pivot from among the elements $$S[i], S[i+1], \dots, S[j-1], S[j]$$, the pivot it chooses must be one of $$S[i+1], \dots, S[j-1]$$, since one of those elements has lower priority than both $$S[i]$$ and $$S[j]$$. In the partitioning step for that pivot, $$S[i]$$ and $$S[j]$$ are placed into separate partitions, since $$S[i]$$ is smaller than the pivot while $$S[j]$$ is larger than the pivot. Since $$S[i]$$ and $$S[j]$$ end up in separate partitions, they cannot be compared in subsequent steps of the algorithm.

We can now proceed to compute the expected number of comparisons. Let $$X_{ij}$$ be an indicator random variable such that $$X_{ij} = 1$$ if $$S[i]$$ and $$S[j]$$ were compared at some point in the algorithm, $$X_{ij} = 0$$ otherwise. From Claim 214, we can conclude that

$\Pr[X_{ij} = 1] = \frac{2}{j-i+1}$

This is because each of the $$j-1+1$$ elements in $$S[i], S[i+1], \dots, S[j-1], S[j]$$ has equal chance of having the lowest priority among those elements, and $$S[i]$$ and $$S[j]$$ are compared when one of those two elements has lowest priority among this set.

We also observe that a pair of elements is compared at most once – they are compared once in the partitioning step if they are in the same subarray and one of the two is chosen as a pivot, and after the partitioning step is over, the pivot is never compared to anything else. Thus, the total number of comparisons $$X$$ is just the number of pairs that are compared, out of the $$n(n-1)/2$$ distinct pairs:

$X = \sum_{i=1}^n \sum_{j=i+1}^n X_{ij}$

The rest of the analysis is as before, where we concluded that $$\Ex[X] \le 2 n \ln n = \O(n \log n)$$. Since the modified quick-sort algorithm models the behavior of the original randomized algorithm, this result also applies to the latter.

## Proof of the Simplified Multiplicative Chernoff Bounds¶

We previously showed that the unsimplified Chernoff bounds

$\begin{split}\Pr[X \ge (1 + \delta)\mu] &\le \parens{\frac{e^\delta}{(1+\delta)^{1+\delta}}}^\mu ~~~~\text{for \delta > 0}\\ \Pr[X \le (1 - \delta)\mu] &\le \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu ~~~~\text{for 0 < \delta < 1}\end{split}$

hold. We now demonstrate that the simplified bounds

\begin{split}\large \begin{aligned} \Pr[X \ge (1 + \delta)\mu] &\le e^{-\frac{\delta^2\mu}{2+\delta}} ~~~~\text{for \delta > 0}\\ \Pr[X \le (1 - \delta)\mu] &\le e^{-\frac{\delta^2\mu}{2}} ~~~~\text{for 0 < \delta < 1} \end{aligned}\end{split}

follow from the unsimplified bounds.

Proof 216

We first consider the upper-tail Chernoff bound. For $$\delta > 0$$, we have

\begin{split}\large \begin{aligned} \Pr[X \ge (1 + \delta)\mu] &\le \parens{\frac{e^\delta}{(1+\delta)^{1+\delta}}}^\mu\\ &= \parens{\frac{e^\delta}{(e^{\ln(1+\delta)})^{1+\delta}}}^\mu\\ &= \parens{\frac{e^\delta}{e^{(1+\delta)\ln(1+\delta)}}}^\mu\\ &= \parens{e^{\delta-(1+\delta)\ln(1+\delta)}}^\mu \end{aligned}\end{split}

From a list of logarithmic identities, we obtain the inequality

$\ln(1+x) \ge \frac{2x}{2+x}$

for $$x > 0$$. This gives us

\begin{split}\large \begin{aligned} -\ln(1+x) &\le -\frac{2x}{2+x}\\ e^{-\ln(1+x)} &\le e^{-\frac{2x}{2+x}} \end{aligned}\end{split}

Applying this to our Chernoff-bound expression with $$x = \delta$$, we get

\begin{split}\large \begin{aligned} \Pr[X \ge (1 + \delta)\mu] &\le \parens{e^{\delta-(1+\delta)\ln(1+\delta)}}^\mu\\ &\le \parens{e^{\delta-(1+\delta)\frac{2\delta}{2+\delta}}}^\mu\\ &= \parens{e^{\frac{\delta(2+\delta)-2\delta(1+\delta)} {2+\delta}}}^\mu\\ &= \parens{e^{\frac{2\delta+\delta^2-2\delta-2\delta^2} {2+\delta}}}^\mu\\ &= \parens{e^{\frac{-\delta^2}{2+\delta}}}^\mu\\ &= e^{-\frac{\delta^2\mu}{2+\delta}} \end{aligned}\end{split}

resulting in the simplified upper-tail bound.

For the lower-tail Chernoff bound, we have for $$0 < \delta < 1$$:

\begin{split}\large \begin{aligned} \Pr[X \le (1 - \delta)\mu] &\le \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu\\ &= \parens{\frac{e^\delta}{(e^{\ln(1-\delta)})^{1-\delta}}}^\mu\\ &= \parens{\frac{e^{-\delta}}{e^{(1-\delta)\ln(1-\delta)}}}^\mu\\ &= \parens{e^{-\delta-(1-\delta)\ln(1-\delta)}}^\mu \end{aligned}\end{split}

The Taylor series for the natural logarithm gives us:

$\begin{split}\ln(1-x) &= -x - \frac{x^2}{2} - \frac{x^3}{3} - \frac{x^4}{4} - \dots\\ &\ge -x - \frac{x^2}{2} ~~~~\text{for 0 < x < 1}\\ -\ln(1-x) &\le x + \frac{x^2}{2}\end{split}$

The second step follows since $$\frac{x^3}{3} + \frac{x^4}{4} + \dots$$ is positive for $$x > 0$$. Applying the result to our Chernoff-bound expression with $$x = \delta$$, we obtain

\begin{split}\large \begin{aligned} \Pr[X \le (1 - \delta)\mu] &\le \parens{e^{-\delta-(1-\delta)\ln(1-\delta)}}^\mu\\ &\le \parens{e^{-\delta+(1-\delta)(\delta+\frac{\delta^2}{2})}}^\mu\\ &= \parens{e^{-\delta+\delta-\delta^2+\frac{\delta^2}{2}- \frac{\delta^3}{2}}}^\mu\\ &= \parens{e^{-\frac{\delta^2}{2}-\frac{\delta^3}{2}}}^\mu \end{aligned}\end{split}

Since $$\delta > 0$$, we have $$\delta^3/2 > 0$$, and

$-\frac{\delta^2}{2}-\frac{\delta^3}{2} \le -\frac{\delta^2}{2}$

Thus,

\begin{split}\large \begin{aligned} \Pr[X \le (1 - \delta)\mu] &\le \parens{e^{-\frac{\delta^2}{2}-\frac{\delta^3}{2}}}^\mu\\ &\le \parens{e^{-\frac{\delta^2}{2}}}^\mu\\ &= e^{-\frac{\delta^2\mu}{2}} \end{aligned}\end{split}

completing our proof of the simplified lower-tail bound.

## Proof of the Upper-Tail Hoeffding’s Inequality¶

The upper-tail Hoeffding’s inequality we use (Theorem 150) states that for a sequence of independent indicator random variables $$X_1, \dots, X_n$$,

$\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le e^{-2\varepsilon^2 n}$

where $$X = X_1 + \dots + X_n$$, $$\Ex[X_i] = p_i$$, and $$\lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i = p$$. We proceed to prove a more general variant of this bound:

Theorem 217

Let $$X = X_1 + \dots + X_n$$, where the $$X_i$$ are independent random variables such that $$X_i \in [0, 1]$$, and the $$X_i$$ have expectation $$\Ex[X_i] = p_i$$, respectively. Let

$p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i$

Let $$\varepsilon > 0$$ be a deviation from the expectation. Then

$\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le \expp{-2\varepsilon^2 n}$

(Note that $$\expp{x}$$ is alternate notation for $$e^{x}$$.)

Here, we do not require that the $$X_i$$ are indicators, just that they are in the range $$[0, 1]$$.

We proceed to prove Theorem 217. We have

$\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &= \Pr[X \ge n(p + \varepsilon)]\\ &= \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\end{split}$

The latter step holds for all $$s \ge 0$$. The process of working with the random variable $$e^{sX}$$ rather than $$X$$ directly is the Chernoff bounding technique; the idea is that small deviations in $$X$$ turn into large deviations in $$e^{sX}$$, so that Markov’s inequality (Theorem 128) produces a more useful result. We will choose an appropriate value of $$s$$ later. For any value of $$s$$, $$e^{sX}$$ is nonnegative, so we can apply Markov’s inequality to obtain:

$\begin{split}\Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}] &\le \Ex[e^{sX}]/\expp{sn(p + \varepsilon)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[\expp{s(X_1 + X_2 + \dots + X_n)}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX_1} e^{sX_2} \dots e^{sX_n}]\\ &= \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]\end{split}$

In the last step, we applied the fact that the $$X_i$$ are independent to conclude that the random variables $$e^{sX_i}$$ are also independent, and therefore by Lemma 140, the expectation of their product is the product of their expectations.

We now need to establish a bound on $$\Ex[e^{sX_i}]$$.

Lemma 218

Let $$X$$ be a random variable such that $$X \in [0, 1]$$ and $$\Ex[X] = p$$. Then

$\Ex[e^{sX}] \le \expp{sp + s^2/8}$

for all $$s \in \R$$.

Proof 219

We first observe that because $$e^{sx}$$ is a convex function, we can bound it from above on the interval $$[0, 1]$$ by the line that passes through the endpoints $$1$$ and $$e^s$$:

This line has slope $$e^s - 1$$ and y-intercept $$1$$, giving us $$xe^s - x + 1$$. Thus,

$e^{sx} \le xe^s - x + 1$

in the interval $$0 \le x \le 1$$. Then

$\begin{split}\Ex[e^{sX}] &\le \Ex[Xe^s - X + 1]\\ &= (e^s - 1)\Ex[X] + 1\\ &= pe^s - p + 1\end{split}$

where the second step follows from linearity of expectation. We now have an upper bound on $$\Ex[e^{sX}]$$, but it is not a convenient one – we would like it to be of the form $$e^\alpha$$, so that we can better use it in proving Theorem 217, for which we’ve already obtained a bound that contains an exponential. Using the fact that $$z = e^{\ln z}$$, we get

$\begin{split}\Ex[e^{sX}] &\le pe^s - p + 1\\ &= \expp{\ln (pe^s - p + 1)}\end{split}$

To further bound this expression, let

$\phi(s) = \ln (pe^s - p + 1)$

Then finding an upper bound for $$\phi(s)$$ will in turn allow us to bound $$e^{\phi(s)}$$. By the Lagrange form of Taylor’s theorem, we get

$\phi(s) = \phi(0) + s\phi'(0) + \frac{1}{2}s^2\phi''(v)$

for some $$v \in [0, s]$$. To differentiate $$\phi(s)$$, we let $$f(s) = \ln s$$ and $$g(s) = pe^s - p + 1$$, so that $$\phi(s) = (f \circ g)(s)$$. Then by the chain rule, we have 2

$\begin{split}\phi'(s) &= (f \circ g)'(s)\\ &= f'(g(s)) \cdot g'(s)\\ &= \frac{1}{g(s)} \cdot g'(s)~~~~~~~~~\text{(since f'(s) = \frac{d}{ds}\ln s = \frac{1}{s})}\\ &= \frac{1}{pe^s - p + 1} \cdot pe^s\\ &= \frac{pe^s}{pe^s - p + 1}\end{split}$
2

The function $$g(s) = pe^s - p + 1$$ does not have any real roots for $$p \in [0, 1]$$, so it is safe to divide by it.

To compute $$\phi''(s)$$, we now let $$\hat{f}(s) = pe^s$$, so that $$\phi'(s) = \hat{f}(s)/g(s)$$. Then by the quotient rule, we get

$\begin{split}\phi''(s) &= \frac{\hat{f}'(s)g(s) - \hat{f}(s)g'(s)}{g^2(s)}\\ &= \frac{pe^s(pe^s - p + 1) - pe^s pe^s}{(pe^s - p + 1)^2}\\ &= \frac{pe^s}{pe^s - p + 1}\cdot\frac{(pe^s - p + 1) - pe^s}{pe^s - p + 1}\\ &= \frac{pe^s}{pe^s - p + 1}\parens{1 - \frac{pe^s}{pe^s - p + 1}}\\ &= t(1 - t)\end{split}$

where $$t = \frac{pe^s}{pe^s - p + 1}$$. Observe that $$t(1 - t)$$ is a parabola with a maximum at $$t = \frac{1}{2}$$:

$\begin{split}h(t) &= t(1 - t) = t - t^2\\ h'(t) &= 1 - 2t\\ h''(t) &= -2\end{split}$

Setting $$h'(t) = 1 - 2t = 0$$, we get that $$t = \frac{1}{2}$$ is an extremum, and since the second derivative $$h''(\frac{1}{2})$$ is negative, it is a maximum. Thus, we have

$\begin{split}\phi''(s) &= t(1 - t)\\ &\le \frac{1}{2}\parens{1 - \frac{1}{2}}\\ &= \frac{1}{4}\end{split}$

We can now plug everything into our result from applying Taylor’s theorem:

$\begin{split}\phi(s) &= \phi(0) + s\phi'(0) + \frac{1}{2}s^2\phi''(v)\\ &= \ln(p - p + 1) + s\frac{p}{p - p + 1} + \frac{1}{2}s^2\phi''(v)\\ &= sp + \frac{1}{2}s^2\phi''(v)\\ &\le sp + \frac{1}{2}s^2\cdot \frac{1}{4}\\ &= sp + \frac{s^2}{8}\end{split}$

Thus,

$\begin{split}\Ex[e^{sX}] &\le e^{\phi(s)}\\ &\le \expp{sp + s^2/8}\end{split}$

as claimed.

Continuing our proof of Theorem 217, we have

$\Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}] \le \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]$

Applying Lemma 218, we get

$\begin{split}\Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}] &\le \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]\\ &\le \expp{-sn(p + \varepsilon)} \prod_i \expp{sp_i + s^2/8}\\ &= \expp{-sn(p + \varepsilon)} \cdot \expp{\sum_i (sp_i + s^2/8)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \expp{snp + s^2n/8}\\ &= \expp{-sn\varepsilon + s^2n/8}\end{split}$

We now choose $$s$$ to minimize the exponent $$r(s) = -sn\varepsilon + s^2n/8$$. We have:

$\begin{split}r(s) &= -sn\varepsilon + \frac{n}{8}s^2\\ r'(s) &= -n\varepsilon + \frac{n}{8}\cdot 2s = -n\varepsilon + \frac{n}{4}s\\ r''(s) &= \frac{n}{4}\end{split}$

We have another parabola, and since the second derivative is positive, we obtain a minimum for $$r(s)$$ at

$\begin{split}r'(s) &= -n\varepsilon + \frac{n}{4}s = 0\\ s &= 4\varepsilon\end{split}$

Then

$\begin{split}r(4\varepsilon) &= -4n\varepsilon^2 + \frac{n}{8}\cdot 16\varepsilon^2\\ &= -4n\varepsilon^2 + 2n\varepsilon^2\\ &= -2n\varepsilon^2\end{split}$

Thus,

$\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &\le \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\\ &\le \expp{-2n\varepsilon^2}\end{split}$

completing our proof of Theorem 217.

## General Case of Hoeffding’s Inequality¶

We can further generalize Theorem 217 to the case where the individual random variables $$X_i$$ are in the range $$[a_i, b_i]$$. We start by generalizing Lemma 218 to obtain Hoeffding’s lemma.

Lemma 220 (Hoeffding’s Lemma)

Let $$X$$ be a random variable such that $$X \in [a, b]$$, where $$a < b$$, and $$\Ex[X] = p$$. Then

$\Ex[e^{sX}] \le \expp{sp + s^2(a - b)^2/8}$

for all $$s \in \R$$.

Proof 221

Let $$X' = \frac{X - a}{b - a}$$. Then $$X'$$ is a random variable such that $$X' \in [0, 1]$$, and

$\Ex[X'] = \lrEx{\frac{X - a}{b - a}} = \frac{p - a}{b - a}$

by linearity of expectation. Let $$p' = \frac{p - a}{b - a}$$. Applying Lemma 218 to $$X'$$, we get

$\Ex[e^{s'X'}] \le \expp{s'p' + s'^2/8}$

Now consider $$e^{sX}$$. We have $$X = a + (b - a)X'$$, so

$\begin{split}\Ex[e^{sX}] &= \Ex[\expp{sa + s(b - a)X'}]\\ &= e^{sa}\Ex[\expp{s(b - a)X'}]\end{split}$

Let $$s' = s(b - a)$$. From the result above, we have

$\begin{split}e^{sa}\Ex[\expp{s(b - a)X'}] &= e^{sa}\Ex[e^{s'X'}]\\ &\le e^{sa}\expp{s'p' + s'^2/8}\\ &= \expp{sa + s'p' + s'^2/8}\end{split}$

Substituting $$p' = \frac{p - a}{b - a}$$ and $$s' = s(b - a)$$, we obtain the following for the exponent:

$\begin{split}sa + s'p' + \frac{1}{8}s'^2 &= sa + s(b - a)\frac{p - a}{b - a} + \frac{1}{8}s^2(b - a)^2\\ &= sa + s(p - a) + \frac{1}{8}s^2(b - a)^2\\ &= sp + \frac{1}{8}s^2(b - a)^2\end{split}$

Thus,

$\begin{split}\Ex[e^{s'X'}] &\le \expp{sa + s'p' + s'^2/8}\\ &= \expp{sp + s^2(b - a)^2/8}\end{split}$

as claimed.

We can now derive the general case of Hoeffding’s inequality.

Theorem 222 (Hoeffding’s Inequality – General Case)

Let $$X = X_1 + \dots + X_n$$, where the $$X_i$$ are independent random variables such that $$X_i \in [a_i, b_i]$$ for $$a_i < b_i$$ and $$\Ex[X_i] = p_i$$. Let

$p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i$

Let $$\varepsilon > 0$$ be a deviation from the expectation. Then

$\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le \lrexp{-\frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}}$
Proof 223

We proceed as in the proof of Theorem 217 to obtain

$\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &= \Pr[X \ge n(p + \varepsilon)]\\ &= \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\\ &\le \Ex[e^{sX}]/\expp{sn(p + \varepsilon)}~~~~~~~~~\text{(Markov's inequality)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[\expp{s(X_1 + X_2 + \dots + X_n)}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX_1} e^{sX_2} \dots e^{sX_n}]\\ &= \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]~~~~~~~~~\text{(independence of the X_i)}\end{split}$

Applying Hoeffding’s lemma (Lemma 220), we obtain

$\begin{split}\expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}] &\le \expp{-sn(p + \varepsilon)} \prod_i \expp{sp_i + s^2(b_i - a_i)^2/8}\\ &= \expp{-sn(p + \varepsilon)} \cdot \lrexp{\sum_i (sp_i + s^2(b_i - a_i)^2/8)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \lrexp{snp + \frac{s^2}{8} \sum_i (b_i - a_i)^2}\\ &= \lrexp{-sn\varepsilon + \frac{s^2}{8} \sum_i (b_i - a_i)^2}\end{split}$

We choose $$s$$ to minimize the exponent $$r(s)$$:

$\begin{split}r(s) &= -sn\varepsilon + \frac{s^2}{8} \sum_i (b_i - a_i)^2\\ r'(s) &= -n\varepsilon + \frac{s}{4} \sum_i (b_i - a_i)^2\\ r''(s) &= \frac{1}{4} \sum_i (b_i - a_i)^2\end{split}$

Since $$a_i < b_i$$, we have $$b_i - a_i > 0$$ for all $$i$$, so that $$r''(s)$$ is positive. Thus, we obtain a minimum for $$r(s)$$ when $$r'(s) = 0$$:

$\begin{split}r'(s) &= -n\varepsilon + \frac{s}{4} \sum_i (b_i - a_i)^2 = 0\\ s &= \frac{4n\varepsilon}{\sum_i (b_i - a_i)^2}\end{split}$

Then

$\begin{split}r\parens{\frac{4n\varepsilon}{\sum_i (b_i - a_i)^2}} &= -n\varepsilon\frac{4n\varepsilon}{\sum_i (b_i - a_i)^2} + \frac{16n^2\varepsilon^2}{\parens{\sum_i (b_i - a_i)^2}^2}\cdot \frac{1}{8}\sum_i (b_i - a_i)^2\\ &= -\frac{4n^2\varepsilon^2}{\sum_i (b_i - a_i)^2} + \frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}\\ &= -\frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}\end{split}$

Thus,

$\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &\le \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\\ &\le \lrexp{-\frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}}\end{split}$

completing our proof of the general case of Hoeffding’s inequality.