Appendix

Proof of the Master Theorem

We first prove the standard master theorem without log factors.

Proof 266

We prove the master theorem for the case where \(n\) is a power of \(b\); when this is not the case, we can pad the input size to the smallest power of \(b\) that is larger than \(n\), which increases the input size by at most the constant factor \(b\) (can you see why?).

First, we determine how many subproblems we have at each level of the recursion. Initially, we have a single problem of size \(n\). We subdivide it into \(k\) subproblems, each of size \(n/b\). Each of those gets subdivided into \(k\) subproblems of size \(n/b^2\), for a total of \(k^2\) problems of size \(n/b^2\). Those \(k^2\) problems in turn are subdivided into a total of \(k^3\) problems of size \(n/b^3\).

recursion tree of the subproblems in a master-type recurrence, with $k^i$ problems of size $n/b^i$ at each level $i$

In general, at the \(i\)th level of recursion (with \(i = 0\) the initial problem), there are \(k^i\) subproblems of size \(n/b^i\). We assume that the base case is when the problem size is 1, which is when

\[\begin{split}1 &= \frac{n}{b^i}\\ b^i &= n\\ i &= \log_b n\end{split}\]

Thus, there are \(1 + \log_b n\) total levels in the recursion.

We now consider how much work is done in each subproblem, which corresponds to the \(\O(n^d)\) term of the recurrence relation. For a subproblem of size \(n/b^i\), this is

\[\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d} &= \O\parens{\frac{n^d}{b^{id}}}\\ &= b^{-id} \cdot \O(n^d)\end{split}\]

With \(k^i\) subproblems at level \(i\), the total work \(T_i\) at level \(i\) is

\[\begin{split}T_i &= \frac{k^i}{b^{id}} \cdot \O(n^d)\\ &= \parens{\frac{k}{b^d}}^i \cdot \O(n^d)\end{split}\]

Summing over all the levels, we get a total work

\[\begin{split}T &= \sum_{i=0}^{\log_b n} T_i\\ &= \sum_{i=0}^{\log_b n} \parens{\parens{\frac{k}{b^d}}^i \cdot \O(n^d)}\\ &= \O(n^d) \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^d}}^i\\ &= \O(n^d) \cdot \sum_{i=0}^{\log_b n} r^i\\ &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} r^i}\\ &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\end{split}\]

where \(r = k/b^d\).

Since we are working with asymptotics, we only care about the dominating term in the sum

\[\sum_{i=0}^{\log_b n} r^i = 1 + r + r^2 + \dots + r^{\log_b n}\]

There are three cases:

  • \(r < 1\): The terms have decreasing value, so that the initial term 1 is the largest and dominates [1]. Thus, we have

    \[\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d)\end{split}\]
  • \(r = 1\): The terms all have the same value 1. Then

    \[\begin{split}T &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} r^i}\\ &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} 1}\\ &= \O\parens{n^d \cdot (1 + \log_b n)}\\ &= \O\parens{n^d + n^d \log_b n)}\\ &= \O(n^d \log_b n)\end{split}\]

    where we discard the lower-order term in the last step. Since logarithms of different bases differ only by a multiplicative constant factor (see the logarithmic identity for changing the base), which does not affect asymptotics, we can elide the base:

    \[\begin{split}T &= \O(n^d \log_b n)\\ &= \O(n^d \log n)\end{split}\]
  • \(r > 1\): The terms have increasing value, so that the final term \(r^{\log_b n}\) is the largest. Then we have

    \[\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O\parens{n^d r^{\log_b n}}\\ &= \O\parens{n^d \parens{\frac{k}{b^d}}^{\log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot b^{-d \log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\end{split}\]

    To simplify this further, we observe the following:

    • \(b^{\log_b n} = n\), which we can see by taking \(\log_b\) of both sides.

    • \(k^{\log_b n} = n^{\log_b k}\). We show this as follows:

      \[k^{\log_b n} = \parens{b^{\log_b k}}^{\log_b n}\]

      Here, we applied the previous observation to substitute \(k = b^{\log_b k}\). We proceed to get

      \[\begin{split}k^{\log_b n} &= \parens{b^{\log_b k}}^{\log_b n}\\ &= \parens{b^{\log_b n}}^{\log_b k}\\ &= n^{\log_b k}\end{split}\]

      applying the prior observation once again.

    Applying both observations, we get

    \[\begin{split}T &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\\ &= \O\parens{n^d \cdot n^{\log_b k} \cdot n^{-d}}\\ &=\O\parens{n^{\log_b k}}\end{split}\]

    Here, we cannot discard the base of the logarithm, as it appears in the exponent and so affects the value by an exponential rather than a constant factor.

Thus, we have demonstrated all three cases of the master theorem.

We now proceed to prove the version with log factors.

Proof 267

We need to modify the reasoning in Proof 266 to account for the amount of work done in each subproblem. For a subproblem of size \(n/b^i\), we now have work

\[\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}}\]

We consider the cases of \(k/b^d \le 1\) and \(k/b^d > 1\) separately.

  • \(k/b^d \le 1\): We compute an upper bound on the work as follows:

    \[\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\frac{n^d}{b^{id}} \cdot (\log^w n - \log^w b^i)}\\ &= b^{-id} \cdot \O(n^d \cdot (\log^w n - \log^w b^i))\\ &= b^{-id} \cdot \O(n^d \log^w n)\end{split}\]

    since \(n^d \cdot (\log^w n - \log^w b^i) \le n^d \log^w n\) for \(b > 1\), and we only need an upper bound for the asymptotics. Then the work \(T_i\) at level \(i\) is

    \[\begin{split}T_i &= \frac{k^i}{b^{id}} \cdot \O(n^d \log^w n)\\ &= \parens{\frac{k}{b^d}}^i \cdot \O(n^d \log^w n)\end{split}\]

    Summing over all the levels, we get a total work

    \[\begin{split}T &= \sum_{i=0}^{\log_b n} T_i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^d}}^i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} r^i\\ &= \O\parens{n^d \log^w n \cdot \sum_{i=0}^{\log_b n} r^i}\end{split}\]

    where \(r = k/b^d\). When \(r < 1\), the initial term of the summation dominates as before, so we have

    \[\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n)\end{split}\]

    When \(r = 1\), the terms all have equal size, also as before. Then

    \[\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n \cdot (1 + \log_b n))\\ &= \O(n^d \log^w n \log_b n)\\ &= \O(n^d \log^w n \log n)\\ &= \O(n^d \log^{w+1} n)\end{split}\]

    As before, we drop the lower order term, as well as the base in \(\log_b n\) since it only differs by a constant factor from \(\log n\).

  • \(k/b^d > 1\): We start by observing that since \(\log^w n\) is subpolynomial in \(n\), it grows slower than any polynomial in \(n\). We have

    \[\frac{k}{b^d} > 1\]

    or equivalently

    \[\begin{split}\log_b k &> d\\ 0 &< \log_b k - d\end{split}\]

    We choose a small exponent \(\epsilon\) such that

    \[0 < \epsilon < \log_b k - d\]

    For instance, we can choose \(\epsilon = (\log_b k - d) / 2\). Then

    \[\begin{split}\log_b k &> d + \epsilon\\ k &> b^{d+\epsilon}\\ \frac{k}{b^{d+\epsilon}} > 1\end{split}\]

    We also have

    \[\log^w n \le C n^\epsilon\]

    for a sufficiently large constant \(C\) and \(n \ge 1\). For a subproblem of size \(n/b^i\), we now have work

    \[\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\parens{\frac{n}{b^i}}^d \cdot C \parens{\frac{n}{b^i}}^\epsilon}\\ &= \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}\end{split}\]

    Then our work \(T_i\) at level \(i\) is

    \[T_i = k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}\]

    The total work \(T\) is

    \[\begin{split}T &= \sum_{i=0}^{\log_b n} \parens{k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{\sum_{i=0}^{\log_b n} \parens{k^i \cdot C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{-i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{d+\epsilon}}^{-i}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\end{split}\]

    Since \(k/b^{d+\epsilon} > 1\), the last term in the summation dominates as in Proof 266, so we get

    \[\begin{split}T &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\\ &= \O\parens{C n^{d+\epsilon} \cdot \parens{\frac{k}{b^{d+\epsilon}}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{-(d+\epsilon)}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-(d+\epsilon)}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot n^{-(d+\epsilon)}}\\ &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{C n^{\log_b k}}\end{split}\]

    In the last step, we applied the observation that \(k^{\log_b n} = n^{\log_b k}\). Folding the constant \(C\) into the big-O notation, we end up with

    \[\begin{split}T &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{n^{\log_b k}}\end{split}\]

    as required.

Alternative Analysis of Quick Sort

We can conduct an alternative analysis of the quick-sort algorithm by modifying the algorithm so that all the randomness is fixed at the beginning. Given \(n\) elements, we start by choosing a permutation of the numbers \(\{1,2,\dots,n\}\) uniformly at random to be the priorities of the elements. Then when choosing a pivot for a subset of the elements, we pick the element that has the lowest priority over that subset. The following illustrates an execution of this modified algorithm:

modified quick sort works by choosing priorities for the elements, then picking the element with lowest priority to be the pivot at each point

In the first step, the element 37 has the lowest priority, so it is chosen as the pivot. At the second level, the element 15 has the minimal priority in the first recursive call, while the element 86 has the lowest priority in the second call. At the third level of the recursion, the element 4 has lowest priority over its subset, and similarly the element 52 for the subset \(\{70, 52\}\).

The full definition of the modified algorithm is as follows:

Algorithm 268 (Modified Quick Sort)
\begin{algorithmic}
\FUNCTION{ModifiedQuickSort}{$A[1,\ldots,n]$}
\STATE $P = \text{a uniformly random permutation of }
\set{1,\ldots,n}$
\STATE \RETURN \CALL{PrioritizedSort}{$A,P$}
\ENDFUNCTION

\FUNCTION{PrioritizedSort}{$A[1,\ldots,n], P[1,\ldots,n]$}
\IF{$n=1$} \RETURN $A$ \ENDIF
\STATE $p = $ index of the smallest element in $P$
\STATE $(L, PL, R, PR) = $ \CALL{Partition}{$A,P,p$}
\STATE \RETURN \CALL{PrioritizedSort}{$L,PL$} $+ A[p] +$ \CALL{PrioritizedSort}{$R,PR$}
\ENDFUNCTION

\FUNCTION{Partition}{$A[1,\ldots,n], P[1,\ldots,n], p$}
\STATE initialize empty arrays $L, PL, R, PR$
\FOR{$i=1$ to $n$}
  \IF{$i \neq p$ and $A[i] < A[p]$}
    \STATE $L = L + A[i]$
    \STATE $PL = PL + P[i]$
  \ELIF{$i \neq p$}
    \STATE $R = R + A[i]$
    \STATE $PR = PR + P[i]$
  \ENDIF
  \STATE \RETURN $(L, PL, R, PR)$
\ENDFOR
\ENDFUNCTION
\end{algorithmic}

This modified algorithm matches the behavior of the original randomized quick sort in two important ways:

  1. In both algorithms, when a pivot is chosen among a subset consisting of \(m\) elements, each element is chosen with a uniform probability \(1/m\). This is explicitly the case in the original version. In the modified version, the priorities assigned to each of the \(m\) elements are unique, and for any set of \(m\) priorities, each element receives the lowest priority in exactly \(1/m\) of the permutations of those priorities.

  2. The two algorithms compare elements in exactly the same situation – when partitioning the array, each element is compared to the pivot. (The modified version also compares priorities to find the minimal, but we can just ignore that in our analysis. Even if we did consider them, they just increase the number of comparisons by a constant factor of two.)

Thus, the modified algorithm models the execution of the original algorithm in both the evolution of the recursion as well as the element comparisons. We proceed to analyze the modified algorithm, and the result will equally apply to the original.

For simplicity, we assume for the purposes of analysis that the initial array does not contain any duplicate elements. We have defined our two versions of quick sort to be stable, meaning that they preserve the relative ordering of duplicate elements. Thus, even when duplicates are present, the algorithm distinguishes between them, resulting in the same behavior as if there were no duplicates.

Claim 269

Let \(A\) be an array of \(n\) elements, and let \(S\) be the result of sorting \(A\). Let \(priority(S[i])\) be the priority that was assigned to the element \(S[i]\) in the modified quick-sort algorithm. Then \(S[i]\) and \(S[j]\) were compared by the algorithm if and only if \(i \ne j\), and one of the following holds:

\[\begin{split}priority(S[i]) &= \min_{i \le k \le j}\{priority(S[k])\} \text{, or}\\ priority(S[j]) &= \min_{i \le k \le j}\{priority(S[k])\}\end{split}\]

In other words, \(S[i]\) and \(S[j]\) are compared exactly when the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is that of either \(S[i]\) or \(S[j]\).

The claim above holds regardless of the original order of the elements in \(A\). For example, take a look at the result from our illustration of the modified algorithm above:

sorted array with assigned priorities; two elements were compared if and only if at least one of their priorities is lower than those of all the elements that lie between them

As stated above, two elements are compared only in the partitioning step, and one of the two must be the pivot. Referring to the execution above, we can see that 0 and 15 were compared in the second level of the recursion, whereas 52 and 99 were never compared to each other. Now consider a different initial ordering of the elements, but with the same priorities assigned to each element:

the same elements and priorities, but with a different initial order; the algorithm evolves in the exact same way with the same subsets at each step, but each subset may be reordered

The execution of the algorithm is essentially the same! It still picks 37 as the first pivot, since it has the lowest priority, and forms the same partitions; the only possible difference is the ordering of a partition. Subsequent steps again pick the same pivots, since they are finding the minimal priority over the same subset of elements and associated priorities. Thus, given just the sorted set of elements and their priorities, we can determine which elements were compared as stated in Claim 269, regardless of how they were initially ordered. We proceed to prove Claim 269:

Proof 270

First, we show that if the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is that of either \(S[i]\) or \(S[j]\), then \(S[i]\) and \(S[j]\) are compared. Without loss of generality, suppose \(S[i]\) is the element with the lowest priority in this subset. This implies that no element in \(S[i+1], \dots, S[j]\) will be picked as a pivot prior to \(S[i]\) being picked. Thus, all pivots chosen by the algorithm before \(S[i]\) must be either less than \(S[i]\) or greater than \(S[j]\). For each such pivot, the elements \(S[i], S[i+1], \dots, S[j]\) are all placed in the same partition – they are in sorted order, and if the pivot is less than \(S[i]\), all of these elements are larger than the pivot, while if the pivot is greater than \(S[j]\), then all these elements are smaller than the pivot. Then when the algorithm chooses \(S[i]\) as the pivot, \(S[j]\) is in the same subset of the array as \(S[i]\) and thus will be compared to \(S[i]\) in the partitioning step. The same reasoning holds for when \(S[j]\) has the lowest priority – it will eventually be chosen as the pivot, at which point \(S[i]\) will be in the same subset and will be compared to \(S[j]\).

Next, we show that if the lowest priority among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\) is neither that of \(S[i]\) nor of \(S[j]\), then \(S[i]\) and \(S[j]\) are never compared. As long as the algorithm chooses pivots outside of the set \(S[i], S[i+1], \dots, S[j]\), the two elements \(S[i]\) and \(S[j]\) remain in the same subset of the array without having been compared to each other – they remain in the same subset since the pivots so far have been either less than both \(S[i]\) and \(S[j]\) or greater than both of them, and they have not been compared yet since neither has been chosen as a pivot. When the algorithm first chooses a pivot from among the elements \(S[i], S[i+1], \dots, S[j-1], S[j]\), the pivot it chooses must be one of \(S[i+1], \dots, S[j-1]\), since one of those elements has lower priority than both \(S[i]\) and \(S[j]\). In the partitioning step for that pivot, \(S[i]\) and \(S[j]\) are placed into separate partitions, since \(S[i]\) is smaller than the pivot while \(S[j]\) is larger than the pivot. Since \(S[i]\) and \(S[j]\) end up in separate partitions, they cannot be compared in subsequent steps of the algorithm.

We can now proceed to compute the expected number of comparisons. Let \(X_{ij}\) be an indicator random variable such that \(X_{ij} = 1\) if \(S[i]\) and \(S[j]\) were compared at some point in the algorithm, \(X_{ij} = 0\) otherwise. From Claim 269, we can conclude that

\[\Pr[X_{ij} = 1] = \frac{2}{j-i+1}\]

This is because each of the \(j-1+1\) elements in \(S[i], S[i+1], \dots, S[j-1], S[j]\) has equal chance of having the lowest priority among those elements, and \(S[i]\) and \(S[j]\) are compared when one of those two elements has lowest priority among this set.

We also observe that a pair of elements is compared at most once – they are compared once in the partitioning step if they are in the same subarray and one of the two is chosen as a pivot, and after the partitioning step is over, the pivot is never compared to anything else. Thus, the total number of comparisons \(X\) is just the number of pairs that are compared, out of the \(n(n-1)/2\) distinct pairs:

\[X = \sum_{i=1}^n \sum_{j=i+1}^n X_{ij}\]

The rest of the analysis is as before, where we concluded that \(\Ex[X] \le 2 n \ln n = \O(n \log n)\). Since the modified quick-sort algorithm models the behavior of the original randomized algorithm, this result also applies to the latter.

Proof of the Simplified Multiplicative Chernoff Bounds

We previously showed that the unsimplified Chernoff bounds

\[\begin{split}\Pr[X \ge (1 + \delta)\mu] &\le \parens{\frac{e^\delta}{(1+\delta)^{1+\delta}}}^\mu ~~~~\text{for $\delta > 0$}\\ \Pr[X \le (1 - \delta)\mu] &\le \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu ~~~~\text{for $0 < \delta < 1$}\end{split}\]

hold. We now demonstrate that the simplified bounds

\[\begin{split}\large \begin{aligned} \Pr[X \ge (1 + \delta)\mu] &\le e^{-\frac{\delta^2\mu}{2+\delta}} ~~~~\text{for $\delta > 0$}\\ \Pr[X \le (1 - \delta)\mu] &\le e^{-\frac{\delta^2\mu}{2}} ~~~~\text{for $0 < \delta < 1$} \end{aligned}\end{split}\]

follow from the unsimplified bounds.

Proof 271

We first consider the upper-tail Chernoff bound. For \(\delta > 0\), we have

\[\begin{split}\large \begin{aligned} \Pr[X \ge (1 + \delta)\mu] &\le \parens{\frac{e^\delta}{(1+\delta)^{1+\delta}}}^\mu\\ &= \parens{\frac{e^\delta}{(e^{\ln(1+\delta)})^{1+\delta}}}^\mu\\ &= \parens{\frac{e^\delta}{e^{(1+\delta)\ln(1+\delta)}}}^\mu\\ &= \parens{e^{\delta-(1+\delta)\ln(1+\delta)}}^\mu \end{aligned}\end{split}\]

From a list of logarithmic identities, we obtain the inequality

\[\ln(1+x) \ge \frac{2x}{2+x}\]

for \(x > 0\). This gives us

\[\begin{split}\large \begin{aligned} -\ln(1+x) &\le -\frac{2x}{2+x}\\ e^{-\ln(1+x)} &\le e^{-\frac{2x}{2+x}} \end{aligned}\end{split}\]

Applying this to our Chernoff-bound expression with \(x = \delta\), we get

\[\begin{split}\large \begin{aligned} \Pr[X \ge (1 + \delta)\mu] &\le \parens{e^{\delta-(1+\delta)\ln(1+\delta)}}^\mu\\ &\le \parens{e^{\delta-(1+\delta)\frac{2\delta}{2+\delta}}}^\mu\\ &= \parens{e^{\frac{\delta(2+\delta)-2\delta(1+\delta)} {2+\delta}}}^\mu\\ &= \parens{e^{\frac{2\delta+\delta^2-2\delta-2\delta^2} {2+\delta}}}^\mu\\ &= \parens{e^{\frac{-\delta^2}{2+\delta}}}^\mu\\ &= e^{-\frac{\delta^2\mu}{2+\delta}} \end{aligned}\end{split}\]

resulting in the simplified upper-tail bound.

For the lower-tail Chernoff bound, we have for \(0 < \delta < 1\):

\[\begin{split}\large \begin{aligned} \Pr[X \le (1 - \delta)\mu] &\le \parens{\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}}^\mu\\ &= \parens{\frac{e^\delta}{(e^{\ln(1-\delta)})^{1-\delta}}}^\mu\\ &= \parens{\frac{e^{-\delta}}{e^{(1-\delta)\ln(1-\delta)}}}^\mu\\ &= \parens{e^{-\delta-(1-\delta)\ln(1-\delta)}}^\mu \end{aligned}\end{split}\]

The Taylor series for the natural logarithm gives us:

\[\begin{split}\ln(1-x) &= -x - \frac{x^2}{2} - \frac{x^3}{3} - \frac{x^4}{4} - \dots\\ &\ge -x - \frac{x^2}{2} ~~~~\text{for $0 < x < 1$}\\ -\ln(1-x) &\le x + \frac{x^2}{2}\end{split}\]

The second step follows since \(\frac{x^3}{3} + \frac{x^4}{4} + \dots\) is positive for \(x > 0\). Applying the result to our Chernoff-bound expression with \(x = \delta\), we obtain

\[\begin{split}\large \begin{aligned} \Pr[X \le (1 - \delta)\mu] &\le \parens{e^{-\delta-(1-\delta)\ln(1-\delta)}}^\mu\\ &\le \parens{e^{-\delta+(1-\delta)(\delta+\frac{\delta^2}{2})}}^\mu\\ &= \parens{e^{-\delta+\delta-\delta^2+\frac{\delta^2}{2}- \frac{\delta^3}{2}}}^\mu\\ &= \parens{e^{-\frac{\delta^2}{2}-\frac{\delta^3}{2}}}^\mu \end{aligned}\end{split}\]

Since \(\delta > 0\), we have \(\delta^3/2 > 0\), and

\[-\frac{\delta^2}{2}-\frac{\delta^3}{2} \le -\frac{\delta^2}{2}\]

Thus,

\[\begin{split}\large \begin{aligned} \Pr[X \le (1 - \delta)\mu] &\le \parens{e^{-\frac{\delta^2}{2}-\frac{\delta^3}{2}}}^\mu\\ &\le \parens{e^{-\frac{\delta^2}{2}}}^\mu\\ &= e^{-\frac{\delta^2\mu}{2}} \end{aligned}\end{split}\]

completing our proof of the simplified lower-tail bound.

Proof of the Upper-Tail Hoeffding’s Inequality

The upper-tail Hoeffding’s inequality we use (Theorem 205) states that for a sequence of independent indicator random variables \(X_1, \dots, X_n\),

\[\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le e^{-2\varepsilon^2 n}\]

where \(X = X_1 + \dots + X_n\), \(\Ex[X_i] = p_i\), and \(\lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i = p\). We proceed to prove a more general variant of this bound:

Theorem 272

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent random variables such that \(X_i \in [0, 1]\), and the \(X_i\) have expectation \(\Ex[X_i] = p_i\), respectively. Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le \expp{-2\varepsilon^2 n}\]

(Note that \(\expp{x}\) is alternate notation for \(e^{x}\).)

Here, we do not require that the \(X_i\) are indicators, just that they are in the range \([0, 1]\).

We proceed to prove Theorem 272. We have

\[\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &= \Pr[X \ge n(p + \varepsilon)]\\ &= \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\end{split}\]

The latter step holds for all \(s \ge 0\). The process of working with the random variable \(e^{sX}\) rather than \(X\) directly is the Chernoff bounding technique; the idea is that small deviations in \(X\) turn into large deviations in \(e^{sX}\), so that Markov’s inequality (Theorem 183) produces a more useful result. We will choose an appropriate value of \(s\) later. For any value of \(s\), \(e^{sX}\) is nonnegative, so we can apply Markov’s inequality to obtain:

\[\begin{split}\Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}] &\le \Ex[e^{sX}]/\expp{sn(p + \varepsilon)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[\expp{s(X_1 + X_2 + \dots + X_n)}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX_1} e^{sX_2} \dots e^{sX_n}]\\ &= \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]\end{split}\]

In the last step, we applied the fact that the \(X_i\) are independent to conclude that the random variables \(e^{sX_i}\) are also independent, and therefore by Lemma 195, the expectation of their product is the product of their expectations.

We now need to establish a bound on \(\Ex[e^{sX_i}]\).

Lemma 273

Let \(X\) be a random variable such that \(X \in [0, 1]\) and \(\Ex[X] = p\). Then

\[\Ex[e^{sX}] \le \expp{sp + s^2/8}\]

for all \(s \in \R\).

Proof 274

We first observe that because \(e^{sx}\) is a convex function, we can bound it from above on the interval \([0, 1]\) by the line that passes through the endpoints \(1\) and \(e^s\):

an exponential function f(x) is convex, so every point on the interval [0, 1] lies at or beneath the line that passes through the endpoints f(0) and f(1)

This line has slope \(e^s - 1\) and y-intercept \(1\), giving us \(xe^s - x + 1\). Thus,

\[e^{sx} \le xe^s - x + 1\]

in the interval \(0 \le x \le 1\). Then

\[\begin{split}\Ex[e^{sX}] &\le \Ex[Xe^s - X + 1]\\ &= (e^s - 1)\Ex[X] + 1\\ &= pe^s - p + 1\end{split}\]

where the second step follows from linearity of expectation. We now have an upper bound on \(\Ex[e^{sX}]\), but it is not a convenient one – we would like it to be of the form \(e^\alpha\), so that we can better use it in proving Theorem 272, for which we’ve already obtained a bound that contains an exponential. Using the fact that \(z = e^{\ln z}\), we get

\[\begin{split}\Ex[e^{sX}] &\le pe^s - p + 1\\ &= \expp{\ln (pe^s - p + 1)}\end{split}\]

To further bound this expression, let

\[\phi(s) = \ln (pe^s - p + 1)\]

Then finding an upper bound for \(\phi(s)\) will in turn allow us to bound \(e^{\phi(s)}\). By the Lagrange form of Taylor’s theorem, we get

\[\phi(s) = \phi(0) + s\phi'(0) + \frac{1}{2}s^2\phi''(v)\]

for some \(v \in [0, s]\). To differentiate \(\phi(s)\), we let \(f(s) = \ln s\) and \(g(s) = pe^s - p + 1\), so that \(\phi(s) = (f \circ g)(s)\). Then by the chain rule, we have [2]

\[\begin{split}\phi'(s) &= (f \circ g)'(s)\\ &= f'(g(s)) \cdot g'(s)\\ &= \frac{1}{g(s)} \cdot g'(s)~~~~~~~~~\text{(since $f'(s) = \frac{d}{ds}\ln s = \frac{1}{s}$)}\\ &= \frac{1}{pe^s - p + 1} \cdot pe^s\\ &= \frac{pe^s}{pe^s - p + 1}\end{split}\]

To compute \(\phi''(s)\), we now let \(\hat{f}(s) = pe^s\), so that \(\phi'(s) = \hat{f}(s)/g(s)\). Then by the quotient rule, we get

\[\begin{split}\phi''(s) &= \frac{\hat{f}'(s)g(s) - \hat{f}(s)g'(s)}{g^2(s)}\\ &= \frac{pe^s(pe^s - p + 1) - pe^s pe^s}{(pe^s - p + 1)^2}\\ &= \frac{pe^s}{pe^s - p + 1}\cdot\frac{(pe^s - p + 1) - pe^s}{pe^s - p + 1}\\ &= \frac{pe^s}{pe^s - p + 1}\parens{1 - \frac{pe^s}{pe^s - p + 1}}\\ &= t(1 - t)\end{split}\]

where \(t = \frac{pe^s}{pe^s - p + 1}\). Observe that \(t(1 - t)\) is a parabola with a maximum at \(t = \frac{1}{2}\):

\[\begin{split}h(t) &= t(1 - t) = t - t^2\\ h'(t) &= 1 - 2t\\ h''(t) &= -2\end{split}\]

Setting \(h'(t) = 1 - 2t = 0\), we get that \(t = \frac{1}{2}\) is an extremum, and since the second derivative \(h''(\frac{1}{2})\) is negative, it is a maximum. Thus, we have

\[\begin{split}\phi''(s) &= t(1 - t)\\ &\le \frac{1}{2}\parens{1 - \frac{1}{2}}\\ &= \frac{1}{4}\end{split}\]

We can now plug everything into our result from applying Taylor’s theorem:

\[\begin{split}\phi(s) &= \phi(0) + s\phi'(0) + \frac{1}{2}s^2\phi''(v)\\ &= \ln(p - p + 1) + s\frac{p}{p - p + 1} + \frac{1}{2}s^2\phi''(v)\\ &= sp + \frac{1}{2}s^2\phi''(v)\\ &\le sp + \frac{1}{2}s^2\cdot \frac{1}{4}\\ &= sp + \frac{s^2}{8}\end{split}\]

Thus,

\[\begin{split}\Ex[e^{sX}] &\le e^{\phi(s)}\\ &\le \expp{sp + s^2/8}\end{split}\]

as claimed.

Continuing our proof of Theorem 272, we have

\[\Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}] \le \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]\]

Applying Lemma 273, we get

\[\begin{split}\Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}] &\le \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]\\ &\le \expp{-sn(p + \varepsilon)} \prod_i \expp{sp_i + s^2/8}\\ &= \expp{-sn(p + \varepsilon)} \cdot \expp{\sum_i (sp_i + s^2/8)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \expp{snp + s^2n/8}\\ &= \expp{-sn\varepsilon + s^2n/8}\end{split}\]

We now choose \(s\) to minimize the exponent \(r(s) = -sn\varepsilon + s^2n/8\). We have:

\[\begin{split}r(s) &= -sn\varepsilon + \frac{n}{8}s^2\\ r'(s) &= -n\varepsilon + \frac{n}{8}\cdot 2s = -n\varepsilon + \frac{n}{4}s\\ r''(s) &= \frac{n}{4}\end{split}\]

We have another parabola, and since the second derivative is positive, we obtain a minimum for \(r(s)\) at

\[\begin{split}r'(s) &= -n\varepsilon + \frac{n}{4}s = 0\\ s &= 4\varepsilon\end{split}\]

Then

\[\begin{split}r(4\varepsilon) &= -4n\varepsilon^2 + \frac{n}{8}\cdot 16\varepsilon^2\\ &= -4n\varepsilon^2 + 2n\varepsilon^2\\ &= -2n\varepsilon^2\end{split}\]

Thus,

\[\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &\le \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\\ &\le \expp{-2n\varepsilon^2}\end{split}\]

completing our proof of Theorem 272.

General Case of Hoeffding’s Inequality

We can further generalize Theorem 272 to the case where the individual random variables \(X_i\) are in the range \([a_i, b_i]\). We start by generalizing Lemma 273 to obtain Hoeffding’s lemma.

Lemma 275 (Hoeffding’s Lemma)

Let \(X\) be a random variable such that \(X \in [a, b]\), where \(a < b\), and \(\Ex[X] = p\). Then

\[\Ex[e^{sX}] \le \expp{sp + s^2(a - b)^2/8}\]

for all \(s \in \R\).

Proof 276

Let \(X' = \frac{X - a}{b - a}\). Then \(X'\) is a random variable such that \(X' \in [0, 1]\), and

\[\Ex[X'] = \lrEx{\frac{X - a}{b - a}} = \frac{p - a}{b - a}\]

by linearity of expectation. Let \(p' = \frac{p - a}{b - a}\). Applying Lemma 273 to \(X'\), we get

\[\Ex[e^{s'X'}] \le \expp{s'p' + s'^2/8}\]

Now consider \(e^{sX}\). We have \(X = a + (b - a)X'\), so

\[\begin{split}\Ex[e^{sX}] &= \Ex[\expp{sa + s(b - a)X'}]\\ &= e^{sa}\Ex[\expp{s(b - a)X'}]\end{split}\]

Let \(s' = s(b - a)\). From the result above, we have

\[\begin{split}e^{sa}\Ex[\expp{s(b - a)X'}] &= e^{sa}\Ex[e^{s'X'}]\\ &\le e^{sa}\expp{s'p' + s'^2/8}\\ &= \expp{sa + s'p' + s'^2/8}\end{split}\]

Substituting \(p' = \frac{p - a}{b - a}\) and \(s' = s(b - a)\), we obtain the following for the exponent:

\[\begin{split}sa + s'p' + \frac{1}{8}s'^2 &= sa + s(b - a)\frac{p - a}{b - a} + \frac{1}{8}s^2(b - a)^2\\ &= sa + s(p - a) + \frac{1}{8}s^2(b - a)^2\\ &= sp + \frac{1}{8}s^2(b - a)^2\end{split}\]

Thus,

\[\begin{split}\Ex[e^{s'X'}] &\le \expp{sa + s'p' + s'^2/8}\\ &= \expp{sp + s^2(b - a)^2/8}\end{split}\]

as claimed.

We can now derive the general case of Hoeffding’s inequality.

Theorem 277 (Hoeffding’s Inequality – General Case)

Let \(X = X_1 + \dots + X_n\), where the \(X_i\) are independent random variables such that \(X_i \in [a_i, b_i]\) for \(a_i < b_i\) and \(\Ex[X_i] = p_i\). Let

\[p = \lrEx{\frac{1}{n}X} = \frac{1}{n}\sum_i p_i\]

Let \(\varepsilon > 0\) be a deviation from the expectation. Then

\[\lrPr{\frac{1}{n}X \ge p + \varepsilon} \le \lrexp{-\frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}}\]
Proof 278

We proceed as in the proof of Theorem 272 to obtain

\[\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &= \Pr[X \ge n(p + \varepsilon)]\\ &= \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\\ &\le \Ex[e^{sX}]/\expp{sn(p + \varepsilon)}~~~~~~~~~ \text{(Markov's inequality)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[\expp{s(X_1 + X_2 + \dots + X_n)}]\\ &= \expp{-sn(p + \varepsilon)} \cdot \Ex[e^{sX_1} e^{sX_2} \dots e^{sX_n}]\\ &= \expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}]~~~~~~~~~\text{(independence of the $X_i$)}\end{split}\]

Applying Hoeffding’s lemma (Lemma 275), we obtain

\[\begin{split}\expp{-sn(p + \varepsilon)} \prod_i \Ex[e^{sX_i}] &\le \expp{-sn(p + \varepsilon)} \prod_i \expp{sp_i + s^2(b_i - a_i)^2/8}\\ &= \expp{-sn(p + \varepsilon)} \cdot \lrexp{\sum_i (sp_i + s^2(b_i - a_i)^2/8)}\\ &= \expp{-sn(p + \varepsilon)} \cdot \lrexp{snp + \frac{s^2}{8} \sum_i (b_i - a_i)^2}\\ &= \lrexp{-sn\varepsilon + \frac{s^2}{8} \sum_i (b_i - a_i)^2}\end{split}\]

We choose \(s\) to minimize the exponent \(r(s)\):

\[\begin{split}r(s) &= -sn\varepsilon + \frac{s^2}{8} \sum_i (b_i - a_i)^2\\ r'(s) &= -n\varepsilon + \frac{s}{4} \sum_i (b_i - a_i)^2\\ r''(s) &= \frac{1}{4} \sum_i (b_i - a_i)^2\end{split}\]

Since \(a_i < b_i\), we have \(b_i - a_i > 0\) for all \(i\), so that \(r''(s)\) is positive. Thus, we obtain a minimum for \(r(s)\) when \(r'(s) = 0\):

\[\begin{split}r'(s) &= -n\varepsilon + \frac{s}{4} \sum_i (b_i - a_i)^2 = 0\\ s &= \frac{4n\varepsilon}{\sum_i (b_i - a_i)^2}\end{split}\]

Then

\[\begin{split}r\parens{\frac{4n\varepsilon}{\sum_i (b_i - a_i)^2}} &= -n\varepsilon\frac{4n\varepsilon}{\sum_i (b_i - a_i)^2} + \frac{16n^2\varepsilon^2}{\parens{\sum_i (b_i - a_i)^2}^2}\cdot \frac{1}{8}\sum_i (b_i - a_i)^2\\ &= -\frac{4n^2\varepsilon^2}{\sum_i (b_i - a_i)^2} + \frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}\\ &= -\frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}\end{split}\]

Thus,

\[\begin{split}\lrPr{\frac{1}{n}X \ge p + \varepsilon} &\le \Pr[e^{sX} \ge \expp{sn(p + \varepsilon)}]\\ &\le \lrexp{-\frac{2n^2\varepsilon^2}{\sum_i (b_i - a_i)^2}}\end{split}\]

completing our proof of the general case of Hoeffding’s inequality.