# Design and Analysis of Algorithms

# Introduction¶

Every complex problem has a clear, simple and wrong solution. — H. L. Mencken

Welcome to *Foundations of Computer Science*! This text covers
foundational aspects of Computer Science that will help you reason
about any computing task. In particular, we are concerned with the
following with respect to problem solving:

What are common, effective approaches to designing an algorithm?

Given an algorithm, how do we reason about whether it is correct and how efficient it is?

Are there limits to what problems we can solve with computers, and how do we identify whether a particular problem is solvable?

What problems are

*efficiently*solvable, and how do we determine whether a particular problem is?What techniques can we use to approximate solutions to problems that we don’t know how to solve efficiently?

Can randomness help us in solving problems?

How can we exploit difficult problems to build algorithms for cryptography and security?

The only way to answer these questions is to construct a model and apply a proof-based (i.e. math-like) methodology to the questions. Thus, this text will feel much like a math text, but we apply the techniques directly to widely applicable problems in Computer Science.

As an example, how can we demonstrate that there is no general algorithm for determining whether or not two programs have the same functionality? A simple but incorrect approach would be to try all possible algorithms to see if they work. However, there are infinitely many possible algorithms, so we have no hope of this approach working. Instead, we need to construct a model that captures the notion of what is computable by any algorithm and use that to demonstrate that no such algorithm exists.

## Text Objectives¶

The main purpose of this text is to give you the tools to approach
problems you’ve never seen before. Rather than being given a
particular algorithm or data structure to implement, approaching a new
problem requires reasoning about whether the problem is solvable, how
to relate it to problems that you’ve seen before, what algorithmic
techniques are applicable, whether the algorithm you come up with is
correct, and how efficient the resulting algorithm is. These are all
steps that must be taken *prior* to actually writing code to implement
a solution, and these steps are independent of your choice of
programming language or framework.

Thus, in a sense, the fact that you will not have to write code (though you are free to do so if you like) is a feature, not a bug. We focus on the prerequisite algorithmic reasoning required before writing any code, and this reasoning is independent of the implementation details. Were we to have you implement all the algorithms you design in this text, the workload would be far greater, and it would only replicate the coding practice you get in your programming courses. Instead, we focus on the aspects of problem solving that you have not had much experience in thus far. The training we give you in this text will make you a better programmer, as algorithmic design is crucial to effective programming. This text also provides a solid framework for further exploration of theoretical Computer Science should you wish to pursue that path. However, you will find the material here useful regardless of which subfields of Computer Science you decide to study.

As an example, suppose your boss tells you that you need to make a business trip to visit several cities, and you must minimize the cost of visiting all those cities. This is an example of the classic traveling salesperson problem, and you may have heard that it is an intractable problem. What exactly does that mean? Does it mean that it cannot be solved at all? What if we change the problem so that you don’t have to minimize the cost, but instead must fit the cost within a given budget (say $2000)? Does this make the problem any easier? Or if we require that the total cost be within a factor of two of the optimal rather than actually being minimized?

Before we can reason about the total cost of a trip, we need to know how much it costs to travel between two consecutive destinations in the trip. Suppose we are traveling by air. There may be no direct flight between those two cities, or it might be very expensive. To help us keep our budget down, we need to figure out what the cheapest itinerary between those two cities is, considering intermediate layover stops. And since we don’t know a priori in what order we will visit all the cities, we need to know the cheapest itineraries between all pairs of cities. This is an instance of the all-pairs shortest path problem. Is this a “solvable” problem, and if so, what algorithmic techniques can we use to construct a solution?

We will consider both of the problems above in this text. We will learn what it means for a problem to be solvable or tractable, how to show that a particular problem is or is not solvable, and techniques for designing and analyzing algorithms.

## Tools for Abstraction¶

*Abstraction* is a core principle in Computer Science, allowing us to
reason about and use complex systems without paying attention to
implementation details. As mentioned earlier, the focus of this text
is on reasoning about problems and algorithms independently of
implementation. To do so, we need appropriate abstract models that are
applicable to any programming language or system architecture.

Our abstract model for expressing algorithms is *pseudocode*, which
describes the steps in the algorithm at a high level without
implementation details. As an example, the following is a pseudocode
description of the Floyd-Warshall algorithm, which we will discuss
later:

This description is independent of how the graph \(G\) is represented, or the two-dimensional matrices \(d^i\), or the specific syntax of a loop. Yet it is still clear what each step of the algorithm is doing. Expressing this algorithm in an actual programming language such as C++ would only add unnecessary syntactic and implementation details that are an artifact of the chosen language and data structures and not intrinsic to the algorithm itself. Instead, pseudocode gives us a simple means of expressing an algorithm that facilitates understanding and reasoning about the core elements of the algorithm.

We also need abstraction in our methodology for reasoning about the efficiency of an algorithm. The actual running time and space usage of a program depend on many factors, including choice of language and data structures, available compiler optimizations, and characteristics of the underlying hardware. Again, these are not intrinsic to an algorithm itself. Instead, we focus not on absolute efficiency but on how the efficiency scales with respect to input size. Specifically, we look at time complexity in terms of:

the number of basic operations as a function of input size,

asymptotically, as the input size grows,

ignoring leading constants,

over worst-case inputs.

We measure space complexity in a similar manner, but with respect to the number of memory cells rather than the number of basic operations. These measures of time and space complexity allow us to evaluate algorithms in the abstract rather than with respect to a particular implementation. (This is not to say that absolute performance is irrelevant. Rather, the asymptotic complexity gives us a “higher-order” view of an algorithm. An algorithm with poor asymptotic time complexity will be inefficient when run on large inputs regardless of how good the implementation is.)

Later in the text, we will also reason about the intrinsic solvability
of a problem. We will see how to express problems in the abstract
(e.g. as *languages* and *decision problems*), and we will examine a
simple model of computation (*Turing machines*) that encapsulates the
essence of computation.

## The First Algorithm¶

One of the oldest known algorithms is Euclid’s algorithm for computing
the *greatest common divisor (GCD)* of two natural numbers. The
greatest common divisor of \(x,y \in \N\) is the largest number \(z \in
\N\) such that \(z\) evenly divides both \(x\) and \(y\). We often use the
notation \(\gcd(x,y)\) to refer to the GCD of \(x\) and \(y\), and if
\(\gcd(x,y) = 1\), we say that \(x\) and \(y\) are *coprime*. For example,
\(\gcd(21,9) = 3\), so 21 and 9 are not coprime, but \(\gcd(121, 5) = 1\),
so 121 and 5 are coprime.

A naïve algorithm for computing the GCD of \(x\) and \(y\) can try every number starting from \(x\) down to 1 to see if it evenly divides both \(x\) and \(y\). (For simplicity, we will assume that \(x \geq y\); otherwise we can swap their values without changing the answer.)

Here, the \(\bmod\) operation computes the remainder of the first operand divided by the second. For example, \(9 \bmod 6 = 3\) since \(9 = 6 \cdot 1 + 3\), and \(9 \bmod 3 = 0\) since \(9 = 3 \cdot 3 + 0\). The result of \(x \bmod z\) is always an integer in the range \([0, z-1]\), and the result is not defined when \(z\) is 0.

How efficient is the algorithm above? In the worst case, it performs two mod operations for every value in the range \([1, x]\). Using big-O notation, the number of operations is \(\O(x)\).

Can we do better?

Observe that if a number \(z\) divides both \(x\) and \(y\), then it also divides \(x - my\) for any \(m \in \N\); if \(x = z \cdot a\) and \(y = z \cdot b\) for some \(a,b \in \N\), then \(x - my = za - mzb = z \cdot (a - mb)\), so \(z\) evenly divides \(x - my\) as well. By the same reasoning, if a number \(z'\) divides both \(x - my\) and \(y\), it also divides \(x\). Thus, the set of factors common to both \(x\) and \(y\) is the same as the set common to both \(x - my\) and \(y\), so we can subtract off any multiple of \(y\) from \(x\) when computing the GCD without changing the answer:

Since any \(m\) will do, let’s pick \(m\) to minimize the value \(x - my\). We do so by picking \(m = \floor{x/y}\), the integer quotient of \(x\) and \(y\). Then if we compute \(r = x - my = x - \floor{x/y}y\), we obtain the remainder of \(x\) divided by \(y\). Thus, we have \(r = x \bmod y\), using our previous definition of the mod operation.

This results in another identity:

Or, since the GCD is symmetric:

The latter identity will maintain our invariant that the first argument of \(\gcd\) is greater than or equal to the second.

We have determined a recurrence relation for \(\gcd\). We also need base cases. As mentioned previously, \(x \bmod y\) is not defined when \(y\) is 0. Thus, we need a base case for \(y = 0\). In this case, \(x\) evenly divides both \(x\) and 0, so \(\gcd(x, 0) = x\). We can also observe that \(\gcd(x, 1) = 1\). (This latter base case is not technically necessary; some descriptions of Euclid’s algorithm include it while others do not.)

This leads us to the Euclidean algorithm:

The following are some examples of running this algorithm:

How efficient is this algorithm? It is no longer obvious how many operations it performs. For instance, the computation \(Euclid(30, 19)\) does six iterations, while \(Euclid(376281, 376280)\) does only two. There doesn’t seem to be a clear relationship between the input size and the number of iterations.

This is an extremely simple algorithm, consisting of just a handful of lines of code. But that does not make it simple to reason about. We need new techniques to analyze code such as this to determine a measure of its time complexity.

# The Potential Method¶

Let’s set aside Euclid’s algorithm for the moment and examine a game instead. Consider a flipping game that has an \(11\times 11\) board covered with two-sided chips, say on the front side and on the back. Initially, the entirety of the board is covered with every chip face down.

The game pits a row player against a column player, and both take turns to flip an entire row or column, respectively. A row or column may be flipped only if it contains more face-down (OSU) than face-up (UM) chips. The game ends when a player can make no legal moves, and that player loses the game. For example, if the game reaches the following state and it is the column player’s move, the column player has no possible moves and therefore loses.

We will set aside strategy and ask a simpler question: must the game end in a finite number of moves, or is there a way for the game to continue indefinitely?

An \(11\times 11\) board is rather large to reason about, so a good strategy is to simplify the problem by considering a \(3\times 3\) board instead. Let’s consider the following game state.

Assume that it is the row player’s turn, and the player chooses to flip the bottom row.

Notice that in general, a move may flip some chips from UM to OSU and others vice versa. This move in particular flipped the first chip in the row from UM to OSU and the latter two chips from OSU to UM. The number of chips flipped in each direction depends on the state of a row or column, and it is not generally the case that every move flips three OSU chips to UM in the \(3\times 3\) board.

Continuing the game, the column player only has a single move available.

Again, one UM chip is flipped to OSU and two OSU chips are flipped to UM.

It is again the row player’s turn, but unfortunately, no row flips are possible. The game ends, with the victory going to the column player.

We observe that each move we examined flipped both UM and OSU chips,
but each move flipped *more* OSU chips to UM than vice versa. Is this
always the case? Indeed it is: a move is only legal if the respective
row or column has more OSU than UM chips. The move flips each OSU chip
to a UM one, and each UM chip to an OSU chip. Since there are more OSU
than UM chips, more OSU to UM swaps happen than vice versa. More
formally, for an \(n\times n\) board, an individual row or column has
\(k\) OSU chips and \(n-k\) UM chips. A move is only legal if \(k > n-k\).
After the flip, the row or column will have \(k\) UM chips and \(n-k\) OSU
chips. The number of UM chips gained is \(k - (n-k)\). Since we know
that \(k > n-k\), subtracting \(n-k\) from both sides gives us \(k - (n-k)
> 0\). Thus, at least one UM chip is added by each move, which implies
that at least one OSU chip is removed by a move.

In the \(3\times 3\) case, we start with nine OSU chips. No board configuration has fewer than zero OSU chips. Then if each move removes at least one OSU chip, no more than nine moves are possible. By the same reasoning, no more than 121 moves are possible in the original \(11\times 11\) game.

Strictly speaking, it will always take fewer than 121 moves to reach a configuration where a player has no moves available. But we don’t need an exact number to answer our original question. We have placed an upper bound of 121 on the number of moves, so we have established that a game is indeed bounded by a finite number of moves.

The core of our reasoning above is that we compute a measure of the board’s state, e.g. the number of OSU chips. We observe that in each step, this measure decreases. There is a lower bound on the measure, so eventually that lower bound will be reached.

This pattern of reasoning is called the *potential method*. Formally,
if we have an algorithm \(A\), we define a function \(s: A \to \R\) that
maps the state of the algorithm to a number. The function \(s\) is a
*potential function* if:

\(s\) strictly decreases with every step of the algorithm 1

\(s\) has a lower bound

- 1
We need to take a bit more care to ensure that \(s\) converges to its lower bound in a finite number of steps. A sufficient condition is that \(s\) decreases by at least some fixed constant \(c\) in each step. In the game example, we established that \(s\) decreases by at least \(c=1\) in each step.

By defining a meaningful potential function and establishing both its lower bound and how it decreases, we can reason about the number of steps required to run a complex algorithm.

## A Potential Function for Euclid’s Algorithm¶

Let’s come back to Euclid’s algorithm and try to come up with a potential function we can use to reason about its efficiency. Specifically, we want to determine an upper bound on the number of iterations for a given input \(x\) and \(y\).

Observe that \(x\) and \(y\) both decrease from one step to the next. For instance, \(Euclid(30,19)\) calls \(Euclid(19,11)\), so \(x\) decreases from 30 to 19 and \(y\) decreases from 19 to 11. However, the amount each argument individually decreases can vary. In \(Euclid(376281,376280)\), \(x\) decreases by only one in the next iteration while \(y\) decreases by 376279.

Given that the algorithm has two variables, both of which change as the algorithm proceeds, it seems reasonable to define a potential function that takes both into account. Let \(x_i\) and \(y_i\) be the values of the two variables in iteration \(i\). Initially, \(x_0 = x\) and \(y_0 = y\), where \(x\) and \(y\) are the original inputs to the algorithm. We try a simple sum of the two variables as our potential function:

Before we look at some examples, let’s first establish that this is a valid potential function. Examining the algorithm, we see that:

where \(r_i = x_i \bmod y_i\). Given the invariant maintained by the algorithm that \(x_i > y_i\), and that \(x_i \bmod y_i \in [0, y_i-1]\), we have that \(r_i < y_i < x_i\). Then:

Thus, \(s\) always decreases from one iteration to the next, satisfying the first requirement of a potential function. We also observe that \(y_i \geq 0\) for all \(i\); coupled with \(x_i > y_i\), this implies that \(x_i \geq 1\), so \(s_i \geq 1 + 0 = 1\) for all \(i\). Since we have established a lower bound on \(s\), it meets the second requirement for a potential function.

As an example, we look at the values of the potential function in the case of \(Euclid(21, 9)\):

The following are the values for \(Euclid(8, 5)\):

While the values decrease rather quickly for \(Euclid(21, 9)\), they do so more slowly for \(Euclid(8, 5)\). In these examples, the ratio of \(s_{i+1}/s_i\) is highest for \(s_2/s_1\) in \(Euclid(8, 5)\), where it is 0.625. In fact, we will demonstrate an upper bound that is not far from that value.

For all valid inputs \(x,y\) to Euclid’s algorithm, \(s_{i+1} \leq \frac{2}{3} s_i\) for all iterations \(i\) of \(Euclid(x, y)\).

The recursive case of \(Euclid(x_i, y_i)\) invokes \(Euclid(y_i, x_i \bmod y_i)\), so \(x_{i+1} = y_i\) and \(y_{i+1} = x_i \bmod y_i = r_i\), the remainder of dividing \(x_i\) by \(y_i\). By definition of remainder, we can express \(x_i\) as:

where \(q_i = \floor{x_i/y_i}\) is the integer quotient of dividing \(x_i\) by \(y_i\). Since \(x_i > y_i\), we have that \(q_i \geq 1\). Then:

We are close to what we need to relate \(s_i\) to \(s_{i+1} = y_i + r_i\), but we need a common multiplier for both the \(y_i\) and \(r_i\) terms. Let’s split the difference by adding \(r_i/2\) and subtracting \(y_i/2\):

The latter stop holds because \(r_i < y_i\) (since it is the remainder of dividing \(x_i\) by \(y_i\)), which means that \(y_i - r_i\) is positive. By subtracting a positive number, we only make the right-hand side of the inequality smaller. We started with the left-hand side being bigger than the right-hand side, and making the right-hand side even smaller maintains that inequality.

Continuing onward, we have:

Rearranging the inequality, we conclude that \(s_{i+1} \leq \frac{2}{3} s_i\).

By repeated applications of the lemma, starting with \(s_0 = x_0 + y_0 = x + y\), we can conclude that:

For all valid inputs \(x,y\) to Euclid’s algorithm, \(s_i \leq \parens{\frac{2}{3}}^i (x + y)\) for all iterations \(i\) of \(Euclid(x, y)\).

We can now prove the following:

For inputs \(x,y\), the number of iterations of Euclid’s algorithm is \(\O(\log(x + y))\).

We have previously shown that \(1 \leq s_i \leq \parens{\frac{2}{3}}^i (x + y)\) for any iteration \(i\). Thus,

We have established an upper bound on \(i\), which means that the number of recursive calls cannot exceed \(\log_{3/2}(x + y) = \O(\log(x + y))\). (We can change bases by multiplying by a constant, and since \(\O\)-notation ignores leading constants, the base in \(\O(\log(x + y))\) does not matter. Unless otherwise specified, the base of a logarithm is assumed to be 2 in this text.) Since each iteration does at most one mod operation, the total number of operations is \(\O(\log(x + y))\), completing the proof.

Under our assumption that \(x > y\), we have that \(\O(\log(x + y)) = \O(\log 2x) = \O(\log x)\). Recall that the naïve algorithm had complexity \(\O(x)\). This means that Euclid’s algorithm is exponentially faster than the naïve algorithm.

We see that the potential method gives us an important tool in reasoning about the complexity of algorithms, enabling us to establish an upper bound on the runtime of Euclid’s algorithm.

# Divide and Conquer¶

The *divide-and-conquer* algorithmic pattern involves subdividing a
large, complex problem instance into smaller versions of the same
problem. The subproblems are solved recursively, and their solutions
are combined in a “meaningful” way to construct the solution for the
original instance.

Since divide and conquer is a recursive paradigm, the main tool for analyzing divide-and-conquer algorithms is induction. When it comes to complexity analysis, such algorithms generally give rise to recurrence relations expressing the time or space complexity. While these relations can be solved inductively, we usually rely on other tools to do so. We will see one such tool in the form of the master theorem.

As an example of a divide-and-conquer algorithm, the following is a
description of the *merge sort* algorithm for sorting mutually
comparable items:

The algorithm sorts an array by recursively sorting the two halves,
then combining the sorted halves with the *merge* operation. Thus, it
follows the pattern of a divide-and-conquer algorithm.

A naïve algorithm such as insertion sort has a time complexity of \(\O(n^2)\). How does merge sort compare?

Define \(T(n)\) to be the number of operations performed by merge sort on an array of \(n\) elements. Let \(S(n)\) be the number of operations performed by the merge step for an input array of size \(n\). We can express the following recurrence for \(T(n)\):

Observe that the merge step does ~\(n\) comparisons and ~\(n\) array concatenations. Assuming that each of these operations takes a constant amount of time 2, we have that \(S(n) = \O(n)\). This leaves us with the following for \(T(n)\):

- 2
Using a linked data structure, adding an element to the front does indeed take a constant amount of time. On the other hand, the assumption of a constant-time comparison only generally holds for fixed-size data types, such as a 32-bit integer or 64-bit floating-point number. For arbitrary-size numbers or other variable-length data types (e.g. strings), this assumption does not hold, and the cost of the comparison needs to be considered more carefully.

How do we solve this recurrence? While we can do so using induction or other tools for directly solving recurrence relations, in many cases we can take a short cut and apply the master theorem.

## The Master Theorem¶

Consider an arbitrary divide-and-conquer algorithm that breaks a problem of size \(n\) into:

\(k\) smaller subproblems, \(k \geq 1\)

each subproblem is of size \(n/b\), \(b > 1\)

the cost of splitting the input and combining the results is \(\O(n^d)\)

This results in a recurrence relation:

More generally, when \(b\) does not evenly divide \(n\), we may end up with one of the following recurrences:

The *master theorem* provides the solution to all three of these
recurrences.

Let \(T(n)\) be a recurrence with one of the following forms, where \(k \ge 1, b > 1, d \ge 0\) are constants:

Then:

The relationship between \(k\), \(b\), and \(d\) can alternatively be expressed in logarithmic form. For instance,

taking the \(\log_b\) of both sides in the last step. We end up with:

We prove the master theorem for the case where \(n\) is a power of \(b\); when this is not the case, we can pad the input size to the smallest power of \(b\) that is larger than \(n\), which increases the input size by at most the constant factor \(b\) (can you see why?).

First, we determine how many subproblems we have at each level of the recursion. Initially, we have a single problem of size \(n\). We subdivide it into \(k\) subproblems, each of size \(n/b\). Each of those gets subdivided into \(k\) subproblems of size \(n/b^2\), for a total of \(k^2\) problems of size \(n/b^2\). Those \(k^2\) problems in turn are subdivided into a total of \(k^3\) problems of size \(n/b^3\).

In general, at the \(i\)th level of recursion (with \(i = 0\) the initial problem), there are \(k^i\) subproblems of size \(n/b^i\). We assume that the base case is when the problem size is 1, which is when

Thus, there are \(1 + \log_b n\) total levels in the recursion.

We now consider how much work is done in each subproblem, which corresponds to the \(\O(n^d)\) term of the recurrence relation. For a subproblem of size \(n/b^i\), this is

With \(k^i\) subproblems at level \(i\), the total work \(T_i\) at level \(i\) is

Summing over all the levels, we get a total work

where \(r = k/b^d\).

Since we are working with asymptotics, we only care about the dominating term in the sum

There are three cases:

\(r < 1\): The terms have decreasing value, so that the initial term 1 is the largest and dominates 3. Thus, we have

\[\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d)\end{split}\]- 3
We’re being a bit sloppy here – the largest term in a sum doesn’t necessarily dominate, as can be seen with a divergent series such as the harmonic series \(\sum_{n=1}^\infty 1/n\). However, what we have here is a geometric series \(\sum_{i=0}^n r^i\), which has the closed-form solution \(\sum_{i=0}^n r^i = (1 - r^{n+1}) / (1 - r)\) for \(r \ne 1\). When \(0 \le r < 1\) (and therefore \(r^{n+1} < 1\)), this is bounded from above by the constant \(1 / (1 - r)\), so the sum is \(\O(1)\).

\(r = 1\): The terms all have the same value 1. Then

\[\begin{split}T &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} r^i}\\ &= \O\parens{n^d \cdot \sum_{i=0}^{\log_b n} 1}\\ &= \O\parens{n^d \cdot (1 + \log_b n)}\\ &= \O\parens{n^d + n^d \log_b n)}\\ &= \O(n^d \log_b n)\end{split}\]where we discard the lower-order term in the last step. Since logarithms of different bases differ only by a multiplicative constant factor (see the logarithmic identity for changing the base), which does not affect asymptotics, we can elide the base:

\[\begin{split}T &= \O(n^d \log_b n)\\ &= \O(n^d \log n)\end{split}\]\(r > 1\): The terms have increasing value, so that the final term \(r^{\log_b n}\) is the largest. Then we have

\[\begin{split}T &= \O\parens{n^d \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O\parens{n^d r^{\log_b n}}\\ &= \O\parens{n^d \parens{\frac{k}{b^d}}^{\log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot b^{-d \log_b n}}\\ &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\end{split}\]To simplify this further, we observe the following:

\(b^{\log_b n} = n\), which we can see by taking \(\log_b\) of both sides.

\(k^{\log_b n} = n^{\log_b k}\). We show this as follows:

\[k^{\log_b n} = \parens{b^{\log_b k}}^{\log_b n}\]Here, we applied the previous observation to substitute \(k = b^{\log_b k}\). We proceed to get

\[\begin{split}k^{\log_b n} &= \parens{b^{\log_b k}}^{\log_b n}\\ &= \parens{b^{\log_b n}}^{\log_b k}\\ &= n^{\log_b k}\end{split}\]applying the prior observation once again.

Applying both observations, we get

\[\begin{split}T &= \O\parens{n^d \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-d}}\\ &= \O\parens{n^d \cdot n^{\log_b k} \cdot n^{-d}}\\ &=\O\parens{n^{\log_b k}}\end{split}\]Here, we cannot discard the base of the logarithm, as it appears in the exponent and so affects the value by an exponential rather than a constant factor.

Thus, we have demonstrated all three cases of the master theorem.

In the case of merge sort, we have \(d = 1\) and \(k/b^d = 1\), so the solution is \(T(n) = \O(n \log n)\). Thus, merge sort is more efficient than insertion sort.

### Master Theorem with Log Factors¶

A recurrence such as

\[T(n) = 2T\parens{\frac{n}{2}} + \O(n \log n)\]

does not quite fit into the pattern of the master theorem above, since the combination cost \(\O(n \log n)\) is not of the form \(\O(n^d)\) for some constant \(d\). Such a recurrence requires a modified form of the theorem.

Let \(T(n)\) be a recurrence with one of the following forms, where \(k \ge 1, b > 1, d \ge 0, w \ge 0\) are constants:

Then:

We need to modify the reasoning in Proof 15 to account for the amount of work done in each subproblem. For a subproblem of size \(n/b^i\), we now have work

We consider the cases of \(k/b^d \le 1\) and \(k/b^d > 1\) separately.

\(k/b^d \le 1\): We compute an upper bound on the work as follows:

\[\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\frac{n^d}{b^{id}} \cdot (\log^w n - \log^w b^i)}\\ &= b^{-id} \cdot \O(n^d \cdot (\log^w n - \log^w b^i))\\ &= b^{-id} \cdot \O(n^d \log^w n)\end{split}\]since \(n^d \cdot (\log^w n - \log^w b^i) \le n^d \log^w n\) for \(b > 1\), and we only need an upper bound for the asymptotics. Then the work \(T_i\) at level \(i\) is

\[\begin{split}T_i &= \frac{k^i}{b^{id}} \cdot \O(n^d \log^w n)\\ &= \parens{\frac{k}{b^d}}^i \cdot \O(n^d \log^w n)\end{split}\]Summing over all the levels, we get a total work

\[\begin{split}T &= \sum_{i=0}^{\log_b n} T_i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^d}}^i\\ &= \O(n^d \log^w n) \cdot \sum_{i=0}^{\log_b n} r^i\\ &= \O\parens{n^d \log^w n \cdot \sum_{i=0}^{\log_b n} r^i}\end{split}\]where \(r = k/b^d\). When \(r < 1\), the initial term of the summation dominates as before, so we have

\[\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n)\end{split}\]When \(r = 1\), the terms all have equal size, also as before. Then

\[\begin{split}T &= \O\parens{n^d \log^w n \cdot (1 + r + r^2 + \dots + r^{\log_b n})}\\ &= \O(n^d \log^w n \cdot (1 + \log_b n))\\ &= \O(n^d \log^w n \log_b n)\\ &= \O(n^d \log^w n \log n)\\ &= \O(n^d \log^{w+1} n)\end{split}\]As before, we drop the lower order term, as well as the base in \(\log_b n\) since it only differs by a constant factor from \(\log n\).

\(k/b^d > 1\): We start by observing that since \(\log^w n\) is subpolynomial in \(n\), it grows slower than any polynomial in \(n\). We have

\[\frac{k}{b^d} > 1\]or equivalently

\[\begin{split}\log_b k &> d\\ 0 &< \log_b k - d\end{split}\]We choose a small exponent \(\epsilon\) such that

\[0 < \epsilon < \log_b k - d\]For instance, we can choose \(\epsilon = (\log_b k - d) / 2\). Then

\[\begin{split}\log_b k &> d + \epsilon\\ k &> b^{d+\epsilon}\\ \frac{k}{b^{d+\epsilon}} > 1\end{split}\]We also have

\[\log^w n \le C n^\epsilon\]for a sufficiently large constant \(C\) and \(n \ge 1\). For a subproblem of size \(n/b^i\), we now have work

\[\begin{split}\O\parens{\parens{\frac{n}{b^i}}^d \log^w \frac{n}{b^i}} &= \O\parens{\parens{\frac{n}{b^i}}^d \cdot C \parens{\frac{n}{b^i}}^\epsilon}\\ &= \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}\end{split}\]Then our work \(T_i\) at level \(i\) is

\[T_i = k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}\]The total work \(T\) is

\[\begin{split}T &= \sum_{i=0}^{\log_b n} \parens{k^i \cdot \O\parens{C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{\sum_{i=0}^{\log_b n} \parens{k^i \cdot C \parens{\frac{n}{b^i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{-i}}^{d+\epsilon}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{k^i \cdot \parens{b^{d+\epsilon}}^{-i}}}\\ &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\end{split}\]Since \(k/b^{d+\epsilon} > 1\), the last term in the summation dominates as in Proof 15, so we get

\[\begin{split}T &= \O\parens{C n^{d+\epsilon} \cdot \sum_{i=0}^{\log_b n} \parens{\frac{k}{b^{d+\epsilon}}}^i}\\ &= \O\parens{C n^{d+\epsilon} \cdot \parens{\frac{k}{b^{d+\epsilon}}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{-(d+\epsilon)}}^{\log_b n}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot \parens{b^{\log_b n}}^{-(d+\epsilon)}}\\ &= \O\parens{C n^{d+\epsilon} \cdot k^{\log_b n} \cdot n^{-(d+\epsilon)}}\\ &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{C n^{\log_b k}}\end{split}\]In the last step, we applied the observation that \(k^{\log_b n} = n^{\log_b k}\). Folding the constant \(C\) into the big-O notation, we end up with

\[\begin{split}T &= \O\parens{C k^{\log_b n}}\\ &= \O\parens{n^{\log_b k}}\end{split}\]as required.

Applying the modified master theorem to the recurrence

\[T(n) = 2T\parens{\frac{n}{2}} + \O(n \log n)\]

we have \(k = 2, b = 2, d = 1, w = 1, k/b^d = 1\). Thus,

\[\begin{split}T(n) &= \O(n^d \log^{w+1} n)\\ &= \O(n \log^2 n)\end{split}\]

### Non-master-theorem Recurrences¶

Recurrences that do not match the pattern required by the master
theorem can often be manipulated to do so using *substitutions*. We
take a look at two examples here.

Consider the following recurrence:

While we solved this recurrence above using the master theorem with log factors, we can also do so through substitution. We perform the substitution \(S(n) = T(n)/\log{n}\) to get the following:

To get this, we substituted \(T(n) = S(n)\log{n}\) and \(T(n/2) = S(n/2)\log(n/2)\) in the first step. Since \(\log n - 1 \le \log n\), we can turn the above into the inequality:

We can then divide out the \(\log n\) from each term to get:

We now have something that is in the right form for applying the master theorem. Even though it is an inequality rather than an equality, because we are computing an upper bound, we can still apply the master theorem. We have \(k/b^d = 2/2^1 = 1\), so we get:

We can then undo the substitution \(S(n) = T(n)/\log{n}\) to get:

In Example 18, we assumed that if \(f(n) = \O(\log n)\), then \(f(n)/\log{n} = \O(1)\). Prove from the definition of big-O that this holds.

## Integer Multiplication¶

We now turn our attention to algorithms for integer multiplication. For fixed-size data types, such as 32-bit integers, multiplication can be done in a constant amount of time and is typically implemented as a hardware instruction for common sizes. However, if we are working with arbitrary \(n\)-bit numbers, we will have to implement multiplication ourselves in software.

Let’s first take a look the standard grade-school long-multiplication algorithm.

Here, the algorithm is illustrated for binary numbers, but it is the same as for decimal numbers. We first multiply the top number by the last digit in the bottom number. We then multiply the top number by the second-to-last digit in the bottom number, but we shift over the result leftward by one digit. We repeat this for each digit in the bottom number, adding one more leftward shift as we proceed. Once we have done all the multiplications for each digit of the bottom number, we add up the results to compute the final product.

How efficient is this algorithm? If the input numbers are each \(n\) bits long, each individual multiplication takes a linear amount of time – we have to multiply each digit in the top number by the single digit in the bottom number (plus a carry if we are working in decimal). Since we have to do \(n\) multiplications, computing the partial results takes \(\O(n^2)\) total time. We then have \(n\) partial results that we need to add. The longest partial result is the last one, which is ~\(2n\) digits long. Thus, we add \(n\) numbers, each of which has \(\O(n)\) digits. Adding two \(\O(n)\)-digit numbers takes \(\O(n)\) time, so adding \(n\) of them takes a total of \(\O(n^2)\) time. Adding the time for the multiplications and additions leaves us with \(\O(n^2)\) time total for the entire operation.

Can we do better? Let’s try to make use of the divide-and-conquer paradigm. We need a way of splitting up a number into smaller pieces, but we can do that by splitting it into the first \(n/2\) digits and the last \(n/2\) digits. (For the rest of our discussion, we will work with decimal numbers, though the same reasoning applies to numbers in other bases.) Assume that \(n\) is even for simplicity.

As an example, consider the number 376280. Here, \(n = 6\), and splitting it into two gives us 376 and 280. How are these pieces related to the original number? We have:

In general, when we split an \(n\)-digit number \(N\) into two pieces \(a\) and \(b\), we have that \(N = a\cdot 10^{n/2} + b\).

Let’s now apply this splitting process to multiply two \(n\)-digit numbers \(N_1\) and \(N_2\). Split \(N_1\) into \(a\) and \(b\) and \(N_2\) into \(c\) and \(d\) such that:

We can then compute \(N_1 \times N_2\) as:

How efficient is this computation? First, observe that multiplying by \(10^k\) is the same as shifting it to the left by appending \(k\) zeros to the end of a number, so it can be done in \(\O(k)\) time. Then we have the following subcomputations:

4 recursive multiplications of \(n/2\)-digit numbers (\(a \times c\), \(a \times d\), \(b \times c\), \(b \times d\))

2 left shifts, each of which takes \(\O(n)\) time

3 additions of \(\O(n)\)-digit numbers, which take \(\O(n)\) time

Let \(T(n)\) be the time it takes to multiply two \(n\)-digit numbers. Using the algorithm above, we have:

Applying the master theorem, we have \(k = 4\), \(b = 2\), \(d = 1\), so \(k/b^d = 2 > 1\). Thus:

This is the same as the long-multiplication algorithm! We did a lot of work to come up with a divide-and-conquer algorithm, and it doesn’t do any better than a naïve algorithm. It seems that our division and combination wasn’t sufficiently “meaningful” to result in an improvement.

### The Karatsuba Algorithm¶

Let’s try again, but this time, let’s try to rearrange the computation such that we have fewer than four subproblems. We have shown that

Rather than computing \(a \times c\), \(a \times d\), \(b \times c\), and \(b \times d\) directly, we compute the following three terms:

Observe that \(m_1 = a \times c + a \times d + b \times c + b \times d\), and we can subtract \(m_2\) and \(m_3\) to obtain:

This is exactly the middle term in our expansion for \(N_1 \times N_2\). Thus:

This algorithm is known as the *Karatsuba algorithm*. How efficient is
the computation? We have the following subcomputations:

Computing \(m_1\) requires two additions of \(n/2\)-digit numbers, resulting in two numbers that are ~\(n/2\) digits each. This takes \(\O(n)\) time. The two ~\(n/2\)-digit numbers get multiplied, which takes \(T(n/2)\) time.

Computing \(m_2\) and \(m_3\) each take \(T(n/2)\) time.

Putting together the terms in the expansion for \(N_1 \times N_2\) requires two left shifts, two additions, and two subtractions, all of which take \(\O(n)\) time.

This leads to the recurrence:

Applying the master theorem, we have \(k = 3\), \(b = 2\), \(d = 1\), so \(k/b^d = 1.5 > 1\). This gives us:

Thus, the Karatsuba algorithm gives us a runtime that is faster than the naïve algorithm, and it was the first algorithm discovered for integer multiplication that takes less than quadratic time.

## The Closest-Pair Problem¶

In the *closest-pair problem*, we are given \(n\) points in
\(d\)-dimensional space, and our task is to find the pair of points
whose mutual distance is smallest among all pairs. This problem has
several applications in computational geometry and data mining (e.g.
clustering). The following is an example of this problem in two
dimensions, where the closest pair is at the top left.

A naïve algorithm compares the distance between every pair of points; since there are \(\O(n^2)\) pairs, the algorithm takes \(\O(n^2)\) time. Can we do better?

Let’s start with a simpler problem. Given a list of \(n\) real numbers \(x_1, x_2, \dots, x_n\), we wish to find the pair of numbers that are closest together. In other words, find \(x_i,x_j\) that minimize \(\abs{x_i - x_j}\), where \(i \ne j\).

Rather than comparing every pair of numbers, we can first sort the list. Then it must be the case that the closest pair of numbers are adjacent in the sorted list. (Exercise: prove this by contradiction.) So we only need compare each pair of adjacent points to find the pair with minimal distance. The following is a complete algorithm:

As we saw previously, merge sort takes \(\O(n \log n)\) time (assuming fixed-width numbers). The algorithm above also iterates over the sorted list, doing a constant amount of work in each iteration. This takes \(\O(n)\) time. Putting the two steps together results in a total time of \(\O(n \log n)\), which is better than the naïve \(\O(n^2)\).

This algorithm works for points in one-dimensional space, which are just real numbers. Unfortunately, it is not clear how to generalize this algorithm to two-dimensional points. While there are various ways we can sort such points, there is no obvious ordering that provides the invariant that the pair of closest points must be adjacent in the resulting ordering.

Let’s instead take another look at the single-dimensional problem, taking a more direct divide-and-conquer approach. We start by finding the median of all the points and partitioning them into two halves according to the median.

We now have two subproblems, and we can recursively find the closest pairs that reside in just the left and just the right halves. However, the closest pair overall may cross the two halves, with one point in the left and one in the right. But we actually only need to consider the pair formed by the maximal point in the left half and the minimal point in the right – any other crossing pair will be further apart than these two.

Thus, we have three pairs of points, and we just need to compare them to determine the closest overall:

the closest pair in the left half

the closest pair in the right half

the closest crossing pair, consisting of the maximal element on the left and the minimal element on the right

The full algorithm is as follows:

Analyzing this algorithm for its runtime, we can find the median of a set of points by sorting them and then examining the middle points. We can also obtain the maximal element in the left-hand side and the minimal in the right from the sorted list. Splitting the points by the median takes \(\O(n)\) time – we just need to compare each point to the median. The combination step is dominated by the \(\O(n \log n)\) sort, so we end up with the recurrence:

This doesn’t quite fit into the master theorem since the combination is not of the form \(\O(n^d)\). We showed using substitution in Example 18 that \(T(n) = \O(n \log^2 n)\). This means that this algorithm is less efficient than our previous one. However, there are two modifications we can make:

Use an \(\O(n)\) median-finding algorithm rather than sorting to find the median.

Sort the points once at the beginning, so that we don’t need to re-sort in each step.

Either modification brings the combination step down to \(\O(n)\), resulting in \(T(n) = \O(n \log n)\). (The second option adds an additional \(\O(n \log n)\) on top of this for the presorting, but that still results in a total time of \(\O(n \log n)\).)

In the one-dimensional closest-pair algorithm, we computed \(m\) as the median, \(\delta_L\) as the distance between the closest pair on the left, and \(\delta_R\) as the distance between the closest pair on the right. Let \(\delta = \min(\delta_L, \delta_R)\). How many points can lie in the interval \([m, m + \delta)\)? What about the interval \((m - \delta, m + \delta)\)?

Now let’s try to generalize this algorithm to two dimensions. It’s not
clear how to split the points according to a median point. So rather
than doing that, let’s compute a median *line* instead as the median
x-coordinate.

As before, we can recursively find the closest pair solely on the left and the closest pair solely on the right. When it comes to crossing pairs, however, we no longer have just one pair to consider. It may be that the points closest to the median line on the left and right are very far apart in the y-dimension, whereas another crossing pair is slightly further apart in the x-dimension but much closer in the y-dimension. Thus, we need to consider all crossing pairs that are “close enough” to the median line.

How far from the median line do we need to consider? Let \(\delta_L\) be the distance between the closest pair solely on the left and \(\delta_R\) similarly on the right. Let \(\delta = \min(\delta_L, \delta_R)\) be the minimal distance between all pairs of points where both are solely on the left or solely on the right. Then when looking at crossing pairs, we can ignore any point that is \(\delta\) or more away from the median – the distance between such a point and a point on the other side is at least \(\delta\) in just the x-dimension. Then the overall distance is also at least \(\delta\), and they can’t be closer together than the closer of the two non-crossing pairs we have already found.

Thus, we need only consider the points that lie within
the \(\delta\)-*strip*, the space that lies within a distance of
\(\delta\) of the median line in the x-dimension. This leads to the
following algorithm:

Other than examining the \(\delta\)-strip, the computations are the same as in the one-dimensional algorithm, and they take \(\O(n)\) time. (We can presort the points by x-coordinate or use an \(\O(n)\) median-finding algorithm, as before.) How long does it take to examine the \(\delta\)-strip? A naïve examination would consider every pair of points where one lies within the left side of the \(\delta\)-strip and the other on its right side. How many points can lie in the \(\delta\)-strip? In the worst case, all of them! This can happen if the points are close together in the x-dimension but far apart in the y-dimension. So in the worst case, we have \(n/2\) points in the left part of the \(\delta\)-strip and \(n/2\) points in the right, leaving us with \(n^2/4 = \O(n^2)\) pairs to consider. So we end up with the recurrence:

This is no better than the naïve algorithm that unconditionally compares all pairs. As with our first attempt at divide-and-conquer multiplication, we have not come up with a “meaningful” enough division and combination.

Let’s take another look at the \(\delta\)-strip to see if we can find a more efficient method for examining it. Consider an arbitrary point \(p_L = (x_L, y_L)\) in the left side of the \(\delta\)-strip. How many points in the right side that are below \(p_L\) (i.e. have a y-coordinate that is \(\le\) the y-coordinate of \(p_L\)) can lie within distance \(\delta\) of \(p_L\)?

Observe that in order for a point \(p_R\) on the right to be below \(p_L\) but within distance \(\delta\), its y-coordinate must be in the range \((y_L - \delta, y_L]\) – otherwise, it would be at least \(\delta\) away from \(p_L\) in just the y-dimension, and thus at least \(\delta\) away overall. Thus, all such points \(p_R\) that are on the right, below \(p_L\), and within a distance \(\delta\) must lie within the \(\delta \times 2\delta\) portion of the \(\delta\)-strip that lies just below \(p_L\). Note that some points in this rectangle are on the left, and some may be further from \(p_L\) than distance \(\delta\), but all points that are on the right, below \(p_L\), and within \(\delta\) of \(p_L\) must lie within the rectangle.

How many points can be in this \(\delta \times 2\delta\) rectangle? We claim no more than eight, including \(p_L\), but we leave the proof as an exercise. Thus, for each point in the \(\delta\)-strip, we need only pair it with the seven points that lie directly below it in the strip.

We have already seen that we can presort by x-coordinate to avoid sorting to find the median in each step. We can also separately presort by y-coordinate (into a different array) so that we do not have to sort the \(\delta\)-strip in each iteration. Instead, we merely filter the points from the presorted array according to whether they lie within the strip. Thus, in the worst case when all \(n\) points are in the \(\delta\)-strip, constructing a sorted \(\delta\)-strip takes \(\O(n)\) time to do the filtering, and we consider ~\(7n = \O(n)\) pairs in the strip. The total combination time is then \(\O(n)\), resulting in the recurrence:

This matches the efficiency of the one-dimensional algorithm. The algorithm can be further generalized to higher dimensions, retaining the \(\O(n \log n)\) runtime for any fixed dimension.

Prove that the left square area in the \(\delta \times 2\delta\) rectangle from the \(\delta\)-strip can contain at most 4 points from the left subset, and similarly that the right square area can contain at most 4 points from the right subset.

**Hint:**Partition each square area into four congruent square sub-areas, and show that each sub-area can have at most one point from the relevant subset.Use the previous part to argue why it suffices to check only the seven points that lie directly below each point in the \(\delta\)-strip, when considering only the points that are within the strip itself.

Note: the value seven here is not optimal, but for the asymptotics we only need a constant.

# Dynamic Programming¶

The idea of subdividing a large problem into smaller versions of the
same problem lies at the core of the divide-and-conquer paradigm.
However, for some problems, this recursive subdivision may result in
encountering many instances of the exact same subproblem. This gives
rise to the paradigm of *dynamic programming*, which is applicable to
problems that have the following features:

The

*principle of optimality*, also known as an*optimal substructure*. This means that the optimal solution to a larger problem can be constructed from optimal solutions to its smaller subproblems. For example, shortest-path problems on graphs generally obey the principle of optimality. If the shortest path between vertices \(a\) and \(b\) in a graph goes through some other vertex \(c\), so that the path has the form \(a, u_1, \dots, u_j, c, v_1, \dots, v_k, b\), then the subpath \(a, u_1, \dots, u_j, c\) is the shortest path from \(a\) to \(c\), and similarly for the subpath between \(c\) and \(b\).*Overlapping subproblems*. This means that the same subproblem appears many times when recursively decomposing the original problem down to the base cases. A classic example of this is a recursive computation of the Fibonacci sequence, which has the recurrence \(F(n) = F(n-1) + F(n-2)\). The same subproblem appears over and over again, such that a naïve computation takes time exponential in \(n\).

The latter characteristic of overlapping subproblems is what distinguishes dynamic programming from divide and conquer.

## Implementation Strategies¶

The first step in applying dynamic programming to a problem is to determine a recurrence relation, along with the base cases. In the case of the Fibonacci sequence, for example, the recurrence and bases cases are:

Once we have established the base cases and recurrence relation, we can decide on an implementation strategy. There are three typical patterns:

*Top-down recursive*. The naïve implementation directly translates the recurrence relation into a recursive algorithm, as in the following:Algorithm 26 (Top-down Fibonacci)\[\begin{split}&\Algorithm Fib(n):\\ &~~~\If n \le 1 \return 1\\ &~~~\Return Fib(n-1) + Fib(n-2)\end{split}\]As alluded to previously, the problem with this strategy is that it repeats the same computations many times, to the extent that the overall number of recursive calls is exponential in \(n\). Thus, the naïve top-down strategy is time inefficient when there are overlapping subproblems. However, the algorithm does not use any auxiliary storage 4, so it is space efficient.

- 4
Space is required to represent the recursion stack; for the Fibonacci computation, there are as many as \(\O(n)\) active calls at any point, so the naïve approach actually requires linear space. For other algorithms, however, the naïve implementation can incur lower space costs than the table-based approaches.

*Top-down memoized*, or simply*memoization*. This approach also translates the recurrence relation into a recursive algorithm, but it saves results in a lookup table and queries that table before doing any computation. The following is an example:Algorithm 27 (Memoized Fibonacci)\[\begin{split}&memo \gets \text{ an empty table (i.e. a map or dictionary)}\\ \\ &\Algorithm Fib(n):\\ &~~~\If n \le 1 \return 1\\ &~~~\If n \notin memo:\\ &~~~~~~memo(n) \gets Fib(n-1) + Fib(n-2)\\ &~~~\Return memo(n)\end{split}\]This memoized algorithm avoids recomputing a previously encountered subproblem. Each call to \(Fib(n)\), where \(n\) is not a base case, first checks the \(memo\) table to see if \(Fib(n)\) has been computed before. If not, it goes ahead and computes it, installing the result in \(memo\). If the subproblem was previously encountered, the algorithm just returns the previously computed result.

Memoization trades space for time. The computation of \(Fib(n)\) requires \(\O(n)\) auxiliary space to store the result of each subproblem. On the other hand, since each subproblem is only computed once, the overall number of operations required is \(\O(n)\), a significant improvement over the exponential naïve algorithm.

*Bottom-up table*5. Rather than starting with the desired input and working our way down to the base cases, we can invert the computation to start with the base cases and then work our way up to the desired input. In general, we need a table to store the results of the subproblems we have computed so far, as those results will be needed to compute larger subproblems.- 5
In many contexts, the term “dynamic programming” is used to refer specifically to the bottom-up approach. In this text, however, we use the term to refer to both the top-down-memoized and the bottom-up-table strategies, and we will be explicit when we refer to a specific approach.

The following is a bottom-up implementation of computing the Fibonacci sequence:

Algorithm 28 (Bottom-up Fibonacci)\[\begin{split}&\Algorithm Fib(n):\\ &~~~table \gets \text{ an empty table (i.e. a map or dictionary)}\\ &~~~table(0) \gets 1\\ &~~~table(1) \gets 1\\ &~~~\For i = 2 \tot n:\\ &~~~~~~table(i) \gets table(i-1) + table(i-2)\\ &~~~\Return table(n)\end{split}\]We start with an empty table and populate it with the results for the base cases. Then we work our way forward from there, computing the result of each new subproblem from the results previously computed for the smaller subproblems. We stop when we reach the desired input and return the result. The following is an illustration of the table for \(Fib(9)\).

When we get to computing \(Fib(i)\), we have already computed \(Fib(i-1)\) and \(Fib(i-2)\), so we can just take those results from the table and add them to get the result for \(Fib(i)\).

Like memoization, the bottom-up approach trades space for time. In the case of \(Fib(n)\), it too requires \(\O(n)\) auxiliary space to store the result of each subproblem, and the overall number of operations required is \(\O(n)\).

In the specific case of computing the Fibonacci sequence, we don’t actually need to keep the entire table around – once we get to computing \(Fib(i)\), we never use the results for \(Fib(i-3)\) or lower. So we only need to keep around the two previously computed results at any time. This lowers the storage overhead to \(\O(1)\). However, this isn’t the case in general for dynamic programming. Other problems require maintaining a larger subset or all of the table throughout the computation.

The three implementation strategies produce different tradeoffs. The
naïve top-down strategy often takes the least implementation effort,
as it is a direct translation of the recurrence relation. It is also
usually the most space efficient, but it typically is the least time
efficient. Memoization adds some implementation effort in working with
a lookup table (though some programming languages provide built-in
facilities for converting a naïve recursive function into a memoized
one, such as `@functools.lru_cache`

in Python). Its main advantages
over the bottom-up approach are:

It maintains the same structure as the recurrence relation, so it typically is simpler to reason about and implement.

It only computes the subproblems that are actually needed. If the recursive computation is

*sparse*, meaning that it requires only a subset of the subproblems that are smaller than the input, the top-down memoization approach can be more time and space efficient than the bottom-up strategy.

However, memoization often suffers from higher constant-factor overheads than the bottom-up approach (e.g. function-call overheads and working with sparse lookup structures that are less time efficient than dense data structures). Thus, the bottom-up approach is preferable when a large fraction of the subproblems are actually needed for computing the desired result.

## Longest Common Subsequence¶

As a more complex example of dynamic programming, we take a look at
the problem of finding the *longest common subsequence (LCS)* of two
strings. A *subsequence* is an ordered subset of the characters in a
string, preserving the order in which the characters appear in the
original string. However, the characters in the subsequence need not
be adjacent in the original string 6. The following are all
subsequences of the string \(\texttt{"Fibonacci sequence"}\):

\(\texttt{"Fun"}\)

\(\texttt{"seen"}\)

\(\texttt{"cse"}\)

- 6
Contrast this with a

*substring*, where the characters in the substring must all be adjacent in the original string.

The longest common subsequence \(LCS(S_1, S_2)\) of two strings \(S_1\) and \(S_2\) is the longest string that is both a subsequence of \(S_1\) and of \(S_2\). For instance, the longest common subsequence of \(\texttt{"Go blue!"}\) and \(\texttt{"Wolverine"}\) is \(\texttt{"ole"}\).

Finding the longest common subsequence is useful in many applications,
including DNA sequencing and computing the `diff`

of two files. As
such, we wish to devise an efficient algorithm for determining the LCS
of two arbitrary strings.

Let’s assume that we are given two strings \(S_1\) and \(S_2\) as input,
and let their lengths be \(N\) and \(M\), respectively. For now, let’s set
aside the problem of finding the actual longest common subsequence and
focus on just determining the **length** of this subsequence. To apply
dynamic programming, we first need to come up with a recurrence
relation that relates the length of the LCS of \(S_1\) and \(S_2\) to
smaller subproblems. Let’s consider just the last 7 character in
each string. There are two possibilities:

- 7
It is equally valid to start with the first character in each string, but we have chosen to start with the last character as it simplifies some implementation details when working with strings in real programming languages.

The last character is the same in both strings, i.e. \(S_1[N] = S_2[M]\); call this \(x\). Then \(x\) must be the last character of the LCS of \(S_1\) and \(S_2\). We can prove this by contradiction. Let \(C = LCS(S_1, S_2)\), and suppose that \(C\) does not end with \(x\). Then it must consist of characters that appear in the first \(N-1\) characters of \(S_1\) and the first \(M-1\) characters of \(S_2\). But \(C + x\) is a valid subsequence of both \(S_1\) and \(S_2\), since \(x\) appears in \(S_1\) and \(S_2\) after all the characters in \(C\). Since \(C + x\) is a longer common subsequence than \(C\), this contradicts the assumption that the longest common subsequence does not end with \(x\).

Let \(k\) be the length of \(C = LCS(S_1, S_2)\). By similar reasoning to the above, we can demonstrate that \(C[1..k-1] = LCS(S_1[1..N-1], S_2[1..M-1])\) – that is, \(C\) with its last character removed is the LCS of the strings consisting of the first \(N-1\) characters of \(S_1\) and the first \(M-1\) characters of \(S_2\). This is an example of the principle of optimality, that we can construct an optimal solution to the larger problem from the optimal solutions to the smaller subproblems.

Let \(L(S_1, S_2)\) be the length of the LCS of \(S_1\) and \(S_2\). We have demonstrated that when \(S_1[N] = S_2[M]\), we have:

\[L(S_1, S_2) = L(S_1[1..N-1], S_2[1..M-1]) + 1.\]The last character differs between the two strings, i.e. \(S_1[N] \ne S_2[M]\). Then at least one of these characters must not be in \(LCS(S_1, S_2)\). We do not know a priori which of the two (or both) are not in the LCS, but we observe the following:

If \(S_1[N]\) is not in \(LCS(S_1, S_2)\), then all the characters in \(LCS(S_1, S_2)\) must appear in the first \(N-1\) characters of \(S_1\). Then \(LCS(S_1, S_2) = LCS(S_1[1..N-1], S_2)\).

If \(S_2[M]\) is not in \(LCS(S_1, S_2)\), then all the characters in \(LCS(S_1, S_2)\) must appear in the first \(M-1\) characters of \(S_2\). Then \(LCS(S_1, S_2) = LCS(S_1, S_2[1..M-1])\).

Since at least one of these cases must hold, we just compute both \(LCS(S_1[1..N-1], S_2)\) and \(LCS(S_1, S_2[1..M-1])\) and see which one is longer. In terms of length alone, we have demonstrated that when \(S_1[N] \ne S_2[M]\), we have that:

\[L(S_1, S_2) = \max(L(S_1[1..N-1], S_2), L(S_1, S_2[1..M-1]))\]

Finally, we note that when either \(S_1\) or \(S_2\) is empty (i.e. \(N=0\) or \(M=0\), respectively), then \(LCS(S_1, S_2)\) is also empty. Putting everything together, we have:

Now that we have a recurrence relation, we can proceed to construct an algorithm. Let’s take the bottom-up approach. We need a table to store the solutions to the subproblems, which consist of \(L(S_1[1..i], S_2[1..j])\) for all \((i,j) \in [0,N] \times [0,M]\) (we consider \(S_1[1..0]\) to denote an empty string). Thus, we need a table of size \((N+1) \times (M+1)\). The following is the table for \(S_1 = \texttt{"Go blue!"}\) and \(S_2 = \texttt{"Wolverine"}\).

The \((i,j)\)th entry in the table denotes the length of the longest common subsequence for \(S_1[1..i]\) and \(S_2[1..j]\). Using the recurrence relation, we can compute the value of the \((i,j)\)th entry from the entries at locations \((i-1,j-1)\), \((i-1,j)\), and \((i,j-1)\), with the latter two required when \(S_1[i] \ne S_2[j]\) and the former when \(S_1[i] = S_2[j]\). The entry at \((N,M)\) is the length of the LCS of the full strings \(S_1[1..N]\) and \(S_2[1..M]\).

The last thing we need before proceeding to the algorithm is to determine an order in which to compute the table entries. There are multiple valid orders, but we will compute the entries row by row from top to bottom, moving left to right within a row. With this order, the entries required for \((i,j)\) have already been computing by the time we get to \((i,j)\).

We can now write the algorithm for computing the table:

As stated before, the length of \(LCS(S_1, S_2)\) is in the last table entry at position \((N,M)\).

Now that we have determined how to compute the length of the longest
common subsequence, let’s return to the problem of computing the
subsequence itself. It turns out that we can *backtrack* through the
table to recover the actual characters. We start with the last entry
and determine the path to that entry from a base case. For each entry
on the path, we check to see if the characters corresponding to that
location match. If so, the matching character contributes to the LCS.
If the characters do not match, we check to see if we came from the
left or from above – the larger of the two neighboring entries is the
source, since we took the max of the two in computing the current
entry. If both neighbors have the same value, we can arbitrarily
choose one or the other. The following demonstrates the path that
results from backtracking. The solid, red path arbitrarily chooses to
go up when both neighbors have the same value, while the dashed, blue
path chooses to go left. Both paths result in a valid LCS (and in this
case, the same set of characters, though that isn’t necessarily always
the case).

The algorithm for backtracking is as follows:

How efficient is this algorithm? Computing a single table entry requires a constant number of operations. Since there are \((N+1) \cdot (M+1)\) entries, constructing the table takes \(\O(NM)\) time and requires \(\O(NM)\) space. Backtracking also does a constant number of operations per entry on the path, and the path length is at most \(N+M+1\), so backtracking takes \(\O(N+M)\) time. Thus, the total complexity of this algorithm is \(\O(NM)\) in both time and space.

## Longest Increasing Subsequence¶

Given a sequence of numbers \(S\), an *increasing subsequence* of \(S\) is
a subsequence such that the elements are in strictly increasing order.
As with a subsequence of a string, the elements do have to appear in
the same relative order in the subsequence as they do in the original
sequence \(S\). Suppose we want to find a *longest increasing
subsequence (LIS)* of \(S\). For example, the sequence \(S = (0, 8, 7,
12, 5, 10, 4, 14, 3, 6)\) has several longest increasing subsequences:

\((0, 8, 12, 14)\)

\((0, 8, 10, 14)\)

\((0, 7, 12, 14)\)

\((0, 7, 10, 14)\)

\((0, 5, 10, 14)\)

As in the previous problem, we focus first on finding the length of an LIS before concerning ourselves with the actual elements. Let \(N\) be the length of \(S\). For our first attempt, we take a look at the subproblems of computing the length of the LIS for each sequence \(S[1..i]\) for \(i \in [1,N]\). In the case of \(S = (0, 8, 7, 12, 5, 10, 4, 14, 3, 6)\), we can determine these lengths by hand.

However, it is not clear how we can relate the length of the LIS for a sequence \(S\) of \(N\) elements to a smaller sequence. In the example above, \(S[1..9]\) and \(S[1..10]\) both have the same length LIS, but that is not the case for \(S[1..7]\) and \(S[1..8]\). Yet, in both cases, the additional element is larger than the previous one (i.e. \(S[10] > S[9]\) and \(S[8] > S[7]\)). Without knowing the contents of the LIS itself, it is not obvious how the result for the larger problem is related to that of smaller ones.

For dynamic programming to be applicable, we must identify a problem formulation that satisfies the principle of optimality. So rather than tackling the original problem directly, let’s constrain ourselves to subsequences of \(S[1..N]\) that actually contain the last element \(S[N]\). Define \(CLIS(S[1..N])\) to be the longest of the increasing subsequences of \(S[1..N]\) that contain \(S[N]\). Then \(CLIS(S[1..N])\) has the form \((S[i_1], S[i_2], \dots, S[i_k], S[N])\). Observe that by definition of increasing subsequence, we must have \(i_k < N\) and \(S[i_k] < S[N]\). In addition, \((S[i_1], S[i_2], \dots, S[i_k])\) is itself an increasing subsequence that contains \(S[i_k]\). Moreover, we can show via contradiction that \(CLIS(S[1..i_k]) = (S[i_1], S[i_2], \dots, S[i_k])\). Thus, \(CLIS(S[1..i_k])\) is a subproblem of \(CLIS(S[1..N])\) when \(S[i_k] < S[N]\).

We do not know a priori which index \(i_k\) produces a result that is a subsequence of \(CLIS(S[1..N])\). But only the indices \(i_k\) such that \(i_k < N\) and \(S[i_k] < S[N]\) may do so, so we can just try all such indices and take the maximum. Let \(L(i)\) be the length of \(CLIS(S[1..i])\). Then we have:

The following illustrates this computation for the concrete sequence \(S\) above, specifically showing which values need to be compared when computing \(L(10)\).

We also have to account for the base cases, where no such \(j\) exists. In that case, \(L(i) = 1\), since \(CLIS(S[1..i])\) consists of just \(S[i]\) itself. The full recurrence is as follows:

We can then construct an algorithm for computing \(L(i)\). Once we have \(L(i)\) for all \(i \in [1, N]\), we just have to take the maximum value of \(L(i)\) as the result for the length of the unconstrained LIS. As with LCS, we can also backtrack through the table of values for \(L(i)\) to find the actual elements belonging to the longest increasing subsequence.

How efficient is the algorithm, assuming a bottom-up approach? We must compute \(N\) values for \(L(i)\), and each value requires scanning over all the previous elements of \(S\), as well as all the previous values of \(L(i)\) in the worst case. Thus, it takes \(\O(N)\) time to compute \(L(i)\) for a single \(i\), and \(\O(N^2)\) to compute them all. Finding the maximum \(L(i)\) takes linear time, as does backtracking. So the algorithm as a whole takes \(\O(N^2)\) time, and it uses \(\O(N)\) space to store the values of \(L(i)\).

## All-Pairs Shortest Path¶

Suppose you are building a flight-aggregator website. Each day, you receive a list of flights from several airlines with their associated costs, and some flight segments may actually have negative cost if the airline wants to incentivize a particular route (see hidden-city ticketing for an example of how this can be exploited in practice, and what the perils of doing so are). You’d like your users to be able to find the cheapest itinerary from point A to point B. To provide this service efficiently, you determine in advance the cheapest itineraries between all possible origin and destination locations, so that you need only look up an already computed result when the user puts in a query.

This situation is an example of the *all-pairs shortest path* problem.
The set of cities and flights can be represented as a graph \(G =
(V,E)\), with the cities represented as vertices and the flight
segments as edges in the graph. There is also a weight function
\(weight: E\to\R\) that maps each edge to a cost. While an individual
edge may have a negative cost, no negative cycles are allowed.
(Otherwise a traveler could just fly around that cycle to make as much
money as they want, which would be very bad for the airlines!) Our
task is to find the lowest-cost path between all pairs of vertices in
the graph.

How can we apply dynamic programming to this problem? We need to
formulate it such that there are self-similar subproblems. To gain
some insight, we observe that many aggregators allow the selection of
*layover* airports, which are intermediate stops between the origin
and destination, when searching for flights. The following is an
example from kayak.com.

We take the set of allowed layover airports as one of the key characteristics of a subproblem – computing shortest paths with a smaller set of allowed layover airports is a subproblem of computing shortest paths with a larger set of allowed layover airports. Then the base case is allowing only direct flights, with no layover airports.

Coming back to the graph representation of this problem, we formalize
the notion of a layover airport as an *intermediate vertex* of a
simple path, which is a path without cycles. Let \(p = \{v_1, v_2,
\dots, v_m\}\) be a path from origin \(v_1\) to destination \(v_m\). Then
\(v_2, \dots, v_{m-1}\) are intermediate vertices.

Assume that the vertices are labeled as numbers in the set \(\{1, 2, \dots, \abs{V}\}\). We parameterize a subproblem by \(k\), which signifies that the allowed set of intermediate vertices (layover airports) is restricted to \(\{1, 2, \dots, k\}\). Then we define \(d^k(i,j)\) to be the length of the shortest path between vertices \(i\) and \(j\), where the path is only allowed to go through intermediate vertices in the set \(\{1, 2, \dots, k\}\).

We have already determined that when no intermediate vertices are allowed, which is when \(k = 0\), the shortest path between \(i\) and \(j\) is just the direct edge (flight) between them. Thus, our base case is

where \(weight(i,j)\) is the weight of the edge between \(i\) and \(j\).

We proceed to the recursive case. We have at our disposal the value of \(d^{k-1}(i',j')\) for all \(i',j' \in V\), and we want to somehow relate \(d^k(i,j)\) to those values. The latter represents adding vertex \(k\) to our set of permitted intermediate vertices. There are two possible cases for the shortest path between \(i\) and \(j\) that is allowed to use any of the intermediate vertices \(1, 2, \dots, k\):

Case 1: The path does not go through \(k\). Then the length of the shortest path that is allowed to go through \(1, 2, \dots, k\) is the same as that of the shortest path that is only allowed to go through \(1, 2, \dots, k-1\), so \(d^k(i,j) = d^{k-1}(i,j)\).

Case 2: The path does go through \(k\). Then this path is composed of two segments, one that goes from \(i\) to \(k\) and another that goes from \(k\) to \(j\). We minimize the cost of the total path by minimizing the cost of each of the segments – the costs respect the principle of optimality.

Neither of the two segments may have \(k\) as an intermediate vertex – otherwise we would have a cycle. The only way for a path with a cycle to have lower cost than one without is for the cycle as a whole to have negative weight, which was explicitly prohibited in our problem statement. Since \(k\) is not an intermediate vertex in the segment between \(i\) and \(k\), the shortest path between them that is allowed to go through intermediate vertices \(1, 2, \dots, k\) is the same as the shortest path that is only permitted to go through \(1, 2, \dots, k-1\). In other words, the length of this segment is \(d^{k-1}(i,k)\). By the same reasoning, the length of the segment between \(k\) and \(j\) is \(d^{k-1}(k,j)\).

Thus, we have that in this case, \(d^k(i,j) = d^{k-1}(i,k) + d^{k-1}(k,j)\).

We don’t know a priori which of these two cases holds, but we can just compute them both and take the minimum. This gives us the recursive case:

Combining this with the base case, we have our complete recurrence relation:

We can now construct a bottom-up algorithm to compute the shortest paths:

This is known as the *Floyd-Warshall* algorithm, and it runs in time
\(\O(\abs{V}^3)\). The space usage is \(\O(\abs{V}^2)\) if we only keep
around the computed values of \(d^m(i,j)\) for iterations \(m\) and \(m+1\).
Once we have computed these shortest paths, we need only look up the
already computed result to find the shortest path between a particular
origin and destination.

# Greedy Algorithms¶

A *greedy algorithm* computes a solution to a problem by making a
locally optimal choice at each step of the algorithm. In general,
there is no guarantee that locally optimal choices lead to a globally
optimal result. However, for some specific greedy algorithms, we can
demonstrate that the result is globally optimal.

As an example, consider the problem of finding a *minimal spanning
tree (MST)* of a weighted, undirected graph. Given such a graph, we
would like to find a subset of the edges such that the subgraph
induced by those edges is connected, touches every vertex, and has the
least total edge cost. This is an important problem in designing
networks, including transportation and communication networks, where
we want to ensure that there is a path between any two vertices while
minimizing the cost of constructing the network.

Before we proceed, let’s review the definition of a *tree*. There
are three equivalent definitions:

Let \(G = (V,E)\) be an undirected graph. Then \(G\) is a tree if it is connected without cycles.

A graph is *connected* when there is a path between any two vertices.
A *cycle* is a sequence of adjacent edges that starts and ends at the
same vertex.

Let \(G = (V,E)\) be an undirected graph. Then \(G\) is a tree if it is minimally connected.

A graph is *minimally connected* if removing any single edge causes it
to be disconnected.

Let \(G = (V,E)\) be an undirected graph. Then \(G\) is a tree if it is maximal without cycles.

A graph \(G\) is *maximal without cycles* if adding any edge that is not
in \(G\) but connects two vertices in \(G\) causes the graph to contain a
cycle.

Show that the three definitions of a tree are all equivalent.

In the MST problem, our goal is to find a subset of the edges of a connected input graph such that:

The subset touches all vertices in the graph, i.e. it

*spans*the graph.The subset has no cycles.

The subset has the least total edge cost over all subsets that meet the first two requirements.

The first requirement requires the subset to be connected and span the
original graph. Combined with the second requirement, the end result
is a tree by Definition 32. Since the tree
spans the original graph, we call it a *spanning tree*. A *minimal
spanning tree* is then a spanning tree that has the minimal cost over
all spanning trees of the original graph.

The following illustrates three of the spanning trees of a sample graph. The middle one is the MST, as its total weight of 8 is lower than that of any other spanning tree.

A graph may have multiple minimal spanning trees. In the following graph, any two edges form an MST.

Now that we understand what a minimal spanning tree is, we take a look
at *Kruskal’s algorithm*, a greedy algorithm for computing an MST. The
algorithm simply examines the edges in order of weight, taking any
edge that does not induce a cycle in the partially computed result.

As an example, we use Kruskal’s algorithm to compute the MST of the following graph. We start by ordering the edges according to their weights.

Then we look at the edges in order, inserting an edge into our partial result if adding it does not result in a cycle.

Observe that when the algorithm looks at the edge with weight 8, the two adjacent vertices are already connected in the partially computed tree, so adding that edge would introduce a cycle. Thus, the algorithm skips it and continues on to the next edge. In this graph, there are two edges of weight 9, so the algorithm arbitrarily picks one of them to examine first.

The algorithm terminates when all edges have been examined. For this graph, the resulting spanning tree has a total weight of 66.

Kruskal’s algorithm is a greedy algorithm – in each step of the algorithm, it examines the minimal-weight edge remaining, so it is making locally optimal choices. As stated previously, it isn’t always the case that locally optimal choices lead to a globally optimal result. However, as we will now show, it is indeed the case for Kruskal’s algorithm that the result is a minimal spanning tree.

The output of Kruskal’s algorithm is a tree.

To prove this, we will assume that the input graph \(G\) is *fully
connected*, meaning that there is an edge between every pair of
vertices in the graph. We can turn a graph into a fully connected one
by adding the missing edges with some very large (or infinite) weight,
so that the new edges do not change the MST.

Let \(T\) be the output of Kruskal’s algorithm on a fully connected input graph \(G\). Recall from Definition 34 that a tree is a maximal graph without cycles. Clearly \(T\) does not have any cycles, as Kruskal’s algorithm only adds an edge to \(T\) when it does not induce a cycle. Then we need only show that \(T\) is maximal. Suppose for the purpose of contradiction that it is not. By definition, this means that we can add some edge \(e\) to \(T\) without introducing a cycle. However, the algorithm examines every edge in the input graph, so it must have encountered \(e\) at some point, and the algorithm adds an edge to \(T\) if it would not create a cycle. Since it did not add \(e\) to \(T\), adding \(e\) to \(T\) must induce a cycle, which contradicts our assumption that it does not. Thus, no such edge exists, so \(T\) is maximal. Since \(T\) is a maximal graph without cycles, it is a tree.

We can similarly demonstrate that the output of Kruskal’s algorithm is a spanning tree, but we leave that as an exercise.

Show that if the input graph \(G\) is connected, the output \(T\) of Kruskal’s algorithm spans all the vertices in \(G\).

Next, we show that the result of Kruskal’s algorithm is a minimal spanning tree. To do so, we actually prove a stronger claim:

If \(S\) is a set of edges chosen at *any* stage of Kruskal’s
algorithm on an input graph \(G\), then there is *some* minimal
spanning tree of \(G\) that contains all the edges in \(S\).

If this claim is true for all intermediate sets of edges in Kruskal’s algorithm, it is true for the final set of edges. Since we have already shown that the final set of edges is a tree, we can conclude that this set of edges is itself the MST that contains the edges in that set.

To prove Claim 40, we make use of the
*exchange property* of minimal spanning trees:

Let \(T\) be a minimal spanning tree of a graph \(G\), and let \(e \notin T\) be an edge that is not in \(T\) but is in the original graph \(G\). Then:

\(T \cup \{e\}\) has a set of edges \(C\) that comprise a cycle.

For any edge \(f\) in \(C\), the weight of \(f\) is no more than the weight of \(e\), i.e. \(\forall f \in C.~ w(f) \le w(e)\).

Recall by Definition 34 that a tree is a maximal graph without cycles. Then the first part of the exchange property immediately follows, since \(T\) is a tree.

To show the second part, observe that all the edges in \(C\) other
than \(e\) are contained in \(T\). Pick any such edge \(f \in C, f \ne
e\). Consider the tree that *exchanges* the edge \(f\) for \(e\), i.e.
the tree \(T' = T \cup \{e\} \setminus \{f\}\).

Since \(T\) is a minimal spanning tree, the total weight of \(T\) must be no more than the total weight of \(T'\). Then we have:

Rearranging the terms in the inequality \(w(T) \le w(T) + w(e) - w(f)\), we obtain \(w(f) \le w(e)\), the second part of the exchange property.

We can now prove Claim 40.

We prove Claim 40 by induction over the size of \(S\).

**Base case:**\(S\) is empty, so every MST contains \(S\).**Inductive step:**Let \(T\) be an MST that contains \(S\). Suppose the algorithm chooses an edge \(e\) in the next step. We need to show that there is some MST \(T'\) that contains the edges in \(S \cup \{e\}\).There are two possibilities: either \(e\) is contained in \(T\), or it is not.

Case 1: \(e \in T\). Since \(T\) is an MST that contains \(S\), and \(T\) also contains \(e\), \(T\) is an MST that contains all of \(S \cup \{e\}\). (In other words, \(T' = T\) in this case.)

Case 2: \(e \notin T\). Then by Definition 34, \(T \cup \{e\}\) has a cycle \(C\).

From the logic in Kruskal’s algorithm we know that \(S \cup \{e\}\) does not contain a cycle. Thus, there must be some edge \(f\) that is in \(C\) but is not in \(S \cup \{e\}\). Since \(C\) is composed of edges from \(T\) plus the additional edge \(e\), \(f\) is in \(T\).

Observe that \(S \cup \{f\} \subseteq T\), since all edges in \(S\) are also contained in \(T\). Since \(T\) does not contain a cycle, neither does \(S \cup \{f\}\).

Since adding \(f\) to \(S\) does not induce a cycle, the algorithm could have chosen to add \(f\) to \(S\). The only circumstance in which it chooses \(e\) first is when the weight of \(e\) is less than or equal to that of \(f\), i.e. \(w(e) \le w(f)\).

On the other hand, since \(T\) is an MST and \(T \cup \{e\}\) has the cycle \(C\) that contains both \(e\) and \(f\), we have \(w(f) \le w(e)\) by the exchange property of MSTs.

Thus, we have \(w(e) = w(f)\). Then the exchanged tree \(T' = T \cup \{e\} \setminus \{f\}\) has weight \(w(T') = w(T) + w(e) - w(f) = w(T)\). Since it has the same weight as \(T\) and \(T\) is an MST, \(T'\) is also an MST. Since \(S \subseteq T\) and \(f \notin S\), \(T' = T \cup \{e\} \setminus \{f\}\) contains all the edges in \(S \cup \{e\}\). Thus, \(T'\) is an MST that contains all the edges in \(S \cup \{e\}\).

As we have shown that the set of edges chosen at any point by Kruskal’s algorithm is contained within an MST, the final set of edges is also contained within some MST. Since the final set of edges is itself a tree and a tree is maximal, that final set must be an MST.

We have demonstrated that Kruskal’s algorithm does indeed produce an MST. We will also state without proof that the running time on an input graph \(G = (E,V)\) is \(\O(\abs{E} \log \abs{E})\) with the appropriate choice of data structure for the algorithm.

Observe that the analysis of Kruskal’s algorithm was nontrivial. Unfortunately, this is often the case for greedy algorithms. Again, it is not necessarily the case that locally optimal choices lead to a globally optimal solution, so we typically need to do significant work to demonstrate that this is the case for a particular problem and algorithm.