Lecture Notes for CSCI 3110: Design and Analysis of Algorithms Travis Gagie Faculty of Computer Science Dalhousie University Summer 2021 </br>Contents 1 “Clink” versus “BOOM” 3 I Divide and Conquer 8 2 Colouring Graphs 9 Assignment 1 19 Solution 21 3 Euclid, Karatsuba, Strassen 24 4 Fast Fourier Transform 32 Assignment 2 37 Solutions 39 5 Asymptotic Bounds 44 6 Sorting 51 Assignment 3 59 Solutions 61 1 II Greedy Algorithms 62 7 Huffman Coding 63 8 Minimum Spanning Trees 74 9 More Greedy Algorithms 82 Assignment 4 91 Midterm 1 93 Solutions 95 III Dynamic Programming 98 10 Edit Distance 99 11 Subset Sum and Knapsack 107 12 Optimal Binary Search Trees and Longest Increasing Subsequences 113 Assignment 5 120 2 Chapter 1 “Clink” versus “BOOM” Imagine I’m standing in front of you holding up two little pieces of silver-grey metal and between us is a bucket of water. You can see the pieces of metal look pretty much the same; I tap them on the desk so you can tell they sound about the same; I pass them around so you can check they feel about the same, weigh about the same, smell the same (not really at all), although I warn you not to taste them. I take them back and throw first one into the bucket, where it goes “clink”, and then the other, where it goes “BOOM”. Even if we had in-person classes I probably wouldn’t be allowed to explode things in front of my students — that’s what tenure is for! — but I want you to think about how two things that seem the same can behave so differently. Explaining that for those two little pieces of metal gets us into discussions that get pretty deep pretty quickly, of chemical reactions and molecules and atoms. The famous physicist Richard Feynman said that if some disaster were going to wipe out nearly all scientific knowledge and he could choose once sentence to pass on to future generations, it would be “all things are made of atoms”. Other scientists might choose deep statements from their own fields that have changed humanity’s view of the universe and our place in it: for an astronomer, “the earth goes around the sun”; for a biologist, “species arise through a process of natural selection”; for a psychologist, “the mind is a function of the brain”. Statements about mathematics usually aren’t quite so controversial, at least among non-mathematicians, but the story goes that the first mathematician to prove that the square root of 2 is irrational was thrown overboard by his fellow Pythagoreans. So, what have we got? You’ve been studying computer science in university for a few years now, and I’d like to know what you’ve learned about our field that can reasonably be considered one of the great scientific truths. “The speed of processors doubles every 18 months”? “Adding workers to a late software project makes it later”? “All your bases are belong to us”? “The cake is a lie”? This is when I’d really like to be able to stop and let you think, and then hear what you have to say. Under the circumstances, however, I’m just going to have to assume most of you chose to study computer science because it’s fun, you can do cool things with it (including saving people’s lives and making the world a better place), it pays pretty well and it’s indoor work with no heavy lifting — even though it may not tell us deep things about the universe. 3 That’s rather a shame. If you check Scientific American’s list of “100 or so Books that Shaped a Century of Science”, under “Physical Sciences” and mixed in with Stephen Hawking’s A Brief His- tory of Time, Primo Levi’s The Periodic Table, Carl Sagan’s The Pale Blue Dot, Paul Dirac’s Quan- tum Mechanics, The Collected Papers of Albert Einstein, Linus Pauling’s Nature of the Chemical Bond and Feynman’s QED, you’ll find books about computer science (or things close to it): Benoit Mandelbrot’s Fractals, Norbert Weiner’s Cybernetics and, last but definitely not least, Knuth’s Art of Computer Programming. So apparently we are doing deep things, we’re just not teaching you about them. Actually, I think that in a very important sense, we are the deepest of the sciences. To under- stand where the basic rules of psychology come from, for example, you must know something about biology; to understand where the basic rules of biology come from, you must know something about chemistry; to understand where the basic rules of chemistry come from, you must know something about physics; to understand where the basic rules of physics come from, you must know some- thing about mathematics; but who tells the mathematicians what they can and cannot do? We do. Philosophers, theologians and politicians might claim to, but I think the mathematicians largely ignore them. One field has really stood up to mathematics and laid down the law, and that field was computer science. At the International Congress of Mathematicians in 1900, a famous mathematician named David Hilbert proposed ten problems that he thought mathematicians should work on in the 20th century. He later expanded the list to 23 problems, which were published in 1902 in the Bulletin of the American Mathematical Society. Let’s talk about Hilbert’s Tenth Problem (for the integers): “to devise a process according to which it can be determined in a finite number of operations whether [a given Diophantine equation with integer coefficients] is solvable in [the integers]”. Notice Hilbert didn’t ask whether such a process existed; back then, people assumed that if a mathematical statement were true, with enough effort they should be able to prove it. I’ll assume you all know what a quadratic equation in one variable is, and that you can tell if one has a solution in the reals using the quadratic formula. There are similar but more complicated formulas for solving cubic and quartic equations in one variable, which fortunately I never learned. If you’re given a quintic equation in one variable with integer coefficients, then you may be able to use the rational-root theorem (which I actually did learn once upon a time!) to get it down to a quartic, which you can then solve with the quartic formulas. (For anything more complicated, maybe you can try Galois Theory, but I can’t help you there.) Anyway, the Diophantine equations Hilbert was talking about are sort of like these equations but with a finite number of variables. He wanted a way to solve any given Diophantine equation that has integer coefficients or, at least, a way to tell if it even has an integer solution. So, solving Diophantine equations over the integers was considered an important mathematics problem by important mathematicians. Unfortunately for them, in 1970 Yuri Matiyasevich showed in his doctoral thesis that we can take a certain general model of a computer, called a Turing Machine, and encode it as a Diophantine equation with integer coefficients, in such a way that the Turing Machine halts if and only if the Diophantine equation has an integer solution. It was already known that determining whether a Turing Machine halts is generally incomputable, meaning there is no algorithm for it (and that had already thrown a spanner in Hilbert’s plan for 20th-century mathematics), so Matiyasevich’s Theorem (also known as a Matiyasevich-Robinson-Davis-Putnam 4 Figure 1.1: The seven bridges of Ko¨nigsberg in Euler’s time (https://commons.wikimedia.org/ wiki/File:Konigsberg_bridges.png). Theorem, for some other mathematicians who laid the groundwork) says Hilbert’s Tenth Problem is impossible (for the integers). So, computer science told mathematics that it couldn’t have something it really wanted. I guess 1970 seems like a long time ago and we don’t want to relive old victories so — although we’ll talk a bit about Turing Machines and the Halting Problem and computability and Hilbert’s Entscheidungsproblem — in this course mostly we’re not going to focus on the gap between possible and impossible problems (and I won’t mention Diophantine equations again). Instead, we’re going to focus on the nature of the gap between easy problems (we have a polynomial-time solution that runs on a standard computer) and hard problems (we don’t) — which is the deepest open question in computer science and has been for a long time, since 1971. Echoing Hilbert, in 2000 the Clay Mathematics Institute proposed a list of seven problems they thought mathematicians should work on in the 21st century, including this one of easy versus hard, and backed each with a million- dollar prize. To quickly illustrate the gap between easy and hard, I’d like to talk about two classic problems, determining whether a graph has an Eulerian cycle and whether it has a Hamiltonian cycle. At first these problems seem almost the same, but it turns out that one goes “clink” and the other goes “BOOM” (we think). In the 1730s, the citizens of Ko¨nigsberg (then in Prussia, now the Russian city of Kaliningrad) wrote to the famous mathematician Leonhard Euler asking whether it was possible to start some- where in their city (shown as it existed then in Figure 1.1), walk around it in such a way as to cross each of their seven bridges once and only once, and arrive back at the starting point. Euler answered that it was not possible since, if we draw a graph with each shore and island as a vertex and each bridge as an edge, all of the vertices have odd degree, where a vertex’s degree is the number of edges incident to it. He observed that in such a walk — now called an Eulerian cycle of a graph — every time we visit a vertex we reduce the number of uncrossed edges incident to it by 2 (we arrive across one and leave across one); therefore, in order for a graph to have an Eulerian cycle, it is a necessary condition that it be connected (that is, it must be possible to get from any vertex to any other vertex) and every vertex must have even degree. Euler also showed that this is also a sufficient condition, which is less obvious. Suppose we start at some vertex v in a graph G that satisfies the condition, and walk around until we get stuck 5 (that is, we arrive at a vertex all of whose incident edges we’ve already crossed). Where can we get stuck? Well, when we’re at a vertex w 6= v then we’ll have “crossed off” an odd number of edges incident to w (one for each time we’ve arrived there and one for each time we’ve left, and we’ve arrived once more often than we’ve left); therefore, since an even number minus an odd number is an odd number and 0 isn’t odd, there’s always at least one edge left for us to leave across. It follows that we can only get stuck back at v, after tracing out a cycle C in G. Let G′ be the graph resulting from deleting from G the edges in C. If v has positive degree in G′ then we can start there again and find and delete another cycle; otherwise, if some other vertex has positive degree in G′, we can start there and find and delete another cycle. If we keep doing this until there are no edges left, we get a decomposition of G into edge-disjoint cycles (meaning the cycles can share vertices but not edges). Take any of the cycles C1, choose a vertex v it shares with any of the other cycles, and choose another cycle C2 containing v. (If C1 doesn’t share any vertices with any other cycles, then its vertices aren’t connected to the rest of the graph, contrary to our assumption.) Replace C1 and C2 by a single (non-simple) cycle C3 that makes a figure 8 in some sense: if we start at v then we first cross all the edges in C1 and return to v, then cross all the edges in C2 and return to v. Repeating this cycle-merging, we eventually obtain an Eulerian cycle of the graph. It’s not totally trivial to implement this algorithm cleanly but it’s not all that hard either (so I gave it as a homework exercise last year) and it obviously takes only polynomial time in the size of the graph. In other words, the problem of determining whether a graph is Eulerian (contains an Eulerian cycle) goes “clink”. It’s actually an important problem that comes up in de novo genome assembly: one of the most common ways to assemble DNA reads is by building what’s called a de Bruijn graph of the k-mers they contain, for some appropriate value of k, and then trying to find an Eulerian tour of that graph (“tour” instead of “cycle” because it need not end where it started). It’s likely, for example, that this played a role in sequencing the Covid genome, which was important for developing a test for whether people were infected. Now let’s consider a problem which seems deceptively similar at first. In 1857 William Rowan Hamilton showed how to walk along the edges of a dodecahedron, visiting each vertex once and only once, until arriving back at the starting point. Such a walk in a graph is now called a Hamiltonian cycle, and it’s an obvious counterpart to an Eulerian cycle but focusing on visiting vertices instead crossing edges. Hamilton didn’t give a general result like Euler’s, however, and in fact no one has ever found an efficient algorithm for determining whether a given graph is Hamiltonian (that is, it has a Hamiltonian cycle). By “efficient”, in computer science we mean that the algorithm takes time polynomial in the size of the graph (in this case, the sum of the number of vertices and the number of edges). Obviously there are necessary conditions we can check quickly (for example, the graph has to be connected), and there are sufficient conditions we can check quickly (for example, if a graph is a clique then it’s Hamiltonian, meaning it contains a Hamiltonian cycle), and there are some kinds of graphs that are easy (for example, cycles or trees), and some graphs are small enough that we can deal with them by brute force. . . but nobody’s ever found an algorithm that can handle all graphs in polynomial time. In other words, it seems this problem goes “BOOM”. To see why people would really like to be able to determine whether a graph is Hamiltonian, suppose you’re in charge of logistics for a delivery company and you’re looking at a map trying to 6 plan the route of a delivery truck that has to visit n cities, in any order and then return home. You can tell the distance between each pair of cities and you want to know if it’s possible for the truck to make its deliveries while travelling at most a certain total distance. (This problem is called the Travelling Salesperson Problem or TSP or, assuming the truck travels only in the plane, Euclidian TSP; computer scientists often use Small Caps for tricky problems.) If every road between two cities has distance 1, then the truck can travel distance n if and only if the graph with cities as vertices and roads as edges, is Hamiltonian. It follows that if HamCycle (the problem of determining whether a graph is Hamiltonian) goes “BOOM”, then TSP goes “BOOM” too. That is, if HamCycle is hard, then so is TSP, because in some sense HamCycle can be turned into TSP. Don’t worry, we’ll see a lot more about this later in the course. How can two problems — determining if a graph is Eulerian and if it’s Hamiltonian — that seem so similar be so different? Well, we hope investigating that will lead us to a deep truth, like investigating the reactions of those pieces of metal to water could lead us to studying atoms. We may not even really care as much about the problems themselves as we do about what studying them can tell us: my chemistry teacher might forgive me if I forgot that sodium metal explodes in water, but he’d be really annoyed if I forgot the atomic theory; similarly, you may pass this course even if you forget Euler’s algorithm, but I really do want you to remember something about the difference between easy and hard problems. Another way of saying this is that you shouldn’t miss the forest for the trees. Think of Euler’s Algorithm as a tree and HamCycle as a tree and yourself as a student of geography: you’re not really interested in individual trees as much as you’re interested in the general layout of the forest and, metaphorically, whether there’s a big river splitting it into two pieces. In the next class, we’ll talk about another classic problem, colouring graphs, some versions of which (1-colourability, 2-colourability, planar k-colourability for k ≥ 4) are easy and some versions of which (planar 3- colourability, general k-colourability for k ≥ 3) are thought to be hard. Does anyone really care about 47-colourability in particular, say? Probably not, but by the end of this course you should understand that it’s on the same side of the river as HamCycle and TSP and planar 3-colourability and general k-colourability for any other k ≥ 3 (that is, the “BOOM” side, to mix metaphors), and on the opposite side from finding Eulerian cycles and 1- or 2-colourability (the “clink” side). Investigating the difference between the complexity of finding Eulerian and Hamiltonian cycles and between 2-colouring and 3-colouring graphs also leads us to a shallow but important truth about this course: it’s dangerous to answer questions just by pattern matching! That is, just because a problem sounds similar to something you’ve seen before, don’t think you can just write the answer to the old problem and get part marks. Summing up, here are some things to remember while taking this course: Computer science is the deepest of the sciences! Some problems are possible, some aren’t — and that’s deep. Some problems are easy, some probably aren’t — and that’s deep, too. Don’t miss the forest for the trees. It’s dangerous to answer questions just by pattern matching. 7 Part I Divide and Conquer 8 Chapter 2 Colouring Graphs When I was in primary school geography was mainly about colouring maps, which I was pretty good at except that I kept loosing my pencil crayons, so my maps weren’t as colourful as the other children’s. I was therefore really impressed when I learned that in 1852 Francis Guthrie found a way to colour a map of the counties of England (there were 39 back then) using only four colours, such that no two counties with the same colour share a border (although meeting at a point is ok). This made him wonder if it was possible to 4-colour any (planar) map like this. He wrote to his brother, who was studying with the mathematician De Morgan (who formulated De Morgan’s Laws of propositional logic, which you may have heard about last year), and the conjecture was published in The Athenæum magazine in 1854 and 1860. In 1890 Heawood proved any planar map can be 5-coloured, and in 1976 Appel and Haken finally proved the original conjecture by checking 1834 cases with a computer. (An important (and very long) part of the proof was checked manually by Haken’s daughter Dorothea, who taught me third-year algorithms at Queen’s 22 years later; she’s also a coauthor of the Master Theorem that we’ll discuss next week.) Appel and Haken’s Four-Colour Theorem is considered part of graph theory, instead of geogra- phy, because if someone gives us a planar map and we draw a vertex in each region and put an edge between two vertices if and only if their regions share a border, then we get a planar graph that can be coloured with k colours such that no two vertices with the same colour share an edge, if and only if the original map can be coloured with k colours. For example, Figure 2.1 shows the graph we get for the map of the counties of Nova Scotia. (Actually, if we put an edge for every shared border — so one edge between France and Spain for their border west of Andorra and another for their border east of it, for example — then the graph and the map are duals, meaning that from the graph we can reconstruct something that’s topologically equivalent to the map, and from that we can reconstruct the graph, and so on.) It’s easy to find planar graphs that require 4 colours, such as a 4-clique (a graph on four vertices with edges between every pair), but it’s not so easy to tell whether a given planar graph can be 3-coloured. In fact, as mentioned in the last lecture, that’s a “BOOM” problem. 9 Figure 2.1: A graph that is 3-colourable if and only if the map of the counties of Nova Scotia is 3-colourable. Suppose someone gives us a planar graph, meaning one that can be drawn in the plane such that edges don’t cross, and asks us if it can be coloured with k colours. (There are polynomial-time algorithms for determining whether a graph is planar and, if so, finding a way to draw it without edge-crossings, so we won’t worry about that.) If k = 1 then the answer is “yes” if and only if the graph contains no edges. If k = 2 then there’s a simple algorithm based on breadth-first search: choose any vertex and colour it blue; colour its neighbours red; colour their uncoloured neighbours blue; keep going until the whole graph is coloured; if there’s an edge with endpoints the same colour, tne answer is “no” and otherwise it’s “yes”. If k ≥ 4 then, by the Four-Colour Theorem and the fact the graph is planar, the answer is “yes”. No one has ever found an algorithm, however, for determining whether a graph is 3-colourable, whether or not it’s restricted to be planar, that runs in polynomial time in the size of the graph in the worst case. Of course, just because we don’t have an algorithm that runs quickly in the worst case, doesn’t mean we have to give up. For small graphs, we can just use brute force. Even for large graphs, it may be that almost all graphs are easy. We’re going to explore that in this class, along with the idea of “degrees of BOOM”, and then finally we’ll end with some programming (which will probably be a relief by then). Notice we didn’t use the fact that the graph was planar in our tests for 1- and 2-colourability, so those problems are easy for general graphs too. If we can solve 3-Colourability (affectionately known as 3-Col) for general graphs, then obviously we can solve it for planar graphs as well. Later in the course we’ll see that if we can solve Planar 3-Col in polynomial time then we can use it to solve general 3-Col. Therefore, Planar 3-Col goes “BOOM” if and only if general 3-Col goes “BOOM” — even though the latter is probably a bigger “BOOM”. (Don’t worry, by the end of the 10 course you’ll know the technical names for “clink” problems and “BOOM” problems.) When we consider 4-colourability, though, there’s a big difference: while any planar graph can be 4-coloured, not every graph can be (think of a 5-clique). In fact, if we could solve general 4-Col in polynomial time, then we could also solve 3-Col (and, thus, Planar 3-Col) in polynomial time. To see why, suppose we’ve been asked if a graph G is 3-colourable and we have a magic box that tells us quickly whether a graph is 4-colourable. If we build a graph G′ by adding a vertex v to G and putting edges between v and all the original vertices of G, then G′ is 4-colourable if and only if G is 3-colourable: if G′ is 4-colourable and we take a 4-colouring of it and remove v, then the remainder — which is G — must be 3-coloured, because v couldn’t have been the same colour as any other vertex; if G is 3-colourable and we take a 3-colouring of it and add v and colour v a new colour, then we get a 4-colouring of G′. In general, a polynomial-time algorithm for general (k+ 1)-colourability immediately implies a polynomial-time algorithm for k-colourability. Later in the course we’ll see that if we have a polynomial-time algorithm for 3-Col (either planar or general, since I’ll show you how to use the former to solve the latter) then we can use it to get a polynomial-time algorithm for a problem called Sat (determining whether a given propositional formula has a satisfying truth assignment). We’ll then see that if we have such an algorithm for Sat then we can use it to get one for k-colourability for any k. We’ll also see that if we have such an algorithm for Sat then we can use it to get one for HamCycle, and vice versa, and the same for TSP. This means that Sat goes “BOOM” if and only if 47-Col goes “BOOM”, and that happens if and only if TSP goes “BOOM”. If one of these problems is hard, they all are and, conversely, if we can solve one then we’ll have solved them all. This is probably all getting a little confusing by now, so I’ve drawn Figure 2.2 to help you visualize things, with an arrow from a problem u to a problem v if having a solution for v gives us a solution for u. The arrow goes from u to v instead of the other way around because, instead of changing the algorithm for v into an algorithm for u directly, what we normally do is design a polynomial-time algorithm that takes an instance of u and produces an instance of v such that the former is a “yes”-instance of u if and only if the latter is a “yes”-instance of v. For example, the arrow from 3-Col to 4-col is because we saw how to take a graph G and produce a graph G′ such that G is 3-colourable if and only if G′ is 4-colourable. This is called a reduction. Note that the reduction can (and usually does) change the type: for example, when we reduce Sat to 3-Col later in the course, our reduction will turn a propositional formula into a graph. The solid arrows I should be able to test you on right now, and the dashed arrows I’ll be able to test you on later in the course. We won’t have time to see many reductions in this course, unfortunately, but there are hundreds of famous problems that all turn out to be essentially the same problem in different guises, and we don’t know how to solve any of them quickly in the worst case. Do you get the feeling we might be onto something deep here? Now that we’ve talked about how we probably can’t solve lots of interesting problems in poly- nomial time in the worst case, let’s talk about how we can solve some of them in practice by divide and conquer. (Most people learn the divide-and-conquer approach to algorithm design for sorting 11 clink 1-colourability 2-colourability Eulerian cycle BOOM HamCycle Planar 3-Col 4-colourability planar 3-Col 4-Col 5-Col 47-Col ... ... Sat 3-Sat This way to the Halting Problem, Diophanine Equations on Z, etc. Figure 2.2: A very sketchy diagram of how some problems are related. and thus get the impression it’s only good for easy problems. I’m trying to avoid that with this course.) To start with, let’s talk about Planar 3-Col or, rather, an even harder version of it: how many 3-colourings there are of the map of the 18 historical counties of Nova Scotia using only the colours red, green and yellow — such as the colouring shown in Figure 2.3? I’ve just told you that determining if there’s at least one valid 3-colouring of a planar graph is a “BOOM” problem, but now I want to know exactly how many there are. Don’t worry, it’s not that bad: there are only 318 possibilities to check, after all, and that’s not even 400 million. If we simply list the counties in alphabetical order, and for each list their neighbours, we get what’s called the adjacency-list representation of the graph, as shown in Table 2.1. It’s pretty easy to turn that into a simple program for counting 3-colourings, as shown in Figure 2.4, which might take a couple of seconds on a modern laptop. It hardly seems worth optimizing that, but it will be worth it when we get to bigger instances later. The first optimization you might notice is that we’re unnecessarily comparing Digby and Yarmounth twice (once as Digby != Yarmouth and again as Yarmouth != Digby). A more impor- tant optimization is that, since Cape Breton Island is an island, we can count its 3-colourings and the 3-colourings of the mainland separately, and multiply them to get the number of 3-colourings of the whole province, as shown in Figure 2.5. This way, instead of 318 possibilities, we’re only checking 34 + 314, so our new program should be about 80 times faster. It’s not as obvious what to do next, since we don’t have another big island to work with, but we can break the mainland into two “islands” of size 6 each if we consider Hants and Lunenberg first, as shown in Figure 2.6. The idea is that, for each way to colour Hants and Lunenberg, we count the ways to colour the mainland counties north of them, and the ways to colour the mainland counties south of them, and multiply those counts to get the number of ways of colouring the mainland 12 Figure 2.3: A 3-colouring of the counties of Nova Scotia. given how we chose to colour Hants and Lunenberg. Summing over the 9 ways to colour Hants and Lunenberg (only 6 of which are valid!) we get the number of ways to colour the mainland. This way, we consider only 34 + 32 · 2 · 36 = 13203 possibilities. A final optimization is to check the endpoints of each edge are different colours as soon as we’ve coloured both endpoints. For example, in Figure 2.6, right after the line for (int Lunenberg = RED; Lunenberg <= YELLOW; Lunenberg++) we should check that Lunenberg is not the same colour as Hants, if (Lunenberg != Hants) { ... } . Just as Hants and Lunenberg break the mainland into two fairly small pieces, there’s a theorem that says that in a planar graph on n vertices, we can always find a set of O( √ n) vertices such that, when we remove them, none of the remaining connected pieces of the graph contain more than 2n/3 vertices. The vertices we remove are called a separator and there are efficient algorithms for choosing them, but we’re not going to see those in this course as, for graphs that aren’t very big, it’s usually possible to do it fairly well by hand. In any case, this result about separators gives us a recursive algorithm for counting the 3-colourings of planar graphs: find a separator and, for each way to colour the separator, recursively count the ways to colour each of the remaining pieces. Writing a recurrence for this, we get something like T (n) = 32 √ n · 2T (2n/3) = 32 √ n · 2 · 32 √ 2n/3 · 2 · 32 √ 4n/9 · · · = 3O(n 1/2 logn) , which isn’t polynomial but is much better than 2nnO(1), which is what we get with the fastest known algorithm for finding the chromatic number of a general graph (the number of colours needed to colour it). Notice that it’s not generally possible to choose a small separator in a general graph: no 13 Annapolis Digby, Kings, Lunenburg, Queens Antigonish Guysborough, Pictou Cape Breton Richmond, Victoria Colchester Cumberland, Halifax, Hants, Pictou Cumberland Colchester Digby Annapolis, Queens, Yarmouth Guysborough Antigonish, Halifax, Pictou Halifax Colchester, Guysborough, Hants, Lunenburg Hants Colchester, Halifax, Kings, Lunenburg Inverness Richmond, Victoria Kings Annapolis, Hants, Lunenburg Lunenburg Annapolis, Halifax, Hants, Kings, Queens Pictou Antigonish, Colchester, Guysborough Queens Annapolis, Digby, Lunenburg, Shelburne Richmond Cape Breton, Inverness Shelburne Queens, Yarmouth Victoria Cape Breton, Inverness Yarmouth Digby, Shelburne Table 2.1: The adjacency-list representation of the planar graph dual to the map of the counties of Nova Scotia. matter how many vertices we remove from an n-clique, it remains connected (until we’ve removed the whole graph). This all leads us to your homework, due in a little over a week: on Brightspace you’ll find a text file with the adjacency-list representation of a graph corresponding to a simplified world map; figure out how many ways there are to 4-colour it. If you’re feeling really enthusiastic, you can try to write general code that takes an encoding of a tree, such as the one shown in Figure 2.7, and uses it to compute the number of colourings: at a × node, start a new counter for each child, compute their values recursively, and multiply them; at a + node, increment the current counter; at a node for a county, for each colour that isn’t already assigned to one of the ancestors it’s pointing at, assign it that colour and descend. (I may have made a mistake with the tree — it happens.) Honestly, any time you see that many nested for-loops, all doing essentially the same thing, you should think “There must be a better way!” (and maybe “Recursion!”). I’m not going to distribute a nice, general answer, because I want to use this exercise again (just switching maps). Although the answer (and your code) are due in a week, I suggest you should try to get it done in the next 24 hours or so, while all of this is still fresh in your minds. Actually, I’m suggesting that only so I can finish the class by saying Divide and Conquer! Today, Nova Scotia; tomorrow, the world! 14 #define RED 0 #define GREEN 1 #define YELLOW 2 int main() { int count = 0; for (int Annapolis = RED; Annapolis <= YELLOW; Annapolis++) [ for (int Antigonish = RED; Antigonish <= YELLOW; Antigonish++) { ... for (int Yarmouth = RED; Yarmouth <= YELLOW; Yarmouth++) { if (Annapolis != Digby && Annapolis != Kings && Annapolis != Lunenberg && Annapolis != Queens && Antigonish != Guysborough && Antigonish != Pictou && ... Yarmouth != Digby && Yarmouth != Shelburne) { count++; } } ... } } printf("There are %i ways to 3-colour Nova Scotia.\n", count); return(0); } Figure 2.4: A na¨ıve program for counting the 3-colourings of Nova Scotia. 15 #define RED 0 #define GREEN 1 #define YELLOW 2 int main() { int CB_count = 0; int mainland_count = 0; for (Cape_Breton = RED; Cape_Breton <= YELLOW; Cape_Breton++) { for (Inverness = RED; Inverness <= YELLOW; Inverness++) { for (Richmond = RED; Richmond <= YELLOW; Richmond++) { for (Victoria = RED; Victoria <= YELLOW; Victoria++) { if (Cape_Breton != Victoria && Cape_Breton != Richmod && Inverness != Victoria && Inverness != Richmond) { CB_count++; } } } } } for (int Annapolis = RED; Annapolis <= YELLOW; Annapolis++) [ for (int Antigonish = RED; Antigonish <= YELLOW; Antigonish++) { ... } } printf("There are %i ways to 3-colour Nova Scotia.\n", CB_count * mainland_count); return(0); } Figure 2.5: A somewhat faster program for counting the 3-colourings of Nova Scotia. We have omitted the details of how to count the 3-colourings of the mainland, since they have not changed except that we leave out the counties on Cape Breton Island. 16 int mainland_count = 0; for (int Hants = RED; Hants <= YELLOW; Hants++) { for (int Lunenberg = RED; Lunenberg <= YELLOW; Lunenberg++) { int north_count = 0; int south_count = 0; for (Antigonish = RED; Antigonish <= YELLOW; Antigonish++) { for (Colchester = RED; Colchester <= YELLOW; Colchester++) { ... for (Pictou = RED; Pictou <= YELLOW; Pictou++) { if (Hants != Colchester && Hants != Halifax && Hants != Lunenberg && Lunenberg != Halifax && Antigonish != Guysborough && ... Guysborough != Pictou) { north_count++; } } ... } } for (Annapolis = RED; Annapolis <= YELLOW; Annapolis++) { for (Digby = RED; Digby <= YELLOW; Digby++) { ... for (Yarmouth = RED; Yarmouth <= YELLOW; Yarmouth++) { if (Hants != Kings && Hants != Lunenberg && Lunenberg != Annapolis && Lunenberg != Kings && Lunenberg != Queens && Annapolis != Digby && ... Shelburne != Yarmouth) { south_count++; } } ... } } mainland_count += north_count * south_count; } } Figure 2.6: A somewhat faster way to compute the number of 3-colourings of the mainland. 17 ×+ + + + + + + × × × × CB Inv Rich Vic Han Lun Col Guy Hal AntCum Pict Ann Kin Shel Yar Dig Que Figure 2.7: An abstract representation of a way to compute the number of 3-colourings of Nova Scotia. 18 Assignment 1 posted: 05.05.2021 due: midnight 14.05.2021 You can work in groups of up to three people. One group member should submit a copy of the solu- tions on Brightspace, with all members’ names and banner numbers on it; the other group members should submit text files with all members’ names and banner numbers (otherwise Brightspace won’t let us assign them marks!). You may consult with other people but each group should understand the solutions: after discussions with people outside the groups, discard any notes and do something unrelated for an hour before writing up your solutions; it’s a problem if no one in a group can ex- plain one of their answers. For programming questions you should submit your code, which should compile and run correctly to receive full marks. 1. Use divide-and-conquer to count the valid 4-colourings of the map shown in Figure 2.8. (Apologies to New Zealand and Antarctica!) In this case, let’s say a valid 4-colouring is an assignment of the colours red, green, yellow and purple to the regions of the map (ignore water and the current colours) such that no two regions assigned the same colour share a border of positive length (they can touch at a point) or have a line drawn between them. The adjacency-list representation of the corresponding graph (with the vertices names A1 through F4) is on the back of this sheet and will be posted as a text file on Brightspace. If you find a discrepancy between the map and the adjacency lists, please email Travis as soon as possible! Figure 2.8: How many ways can this map (https://commons.wikimedia.org/wiki/File:Risk_ game_map_fixed.png) be 4-coloured? 19 A1: A2, A5, A6 A2: A1, A3, A4, A5, A6, B7 A3: A2, A5, B7, D5 A4: A2, A6 A5: A1, A2, A3, D5, D7, F2 A6: A1, A2, A4 B1: B2, B3, B7, B11, D6 B2: B1, B3, B8, B9, B10, B11 B3: B1, B2, B7, B9 B4: B6, B8, B10, B12 B5: B6, B8 B6: B4, B5, B8, B12, E1 B7: A2, A3, B1, B3, D5, D6 B8: B2, B4, B5, B6, B10 B9: B2, B3, C2 B10: B2, B4, B8, B11, B12 B11: B1, B2, B10, D6 B12: B4, B6, B10 C1: C3, C4 C2: B9, C3, C4 C3: C1, C2, C4 C4: C1, C2, C3 D1: D2, D3, D4, D7 D2: D1, D4, E5 D3: D1, D4, D5, D6, D7 D4: D1, D2, D3, D6 D5: A3, A5, B7, D3, D6, D7 D6: B1, B7, B11, D3, D4, D5 D7: A5, D1, D3, D5 E1: B6, E2, E6 E2: E1, E6, E7, E9 E3: E4, E9, F4 E4: E3, E7, E8, E9 E5: D2, E6, E7, E8 E6: E1, E2, E5, E7 E7: E2, E4, E5, E6, E8, E9 E8: E4, E5, E7 E9: E2, E3, E4, E7 F1: F2, F3 F2: A5, F1, F3, F4 F3: F1, F2, F4 F4: E3, F2, F3 20 Solution The solution below is not pretty — as one of you amusingly wrote in the Zoom chat when I showed you my code for counting the 3-colourings of Nova Scotia, “my eyes are on fire, it’s so ugly” — but as I said, I’m not going to give out a general solution because I want to be able to reuse this exercise if Dal has me teach this course again, just changing the map. The code I originally posted had a couple of bugs in it and the answer it returned was about 3 times larger than what the code below returns (and what we now think is the real answer), 8720115499008. Both versions ran in a second or two on my very standard, two-year-old laptop. As long as your answer was within a factor of 1000 either way and wasn’t obviously wrong, we’d consider it correct. #include
long long int countA(); long long int countB(); long long int countC(); long long int countD(); long long int countE(); long long int countF(); long long int A1, A2, A3, A4, A5, A6; long long int B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12; long long int C1, C2, C3, C4; long long int D1, D2, D3, D4, D5, D6, D7; long long int E1, E2, E3, E4, E5, E6, E7, E8, E9; long long int F1, F2, F3, F4; int main() { long long int counter = 0, temp; for (A5 = 0; A5 < 4; A5++) { for (B6 = 0; B6 < 4; B6++) { for (B7 = 0; B7 < 4; B7++) { for (B9 = 0; B9 < 4; B9++) { for (D5 = 0; D5 < 4; D5++) { if (D5 != A5 && D5 != B7) { for (D6 = 0; D6 < 4; D6++) { if (D6 != B7 && D6 != D5) { for (E3 = 0; E3 < 4; E3++) { for (E5 = 0; E5 < 4; E5++) { counter += countA() * countB() * countC() * countD() * countE() * countF(); }}}}}}}}}} printf("The number of 4-colourings is %lli.\n", counter); return(0); } long long int countA() { long long int counterA = 0; for (A1 = 0; A1 < 4; A1++) { if (A1 != A5) { for (A2 = 0; A2 < 4; A2++) { if (A2 != A1 && A2 != A5 && A2 != B7) { 21 for (A3 = 0; A3 < 4; A3++) { if (A3 != A2 && A3 != A5 && A3 != B7 && A3 != D5) { for (A4 = 0; A4 < 4; A4++) { if (A4 != A2) { for (A6 = 0; A6 < 4; A6++) { if (A6 != A1 && A6 != A2 && A6 != A4) { counterA++; }}}}}}}}}} return(counterA); } long long int countB() { long long int counterB = 0; for (B1 = 0; B1 < 4; B1++) { if (B1 != B7 && B1 != D6) { for (B2 = 0; B2 < 4; B2++) { if (B2 != B1 && B2 != B9) { for (B3 = 0; B3 < 4; B3++) { if (B3 != B1 && B3 != B2 && B3 != B7 && B3 != B9) { for (B4 = 0; B4 < 4; B4++) { if (B4 != B6) { for (B5 = 0; B5 < 4; B5++) { if (B5 != B6) { for (B8 = 0; B8 < 4; B8++) { if (B8 != B2 && B8 != B4 && B8 != B5 && B8 != B6) { for (B10 = 0; B10 < 4; B10 ++) { if (B10 != B2 && B10 != B4 && B10 != B8) { for (B11 = 0; B11 < 4; B11++) { if (B11 != B1 && B11 != B2 && B11 != B10 && B11 != D6) { for (B12 = 0; B12 < 4; B12++) { if (B12 != B4 && B12 != B6 && B12 != B10) { counterB++; }}}}}}}}}}}}}}}}}} return(counterB); } long long int countC() { long long int counterC = 0; for (C1 = 0; C1 < 4; C1++) { for (C2 = 0; C2 < 4; C2++) { if (C2 != B9) { for (C3 = 0; C3 < 4; C3++) { if (C3 != C1 && C3 != C2) { for (C4 = 0; C4 < 4; C4++) { if (C4 != C1 && C4 != C2 && C4 != C3) { counterC++; }}}}}}} return(counterC); } long long int countD() { long long int counterD = 0; for (D1 = 0; D1 < 4; D1++) { 22 for (D2 = 0; D2 < 4; D2++) { if (D2 != D1 && D2 != E5) { for (D3 = 0; D3 < 4; D3++) { if (D3 != D1 && D3 != D5 && D3 != D6) { for (D4 = 0; D4 < 4; D4++) { if (D4 != D1 && D4 != D2 && D4 != D3 && D4 != D6) { for (D7 = 0; D7 < 4; D7++) { if (D7 != A5 && D7 != D1 && D7 != D3 && D7 != D5) { counterD++; }}}}}}}}} return(counterD); } long long int countE() { long long int counterE = 0; for (E1 = 0; E1 < 4; E1++) { if (E1 != B6) { for (E2 = 0; E2 < 4; E2++) { if (E2 != E1) { for (E4 = 0; E4 < 4; E4++) { if (E4 != E3) { for (E6 = 0; E6 < 4; E6++) { if (E6 != E1 && E6 != E2 && E6 != E5) { for (E7 = 0; E7 < 4; E7++) { if (E7 != E2 && E7 != E4 && E7 != E5 && E7 != E6) { for (E8 = 0; E8 < 4; E8++) { if (E8 != E4 && E8 != E5 && E8 != E7) { for (E9 = 0; E9 < 4; E9++) { if (E9 != E2 && E9 != E3 && E9 != E4 && E9 != E7) { counterE++; }}}}}}}}}}}}}} return(counterE); } long long int countF() { long long int counterF = 0; for (F1 = 0; F1 < 4; F1++) { for (F2 = 0; F2 < 4; F2++) { if (F2 != A5 && F2 != F1) { for (F3 = 0; F3 < 4; F3++) { if (F3 != F1 && F3 != F2) { for (F4 = 0; F4 < 4; F4++) { if (F4 != E3 && F4 != F2 && F4 != F3) { counterF++; }}}}}}} return(counterF); } 23 Chapter 3 Euclid, Karatsuba, Strassen Euclid’s algorithm is one of the oldest, having appeared in his Elements about 300 BCE, and says that given two integers a and b — for simplicity, assume a > b > 0 — we can find the greatest common divisor 〈a, b〉 of a and b by finding c = a mod b and then recursively finding 〈b, c〉. For example, 〈168, 78〉 = 〈78, 12〉 = 〈12, 6〉 = 〈6, 0〉 = 6. Euclid actually proposed repeatedly subtracting the smaller of the two numbers from the larger one, but that can obviously be really slow: it takes ba/bc step to do the same thing as a single mod step, and so the whole algorithm can take Ω(a) steps in the worst case, when a is huge and b is tiny. (We’ll see Ω soon; for now, just think of O as meaning “roughly at most” and Ω as meaning “roughly at least”.) Rather surprisingly, the complexity of the mod version wasn’t analyzed until the 19th century, when it was shown to take O(log a) steps. To see why, consider than if b < a/2 then a mod b < b < a/2, and if b ≥ a/2 then a mod b ≤ a− b < a/2. Either way, at each step the sum of the two numbers we’re considering decreases by at least half the larger number, which is at least a quarter of the sum, so the number of steps can’t be more than about log4/3(a + b). In fact, Lame´ proved in 1844 that the number of steps is no more than 5 times the number of digits in the decimal representation of a, and this bound is tight when a and b are consecutive Fibonacci numbers. Lame´’s proof is considered the beginning of computational complexity theory (the study of how long computations take). I was careful to write “steps” in the preceding paragraph because it’s not clear that taking the mod of two numbers takes constant time, especially not for a human. In just a moment we’ll discuss how long basic arithmetic operations take, but first we should decide how we measure the size of the input (so we can express the complexity as a function of that size), what model of computation we’re working in, and how to measure the difficulty of a problem. When I wrote “O(log a) steps”, I was expressing the complexity in terms of the numbers them- selves (since a > b means O(log a) = O(log(a+b))); Lame´’s bound — linear in the number of digits in a — is in terms of the size of the input, with “size” interpreted as the amount of paper it would occupy when the numbers are written in base c for some constant c > 1. (Notice that choosing another constant base d > 1 would change the number of digits by only a constant factor, which disappears in the O, although the number of digits in unary — writing a as a copies of 1 — is 24 exponentially larger.) Although there can be exceptions, that “size of the paper” measure is the default. To remember this, think of factoring integers. You may have heard that factoring is believed to be hard, and we don’t have a fast algorithm for it, at least not on classical computers. Probably the most famous quantum algorithm, Shor’s algorithm, should be able to factor in polynomial time if we can build the hardware needed to run it. (If you haven’t heard that factoring is believed to be hard, I recommend the movie Sneakers, starring Robert Redford, Dan Aykroyd, Ben Kingsley, Mary McDonnell, River Phoenix, Sidney Poitier and David Strathairn), about what happens when a mathematician figures out how to factor quickly. Incidentally, I just read that Len Adleman — the A in the RSA public-key cryptosystem — prepared the lecture that’s given in one of the scenes, in exchange for his wife meeting Robert Redford.) How hard is it to factor a number n? Consider the following simple program and how many steps it takes (with % meaning mod in C): for (int i = 2; i < n; i++) { if (n % i = 0) { printf("%i\n", i); } } Pretty clearly this takes O(n) steps: linear in n, but exponential in the number of digits in n. This will come up later in the course when we discuss the difficulty of the “BOOM” problem Knapsack, which is NP-hard only in the weak sense, meaning it can be solved in time polynomial in the numbers involved but not (we think) in the number of digits in them. If a problem is NP-hard in the strong sense, then we don’t know how to solve it in polynomial time even when the numbers are written in unary. Usually, when we tell a computer to add two numbers, for example, we can assume it’ll happen in a constant number of clock cycles. It’s fairly rare to be dealing with numbers so big they won’t fit in a machine word — normally 64 bits or at least 32 bits on any reasonably recent machine — and word-level parallelism allows us to do a lot with numbers that will fit. (Even the number of 4-colourings from your assignment fits in 64 bits!) Algorithm designers often work in the word- RAM model with words whose size is logarithmic in the size of the input. One reason for this is that, if an algorithm runs in time polynomial in the size of the input, then it also uses space (memory) polynomial in the size of the input, so pointers take a logarithmic number of bits and fit in a constant number of words. Operations on words are considered to take constant time if they can be computed by AC0 circuits, which have constant depth and a number of gates polynomial in the word-size (so polylogarithmic in the input size, in the word-RAM model). Addition and subtraction are in AC0 but multiplication and parity (checking whether the word contains an odd or even number of 1s) aren’t, so some people still tweak their algorithms to avoid multiplications, even though multiplication is still fast in practice on modern machines. We’ll usually assume we’re working in the word-RAM model in this course, but we won’t mention it much. If you go further 25 in algorithmics, it’ll become more important, but for now you just need to know that there is a formal definition of what’s fast and what’s not. We’re also usually going to consider the difficulty of a problem, as a function of the size of the input, to be the difficulty of the hardest instances as the size goes to infinity. That is, if we say we can compute a function of graphs on n vertices and m edges in O(n + m) time, we mean that no matter how big n and m are, we can compute the function that quickly for any graph with n vertices and m edges. (We’ll talk in a later class about exactly what O means and why it means we only have to worry about sufficiently large graphs.) This is worst-case complexity, and it’s the default. You’ll also hear about best-case, expected-case and average-case complexities. A lot of people seem to think O means “in the worst case” and Ω means “in the best case” but, as we’ll discuss in that later class, they’re confused. So, how long does addition take for a human? Suppose I give you two n-digit numbers and you perform the standard algorithm for addition that we learn in primary school, working from right to left and carrying when necessary. It’s easy to prove by induction that, with two numbers, the carry is never bigger than 1; therefore, you use O(n) time. That’s optimal, since you have at least to read the input. Now, how long does multiplication take, for a human (and for a computer when the numbers are much bigger than the word size)? Again, suppose I give you two n-digit numbers and you perform the standard algorithm for multiplication that we learn in primary school, working from right to left in the second number and multiplying each digit by the whole first number, padding on the right with 0s, and then summing up all the products. This takes O(n2) time and O(n2) digits of space, using O(n)-time addition as a subroutine. (Actually, both those O(n2)s should be Θ(n2)s, but we’ll discuss that in another class.) Can we do better? For example,suppose we’re multiplying 85419621 by 75339405: 85419621 75339405 -------- 427098105 0000000000 34167848400 768776589000 2562588630000 25625886300000 427098105000000 5979373470000000 ---------------- 6435463421465505 First, it’s not really hard to reduce the space to O(n), by figuring out the digits in the addition column-wise instead of row-wise. For example, suppose we’ve worked out that last 8 digits in the answer are 21465505 and the carry into the ninth column from the right in the addition is 4, all of which takes O(n) digits of space to store. Also, suppose we know that the carries are as follows: 4 26 for the multiplication by the 5 on the right (because 5 × 8 + 2 = 42); 0 for the multiplication by 0; 1 for the multiplication by 4; 1 for the multiplication by 9; 2 for the multiplication by the 3 on the right; 1 for the multiplication by the 3 on the left; 1 for the multiplication by the 5 on the left; and 0 for the multiplication by 7. These carries also take O(n) digits of space to store. There are no more digits in 85419621 to multiply the 5 on the right by, so its entry in the ninth column from the right is just its carry, 4; the entry for 0 is (0 × 8 + 0) mod 10 = 0; the entry for 4 is (4 × 5 + 1) mod 10 = 1 and its next carry is b(4 × 5 + 1)/10c = 2; the entry for 9 is (9 × 4 + 1) mod 10 = 7 and its next carry is b(9 × 4 + 1)/10c = 3; the entry for the 3 on the right is (3 × 1 + 2) mod 10 = 5 and its carry is b(3 × 1 + 2)/10c = 0; the entry for the 3 on the left is (3 + 1) mod 10 = 8 and its next carry is b(3 + 1)/10c = 2; the entry for the 5 on the left is (5 × 6 + 1) mod 10 = 1 and its next carry is b(5 × 6 + 1)/10c = 3; finally, the entry for 7 is (7× 2 + 0) mod 10 = 4 and its next carry is b(7× 2 + 0)/10c = 1. Summing up the carry into the 9th column of the addition and the entries, we get 4 + (4 + 0 + 1 + 7 + 5 + 8 + 1 + 4) = 34, so the ninth digit from the right in the answer is 4 and the carry into the tenth column from the right in the addition is 3. Now we can forget the entries in the ninth column and just remember the carries and the last 9 digits of the answer, 421465505. This still takes O(n2) time but at least we never need to store more than O(n) digits of space. There are lots of folklore methods for fast multiplication, but I don’t think any of them actually beat that O(n2) time bound. At least, in a seminar in the 1960, the famous mathematician Andrei Kolmogorov (who’ll come up again when we get to Kolmogorov complexity and why it’s incomputable) conjectured that Ω(n2) is a lower bound for multiplication. A student in the seminar, Anatoly Karatsuba, thought about the problem for a few days and then presented Kolmogorov with a simple divide-and-conquer algorithm that takes subquadratic time, disproving Kolmogorov’s conjecture. Kolmogorov was so impressed that he presented the algorithm in lectures, wrote it up and sent it to a journal with Karasuba’s name on it. Suppose we are given two n-digit numbers x and y to multiply and, for simplicity, assume x and y are written in decimal and n is a power of 2. If n is 1, we can work out xy in constant time, so suppose n > 1. We rewrite xy as (x1 · 10n/2 + x0)(y1 · 10n/2 + y0) , where x1 and y1 consist of the first n/2 digits of x and y respectively, and x0 and y0 consist of the last n/2 digits of s and y respectively. The obvious way to work this out, (x1 · 10n/2 + x0)(y1 · 10n/2 + y0) = x1y1 · 10n + (x1y0 + x0y1) · 10n/2 + x0y0 seems to require four multiplications of (n/2)-digit numbers, plus some easy shifts and additions, which doesn’t help since 4(n/2)2 = n2. However, Karatsuba noticed that if we define z2 = x1y1 , z1 = x1y0 + x0y1 , z0 = x0y0 , so xy = z2 · 10n + z1 · 10n/2 + z0 , 27 then since z1 = (x1 + x0)(y1 + y0)− z2 − z0 , we can use one multiplication to work out z2, one multiplication to work out z0, and then one multiplication (and two additions and two subtractions, which are easy) to work out z1. (You might wonder if x1 + x0 and y1 + y0 couldn’t be (n/2 + 1)-digit numbers. They could, but let’s ignore that and get through this lecture. Karatsuba’s algorithm doesn’t actually need the number of digits in the numbers to be powers of 2, or equal, since we can always pad on the left with 0s.) Of course just saving a quarter of the multiplications doesn’t improve the asymptotic bound; for that, we have to apply this technique recursively, working out each multiplication of two (n/2)-digit numbers using three multiplications of (n/4)-digit numbers. This gives us a recurrence T (n) = 3T (n/2) + 2n , where the 2n is just my guess at the the complexity of all the additions and subtractions, and doesn’t really matter. Expanding this, we get T (n) = 3T (n/2) + 2n = 3(3T (n/4) + n/2) + 2n = 3(3(3T (n/8) + 2(n/4)) + 2(n/2)) + 2n ... so expanding i times gives us T (n) = 3iT (n/2i) + 2n i−1∑ j=0 (3/2)j . If we set i = lg n and assume T (1) = 1, where lg denotes log2, then since 2n lgn−1∑ j=0 (3/2)j < 2n(3/2)lgn−1 ∑ k≥0 (2/3)k = O(3lgn) , we get T (n) = O(3lgn) ⊂ O(n1.585). For example, to multiply 85419621 and 75339405 with Karatsuba’s algorithm, we’d compute z2 = 8541× 7533 = 64339353 , z0 = 9621× 9405 = 90485505 , z1 = (8541 + 9621)(7533 + 9405)− z2 − z0 = 18162× 16938− 64339353− 90485505 = 307627956− 154824858 = 152803098 28 and then 6433935300000000 + 1528030980000 + 90485505 = 6435463421465505 . As you can probably guess, it doesn’t really make sense to use Karatsuba’s algorithm for any except really huge numbers. (It’s also worth noting that, although Karatsuba’s algorithm is recursive, just because you start breaking up big numbers and multiplying the pieces doesn’t mean you’re stuck doing that; once the pieces get small enough, you can switch to the standard algorithm.) Still, I like teaching Karatsuba’s algorithm for several reasons: Who would have guess that after thousands of years of people multiplying in quadratic time, multiplication would finally be sped up in 1962? Stephen Cook (one of my professors at Toronto, who’ll also come up again in this course) stud- ied the complexity of multiplication for his PhD thesis, and a generalization of Karatsuba’s algorithm (splitting each number into k parts) is named Toom-Cook. In March 2019 — just over two years ago, and just six months before the first time I taught this course! — Harvey and van der Hoeven gave an O(n log n)-time algorithm for multiplying two n-digit numbers. Karatsuba’s algorithm is a gentle introduction to the ideas used in Strassen’s algorithm for ma- trix multiplication — which is kind of mysterious otherwise — and the matrix-multiplication exponent is sort of theoretical computer science’s answer to the gravitational constant. Once we know we can speed up multiplication of integers by divide-and-conquer, it’s natu- ral to wonder what other kinds of multiplication we can speed up. Strassen considered matrix multiplication, such as [ A1,1 A1,2 A2,1 A2,2 ] [ B1,1 B1,2 B2,1 B2,2 ] = [ C1,1 C1,2 C2,1 C2,2 ] . In high school, we learned to get the value in cell (i, j) of C by multiplying the ith row of A by the jth column of B. If A, B and C are n × n matrices, however — with each of A1,1 through C2,2 being an (n/2) × (n/2) matrix — then this involves computing the dot product of two n-component matrices for each of the n2 entries of C, and thus takes Ω(n3) time overall, even assuming multiplying each pair of components takes constant time. In 1969 Strassen took the equations we all know, C1,1 = A1,1B1,1 +A1,2B2,1 , C1,2 = A1,1B1,2 +A1,2B2,2 , C2,1 = A2,1B1,1 +A2,2B2,1 , C2,2 = A2,1B1,2 +A2,2B2,2 , and defined 7 intermediate matrices similar to Karatsuba’s z2, z1, z0 (but harder to memorize — you don’t have to!): M1 = (A1,1 +A2,2)(B1,1 +B2,2) , M2 = (A2,1 +A2,2)B1,1 , 29 M3 = A1,1(B1,2 −B2,2) , M4 = A2,2(B2,1 −B1,1) , M5 = (A1,1 +A1,2)B2,2 , M6 = (A2,1 −A1,1)(B1,1 +B1,2) , M7 = (A1,2 −A2,2)(B2,1 +B2,2) . Once we’ve computed M1 through M7 using 7 multiplications of (n/2)× (n/2) matrices and a constant number of matrix additions and subtractions (which take O(n2) time each, assuming we can add and subtract components in constant time), then we can compute C1,1 through C2,2 using only a constant number more additions and subtractions, as follows: C1,1 = M1 +M4 −M5 +M7 , C1,2 = M3 +M5 , C2,1 = M2 +M4 , C2,2 = M1 −M2 +M3 +M6 . Checking only C1,1, for brevity, we see that indeed C1,1 = M1 +M4 −M5 +M7 = (A1,1 +A2,2)(B1,1 +B2,2) +A2,2(B2,1 −B1,1)− (A1,1 +A1,2)B2,2 + (A1,2 −A2,2)(B2,1 +B2,2) = A1,1B1,1 +A1,1B2,2 +A2,2B1,1 +A2,2B2,2 +A2,2B2,1 −A2,2B1,1 − A1,1B2,2 −A1,2B2,2 +A1,2B2,1 +A1,2B2,2 −A2,2B2,1 −A2,2B2,2 = A1,1B1,1 +A1,2B2,1 . Computing C from M1 through M7 saves us one multiplication compared to computing it directly from A and B, which leads to the recurrence T (n) = 7T (n/2) + cn2 for some constant c (for the additions and subtractions). Expanding and plugging in i = lg n as before, T (n) = 7T (n/2) + cn2 = 7(7T (n/4) + c(n/2)2) + cn2 = 7(7(7T (n/8) + c(n/4)2) + c(n/2)2) + cn2 ... we get T (n) = 7lgn + cn2 lgn−1∑ j=0 (7/4)j . 30 Since cn2 lgn−1∑ j=0 (7/4)j < cn2(7/4)lgn−1 ∑ k≥0 (4/7)k = O(7lgn) we get T (n) = O(7lgn) ⊂ O(n2.8074) . That is, Strassen’s algorithm multiplies matrices in subcubic time. You can read more about it on Wikipedia (https://en.wikipedia.org/wiki/Strassen_algorithm), which is where I just looked up the definitions of M1 through M7. I’ve taught Karatsuba’s algorithm often enough that I’ve pretty much memorized z1 = (x1 + x0)(y1 + y0)− z2 − z0, but if I ever get to the point where I’ve memorized the definitions of M1 through M7 then I think it’ll be time to quit. People actually do use Strassen’s algorithm in practice for large enough matrices, padding with rows and/or columns of 0s to deal with matrices that aren’t square or whose height and/or width are odd, although there are asymptotically faster algorithms. Currently the asymptotically fastest algorithm, published by Alman and Vassilevska-Williams in January this year, runs in O(n2.3728596) time. Since matrix multiplication is used as a subroutine in many other algorithms (we’ll see some later in the course), according to some people the question of what that exponent actually is stands as one of the big open problems in theoretical computer science. 31 Chapter 4 Fast Fourier Transform We’ve looked at multiplying integers and square matrices quickly using divide-and-conquer, and those sets are two classic examples of rings: addition and multiplication are defined, there are additive and multiplicative identity elements and additive inverses, and multiplication is associative and distributive but may not be commutative. Another classic example of a ring is the set of polynomials, and today we’re going to look at a divide-and-conquer algorithm, the Fast Fourier Transform (FFT), for multiplying those quickly.* Since polynomials are functions, the most obvious thing to do with them is evaluate them. Given a degree-n polynomial A(x) = a0 + a1x+ a2x 2 + · · ·+ an−1xn−1 and a value for x, we can evaluate A(x) in O(n) time as follows: double eval(int n, double *A, double x) { double powx = 1.0; double value = 0.0; for (int j = 0; j < n; j++) { value += A[j] * powx; powx *= x; } return(value); } *Gauss knew about a form of the FFT in the early 1800s but its widespread use dates from Coo- ley and Tukey’s rediscovery of it in the 1960s, and it’s now one of the most widely used numerical al- gorithms. I’ll try to give a basic introduction here but, if you want more details, you should watch Max’s lecture (actually, you should already have watched that), read the chapter on the FFT in Intro- duction to Algorithms or watch Demaine’s lecture about it (https://www.youtube.com/watch?v=iTMn0Kt18tg) and read his lecture notes (https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/ 6-046j-design-and-analysis-of-algorithms-spring-2015/lecture-notes/MIT6_046JS15_lec03.pdf) — which is how I reviewed the FFT before writing this. 32 (This can be written more mathematically as a formula called Horner’s Rule, but perhaps a program is clearer for us.) If we want to add A(x) and another degree-n polynomial B(x) = b0 + b1x+ b2x 2 + · · ·+ bn−1xn−1 , then we can just add corresponding coefficients, also in O(n) time: A(x) +B(x) = (a0 + b0) + (a1 + b1)x1 + (a2 + b2)x 2 + · · ·+ (an−1 + bn−1)xn−1 . Now suppose we want to multiply A(x) and B(x). The standard way we were taught in high school involves Θ(n2) multiplications of coefficients, C(x) = A(x)B(x) = 2n−2∑ j=0 xj j∑ k=0 akbj−k . As you’ll see in the next assignment, it’s possible to speed that up using a version of Karatsuba’s algorithm. For example, if A(x) = 5 + 3x− 2x2 + 4x3 , B(x) = −2− x+ 3x2 + x3 then A(x) +B(x) = (5− 2) + (3− 1)x+ (−2 + 3)x2 + (4 + 1)x3 = 3 + 2x+ x2 + 5x3 but C(x) = A(x)B(x) = (5 + 3x− 2x2 + 4x3)(−2− x+ 3x2 + x3) = (−10− 5x+ 15x2 + 5x3) + (−6x− 3x2 + 9x3 + 3x4) + (4x2 + 2x3 − 6x4 − 2x5) + (−8x3 − 4x4 + 12x5 + 4x6) = −10 + (−5− 6)x+ (15− 3 + 4)x2 + (5 + 9 + 2− 8)x3 + (3− 6− 4)x4 + (−2 + 12)x5 + 4x6 = −10− 11x+ 16x2 + 8x3 − 7x4 + 10x5 + 4x6 . The first idea behind the FFT is something like evaluating A(x) and B(x) at n distinct values x0, . . . , xn−1; calculating C(xj) = A(xj)B(xj) for each xj ; and then recovering the coefficients of C(x) from the pairs (x0, C(x0)), . . . , (xn−1, C(xn−1)). This isn’t quite right — because C(x) is a polynomial of degree 2n and thus can be uniquely specified by 2n points on its curve but not by only n points — but it is if we assume (without loss of generality) that n is a power of 2 and the last n/2 coefficients in A(x) and B(x) are all 0s. 33 1-1 i −i −1+i√ 2 1+i√ 2 1−i√ 2 −1−i√ 2 Figure 4.1: The 8th roots of unity. The second idea is to choose x0, . . . , xn−1 to be the n distinct complex numbers c such that cn = 1 (called “the nth roots of unity”). Drawn in the complex plane, as in Figure 4.1 for n = 8, these are n points equally spaced around the unit circle, starting at 1 = 1 + 0i (which corresponds to (1, 0) in the Euclidean plane). To evaluate A(x) at x0, . . . , xn−1, we compute the matrix-vector product 1 x0 x 2 0 . . . x n−2 0 x n−1 0 1 x1 x 2 1 . . . x n−2 1 x n−1 1 ... ... ... ... ... 1 xn−2 x2n−2 . . . x n−2 n−2 x n−1 n−2 1 xn−1 x2n−1 . . . x n−2 n−1 x n−1 n−1 a0 a1 a2 ... an−2 an−1 = A(x0) A(x1) A(x2) ... A(xn−2) A(xn−1) . The matrix V on the left is a Vandermonde matrix (meaning the entries in each row are increasing powers) and we’ll discuss it in detail next; the components of the vector ~A by which its multiplied are just the coefficients of A(x); and the components of the product are A(x) evaluated at x0, . . . , xn−1. To see why we choose x0, . . . , xn−1 to be the nth roots of unity, consider V for the 8th roots of unity and ~A for our example polynomial A(x): 1 1 1 1 1 1 1 1 1 −1 1 −1 1 −1 1 −1 1 i −1 −i 1 i −1 −i 1 −i −1 i 1 −i −1 i 1 (1 + i)/ √ 2 i (−1 + i)/√2 −1 (−1− i)/√2 −i (1− i)/√2 1 (−1− i)/√2 i (1− i)/√2 −1 (1 + i)/√2 −i (−1 + i)/√2 1 (−1 + i)/√2 −i (1 + i)/√2 −1 (1− i)/√2 i (−1− i)/√2 1 (1− i)/√2 −i (−1− i)/√2 −1 (−1 + i)/√2 i (1 + i)/√2 5 3 −2 4 0 0 0 0 . 34 There seems to be a lot of structure to V , doesn’t there? Among other things, in our example it’s an 8 × 8 matrix containing 8 distinct values. We’ll use that structure to multiply V by ~A in O(n log n) time rather than Ω(n2) time, which the high-school algorithm would use. The third idea behind the FFT is that A(x) = Aeven(x 2) + xAodd(x 2), where Aeven(x) = a0 + a2x+ a4x 2 + · · ·+ an−2x(n−2)/2 Aodd(x) = a1 + a3x+ a5x 2 + · · ·+ an−1x(n−2)/2 so Aeven(x 2) = a0 + a2x 2 + a4x 4 + · · ·+ an−2xn−2 xAodd(x 2) = a1x+ a3x 3 + a5x 5 + · · ·+ an−1xn−1 . We can compute Aeven(x 2) and Aodd(x 2) for all the values x that are nth roots of unity, by first deleting every other column of V (which contain the odd powers of the nth roots of unity) and removing duplicate rows, and then multiplying by the vectors ~Aeven and ~Aodd which consist of the components a0, a2, a4, . . . , an−1 and a1, a3, a5, . . . , an−1, respectively. After the deletion of the columns, half the rows of V are duplicates since, when n is a power of 2, squaring the n distinct nth roots of unity gives us the n/2 distinct (n/2)th roots of unity. Assuming we’ve listed the nth roots of unity such that roots squaring to the same value are adjacent (as we have in our example), we can think of these multiplication as 1 x20 . . . x n−2 0 1 x22 . . . x n−2 2 ... ... ... 1 x2n−4 . . . x n−2 n−4 1 x2n−2 . . . x n−2 n−2 a0 a2 ... an−2 = Aeven(x 2 0) Aeven(x 2 2) ... Aeven(x 2 n−4) Aeven(x 2 n−2) and 1 x20 . . . x n−2 0 1 x22 . . . x n−2 2 ... ... ... 1 x2n−4 . . . x n−2 n−4 1 x2n−2 . . . x n−2 n−2 a1 a3 ... an−1 = Aodd(x 2 0) Aodd(x 2 2) ... Aodd(x 2 n−4) Aodd(x 2 n−2) . For our example, these multiplications are 1 1 1 1 1 −1 1 −1 1 i −1 −i 1 −i −1 i 5 −2 0 0 = 3 7 5− 2i 5 + 2i 35 and 1 1 1 1 1 −1 1 −1 1 i −1 −i 1 −i −1 i 3 4 0 0 = 7 −1 3 + 4i 3− 4i . Notice we are now performing two matrix-vector multiplications with a 4× 4 matrix containing 4 distinct values. This means that in our example, A(−i) = Aeven((−i)2) + (−i)Aodd((−i)2) = Aeven(−1) + (−i)Aodd(−1) = 7 + (−i)(−1) = 7 + i , for instance. (We can tell Aeven(−1) = 7 and Aodd(−1) = −1 because the second components of the matrix-vector products immediately above are 7 and −1.) Checking the inner product of the 4th row of V — in which the components are the powers of x3 = −i — with the transpose of ~A in our example, indeed we have (1,−i,−1, i, 1,−i,−1, i) · (5, 3,−2, 4, 0, 0, 0, 0) = 5− 3i+ 2 + 4i+ 0 + 0 + 0 + 0 = 7 + i . Using this divide-and-conquer approach, the time to evaluate A(x) on the nth roots of unity by computing V ~A is T (n) = 2T (n/2) + O(n) = Θ(n log n). If we do the same for B(x), and then compute C(x1) = A(x0)B(x1), . . . , C(xn−1) = A(xn−1)B(xn−1), we’ll have n distinct points on the curve C(x), which uniquely specify C(x). To recover the coefficients of C(x) from C(x0), . . . , C(xn−1), we multiply V −1 ~C, where V −1 is V ’s inverse and ~C is the vector with components C(x0), . . . , C(xn−1). We can do this with the Inverse FFT (IFFT), which also works in O(n log n) time and is based on the fact that V −1 = V /n, where V is the result of replacing each entry of V by its complex conjugate (that is, multiplying each entry’s imaginary part by −1). Notice that taking the complex conjugates of the matrix in our example only swaps the rows. This is true in general, because taking complex conjugates doesn’t change the set of nth roots of unity (and their powers), it just rearranges them. Therefore, we can compute the IFFT the same way we computed the FFT, and then multiply the resulting vector by 1/n. 36 Assignment 2 posted: 14.05.2021 due: midnight 28.05.2021 You can work in groups of up to three people. One group member should submit a copy of the solu- tions on Brightspace, with all members’ names and banner numbers on it; the other group members should submit text files with all members’ names and banner numbers (otherwise Brightspace won’t let us assign them marks!). You may consult with other people but each group should understand the solutions: after discussions with people outside the groups, discard any notes and do something unrelated for an hour before writing up your solutions; it’s a problem if no one in a group can ex- plain one of their answers. For programming questions you should submit your code, which should compile and run correctly to receive full marks. 1. Given a tree on n vertices, we can always find a single vertex whose removal leaves a forest in which no tree has more than n/2 vertices. Suppose we use our divide-and-conquer algorithm to count the 3-colourings of a tree on n vertices; about how long does it take? How fast can you compute the answer? You can assume n is a power of 2 and the tree is always split into exactly two pieces of size n/2 (even though the two pieces together should have n − 1 vertices instead of n, since we removed a vertex to split the tree). 2. In the lecture, we saw that implementing Euclid’s algorithm on positive integers a and b with a > b by repeated subtraction takes Ω(a) time in the worst case but implementing it by mod takes O(log a) time, assuming subtraction and mod each take constant time. Now suppose subtracting two n-digit numbers takes n time but taking their mod takes n2 time; comparing two numbers takes time 1. About how much bigger does a have to be than b in order for it to be faster to compute a mod b with mod directly than with repeated subtraction? For example, if a = 1523 and b = 0427, then computing a mod b = 242 by repeated sub- traction means subtracting 0427 from 1523 to get 1096 in 4 time units, checking 1096 is still bigger than 0427 in 1 time unit, subtracting 0427 from 1096 to get 0669 in 4 time units, checking 0669 is still bigger than 0427 in 1 time unit, subtracting 0427 from 0667 to get 0242 in 4 time units, and checking whether 0240 is bigger than 0427 in 1 time unit (and finding it’s not). That takes a total of 4 + 1 + 4 + 1 + 4 + 1 = 15 time units, whereas computing a mod b = 242 directly takes 42 = 16 time units, so in this case repeated subtraction is faster. 3. Describe how to build a circuit consisting of AND, OR and NOT gates that takes two n-bit binary numbers x and y and outputs the (n+1)-bit binary number x+y. Your circuit should be a directed acyclic graph (a DAG) whose size is at most polynomial in n and whose depth is constant (where “depth” means the length of the longest directed path); the fan-in and fan- out are not bounded (where “fan-in” and “fan-out” mean the maximum in- and out-degree of any vertex). Continued on Next Page! 37 4. Give a divide-and-conquer program for https://leetcode.com/problems/maximum-subarray (you don’t have to pay for a membership!) and explain how to use your solution to solve https://leetcode.com/problems/maximum-sum-circular-subarray neatly. (If you don’t use divide-and-conquer or your solution looks like it’s been copied, you will not get the mark and you may be reported to FCS.) 5. Suppose you didn’t understand Max’s lecture on the FFT but you still want to multiply degree-n polynomials in time subquadratic in n. Show how to use Karatsuba’s algorithm to do it in O(nlg 3) time, assuming arithmetic operations on coefficients take constant time. For example, consider multiplying the two polynomials A(x) = 8x7 + 5x6 + 4x5 + x4 + 9x3 + 6x2 + 2x+ 1 B(x) = 7x7 + 5x6 + 3x5 + 3x4 + 9x3 + 4x2 + 5 . Notice (8x3 + 5x2 + 4x+ 1)(9x3 + 4x2 + 5) + (9x3 + 6x2 + 2x+ 1)(7x3 + 5x2 + 3x+ 3) = (8x3 + 5x2 + 4x+ 1 + 9x3 + 6x2 + 2x+ 1)(7x3 + 5x2 + 3x+ 3 + 9x3 + 4x2 + 5)− (8x3 + 5x2 + 4x+ 1)(7x3 + 5x2 + 3x+ 3)− (9x3 + 6x2 + 2x+ 1)(9x3 + 4x2 + 5) ; does this look familiar? 38 Solutions 1. Given a tree on n vertices, we can always find a single vertex whose removal leaves a forest in which no tree has more than n/2 vertices. Suppose we use our divide-and-conquer algorithm to count the 3-colourings of a tree on n vertices; about how long does it take? How fast can you compute the answer? You can assume n is a power of 2 and the tree is always split into exactly two pieces of size n/2 (even though the two pieces together should have n − 1 vertices instead of n, since we removed a vertex to split the tree). Solution: In this case, the recurrence for our colouring algorithm from Assignment 1 is T (n) = 3 · 2T (n/2) + f(n) = 6(6T (n/4) + f(n/2)) + f(n) = 6(6(6T (n/8) + f(n/4)) + f(n/2)) + f(n) = 6iT (n/2i) + i−1∑ j=0 6jf(n/2j) . I forgot to say it takes O(n) time to find the vertex whose removal leaves a forest in which no tree has more than n/2 vertices — that is, f(n) = n — but trying every vertex na¨ıvely would take O(n2) time. Plugging that in and setting i = lg n, we still get T (n) ≈ 6lgn + n2 lgn−1∑ j=0 (3/2)j = O(6lgn) = O(nlg 6) . On the other hand, it doesn’t actually matter what the shape of the tree is: for whichever vertex you decide to colour first, you have 3 choices; for all its neighbours, you have 2 choices; for all their neighbours, you have 2 choices; etc. So the number of ways to 3-colour a tree on n vertices is always 3 · 2n−1. (That’s a big number, but it’s easy to work with because in binary it’s just 11 followed by n− 1 copies of 0.) 2. In the lecture, we saw that implementing Euclid’s algorithm on positive integers a and b with a > b by repeated subtraction takes Ω(a) time in the worst case but implementing it by mod takes O(log a) time, assuming subtraction and mod each take constant time. Now suppose subtracting two n-digit numbers takes n time but taking their mod takes n2 time; comparing two numbers takes time 1. About how much bigger does a have to be than b in order for it to be faster to compute a mod b with mod directly than with repeated subtraction? For example, if a = 1523 and b = 0427, then computing a mod b = 242 by repeated sub- traction means subtracting 0427 from 1523 to get 1096 in 4 time units, checking 1096 is still bigger than 0427 in 1 time unit, subtracting 0427 from 1096 to get 0669 in 4 time units, checking 0669 is still bigger than 0427 in 1 time unit, subtracting 0427 from 0667 to get 0242 in 4 time units, and checking whether 0240 is bigger than 0427 in 1 time unit (and finding it’s not). That takes a total of 4 + 1 + 4 + 1 + 4 + 1 = 15 time units, whereas computing a mod b = 242 directly takes 42 = 16 time units, so in this case repeated subtraction is faster. 39 Solution: It’s faster to compute a mod b when a is at least about n times larger than b: if a/b = ω(n) then it will take ω(n2) time to do the repeated subtractions; if a/b = Θ(n) then it will take Θ(n2) time to do either the repeated subtractions or compute the modulus; if a/b = o(n) then it will take o(n2) time to do the repeated subtractions. 3. Describe how to build a circuit consisting of AND, OR and NOT gates that takes two n-bit binary numbers x and y and outputs the (n+1)-bit binary number x+y. Your circuit should be a directed acyclic graph (a DAG) whose size is at most polynomial in n and whose depth is constant (where “depth” means the length of the longest directed path); the fan-in and fan- out are not bounded (where “fan-in” and “fan-out” mean the maximum in- and out-degree of any vertex). Solution: To be consistent with what I wrote in the discussion on Brightspace, let’s number the columns from the right, say the sum of x and y is z, and consider how to compute the bit z[i]. If we know the carry-bit c[i] from adding x[i− 1..0] and y[i− 1..0], then for 1 ≤ i < n we can compute z[i] according to the following truth table: x[i] y[i] c[i] z[i] 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 That is, z[i] is 1 if and only if an odd number of x[i], y[i] and c[u = i] are 1s, for 1 ≤ i < n. (If i = 0 then we ignore the carry-bit, and z[n] = c[n].) As people pointed out in the discussion, that still leaves the problem of figuring out the carry-bit c[i]. Notice that c[i] = 1 if and only if, for some k < i, we have x[k] = y[k] = 1 and, for all j with i > j > k, we do not have x[j] = y[j] = 0. In other words (or, rather, symbols), c[i] = ∨ kx[k] ∧ y[k] ∧ ∧ i>j>k (x[j] ∨ y[j]) . For each choice of i and k < i and j with i > j > k, we have an AND gate, so our entire circuit has Θ(n3) gates — but that’s allowed. The main thing is, the entire circuit has constant depth. 4. Give a divide-and-conquer program for https://leetcode.com/problems/maximum-subarray (you don’t have to pay for a membership!) and explain how to use your solution to solve https://leetcode.com/problems/maximum-sum-circular-subarray 40 neatly. (If you don’t use divide-and-conquer or your solution looks like it’s been copied, you will not get the mark and you may be reported to FCS.) Solution: The main idea is that the max-sum subarray nums[i..j] is either entirely in the first half of nums, so j ≤ numsSize/2; or entirely in the second half, so i > numsSize/2; or crosses the middle, in which case its a non-empty suffix of the first half followed by a non-empty prefix of the second half, so i ≤ numsSize/2 < j. We find the max-sum subarrays in nums[0..bnumsSize/2c] and nums[bnumsSize/2c..numsSize− 1] recursively. With a for-loop, we work out the sum of each non-empty suffix of nums[0..bnumsSize/2c] and, with another for-loop we work out the sum of each non-empty prefix of nums[bnumsSize/2c..numsSize − 1]. We concatenate the max-sum suffix of the first half and the max-sum prefix of the second half to get the max-sum subarray that crosses the middle. We return the largest of the sums of the max-sum subarray entirely in the first half, the max-sum subarray entirely in the second half, and the max-sum subarray that crosses the middle. int maxSubArray(int* nums, int numsSize){ if (numsSize == 1) { return(nums[0]); } int split = numsSize / 2; int maxLeft = maxSubArray(nums, split); int maxRight = maxSubArray(&nums[split], numsSize - split); int sufSum = nums[split - 1]; int maxSufSum = sufSum; for (int i = split - 2; i >= 0; i--) { sufSum += nums[i]; if (sufSum > maxSufSum) { maxSufSum = sufSum; } } int prefSum = nums[split]; int maxPrefSum = prefSum; for (int i = split + 1; i < numsSize; i++) { prefSum += nums[i]; if (prefSum > maxPrefSum) { maxPrefSum = prefSum; } } int maxMiddle = maxSufSum + maxPrefSum; 41 if (maxLeft >= maxRight && maxLeft >= maxMiddle) { return(maxLeft); } else if (maxRight >= maxMiddle) { return(maxRight); } else { return(maxMiddle); } } The coolest way to reuse this code to find the max-sum subarray in a circular array A is to find the max-sum subarray A[i..j] in A considering it as a normal array, and then find the max-sum subarray A′[i′..j′] in A′ where A′[k] = −A[k] for all k, and then report the larger of the sums of A[i..j], and of A[j′ + 1..|A| − 1] concatenated with A[0..i′ − 1]. The more obvious way to find the max-sum subarray is to find the max-sum subarray A[i..j], find the max-sum prefix of A, find the max-sum suffix of A, and then report the larger of the sums of A[i..j], and of the max-sum suffix concatenated with the max-sum prefix. 5. Suppose you didn’t understand Max’s lecture on the FFT but you still want to multiply degree-n polynomials in time subquadratic in n. Show how to use Karatsuba’s algorithm to do it in O(nlg 3) time, assuming arithmetic operations on coefficients take constant time. For example, consider multiplying the two polynomials A(x) = 8x7 + 5x6 + 4x5 + x4 + 9x3 + 6x2 + 2x+ 1 B(x) = 7x7 + 5x6 + 3x5 + 3x4 + 9x3 + 4x2 + 5 . Notice (8x3 + 5x2 + 4x+ 1)(9x3 + 4x2 + 5) + (9x3 + 6x2 + 2x+ 1)(7x3 + 5x2 + 3x+ 3) = (8x3 + 5x2 + 4x+ 1 + 9x3 + 6x2 + 2x+ 1)(7x3 + 5x2 + 3x+ 3 + 9x3 + 4x2 + 5)− (8x3 + 5x2 + 4x+ 1)(7x3 + 5x2 + 3x+ 3)− (9x3 + 6x2 + 2x+ 1)(9x3 + 4x2 + 5) ; does this look familiar? Solution: We can imagine x is a base so large that we’ll never have to worry about carries, and then apply Karatsuba’s algorithm. In the example, we set a1(x) = 8x 3 + 5x2 + 4x+ 1 a0(x) = 9x 3 + 6x2 + 2x+ 1 b1(x) = 7x 3 + 5x2 + 3x+ 3 b0(x) = 9x 3 + 4x2 + 0x+ 5 so A(x) = a1(x)x 4 + a0(x) B(x) = b1(x)x 4 + b0(x) . 42 and A(x)B(x) = (a1(x)x 4 + a0(x))(b1(x)x 4 + b0(x)) = a1(x)b1(x)x 8 + (a1(x)b0(x) + a0(x)b1(x))x 4 + a0(x)b0(x) . Setting z2(x) = a1(x)b1(x) z0(x) = a0(x)b0(x) z1(x) = a1(x)b0(x) + a0(x)b1(x) = (a1(x) + a0(x))(b1(x) + b0(x))− z2 − z0 we get A(x)B(x) = z2(x)x 8 + z1(x)x 4 + z0(x) . 43 Chapter 5 Asymptotic Bounds My friend Simon was part of a team that took turns running on a treadmill and together ran over 420 km in 24 hours, trying to break the world record (about 429 km back then, currently about 453 km). His ambition was to be the first PhD-level computer scientist to run faster than Turing, although I don’t think he quite made it.* Suppose Simon normally runs distance n ≥ 1 in time f(n) = n3/2. His exact performance on any given day depends on many factors, however: if he’s wearing his favourite shoes then we should multiply his time by 0.8; if he’s just had a coffee, we multiply it by 0.9; if he’s tired, we multiply it by 1.1; etc. Suppose I normally run distance n ≥ 1 in time n7/4, but if I’m running against Simon then he might give me a head start, so I’ll take time g(n) = n7/4 − 10. If I’m wearing my favourite shoes then we should multiply my time by 0.95, etc. Clearly Simon is much faster than I am in the long run (pun intended), even though in a short enough race and with a big enough head start and when I’m wearing my favourite shoes and he’s not, etc, I might still win. How can we formalize the notion of “in the long run” and our intuition that, regardless of head starts and shoes and coffee and fatigue, as long as we run far enough then Simon is going to leave me in the dust? This is important in computer science because, when we’re talking about how much time algo- rithms take, we don’t want to worry too much about how they behave on small inputs, or whether one algorithm is slightly faster when implemented in some particular language on some particular architecture. For example, Simon is also a really good programmer and he works in Silicon Valley *Incidentally, if you’ve seen The Imitation Game and think Benedict Cumberbatch’s wimpy portrayal of Turing was accurate, think again: he sometimes ran more than 60 km from Bletchley Park to London for meetings. When he was 13 he was featured in a local newspaper for cycling over 100 km to his boarding school during a train strike: “By 1926, Alan’s parents were living in France and with Alan due to begin his first term at Sherborne School on 4 May 1926 he set off alone the preceding day, taking the Channel ferry from St Malo and arriving in Southampton only to discover that owing to the General Strike no trains were running. Undeterred, Alan sent his housemaster a telegram informing him that he would not now be arriving until the following day and set out to cycle the 63 miles from Southampton to Sherborne across what was for him unknown territory. He stopped overnight at the Crown Hotel in Blandford Forum and the following morning, he reported later in a letter to his parents, the hotel staff turned out to see him off on the remainder of his journey to Sherborne.” (https://oldshirburnian.org.uk/the-sherborne-formula-the-making-of-alan-turing) 44 so he can probably afford a faster computer than I can, but maybe his brain has atrophied slightly since he switched from academia to industry so he’s not as good at designing algorithms any more. (I admit this is unlikely, but it’s a convenient supposition for this lecture.) Let’s suppose he designs an algorithm that can normally process an input of size n in g(n) time and I design an algorithm for the same problem that can normally process such an input in f(n) time. Even if Simon can speed his algorithm up by a factor of 1.5 using fancy programming tricks, and by another factor of 2 using a fancy computer, and his algorithm has a head start because he pre-computes what it should do for the first few steps on any input, in the long run my algorithm is still going to be faster. One way to say what we want is lim n→∞ f(n) g(n) = 0 , but computer scientists usually use asymptotic notation and just say “f(n) ∈ o(g(n))” or, equiva- lently, “g(n) ∈ ω(f(n))”. You’ve seen O in previous courses and we’ve mentioned Ω and Θ before in this course, and I told you to think of them as meaning “at most about”, “at least about” and “about”, respectively. (Historically, O stood for “on the order of”, which is a bit confusing because in English that means “about”, not “at most about”.) In this class we’ll cover o and ω too and give formal definitions, then talk about proving upper and lower bounds, and finally about how to use the Master Theorem to solve many of the recurrences you’ll run across using divide-and-conquer algorithms. The first thing we should get straight is that O(g(n)), Ω(g(n)), Θ(g(n)), o(g(n)) and ω(g(n)) are all sets — more specifically, sets of functions — and we should write f(n) ∈ O(g(n)) instead of f(n) = O(g(n)), even though being picky about the proper use of asymptotic notation is as frustrating as being picky about the proper use of apostrophes. It’s really common for people to write f(n) = O(g(n)), but it’s also misleading, since equality is supposed to be reflexive — that is, if x = y then y = x — but f(n) = O(g(n)) (which is true in this case) doesn’t imply g(n) = O(f(n)) (which is false in this case). Being really precise, O(g(n)) is the set of functions h(n) such that, for some constants n0 and c > 0 and for every n ≥ n0, we have h(n) ≤ cg(n). We can write this in mathematical notation, with ∃ for “there exists” and ∀ for “for all”, as O(g(n)) = {h(n) : ∃n0, c > 0 ∀n ≥ n0 . h(n) ≤ cg(n)} . Notice our f(n) is in this set, f(n) ∈ O(g(n)). You can see that “about” in my “at most about” means “for sufficiently large n and ignoring constant factors”. To go from “at most about” to “at least about”, we just switch things around: Ω(f(n)) = {h(n) : ∃n0, c ∀n ≥ n0 . h(n) ≥ cf(n)} . I’ve switched from g(n) to f(n) here because, for the functions f(n) and g(n) we’ve chosen, we have g(n) ∈ Ω(f(n)) but f(n) 6∈ Ω(g(n)). Of course if we’d chosen f(n) = n3/2 and g(n) = 5n3/2+lg n−7, for example, then we’d have all of g(n) ∈ O(f(n)) and g(n) ∈ Ω(f(n)) and f(n) ∈ O(g(n)) and f(n) ∈ Omega(g(n)). In this case, we write f(n) ∈ Θ(g(n)) or g(n) ∈ Θ(f(n)), so when I said Θ 45 means “about”, I meant “within a positive constant factor of each other, for sufficiently large n”. Formally, Θ(f(n)) = O(f(n)) ∩ Ω(g(n)). Notice f(n) ∈ O(g(n)) implies g(n) ∈ Ω(f(n)) and vice versa, so f(n) 6∈ O(g(n)) implies g(n) 6∈ Ω(f(n)) and vice versa, but it’s not true that n sinn 6∈ O(n cosn) implies n sinn ∈ Ω(n cosn), for example. Although for real numbers x and y either we have x < y or x = y or x > y, functions like n sinn and n cosn aren’t always comparable like that, because for any constant c we have (n sinn)/(n cosn) > c for infinitely many values of n, and (n cosn)/(n sinn) > c for infinitely many values of n. Of course, none of these concepts really say what we started out wanting to say about how Simon and I run: he’s not just at least about as fast as I am, in the long run he’s much faster. For that we need o or ω. Saying f(n) ∈ o(g(n)) means that for any positive constant c and sufficiently large n, we have f(n) ≤ c(n) or, in mathematical notation, o(g(n)) = {h(n) : ∀ c > 0 ∃n0 . h(n) ≤ cg(n)} . Switching things around, saying g(n) ∈ ω(f(n)) means that for any positive constant c and sufficiently large n, we have g(n) ≥ cf(n) or, in mathematical notation, ω(f(n)) = {h(n) : ∀ c > 0 ∃n0 . h(n) ≥ cf(n)} . Notice f(n) ∈ o(g(n)) implies g(n) ∈ ω(f(n)) and vice versa, and they’re both equivalent to saying lim n→∞ f(n) g(n) = 0 or lim n→∞ g(n) f(n) =∞ . Since O(g(n)), Ω(g(n)), Θ(g(n)), o(g(n)) and ω(g(n)) are all sets, we can draw them as a Venn diagram, with the following relations: o(g(n)) ⊂ O(g(n)) , Θ(g(n)) = O(g(n)) ∩ Ω(g(n)) , ω(g(n)) ⊂ Ω(g(n)) , g(n) ∈ Θ(g(n)) , g(n) 6∈ o(g(n)) ∪ ω(g(n)) , as shown in Figure 5.1. Notice we can write ⊂ instead of ⊆, because g(n) ∈ Θ(g(n)) but g(n) 6∈ o(g(n)) ∪ ω(g(n)). In fact, since f(n) ∈ o(g(n)), we also have the following relations, which are beyond my abilities at drawing Venn diagrams: o(f(n)) ⊂ O(f(n)) , Θ(f(n)) = O(f(n)) ∩ Ω(f(n)) , 46 o(g(n)) ω(g(n)) Θ(g(n)) O(g(n)) Ω(g(n)) g(n) Figure 5.1: A Venn diagram showing the relationships between O(g(n)), Ω(g(n)), Θ(g(n)), o(g(n)) and ω(g(n)). ω(f(n)) ⊂ Ω(f(n)) , f(n) ∈ Θ(f(n)) , f(n) 6∈ o(f(n)) ∪ ω(f(n)) , O(f(n)) ⊂ o(g(n)) , Ω(g(n)) ⊂ ω(f(n)) . It’s traditional for instructors to give their students lots of examples of functions and how they compare asymptotically, and drive home the message of how asymptotic notation lets us ignore constant factors and lower-order terms. Because that’s traditional, however, you can find lots of explanations and exercises on Khan Academy or YouTube, as well as in Introduction to Algorithms. Actually, one of the coauthors of the lesson on Khan Academy (https://www.khanacademy.org/ computing/computer-science/algorithms/asymptotic-notation/a/asymptotic-notation) is a coauthor of Introduction to Algorithms. Instead, I want to focus on something I’ve heard from students in the past, which is that Ω is for best-case bounds and O is for worst-case bounds. I don’t know where people get that from, but it’s WRONG! Stirling’s formula says lnn! = n lnn− n+O(log n) . (The sum of the lower-order terms hidden by O(log n) is positive. If you want to be really exact, it’s between (1/2) lnn+ln √ 2pi+1/(12n+1) and (1/2) lnn+ln √ 2pi+1/(12n).) Because it includes an O, is Stirling’s formula talking about the worst case? In what sense is it bad? Incidentally, notice that we don’t care about the base of the logarithm, since changing from one constant greater than one to another just changes the constant coefficient hidden by the O. It often makes sense to say things like “even in the best case, we still use at least about √ n time” or “even in the worst case, we still use at most about n2 time”, and then we indeed use Ω for a best-case bound and O for a worst-case bound, but it can also make sense to say “in the best case we use at most about n3/4 time” or “in the worst case we use at least about n log n time”. 47 1 1 1 1 1 3 1 1 1 1 3 4 1 1 1 3 4 4 1 1 3 4 4 4 1 2 4 4 4 4 3 4 4 4 4 4 Figure 5.2: A 6 × 6 matrix with 1s in the top left, 4s in the bottom right and a diagonal of 3s separating them, except for a single 2. Instead of associating Ω with best-case bounds and O with worst-case bounds, it’s more accurate to say Ω is for lower bounds (or “bounds from below”) and O is for upper bounds (or “bounds from above”). Let’s look at my favourite example. Suppose I give you a scratchcard with an n × n matrix on it, and tell you that in each cell there is a number between 1 and n2, arranged such that every row and every column is sorted in non-decreasing order from left to right and from top to bottom. You draw a target number x from a hat with n2 pieces of paper, one with each number from 1 to n2, and then have to scratch enough cells to either find x in the matrix or show it isn’t there. How many cells do you need to scratch? Even in the best case, you obviously need to scratch at least one cell, and if you’re lucky and draw x = 1, then you just need to check the top left, so the best-case bound is Ω(1)∩O(1) = Θ(1). When our lower bound and upper bound match for this case, ignoring constant coefficients and lower-order terms (although those match too in this instance), we can use Θ and we say the bound is tight. What about in the worst case? Clearly, you still need Ω(1) in the worst case, and you don’t have to scratch more than the whole matrix, so O(n2) is enough — but we don’t have a tight bound yet. If you perform binary search on each row, then you scratch O(n log n) cells. It feels like you might be able to use some kind of 2-dimensional binary search, first checking the cell in the middle of the matrix and eliminating either the top-left quadrant or the bottom-right quadrant. Does that work? In fact it doesn’t. To see why, suppose the top left of the matrix is full of 1s, the bottom right is full of 4s, and the diagonal separating them is a line of 3s, except one of the cells on the diagonal contains a 2, as shown in Figure 5.2. If you’re unlucky and draw an x = 2, then you have to check at least every cell on the diagonal until you find the 2, because if you don’t then you can’t be sure if there’s a 2 or not. Whatever your strategy, the 2 could be in the last cell you check, so in the worst case you need to scratch Ω(n) cells. That is, tempting though it is, binary search won’t help much in the worst case. It’s not hard to see that we can take the idea we just used to prove a lower bound on the number of cells you need to scratch, and derive a matching upper bound. Suppose we start at the top-left corner and move right, scratching each cell as we go, until we either find x (in which case you’re done) or a value greater than x or we reach the top right-corner, in which case we turn and go down and continue until we find x or a value greater than x or we reach the bottom-right corner 48 (in which case you know x is not in the matrix). Assuming we find a value greater than x, you can trace out the border between the values less than x and at least x by scratching O(n) cells. Since x has to lie along that border, you can find it or determine that it’s not in the matrix, scratching only those O(n) cells. Since our upper bound O(n) matches our lower bound Ω(n), we have a tight bound, Θ(n). Now that we more or less understand asymptotic notation (it gets more complicated with more than one variable, but we won’t worry about that in this course), we can understand what the Master Theorem says. There are stronger versions, but the one stated in Introduction to Algorithms says this: Theorem 1 Let a ≥ 1 and b > 1 be constants, let f(n) be a function, and let T (n) be defined on the nonnegative integers by the recurrence T (n) = aT (n/b) + f(n) , where we interpret n/b to mean either bn/bc or dn/be. Then T (n) has the following asymptotic bounds: 1. If f(n) = O(nlogb a−) for some constant > 0, then T (n) = Θ(nlogb a). 2. If f(n) = Θ(nlogb a), then T (n) = Θ(nlogb a log n). 3. If f(n) = Ω(nlogb a+) for some constant > 0, and if af(n/b) ≤ cf(n) for some constant c < 1 and all sufficiently large n, then T (n) = Θ(f(n)). This won’t help you solve all recurrences, nor even all recurrences you’ll encounter with divide- and-conquer algorithms — for example, the recurrence T (n) = 32 √ n · 2T (2n/3) from Chapter 2 doesn’t have the right form, since 32 √ n · 2 isn’t a constant — but it will help with a lot of them. For example, it will help with the recurrences for Karatsuba’s algorithm, T (n) = 3T (n/2) + 2n, and for Strassen’s algorithm, T (n) = 7T (n/2) + cn2 for some constant c. For the recurrence for Karatsuba’s algorithm, we have a = 3, b = 2 and f(n) = 2n, and 2n = O(nlog2 3−) for < 0.58, so we’re in Case 1 and T (n) = Θ(nlg 3). For the recurrence for Strassen’s algorithm, we have a = 7, b = 2 and f(n) = cn2 for some constant c, and cn2 = O(nlog2 7−) for < 0.8, so we’re again in Case 1 and T (n) = Θ(nlg 7). You’ve probably seen the recurrence for MergeSort, T (n) = 2T (n/2) + cn, and for that we’re in Case 2, since a = 2, b = 2 and cn = Θ(nlog2 2) = Θ(n), so MergeSort takes Θ(n log n) time. If we think of the trace of a divide-and-conquer algorithm as a tree of subproblems, then we can give informal characterizations of the three cases. Case 1 is when we’re subdividing the problem into many subproblems, but the subproblems aren’t shrinking as fast as we’re subdividing. For example, with Strassen’s algorithm we’re dividing 2 matrices of size n× n into a total of 8 submatrices and then doing 7 multiplications, but the submatrices are still of size (n/2) × (n/2). This means that most of the work in the tree is done at the leaves. Case 2 is when we’re subdividing and the subproblems are shrinking at about the same speed, so we do about the same amount of work at each level of the tree. For example, in MergeSort, in the ith round of the recurrence, we’re doing linear work on 2i lists each of length about n/2i, which takes O(n) time in total for that round. 49 Case 3 is when the overhead of splitting the problem into subproblems and then merging the subsolutions into a solution is really high, so most of the work is at the root of the tree. For example, suppose it takes us 2n time to split the problem up into two subproblems each of size n/2 and later to merge the subsolutions of those into a solution for our original problem. Then the recurrence is T (n) = 2T (n/2) + 2n and we have a = 2, b = 2 and f(n) = 2n = Ω(nlog2 2+) for any constant , so T (n) = Θ(2n). This makes sense because T (n) = 2T (n/2) + 2n = 2(2T (n/4) + 2n/2) + 2n = 2(2(2T (n/8) + 2n/4) + 2n/2) + 2n ... = 2lgn + lgn−1∑ i=0 2n/2 i+i is dominated by the term 2n. 50 Chapter 6 Sorting You may have heard that sorting n integers takes O(n log n) time. As long as each integer fits in a constant number of machine words, that’s true, as we’ll see in this lecture. In fact, we’ll see that sorting n integers takes Θ(n log n) time in the worst case in some models (and by now you should understand why saying Θ(n log n) is stronger than saying O(n log n)). Before we consider cases when we actually need Ω(n log n) time, though, we should discuss two special cases when it’s possible to do better, the latter of which comes up fairly often in practice.* Suppose we have n positive binary integers x1, . . . , xn drawn from a range of size cn for some constant c, with each xi fitting in a constant number of machine words. We can easily scan them to find minj{xj} and then subtract it from each xi to get n integers x′1, . . . , x′n in the range [0, cn−1], all in O(n) time. If we can sort x′1, . . . , x′n in O(n) time, then we can just add minj{xj} to each x′i and recover x1, . . . , xn, sorted. Therefore, without loss of generality, we can assume x1, . . . , xn are from the range [0, cn− 1]. In O(cn) = O(n) time we create an integer array A[0..cn − 1] with each A[i] = 0 initially. We scan x1, . . . , xn and, for each xi, we increment A[i]. Finally, we scan A[0..cn− 1] and, for each A[i], we report that there were that many occurrences of i in x1, . . . , xn. This all takes O(n) time, and the procedure is called counting sort. If we want to return the actual xis (perhaps because they have satellite data associated with them), then we can make A an array of linked lists instead, and add each xi to the appropriate linked list instead of just incrementing the appropriate counter. Notice that, if we prepend each xi to the head of its linked list but eventually report the contents of each linked list in tail-to-head order (which still takes linear time), then we sort x1, . . . , xn stably: that is, if xi = xj and i < j, then we report xi before xj . Now suppose we have n positive binary integers x1, . . . , xn drawn from a range of size n c for some constant c, with each xi still fitting in a constant number of machine words. Again, without loss of generality, we can assume x1, . . . , xn are from the range [0, n c− 1]. Let b = dlg ne, so we can view each xi as a sequence of c binary numbers, each of b bits and in the range [0, 2 dlgne]. Since *In the word-RAM model, which is our default for most of this course, I think the best known upper bound for sorting n integers that each fit in a constant number of machine words, is O(n √ log logn), by Yan and Thorup from FOCS 2002; we won’t discuss that in this course, though. 51 2dlgne < 2n, we can then view x1, . . . , xn as an n× c matrix, with each cell containing a number in the range [0, 2n− 1]. This is like viewing each xi as a c-digit number in base 2 b, and we can use radix sort to sort x1, . . . , xn quickly. For each of the c columns, we use counting sort to sort in O(n) time the rows of our matrix on the basis of the numbers in that column. (Swapping two rows takes constant time, because each xi fits in a constant number of machine words.) Overall we use O(cn) = O(n) time so, as long as we’re working in the word-RAM model and each integer fits in a constant number of machine words, we can sort n integers from a range of size polynomial in n in linear time. If we sort based on the the columns of the matrix working from left to right, it corresponds to most-significant-bit-first (MSB) radix sort (although our “bits” in this case are digits in base 2b); if we work right to left, it corresponds to least-significant-bit-first (LSB) radix sort. Each of these have advantages and disadvantages: for example, with MSB, if we get interrupted halfway through, we’ve still roughly sorted x1, . . . , xn; with LSB, we don’t need to split the matrix into buckets as long as we use stable counting sort. To see what I mean, consider the example x1 = 10001101 , x2 = 11101001 , x3 = 00110010 , x4 = 01001000 , x5 = 01001010 , x6 = 10100101 , x7 = 01110100 . (Let’s skip subtracting the minimum this time.) Since n = 7 we have b = 2dlgne = 8 and we view each xi as a 3-digit octal number (or, equivalently, we view x1, . . . , xn as a matrix with 7 rows and 3 columns with each cell containing a number between 0 and 7): x1 = 215 , x2 = 351 , x3 = 062 , x4 = 110 , x5 = 112 , x6 = 245 , x7 = 164 . If we use counting sort on the least significant digits, we get x4 = 110 , x2 = 351 , x3 = 062 , x5 = 112 , 52 x7 = 164 , x1 = 215 , x6 = 245 . Now if we use counting sort on the second digits, ignoring the least significant ones, if we’re not careful we could get x1 = 215 , x5 = 112 , x4 = 110 , x6 = 245 , x2 = 351 , x7 = 164 , x3 = 062 , which undoes all the sorting we just did based on the least significant digits. If we use stable counting sort on the second digits, however, then we get x4 = 110 , x5 = 112 , x1 = 215 , x6 = 245 , x2 = 351 , x3 = 062 , x7 = 164 , with x4 preceding x5, for example, because it preceded it after we sorted on the least significant digit, because 0 is smaller than 2 and those are x4’s and x5’s least significant digits, respectively. Applying stable counting sort to the most significant digits, we get x3 = 062 , x4 = 110 , x5 = 112 , x7 = 164 , x1 = 215 , x6 = 245 , x2 = 351 . Finally, translating back into binary, we have x3 = 000110010 , x4 = 001001000 , 53 x1 < x2? x3x1 < x3? x2 < x3? x1x3x2 < x3? x1 < x3? x2no yesno no yes yesno yes yesno Figure 6.1: A decision tree for sorting 3 distinct numbers. x5 = 001001010 , x7 = 001110100 , x1 = 010001101 , x6 = 010100101 , x2 = 011101001 . I guess you might have seen all this in first or second year, but it’s probably good to have a reminder before continuing, anyway. Just in case, I’m going to restate what we saw above as a theorem: Theorem 2 In the word-RAM model, if each integer fits in a constant number of machine words then sorting n integers from a range of size polynomial in n takes O(n) time. Counting sort and radix sort “look inside” their arguments, but for the rest of the lecture we’ll be working in the comparison model, which disallows direct addressing, pointer arithmetic, and even adding and subtracting the integers in the input. Since the word-RAM model lets us do more (as long as the integers each fit in a constant number of machine words so we can compare two of them in constant time), upper bounds in the comparison model hold in the word-RAM model but not necessarily vice versa, and lower bounds in the word-RAM model hold in the comparison model but not necessarily vice versa. This means showing that MergeSort works in O(n log n) time in the comparison model, for example, implies we can sort in that time in the word-RAM model as well. First, let’s see why we need Ω(n log n) time in the worst case when we’re in the comparison model. Suppose we have a set of n numbers, x1, . . . , xn, now with “set” implying they’re all distinct. A pairwise comparison between xi and xj returns exactly 1 bit of information: either xi < xj or xi > xj . Suppose you have an algorithm A that does pairwise comparisons and eventually tells you how to order x1, . . . , xn, and consider A as a decision tree. For example, Figure 6.1 shows a decision tree for sorting 3 numbers with pairwise comparisons. Sort of: some people only count comparisons in the comparison model and allow other computations for free, in which case those bounds may not translate well into the RAM model. 54 Since the decision tree is binary and has a leaf for each of the n! possible orderings, one of those leaves must be at depth at least dlg n!e. Stirling’s Formula says that’s⌈ lnn! ln 2 ⌉ = 1 ln 2 (n lnn− n+O(log n) = n lg n− n ln 2 +O(log n) = Ω(n log n) . Therefore, for some ordering, algorithm A makes Ω(n log n) pairwise comparisons — regardless of what algorithm A is. Lemma 3 In the comparison model, sorting takes Ω(n log n) time. Does this mean that we always need Ω(n log n) time to sort n numbers? Consider how Inser- tionSort behaves on the list 1, 2, 3, . . . , n: after i−1 steps, it has removed 1, . . . , i−1 from its input list and inserted them into its output list; during step i, it removes i from the front of its input list and scans through its output list, comparing i to 1, . . . , i− 1 in turn, and then finally adds i at the end of that list. In total, it uses 1 + 2 + 3 + . . .+ n− 1 = Ω(n2) comparisons. This is actually its worst case. Now, however, consider how InsertionSort behaves on the list n, n − 1, n − 2, . . . , 3, 2, 1: after i−1 steps, it has removed n, n−1, . . . , n− i+1 from its input list and inserted them into its output list; during step i, it removes n− i from the front of its input list, compares it to n− i+ 1 at the front of its output list, and adds i at the start of that list. In total, it uses O(n) comparisons. This is actually its best case. Most algorithms do pretty well on their best case — which is one reason comparing best-case performance isn’t very informative — but not all. For example, SelectionSort always uses Θ(n2) comparisons. It’s more interesting to talk about the average case or expected case, which we’ll see when we get to QuickSort. My friend Je´re´my has spent years studying how sorting algorithms can take advantage of pre-sortedness in their inputs, such as being made up of only a few interleave increasing subsequences. I might be able to squeeze that into the next assignment. I think you all know MergeSort, but let’s review it quickly: given an array of n ≥ 2 numbers, we divide it roughly in half and recursively sort both halves, then merge the sorted subarrays. To merge two sorted subarrays S1 and S2 of length n1 and n2 into a sorted array S of length n = n1 + n2, we can conceptually append ∞ to both lists and run the following code: i = 0; j = 0; k = 0; while (S1 [i] != infty || S2 [j] != infty) { if (S1 [i] < S2 [j]) { S [k] = S1 [i]; i++; k++; } else { S [k] = S2 [j]; 55 j++; k++; } } This takes time O(n1 + n2) = O(n). Incidentally, is this implementation of MergeSort stable? The recurrence for MergeSort is T (n) = 2T (n/2) + n, since splitting the input in half takes constant time (and no comparisons) and merging the sorted sublists takes n time and comparisons. This is Case 2 of the Master Theorem, a = 2, b = 2, f(n) = n = Theta(nlogb a) = Θ(n), so we use Θ(n log n) time overall. Lemma 4 In the comparison model, sorting takes O(n log n) time. Notice that, even though we just proved that MergeSort uses Θ(n log n) time, that only gives us an O(n log n) upper bound for sorting, until we combine it with Lemma 3 to get a tight bound: Theorem 5 In the comparison model, sorting takes Θ(n log n) time. So, why not just stop here? Why are there literally dozens of other sorting algorithms? Well, some of them have other useful properties, like being adaptive to various kinds of pre-sortedness, and some are just faster in practice, such as QuickSort. I think you also all know QuickSort, but we should probably review that too: given an array S[1..n] of n numbers, we choose a pivot S[i] somehow; scan S and then partition it into numbers smaller than S[i], numbers equal to S[i], and numbers greater than S[i]; and recursively sort the numbers smaller than S[i] and the numbers bigger than S[i]. How long can this take in the worst case? If we’re extremely unlucky, S could contain the numbers 1 to n and we could choose as pivots 1, 2, 3, . . . , n. In this case, for the ith step we choose i as a pivot and spend n− i time scanning i+ 1, . . . , n and finding they’re all bigger than i. Thus, in the worst case we use (n− 1) + (n− 2) + · · ·+ 3 + 2 + 1 = Θ(n2) time. To see why anyone uses QuickSort, we must analyze how fast it usually runs. Suppose we pick a pivot S[i] uniformly at random from S; what is the probability that neither the partition of numbers smaller than S[i] nor the partition of numbers bigger than S[i] contain more than 3n/4 numbers? If we imagine S already sorted then, because we pick S[i] uniformly at random, the probability it is in the first quarter — in which case the partition of numbers bigger than S[i] can contain more than 3n/4 numbers — is 1/4, and the probability it is in the last quarter — in which case the partition of numbers smaller than S[i] can contain more than 3n/4 numbers — is also 1/4, but the probability it is in neither of those quarters — in which case neither of those partitions has size more than 3n/4 — is 1/2. Imagining S sorted is just for analysis, of course; we don’t really want to sort S before choosing S[i] (although later we’ll discuss an idea along the same lines). Suppose we change our implementation of QuickSort to be fussier: if at some step it picks a pivot and scans and partitions the subarray it’s trying to sort recursively, and finds that either 56 the partition of numbers smaller than the pivot contains more than 3/4 of all the numbers in the subarray, or the partition of numbers bigger than pivot does, then it undoes the partitioning, picks another pivot and tries again. How many tries do we expect it to make before it finds a pivot it likes? As we just argued, it makes exactly 1 try with probability p ≥ 1/2; it makes exactly 2 tries with probability p(1 − p); it makes exactly 3 tries with probability p(1 − p)2; etc. Therefore, the expected number of tries is ∑ i≥1 ip(1− p)i−1 ≤ ∑ i≥1 i/2i = 2 , and so the expected time before it recurses is linear in the size of the subarray. As we did when considering the cases of the Master Theorem, let’s visualize the recursion as a tree of height O(log4/3 n) = O(log n), with each vertex corresponding to a subarray, each of the n leaves corresponding to a single number, and the size of each vertex’s subarray being the number of leaves in its subtree. At each vertex — that is, at each step in the recursion — we expect to spend time linear in the size of the subarray, which means each leaf in its subtree is contributing constant expected time to that vertex. By linearity of expectation — that is, because the expected value of a sum is the sum of the expected values of the terms — the total expected time is proportional to the sum of the leaves’ depths, or O(n log n). As long as you believe making QuickSort fussier doesn’t actually speed it up (it doesn’t), then we can conclude that QuickSort uses O(n log n) time in the expected case. Since we’re talking about QuickSort, we can quickly cover a related algorithm, called QuickS- elect. Suppose we just want to find the kth number in sorted order in an array S[1..n]. Of course we could sort S to find the kth number in sorted order, but that seems like overkill. Suppose we run QuickSort, but we discard whichever partition doesn’t contain the kth number. For example, suppose we’re looking for the 197th number in an array of 300 numbers. If we choose a pivot and partition and it turns out there are 103 numbers smaller than the pivot and 3 numbers equal to the pivot, then we can discard those and recursively find the 91st number in the 294 numbers larger than the pivot. With QuickSelect, we’re conceptually just descending one branch of the recursion tree, which takes expected time at most proportional to n+ (3/4)n+ (3/4)2n+ · · ·+ 1 = O(n) although, like QuickSort, in the worst case QuickSelect uses Θ(n2) time. In fact, it’s possible to find the kth largest number in linear time even in the worst case. To do this, we partition S into quintuples and find the median of each quintuple, which takes O(n) time. We collect the medians into a set S′ of size about n/5 and recursively find the median of S′. The median of S′ needn’t be the median of S, but it can’t be in the first quarter or the last quarter: half the n/5 quintuples had medians smaller than the median of S′ (ignoring rounding), and each of those quintuples had 3 numbers (the median of the quintuple and the two smaller elements) smaller than the median of S′, so there are at least 3n/10 numbers in S smaller than the median of S′ and, symmetrically, there are at least 3n/10 numbers in S larger than the median of S′. It follows that the median of S′ is a good pivot. 57 Analyzing this algorithm is a bit complicated, because we’re finding the pivot recursively and then recursing on one of the partitions, but the recurrence and its solution is given in Introduction to Algorithms. So, why don’t people use this technique to find the pivots for QuickSort, to make it run in O(n log n) time in the worst case, as well as the expected case? Because then it’s slower in practice than just running MergeSort! 58 Assignment 3 posted: 21.05.2021 due: midnight 04.06.2021 You can work in groups of up to three people. One group member should submit a copy of the solu- tions on Brightspace, with all members’ names and banner numbers on it; the other group members should submit text files with all members’ names and banner numbers (otherwise Brightspace won’t let us assign them marks!). You may consult with other people but each group should understand the solutions: after discussions with people outside the groups, discard any notes and do something unrelated for an hour before writing up your solutions; it’s a problem if no one in a group can ex- plain one of their answers. For programming questions you should submit your code, which should compile and run correctly to receive full marks. 1. Mark the true statments. (Your score for this questions will be proportional to the number of statements you mark correctly, plus the number you correctly leave unmarked, minus the number you should have marked but didn’t, minus the number you shouldn’t have marked but did.) (a) lg nn = O(lg n!) (b) n(n+1) mod 2 = Ω(nn mod 2) for n ∈ N (c) n(3n) mod 2 = O(nn mod 2) for n ∈ N (d) If T (n) = 7T (n/3) + n4 then T (n) = Ω(n4/ log n). (e) If T (n) = 8T (n/2) + 8n3 then T (n) = Θ(n3 log n). 2. Explain how you can sort a sequence of n integers from a range of size nlg lgn in O(n log log n) time (assuming each integer fits in a constant number of machine words). 3. An in-place algorithm uses a constant number of machine words of memory on top of the memory initially occupied by its input and eventually occupied by its output. Give code or pseudo-code for an in-place version of QuickSort. Continued on Next Page! 59 4. You’ve probably seen in previous courses how to build a min-heap on n elements in O(n) time and how to extract the minimum value from one in O(log n) time. Do you think we can easily extract the minimum value in o(log n) time while still leaving a heap on the remaining elements? Why or why not? 5. Suppose you have an algorithm that, given a sequence of n integers that can be partitioned into d non-decreasing subsequences but not fewer, does so in O(n log d) time. Explain how you can also sort such a sequence in O(n log d) time. 60 Solutions 1. Mark the true statments. (Your score for this questions will be proportional to the number of statements you mark correctly, plus the number you correctly leave unmarked, minus the number you should have marked but didn’t, minus the number you shouldn’t have marked but did.) (a) lg nn = O(lg n!) (b) n(n+1) mod 2 = Ω(nn mod 2) for n ∈ N (c) n(3n) mod 2 = O(nn mod 2) for n ∈ N (d) If T (n) = 7T (n/3) + n4 then T (n) = Ω(n4/ log n). (e) If T (n) = 8T (n/2) + 8n3 then T (n) = Θ(n3 log n). Solution: a, c, d, e 2. Explain how you can sort a sequence of n integers from a range of size nlg lgn in O(n log log n) time (assuming each integer fits in a constant number of machine words). Solution: Consider each number as a (lg lg n)-digit number in base b = 2dlgne. Since b is linear in n, we can radix-sort all the numbers in O(n) time for each column of digits, so we use O(n log logn) time overall. 3. An in-place algorithm uses a constant number of machine words of memory on top of the memory initially occupied by its input and eventually occupied by its output. Give code or pseudo-code for an in-place version of QuickSort. Solution: Start a pointer p at the start of the subarray and a pointer q at the end of the subarray; move p right until we find a value larger than the pivot and move q left until we find a value smaller than the pivot; then swap the elements pointed to by p and q; continue. We stop when p meets q. 4. You’ve probably seen in previous courses how to build a min-heap on n elements in O(n) time and how to extract the minimum value from one in O(log n) time. Do you think we can easily extract the minimum value in o(log n) time while still leaving a heap on the remaining elements? Why or why not? Solution: If we could do this easily, then we could easily implement HeapSort in o(n log n) time. 5. Suppose you have an algorithm that, given a sequence of n integers that can be partitioned into d non-decreasing subsequences but not fewer, does so in O(n log d) time. Explain how you can also sort such a sequence in O(n log d) time. Solution: We partition the sequence into the d subsequences, and then merge them in O(nlogd) time using a min-heap on d elements, containing the first element in each list. When we extract the minimum element, we insert into the heap the next element from the list that extracted element came from, maintaining the invariant that the smallest element from each list not yet included in the sorted output, is in the heap. 61 Part II Greedy Algorithms 62 Chapter 7 Huffman Coding When I asked during the first lecture for examples of profound truths we have found through computer science, someone suggested Shannon’s information theory. I hadn’t thought of that and it’s a good answer, even though I think information theory is often considered more part of electrical engineering than computer science (for example, the Transactions on Information Theory is published by the Information Theory Society of the Institute of Electrical and Electronics Engineers, better known as the IEEE). I want to introduce greedy algorithms by teaching you Huffman’s algorithm, which gives us a chance to discuss some of the basics of noiseless coding, which was the topic of the first half of Shannon’s 1948 article introducing information theory (the other half was about noisy coding). Before Shannon’s article, it wasn’t clear how to measure information. All other things being equal, it was intuitively clear that a big book contained more information than a one-page letter, but it was also easy to think the situation might be reversed if the book was boring and inaccurate while the letter held vitally important news. Shannon neatly sidestepped this by focusing only on transmitting losslessly whatever message we’re given, regardless of its content. In fact, he assumed that we and the people we’re transmitting to all know the probability distribution — over the universe of possible messages — according to which the message is chosen, and we’ve been able to prepare in advance a code assigning a distinct self-delimiting codeword to each possible message (where “possible” means the distribution assigns it a non-zero probability). This way, we don’t have to talk about the message, per se, only about the distribution. Shannon’s information theory is thus concerned with random variables that take on values which are individual messages, not with the individual messages themselves. (Later in the course we’ll talk a little about Kolmogorov complexity, which is concerned with the individual messages — but, as we’ll see, that’s incomputable.) We need not care about the significance of the random variables: a random variable that takes on the value “yes” with probability 0.5 and “no” with probability 0.5 is the same for us whether the question was “Launch the missile?” or “Do you want fries with that?”. (My first statistics professor used to say “let’s consider a random variable, such as your mark on the next test”, which made us uncomfortable even though he was completely correct.) Some people still object to how information theory separates “information” from “meaning” and says a page 63 of randomly-chosen characters contains more information than a page of beautiful poetry — but consider which is easier to memorize exactly. One way to remember that information theory is about random variables is to think of trans- mitting information as removing uncertainty. It’s not often mentioned in electrical engineering, but Shannon proposed three axioms for measuring uncertainty: Suppose we have a set of possible events whose probabilities of occurrence are p1, p2, . . . pn. These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome? If there is such a measure, say H(p1, p2, . . . , pn), it is reasonable to require of it the following properties: 1. H should be continuous in the pi. 2. If all the pi are equal, pi = 1 n , then H should be a monotonic increasing function of n. With equally likely events there is more choice, or uncertainty, when there are more possible events. 3. If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H. The meaning of this is illustrated in Figure 7.1. At the left we have three possibilities p1 = 1 2 , p2 = 1 3 , p3 = 1 6 . On the right we first choose between two possibilities each with probability 12 , and if the second occurs make another choice with probabilities 23 , 1 3 . The final results have the same probabilities as before. We require, in this special case, that H ( 1 2 , 1 3 , 1 6 ) = H ( 1 2 , 1 2 ) + 1 2 H ( 2 3 , 1 3 ) . The coefficient 12 is because this second choice only occurs half the time. Shannon showed that the only function H satisfying those three axioms is of the form H = −K n∑ i=1 pi log pi , where K is a positive constant. Changing the base of the logarithm changes the value K or the unit in which we measure information and uncertainty. If the base of the logarithm is 2 and K = 1 then we are measuring uncertainty in bits, with 1 bit (an abbreviation of “binary digit” and a term suggested to Shannon by Tukey, one of the re-discoverers of the Fast Fourier Transform) being our uncertainty about the outcome of a flip of a fair coin. If the base of the logarithm is e and K = 1 then we are measuring uncertainty in units called nats (which I think only electrical engineers use). He called this function the entropy of the random variable (or of the probability distribution, when we’re concentrating on that). In this course, we’ll usually be concerned with the binary entropy, H(p1, . . . , pn) = − n∑ i=1 pi lg pi , 64 Figure 7.1: Decomposition of a choice from three possibilities (from Shannon’s article). and we’ll measure information and uncertainty in bits. Quite apart from his axiomatic derivation he showed that, given P = p1, . . . , pn, it is not possible to assign self-delimiting binary codewords to the n possible outcomes such that the expected codeword length is less than H(P ), but it is possible to assign such codewords so that the expected codeword length is less than H(P )+1. This result is often called “Shannon’s Noiseless Coding Theorem”. It’s not hard to prove the first part of the Noiseless Coding Theorem (the lower bound) but it’s less important for this class, so we’ll leave it until the end and for now prove only the second part (the upper bound). Without loss of generality, suppose p1 ≥ p2 ≥ · · · ≥ pn > 0. Let si = ∑i−1 j=1 pj and consider how many bits of the binary representation of si we must write to uniquely distinguish si. Notice that if two binary fractions agree on their first b bits after the binary point then they differ by less than 1/2b: the smaller binary fraction can end with an infinite number of 0s after those b bits, but the larger cannot end with an infinite number of 1s because we would write 0.0101100111111111111 . . ., say, as 0.0101101000000000000 . . .. Conversely, if two binary fractions differ by at least 1/2b for an integer b then they must differ on one of their first b bits after the binary point. It follows that, since si differs from si−1 by pi−1 ≥ pi for i > 1 and from si+1 by pi for i ≤ n, si is uniquely distinguished by its first dlg(1/pi)e bits after the binary point. Taking the first dlg(1/pi)e bits of si to the ith outcome, the expected codeword length is n∑ i=1 pidlg(1/pi)e < n∑ i=1 pi lg(1/pi) + 1 = H(P ) + 1 . Codes in which every codeword is self-delimiting are called prefix-free (sometimes abbreviated to only “prefix”), because no codeword can be a prefix of another codeword, or instantaneous, because we can tell as soon as we’ve reached the end of a codeword. Prefix-free codes built with Shannon’s construction are called, naturally enough, Shannon codes. Shortly after Shannon published his article, Robert Fano proposed a construction that recursively divides the pis into two subsets whose sums are as close as possible. Even before he was the father of information theory, Shannon had made important contributions to electrical engineering: for example, in his master’s thesis at MIT, written when he was 21, he introduced the idea of modelling digital circuits with Boolean logic (and included a diagram of a 4-bit full adder); if you think Question 3 on Assignment 2 was hard, imagine what it would have been like without the notion of logic gates! Suffice to say, like Turing (with whom he met 65 at Bell Labs), Shannon was a genius. Nevertheless, Shannon codes can be sub-optimal, meaning the expected codeword length is not always minimum. For example, for P = 616 , 5 16 , 5 16 , Shannon’s construction produces a code with all codeword lengths equal to dlg(16/6)e = dlg(16/5)e = 2 , but if we assign the first outcome codeword 0 and the third and fourth codewords 10 and 11, then the expected codeword length is (6/16) + 2((5/16) + (5/16)) = 1.5 . People realized immediately that Shannon’s article was a masterpiece and by 1951 Fano was teaching a graduate on information theory. He gave the students the option of doing a project or writing a final exam, and one of the students, David Huffman, decided to find an efficient algorithm for building optimal prefix-free codes. The story goes that he failed repeatedly and was about to give up when inspiration struck and he found the famous greedy algorithm now named after him, which he published in 1952 and we’re going to cover in this lecture. Before we try to understand Huffman’s algorithm, however, I’d like us to warm up by developing a greedy algorithm for another problem. Suppose that during the last lecture I tried to demonstrate MergeSort by taking an unsorted sequence, splitting it half, and sending the two subsequences to two students with instructions to continue the recursion and send me their sorted subsequence when it was ready. Somehow wires got crossed, however, and some other students sent their sorted subsequences directly to me instead of giving them back to the people they’d got them from. To save time, I decide to merge the subsequences myself, using the code I showed you in the last lecture that merges two sorted subsequences of lengths n1 and n2 into a single sorted sequence of length n = n1 + n2 in O(n) time. The subsequences are of many different sizes, however, and I want to know in which order I should merge them to minimize the time I spend. Faced with this problem of planning how to perform pairwise merges of subsequences, with each merge taking time proportional to the combined length of the two merged subsequences, an obvious first step is to merge the two shortest subsequences. Since this minimizes the time taken by the first merge, in some sense it’s the obvious greedy choice. Suppose we start with n subsequences — ignoring the fact that in the last lecture n was the length of the original sequence and thus the total length of the subsequences — so after we merge the two shortest, we have n − 1 subsequences (including the result of that first merge). For now, let’s concentrate on proving only that there exists some strategy — maybe greedy, maybe not — that merges those n − 1 subsequences such that the total time (including the first merge) is the minimum possible for any pairwise merging strategy. Suppose, for example, that we’ve asked my friend Je´re´my (an expert on sorting, as I mentioned in the last lecture) to merge our subsequences for us, but he’s been delayed and we’ve decided to choose and perform the first merge ourselves. How can we be sure that after we merge the two shortest subsequences, when Je´re´my does eventually arrive he’ll say something like “That’s not exactly what I would have done, but I can work with it; we can still get an optimal solution (so the total time spent merging is minimized).” 66 and not something like “Sacre Bleu, what have you done? All is lost! There’s no way to reach an optimal solution now!” ? (Je´re´my is French, as you may have guessed.) Any strategy for merging the subsequences pairwise can be viewed as a strictly binary tree (that is, every internal vertex has exactly 2 children), with the input subsequences at the leaves, each internal being the merge of its children, and the root being the complete sequence. Symmetrically, any such tree can be viewed as a strategy for merging the subsequences. Notice that an element of the input subsequences for leave v is processed during the merges resulting in the subsequences for v’s proper ancestors (that is, the ancestors of v excluding v itself), taking constant time during each of those merges. Therefore, the total contribution of all the elements in v’s subsequence to the time for merging, is proportional to the length of that subsequence times v’s depth. (Yes, this is similar to our analysis of QuickSort; that’s deliberate.) If we assign each leaf weight equal to the length of its subsequence, then a strategy according to which the time spent merging is minimized, corresponds to a tree in which the sum of the products of the leaves’ weights and depths is minimized. Let’s call that sum the tree’s cost. Suppose we are given a tree TOpt that corresponds to an optimal merging strategy, so it has minimum cost. If a leaf with a smaller weight is higher in TOpt than a leaf with a larger weight, then we can swap them and reduce TOpt’s overall cost, contrary to the assumption its cost is minimum. Therefore, if we choose to merge subsequences with lengths wi and wj because those are the shortest, we can assume TOpt has two leaves at its bottom level with weights equal to wi and wj . (It can’t have a single leaf at its bottom level because it’s strictly binary.) Swapping leaves at the same level doesn’t change the cost of the tree, and neither does swapping leaves with the same weight, so there exists a tree T ′Opt with the same cost as TOpt in which the leaves for the subsequences we merge first are siblings (at the bottom level). In other words, there indeed exists some optimal merging strategy that starts by merging the two subsequences we merge — meaning Je´re´my will be able to take over and reach an optimal solution. Now consider the algorithm that merges subsequences into a single subsequence by repeatedly merging the shortest two subsequences (breaking ties arbitrarily). We have just shown that if we merge the shortest two of n subsequences and then merge all the resulting n − 1 subsequences optimally, then we use the minimum possible time for merging. It follows that if our algorithm is optimal for merging n− 1 subsequences, then it is optimal for merging n subsequences. Clearly our algorithm is optimal for 0 or 1 subsequences (for which it has nothing to do) or even two subsequences (for which it makes the only possible merge). Therefore, by induction, it is optimal for any number n of subsequences. Theorem 6 Given a set of sorted subsequences to merge pairwise into a single sequence, with each merge taking time proportional to the combined length of the two merged subsequences, we can minimize the total time used by repeatedly merging the two shortest subsequences. 67 uv wi wj wjwi v′u′ TOpt T ′ Opt wi u′ u wi wj v′ v wj Figure 7.2: If we merge subsequences with lengths wi and wjcorresponding to leaves u and v, then there must be leaves u′ and v′ at the bottom level of TOpt with those weights. We can swap u and v with u′ and v′ to have u and v at the bottom level and swap them with other leaves at the bottom level to make them siblings in a new tree T ′Opt with the same cost as TOpt. Considering again pairwise merging strategies as corresponding to binary trees, with optimal strategies corresponding to trees with minimum cost (where a tree’s cost is the sum of the products of its leaves’ weights and depths), we can rephrase Theorem 6 as follows: Theorem 7 Given positive weights, the following algorithm builds a binary tree whose leaves are assigned those weights, in some order, such that the sum of the products of the leaves’ weights and depths is minimized: 1. we create a set of vertices with those weights; 2. until there is only one vertex in the set, we repeatedly (a) remove the two vertices u and v with the smallest weights in the set, (b) make u and v the children of a new vertex t with weight equal to the sum of u’s and v’s weights, (c) insert t into the set; 3. we return the single vertex remaining in the set, as the root of the tree. An obvious way to implement our algorithm in Theorem 7 is to keep vertices with the current weights in a priority queue, in non-decreasing order by weight. When we extract the two vertices with the smallest weights wi and wj , we make them the children of a new vertex with weight wi+wj , which we insert into the priority queue. Using a min-heap as a priority queue, the algorithm then takes O(n log n) time. Now that we have warmed up, we are ready to consider the problem of designing an optimal prefix-free code, as Huffman did. Specifically, given a probability distribution p1, . . . , pn, we want to associate binary codewords to the pis such that no codeword is a prefix of any other codeword and, when the ith codeword is chosen with probability pi, the expected codeword length is minimum. The key to this problem is to realize that a collection of n binary codewords in which no one codeword is a prefix of any other codeword, can be viewed as a binary tree with edges from parents 68 0 0 0 0 0 0 1 1 1 1 1 1 Figure 7.3: A binary tree corresponding to a prefix-free code with codewords 000, 001, 01, 100, 1010, 1011, 11. to left children with 0s, edges from parents to right children with 1s, and the paths from the root to the leaves labelled with the codewords. Figure 7.3 illustrates this correspondence. From this perspective, building an optimal prefix-free code is equivalent to building a binary tree whose leaves are assigned weights p1, . . . , pn, in some order, such that the sum of the products of the leaves’ weights and depths is minimized. Because we can assume all the probabilities are positive, in fact we have already solved this problem with Theorem 7! Our supposed warm-up was actually a derivation of Huffman’s algorithm. As one of my former supervisors would say, “I was tricking you!” I hope this presentation is easier to understand than the traditional one: last year I had a student spontaneously suggest merging the two shortest lists as a first step, and I’ve never had anyone spontaneously suggest something like “make the two leaves labelled with the least likely characters in the alphabet the children of a new vertex assigned a probability equal to the sum of theirs”. Figure 7.4 gives an illustration of Huffman’s algorithm. In order to fill some space and hide my trick from people glancing ahead in these notes (who might be suspicious if we finish too quickly after proving Theorem 7), now we will discuss Van Leeuwen’s algorithm, which is a version of Huffman’s algorithm that runs in linear time when the probabilities are given already sorted (or can be represented by integers in a range of size polynomial in n, in which case we can sort them in linear time ourselves). This is in contrast to using a priority queue, which uses Ω(n log n) time with most common implementations of priority queues. Van Leeuwen’s algorithm is also a nice example of how the clever use of data structures, even standard ones, can significantly speed up algorithms. Suppose we are given weights w1, . . . , wn with w1 ≤ · · · ≤ wn. We enqueue vertices with weights w1, . . . , wn in order in a queue Q1, such that the vertex with weight w1 is at the head of Q1 and the vertex with weight wn is at its tail. We create an empty queue Q2 and repeat the following steps until there is only one vertex left in the queues: 1. dequeue a vertex with weight x from whichever of Q1 and Q2 has a lighter vertex at its head (or from whichever has a vertex at its head, if one is empty); 2. dequeue a vertex with weight y from whichever of Q1 and Q2 has a lighter vertex at its head (or from whichever has a vertex at its head, if one is empty); 69 0.07 0.08 0.13 0.13 0.68 0.86 1.31 1.70 1.74 1.77 1.96 2.08 2.12 2.43 2.45 3.54 3.74 5.27 5.36 5.57 5.94 6.13 6.61 7.19 7.97 11.18 11.97 0.68 0.86 1.31 1.70 1.74 1.77 1.96 2.08 2.12 2.43 2.45 3.54 3.74 5.27 5.36 5.57 5.94 6.13 6.61 7.19 7.97 11.18 11.97 0.13 0.13 0.07 0.08 0.15 0.68 0.86 1.31 1.70 1.74 1.77 1.96 2.08 2.12 2.43 2.45 3.54 3.74 5.27 5.36 5.57 5.94 6.13 6.61 7.19 7.97 11.18 11.97 0.07 0.08 0.15 0.13 0.13 0.26 0.68 0.86 1.31 1.70 1.74 1.77 1.96 2.08 2.12 2.43 2.45 3.54 3.74 5.27 5.36 5.57 5.94 6.13 6.61 7.19 7.97 11.18 11.97 0.07 0.08 0.15 0.13 0.13 0.26 0.41 1.31 1.70 1.74 1.77 1.96 2.08 2.12 2.43 2.45 3.54 3.74 5.27 5.36 5.57 5.94 6.13 6.61 7.19 7.97 11.18 11.97 0.86 0.68 0.07 0.08 0.15 0.13 0.13 0.26 0.41 1.09 1.96 2.08 2.12 2.43 2.45 3.54 3.74 5.27 5.36 5.57 5.94 6.13 6.61 7.19 7.97 11.18 11.97 1.31 1.70 1.74 1.77 0.86 0.68 0.07 0.08 0.15 0.13 0.13 0.26 0.41 1.09 1.95 3.54 3.74 5.27 5.36 5.57 5.94 6.13 6.61 7.19 7.97 11.18 11.97 1.74 1.77 0.86 0.68 0.07 0.08 0.15 0.13 0.13 0.26 0.41 1.09 1.95 1.96 2.08 2.12 2.43 2.45 1.31 1.70 3.01 . . . 2.43 2.45 4.88 5.27 10.15 5.36 5.57 10.93 21.08 11.18 11.97 23.15 44.23 5.94 6.13 12.07 13.31 25.38 6.61 1.31 1.70 3.01 1.74 1.77 3.51 6.52 7.19 3.54 3.74 7.28 14.47 7.97 2.08 2.12 4.20 0.86 0.68 0.15 0.26 0.41 1.09 1.95 1.96 0.07 0.08 0.13 0.133.91 8.11 16.08 30.55 55.93 100.16 Figure 7.4: The first few steps of Huffman’s algorithm, implemented with a priority queue, and the complete tree (bottom right). The weights are the approximate frequencies (as percentages) of the characters A through Z and SPACE in English text. (They sum to 100.16 instead of 100 because of rounding.) 70 3. make those vertices the children of a new vertex with weight x+ y; 4. enqueue that new vertex in Q2. Since enqueuing and dequeuing take constant time, the algorithm takes O(n) time overall. As long as both queues are always sorted, dequeuing from whichever has a lighter vertex at its head is equivalent to extracting from a priority queue. Obviously Q1 stays sorted, because we only dequeue from it and never enqueue to it. To see why Q2 stays sorted, assume for the sake of a contradiction that at some point we enqueue a vertex v to Q2 with a weight x+ y that is smaller than weight of the vertex u ahead of it in Q2, and that previously both queues were always sorted. Since Q2 is initially empty and we enqueue only vertices with weights that are the sums of the vertices they are replacing, u’s weight is x′ + y′ for some weights x′ and y′ of vertices that were dequeued before the vertices with weights x and y that v is replacing. If both queues were always sorted up to this point, however, and we dequeued vertices with weights x′ and y′ before vertices with weights x and y, however, then x′ + y′ ≤ x+ y, so v is heavier than u. Theorem 8 If we are given a distribution with n sorted probabilities, then we can build an optimal prefix-free code for them in O(n) time. As promised, and to fill some more space, now we will see the proof (more or less) of the lower bound in the Noise Coding Theorem: given P = p1, . . . , pn, it is not possible to assign self-delimiting binary codewords to the n possible outcomes such that the expected codeword length is less than H(P ). Let’s assume the pis are positive rationals, so each pi can be expressed as ai/bi for some positive integers ai and bi. Let m be a sufficiently large common multiple of the bis and let mi = aim/bi, so m1 + · · ·+mn = m. For the sake of a contradiction, assume it is possible to assign self-delimiting binary codewords w1, . . . , wn to the outcomes such that the expected codeword length∑ i pi|wi| < H(P )− for some positive constant . Consider an alphabet {c1, . . . , cn} and all the possible strings of length m in which ci occurs mi times, for each i. There are ( m m1,...,mn ) such strings, where the multinomial coefficient( m m1, . . . ,mn ) = ( m m1 )( m−m1 m2 ) · · · ( m−m1 − · · · −mn−2 mn−1 )( m−m1 − · · · −mn−2 −mn−1 mn ) = ( m! (m−m1)!m1! )( (m−m1)! (m−m1 −m2)!m2! ) · · · ( (m−m1 − · · · −mn−2 −mn−1)! (m−m1 − · · · −mn−2 −mn−1 −mn)!mn! ) = m! m1!m2! · · ·mn! . Notice this is equal to 2lgm!−lgm1!−···−lgmn!. 71 By Stirling’s formula, lgm!− lgm1!− · · · − lgmn! = (m lgm−m lg e+O(logm))− ∑ i (mi lgmi −mi lg e+O(lgmi)) = m lgm− ∑ i mi lgmi −m lg e+ ∑ i mi lg e+O(logm)− ∑ i O(logmi) = ∑ i mi lgm− ∑ i mi lgmi − ∑ i mi lg e+ ∑ i mi lg e+O(logm)− ∑ i O(logmi) = ∑ i mi lg(m/mi) +O(logm)− ∑ i O(logmi) = m ∑ i (mi/m) lg(m/mi) +O(logm)− ∑ i O(logmi) = mH(P ) +O(logm)− ∑ i O(logmi) ≥ mH(P )−O(n log(n/m)) . For sufficiently large m, this is greater than m(H(P )− /2), so we can assume we are considering more than 2m(H(P )−/2) possible strings. Suppose we assign each ci codeword wi and encode each of those possible strings by concate- nating the codewords of its characters. Then we encode each such string with a distinct binary string consisting of ∑ i mi|wi| = m ∑ i pi|wi| < m(H(P )− ) bits — but there are fewer than 2m(H(P )−)+1 − 1 of those and, for sufficiently large m, 2m(H(P )−/2) > 2m(H(P )−)+1 − 1 , so we have a contradiction. For a course on algorithms, you don’t really need to know why the Noiseless Coding Theorem holds, but the proofs of the upper and lower bounds aren’t really all that hard (and used some things we’ve seen before) and the result is generally considered deep and beautiful, so I thought it worth including them. Understanding the full proof of the Noiseless Coding Theorem may also help you understand an apparent contradiction: that theorem says we can’t beat the entropy and that Huffman coding is optimal, but you’ll often hear people saying arithmetic coding beats Huffman coding and that more sophisticated schemes do even better. To see that this isn’t really a contradiction, it’s important to remember that Shannon and Huffman were assuming we know the probability distribution over the universe of possible entire messages. That’s usually not the case and, when people actually use Huffman coding, they’re usually taking a string, counting how many times each character in the alphabet occurs in that string, assigning each character probability proportional to its frequency, building a Huffman code for the resulting probability distribution, and then using that code to encode the string character by character. That’s quite different from trying to encode the whole string at once! 72 As we just saw, if the string is long enough and the characters are shuffled randomly, then we can’t achieve an average codeword length of H(P )− for the characters, where P is the distribution of characters in the string. If the characters are shuffled or if they’re sampled from a multiset with replacement — for example, drawn from a hat containing slips of paper with the characters written one them, possibly with a different number of slips for each character, and then returned to the hat — then Huffman coding is (essentially) optimal among all codes that encode the string character by character, using an integer number of bits for each character. Such codes are sometimes called character codes or, when prefix-free, instantaneous codes (since we can decode each character when we reach the end of its codeword), and they can be beaten by schemes, such as arithmetic coding, that can encode several characters together. More sophisticated compression schemes also take advantage of the fact that the choice of each character is not independent and identically distributed (often abbreviated to “iid”). For example, a “q” in in English is usually followed by a “u”, and if we’ve just seen “th” then the next character is probably an “e” or an “a”, or maybe a space or an “o” or an “r”, but almost certainly not a “t”, even though “t” is a fairly common in English generally. Those schemes can beat Huffman coding when they use a contextual model and it’s used as a memoryless character code. Arithmetic coding can also be used with a contextual model; in fact, it was invented partly to determine which is the best model, according to the principle of minimum description length: of a class of models for a string, the best one is the one that minimizes the total size of the model and of the encoding of the string with respect to the model. You may hear more about this in courses on machine learning. 73 Chapter 8 Minimum Spanning Trees Building a minimum spanning tree of a connected, edge-weighted graph is a classic problem, and a classic example of a problem we can solve with greedy algorithms. There are many greedy algorithms for building minimum spanning trees, and we’ll look at three: Kruskal’s, Prim’s and Bor˚uvka’s. They have different advantages and disadvantages which you should consider when choosing one for a particular situation. Recall that a graph is connected if and only if there exists a path from any vertex to any other vertex. (For this lecture, we’ll consider only undirected graphs.) An edge-weighted graph has a cost associated with each edge; without loss of generality, we can assume all the weights are positive. (If some of the edges have negative weights or weight 0, then we can include all those in our solution without increasing its total cost, and then consider as single vertices the connected components of the resulting graph.) In a connected graph with positive edge weights, a connected subgraph with minimum weight that spans all the vertices — that is, includes them all — will necessarily be a tree. (If it included a cycle, we could remove one of the edges in that cycle and reduce the subgraph’s weight without disconnecting it.) Such a subgraph is thus called a minimum spanning tree (MST). There are many applications for which MSTs are useful, which is one reason there are so many algorithms, and why they have been rediscovered so often: Bor˚uvka developed his algorithm in 1926 for designing an electrical grid for part of the eastern Czech Republic (then part of Czechoslo- vakia) and, according to Wikipedia (https://en.wikipedia.org/wiki/Boruvkas_algorithm) it was later rediscover by Choquet in 1938, by Florek et al. in 1951, and by Georges Sollin in 1965 (and is sometimes known as “Sollin’s algorithm”); Prim’s algorithm also first developed in Czech- slovakia, by Jarnik in 1930; it was then rediscovered by Prim in 1957 and again by Dijkstra in 1959 (so it is sometimes known as “Jarnik’s algorithm”, or “Prim-Jarnik”, or “Prim-Dijkstra” or “DJP”). For example, suppose you’re the county planner in charge of paving some roads in your county such that it’s possible to drive between any two towns along paved roads, and every road starts at 74 a town and ends at another town without forking or intersecting other roads along the way.* Let’s assume it is currently possible to drive between any two towns along dirt roads without leaving the county, you know the cost for paving each road and, although you need not worry about how long it will take to drive between towns along paved roads, you want to spend as little money as possible on paving. How do you choose which roads to pave? We can model this as a graph, with the towns as vertices, the roads as edges, the cost of paving a road as the weight of the corresponding edge, and an optimal plan for paving as an MST of the graph. Let’s start by considering what at first seems like the simplest algorithm for this, Kruskal’s: first, we sort the edges by weight, breaking ties arbitrarily; then, we process the edges in non- decreasing order by weight, adding each edge to our current subgraph if it does not create a cycle and discarding it if it does. To see why Kruskal’s algorithm works, let’s consider the standard form of a proof of correctness for a greedy algorithm: Before we take any steps, our (empty) subsolution can be extended to an optimal solution. Assume that, after i ≥ 0 steps, our subsolution can be extended to an optimal solution S. Then we show that after i + 1 steps, our subsolution can be extended to an optimal solution S′. Therefore, by induction, we obtain an optimal solution. To show that our subsolution after i+ 1 steps can be extended to an optimal solution, consider the edge e we process in our (i+ 1)st step. If we discard e then, because we discard edges only when they create a cycle with our current subsolution and S extends our current subsolution — so it contains all of the edges we have taken so far and none of the edges we have discarded — S cannot contain e, so S itself extends our subsolution after i+1 steps (meaning S′ = S). If we take e and S includes e then, again, S extends our subsolution after i + 1 steps (and S′ = S). Finally, if we take e and S does not include e, then we must show how to change S to obtain a solution S′ that does include e but does not cost more. We will do this in the normal way, by an exchange argument: proving we can exchange e for another edge in S of equal or greater weight without disconnecting the subgraph in S. If S does not already include e, then adding e to S creates a cycle. There must be some edge f in this cycle we have not already considered (because S agrees with our subsolution after i steps). Since we consider the edges in non-decreasing order by weight, f ’s weight must be at least as great as e’s. Therefore, by replacing f with e, we obtain a solution S′ that still spans the graph and has total weight no more than S. To use Kruskal’s algorithm, it’s important that you know the whole graph for which you’re trying to build an MST. It’s reasonable to assume a county planner has a map of their county, of *If we consider intersections and allow plans in which pavement on roads can stop at intersections instead of at towns, then we have an instance of the minimum Steiner-tree problem, which is another of those “BOOM” problems essentially equivalent to finding Hamiltonian paths. This is a more formal version of the argument in the last lecture that my friend Je´re´my can show up after we take a step and still reach an optimal solution. 75 course, but suppose you’re trying to choose a subset of communication links between computers in a network such that a message can be sent from any computer to any other through those links, the links have various costs, and you find out how a computer is linked to the rest of the network only when you have selected links allowing it to exchange messages with your computer. Prim’s algorithm can deal with this version of the MST problem because, while Kruskal’s algorithm builds subsolutions that are forests, Prim’s subsolutions are always trees: first, we choose a vertex from which to start our MST; then, at each step, we consider the edge with lowest weight incident to our current tree which we have not already considered and add it to our tree unless it creates a cycle, in which case we discard that edge. To see why Prim’s algorithm works, we again follow the standard form: Before we take any steps, our (empty) subsolution can be extended to an optimal solution. Assume that, after i ≥ 0 steps, our subsolution can be extended to an optimal solution S. Then we show that after i + 1 steps, our subsolution can be extended to an optimal solution S′. Therefore, by induction, we obtain an optimal solution. (I’m repeating this because I want you to be able to remember it during an exam.) Again, we will use an exchange argument to show that, if we take an edge e in the (i+ 1)st step and S does not include e, then we can change S into a solution that includes e without increasing the total weight. The argument again starts with adding e to S and considering the resulting cycle, but this time we cannot claim that any edge in that cycle we have not yet considered must have weight at least as great as e’s, because some of them may not be incident to our tree after i steps. Nevertheless, since the cycle includes e, and one of e’s endpoints is in our tree after i steps but the other is not (otherwise our taking e would create a cycle in that tree), then some edge f in the cycle is not in our tree after i steps but is incident to it. Therefore, since we consider the edges incident to our tree in non-decreasing order by weight, f ’s weight must be at least as great as e’s, so by replacing f by e we obtain an optimal solution S′ that extends our subsolution after i+ 1 steps. To get a broader perspective, I encourage you to read about Kruskal’s and Prim’s algorithms in Introduction to Algorithms. For consistency, since I used it in the video, I’ve also copied their example of how those algorithms produce MSTs (which I hope falls under “fair use” of copyrighted material), in Figures 8.1 and 8.2. Finally, suppose that all the computers in the network want to work together to build an MST quickly, but know only about the links incident to computers with which they they can exchange messages. In this case, we can use Bor˚uvka’s algorithm (which I don’t think Introduction to Algorithms covers): like Kruskal’s algorithm, a subsolution is a forest and not necessarily a tree; like Prim’s algorithm, in each step, the computers in each tree consider all the edges incident to that tree and add the one with lowest weight, unless it creates a cycle. For the sake of simplicity, let’s assume either the links all have slightly different costs, or there’s some locking mechanism to prevent the computers in two trees simultaneously adding edges with equal weight that create a cycle. If the computers in each tree take 1 time unit to choose the edge 76 Figure 8.1: The example of Kruskal’s algorithm from Introduction to Algorithms: “The execution of Kruskal’s algorithm on the graph from Figure 23.1. Shaded edges belong to the forest. The algorithm considers each edge in sorted order by weight. An arrow points to the edge under consideration at each step of the algorithm. If the edge joins two distinct trees in the forest, it is added to the forest, thereby merging the two trees.” 77 Figure 8.2: The example of Prim’s algorithm from Introduction to Algorithms: “The execution of Prim’s algorithm on the graph from Figure 23.1. The root vertex is a. Shaded edges are in the tree being grown, and black vertices are in the tree. At each step of the algorithm, the vertices in the tree determine a cut of the graph [Travis: don’t worry what this means, we’ll learn about cuts if we have time to cover network flows], and a light edge crossing the cut is added to the tree. In the second step, for example, the algorithm has a choice of adding either edge (b, c) or edge (a, h) to the tree, since both are light edges crossing the cut.” 78 incident to that tree with lowest weight and add it to the tree, then after 0 time units the trees consist of 1 vertex each; after 1 time unit the trees consists of at least 2 vertices each; in general, after i time units the trees consist of at least 2i vertices each; and we have an MST in O(log n) time units, where n is the number of vertices in the graph. That is, although building an MST takes Ω(n) work, it can be done in o(n) time if we can use parallelism. Although the strength of Bor˚uvka’s algorithm is that computers in the different trees can work in parallel, it’s easiest to see why it is correct by considering the computers in one tree adding one edge in isolation. In this case, we can consider the other trees as single vertices, so the correctness of a step in Bor˚uvka’s algorithm follows from the correctness of Prim’s algorithm (or of Kruskal’s algorithm, from a different perspective). Let’s now look at how to implement each algorithm. With Prim’s algorithm we need a priority queue of the edges incident to our current tree. If we use a min-heap with Θ(logm) time to extract the min or insert a new element, where m is the number of edges, then Prim’s algorithm takes Θ(m logm) = O(m log n) time. Detecting when adding an edge would create a cycle is relatively easy: we keep all the edges in our tree coloured a certain colour and, whenever we add an edge to the tree, we colour its endpoint that was just added to the tree and consider all the edges incident to that vertex; we discard any that have both endpoints coloured and insert into the priority queue any that have only one endpoint coloured. We can also implement Bor˚uvka’s algorithm in O(m log n) time: we initially colour all the vertices different colours and then proceeding in O(log n) rounds, maintaining the invariant that all the vertices in any one tree are all the same colour, and a different colour than the vertices in any other tree; in each round, we scan all the edges in the graph; if an edge e’s endpoints are two different colours and one of the endpoints is red, for example, and e’s weight is the smallest we’ve seen so far during this round for any edge with a red endpoint, then we keep e as our candidate edge to add to the red tree (discarding any previous candidate); when we finish the scan, we have the edge with lowest weight incident to each tree; we then add those edges — taking some care not to create a cycle by adding two edges with the same weight incident to the same tree (two candidates with endpoints that share a colour) — and recolour vertices so all the vertices in each tree are the same colour (perhaps with breadth-first or depth-first traversals of the trees). This takes O(m+ n) = O(m) time per round, or O(m log n) time in total. The most interesting implementation is of Kruskal’s algorithm. At first it seems the simplest, because we need only sort the edges once. It is not so obvious how to detect when adding an edge would create a cycle, however: we cannot use a single colour, as with Prim’s algorithm; but if we use many colours, like Bor˚uvka’s algorithm, then we must be careful not to spend too much time recolouring. Suppose we initially colour all the vertices different colours and maintain the invariant that all the vertices in any one tree are all the same colour, and a different colour than the vertices in any other tree, and we know the count of how many vertices are in each tree. Whenever we consider an edge, if its endpoints are the same colour then we discard it; otherwise, we choose the smaller of the two trees it connects and recolour all its vertices to match the vertices in the larger tree. With a fairly simple implementation this takes time linear in the size of the smaller tree, and it is not hard to see that any vertex is recoloured O(log n) times. It follows that, after sorting the edges, we 79 spend O(n log n) time building an MST with Kruskal’s algorithm. Since we can sort the edges in O(m logm) = O(m log n) time, we get an O(m log n) bound for Kruskal’s algorithm as well. As you’ll remember from the lecture on sorting, however, in some models or in some circum- stances we can sort the m edges in o(m log n) time. Therefore, it makes sense to try to optimize the post-sorting construction and, for that, we’ll now briefly cover the union-find data structure. This data structure maintains a collection of disjoint sets with a representative for each set, and supports the following operations: create(x) creates a new set {x} (when x is not already in any set); union(x, y) merges the sets containing x and y (unless they are already in the same set); find(x) returns the representative of the set containing x. It shouldn’t be too hard to see how to use this data structure to implement Kruskal’s algorithm: whenever we consider an edge, we find the representatives of its endpoints and, if they are the same, we discard the edge; if not, we union the sets containing the endpoints. It also shouldn’t be too hard to see how, as long as our sets contain a total of n elements, we can support all of the operations in O(log n) time, using the idea for implementing Kruskal’s algorithm in O(m log n) time: we keep a coloured tree for each set, whose vertices are the elements in that set, and a representative vertex associated with each colour; when we union two elements, we add an edge between their vertices and recolour the smaller of their two trees. The standard implementation of union-find actually works much faster than this. We still keep the elements of a set in a tree, but now the tree is rooted and each vertex points to its parent (with the root pointing to itself). We consider the root to be the representative of the set and, when we union two sets, we switch the pointer at the root of the smaller tree from pointing to itself, to pointing to the root of the larger tree. The main change is in how we implement find, using an idea called path compression: when we perform find(x), we follow the pointers up the path from x to the root of its tree and, as we go, we push x’s ancestors on a stack so that, when we finally reach the root of the tree, we can reset all of their pointers to the root. This increases the worst-case time of a find by a constant factor, but it greatly decreases the amortized time. I’m not sure if you’ve seen amortized analysis in previous courses. The simplest example is an expandable array, which is useful when we are receiving elements and want to store them in an array but don’t know how many there are or how much space to allocate. By starting with a small array and doubling the size whenever it fills up, we never use more than a constant times as much space as we need and, although the worst-case time to receive an element is linear in the number of elements we have received — to create the new array, copy all the elements into it, and free the old array — the total time is linear in the number of elements and so the time to receive and process each element is constant when amortized over all of the elements. To see why, suppose we charge 3 units to receive each element, spend 1 to insert it into the current array, and save 2; then, when an array of n elements is full, we have 2(n/2) = n units saved to pay for copying each element into a new array. The analysis of union-find with path compression is quite detailed, and even Introduction to Algorithm presents only a simplified version, showing that it makes Kruskal’s algorithm take O(m log∗ n) time after sorting the edges, where lg∗ n is the number of times we need to apply lg to 80 n to get down to 1. The function lg∗ n grows very slowly; for example, even though 222 22 ≈ 2·1019728, lg∗ 22 22 2 = lg∗ 22 22 + 1 = lg∗ 22 2 + 2 = lg∗ 22 + 3 = lg∗ 2 + 4 = 5. In fact, m operations on an initially empty collection of sets, of which n operations are create, takes O(α(m,n)) time, where α is the Inverse Ackermann Function, which grows even more slowly than lg∗. 81 Chapter 9 More Greedy Algorithms We started our study of greedy algorithms with Huffman coding so I could trick you into redis- covering Huffman’s algorithm while pretending we were warming up with a simple exercise about merging sorted lists. Most other courses on algorithms start with simpler problems, not known to have stumped Shannon (“Father of the Information Age”). Now that we’ve seen Huffman’s, Kruskal’s, Prim’s and Bor˚uvka’s algorithms, let’s look at a couple of those easy problems — and some harder versions of them, to keep things interesting. The first problem Introduction to Algorithms shows how to solve with a greedy algorithm is Activity Selection: we are given a list of activities with their start times and finish times; we can think of an activity with start time s and finish time f as a half-open interval [s, f); an activity [s, f) is compatible with another activity [s′, f ′) if and only if their intervals do not intersect, [s, f) ∩ [s′, f ′) = ∅; we want to choose the largest possible subset of pairwise-compatible activities. The example of Activity Selection I have in my edition of Introduction to Algorithms has these activities: [3, 5), [5, 7), [12, 14), [1, 4), [8, 11), [5, 9), [6, 10), [8, 12), [3, 8), [0, 6), [2, 13). For exam- ple, if your school-day lasts from 9:30 am to 4:30 pm, 0 means 9:30 and 14 means 4:30, then [5, 7) could be lunch from 12:00 to 1:00 and the other activities could be lectures, seminars, tutorials, sports, naps, etc. Assuming you can do only one thing at a time but you can get from one activity to another instantly — hooray for online learning! — and you value all activities equally — even 3110 lectures, and even lunch! — how can we choose your activities so that you do as many as possible? Remember that greedy algorithms usually start by sorting their inputs in some way. I listed the activities in order by length, with ties broken by start time, but that’s not always the best way to sort them. For example, if your activities are [0, 3), [2, 4), [3, 6) then it’s better to take [0, 3) and [3, 6) than only [2, 4), even though that’s the shortest. How else can you sort the activities? The obvious other ways are by start time — which isn’t optimal, in cases such as [0, 14), [1, 2), [2, 3), . . . , [12, 13) — and by finish time. If you try a few examples, it seems that processing the activities in order by finish time, from earliest to latest, and scheduling each activity if it is compatible with all the activities you’ve 82 already scheduled and ignoring it otherwise, always produces an optimal schedule. (Ok, this is symmetric processing the activities by start time, from latest to earliest — just think of time as running backwards.) Why is this the case? Before we take any steps, our (empty) subsolution can be extended to an optimal solution. Assume that, after i ≥ 0 steps, our subsolution can be extended to an optimal solution S. Then we show that after i + 1 steps, our subsolution can be extended to an optimal solution S′. Therefore, by induction, we obtain an optimal solution. Suppose that in the (i + 1)st step we consider activity [x, y). If we discard it then it must be incompatible with one of the activities we have already scheduled, in which case S cannot include it, so S′ = S. If we schedule it and S includes it then, again, S′ = S. The only interesting case is when we schedule [x, y) and S discards it. If S does not schedule any activities incompatible with S, then we can add [x, y) to S and achieve an improved solution (contrary to the assumption that S is optimal). Therefore, assume S schedules such an activity, [x′, y′). Notice that x′ < y (otherwise the two activities are not incompatible) and y′ ≥ y (otherwise we would have considered [x′, y′) before [x, y)), so S contains only one activity incompatible with [x, y). If we remove [x′, y′) from S, then the rest of S’s activities are compatible with [x, y), so we can add [x, y) and achieve a solution S′ containing [x, y) that has as many activities as S and, so, is optimal. Activity Selection is perhaps the simplest problem in a field known as job scheduling, which includes many “BOOM” problems. Imagine each activity can be started after a certain point (its release time), should be finished before a certain point (its deadline), has a certain profit (which can vary between activities), may depend on other activities having already been completed, and takes a certain amount of time (its duration) less than or equal to the difference between its deadline and its release time, with the duration depending on who among several people performs that activity — and you’ll start to see why we’re not going deeper into this topic in this course. For Assignment 3, you showed how to sort quickly using a procedure that “given a sequence of n integers that can be partitioned into d non-decreasing subsequences but not fewer, does so in O(n log d) time”. There actually exists such an algorithm, called Supowit’s algorithm, that does this partitioning online and greedily. By “online” I mean that it processes the integers in order and has always partitioned those it has seen so far into a minimum number of non-decreasing subsequences. For example, if the original sequence is 7, 2, 4, 3, 1, 2, 6, 9, 5, 8, 7, 4, 3, how can you partition it online? The first 7 must start a subsequence, and the first 2 is less than 7 so it must start its own subsequence, but the first 4 could either start its own subsequence or follow the first 2; if the first 4 follows the first 2 then the first 3 must start its own subsequence, otherwise it can follow the first 2; etc. For Supowit’s algorithm, we keep the last element in each subsequence in a dynamic predecessor data structure (such as an AVL tree) with operations taking time logarithmic in the number of 83 elements stored. We append each element x to the subsequence currently ending with x’s prede- cessor among the last elements in the subsequences (that is, the largest element appearing last in a subsequence, that is at most x); if x is smaller than all of the current last elements, we start a new subsequence with it. If we append x to a subsequence, then we delete from the predecessor data structure the element that was the last element in the subsequence; in both cases, we insert x into the predecessor data structure. To process the entire sequence, we use O(n log d) time. In our example, we start a subsequence with 7 and insert it into the predecessor data structure; search for the predecessor of 2 but find nothing, and start a new subsequence with 2 and insert it into the predecessor data structure; search for the predecessor of 4 and find 2, append 4 to the subsequence 2, delete 2 from the predecessor data structure and insert 4; search for the predecessor of 3 but find nothing, and start a new subsequence with 3 and insert it into the predecessor data structure; search for the predecessor of 1 but find nothing, and insert it into the predecessor data structure; search for the predecessor of 2 and find 1, append 2 to the subsequence 1, delete 1 from the predecessor data structure and insert 2; etc. Eventually, we obtain the following 5 subsequences: 7, 9 2, 4, 6, 8 3, 5, 7 1, 2, 4 3 . Why can we be sure it is impossible to partition our example sequence into fewer than 5 non- decreasing subsequences? Before we take any steps, our (empty) subsolution can be extended to an optimal solution. Assume that, after i ≥ 0 steps, our subsolution can be extended to an optimal solution S. Then we show that after i + 1 steps, our subsolution can be extended to an optimal solution S′. Therefore, by induction, we obtain an optimal solution. Suppose that an optimal solution S extends our subsolution after i steps, when we have pro- cessed i integers. If S has the (i+ 1)st integer x in the same subsequence that we put it in, the S extends our subsolution after i+ 1 steps, so S′ = S. If we start a new subsequence with x, then it cannot be appended to any of the existing subsequences, so S must start a new subsequence with x as well and, again, S′ = S. The only case left to consider is when we append x to a subsequence A and S appends it to another (possibly empty) subsequence B. Let a be the last element of A before x in our subsolution and let b be the last element of B before x in S. Let C and D be the suffixes in S of the subsequences that start A and B, x, respectively — meaning S includes subsequences A,C and B, x,D. Since we append x to the list ending with x’s predecessor among the last elements of the subsequences when we consider x, we have b ≤ a ≤ x. Therefore, the first integer in C is larger than b, so we can swap C and x,D to obtain a solution with the same number of subsequences as S and containing the subsequences A, x,D and B,C. This solution extends our subsolution after i+ 1 steps. 84 Figure 9.1: A vertex cover of size 4 (shown with red markers) of a graph with 7 vertices. Interestingly, the problem of partitioning a sequence of integers into the minimum number of monotone subsequences — that is, with each subsequence either increasing or decreasing — is a “BOOM” problem. For various reasons, people who work in data compression and data structures would like to be able to partition sequences of integers into non-decreasing subsequences so as to minimize the entropy of the distribution of elements to subsequences. This isn’t the same as minimizing the number of non-decreasing subsequences: for example, another solution to our example instance 7, 9 2, 4, 6, 8 3, 4 1, 2, 5, 7 3 . has the same number of subsequences but the entropy of its distribution (2/13) lg(13/2) + (4/13) lg(13/4) + (2/13) lg(13/2) + (4/13) lg(13/4) + (1/13) lg(13/1) ≈ 2.16 is slightly lower than the entropy of the distribution of the solution given by Supowit’s algorithm, (2/13) lg(13/2) + (4/13) lg(13/4) + (3/13) lg(13/3) + (3/13) lg(13/3) + (1/13) lg(13/1) ≈ 2.20 . As far as I know, no one knows whether partitioning sequences of integers into non-decreasing subsequences so as to minimize the entropy, goes “clink” or goes “BOOM”. If you figure it out, either way, I promise to give you as many bonus marks as the faculty will let me. Vertex Cover is a classic “BOOM” problem: given a graph on n vertices and an integer k ≤ n, decide if there is a subset of k vertices such that every edge is incident to (“covered by”) at least one vertex in that subset. (There’s an analogous problem called Edge Cover for which we must select a subset of edges so as cover all the vertices. The way to keep them straight is to think “We’re looking for a vertex cover (of the edges)”.) Figure 9.1 shows a vertex cover of size 4 of a graph with 7 vertices. Notice that, if you find a vertex cover consisting of fewer than k vertices, you can always add more vertices to get a vertex cover consisting of exactly k vertices, so we can write “of k vertices” instead of “at most k vertices”. This is called a decision problem because it asks for a yes-or-no answer; the corresponding optimization problem is to find the smallest vertex cover. Theorists 85 Figure 9.2: A maximal (but not maximum!) matching, and the vertex cover we get from including all the endpoints of its edges. like thinking about decision problems because you can call the set of all strings encoding “yes” instances of Vertex Cover as a language, for example, and start using terms and ideas from formal language theory. Even though there’s no known polynomial-time algorithm for finding the smallest vertex cover in a graph, there is an easy way to find a vertex cover that’s at most twice as big as the smallest: as often as we can, we choose an edge arbitrarily, then remove all the other edges incident to its endpoints; when we’re done, we have a maximal matching. (A matching is a subset of the edges such that no two share an endpoint. We say “maximal” instead of “maximum” because we can’t add any more edges, but that doesn’t necessarily mean there isn’t a larger matching.) If we then add both endpoints of each edge in our matching to our vertex cover, every edge is covered (because if an edge e were uncovered, then we could have added it to our matching) and our vertex cover is at most twice the size of the smallest one (because any vertex cover must include at least one endpoint of each edge in our matching, and none of those edges share endpoints). Figure 9.2 shows a matching consisting of 2 edges for the graph shown in Figure 9.1, and the vertex cover — again of size 4 — that we get by choosing both endpoints of each edge in our matching. Although in this case we obtain a vertex cover as small as possible (I think), if we’d considered the edges in a different order when building our matching then we could have ended up with 3 edges instead of 2 (so our current matching is maximal but not maximum), and then our vertex cover would have 6 vertices instead of only 4. When we’re faced with instances of “BOOM” problems in real life, we usually can’t just give up, so people have developed approaches to dealing with them. We can use approximation algorithms, that aren’t guaranteed to find optimal solutions but are guaranteed to find good ones (incidentally, there are better approximation algorithms for Vertex Cover, we just won’t see them in this course); we can use heuristics and hope for the best, because sometimes worst-case analysis is unduly pessimistic; we can try to solve them using brute force and big computers; or we can try to find characteristics of the instances we’re interested in that make them special and easier to solve. For example, although Vertex Cover goes “BOOM” in general, on trees it goes “clink” — in fact, there’s a greedy algorithm! (If you take Norbert’s 4th-year course on algorithms, you’ll see a lot more algorithms that work well on special cases.) 86 Figure 9.3: A tree with a vertex cover of size 4, but no vertex cover of size 4 including the central vertex. The obvious greedy strategy for choosing a small vertex cover of a tree — start by adding the vertex with highest degree — doesn’t always work optimally. For example, consider the cross- shaped tree shown in Figure 9.3: if we choose the center vertex because it has degree 4, then we still need to choose 4 more vertices to cover the edges incident to the leaves; if we choose the 4 vertices with degree 2, however, then we cover all the edges. Instead, suppose we choose a leaf of the tree, add to our (initially empty) vertex cover the other endpoint v of the single edge incident to that leaf, delete that edge and all the other edges incident to v, and recurse on each tree of the resulting forest (ignoring all the vertices which are now isolated). Figure 9.4 shows an example of how this algorithm can work on a tree, with the numbers showing which vertices we consider as leaves (which need not be leaves when we start) and red disks showing which vertices we add to our vertex cover. How can we be sure the vertex cover we obtain will be as small as possible? Before we take any steps, our (empty) subsolution can be extended to an optimal solution. Assume that, after i ≥ 0 steps, our subsolution can be extended to an optimal solution S. Then we show that after i + 1 steps, our subsolution can be extended to an optimal solution S′. Therefore, by induction, we obtain an optimal solution.* Suppose that during our (i + 1)st step we consider an edge (u, v) and take v. If S includes v, then S′ = S. If S does not include v, then it must contain u in order to cover that edge. Since all the other edges incident to u must already have been covered — because we consider (u, v) only when one of its endpoints is a leaf and then we take the other endpoint — removing u from S leaves only (u, v) uncovered. Therefore, S′ = {v} ∪ (S {u}) (that is, v added to S with u removed) is a vertex cover of the same size as S that extends our subsolution after i+ 1 steps. Another “BOOM” problem with easy and hard versions is Knapsack: suppose you’re going on a hiking trip and you’re deciding which of a set of items to take with you in a knapsack, with *I hope you’re getting the impression that I want you to remember this. 87 12 3 4 5 6 Figure 9.4: A minimum vertex cover of a tree. each item having a weight and a profit and the knapsack having a certain limit on the weight it can hold, called its capacity; you want to choose items such that you obtain the maximum possible profit without exceeding the knapsack’s capacity. The easy version of Knapsack is called Fractional Knapsack, and I think it’s really only used as an example problem for teaching greedy algorithms. The normal, hard version of Knapsack is usually just called Knapsack, but sometimes also Binary Knapsack or 0-1 Knapsack to distinguish it from Fractional Knapsack. Binary Knapsack will come up again when we get to dynamic programming, and again when we get to NP-completeness. I’ll show you an algorithm for Binary Knapsack that solves even the hard version optimally, and runs in time polynomial in the capacity — even though it’s a “BOOM” (that is, NP-complete) problem. It was to prepare you for that apparent contradiction that I told you we usually measure running times in terms of the size of the paper it takes to write out the instance, and had you think about why factoring is used in crypto-systems even though we can factor a number n in O(n) arithmetic operations. For now, let’s concentrate on Fractional Knapsack, which is called that because we can cut the items: if we cut an item and discard some fraction of it, then the profit of the remaining part is proportional to its original profit times the fraction that we keep. Figure 9.5 shows some of the figures from the scribe notes on this topic, which for some reason make me smile. Taking the most profitable item (such as a giant watermelon) first could be a mistake since, as shown in the second part of Figure 9.5, it might fill the whole knapsack. Taking the smallest item first could also be a mistake. Since we can cut the items, the best approach is to order the items into non-increasing order by their profit-to-weight ratios — their densities — then take each item 88 “Empty knapsack with items can be taken” “A giant watermelon vs cheese+bread+chocolate bar” “Each item with value density” “cut cheese into half and get half value” Figure 9.5: Some memorable figures from the 2020 scribe notes. 89 as long as it fits in the knapsack and, when we reach an item that won’t fit in the knapsack, to cut it and take as much of it as we can. Why does this fill the knapsack as profitably as possible? Before we take any steps, our (empty) subsolution can be extended to an optimal solution. Assume that, after i ≥ 0 steps, our subsolution can be extended to an optimal solution S. Then we show that after i + 1 steps, our subsolution can be extended to an optimal solution S′. Therefore, by induction, we obtain an optimal solution. Suppose that for our (i+ 1)st step we take g grams of some item x and S includes g′ grams of it. Since we always take the whole item when we can, and as much as possible when it won’t all fit, g ≥ g′. Suppose we take S and cut out g− g′ grams of whatever it includes instead of those last g − g′ grams of x, and fill the extra space up with x. The profit cannot go down, because g − g′ grams of x are at least as profitable as g − g′ grams of anything we consider after x, so we obtain a solution S′ that is as profitable as S and extends our subsolution after i+ 1 steps. Don’t worry, this is the last time I’m going to write this. Of course, you should probably still write it for various assignments and exams. . . 90 Assignment 4 posted: 11.06.2021 due: midnight 18.04.2021 You can work in groups of up to three people. One group member should submit a copy of the solu- tions on Brightspace, with all members’ names and banner numbers on it; the other group members should submit text files with all members’ names and banner numbers (otherwise Brightspace won’t let us assign them marks!). You may consult with other people but each group should understand the solutions: after discussions with people outside the groups, discard any notes and do something unrelated for an hour before writing up your solutions; it’s a problem if no one in a group can ex- plain one of their answers. For programming questions you should submit your code, which should compile and run correctly to receive full marks. 1. You have a week to complete an assignment with several questions, each worth the same number of marks. You don’t want to spend more than h hours on the whole assignment and you can estimate accurately how many hours each question will take you. Give a greedy algorithm to decide which questions to answer. PROVE YOUR ALGORITHM CORRECT! 2. Your professor is training to run against his friend Simon, but he’s not sure he can make it around his whole planned route in one go, so he’s made a list of places where he can stop for a break, coffee, etc. (For example, if he starts at the shipyards and runs along the coast, he can stop at the Tim Horton’s by the ferry terminal, then in the Salt Yard, then at one of the restaurants along the waterfront, then at the Garrison Brewery or Tomavinos by the Seaport Market, then at the entrance of Point Pleasant, then at the top of Arm Road in the park, etc etc.) Suppose he gives you this list, with the distance between each consecutive pair of potential pitstops, and the distance d he can run without stopping. Give a greedy algorithm that tells him where to stop such that 1) he never runs more that distance d without a break and 2) he makes the minimum number of stops. PROVE YOUR ALGORITHM CORRECT! 3. A cross parsing of a string S[1..m] with respect to a string T [1..n] is a partition of S into a minimum number of substrings each of which occurs in T . Suppose you are given an array L[1..m] such that, for 1 ≤ i ≤ m, the substring S[i..i+L[i]− 1] occurs in T but the substring S[i..i+L[i]] doesn’t. Give a greedy algorithm for computing a cross parsing of S with respect to T . PROVE YOUR ALGORITHM CORRECT! 4. According to https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1710000901, the populations of Canada’s provinces and territories in the first quarter of 2021 were as follows: 91 Newfoundland and Labrador 520,438 Prince Edward Island 159,819 Nova Scotia 979,449 New Brunswick 782,078 Quebec 8,575,944 Ontario 14,755,211 Manitoba 1,380,935 Saskatchewan 1,178,832 Alberta 4,436,258 British Columbia 5,153,039 Yukon 42,192 Northwest Territories 45,136 Nunavut 39,407 Suppose we choose a resident of Canada uniformly at random and let X be the province or territory where they live. (a) Compute the entropy (in bits) of the random variable X. (b) Compute ∑ i pidlg(1/pi)e, where pi is the probability the resident of Canada lives in the ith province or territory listed above. (c) Build a Huffman code for the probability distribution p1, . . . , p13; what is its expected codeword length? 5. Suppose you have season passes for m train lines between n cities, with different expiry dates. A ticket lets you travel on the line between two cities as many times as you like, in either direction, from now until the ticket expires. How can you quickly determine the last date on which you will be able to reach any city from any other city using your passes? (a) Give a solution with the union-find data structure that takes O(mα(m,n)) time after you’ve sorted the passes by expiry date. (b) Give a solution that colours and re-colours the cities, that takes O(m) time after you’ve sorted the passes by expiry data. You need not prove your solutions correct. 92 Midterm 1 posted: 9 am, 07.06.2021 due: midnight, 11.06.2021 You are not allowed to work in groups for the midterm. You can look in books and online, but you must not discuss the exam problems with anyone. If you don’t understand a question, contact Travis or one of the TAs for an explanation (but no hints). All problems are weighted the same and, for problems broken down into subproblems, all subproblems are weighted the same. 1. For each cell (i, j) in the matrix below, write o, O, Θ, Ω or ω to indicate the relationship of the ith function on the left to the jth function along the top. If none of those relationships hold, leave the cell blank. Only the best answer possible will be considered correct (so writing O when the best answer is o doesn’t count, for example). The cell (1, 1) is filled in as an example: n1/4 ∈ o(n2), so that cell contains “o”. You need not explain your answers. n2 ((−1)n + 1)n 3lgn nlg lgn n lg n n1/4 o 2n dlg ne! n ∑n j=1(1/j) T (n) = 5T (n/4) + n3/2 2. Assuming you have an O(n)-time algorithm to find a separator of a planar graph on n vertices — that is, a subset of at most 2 √ n vertices after whose removal all remaining connected components each consist of at most 2n/3 vertices — give a divide-and-conquer algorithm to find a minimum vertex cover of a planar graph on n vertices, similar to our algorithm for colouring a planar graph. Your mark will partly depend on how fast your algorithm is (although there is not believed to exist a polynomial-time algorithm). You need not analyze your algorithm nor prove it correct. 3. Suppose we have a function that, given an unsorted sequence of n integers, in O(n) time returns the (n/q)th, (2n/q)th, . . . , ((q− 1)n/q)th smallest elements, called q-quantiles. Con- sidering the time to compare elements to quantiles, (a) how quickly can we sort with this function when q is constant? (b) how quickly can we sort with this function when q = √ n? (c) if we can choose q freely, how should we choose it to sort as quickly as possible with this function? You need not explain your answers. (10% bonus question) How fast can we sort if the function takes O(n) time and separates the integers in the array into √ n bins such that the ith bin contains the ((i − 1)√n + 1)st through (i √ n)th smallest integers? (That is, the first bin contains the smallest √ n elements, the second bin contains the next √ n elements, etc.) You need not explain your answer. Continued on Next Page! 93 4. Imagine you’re planning a post-lockdown canoe trip with friends, but people want to bring different amounts of equipment, everyone wants to be in the same canoe as their equipment, you can have only so much equipment in each canoe (all the canoes are the same, and consider only the weight of the equipment), any one person’s equipment fits in one canoe, everyone wants to row (so you can have at most two people in each canoe). You have a list of how much equipment each person wants to take (in kilos), and you know how much fits in a canoe. For example, if there are 3 people going and they want to take 37 kg, 52 kg and 19 kg of equipment and a canoe can hold up to 60 kg of equipment (plus up to 2 people), then you need at least 2 canoes: you can put the first and third people and their 37 + 19 = 56 kg of equipment in one and the second person and their 52 kg in the other. Give a greedy algorithm to find the minimum number of canoes you need AND GIVE A PROOF OF CORRECTNESS! 5. Give a greedy algorithm for Binary Knapsack that runs in O(n log n) time, where n is the number of items to consider, and achieves at least half the maximum profit when all the items have the same profit-to-weight ratio. Explain why your algorithm achieves this. 94 Solutions 1. For each cell (i, j) in the matrix below, write o, O, Θ, Ω or ω to indicate the relationship of the ith function on the left to the jth function along the top. If none of those relationships hold, leave the cell blank. Only the best answer possible will be considered correct (so writing O when the best answer is o doesn’t count, for example). The cell (1, 1) is filled in as an example: n1/4 ∈ o(n2), so that cell contains “o”. You need not explain your answers. n2 ((−1)n + 1)n 3lgn nlg lgn n lg n n1/4 o o o o 2n ω ω ω ω ω dlg ne! ω ω ω o ω n ∑n j=1(1/j) o ω o o Θ T (n) = 5T (n/4) + n3/2 o ω o o ω 2. Assuming you have an O(n)-time algorithm to find a separator of a planar graph on n vertices — that is, a subset of at most 2 √ n vertices after whose removal all remaining connected components each consist of at most 2n/3 vertices — give a divide-and-conquer algorithm to find a minimum vertex cover of a planar graph on n vertices, similar to our algorithm for colouring a planar graph. Your mark will partly depend on how fast your algorithm is (although there is not believed to exist a polynomial-time algorithm). You need not analyze your algorithm nor prove it correct. Solution: We first find a separator S of the graph and then, for each subset T of S, we recursively find the minimum vertex cover of the graph that contains every vertex in T and none of the other vertices in S. To do this, (a) we ensure there are no edges between two vertices in S\T (if there are, then no vertex cover omits both); (b) for each vertex v in T , we add v to the vertex cover and delete all the edges incident to v; (c) for every remaining edge incident to a vertex u in S\T , we add the other endpoint w of that edge to the vertex cover and delete all the edges incident to w; (d) we recurse on the remaining connected components. This takes time bounded by T (n) = 22 √ n 2T (2n/3) ∈ O ( 2n 1/2+ ) for any positive constant — but you don’t have to include an analysis. 3. Suppose we have a function that, given an unsorted sequence of n integers, in O(n) time returns the (n/q)th, (2n/q)th, . . . , ((q− 1)n/q)th smallest elements, called q-quantiles. Con- sidering the time to compare elements to quantiles, (a) how quickly can we sort with this function when q is constant? Solution: Θ(n log n) (in the comparison model, at least) (b) how quickly can we sort with this function when q = √ n? Solution: the fastest I could figure out was still Θ(n log n) (c) if we can choose q freely, how should we choose it to sort as quickly as possible with this function? Solution: if we choose q = n then we can sort in linear time You need not explain your answers. 95 (10% bonus question) How fast can we sort if the function takes O(n) time and separates the integers in the array into √ n bins such that the ith bin contains the ((i − 1)√n + 1)st through (i √ n)th smallest integers? (That is, the first bin contains the smallest √ n elements, the second bin contains the next √ n elements, etc.) You need not explain your answer. Solution: The recurrence is T (n) = n1/2T (n1/2) +O(n) = n1/2(n1/4T (n1/4) +O(n1/2)) +O(n) = n 1 2 + 1 4T (n1/4) +O(n) +O(n) = n 1 2 + 1 4 (n1/8T (n1/8) +O(n1/4)) +O(n) +O(n) = n 1 2 + 1 4 + 1 8T (n1/8) +O(n) +O(n) +O(n) = lg lgn∏ i=1 (n1/2 i +O(n)) = O(n lg lgn) . The trick here is to ask for what i we have n1/2 i = 2 or, equivalently, 22 i = n; taking lg twice on both sides, we get i = lg lg n. You don’t need to write the recurrence or the explanation, though — writing O(n lg lgn) is enough. 4. Imagine you’re planning a post-lockdown canoe trip with friends, but people want to bring different amounts of equipment, everyone wants to be in the same canoe as their equipment, you can have only so much equipment in each canoe (all the canoes are the same, and consider only the weight of the equipment), any one person’s equipment fits in one canoe, everyone wants to row (so you can have at most two people in each canoe). You have a list of how much equipment each person wants to take (in kilos), and you know how much fits in a canoe. For example, if there are 3 people going and they want to take 37 kg, 52 kg and 19 kg of equipment and a canoe can hold up to 60 kg of equipment (plus up to 2 people), then you need at least 2 canoes: you can put the first and third people and their 37 + 19 = 56 kg of equipment in one and the second person and their 52 kg in the other. Give a greedy algorithm to find the minimum number of canoes you need AND GIVE A PROOF OF CORRECTNESS! Solution: Let c be the capacity of a canoe in kilos. Sort the people into non-increasing order by the amount of equipment they want to take and consider them in that order. Suppose the ith person has wi kilos of equipment. Pair them with the person who has the closest to c− wi kilos of equipment, without going over. Before we take any steps, our (empty) subsolution can be extended to an optimal solution. Assume that, after i ≥ 0 steps, our subsolution can be extended to an optimal solution S. Then we show that after 96 i + 1 steps, our subsolution can be extended to an optimal solution S′. Therefore, by induction, we obtain an optimal solution.
Suppose we pair the (i+ 1)st person with someone who has wj ≤ c−wi+1 kilos of equipment. If S also has those people paired, then S′ = S. If S doesn’t have the (i+ 1)st person paired with anyone, then we can move the person with wj kilos into the same boat as the (i+ 1)st person and obtain a solution S′ that uses no more canoes than S and extends our subsolution after i + 1 steps. So, suppose S puts someone else in the canoe with the (i + 1)st person, with wj′ kilos of equipment. Since wj′ ≤ wj , we can swap the people with those amounts of equipment, to obtain a solution S′ that uses no more canoes than S and extends our subsolution after i+ 1 steps. 5. Give a greedy algorithm for Binary Knapsack that runs in O(n log n) time, where n is the number of items to consider, and achieves at least half the maximum profit when all the items have the same profit-to-weight ratio. Explain why your algorithm achieves this. Solution: We sort the items into non-decreasing order by profit (and, thus, also by weight). After we discard items that don’t fit in the knapsack at all (even when it’s empty), at least one of the follow statements is always true: the knapsack is at least half full, there is space for the next item, there are no more items. If the knapsack is less than half full, then either it is completely empty (in which case any single remaining item fits), or the last item we put in took less than half the capacity, in which case the next item we consider takes less than half the capacity (because we’re considering the items in non-decreasing order by weight). If we must stop because we’ve more than half-filled the knapsack, then we already have at least half the possible profit. If we must stop because we’ve run out of items, then we already have all the possible profit.
When I wrote before “Don’t worry, this is the last time I’m going to write this”, I meant it was the last time that day. 97 Part III Dynamic Programming 98 Chapter 10 Edit Distance My high school biology teacher, Mr Klages, wasn’t very strict when it came to tests. If a student asked him to clarify a question, he would first explain what the question was asking and then, if the students was still confused, he’d also explain the steps in the reasoning leading to the answer, and eventually he’d often end up saying something like “So the answer’s 5, isn’t it?”. One of my classmates, Ryan, had an odd sense of humour (and a lot of patience!) and thought it was fun to keep asking Mr Klages questions until he’d revealed the answers to all the questions on the test: Ryan: “Mr Klages, I don’t understand Question 1.” Mr Klages: “Well, Ryan, it’s asking for the cell organelles; remember those from last week?” Ryan: “Not really.” ... Mr Klages: “So the last one is Golgi body, isn’t it?” Ryan: “Right, thanks, Golgi body. . . ” [writes] Mr Klages: “Ok, so now you’re all set?” Ryan: “Yes, but I don’t understand Question 2 either.” (I sympathize more with Mr Klages now that I’ve tried teaching; http://phdcomics.com/comics/ archive.php?comicid=5 .) Let’s suppose Mr Klages is testing Ryan’s ability to compute the minimum cost of a path from the top left corner of the matrix in Figure 10.1 to the bottom right, according to the following rules: at each step we can move right one cell, down one cell, or diagonally right and down one cell; crossing a black arrow costs $1; crossing a red arrow is free. 99 zx y s t u v w Figure 10.1: What is the minimum cost of a path from the top left to the bottom right if crossing a black arrow costs $1 and crossing a red arrow is free? Naturally, Ryan is going to say he doesn’t know the minimum cost of a path from one corner to the other, and Mr Klages will say something like “Well, if I told you the minimum costs x, y and z of paths to the cells above and to the left, above, and to the left of the cell in the bottom right corner, respectively, then you could work out the minimum cost min(x + 1, y + 1, z + 1) of a path to that cell, couldn’t you?” Ryan being Ryan, he’ll say yes and ask what x, y and z, to which Mr Klages will answer something like “Well, if I told you the minimum costs s, t, u, v and w of paths to the cells above and to the left of those cells, then you could work out x = min(s+ 1, t+ 1, v + 1) , y = min(t+ 1, u+ 1, x+ 1) , z = min(v, x+ 1, w + 1) , couldn’t you? (Notice we can get from the cell with cost v to the cell with cost z for free, so we don’t add a 1 to v when computing z!)” What Mr Klages tells Ryan is essentially the following recurrence, A[0, 0] = 0 , A[i, 0] = A[i− 1, 0] + 1 for i > 0 , 100 A[0, j] = A[0, j − 1] + 1 for j > 0 , A[i, j] = min A[i− 1, j − 1] +Bi,j ,A[i− 1, j] + 1, A[i, j − 1] + 1 for i, j > 0, where Bi,j is an indicator variable equal to 1 if the arrow from A[i− 1, j − 1] to A[i, j] is black and 0 if it’s red. Seeing a recurrence, computer scientists automatically think of recursion, and it’s true Ryan could call the following program on the coordinates of the bottom right corner: int cost(int i, int j) { if (i == 0 && j == 0) { return(0); } else if (i == 0) { return(cost(i, j - 1) + 1); } else if (j == 0) { return(cost(i - 1, j) + 1); } else { return(min(cost(i - 1, j - 1) + B[i, j], cost(i - 1, j) + 1, cost(i, j - 1) + 1)); } } How long is that going to take, though? Notice we’ll call cost for a cell once for every way to reach that cell from the bottom right corner using only moves left, up, and diagonally left and up. If we ignore diagonal moves, we’ll get a lower bound on the number of calls to cost that’s something like the number of binary strings on at most 22 bits containing at most 12 copies of 0 and at most 10 copies of 1, which is huge. Not even Mr Klages and Ryan are that patient! We could memoize the answers to each call to cost — meaning we record them in an array so we don’t need to recompute them — but instead of “unwrapping” the problem from the bottom right corner and backtracking, it’s a bit cleaner to work from the top left corner. If we decide the top left corner has coordinates (0, 0) and the bottom right corner has coordinates (m,n) — there’s a reason for this that we’ll see later — then we can use the following iterative program: int costDP(){ A[0][0] = 0; for (int i = 1; i <= m; i++) { A[i][0] = A[i - 1][0] + 1; } for (int j = 1; j <= n; j++) { A[0][j] = A[0][j - 1] + 1; } 101 for (int i = 1; i <= m; i++) { for (int j = 1; j <= n; j++) { A[i][j] = min(A[i - 1][j - 1] + B[i][j], A[i - 1][j] + 1, A[i][j - 1] + 1); } } return(A[m][n]); } This is dynamic programming,* with A[i, j] storing the minimum cost of reaching cell (i, j) in our original matrix. Of course the first and second for loops just set A[i, 0] = i and A[0, j] = j, but it’s useful to remember why they’re doing that. The main work of the algorithm is the nested for loops, which take a total of Θ(mn) time. This is a fairly typical dynamic program so, before moving on to more serious problems, let’s spend some time trying to optimize it. First, notice this program finds only the cost of the minimum path, but not the path itself. We could fix that by creating a matrix C[0..m, 0..n] and, whenever we fill in A[i, j], we store in C[i, j] from which direction we would arrive at A[i, j] to achieve the minimum cost: if A[i, j] = A[i− 1, j − 1] +Bi,j then we store (i− 1, j − 1) in C[i, j]; otherwise, if A[i, j] = A[i− 1, j] + 1 then we store (i − 1, j) in C[i, j]; otherwise, we store (i, j − 1) in C[i, j]. This lets us walk back along the path from C[m,n]. Because we’re filling in an (m+ 1)× (n+ 1) matrix (or two of them, if we want the path), we also use Θ(mn) space. Notice, however, that if we don’t need the path, just the cost, then after we fill in A[i+ 1, 0..n], we never use A[i, 0..n] again and can discard it to save space. If m is much less than n, then we can fill A in column-major order instead of row-major order. This means we can use Θ(mn) time but only Θ(min(m,n)) space. If we want to recover the path itself while still using o(mn) space, there are various tradeoffs we can use, but those are beyond the scope of this course. An important optimization is called banded dynamic programming. Suppose you want to get from the top left corner to the bottom right corner but you have only $7. Notice that reaching a cell (i, j) with i > j + 7 requires more than 7 moves straight down and thus costs more than $7, while if j > i + 7 then reaching the cell requires more than 7 moves straight right and thus costs more than $7. Therefore, we need only concern ourselves with the cells of the matrix in the band A[i, j] with |i− j| ≤ 7, which contains at most (2 ·7+1)n cells, as shown in Figure 10.2. In general, if we have a budget of k then we can find A[m,n] in time O(kmin(m,n)) using this technique. This is important in practice because we often are only interested in the exact cost of an optimal solution when it is fairly small; as the cost grows, either we lose interest in the solution or we are more willing to accept some error. Now that everyone’s completely comfortable with finding a minimum-cost path through a ma- trix, we can discuss how to compute the edit distance between two strings S[1..m] and T [1..m], *Incidentally, don’t worry if you don’t see why it’s called that: https://en.wikipedia.org/wiki/Dynamic_ programming#History . 102 1 2 3 1 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 1 2 2 3 4 4 2 1 2 3 3 4 4 5 6 7 8 9 10 6 5 7 3 2 1 2 8 0 4 3 5 2 4 6 7 5 6 7 8 9 10 6 78 810 8 8 9 11 10 9 3 4 5 3 4 5 6 7 4 5 4 5 6 7 6 7 Figure 10.2: If we are interested only in solutions of cost 7 or less, we need fill in only the band A[i, j] with |i− j| ≤ 7. which is the number of single-character insertions, deletions and substitutions needed to change S into T . For example, if S = AGATACATCA and T = GATTAGATACAT, then we can change S into T by prepending GATT and deleting the last two characters AC, so the edit distance is at most 6. How can we quickly tell if this is minimum? Computing the edit distances between strings is a key task in bioinformatics in particular, since it lets us judge how similar pieces of DNA are and we can modify the computation to align pieces of DNA. For example, the alignment corresponding to the series of edits given above is ----AGATACATCA GATTAGATACAT-- and another is AGAT-ACAT-CA- -GATTAGATACAT with the second corresponding to only 5 edits to change S into T : deleting an A, inserting a T, changing a C to a G, inserting an A, and inserting a T. It’s worth noting that edit distance, also sometimes called Levenshtein distance, is a true dis- tance measure: the distance between S and T is 0 if and only if they are equal, the distance from S to T is the distance from T to S, and the distance from S to T plus the distance from T to another string U is at least the distance from S to U (that is, the Triangle Inequality holds). Let M [i][j] be the minimum number of edits to turn a prefix S[1..i] of S into a prefix T [1..j] of T . Turning the empty prefix of S into the empty prefix of T requires no edits, so M [0][0] = 0. (It 103 is convenient to index the characters of S and T starting from 1, even though it differs from how strings are usually indexed in code, so we have a row for the empty prefix of S and a column for the empty prefix of T .) Turning a prefix S[1..i] of S into the empty prefix of T requires i edits, and turning the empty prefix of S into a prefix T [1..j] of T requires j edits, so M [i][0] = i and M [0][j] = j. Suppose we have computed M [i − 1][j − 1], M [i − 1][j] and M [i][j − 1]. To turn S[1..i] into T [1..j], we can turn S[1..i − 1] into T [1..j − 1] with M [i − 1][j − 1] edits and then turn S[i] into T [j] with 0 edits if they’re already equal and 1 edit if not (so S[i] is above T [j] in the corresponding alignment); turn S[1..i− 1] into T [1..j] with M [i− 1][j] edits and then delete S[i] with 1 edit (so S[i] is above a ‘-’ in the alignment); turn S[i] into T [1..j − 1] with M [i][j − 1] edits and then insert T [j] with 1 edit (so T [j] is below a ‘-’ in the alignment). In other words, M [0, 0] = 0 , M [i, 0] = M [i− 1, 0] + 1 for i > 0 , M [0, j] = M [0, j − 1] + 1 for j > 0 , M [i, j] = min M [i− 1, j − 1] +Bi,j ,M [i− 1, j] + 1, M [i, j − 1] + 1 for i, j > 0, where Bi,j is an indicator variable equal to 0 if S[i] = T [j] and 1 if they are not equal. But this is the same recurrence we had for filling in A and finding the minimum-cost path through the matrix in Figure 10.1! In fact, if we consider Figure 10.3, we can see that the red arrows point into the cells (i, j) for which S[i] = T [j]. (The leftmost column and the top row don’t have characters associated with them because they correspond to the empty prefixes of S and T .) So, much as we essentially rediscovered Huffman’s algorithm while considering how to merge sorted lists, we’ve essentially derived the dynamic-programming algorithm for edit distance while thinking about paths through matrices. Once again, as my former supervisor would say, “I was tricking you!” The technique of banded dynamic programming is so useful because we rarely try to compute the edit distance of long, dissimilar strings; most of the time, we’re aligning human DNA with other human DNA, or at least two pieces of DNA from the same phylum. We’re almost done, but I’d like to mention two things. First, it’s often useful to find the best alignment of a small string (such as a DNA read) against a substring of a long string (such as a chromosome). It doesn’t make sense to use the algorithm we’ve developed so far, because the minimum number of edits to change a small string into a long string is at least the difference in lengths. If S = ACATA and T = GATTAGATACAT, for example, then we want the alignment 104 GATTAGATACAT A G A T A C A T C A Figure 10.3: Computing the edit distance between S = AGATACATCA and T = GATTAGATACAT is equivalent to finding the cost of the cheapest path through the matrix from Figure 10.1. The red arrows point into cells (i, j) such that S[i] = T [j]. ----ACATA--- GATTAGATACAT to correspond to only 1 edit — changing C to G — and not 8. To make the algorithm compute this modified cost, it is enough to change the costs to 0 for moving to the right along the top and bottom rows (in other words, changing the black arrows in the top and bottom rows to red). Whereas an alignment computed with the previous algorithm is called a global alignment, an alignment computed with this algorithm is called a local alignment. The last thing I want to mention is a closely related problem, called Longest Common Sub- sequence (LCS), for which we are asked to find the longest subsequence common to two strings. (Recall that the characters in a subsequence need not be consecutive; if they are then it is a sub- GATTAGATACAT A C A T A Figure 10.4: The matrix for computing the local alignment of ACATA and GATTAGATACAT. 105 string.) If we consider all the characters that are not edited when turning S into T , then they are a common subsequence. In other words, all the characters for which we cross red arrows while computing a global alignment are a common subsequence between the two strings. Therefore, if we change the recurrence so that crossing black arrows are free and we are paid $1 for crossing red edges, we can compute an LCS in essentially the same way that we compute edit distance. Code for computing an LCS is shown below. #include #define MAX(a, b) (a > b ? a : b) #define MAX3(a, b, c) (MAX(a, b) > c ? MAX(a, b) : c) int LCS(char *S, char *T, int m, int n) { int A[m + 1][n + 1]; A[0][0] = 0; for (int i = 1; i <= m; i++) { A[i][0] = 0; } for (int j = 1; j <= n; j++) { A[0][j] = 0; } for (int i = 1; i <= m; i++) { for (int j = 1; j <= n; j++) { A[i][j] = MAX3(A[i - 1][j], A[i][j - 1], A[i - 1][j - 1] + (S[i - 1] == T[j - 1] ? 1 : 0)); } } return(A[m][n]); } int main() { char *S = "AGATACATCA"; char *T = "GATTAGATACAT"; printf("The length of the LCS between %s and %s is %i.\n", S, T, LCS(S, T, 10, 12)); return(0); } 106 Chapter 11 Subset Sum and Knapsack I’ve mentioned before that Knapsack is a “BOOM” problem but we can solve it in time polynomial in the capacity of the knapsack. More specifically, we can solve it in time polynomial in the capacity of the knapsack when measured in a unit such that capacity and all the items’ weights are integers. This means we can’t speed things up by a factor of 1000 just by measuring the capacity of the knapsack in kilos instead of grams. Before we consider Knapsack, however, let’s consider a simpler “BOOM” problem that can be solved by dynamic programming, called Subset Sum. For this problem, we are given a set X = {x1, . . . , xm} of integers and an integer target n, and asked if a subset of X sums to n. (I’m using m to denote the number of integers and n to denote the target because it will lead to us filling another (m + 1) × (n + 1) matrix.) For example, suppose m = 6 and x1, . . . , x6 = 2, 5, 8, 9, 11, 12 and n = 15. It’s pretty obvious the only solution is 2 + 5 + 8 = 15, but it’s not so obvious how to find that solution in a way that will still be reasonably efficient even for large instances. The key step to designing a dynamic program is usually deciding what you’ll store in each cell of your matrix. (I used to work with someone who would often propose solving open problems by “DP!”, meaning dynamic programming, without giving any more details; that’s only slightly more informative than saying “Math!”.) For Subset Sum, the classic solution fills a Boolean matrix A[0..m, 0..n] such that cell A[i, j] stores true if a subset of {x1, . . . , xi} sums to exactly j, and false otherwise.* It’s easy to fill in A[0..m, 0] and A[0, 1..n], because the empty set sums to 0 so A[i, 0] = true for i ≥ 0 and A[0, j] = false for j > 0. To see how to fill in A[i, j] when we’ve filled in A[i − 1, 0..n], consider that when trying to select a subset of {x1, . . . , xi} that sums to exactly j, we have only two choices of what to do with xi: *For simplicity, I’m assuming the xis are positive; otherwise, we should fill in a matrix A[0..i, s−..s+], where s− is the sum of the negative xis and s+ is the sum of the positive xis. Notice it’s not as easy as offsetting each number by the same amount such that the smallest number is 0, since we don’t know the size of the subset that sums to n, if there is one, and so we don’t know how many times to add the offset to the target n. 107 we can include it in the subset, in which case we are left with the subproblem of selecting a subset of {x1, . . . , xi−1} that sums to exactly j − xi — and there is one if and only if A[i− 1, j − xi] = true; we can exclude it from the subset, in which case we are left with the subproblem of selecting a subset of {x1, . . . , xi−1} that sums to exactly j — and there is one if and only if A[i−1, j] = true. Therefore, we fill in the matrix with the recurrence A[i, j] = true if j = 0 false if i = 0 and j 6= 0 A[i− 1, j − xi] ∨A[i− 1, j] otherwise, treating A[i−1][j−xi] as false whenever j−xi < 0. This takes Θ(mn) time (which is polynomial in m and n but not necessarily in the size of the input (X,n), since we need only O(lg n) bits to write n). The correctness of the algorithm follows from the correctness of the base cases, the recurrence, and induction. Code to solve our example instance is shown below, with the matrix printed to stderr, and Figure 11.1 shows its output. #include #define bool int #define TRUE 1 #define FALSE 0 bool subsetSum(int *X, int m, int n) { bool A[m + 1][n + 1]; for (int i = 0; i <= m; i++) { A[i][0] = TRUE; } for (int j = 1; j <= n; j++) { A[0][j] = FALSE; } for (int i = 1; i <= m; i++) { for (int j = 1; j <= n; j++) { A[i][j] = A[i - 1][j] || ((j - X[i] >= 0) && A[i - 1][j - X[i]]); } } fprintf(stderr, "\t"); for (int j = 0; j <= n; j++) { 108 Figure 11.1: The matrix for our example of Subset Sum. fprintf(stderr, "\t%i", j); } fprintf(stderr, "\n"); for (int i = 0; i <= m; i++) { fprintf(stderr, "\t%i", i); for (int j = 0; j <= n; j++) { if (A[i][j]) { fprintf(stderr, "\tT"); } else { fprintf(stderr, "\tF"); } } fprintf(stderr, "\n"); } return(A[m][n]); } int main() { int X[] = {0, 2, 5, 8, 9, 11, 12}; // the leading 0 means we can index X from 1, // which is less confusing int m = 6; int n = 15; if (subsetSum(X, m, n)) { printf("True.\n"); } else { printf("False.\n"); } return(0); } Once you understand the dynamic-programming algorithm for Subset Sum, understanding the algorithm for Knapsack is not too difficult. Recall that for Knapsack we are given a set 109 X = {x1, . . . , xm} of (weight, profit) pairs, with xi = (wi, pi), and a capacity n, and asked to find the maximum profit of a subset of X with weight at most n. Once again, I’m naming the parameters so that we fill in an (m+ 1)× (n+ 1) matrix; for that, we must also assume the weights and capacity have been scaled so they are all integers. For example, suppose x1 = (2, 1), x2 = (3, 4), x3 = (3, 3), x4 = (4, 5) and n = 8, so the maximum profit achievable is 9 (with x2 and x4). Now, we want A[i, j] to store the maximum profit we can achieve with a subset of x1, . . . , xi with weight exactly j. Assuming the weights and profits are all positive, it’s again easy to fill in A[0..m, 0], since we can get no profit with no weight; for technical reasons that should be clearer in a moment, we set A[0, 1..n] to −∞ to indicate we cannot choose an empty subset with positive weight. To see how to fill in A[i, j] when we’ve filled in A[i−1, 0..n], consider that when trying to select a subset of {x1, . . . , xi} with maximum profit whose weights sum to exactly j, we have only two choices of what to do with xi: we can include it in the subset and receive profit pi for it, in which case the maximum profit we can achieve from {x1, . . . , xi} is pi plus the maximum profit we can achieve from a subset of {x1, . . . , xi−1} with weight exactly j − wi; we can exclude it from the subset and no profit for it, in which case the maximum profit we can achieve from {x1, . . . , xi} is the maximum profit we can achieve from a subset of {x1, . . . , xi−1} with weight exactly j. Therefore, we fill in the matrix with the recurrence A[i, j] = 0 if j = 0 −∞ if i = 0 and j 6= 0 max(pi +A[i− 1, j − wi], A[i− 1, j]) otherwise, treating A[i − 1][j − wi] as −∞ whenever j − wi < 0. This takes Θ(mn) time (which is again polynomial in m and n but not necessarily in the size of the input (X,n)). Again, the correctness of the algorithm follows from the correctness of the base cases, the recurrence, and induction. We put −∞ in a cell A[i, j] to indicate we cannot select a subset of x1, . . . , xi with weight j, and consider A[i, j] =∞ for j < 0, because −∞ will not change no matter what we add to it. We could also sum all the weights, multiply them by −1 and subtract 1 (as in the code below); or take the largest number, multiply it by −m and subtract 1; etc. Although it’s simpler to fill in the matrix when we define A[i, j] to be the maximum profit achievable from a subset of {x1, . . . , xi} with weight exactly j, we eventually want the maximum profit achievable with a subset of any weight up to the capacity, which is the maximum entry in the row A[i, 0..n]. Code to solve our example instance is shown below, with the matrix printed to stderr, and Figure 11.2 shows its output. This version is an optimization problem; for the decision version, we need a target parameter t that is the minimum acceptable profit, so we should answer true if there is a subset with weight at most n and profit at least t, and false otherwise. 110 #include #define MAX(a, b) (a > b ? a : b) int Knapsack(int *W, int *P, int m, int n) { int A[m + 1][n + 1]; int sum = 0, max = 0; for (int i = 0; i <= m; i++) { A[i][0] = 0; sum += P[i]; } for (int j = 1; j <= n; j++) { A[0][j] = - 1 * sum - 1; } for (int i = 1; i <= m; i++) { for (int j = 1; j <= n; j++) { A[i][j] = MAX(A[i - 1][j], P[i] + (j - W[i] >= 0 ? A[i - 1][j - W[i]] : - 1 * sum - 1)); } } for (int j = 0; j <= n; j++) { max = MAX(A[m][j], max); } fprintf(stderr, "\t"); for (int j = 0; j <= n; j++) { fprintf(stderr, "\t%i", j); } fprintf(stderr, "\n"); for (int i = 0; i <= m; i++) { fprintf(stderr, "\t%i", i); for (int j = 0; j <= n; j++) { fprintf(stderr, "\t%i", A[i][j]); } fprintf(stderr, "\n"); } return(max); } 111 Figure 11.2: The matrix for our example of Knapsack. int main() { int W[] = {0, 2, 3, 3, 4}; int P[] = {0, 1, 4, 3, 5}; // the leading 0s means we can index P and W from 1, // which is less confusing int m = 6; int n = 8; printf("%i.\n", Knapsack(W, P, m, n)); return(0); } There’s a lot more to know about using dynamic programming to solve “BOOM” problems, but it’s hard to discuss it reasonably when we’re still using terms like “clink” and “BOOM” and it’s probably beyond the scope of this course anyway. If you study algorithms in the future, you’ll probably learn how to use dynamic programming to compute in truly polynomial time approximate answers to “BOOM” problems for which we can compute exact answers in time polynomial in the parameters (rather than the sizes of their binary representations). It would be unfair to finish, however, without giving you an example of a “BOOM” problem we think is hard even if all its parameters are given in unary rather than binary (so “polynomial-time” means “polynomial in the sum of all the numbers”). You probably remember the question about canoes from the midterm, for which you had to partition a list of numbers into a minimum number of subsets such that every subset had cardinality at most 2, no subset summed to more than a given amount. The restriction on the cardinality was justified because “everyone wants to row”. What if people don’t care about rowing and thus we can put 3 people in a canoe? More formally, given a set of n numbers, can we partition it into subsets of size exactly 3 such that every subset sums to the same amount? This problem is called 3-Partition and, although we can obviously solve it by trying all possible nn/3 partitions, it is believed there is no algorithm that runs in polynomial time even in the sum of the numbers (and if there is, then we can find Hamiltonian paths in polynomial time, etc). 112 Chapter 12 Optimal Binary Search Trees and Longest Increasing Subsequences A binary search tree (BST) on n keys must have height Ω(log n), so the worst-case time for a search is also Ω(log n). If we know the frequency with which each key is sought, however, then we can arrange the tree (maintaining the order of the keys) such that the more frequent keys are higher, and thus reduce the average time for a search. For example, if the keys are apple, banana, grape, kiwi, mango, pear and quince and their frequencies are 5, 3, 4, 1, 1, 2, 1, then with the tree shown on the left in Figure 12.1 we use a total of 5 · 3 + 3 · 2 + 4 · 3 + 1 · 1 + 1 · 3 + 2 · 2 + 1 · 3 = 44 comparisons for the 17 searches, while with the tree shown on the right we use only 5 · 2 + 3 · 3 + 4 · 1 + 1 · 4 + 1 · 3 + 2 · 2 + 1 · 3 = 37 comparisons (assuming that, when search for a key x, we compare it to the key stored at each vertex on the path from the root to the vertex storing x). An BST is optimal with respect to a certain distribution over its keys if it minimizes the average number of comparisons during a search, which is called the cost of the tree. In our example, the cost of the tree on the left in Figure 12.1 is 44/17 while the cost of the one on the right is 37/17. We can use dynamic programming to build efficiently an optimal BST for a given distribution, because any subtree of an optimal BST is itself optimal. To see why, consider that the total number of comparisons we perform while searching in a BST T is the total number of comparisons we perform in its left subtree TL, plus the total number of comparisons we perform in its right subtree TR, plus the number of comparisons we perform at the root (which is the number of searches or, equivalently, the total of all the frequencies). If TL or TR are sub-optimal then we can rearrange them to reduce the number of comparisons we perform in them during searches for their keys, and thus reduce the cost of T overall. Imagine Mr Klages asking Ryan to build an optimal BST: 113 kiwi grape banana apple grape mango quince pear apple banana kiwi mango pear quince Figure 12.1: Two BSTs with keys apple, banana, grape, kiwi, mango, pear and quince. The tree on the left has better worst-case performance (3 comparisons versus 4 for a single search) but, if the frequencies with which the keys are sought are 5, 3, 4, 1, 1, 2, 1 then the tree on the right has better average-case performance (37/17 comparisons per search versus 44/17). Ryan: “Mr Klages, I don’t know what the optimal BST is.” Mr Klages: “Well, Ryan, one of the keys has to be at the root. If it’s apple, then the optimal BST has an empty left subtree, apple at the root, and a right subtree that’s an optimal BST for banana to quince.” Ryan: “But I don’t know the optimal BST for banana to quince.” Mr Klages: Well, again, one of the keys is at the root of that subtree. If it’s banana, then the optimal subtree has an empty left subtree, banana at the root, and a right subtree that an optimal BST for grape to quince.” Ryan: “But I don’t know the optimal BST for grape to quince.” ... Mr Klages: “So now you know what the optimal BST is, assuming the root stores apple.” Ryan: “But what if it doesn’t? What if it stores banana?” Mr Klages: “Well, then the left subtree contains only apple, the root stores banana, and the right subtree is an optimal BST for grape to quince.” Ryan: “But I don’t know the optimal BST for grape to quince.” Travis: “Yes you do, he just told you! Pay attention, Ryan! How am I in the same class as this bozo?!”* It’s easy to build an optimal binary search tree for a single key — it’s simply one vertex, storing that key — and if we know the optimal BST for any set of up to i consecutive keys out of n, then we can compute an optimal BST for any set of up to i+ 1 consecutive keys by considering all the possible roots and choosing the one that minimizes the total cost. *It’s important to remember the differences between recursion and dynamic programming, to avoid using expo- nential time and annoying your classmates. 114 Suppose we are given a keys x0, . . . , xn−1 with frequencies f0, . . . , fn−1. Let A[i, j] be the number of comparisons we perform in an optimal BST for xi, . . . , xj . Then our recurrence is A[i, j] = 0 if i > j, fi if i = j, mini≤r≤j{A[i, r − 1] +A[r + 1, j] + fi + · · ·+ fj} otherwise. We’re still filling in a 2-dimensional matrix (or, more specifically, the top-right half, where i ≤ j), but now we cannot fill it in row-major or column-major order. Instead, we have to work along the diagonals, starting with the diagonal j = i and then continuing with the diagonals j = i + 1, j = i+ 2, etc — so the final answer is stored in A[0, n− 1]. Since j − i + 1 is the size of the tree for A[i, j], in the code below we use a counter size. We can recover the tree itself by storing for each cell which choice of root minimized the cost of the corresponding subtree. #include #define MIN(a, b) (a < b ? a : b) int optBST(int *F, int n) { int A[n][n]; for (int i = 0; i < n; i++) { A[i][i] = F[i]; } for (int size = 2; size <= n; size++) { for (int i = 0; i < n - size + 1; i++) { int j = i + size - 1; A[i][j] = A[i + 1][j]; for (int r = i + 1; r <= j - 1; r++) { A[i][j] = MIN(A[i][j], A[i][r - 1] + A[r + 1][j]); } A[i][j] = MIN(A[i][j], A[i][j - 1]); for (int k = i; k <= j; k++) { A[i][j] += F[k]; } } } return(A[0][n - 1]); } int main() { int F[] = {5, 3, 4, 1, 1, 2, 1}; 115 Figure 12.2: The top-right half of the matrix we fill in for our example. printf("%i\n", optBST(F, 7)); return(0); } If we add the following code before the return statement in the function optBST, we get the output shown in Figure 12.2: fprintf(stderr, "\t"); for (int j = 0; j < n; j++) { fprintf(stderr, "\t%i", j); } fprintf(stderr, "\n"); for (int i = 0; i < n; i++) { fprintf(stderr, "\t%i", i); for (int j = 0; j < n; j++) { fprintf(stderr, "\t"); if (j >= i) { fprintf(stderr, "%i", A[i][j]); } } fprintf(stderr, "\n"); } As we can see from the output, the BST on the right in Figure 12.1 is indeed optimal. This algorithm takes Θ(n3) time because, to fill in each cell A[i, j] with j > i, it checks j− i+ 1 cells that have previously been filled in. (It also sums fi through fj , but this can be avoided.) Knuth showed that the best choice for the root of the tree with keys xi, . . . , xj is not to the left of the best choice for the root of the tree with keys xi, . . . , xj−1 nor to the right of the best choice for the root of the tree with keys xi+1, . . . , xj . In other words, adding keys to the right cannot make the best choice of root move left, and adding them to the left cannot make it move right. 116 Let ri,j be the best choice of root for the tree with keys xi, . . . , xj . Then when taking advantage of Knuth’s lemma, we use time proportional to n−1∑ i=0 n−1∑ j=i+1 (ri+1,j − ri,j−1 + 1) . In particular, when working out the best choices of roots for subtrees of size s, we use time pro- portional to (r1,s−1 − r0,s−2) + (r2,s − r1,s−1) + · · ·+ (rn−s+1,n−1 − rn−s,n−2) = rn−s+1,n−1 − r0,s−2 ≤ n , so overall we use O(n2) time. There’s an algorithm that runs in O(n log n) time, but it works only when all our searches are guaranteed to be successful. The dynamic programming algorithm we’ve been considering works even when the gaps between the keys are assigned frequencies as well. For example, if the gap between banana and grape is assigned frequency 4, then we might search twice for cherry and once each for date and fig. Another classic problem solvable with dynamic programming is finding the longest increasing subsequence (LIS) in a sequence of numbers. Let’s assume we’re interested in subsequences that are strictly increasing, not just non-decreasing (although the same techniques work in both cases). For example, in the sequence S = 5, 3, 2, 4, 1, 7, 9, 6, 8, 2, 7, 1, 5, the subsequence 2, 4, 6, 8 is increasing but the subsequence 2, 2, 7 is not. There is a nice dynamic-programming algorithm for LIS, which illustrates how the arrays for dynamic programs can be 1-dimensional (which it’s easy to forget). If we define S[−1] = −∞, A[−1] = 0 and A[i] to be the length of the longest increasing subsequence ending at S[i], for 0 ≤ i ≤ n− 1, then A[i] = max{A[j] : −1 ≤ j < i, S[j] < S[i]}+ 1 . Therefore, we can fill in A with the following code, for (int i = 0; i < n; i++) { A[i] = 1; for (int j = 0; j < i; j++) { if (S[j] < S[i]) { A[i] = MAX(A[i], A[j] + 1); } } } We set A[i] to 1 because we can always consider a single element as a subsequence. Since A[n− 1] is only the length of the longest subsequence ending exactly at S[n− 1], we must make a final pass to report the maximum value in A[0..n− 1]. Recall we’ve already seen how to partition a sequence into the minimum number of increasing subsequences, using Supowit’s greedy algorithm; I like to keep people on their toes by showing two similar-sounding problems with completely different solutions. 117 (7, 3) (9, 4) (4, 2) (5, 1) (3, 1) (2, 1) (1, 1) Figure 12.3: The query range−max(−∞, 6) we use to compute A[7] in our example. This algorithm takes Θ(n2) time but it can be improved to O(n log n) time fairly easily, using a dynamic range-max data structure.
This data structure stores a set of 2-dimensional points and supports the following operations: insert (x, y) adds the point (x, y) to the set; delete (x, y) deletes the point (x, y) from the set; range-max (x1, x2) returns the highest point in the half-open range [x1, x2). It’s possible to adjust the definition of the max range-max query so the query range is [x1, x2], (x1, x2] or (x1, x2); for finding LISs, the version above is the most convenient. The definition of the delete operation is given only for the sake of completeness, since we won’t use it here. Suppose we have a dynamic range-max data structure that supports all operations in O(log n) time when storing O(n) points. If we start with the set of points empty and insert the point (S[j], A[j]) when we compute A[j], then we can compute A[i] in O(log n) time by using the query range−max(−∞, S[i]) to find the point (S[j], A[j]) with S[j] < S[i] and A[j] as large as possible. Figure 12.3 shows the points when processing S[7] = 6 in our example: the higest point in the query range [−∞, 6) is (4, 2), so we set A[7] = 3 and insert the point (6, 3); the grey area is meant to show the query range, although it should be a half-plane instead of a box, with the dashed line indicating that it does not include the points with x = 6. We can implement a static range-max data structure with O(log n) time queries with an aug- mented BST, as shown in Figure 12.4. We store each x-coordinate as a key with the corresponding y-coordinate as satellite data; at each vertex in the tree, we store the point with the maximum y-coordinate stored in that vertex’s subtree. Given a range [x1, x2), we can search for x1 and x2 and find a collection of O(log n) single vertices and subtrees which store all the points with x-coordinates in the range [x1, x2), and no other points. For the single nodes, we check their y-coordinates and, for the subtrees, we check the point with the maximum y-coordinate stored in that subtree. This way, in O(log n) time, we can find
Actually, we don’t need the full power of range-max queries for this particular problem — see https://leetcode. com/problems/longest-increasing-subsequence/solution, which is much clearer than it was last year — but I think you do for one of the questions on the assignment. 118 (7, 3)(2, 1) (3, 1) (4, 2) (5, 1)(1, 1) (1, 1) (2, 1) (3, 1) (9, 4) (5, 1) (9, 4) (9, 4) (9, 4) Figure 12.4: An augmented BST storing the points (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (7, 3), (9, 4) and supporting range-max queries. The shaded vertices are the ones with x-coordinates in the range [−∞, 6) and the numbers in red are those we check for answering that query. the point whose x-coordinate is in the range [x1, x2) and whose y-coordinate is as large as possible. Figure 12.4 shows an augmented BST storing the points shown in Figure 12.3, with the top pair in each vertex being the x-coordinate stored as the key and the y-coordinate as satellite data, and the bottom pair being the highest point stored in the vertex’s subtree. The shaded vertices are the ones with x-coordinates in [−∞, 6), and we check the numbers shown in red for that query. If you’ve seen AVL-trees then it shouldn’t be too hard for you to figure out how to maintain the additional information in the tree when performing rotations — it’s at the level of an interesting homework problem. Similarly, if you’ve seen red-black trees or (2, 4)-trees or some other dynamic balanced binary search tree, then you should be able to figure out how to implement the updates to those such that they maintain the additional information. As a consequence, we obtain a dynamic range-max data structure, with insertions and deletions also taking logarithmic time. If you haven’t seen any dynamic balanced binary search trees, then I’m afraid there was a gaping hole in your second-year course on data structures (especially considering what university costs these days!). Unfortunately, we don’t really have time to fix that in this course, which is full enough already. They’re not that tricky, though, so an hour of YouTube tutorials should be more than enough. 119 Assignment 5 posted: 18.06.2021 due: midnight 02.07.2021 You can work in groups of up to three people. One group member should submit a copy of the solu- tions on Brightspace, with all members’ names and banner numbers on it; the other group members should submit text files with all members’ names and banner numbers (otherwise Brightspace won’t let us assign them marks!). You may consult with other people but each group should understand the solutions: after discussions with people outside the groups, discard any notes and do something unrelated for an hour before writing up your solutions; it’s a problem if no one in a group can ex- plain one of their answers. For programming questions you should submit your code, which should compile and run correctly to receive full marks. 1. Write a program that takes two strings S and T and outputs an optimal alignment in O(|S||T |) time, displayed as in the lecture notes. For example, if S = AGATACATCA and T = GATTAGATACAT, then an optimal alignment is AGAT-ACAT-CA- -GATTAGATACAT (I think). You need not analyze your algorithm nor prove it correct. 2. Write a linear-time dynamic-programming algorithm for https://leetcode.com/problems/ maximum-subarray and EXPLAIN IT. (It’s not hard to find code online, but most of the explanations are terrible. Remember, if you can’t explain your answer, there’s a problem!) 3. Suppose your professor has assigned a “profit” to each of several indivisible food items, expressing how much he likes each item. He’s now filling his knapsack and trying to select items to maximize the total profit. The food items are light but bulky, so the key constraint now is the volume the knapsack can hold, rather than the weight. Even though the food items cannot be cut, they can be squashed, which reduces their volume by a factor of 2 — but also reduces their profit by a factor of 2 (since squashed food is not as appetizing). Write a dynamic-programming algorithm that runs in time polynomial in the number of items and the capacity of the knapsack (in litres) and tells your professor which food items to select and, of the selected ones, which ones to squash. You can assume the knapsack’s capacity and the original volume in litres of each item are integers. Explain why your algorithm is correct. 4. Write pseudo-code — you don’t have to code this — for an O(n log n) time algorithm that takes a sequence of n integers and finds the longest slowly increasing subsequence (LSIS), where an LSIS is a sequence in which each number after the first is larger than it’s predecessor but not by more than 10. Explain why your algorithm is correct. 5. Modify the code in the lecture notes (to be posted by June 24th) for building an optimal binary search tree such that it runs in O(n2) time instead of O(n3) time. You need not analyze your algorithm nor prove it correct. 120 欢迎咨询51作业君