CSE143 Notes for Monday, 10/17/05

I spent a few minutes running the assassin program and answering questions about it. I also showed people the book "Programming Interviews Exposed" by Mongan and Suojanen that basically describes how to get a job at a company like Microsoft. I found it interesting that in their chapter on linked lists, they mention that linked lists are a favorite topic for interview questions (just one more argument for studying them).

At that point I switched into a new topic: complexity. The word "complexity" can be interpreted in many ways. It sounds like a measure of how complex or how complicated a program is, but that's not how computer scientists use the term. When we refer to the complexity of an algorithm or a code fragment, we are referring to the resources that it requires to execute. The two resources that we are generally most interested in are:

We'll find that a common result is that these two primary resources can often be traded off. We can generally make a program work with less memory if we're willing to have it take more time to run. We can also generally get programs to run faster if we're willing to allocate some extra memory to the task.

Of these two, the resource that computer scientists most often refer to when talking about complexity is time. In particular, we are interested in the growth rate as the input size increases. We begin by deciding on some way to measure the size of the input (e.g., the number of names to sort, the number of numbers to examine, etc) and call this "n". We are interested in what happens when we change n. For example, if it takes time "t" to execute for n items, how much time does it take to execute for 2n items?

I pointed out that this is one of the few places where computer science is actually like a science. Some instructors ask their students to collect empirical timing data for different input sizes and have them plot these values to see if the plot matches the prediction. Unfortunately, these experiments are more difficult to perform on modern computers because features like cache memory skew the results. The important thing is that the predictions hold for large values of n.

I then mentioned a simple rule of thumb that you can apply to Java programs to figure out the complexity of a code segment. I mentioned that it "almost" works. The idea is to find the line of code that is executed most often. In thinking about this, you have to be careful how you count. For example, with a for loop, we'd count the loop itself as executing just once, but the statements controlled by the loop might be executed many times. Of course, a for loop can be inside a for loop in which case the inner loop is executed multiple times. But think in terms of how many times you enter the loop when counting the number of executions of the line of code that begins with "for".

So I said imagine that you have a listing of your program with a white background. Suppose that each time a line of code is executed, the background gets a little grayer. The way to determine the complexity of the code is to let it run for a while, then open it up and look for the line of code that has turned the darkest charcoal black. If you express the number of times that line of code is executed as a function of n, then that function will be the complexity of the code.

Then I talked about a specific problem. The idea is that we have a list of integers, both positive and negative, and we want to find the subsequence that has the highest sum. If there weren't any negative integers, you'd always include all of the numbers. But because some of them can be negative, it might be the case that some portion of the list has a sum that is greater than any other sequence from the list. The subsequences always involve taking a contiguous chunk of the list. This particular problem has often been used by Microsoft as an interview question, probably because there are different ways to solve it, some of which are much faster than others.

As an example, suppose the list stores the values (5000, -45, -8000, 4000, 2000). The 5000 at the front the list is the single largest value, so you'd think we want to include it. But it turns out that the last two numbers add up to something larger (6000). Trying to include both the 5000 at the front and the 4000 and 2000 at the end would require us to include the -8000 in the middle, which also leads to a smaller sum. So you get the maximum sum with the last two values (4000, 2000).

There is a simple way to solve this that involves picking each possible subsequence. We can have one for loop that generates each possible starting point and another for loop that generates each possible stopping point:

        for (int start = 0; start < list.length; start++) {
            for (int stop = start; stop < list.length; stop++) {
                look at the numbers from start to stop
So how do we "look at the numbers from start to stop"? We can write a loop that adds up each of those numbers:

        int sum = 0;
        for (int i = start; i <= stop; i++) {
            sum += list[i];
And once we have that sum, we can compare it against the maximum sum we've seen so far and reset the maximum if this sum is better:

        if (sum > max) {
            max = sum;
Putting these pieces together, we end up with the following code:

        for (int start = 0; start < list.length; start++) {
            for (int stop = start; stop < list.length; stop++) {
                int sum = 0;
                for (int i = start; i <= stop; i++) {
                    sum += list[i];
                if (sum > max) {
                    max = sum;
That's the first approach. The line that is executed most often in this approach is the "sum += ..." line inside the innermost for loop (the "i" loop that adds up the list).

Then I asked how the algorithm could be improved. How can we do this faster? The bottleneck is the line that is adding up individual numbers and the key to improving the algorithm is noticing how we're doing a lot of duplicate work. Think about what happens the first time through the outer loop when "start" is equal to 0. We go through the inner loop for all possible values of "stop". So suppose the list is 2000 long. We're going to compute:

        the sum from 0 to 0
        the sum from 0 to 1
        the sum from 0 to 2
        the sum from 0 to 3
        the sum from 0 to 4
        the sum from 0 to 1999
Those are all the possibilities that start with 0. We have to explore each of these possibilities, but think about how we're computing the sums. We have an inner "i" loop that is computing the sum from scratch each time. For example, suppose that we just finished computing the sum from 0 to 6. We next compute the sum from 0 to 7. But we start from the very beginning and have i go through all of the values 0 through 7 when we've just computed the sum from 0 to 6.

This becomes even more obvious when you think about larger subsequences. For example, suppose that you just added up all of the values from 0 to 1000. That was a lot of work. Then you throw away that sum and start from scratch to add up the values from 0 to 1001. But why start back at the beginning? If you know what the values from 0 to 1000 add up to, then just add the value at position 1001 to get the sum from 0 to 1001.

So the key is to eliminate the inner "i" loop by keeping a running sum. This requires us to move the initialization of sum from the inner loop to the outer loop so that we don't forget the work we've already done.

        for (int start = 0; start < list.length; start++) {
            int sum = 0;
            for (int stop = start; stop < list.length; stop++) {
                sum += list[stop];
                if (sum > max) {
                    max = sum;
In this code the most frequently executed statements are inside the for loop for "stop" (the line that begins "sum +=" and the if).

I mentioned that there is a third algorithm, although I wouldn't have time to discuss it in detail and it is the most difficult to understand, so I did not attempt to prove its correctness. I did, however, try to explain the basic idea. The key is to avoid computing all of the sums. We want to have some heuristic that would allow us to ignore certain possibilities. We do that with a single loop and by keeping track of the highest possible sum you can form that includes list[i].

So suppose that we are considering the i-th value in the list for some i greater than 0. Let's say i is 10. Think about subsequences that include list[10]. Some of them begin with list[10] and others begin earlier, including list[9] and potentially other values that appear before list[10]. Under what circumstances would we get a higher sum by starting with list[10] versus including these earlier values? That's the key question.

The answer is that if best subsequence you can find ending in list[9] add up to a positive number, then that sequence is worth including. If they add up to a negative number, then they are taking away from the sum we are trying to generate. In other words, if those earlier values add up to a negative number, then we can get a higher sum by excluding them and starting our sequence with list[10].

This algorithm is somewhat tricky, so it's not essential that you understand exactly why it works. By running all three algorithms together, we can compare their results and get at least some evidence that the third algorithm produces the same answers as the other two.

I then switched to the computer and I showed people a program I had written to explore different algorithms for this problem (handout #11).

The program includes a DEBUGGING constant that allows me to turn debugging on and off. When it's on, the code prints the overall list and it prints the longest subsequence that it finds. This is useful to do when you're dealing with sort lists and you want to verify that the code is working. But if you're dealing with thousands of elements in your list, you'd want to turn this off.

I pointed the general structure of method main. It makes various calls on the method System.currentTimeMillis() to get the clock reading in milliseconds. I compute the time that elapses between calls on three different methods: findMax1, findMax2 and findMax3.

We started by exploring calls on findMax1. We ran the program for 500 elements, 1000, 2000 and 3000 and set up an Excel spreadsheet with the times for each execution and the line counts for each execution. Then we computed some ratios. For example, we explored what happens when you double the input by looking at the values for 1000 divided by the values for 500 and by looking at the values for 2000 divided by the values for 1000. We also looked at what happens when you triple the size of the input by computing the values for 3000 divided by the values for 1000.

We found several things. First of all, we found that the growth of line count was a pretty good predictor of the growth of the time (in other words, the ratios were similar). That's good because our theory was that the line count growth rate predicts the time growth rate. We also noticed that these values got closer for larger values of n. This is a common occurrence. With small values of n, other factors can interfere with the timing. But for large values of n, we'll find that the line count becomes a great predictor of the time. Finally, we noticed that the growth rate seems to be n^3. When we doubled the input, we got an increase of about 8 in time and line count (2^3). When we tripled, we got an increase of about 27 in time and line count (3^3).

Next we commented out the call on the first method. That's because it's so slow that we can't explore the other ones if we include the call on the first one. We looked at some similar values increased by a factor of 10 for the second algorithm. In other words, we looked at times and line counts for n of 5000, 10000, 20000 and 30000. We again found that the ratios between the times and the line counts were similar, with the ratios getting closer for larger values of n. In contrast to the first algorithm, this one had a growth rate of n^2. As we doubled the input size, it took about 4 times longer (2^2) and the line count increased by 4. When we tripled the input size, it took about 9 times longer (3^2) and the line count increased by 9.

I didn't have time to the third algorithm in detail or to test its running time, but we did manage to run one test for 100,000 values that took less than 0.1 seconds to run. The final algorithm is linear, meaning that the growth function is n.

I had written the code to report the time the algorithm took as well as a count of how many times the most frequently executed statement was executed.

The moral of the story is that choosing the right algorithm can make a huge difference, particularly if you can choose an algorithm from a better complexity class.

We'll pick up with this example in Friday's lecture when we discuss more about complexity.

Stuart Reges
Last modified: Wed Oct 19 13:01:13 PDT 2005