CSE143 Notes for Monday, 10/17/11

I said that we were beginning a week-long discussion of how to use the most important structures in the Java Collections Framework. In other words, we're going to learn how to be clients of the collections classes.

I first discussed the built-in ArrayList class. We began by looking at this client code:

        ArrayList<String> list = new ArrayList<String>();
        list.add("four");
        list.add("score");
        list.add("seven");
        list.add("years");
        list.add("what was next?");
        list.add("ago");
        list.add(2, "and");
        list.remove(5);
        System.out.println("list = " + list);
        System.out.println(list.indexOf("seven"));

which produces this output:

        list = [four, score, and, seven, years, ago]
        3

All of the methods we have seen with ArrayIntList are defined for ArrayList: the appending add, add at an index, remove, size, get, etc. I asked what's wrong with this code and someone pointed out that I should be using an interface for the type:

        List<String> list = new ArrayList<String>();

Then we talked about how to loop over the structure. We can use the size and get methods to write a loop that looks a lot like an array processing loop:

        for (int i = 0; i < list.size(); i++) {
            System.out.println(list.get(i));
        }

This is often a reasonable way to manipulate a list, but it relies on the "get" method being able to quickly access any element of the structure. This property is known as random access. We say that arrays and the ArrayList and ArrayIntList classes that are built on top of them are random access structures because we can quickly access any element of the structure. If you knew that for the rest of your life, you'd always be working with arrays, then you'd have little use for iterators. You'd just call the get method because with arrays you get fast random access.

But not all data structures have quick access. If we used a LinkedList instead of an ArrayList, the loop above would be very expensive because the get method requires starting at the front of the list each time to get to the appropriate value. So what would normally be an O(n) operation would become an O(n²) operation with the code above.

I said that I wanted to explore a different approach using what is known as an iterator. In general, we think of an iterator as having three basic operations:

a "has next" operation that tells you whether or not there are any values left
a "get next" operation that lets you see what the next value is
a "move to next" operation that moves the iterator to the next value

Sun adopted the convention early on that the second and third steps would be combined into one operation known as "next" that does two different things: it returns the next value and it advances the iterator to the next value. So in Java there are two fundamental operations:

a "hasNext" method that tells you whether or not there are any values left
a "next" method that returns the next value and advances to the one beyond

We can rewrite the printing loop using a foreach loop as follows:

        Iterator<String> i = list.iterator();
        while (i.hasNext()) {
	    System.out.println(i.next());
	}

This involves a new kind of object of type Iterator<String>. Iterator<E> is an interface in the java.util package. Notice that we ask the list to construct the iterator for us by calling the method called "iterator". Once we have our iterator, we use a while loop to print out the next value as long as there is a next value to process.

Then I discussed the for-each loop. It is implemented using an iterator but provides a simpler syntax for those situations where you simply want to go through all of the data in your collection from beginning to end. We can rewrite our printing loop as:

        for (String s : list) {
	    System.out.println(s);
	}

We generally read the for-each header as, "For each String s in list...". The choice of "s" is arbitrary. It defines a local variable for the loop. I could just as easily have called it "x" or "foo" or "value". This for-each loop is implemented by constructing an iterator and executing the same code that we had previously.

There are some limitations of for-each loops. You can't use them to change the contents of the list. If you assign a value the variable s, you are just changing a local variable inside the loop. It has no effect on the list itself.

Next, I mentioned that we will be looking at a collection known as a Set. Java has an interface Set<E> that is implemented by HashSet<E> and TreeSet<E>. The HashSet is a bit faster, but doesn't keep the values in any particular order. The TreeSet keeps values in sorted order.

For example, to make a set of integers using an array of data, we can say:

        int[] data = {18, 4, 97, 3, 4, 18, 72, 4, 42, 42, -3};
        Set<Integer> s = new TreeSet<Integer>();

        for (int n : data) {
            s.add(n);
        }
        System.out.println("set = " + s);

This produced the following output:

        set = [-3, 3, 4, 18, 42, 72, 97]

There are two major differences between a set and a list. Sets don't allow duplicates. So the duplicate values like 42 and 4 in the array appear just once in the set. Sets also don't allow the client to control the order of elements. The TreeSet class keeps things in sorted order. So the numbers will always be in that order. If you want to control the order, then you should use a list instead.

Sets have many of the same methods that lists do. You can add to a set, get its size, ask for an iterator, use it with a foreach loop. But it doesn't have a notion of indexing. So you can't remove at an index. Instead you remove a specific value. And you can't get at a specific index. Instead you use an iterator or a foreach loop.

We saw that we could name a specific value to remove from a set, and use it win a for-each loop, as in:

        numbers.remove(97);
        for (int n : numbers) {
            System.out.println(n);
        }

This produced the following output:

Notice that this produced each of the numbers from the original version of the set but without the value 97.

Then we talked about how to remove values from a set using an iterator. You can remove specific values directly, as we did with 97, but more often we want to examine each value in the set and remove the values that have a certain property. We do this by calling the iterator's remove method. It removes the most recent value returned by a call on the iterator's method called next.

For example, we wrote this loop to remove the values from the list that are multiples of 3:

        Iterator<Integer> i2 = numbers.iterator();
        while (i2.hasNext()) {
            int n = i2.next();
            if (n % 3 == 0) {
                System.out.println("removing " + n);
                i2.remove();
            }
        }
        System.out.println("set = " + numbers);

This produced the following output:

        removing -3
        removing 3
        removing 18
        removing 42
        removing 72
        set = [4]

In other words, we ended up removing everything but the value 4. We briefly discussed the limitations of the remove method:

You can't call remove until you have called next. So if you construct the iterator and immediately call remove, the iterator will throw an exception.
You can't call remove two times in a row. You can only remove the most recent value returned by a call on next. So if you call remove, then you have to call next again before you can call remove again.

We also briefly discussed the fact that you can't mix code that changes the collection directly with code that uses an iterator. This will produce what is known as a concurrent modification exception. So if you are going to change a set while you iterator over it, you have to go through the iterator's remove method to do so.

Then we began a discussion of a program to examine a file of words. As an example, I asked people how we could write a program that would count the number of unique words in an input file. I had a copy of the text of Moby Dick that we looked at to think about this. I showed some starter code that constructs a Scanner object tied to a file:

        import java.util.*;
        import java.io.*;
        
        public class WordCount {
            public static void main(String[] args) throws FileNotFoundException {
                Scanner console = new Scanner(System.in);
                System.out.print("What is the name of the text file? ");
                String fileName = console.nextLine();
                Scanner input = new Scanner(new File(fileName));

                while (input.hasNext()) {
                    String next = input.next();
                    // process next
                }
            }
        }

Notice that in the loop we use input.next() to read individual words and we have this in a while loop testing against input.hasNext(). I pointed out that we'll have trouble with things like capitalization and punctuation. I said that we should at least turn the string to all lowercase letters so that we don't count Strings like "The" and "the" as different words:

        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            // process next
        }

So how do we count the words? Someone suggested that a Set would be the perfect structure to solve this problem. It eliminates duplicates, so it will keep track of how many different words there are. So we changed the loop to be:

        Set<String> words = new HashSet<String>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            words.add(next);
        }
        System.out.println("Total words = " + words.size());

We could have used a TreeSet, but we decided to use the somewhat faster HashSet because we didn't need to keep the words in sorted order.

Here is a sample log of execution:

        What is the name of the text file? moby.txt
        Total words = 30368

I said that in the next lecture we would continue this example but we would count the occurrences of the word, which will require a different kind of collection called a map.

Stuart Reges

Last modified: Thu Oct 20 10:55:34 PDT 2011