6. Constructing a Button-Box Interface

7. Empirical Results

We tested the interactive note taking software on notes drawn from a variety of domains. Tables 5 through 11 list sample notes from seven domains (in addition to the PowerBook and fabric pattern sample notes listed above).


CVA-62 8/6/63 to 3/4/64 Mediterranean A-5A AG 60X

CVA-61 8/5/64 to 5/6/65 Vietnam RA-5C NG 10X

Table 5: Sample notes from the airwing domain. Listed above are 2 of the 78 notes about airwing assignments aboard aircraft carriers collected from (Grove & Miller, 1989).


B, 81, 5, 151 (2.5), Cyl. 4, 2-bbl., Pontiac

C, 82, X, 173 (2.8), Cyl. 6, 2-bbl., Chevrolet

Table 6: Sample notes from the engine code domain. Listed above are 2 of the 20 notes about the meaning of engine codes stamped on automobile identification plates collected from Chilton's Repair & Tune-Up Guide (1985).


90, Mazda MPV, 40K MI, 7 Pass, V6, Auto
ABS, PL/PW, Cruise, Dual Air

87, Grand Caravan, 35K MI, 7 Pass, V6, Auto
Cruise, Air, Tilt, Tinting

Table 7: Sample notes from the minivan domain. Listed above are 2 of the 22 notes about minivan automobiles collected by the first author.


Lorus Disney Oversize Mickey Mouse Watch.
Genuine leather strap.

Seiko Disney Ladies' Minnie Mouse Watch.
Leather strap.

Table 8: Sample notes from the watch domain. Listed above are 2 of the 89 notes about personal watches collected from the Best catalog (a department store).


azatadine maleate
Blood: thrombocytopenia.
CNS: disturbed coordination, dizziness, drowsiness, sedation, vertigo.
CV: palpitations, hypotension.
GI: anorexia, dry mouth and throat, nausea, vomiting.
GU: Urinary retention.
Skin: rash, urticaria.
Other: chills, thickening of bronchial secretions.

brompheniramine maleate
Blood: aganulocytosis, thrombocytopenia.
CNS: dizziness, insomnia, irritability, tremors.
CV: hypotension, palpitations.
GI: anorexia, dry mouth and throat, nausea, vomiting.
GU: urinary retention.
Skin: rash, urticaria.
After parenteral administration:
 local reaction, sweating, syncope may occur.

Table 9: Sample notes from the antihistamine domain. Listed above are 2 of the 17 notes on the side effects of antihistamines collected from the Nurses Guide to Drugs (1979).


Canon FD f/1.8, 6oz., f/22, 13in.,
good sharpness, poor freedom from flare,
better freedom from distortion,
focal length marked on sides as well as
on front of lens

Chinon f/1.7, 6oz., f/22, 9in.,
poor sharpness, good freedom from flare,
good freedom from distortion,
cannot be locked in program mode, which
is only a problem, of course, when lens is
used on program-mode cameras

Table 10: Sample notes from the lens domain. Listed above are 2 of the 31 notes about 35mm SLR camera normal lenses collected from the Consumer Reports (1988).


22in. W. 48in.
A very large falcon. Three color phases occur:
blackish, white, and gray-brown. All
are more uniformly colored than the
Peregrine Falcon, which has dark
mustaches and hood.

16-24in. W. 42in.
Long-winged, long-tailed hawk with a
white rump, usually seen soaring
unsteadily over marshes with its wings
held in a shallow 'V'. Male has a pale
gray back, head, and breast. Female
and young are brown above, streaked
below, young birds with a rusty tone.

Table 11: Sample notes from the raptor domain. Listed above are 2 of the 21 notes about North American birds of prey collected from (Bull & Farrand, 1977).

Summary characteristics of the nine domains are listed in Table 12 together with some simple measures to indicate prediction difficulty. For instance, Column 1 shows the number of notes in the domain. With a larger number of notes, the easier it should be to accurately train a predictive method. Column 4 shows the standard deviation (STD) of the length of all notes in each domain. It is more likely that a well-behaved FSM can be discovered when STD is low. In this and successive tables, the domains are ranked by STD. Column 5 presents the percentage of unique tokens in the notes. The fewer novel tokens a note has, the more likely that successive tokens can be predicted. This measure places an upper bound on predictive accuracy. Column 6 shows the percentage of constant tokens, ones that always appear in a fixed position. It is easier to predict these constant tokens. Finally, Column 7 indicates the percentage of repeated tokens. When fewer tokens are repeated verbatim within a note, the more likely that the predictive method will not become confused about its locale within a note during prediction.

Table 12: Quantitative properties of the nine domains used to test alternative methods.

The first six domains are natural for the interactive note taking task because they exhibit a regular syntax. The last three domains are included to test the software's ability on less suitable domains. Notes from the Antihistamine, Lens, and Raptor domains contain highly-variable lists of terms or natural language sentences. Learned FSMs for notes in these domains are unlikely to converge, and, in the experiments reported here, only the FSM for the Lens data exceeded the maturity threshold (average state usage greater than 2).

7.1 Contextual Prediction Accuracy

Column 7 of Table 13 lists the accuracy of next-token predictions made by the software in prompting mode. The first nine rows list predictive accuracy over all tokens as notes from each of the nine domains are independently processed in the order they were collected. The last row lists predictive accuracy over all tokens as notes from all nine domains are collectively processed. This simulates a user taking notes about several domains simultaneously.

To put these results in context, the table also lists predictive accuracies for several other methods. Column 1 lists the accuracy for a lower bound method. It assumes that each note shares a fixed sequence of tokens. Termed common, this method initializes its structure to the first note. It then removes each token in this sequential structure that cannot be found in order in other notes. At best, this method can only predict the constant, delimiter-like tokens that may appear regularly in notes. Its performance is limited by the percentage of constant tokens reported in Column 6 of Table 12. It performs best for the PowerBook notes where it learns the following note syntax:

Example 3:
* :NULL * "K" * " PowerBook" * "MB" * "MB" * " Int." * .

(The asterisks indicate Kleene star notation.) This reads as some sequence of zero or more tokens then the token :NULL , followed by zero or more tokens then "K" , followed by zero or more tokens then "PowerBook" , and so on. It is less successful for the minivan notes where it learns a simpler syntax:

Example 4:
* :NULL * "K" * " MI" * " Pass" * .

Columns 2 and 3 of Table 13 list the accuracy of using a classifier to directly predict the next token without explicitly learning a syntax. In this paradigm, examples are prefixes of token sequences. Attributes are the last token in the sequence, the second to last token, the third to last token, and so on. Class values are the next token in the sequence - the one to be predicted. Column 2 lists the performance of a simple Bayes classifier, and Column 3 lists the performance of an incremental variant of ID3 (Schlimmer & Fisher, 1986). Perhaps surprisingly, these methods perform considerably worse than the simple conjunctive method. Without the benefit of a narrow context provided by the FSM, these methods must implicitly construct representations to detect differences between similar situations that arise within a single note. For example, in the PowerBook notes, a classifier-only approach must learn to discriminate between the first and second occurrence of the "MB" token.

Table 13: Percentage of tokens correctly predicted as a function of the learning method.

Column 4 of Table 13 lists the accuracy of a more viable prediction mechanism. Based on simple ideas of memorization and termed digram, the method maintains a list of tokens that have immediately followed each observed token. For example, in the fabric pattern domain, this method retains the list of tokens {"8-10-12", "10", "11/12", "12"} as those that follow the token "Size" . Each list of follow tokens are kept in order from most to least frequent. To predict the next token, the system looks for the last token written and predicts the most frequent follow token. This method is nearly as effective as any other in Table 13, especially on the combined task when notes from each domain are entered in random order. Laird (1992) describes an efficient algorithm for maintaining higher-dimensional n-grams, in effect increasing the context of each prediction and effectively memorizing longer sequences of tokens. Laird's algorithm builds a Markov tree and incorporates heuristics that keep the size of the tree from growing excessively large. Regrettably, these methods are unsuitable for the interactive note-taking software because of the difficulty of using them to construct a custom user interface. It is plausible to construct a panel of exclusive choices based directly on the set of follow tokens, but it is unclear how to identify optional choices corresponding to loops in finite-state machines. Moreover, if notes are drawn from different domains, and those domains share even a single token, then some follow set will include tokens from different domains. Using these follow sets to construct a user interface will unnecessarily confuse the user by introducing options from more than one domain at a time.

Column 5 of Table 13 lists the accuracy of prediction based solely on the learned FSMs. Without an embedded classifier, this method must rely on prediction of the most common transition (or termination) from each state. Because the prediction is based on simple counts (as noted in Section 4, Learning Embedded Classifiers), this method never predicts optional transitions.

Columns 6 and 7 of Table 13 list the accuracy of predicting using FSMs and embedded classifiers. The classifiers used are simple Bayes and the incremental ID3, respectively. The latter outperforms either the FSM alone or the FSM with embedded Bayes classifiers. If the system only makes predictions when its confidence measure is greater than 0.25, the accuracy is significantly different for the Engine Code, Minivan, Lens, and Raptor domains, ranging between 10 and 22 percentage points of improvement.

Column 8 of Table 13 lists an estimate of the upper-bound on predictive accuracy. This was calculated by assuming that prediction errors were only made the first time each distinct token was written.

7.2 Design Decisions

The note taking software embodies a number of design decisions. Table 14 lists the effects of these decisions on predictive accuracy by comparing versions of the software with and without each design feature. The first column lists the predictive accuracy for the software's nominal configuration. Column 2 lists the accuracy data for a slightly different generic tokenizer. Accuracy is higher for some domains, lower for others. A custom-built tokenizer is one way to incorporate knowledge about the domain. Columns 3 and 4 show the accuracy for the system using only the original two FSM merging rules (cf. Table 1) and all but the last merging rule (cf. Table 2), respectively. The decreased structural generality tends to lower predictive accuracy, but the embedded classifiers help compensate for the reduced accuracy. Column 5 lists the accuracy for when the FSM does not heuristically continue parsing upon encountering a token for which there is no immediate transition. As expected, accuracy suffers considerably in some domains because a novel token in a sequence completely foils any subsequent prediction. Columns 6 and 7 list accuracy for different values of the free parameter controlling the clustering of notes together into a FSM. There is little effect on predictive accuracy in this case. Column 8 shows the accuracy for when embedded classifiers do not use information about repeated states in the FSM. Without this information, the classifiers cannot predict that a loop transition should be taken exactly once. Surprisingly, elimination of this feature has little effect on accuracy. Column 9 lists the accuracy for when the embedded classifiers associated with a pair of FSM states are discarded when the states are merged. Finally, Column 10 lists the accuracy for when a new FSM state is assigned a unique ID rather than the ID of the oldest of the two merged states.

Table 14: Percentage of tokens correctly predicted as a function of design variations.

7.3 Sample Button-Box Interfaces

In addition to Figure 3, Figures 11 through 15 depict button-box interfaces for the five other well-behaved note taking domains listed at the top of Table 12. These interfaces are visual and offer the user an organized view of their notes, presenting options in a natural way. However, whenever unique tokens are involved, the current software makes no attempt to explicitly generalize tokens. This effect is reflected in the tour dates for the Airwing notes in Figure 11. Note that the radio button panel consists of a long series of dates, none of which is likely to be selected for a new note.

Figure 11: Screen snapshot of the note-taking software in button-box mode for an airwing note.

Figure 12: Screen snapshot of the note-taking software in button-box mode for a fabric pattern note.

Figure 13: Screen snapshot of the note-taking software in button-box mode for an engine code note.

Figure 14: Screen snapshot of the note-taking software in button-box mode for a minivan note.

Figure 15: Screen snapshot of the note-taking software in button-box mode for a watch note.

8. Related Work