3. Learning a Syntax

4. Learning Embedded Classifiers

Finite-state machines are useful representations for capturing the syntax of a user's notes, and they are easy to learn. When predicting a note's completion, it is essential that a prediction be made from the correct state in the FSM (as discussed above). It is also necessary to decide whether to terminate (indicating acceptance of the note) or continue prediction, and, in the later case, which transition to predict. To facilitate these decisions, the FSM can maintain a count of how many times parsing terminated and how many times each transition was taken. Prediction can then return the option with the maximum frequency.

Figure 10 depicts a FSM for which this method will prove insufficient. There is only one state, an accepting state, and the transition corresponding to the token "X" is optional. (This corresponds to a check box interface item.) There are two problems with a frequency-based prediction. First, the FSM does not indicate that the transition is to be taken at most once, yet this is quite clear from the user interface. Second, simple frequency-based prediction would always recommend termination and never the transition. The FSM accepts whether the box is checked or not, thus the frequency of termination is greater than or equal to the frequency of the transition. This problem arises whenever there is a loop.

Figure 10: Simple finite-state machine with one state.

Embedding general classifiers in a FSM can alleviate some of the FSM's representational shortcomings. For example, in the FSM depicted in Figure 10, a decision tree embedded in this state easily tests whether the transition has already been taken and can advise against repeating it. Moreover, a classifier can predict based on previous transitions rather than just the frequency of the current state's transitions. Therefore, a decision tree embedded in the state of Figure 10 can predict when the transition should be taken as a function of other, earlier tokens in the sequence. Table 3 lists sample decision trees embedded in states of the FSM depicted in Figure 7. The first tree tests which token was parsed by a distant state, in effect augmenting the FSM representation. It relates memory size to hard disk capacity (small amounts of memory correlate with a small hard disk). The second tree prevents an optional loop from being taken a second time by testing to see if the state has yet been visited during a parse of the note. After processing additional notes, this second decision tree becomes more complex as the system tries to predict which PowerBooks have FAX modems and which do not.


Decision tree embedded in State 3:

If State 1 exited with "2048"
  Then predict " 20"
Else if with "4096"
  Then predict " 40"
Else if with "6144"
  Then predict " 40"
Else if with "8192"
  Then predict " 40" .

Decision tree embedded in State 7:

If State 7 has not been visited
  Then predict " FAX"
Else if State 7 exited with " FAX"
  Then predict " Modem" .

Table 3: Sample decision trees embedded in the finite-state machine depicted in Figure 7.

A classifier is trained for each state in the FSM which: (a) has more than one transition, or (b) is marked as a terminal state but also has a transition. The classifiers are updated incrementally after the user finishes each note. The classifier's training data are token sequences parsed at this state. The class value of the data is the transition taken from, or termination at, this state by the token sequences. Only those classifiers whose states are used in a parse are updated. The attributes of the data are the names of states prior to this one, and the values of the attributes are the transitions taken from those states. A distinct attribute is defined each time a state is visited during a given parse, so when a loop transition is taken a specific attribute reflects this fact. For any of the attributes, if the corresponding state was not visited while parsing the token sequence, the attribute has a special, empty value.

Consider the PowerBook FSM shown in Figure 7. A classifier would be embedded at States 1, 2, 3, 4, 5, 6, 7. A training example corresponding to the note in Example 1 for the classifier at State 6 would be:

Attributes:   Values:
S1          = "4096"
S2          = " 170"
S3          = NIL
S4          = " 40"
S5          = " Drives"
S6          = ", 2400/9600"
S7          = " FAX"
S7-1        = " Modem"
Class:      = :TERMINATE .

Note that there is no value for State 3, denoting that it wasn't visited during the parse of Example 1. Also there are two attributes for State 7 denoting that it has been visited twice.

The classifier gives informed advice about which transition to take or whether to terminate. The FSM in turn gives the classifier a specific context for operation. If only a single classifier were used to predict the next token, it would be hard pressed to represent the different predictions required. The domain is naturally narrowed by the FSM and therefore reduces the representational demands on the classifier. Later, we present empirical results comparing a single classifier to a set of classifiers embedded in a FSM. The findings there show that the latter outperforms the former, confirming the intuition that learning is more effective if situated within a narrow context.

From the classifier's point of view, the learning task is non-stationary. The concept to be learned is changing over time because the structure of the FSM is changing. When two states are merged, one of the two classifiers is discarded. The other is now embedded in a different position in the FSM, and it sees different training data. Similarly, when other states are merged, the attributes of the training data also change. To help mitigate this effect, the new state takes the oldest identifier assigned to the two merged states. Empirical results in Table 14 illustrate that the FSM does not have to be fixed before the classifier can learn useful information.

5. Contextual Prompting