Table of Contents

2.1 Decision Trees and Cost-Sensitive Classification


2.2 Calculating the Average Cost of Classification

In this section, we describe how we calculate the average cost of classification for a decision tree, given a set of testing data. The method described here is applied uniformly to the decision trees generated by the five algorithms examined here (EG2, CS-ID3, IDX, C4.5, and ICET). The method assumes only a standard classification decision tree (such as generated by C4.5); it makes no assumptions about how the tree is generated. The purpose of the method is to give a plausible estimate of the average cost that can be expected in a real-world application of the decision tree.

We assume that the dataset has been split into a training set and a testing set. The expected cost of classification is estimated by the average cost of classification for the testing set. The average cost of classification is calculated by dividing the total cost for the whole testing set by the number of cases in the testing set. The total cost includes both the costs of tests and the costs of classification errors. In the simplest case, we assume that we can specify test costs simply by listing each test, paired with its corresponding cost. More complex cases will be considered later in this section. We assume that we can specify the costs of classification errors using a classification cost matrix.

Suppose there are c distinct classes. A classification cost matrix is a matrix, where the element is the cost of guessing that a case belongs in class i, when it actually belongs in class j. We do not need to assume any constraints on this matrix, except that costs are finite, real values. We allow negative costs, which can be interpreted as benefits. However, in the experiments reported here, we have restricted our attention to classification cost matrices in which the diagonal elements are zero (we assume that correct classification has no cost) and the off-diagonal elements are positive numbers.

To calculate the cost of a particular case, we follow its path down the decision tree. We add up the cost of each test that is chosen (i.e., each test that occurs in the path from the root to the leaf). If the same test appears twice, we only charge for the first occurrence of the test. For example, one node in a path may say "patient age is less than 10 years" and another node may say "patient age is more than 5 years", but we only charge once for the cost of determining the patient's age. The leaf of the tree specifies the tree's guess for the class of the case. Given the actual class of the case, we use the cost matrix to determine the cost of the tree's guess. This cost is added to the costs of the tests, to determine the total cost of classification for the case.

This is the core of our method for calculating the average cost of classification of a decision tree. There are two additional elements to the method, for handling conditional test costs and delayed test results.

We allow the cost of a test to be conditional on the choice of prior tests. Specifically, we consider the case where a group of tests shares a common cost. For example, a set of blood tests shares the common cost of collecting blood from the patient. This common cost is charged only once, when the decision is made to do the first blood test. There is no charge for collecting blood for the second blood test, since we may use the blood that was collected for the first blood test. Thus the cost of a test in this group is conditional on whether another member of the group has already been chosen.

Common costs appear frequently in testing. For example, in diagnosis of an aircraft engine, a group of tests may share the common cost of removing the engine from the plane and installing it in a test cell. In semiconductor manufacturing, a group of tests may share the common cost of reserving a region on the silicon wafer for a special test structure. In image recognition, a group of image processing algorithms may share a common preprocessing algorithm. These examples show that a realistic assessment of the cost of using a decision tree will frequently need to make allowances for conditional test costs.

It often happens that the result of a test is not available immediately. For example, a medical doctor typically sends a blood test to a laboratory and gets the result the next day. We allow a test to be labelled either "immediate" or "delayed". If a test is delayed, we cannot use its outcome to influence the choice of the next test. For example, if blood tests are delayed, then we cannot allow the outcome of one blood test to play a role in the decision to do a second blood test. We must make a commitment to doing (or not doing) the second blood test before we know the results of the first blood test.

Delayed tests are relatively common. For example, many medical tests must be shipped to a laboratory for analysis. In gas turbine engine diagnosis, the main fuel control is frequently shipped to a specialized company for diagnosis or repair. In any classification problem that requires multiple experts, one of the experts might not be immediately available.

We handle immediate tests in a decision tree as described above. We handle delayed tests as follows. We follow the path of a case from the root of the decision tree to the appropriate leaf. If we encounter a node, anywhere along this path, that is a delayed test, we are then committed to performing all of the tests in the subtree that is rooted at this node. Since we cannot make the decision to do tests below this node conditional on the outcome of the test at this node, we must pledge to pay for all the tests that we might possibly need to perform, from this point onwards in the decision tree.

Our method for handling delayed tests may seem a bit puzzling at first. The difficulty is that a decision tree combines a method for selecting tests with a method for classifying cases. When tests are delayed, we are forced to proceed in two phases. In the first phase, we select tests. In the second phase, we collect test results and classify the case. For example, a doctor collects blood from a patient and sends the blood to a laboratory. The doctor must tell the laboratory what tests are to be done on the blood. The next day, the doctor gets the results of the tests from the laboratory and then decides on the diagnosis of the patient. A decision tree does not naturally handle a situation like this, where the selection of tests is isolated from the classification of cases. In our method, in the first phase, the doctor uses the decision tree to select the tests. As long as the tests are immediate, there is no problem. As soon as the first delayed test is encountered, the doctor must select all the tests that might possibly be needed in the second phase. That is, the doctor must select all the tests in the subtree rooted at the first delayed test. In the second phase, when the test results arrive the next day, the doctor will have all the information required to go from the root of the tree to a leaf, to make a classification. The doctor must pay for all of the tests in the subtree, even though only the tests along one branch of the subtree will actually be used. The doctor does not know in advance which branch will actually be used, at the time when it is necessary to order the blood tests. The laboratory that does the blood tests will naturally want the doctor to pay for all the tests that were ordered, even if they are not all used in making the diagnosis.

In general, it makes sense to do all of the desired immediate tests before we do any of the desired delayed tests, since the outcome of an immediate test can be used to influence the decision to do a delayed test, but not vice versa. For example, a medical doctor will question a patient (questions are immediate tests) before deciding what blood tests to order (blood tests are delayed tests).

When all of the tests are delayed (as they are in the BUPA data in Appendix A.1), we must decide in advance (before we see any test results) what tests are to be performed. For a given decision tree, the total cost of tests will be the same for all cases. In situations of this type, the problem of minimizing cost simplifies to the problem of choosing the best subset of the set of available tests (Aha and Bankert, 1994). The sequential order of the tests is no longer important for reducing cost.

Let us consider a simple example to illustrate the method. Table 1 shows the test costs for four tests. Two of the tests are immediate and two are delayed. The two delayed tests share a common cost of $2.00. There are two classes, 0 and 1. Table 2 shows the classification cost matrix. Figure 1 shows a decision tree. Table 3 traces the path through the tree for a particular case and shows how the cost is calculated. The first step is to do the test at the root of the tree (test alpha). In the second step, we encounter a delayed test (delta), so we must calculate the cost of the entire subtree rooted at this node. Note that epsilon only costs $8.00, since we have already selected delta, and delta and epsilon have a common cost. In the third step, we do test epsilon, but we do not need to pay, since we already paid in the second step. In the fourth step, we guess the class of the case. Unfortunately, we guess incorrectly, so we pay a penalty of $50.00.

In summary, this section presents a method for estimating the average cost of using a given decision tree. The decision tree can be any standard classification decision tree; no special assumptions are made about the tree; it does not matter how the tree was generated. The method requires (1) a decision tree (Figure 1), (2) information on the calculation of test costs (Table 1), (3) a classification cost matrix (Table 2), and (4) a set of testing data (Table 3). The method is (i) sensitive to the cost of tests, (ii) sensitive to the cost of classification errors, (iii) capable of handling conditional test costs, and (iv) capable of handling delayed tests. In the experiments reported in Section 4, this method is applied uniformly to all five algorithms.


2.3 Cost and Accuracy