The experiment presented here establishes the baseline performance of the five algorithms. The hypothesis was that ICET will, on average, perform better than the other four algorithms. The classification cost matrix was set to a positive constant value k when the guess class i does not equal the actual class j, but it was set to $0.00 when i equals j. We experimented with seven settings for k, $10, $50, $100, $500, $1000, $5000, and $10000.
Initially, we used the average cost of classification as the performance measure, but we found that there are three problems with using the average cost of classification to compare the five algorithms. First, the differences in costs among the algorithms become relatively small as the penalty for classification errors increases. This makes it difficult to see which algorithm is best. Second, it is difficult to combine the results for the five datasets in a fair manner. It is not fair to average the five datasets together, since their test costs have different scales (see Appendix A). The test costs in the Heart Disease dataset, for example, are substantially larger than the test costs in the other four datasets. Third, it is difficult to combine average costs for different values of k in a fair manner, since more weight will be given to the situations where k is large than to the situations where it is small.
To address these concerns, we decided to normalize the average cost of
classification. We normalized the average cost by dividing it by the
standard cost. Let
be the
frequency of class i in the given dataset. That is,
is the fraction of the cases in the dataset
that belong in class i. We calculate
using the entire dataset, not just the training set. Let
be the cost of guessing that a case
belongs in class i, when it actually belongs in class
j. Let T be the total cost of doing all of the
possible tests. The standard cost is defined as follows:
We can decompose formula (6) into three components:
We may think of (7) as an upper bound on test expenditures, (8) as an upper bound on error rate, and (9) as an upper bound on the penalty for errors. The standard cost is always less than the maximum possible cost, which is given by the following formula:
The point is that (8) is not really an upper bound on error rate, since it is possible to be wrong with every guess. However, our experiments suggest that the standard cost is better for normalization, since it is a more realistic (tighter) upper bound on the average cost. In our experiments, the average cost never went above the standard cost, although it occasionally came very close.
Figure 3 shows the result of using formula (6) to normalize the average cost of classification. In the plots, the x axis is the value of k and the y axis is the average cost of classification as a percentage of the standard cost of classification. We see that, on average (the sixth plot in Figure 3), ICET has the lowest classification cost. The one dataset where ICET does not perform particularly well is the Heart Disease dataset (we discuss this later, in Sections 4.3.2 and 4.3.3).
To come up with a single number that characterizes the performance of each algorithm, we averaged the numbers in the sixth plot in Figure 3. We calculated 95% confidence regions for the averages, using the standard deviations across the 10 random splits of the datasets. The result is shown in Table 5.
Table 5 shows the averages for the first three misclassification error costs alone ($10, $50, and $100), in addition to showing the averages for all seven misclassification error costs ($10 to $10000). We have two averages (the two columns in Table 5), based on two groups of data, to address the following argument: As the penalty for misclassification errors increases, the cost of the tests becomes relatively insignificant. With very high misclassification error cost, the test cost is effectively zero, so the task becomes simply to maximize accuracy. As we see in Figure 3, the gap between C4.5 (which maximizes accuracy) and the other algorithms becomes smaller as the cost of misclassification error increases. Therefore the benefit of sensitivity to test cost decreases as the cost of misclassification error increases. It could be argued that one would only bother with an algorithm that is sensitive to test cost when tests are relatively expensive, compared to the cost of misclassification errors. Thus the most realistic measure of performance is to examine the average cost of classification when the cost of tests is the same order of magnitude as the cost of misclassification errors ($10 to $100). This is why Table 5 shows both averages.
Our conclusion, based on Table 5, is that ICET performs significantly better than the other four algorithms when the cost of tests is the same order of magnitude as the cost of misclassification errors ($10, $50, and $100). When the cost of misclassification errors dominates the test costs, ICET still performs better than the competition, but the difference is less significant. The other three cost-sensitive algorithms (EG2, CS-ID3, and IDX) perform significantly better than C4.5 (which ignores cost). The performance of EG2 and IDX is indistinguishable, but CS-ID3 appears to be consistently more costly than EG2 and IDX.