"The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining"
Soft Computing for Knowledge Discovery and Data Mining,
(O. Maimon and L. Rokach, Editors), Springer, Heidelberg, Germany,
Part 4, Chapter 5, pp. 391-431, 2007.
by Huy Nguyen Anh Pham and Evangelos Triantaphyllou
Abstract:
Many classification studies often times conclude with a summary table which presents performance results
of applying various data mining approaches on different datasets. No single method outperforms all
methods all the time. Furthermore, the performance of a classification method in terms of its false-positive
and false-negative rates may be totally unpredictable. Attempts to minimize any of the previous two rates,
may lead to an increase on the other rate. If the model allows for new data to be deemed as unclassifiable
when there is not adequate information to classify them, then it is possible for the previous two error
rates to be very low but, at the same time, the rate of having unclassifiable new examples to be very high.
The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification
approach when it is processing a particular dataset. Although the above situation is of fundamental
importance to data mining, it has not been studied from a comprehensive point of view.
Thus, this chapter analyzes the above issues in depth. It also proposes a new approach called the
Homogeneity-Based Algorithm (or HBA) for optimally controlling the previous three error rates.
This is done by first formulating an optimization problem. The key development in this chapter is based on
a special way for analyzing the space of the training data and then partitioning it according to the
data density of different regions of this space. Next, the classification task is pursued based on the
previous partitioning of the training space. In this way, the previous three error rates can be controlled
in a comprehensive manner. Some preliminary computational results seem to indicate that the proposed approach
has a significant potential to fill in a critical gap in current data mining methodologies.