"An Optimization Approach for Improving Accuracy by Balancing
Overfitting and Overgeneralization in Data Mining"
Submitted for publication to a research journal,
January 2007, pending review.
Huy Anh Nguyen Pham and Evangelos Triantaphyllou
Abstract:
The performance of a classification approach only in terms of its false-positive and false-negative
rates may be totally unpredictable. Attempts to minimize any of the previous two rates may lead to an increase in
the other rate [1]. Furthermore, if the model allows for new data to be deemed as unclassifiable when there is not
adequate information to classify them, then it is possible for the previous two error rates to be very low but,
at the same time, the rate of having unclassifiable new examples may be very high. The root of the above critical problem is the
overfitting and overgeneralization behaviors of a given classification approach when it is processing a particular dataset.
Although the above situation is of fundamental importance to data mining, it has not been studied from a comprehensive point of view.
Thus, this paper analyzes the above issues in depth. It also proposes a new approach called the Homogeneity-Based Algorithm (HBA) for optimally
controlling the three error rates mentioned above. This is done by adjusting the classification models inferred from
traditional classification approaches through the formulation of the three error rates as an optimization problem.
Next, the error rates are optimized by employing a genetic algorithm (GA) approach. The HBA work in conjunction with
existing classification approaches. Some computational results indicate that the proposed approach has a significant potential to fill with
a critical gap with traditional classification approaches.