"An Optimization Approach for Improving Accuracy by Balancing Overfitting and Overgeneralization in Data Mining"

Submitted for publication to a research journal, January 2007, pending review.

Huy Anh Nguyen Pham and Evangelos Triantaphyllou

Abstract:
The performance of a classification approach only in terms of its false-positive and false-negative rates may be totally unpredictable. Attempts to minimize any of the previous two rates may lead to an increase in the other rate [1]. Furthermore, if the model allows for new data to be deemed as unclassifiable when there is not adequate information to classify them, then it is possible for the previous two error rates to be very low but, at the same time, the rate of having unclassifiable new examples may be very high. The root of the above critical problem is the overfitting and overgeneralization behaviors of a given classification approach when it is processing a particular dataset. Although the above situation is of fundamental importance to data mining, it has not been studied from a comprehensive point of view. Thus, this paper analyzes the above issues in depth. It also proposes a new approach called the Homogeneity-Based Algorithm (HBA) for optimally controlling the three error rates mentioned above. This is done by adjusting the classification models inferred from traditional classification approaches through the formulation of the three error rates as an optimization problem. Next, the error rates are optimized by employing a genetic algorithm (GA) approach. The HBA work in conjunction with existing classification approaches. Some computational results indicate that the proposed approach has a significant potential to fill with a critical gap with traditional classification approaches.

Key Words:
Data mining, classification, prediction, overfitting, overgeneralization, false-positive, false-negative, homogenous set, homogeneity degree, optimization, genetic algorithms.




{Download this paper as a PDF file. } At the present time it cannot be not be available to the general public (size = 460 KB).




Visit Dr. Triantaphyllou's homepage.