"Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization"
Computers and Information Science,
a book of the Springer series of books titled: “Studies in Computational Intelligence,”
(Roger Yin Lee, Editor), Springer, Heidelberg, Germany,
Vol. 131, Chapter 2, pp. 11-26, 2008.
by Huy Nguyen Anh Pham and Evangelos Triantaphyllou
Abstract:
The Pima Indian diabetes (PID) dataset [1], originally donated by Vincent Sigillito from the Applied Physics
Laboratory at the Johns Hopkins University, is one of the most well-known datasets for testing classification
algorithms. This dataset consists of records describing 786 female patients of Pima Indian heritage which are
at least 21 years old living near Phoenix, Arizona, USA. The problem is to predict whether a new patient
would test positive for diabetes. However, the correct classification percentage of current algorithms on
this dataset is oftentimes coincidental. The root to the above critical problem is the overfitting and
overgeneralization behaviors of a given classification algorithm when it is processing a dataset.
Although the above situation is of fundamental importance in data mining, it has not been studied from
a comprehensive point of view. Thus, this paper describes a new approach,
called the Homogeneity-Based Algorithm (or HBA) as developed by Pham and Triantaphyllou in [2-3], to optimally
control the overfitting and overgeneralization behaviors of classification on this dataset.
The HBA is used in conjunction with traditional classification approaches (such as Support Vector Machines (SVMs),
Artificial Neural Networks (ANNs), or Decision Trees (DTs)) to enhance their classification accuracy.
Some computational results seem to indicate that the proposed approach significantly outperforms
current approaches.