by Huy Nguyen Anh Pham and Evangelos Triantaphyllou
Abstract:
The Pima Indian diabetes (PID) dataset [1], originally donated by Vincent Sigillito from the
Applied Physics Laboratory at the Johns Hopkins University, is one of the most well-known
datasets for testing classification algorithms. This dataset consists of records
describing 786 female patients of Pima Indian heritage which are at least 21 years old
living near Phoenix, Arizona, USA. The problem is to diagnose whether a new patient would test
positive for diabetes. However, the correct classification percentage of current algorithms on
this dataset is oftentimes coincidental. The root to the above critical problem is the overfitting
and overgeneralization behaviors of a given classification algorithm when it is processing a dataset.
Although the above situation is of fundamental importance in data mining, it has not been studied from
a comprehensive point of view. Thus, this paper describes a new approach,
called the Homogeneity-Based Algorithm (or HBA) as developed by Pham and Triantaphyllou in [2], to
optimally control the overfitting and overgeneralization behaviors of classification on this dataset.
The HBA is used in conjunction with traditional classification approaches to enhance their classification accuracy.
Some computational results seem to indicate that the proposed approach significantly outperforms current approaches.
Key Words:
Diabetes prediction, data mining, classification accuracy, optimization.