"Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization"

to appear in the Proceedings of the 7th IEEE International Conference on Computers and Information Science, (ICIS 2008) Portland, Oregon, USA, May 14-16, 2008. Click here to access the official webpage of this conference.

by Huy Nguyen Anh Pham and Evangelos Triantaphyllou

Abstract:
The Pima Indian diabetes (PID) dataset [1], originally donated by Vincent Sigillito from the Applied Physics Laboratory at the Johns Hopkins University, is one of the most well-known datasets for testing classification algorithms. This dataset consists of records describing 786 female patients of Pima Indian heritage which are at least 21 years old living near Phoenix, Arizona, USA. The problem is to diagnose whether a new patient would test positive for diabetes. However, the correct classification percentage of current algorithms on this dataset is oftentimes coincidental. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification algorithm when it is processing a dataset. Although the above situation is of fundamental importance in data mining, it has not been studied from a comprehensive point of view. Thus, this paper describes a new approach, called the Homogeneity-Based Algorithm (or HBA) as developed by Pham and Triantaphyllou in [2], to optimally control the overfitting and overgeneralization behaviors of classification on this dataset. The HBA is used in conjunction with traditional classification approaches to enhance their classification accuracy. Some computational results seem to indicate that the proposed approach significantly outperforms current approaches.

Key Words:
Diabetes prediction, data mining, classification accuracy, optimization.

Download this paper as a PDF file (size = 450 KB).

Visit Dr. Triantaphyllou's Homepage.