TABLE OF CONTENTS Click here to download the Table of Contents as a PDF file Foreword........................................................vii Preface..........................................................xi Acknowledgments................................................xvii List of Figures...............................................xxvii List of Tables.................................................xxxi ___________________________________________________ PART I Theory and Algorithmic Issues ___________________________________________________ Chapter 1 INTRODUCTION......................................................3 1.1 What is Data Mining and Knowledge Discovery?..............3 1.2 Some Potential Application Areas for Data Mining and Knowledge Discovery...................................4 1.2.1 Applications in Engineering.......................5 1.2.2 Applications In Medical Sciences..................5 1.2.3 Applications In the Basic Sciences................6 1.2.4 Applications In Business..........................6 1.2.5 Applications In the Political and Social Sciences...................................7 1.3 The Data Mining and Knowledge Discovery Process...........7 1.3.1 Problem Definition................................7 1.3.2 Collecting the Data...............................9 1.3.3 Data Preprocessing...............................10 1.3.4 Application of the Main Data Mining and Knowledge Discovery Algorithms...................11 1.3.5 Interpretation of the Results of the Data Mining and Knowledge Discovery Process...........12 1.4 Four Key Research Challenges in Data Mining and Knowledge Discovery......................................12 1.4.1 Collecting Observations about the Behavior of a System......................................13 1.4.2 Identifying Patterns from Collections of Data....14 1.4.3 Which Data to Consider for Evaluation Next?......17 1.4.4 Do Patterns Always Exist in Data?................19 1.5 Concluding Remarks.......................................20 Chapter 2 INFERRING A BOOLEAN FUNCTION FROM POSITIVE AND NEGATIVE EXAMPLES............................................21 2.1 Introduction.............................................21 2.2 Some Background Information..............................22 2.3 Data Binarization........................................26 2.4 Definitions and Terminology..............................29 2.5 Generating Clauses from Negative Examples Only...........32 2.6 Clause Inference as a Satisfiability Problem.............33 2.7 An SAT Approach for Inferring CNF Clauses................34 2.8 The One Clause At a Time (OCAT) Concept..................35 2.9 A Branch-and-Bound Approach for Inferring a Single Clause............................38 2.10 A Heuristic for Problem Preprocessing....................45 2.11 Some Computational Results...............................47 2.12 Concluding Remarks.......................................50 APPENDIX The SAT Formulation for the Illustrative Example (the CNF case)...................................................52 Chapter 3 A REVISED BRANCH-AND-BOUND APPROACH FOR INFERRING A BOOLEAN FUNCTION FROM EXAMPLES...................................57 3.1 Some Background Information..............................57 3.2 The Revised Branch-and-Bound Algorithm...................57 3.2.1 Generating a Single CNF Clause...................58 3.2.2 Generating a Single DNF Clause...................62 3.3 Some Computational Results...............................64 3.4 Concluding Remarks.......................................69 Chapter 4 SOME FAST HEURISTICS FOR INFERRING A BOOLEAN FUNCTION FROM EXAMPLES....................................................73 4.1 Some Background Information..............................73 4.2 A Fast Heuristic for Inferring a Boolean Function From Complete Data.......................................75 4.3 A Fast Heuristic for Inferring a Boolean Function From Incomplete Data.....................................80 4.4 Some Computational Results...............................84 4.4.1 Results for the RA1 Algorithm on the Wisconsin Cancer Data............................86 4.4.2 Results for the RA2 Heuristic on the Wisconsin Cancer Data with Some Missing Values...91 4.4.3 Comparison of the RA1 Algorithm and the B&B Method by Using Large Random Data Sets...........92 4.5 Concluding Remarks.......................................98 Chapter 5 AN APPROACH TO GUIDED LEARNING OF BOOLEAN FUNCTIONS.............101 5.1 Some Background Information.............................101 5.2 Problem Description.....................................104 5.3 The Proposed Approach...................................105 5.4 On the Number of Candidate Solutions....................110 5.5 An Illustrative Example.................................111 5.6 Some Computational Results..............................113 5.7 Concluding Remarks......................................122 Chapter 6 AN INCREMENTAL LEARNING ALGORITHM FOR INFERRING BOOLEAN FUNCTIONS...............................................125 6.1 Some Background Information.............................125 6.2 Problem Description.....................................126 6.3 Some Related Developments...............................127 6.4 The Proposed Incremental Algorithm......................130 6.4.1 Repairing a Boolean Function that Incorrectly Rejects a Positive Example..........131 6.4.2 Repairing of a Boolean Function that Incorrectly Accepts a Negative Example..........133 6.4.3 Computational Complexity of the Algorithms for the ILE Approach............................134 6.5 Experimental Data.......................................134 6.6 Analysis of the Computational Results...................135 6.6.1 Results on the Classification Accuracy..........136 6.6.2 Results on the Number of Clauses................139 6.6.3 Results on the CPU Times........................141 6.7 Concluding Remarks......................................144 Chapter 7 A DUALITY RELATIONSHIP BETWEEN BOOLEAN FUNCTIONS IN CNF AND DNF DERIVABLE FROM THE SAME TRAINING EXAMPLES...............147 7.1 Introduction............................................147 7.2 Generating Boolean Functions in CNF and DNF.............147 7.3 An Example of Deriving Boolean Functions in CNF and DNF.............................................148 7.4 Some Computational Results..............................149 7.5 Concluding Remarks......................................150 Chapter 8 THE REJECTABILITY GRAPH OF TWO SETS OF EXAMPLES.................151 8.1 Introduction............................................151 8.2 The Definition of the Rejectability Graph...............152 8.2.1 Properties of the Rejectability Graph...........153 8.2.2 On the Minimum Clique Cover of the Rejectability Graph.............................155 8.3 Problem Decomposition...................................156 8.3.1 Connected Components............................156 8.3.2 Clique Cover....................................157 8.4 An Example of Using the Rejectability Graph.............158 8.5 Some Computational Results..............................160 8.6 Concluding Remarks......................................170 ___________________________________________________ PART II Application Issues ___________________________________________________ Chapter 9 THE RELIABILITY ISSUE IN DATA MINING: THE CASE OF COMPUTER-AIDED BREAST CANCER DIAGNOSIS..........................173 9.1 Introduction............................................173 9.2 Some Background Information on Computer-Aided Breast Cancer Diagnosis.................................173 9.3 Reliability Criteria....................................175 9.4 The Representation / Narrow Vicinity Hypothesis.........178 9.5 Some Computational Results..............................181 9.6 Concluding Remarks......................................183 APPENDIX I Definitions of the Key Attributes...............................185 APPENDIX II Technical Procedures............................................187 9.A.1 The Interactive Approach................................187 9.A.2 The Hierarchical Approach...............................188 9.A.3 The Monotonicity Property...............................188 9.A.4 Logical Discriminant Functions..........................189 Chapter 10 DATA MINING AND KNOWLEDGE DISCOVERY BY MEANS OF MONOTONE BOOLEAN FUNCTIONS......................................191 10.1 Introduction............................................191 10.2 Background Information..................................193 10.2.1 Problem Descriptions............................193 10.2.2 Hierarchical Decomposition of Variables.........196 10.2.3 Some Key Properties of Monotone Boolean Functions...............................197 10.2.4 Existing Approaches to Problem 1................201 10.2.5 An Existing Approach to Problem 2...............203 10.2.6 Existing Approaches to Problem 3................204 10.2.7 Stochastic Models for Problem 3.................204 10.3 Inference Objectives and Methodology....................206 10.3.1 The Inference Objective for Problem 1...........206 10.3.2 The Inference Objective for Problem 2...........207 10.3.3 The Inference Objective for Problem 3...........208 10.3.4 Incremental Updates for the Fixed Misclassification Probability Model.....................208 10.3.5 Selection Criteria for Problem 1................209 10.3.6 Selection Criteria for Problems 2.1, 2.2, and 2.3......................210 10.3.7 Selection Criterion for Problem 3...............210 10.4 Experimental Results....................................215 10.4.1 Experimental Results for Problem 1..............215 10.4.2 Experimental Results for Problem 2..............217 10.4.3 Experimental Results for Problem 3..............219 10.5 Summary and Discussion..................................223 10.5.1 Summary of the Research Findings................223 10.5.2 Significance of the Research Findings...........225 10.5.3 Future Research Directions......................226 10.6 Concluding Remarks......................................227 Chapter 11 SOME APPLICATION ISSUES ON MONOTONE BOOLEAN FUNCTIONS...........229 11.1 Some Background Information.............................229 11.2 Expressing Any Boolean Function in Terms of Monotone Ones...........................................229 11.3 Formulations of Diagnostic Problems as the Inference of Nested Monotone Boolean Functions....................231 11.3.1 An Application to a Reliability Engineering Problem.............................231 11.3.2 An Application to the Breast Cancer Diagnosis Problem...............................232 11.4 Design Problems.................................233 11.5 Process Diagnosis Problems......................234 11.6 Three Major Illusions in the Evaluation of the Accuracy of Data Mining Models......................234 11.6.1 First Illusion: The Single Index Accuracy Rate...................................235 11.6.2 Second Illusion: Accurate Diagnosis without Hard Cases..............................235 11.6.3 Third Illusion: High Accuracy on Random Test Data Only..................................236 11.7 Identification of the Monotonicity Property.............236 11.8 Concluding Remarks......................................239 Chapter 12 MINING OF ASSOCIATION RULES.....................................241 12.1 Some Background Information.............................241 12.2 Problem Description.....................................243 12.3 Methodology.............................................244 12.3.1 Some Related Algorithms.........................244 12.3.2 Alterations to the RA1 Algorithm................245 12.4 Computational Experiments...............................247 12.5 Concluding Remarks......................................255 Chapter 13 DATA MINING OF TEXT DOCUMENTS...................................257 13.1 Some Background Information.....................257 13.2 A Brief Description of the Document Clustering Process..............................259 13.3 Using the OACT Approach to Classify Text Documents..................................260 13.4 An Overview of the Vector Space Model...........262 13.5 A Guided Learning Approach for the Classification of Text Documents..................................264 13.6 Experimental Data...............................265 13.7 Testing Methodology.............................267 13.7.1 The Leave-One-Out Cross Validation..............267 13.7.2 The 30/30 Cross Validation......................267 13.7.3 Statistical Performance of Both Algorithms......267 13.7.4 Experimental Setting for the Guided Learning Approach........................................268 13.8 Results for the Leave-One-Out and the 30/30 Cross Validations.............................................269 13.9 Results for the Guided Learning Approach................272 13.10 Concluding Remarks......................................275 Chapter 14 FIRST CASE STUDY: PREDICTING MUSCLE FATIGUE FROM EMG SIGNALS...277 14.1 Introduction............................................277 14.2 General Problem Description.............................277 14.3 Experimental Data.......................................279 14.4 Analysis of the EMG Data................................280 14.4.1 The Effects of Load and Electrode Orientation...280 14.4.2 The Effects of Muscle Condition, Load, and Electrode.............................280 14.5 A Comparative Analysis of the EMG Data..................281 14.5.1 Results by the OCAT / RA1 Approach..............282 14.5.2 Results by the Fisher's Linear Discriminant Analysis...........................283 14.5.3 Results by Logistic Regression..................284 14.5.4 A Neural Network Approach.......................285 14.6 Concluding Remarks......................................287 Chapter 15 SECOND CASE STUDY: INFERENCE OF DIAGNOSTIC RULES FOR BREAST CANCER...................................................289 15.1 Introduction............................................289 15.2 Description of the Data Set.............................289 15.3 Description of the Inferred Rules.......................292 15.4 Concluding Remarks......................................296 Chapter 16 A FUZZY LOGIC APPROACH TO ATTRIBUTE FORMALIZATION: ANALYSIS OF LOBULATION FOR BREAST CANCER DIAGNOSIS.......................297 16.1 Introduction............................................297 16.2 Some Background Information on Digital Mammography.............................................297 16.3 Some Background Information on Fuzzy Sets...............299 16.4 Formalization with Fuzzy Logic..................300 16.5 Degrees of Lobularity and Microlobularity.......306 16.6 Concluding Remarks......................................308 Chapter 17 CONCLUSIONS....................................................309 17.1 General Concluding Remarks.............................309 17.2 Twelve Key Areas of Potential Future Research on Data Mining and Knowledge Discovery from Databases..........310 17.2.1 Overfitting and Overgeneralization.............310 17.2.2 Guided Learning................................311 17.2.3 Stochasticity..................................311 17.2.4 More on Monotonicity...........................311 17.2.5 Visualization..................................311 17.2.6 Systems for Distributed Environments...........312 17.2.7 Developing Better Exact Algorithms and Heuristics.................................312 17.2.8 Hybridization and Other Algorithmic Issues...312 17.2.9 Systems with Self Explanatory Capabilities.....313 17.2.10 New Systems for Image Analysis.................313 17.2.11 Systems for Web Applications...................313 17.2.12 Developing More Applications...................314 17.3 Epilogue...............................................314 References.....................................................317 Subject Index..................................................335 Author Index...................................................345 About the Author...............................................349 Visit Dr. Triantaphyllou's Homepage Dr. Triantaphyllou's Books / Special Issues web site Send suggestions / comments to Dr. E. Triantaphyllou (trianta@lsu.edu). |