|
Data Mining and Knowledge Discovery Via Logic-Based Methods: Theory, Algorithms, and Applications
by
Evangelos Triantaphyllou
A monograph expected to be published by
the end of fall 2009 or early spring 2010 by Springer.
It is 418 pages long.
ISBN ****************
|
TABLE OF CONTENTS
List of Figures..................................................xv
List of Tables..................................................xxi
Foreword......................................................xxvii
Preface........................................................xxxi
Acknowledgments..............................................xxxvii
STRONG>
Chapter 1
INTRODUCTION......................................................1
1.1 What is Data Mining and Knowledge Discovery?..............1
1.2 Some Potential Application Areas for Data Mining
and Knowledge Discovery...................................2
1.2.1 Applications in Engineering.......................2
1.2.2 Applications In Medical Sciences..................2
1.2.3 Applications In the Basic Sciences................3
1.2.4 Applications In Business..........................4
1.2.5 Applications In the Political and
Social Sciences...................................4
1.3 The Data Mining and Knowledge Discovery Process...........5
1.3.1 Problem Definition................................6
1.3.2 Collecting the Data...............................6
1.3.3 Data Pre-Processing...............................8
1.3.4 Application of the Main Data Mining and
Knowledge Discovery Algorithms....................8
1.3.5 Interpretation of the Results of the Data
Mining and Knowledge Discovery Process...........10
1.4 Four Key Research Challenges in Data Mining and
Knowledge Discovery......................................10
1.4.1 Collecting Observations about the Behavior
of a System......................................11
1.4.2 Identifying Patterns from Collections of Data....11
1.4.3 Which Data to Consider for Evaluation Next?......15
1.4.4 Do Patterns Always Exist in Data?................17
1.5 Concluding Remarks.......................................18
Chapter 2
INFERRING A BOOLEAN FUNCTION FROM POSITIVE
AND NEGATIVE EXAMPLES............................................19
2.1 Introduction.............................................19
2.2 Some Background Information..............................20
2.3 Data Binarization........................................24
2.4 Definitions and Terminology..............................28
2.5 Generating Clauses from Negative Examples Only...........31
2.6 Clause Inference as a Satisfiability Problem.............32
2.7 An SAT Approach for Inferring CNF Clauses................33
2.8 The One Clause At a Time (OCAT) Concept..................34
2.9 A Branch-and-Bound Approach
for Inferring a Single Clause............................38
2.10 A Heuristic for Problem Pre-Processing...................45
2.11 Some Computational Results...............................47
2.12 Concluding Remarks.......................................51
APPENDIX
The SAT Formulation for the Illustrative Example
(the CNF case)...................................................53
Chapter 3
A REVISED BRANCH-AND-BOUND APPROACH FOR INFERRING A
BOOLEAN FUNCTION FROM EXAMPLES...................................55
3.1 Some Background Information..............................55
3.2 The Revised Branch-and-Bound Algorithm...................55
3.2.1 Generating a Single CNF Clause...................56
3.2.2 Generating a Single DNF Clause...................61
3.3 Some Computational Results...............................62
3.4 Concluding Remarks.......................................69
Chapter 4
SOME FAST HEURISTICS FOR INFERRING A BOOLEAN FUNCTION
FROM EXAMPLES....................................................71
4. 1 Some Background Information..............................71
4.2 A Fast Heuristic for Inferring a Boolean Function
From Complete Data.......................................73
4.3 A Fast Heuristic for Inferring a Boolean Function
From Incomplete Data.....................................79
4.4 Some Computational Results...............................83
4.4.1 Results for the RA1 Algorithm on the
Wisconsin Cancer Data............................85
4.4.2 Results for the RA2 Heuristic on the
Wisconsin Cancer Data with Some Missing Values...90
4.4.3 Comparison of the RA1 Algorithm and the B&B
Method by Using Large Random Data Sets...........93
4.5 Concluding Remarks......................................100
Chapter 5
AN APPROACH TO GUIDED LEARNING OF BOOLEAN FUNCTIONS.............103
5.1 Some Background Information.............................103
5.2 Problem Description.....................................106
5.3 The Proposed Approach...................................108
5.4 On the Number of Candidate Solutions....................113
5.5 An Illustrative Example.................................114
5.6 Some Computational Results..............................116
5.7 Concluding Remarks......................................127
Chapter 6
AN INCREMENTAL LEARNING ALGORITHM FOR INFERRING
BOOLEAN FUNCTIONS...............................................129
6.1 Some Background Information.............................129
6.2 Problem Description.....................................130
6.3 Some Related Developments...............................131
6.4 The Proposed Incremental Algorithm......................135
6.4.1 Repairing a Boolean Function that
Incorrectly Rejects a Positive Example..........135
6.4.2 Repairing of a Boolean Function that
Incorrectly Accepts a Negative Example..........138
6.4.3 Computational Complexity of the Algorithms
for the ILE Approach............................139
6.5 Experimental Data.......................................140
6.6 Analysis of the Computational Results...................140
6.6.1 Results on the Classification Accuracy..........141
6.6.2 Results on the Number of Clauses................144
6.6.3 Results on the CPU Times........................147
6.7 Concluding Remarks......................................149
Chapter 7
A DUALITY RELATIONSHIP BETWEEN BOOLEAN FUNCTIONS IN CNF
AND DNF DERIVABLE FROM THE SAME TRAINING EXAMPLES...............151
7.1 Introduction............................................151
7.2 Generating Boolean Functions in CNF and DNF.............152
7.3 An Example of Deriving Boolean Functions in
CNF and DNF.............................................152
7.4 Some Computational Results..............................153
7.5 Concluding Remarks......................................155
Chapter 8
THE REJECTABILITY GRAPH OF TWO SETS OF EXAMPLES.................157
8.1 Introduction............................................157
8.2 The Definition of the Rejectability Graph...............158
8.2.1 Properties of the Rejectability Graph...........160
8.2.2 On the Minimum Clique Cover of the
Rejectability Graph.............................162
8.3 Problem Decomposition...................................163
8.3.1 Connected Components............................163
8.3.2 Clique Cover....................................164
8.4 An Example of Using the Rejectability Graph.............165
8.5 Some Computational Results..............................167
8.6 Concluding Remarks......................................175
Chapter 9
THE RELIABILITY ISSUE IN DATA MINING: THE CASE OF
COMPUTER-AIDED BREAST CANCER DIAGNOSIS..........................177
9.1 Introduction............................................177
9.2 Some Background Information on Computer-Aided
Breast Cancer Diagnosis.................................177
9.3 Reliability Criteria....................................179
9.4 The Representation / Narrow Vicinity Hypothesis.........183
9.5 Some Computational Results..............................185
9.6 Concluding Remarks......................................189
APPENDIX I
Definitions of the Key Attributes...............................191
APPENDIX II
Technical Procedures............................................193
9.A.1 The Interactive Approach................................193
9.A.2 The Hierarchical Approach...............................194
9.A.3 The Monotonicity Property...............................194
9.A.4 Logical Discriminant Functions..........................195
Chapter 10
DATA MINING AND KNOWLEDGE DISCOVERY BY MEANS OF
MONOTONE BOOLEAN FUNCTIONS......................................197
10.1 Introduction............................................197
10.2 Background Information..................................199
10.2.1 Problem Descriptions............................199
10.2.2 Hierarchical Decomposition of Variables.........202
10.2.3 Some Key Properties of Monotone
Boolean Functions...............................204
10.2.4 Existing Approaches to Problem 1................209
10.2.5 An Existing Approach to Problem 2...............211
10.2.6 Existing Approaches to Problem 3................211
10.2.7 Stochastic Models for Problem 3.................211
10.3 Inference Objectives and Methodology....................214
10.3.1 The Inference Objective for Problem 1...........214
10.3.2 The Inference Objective for Problem 2...........215
10.3.3 The Inference Objective for Problem 3...........215
10.3.4 Incremental Updates for the Fixed
Misclassification Probability Model.....................216
10.3.5 Selection Criteria for Problem 1................216
10.3.6 Selection Criteria for
Problems 2.1, 2.2, and 2.3......................217
10.3.7 Selection Criterion for Problem 3...............218
10.4 Experimental Results....................................223
10.4.1 Experimental Results for Problem 1..............223
10.4.2 Experimental Results for Problem 2..............225
10.4.3 Experimental Results for Problem 3..............228
10.5 Summary and Discussion..................................232
10.5.1 Summary of the Research Findings................232
10.5.2 Significance of the Research Findings...........234
10.5.3 Future Research Directions......................235
10.6 Concluding Remarks......................................236
Chapter 11
SOME APPLICATION ISSUES ON MONOTONE BOOLEAN FUNCTIONS...........237
11.1 Some Background Information.............................237
11.2 Expressing Any Boolean Function in Terms of
Monotone Ones...........................................237
11.3 Formulations of Diagnostic Problems as the Inference
of Nested Monotone Boolean Functions....................239
11.3.1 An Application to a Reliability
Engineering Problem.............................239
11.3.2 An Application to the Breast Cancer
Diagnosis Problem...............................240
11.4 Design Problems.................................241
11.5 Process Diagnosis Problems......................243
11.6 Three Major Illusions in the Evaluation of
the Accuracy of Data Mining Models......................243
11.6.1 First Illusion: The Single Index
Accuracy Rate...................................243
11.6.2 Second Illusion: Accurate Diagnosis
without Hard Cases..............................244
11.6.3 Third Illusion: High Accuracy on Random
Test Data Only..................................244
11.7 Identification of the Monotonicity Property.............245
11.8 Concluding Remarks......................................248
Chapter 12
MINING OF ASSOCIATION RULES.....................................249
12.1 Some Background Information.............................249
12.2 Problem Description.....................................251
12.3 Methodology.............................................252
12.3.1 Some Related Algorithms.........................252
12.3.2 Alterations to the RA1 Algorithm................253
12.4 Computational Experiments...............................257
12.5 Concluding Remarks......................................264
Chapter 13
DATA MINING OF TEXT DOCUMENTS...................................267
13.1 Some Background Information.....................267
13.2 A Brief Description of the Document
Clustering Process..............................270
13.3 Using the OACT Approach to Classify
Text Documents..................................271
13.4 An Overview of the Vector Space Model...........273
13.5 A Guided Learning Approach for the Classification of
Text Documents..................................276
13.6 Experimental Data...............................276
13.7 Testing Methodology.............................278
13.7.1 The Leave-One-Out Cross Validation..............279
13.7.2 The 30/30 Cross Validation......................279
13.7.3 Statistical Performance of Both Algorithms......279
13.7.4 Experimental Setting for the Guided Learning
Approach........................................280
13.8 Results for the Leave-One-Out and the 30/30 Cross
Validations.............................................280
13.9 Results for the Guided Learning Approach................284
13.10 Concluding Remarks......................................288
Chapter 14
FIRST CASE STUDY: PREDICTING MUSCLE FATIGUE FROM EMG SIGNALS...289
14.1 Introduction............................................289
14.2 General Problem Description.............................289
14.3 Experimental Data.......................................290
14.4 Analysis of the EMG Data................................291
14.4.1 The Effects of Load and Electrode Orientation...291
14.4.2 The Effects of Muscle Condition,
Load, and Electrode.............................292
14.5 A Comparative Analysis of the EMG Data..................293
14.5.1 Results by the OCAT / RA1 Approach..............294
14.5.2 Results by the Fisher's Linear
Discriminant Analysis...........................295
14.5.3 Results by Logistic Regression..................296
14.5.4 A Neural Network Approach.......................297
14.6 Concluding Remarks......................................299
Chapter 15
SECOND CASE STUDY: INFERENCE OF DIAGNOSTIC RULES FOR
BREAST CANCER...................................................301
15.1 Introduction............................................301
15.2 Description of the Data Set.............................301
15.3 Description of the Inferred Rules.......................305
15.4 Concluding Remarks......................................310
Chapter 16
A FUZZY LOGIC APPROACH TO ATTRIBUTE FORMALIZATION: ANALYSIS
OF LOBULATION FOR BREAST CANCER DIAGNOSIS.......................311
16.1 Introduction............................................311
16.2 Some Background Information on Digital
Mammography.............................................311
16.3 Some Background Information on Fuzzy Sets...............313
16.4 Formalization with Fuzzy Logic..................316
16.5 Degrees of Lobularity and Microlobularity.......321
16.6 Concluding Remarks......................................324
Chapter 17
CONCLUSIONS....................................................325
17.1 General Concluding Remarks.............................325
17.2 Twelve Key Areas of Potential Future Research on Data
Mining and Knowledge Discovery from Databases..........326
17.2.1 Overfitting and Overgeneralization.............326
17.2.2 Guided Learning................................326
17.2.3 Stochasticity..................................327
17.2.4 More on Monotonicity...........................327
17.2.5 Visualization..................................327
17.2.6 Systems for Distributed Environments...........328
17.2.7 Developing Better Exact Algorithms
and Heuristics.................................328
17.2.8 Hybridization and Other Algorithmic Issues...328
17.2.9 Systems with Self Explanatory Capabilities.....329
17.2.10 New Systems for Image Analysis.................329
17.2.11 Systems for Web Applications...................329
17.2.12 Developing More Applications...................329
17.3 Epilogue...............................................330
REFERENCES.....................................................333
Subject Index..................................................353
Author Index...................................................363
About the Author...............................................369
Visit Dr. Triantaphyllou's Homepage
Dr. Triantaphyllou's Books /
Special Issues web site
Send suggestions / comments to Dr. E. Triantaphyllou (trianta@lsu.edu).
|