Accuracy, 37
Artificial intelligence (AI) analysis, 3, 42
Artificial neural networks, 108, 110
Association, 30, 43
Auto expenditure
comparisons in, 160
logistic regression for, 155
Backpropagation learning rule, 110
Banking industry, 6–7
Bankruptcy data, 55–56
Bar coding, 1
Big data, features of, 1
Business data mining, 5–10
tools. See Data mining tools
Business intelligence, 10
Casino gaming, 4–5
Centroid discriminant model, 91
of insurance fraud data, 91
Churn, 9
Classification, 43
errors, 35
CLEMENTINE, 28
Cloning, 9
Cluster analysis, 19–22, 41
Coding, variables, 85
Coincidence matrixes, 35–36
decision tree, 160
false negative, 35
false positive, 35
for full and pruned logistic regression model, 156
KNIME, 100, 118
logistic regression and, 88, 92
for R logistic regression model, 98
for R decision tree model, 148
R neural net, 114
for rules, 134, 141–143
using squared error distance, 94
true negative, 35
true positive, 35
from WEKA, 103
WEKA neural network, 120
Confusion matrix, 37
Credit card management, 7–8
CRISP-DM, 25–32
business understanding, 26
data modeling, 29–31
data preparation, 27–29
data understanding, 26–27
Customer profiling, 3, 15–18
expected payoff, calculation of, 17–18
lift calculation in, 15–18
Customer relationship management, 14
Customer segmentation, 13
Data
collection, 27
conversion 0-1 scale, 93
description, 27
evaluation, 31
exploration, 27
marts, 10
modeling, 29–31
quality, 27
transformation, 4
treatment, 30
types. See Data types
understanding, 26–27
warehouses, 2, 10
Data mining
application areas of, 5–6
banking, 6–7
credit card management, 7–8
human resource management, 10
insurance, 8–9
retailing, 6
telecommunications, 9–10
applications by method, 44–45
business applications of, 14
defined, 1
entropy measure, 128–130
functions, 43–45
in business, 2
methods of, 19–23
modeling tools, 40–41
neural networks in, 107–150
perspectives, 41–43
processes for. See Data mining processes
regression, 83–105
requirements for, 3–5
software. See Data mining software
tools. See Data mining tools
Data mining methods
cluster analysis, 19–22
decision tress analysis, 23
discrimination analysis, 22
knowledge discovery, 19–22
neural networks, 23
regression, 22–23
Data mining processes
CRISP-DM, 25–32
SEMMA, 25, 32–37
Data mining software, 59–82
KNIME, 64–72
R (Rattle), 59–64
WEKA, 72–82
Data mining techniques, 30, 39–57
Data mining tools, 14, 13–24
customer profiling, 15–18
demonstration datasets, 45–56
Data types
FLAG, 28
RANGE, 28
SET, 28
TYPELESS, 28
Decision tree, 123–150
analysis, 23
coincidence matrix, 160
coincidence matrix for rules, 141–143
entropy calculation for age, 130–131, 139
grouped data, 136–138
J48 decision tree, 76
machine learning, 127–135
operation, 124–127
for rule, 135
scalability, 158–160
software demonstrations, 135–149
KNIME, 146–148
R (Rattle), 144–146
WEKA, 148–149
Decision Tree Predictor node, 72
Demographic data, 27
Demonstration datasets, 45–56
bankruptcy Data, 55–56
expenditure data, 54
insurance fraud data, 50–53
job application data, 46, 50–51
loan analysis data, 46, 47–49
Deployment, 31–32
Detection, 43
Discrimination analysis, 22
Entropy
calculation for age, 130–131, 139
measure, data mining use, 128–130
Expenditure data, 54, 152
False discovery rate (FDR), 37
False negative (FN), 35
False negative rate (FNR), 37
False positive (FP), 35
False positive rate (FPR), 37
Farmers Insurance Group, 8
FLAG data type, 28
Genetic algorithms, 42
Graphical user interface (GUI), 59
Human resource management, 10
Hypothesis testing, 3
IBM’s Intelligent Miner, 59
Imputation, 27
Insurance claim data
logistic regression model for, 86–87
training data, 87
Insurance fraud data, 50–53
centroid discriminant model of, 91
coincidence matrix for, 88, 91, 92
logistic regression model, 88, 90, 92
OLS regression output, 89–90
Insurance industry, 8–9
Job application data, 46, 50–51
K-median, 22
KNIME, 64–72, 99–101
coincidence matrix from, 100, 118
data port, 68–69
decision tree, 70–72, 146–148
logistic regression model from, 100–101
neural networks, 116–119
node status, 66–68
opening new workflow, 69–72
ports, 68
versions, 65
workflow for loan data, 100
workflow for neural network model, 118
Knowledge discovery, 3, 19–22
Lift
calculation of, 15–17
profit impact of, 17–18
Linear discriminant analysis, 85
Linear regression, 83
Loan analysis data, 46, 47–49
Logistic regression, 83, 85–87, 91–94
application, 157
coincidence matrix and, 88
KNIME, 100–101
insurance claim data for, 86–87
insurance fraud test cases, 88, 90, 92
loading data for, 95
output for loan data from R, 96
R (Rattle), 155
regression learner for loan, 99
scalability and, 155–157
WEKA, 102–103
Machine learning, 127–135
loan application cases and, 127–128
Micromarketing, 3
Multilayered feedforward neural networks, 110
Negative predictive value (NPV), 37
Neural network, 23, 42–43, 83–84, 107–150
artificial, 108, 110
multilayered feedforward, 110
negative features of, 43
operation of, 109–111
products, 122
scalability, 156–158
self-organizing, 110
software demonstration, 111–121
KNIME, 116–119
R (Rattle), 111–115
WEKA, 119–121
NPV. See Negative predictive value
Online analytical processing (OLAP), 10
Pattern recognition, 19
PolyAnalyst, 29
Positive predictive value (PPV), 37
Prediction, 43
Qualitative data, 27
Quantitative data, 27
R (Rattle), 59–64
coincidence matrix for, 98
decision tree model, 63–64, 144–146
evaluation, 97
logistic, 95–98
logistic regression
for auto expenditures, 155
output for loan data from, 96
pruned regression model, 156
neural networks, 111–115
scalability and, 152–160
RANGE data type, 28–29
Rattle (R). See R (Rattle)
Receiver operating characteristic (ROC) curve, 37
Recency, Frequency, and Monetary (RFM) analysis, 18
Regression, 22–23, 41, 83–105
discriminant model, 84–85, 89–91
classical test of, 84–85
linear discriminant analysis, 85
logistic, 85–87
software demonstrations, 87–105
KNIME, 100–101
R—logistic, 95–98
WEKA, 101–105
Renminbi (RMB), 2
Retailing, 6
RFM analysis. See Recency, Frequency, and Monetary analysis
ROC curve. See Receiver operating characteristic curve
Rule induction, 42, 127
Rule-based system model, 125
SAS Enterprise Miner, 59
SAS Institute Inc., 32
Scalability, 4, 151–161
expenditure data, 152
decision tree, 158–160
neural networks, 156–158
R (Rattle) calculations, 152–160
Self-organizing neural networks, 110
SEMMA, 25, 32–37
assessing, 34–35
exploration, 33
modeling, 34
modification, 33–34
sampling, 32–33
Sequential pattern analysis, 30–31
SET data type, 28–29
Similar time sequences, 31
Sociographic data, 27
Software demonstrations
decision trees, 135–149
neural network models, 111–121
regression, 87–105
Squared distance
calculation of, 94
coincidence matrix for, 94
Statistical methods, 29
Supervised data mining approach, 3
Targeting, 3
Telecommunications, 9–10
Transactional data, 27
True negative (TN), 35
True negative rate (TNR), 37
True positive (TP), 35
True positive rate (TPR), 37
Type I error, 35
Type II error, 35
TYPELESS data type, 28
Unsupervised data mining approach, 3
Versatile, 4
Visualization tools, 29, 31
WEKA, 72–82, 101–105, 119–121
classify screen, 74
coincidence matrix from, 103
data loading for loan dataset, 102
decision tree, 148–149
explore screen, 73
explorer classification algorithm menu, 75
J48 decision tree, 76
J48 output, 76–78
J48 parameter settings, 76
logistic regression model, 102–103, 104
modified J48 decision tree, 79
MultiLayerPerceptron display, 119
neural network coincidence matrix, 120
new cases for, 121
opening screen, 73
output predictions, 80–82
predictions for loan data, 121