Accuracy, 37

Artificial intelligence (AI) analysis, 3, 42

Artificial neural networks, 108, 110

Association, 30, 43

Auto expenditure

comparisons in, 160

logistic regression for, 155

Backpropagation learning rule, 110

Banking industry, 6–7

Bankruptcy data, 55–56

Bar coding, 1

Big data, features of, 1

Business data mining, 5–10

tools. See Data mining tools

Business intelligence, 10

Casino gaming, 4–5

Centroid discriminant model, 91

of insurance fraud data, 91

Churn, 9

Classification, 43

errors, 35

CLEMENTINE, 28

Cloning, 9

Cluster analysis, 19–22, 41

Coding, variables, 85

Coincidence matrixes, 35–36

decision tree, 160

false negative, 35

false positive, 35

for full and pruned logistic regression model, 156

KNIME, 100, 118

logistic regression and, 88, 92

for R logistic regression model, 98

for R decision tree model, 148

R neural net, 114

for rules, 134, 141–143

using squared error distance, 94

true negative, 35

true positive, 35

from WEKA, 103

WEKA neural network, 120

Confusion matrix, 37

Credit card management, 7–8

CRISP-DM, 25–32

business understanding, 26

data modeling, 29–31

data preparation, 27–29

data understanding, 26–27

Customer profiling, 3, 15–18

expected payoff, calculation of, 17–18

lift calculation in, 15–18

Customer relationship management, 14

Customer segmentation, 13

Data

collection, 27

conversion 0-1 scale, 93

description, 27

evaluation, 31

exploration, 27

marts, 10

modeling, 29–31

quality, 27

transformation, 4

treatment, 30

types. See Data types

understanding, 26–27

warehouses, 2, 10

Data mining

application areas of, 5–6

banking, 6–7

credit card management, 7–8

human resource management, 10

insurance, 8–9

retailing, 6

telecommunications, 9–10

applications by method, 44–45

business applications of, 14

defined, 1

entropy measure, 128–130

functions, 43–45

in business, 2

methods of, 19–23

modeling tools, 40–41

neural networks in, 107–150

perspectives, 41–43

processes for. See Data mining processes

regression, 83–105

requirements for, 3–5

software. See Data mining software

tools. See Data mining tools

Data mining methods

cluster analysis, 19–22

decision tress analysis, 23

discrimination analysis, 22

knowledge discovery, 19–22

neural networks, 23

regression, 22–23

Data mining processes

CRISP-DM, 25–32

SEMMA, 25, 32–37

Data mining software, 59–82

KNIME, 64–72

R (Rattle), 59–64

WEKA, 72–82

Data mining techniques, 30, 39–57

Data mining tools, 14, 13–24

customer profiling, 15–18

demonstration datasets, 45–56

Data types

FLAG, 28

RANGE, 28

SET, 28

TYPELESS, 28

Decision tree, 123–150

analysis, 23

coincidence matrix, 160

coincidence matrix for rules, 141–143

entropy calculation for age, 130–131, 139

grouped data, 136–138

J48 decision tree, 76

machine learning, 127–135

operation, 124–127

for rule, 135

scalability, 158–160

software demonstrations, 135–149

KNIME, 146–148

R (Rattle), 144–146

WEKA, 148–149

Decision Tree Predictor node, 72

Demographic data, 27

Demonstration datasets, 45–56

bankruptcy Data, 55–56

expenditure data, 54

insurance fraud data, 50–53

job application data, 46, 50–51

loan analysis data, 46, 47–49

Deployment, 31–32

Detection, 43

Discrimination analysis, 22

Entropy

calculation for age, 130–131, 139

measure, data mining use, 128–130

Expenditure data, 54, 152

False discovery rate (FDR), 37

False negative (FN), 35

False negative rate (FNR), 37

False positive (FP), 35

False positive rate (FPR), 37

Farmers Insurance Group, 8

FLAG data type, 28

Genetic algorithms, 42

Graphical user interface (GUI), 59

Human resource management, 10

Hypothesis testing, 3

IBM’s Intelligent Miner, 59

Imputation, 27

Insurance claim data

logistic regression model for, 86–87

training data, 87

Insurance fraud data, 50–53

centroid discriminant model of, 91

coincidence matrix for, 88, 91, 92

logistic regression model, 88, 90, 92

OLS regression output, 89–90

Insurance industry, 8–9

Job application data, 46, 50–51

K-median, 22

KNIME, 64–72, 99–101

coincidence matrix from, 100, 118

data port, 68–69

decision tree, 70–72, 146–148

logistic regression model from, 100–101

neural networks, 116–119

node status, 66–68

opening new workflow, 69–72

ports, 68

versions, 65

workflow for loan data, 100

workflow for neural network model, 118

Knowledge discovery, 3, 19–22

Lift

calculation of, 15–17

profit impact of, 17–18

Linear discriminant analysis, 85

Linear regression, 83

Loan analysis data, 46, 47–49

Logistic regression, 83, 85–87, 91–94

application, 157

coincidence matrix and, 88

KNIME, 100–101

insurance claim data for, 86–87

insurance fraud test cases, 88, 90, 92

loading data for, 95

output for loan data from R, 96

R (Rattle), 155

regression learner for loan, 99

scalability and, 155–157

WEKA, 102–103

Machine learning, 127–135

loan application cases and, 127–128

Micromarketing, 3

Multilayered feedforward neural networks, 110

Negative predictive value (NPV), 37

Neural network, 23, 42–43, 83–84, 107–150

artificial, 108, 110

multilayered feedforward, 110

negative features of, 43

operation of, 109–111

products, 122

scalability, 156–158

self-organizing, 110

software demonstration, 111–121

KNIME, 116–119

R (Rattle), 111–115

WEKA, 119–121

NPV. See Negative predictive value

Online analytical processing (OLAP), 10

Pattern recognition, 19

PolyAnalyst, 29

Positive predictive value (PPV), 37

Prediction, 43

Qualitative data, 27

Quantitative data, 27

R (Rattle), 59–64

coincidence matrix for, 98

decision tree model, 63–64, 144–146

evaluation, 97

logistic, 95–98

logistic regression

for auto expenditures, 155

output for loan data from, 96

pruned regression model, 156

neural networks, 111–115

scalability and, 152–160

RANGE data type, 28–29

Rattle (R). See R (Rattle)

Receiver operating characteristic (ROC) curve, 37

Recency, Frequency, and Monetary (RFM) analysis, 18

Regression, 22–23, 41, 83–105

discriminant model, 84–85, 89–91

classical test of, 84–85

linear discriminant analysis, 85

logistic, 85–87

software demonstrations, 87–105

KNIME, 100–101

R—logistic, 95–98

WEKA, 101–105

Renminbi (RMB), 2

Retailing, 6

RFM analysis. See Recency, Frequency, and Monetary analysis

ROC curve. See Receiver operating characteristic curve

Rule induction, 42, 127

Rule-based system model, 125

SAS Enterprise Miner, 59

SAS Institute Inc., 32

Scalability, 4, 151–161

expenditure data, 152

decision tree, 158–160

neural networks, 156–158

R (Rattle) calculations, 152–160

Self-organizing neural networks, 110

SEMMA, 25, 32–37

assessing, 34–35

exploration, 33

modeling, 34

modification, 33–34

sampling, 32–33

Sequential pattern analysis, 30–31

SET data type, 28–29

Similar time sequences, 31

Sociographic data, 27

Software demonstrations

decision trees, 135–149

neural network models, 111–121

regression, 87–105

Squared distance

calculation of, 94

coincidence matrix for, 94

Statistical methods, 29

Supervised data mining approach, 3

Targeting, 3

Telecommunications, 9–10

Transactional data, 27

True negative (TN), 35

True negative rate (TNR), 37

True positive (TP), 35

True positive rate (TPR), 37

Type I error, 35

Type II error, 35

TYPELESS data type, 28

Unsupervised data mining approach, 3

Versatile, 4

Visualization tools, 29, 31

WEKA, 72–82, 101–105, 119–121

classify screen, 74

coincidence matrix from, 103

data loading for loan dataset, 102

decision tree, 148–149

explore screen, 73

explorer classification algorithm menu, 75

J48 decision tree, 76

J48 output, 76–78

J48 parameter settings, 76

logistic regression model, 102–103, 104

modified J48 decision tree, 79

MultiLayerPerceptron display, 119

neural network coincidence matrix, 120

new cases for, 121

opening screen, 73

output predictions, 80–82

predictions for loan data, 121

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset