Chapter 5

Data Mining Software

There are many excellent commercial data mining software products, although these tend to be expensive. These include SAS Enterprise Miner and IBM’s Intelligent Miner, as well as many more recent variants and new products appearing regularly. One source of information is www.kdnuggets.com under “software.” Some of these are free. The most ­popular software by rdstats.com/articles/popularity by product are shown in Table 5.1.


Table 5.1 Data mining software by popularity (rdstats.com)

Rank

1

R

Open source

2

SAS

Commercial

3

SPSS

Commercial

4

WEKA

Open source

5

Statistica

Commercial

5

Rapid Miner

Commercial



Rattle is a graphical user interface (GUI) system for R (also open source), and is also highly recommended. WEKA is a great system, but we have found issues with reading test data, making it a bit troublesome. Down the list in 11th place is KNIME, a very easy-to-use open source GUI system that we will demonstrate. KNIME has the feature that it will read both R and WEKA models, along with the click-and-drag functionality to build workflows similar to SAS and SPSS products.

R

To install R, visit https://cran.rstudio.com/

Open a folder for R.

Select Download R for Windows.

To install Rattle:

Open the R Desktop icon (32-bit or 64-bit) and enter the following command at the R prompt. R will ask for a CRAN mirror. Choose a nearby location.

> install.packages(“rattle”)

Enter the following two commands at the R prompt. This loads the Rattle package into the library and then starts up Rattle.

> library(rattle)

> rattle()

If the RGtk2 package is yet to be installed, there will be an error popup indicating that libatk-1.0-0.dll is missing from your computer. Click on the OK button and then, you will be asked if you would like to install GTK+. Click on OK to do so. This then downloads and installs the appropriate GTK+ libraries for your computer. After this has finished, do exit from R and restart it so that it can find the newly installed libraries.

When running Rattle, a number of other packages will be downloaded and installed as needed, with Rattle asking for the user’s permission before doing so. They only need to be downloaded once.

The installation has been tested to work on Microsoft Windows, 32-bit and 64-bit, XP, Vista, and 7 with R 3.1.1, Rattle 3.1.0, and RGtk2 2.20.31. If you are missing something, you will get a message from R asking you to install a package. I read nominal data (string), and was prompted that I needed “stringr.” On the R console (see Figure 5.1), click on the “Packages” tab on the top line.

Image

Figure 5.1 R console

Give the command “Install packages,” which will direct you to HTTPS CRAN mirror. Select one of the sites (like “USA(TX) [https]”) and find “stringr” and click on it. Then, upload that package. You may have to restart R.

To run a model, on the Filename line, click on the icon and browse for the file “LoanRaw.csv.” Click on the Execute icon on the upper left of the Rattle window. This yields Figure 5.2.

Image

Figure 5.2 LoanRaw.csv data read

We can Explore—the default is Summary. Execute yields Figure 5.3. Figure 5.3 shows the output.

Image

Figure 5.3 Summary of LoanRaw.csv

Here variable “Risk” is a function “Assets,” “Debt,” and “Want.” ­Rattle treated “Debt” as an identifier variable and deleted it from the analysis. This can be adjusted by the user so desires.

Select the Model tab, yielding Figure 5.4.

Image

Figure 5.4 Model tab with Tree selected

This yields options to set parameters for a decision tree, which we will examine later in the book. For now, we can use the default settings shown, Execute, and obtain Figure 5.5.

Image

Figure 5.5 Decision tree model for LoanRaw.csv data

Rattle can also provide a descriptive decision tree by selecting the Rules button, yielding Figure 5.6.

Image

Figure 5.6 Rules from the decision tree model

Selecting the Draw button yields Figure 5.7, a graphic decision tree.

Image

Figure 5.7 Graphic decision tree display from Rattle

This just provides an initial glimpse of what R (through Rattle) can provide. We will demonstrate the analysis in greater depth in subsequent chapters.

KNIME

To install KNIME, you have to log into tech.knime.org/user, getting a username and password. You then can proceed to installation. Table 5.2 describes the KNIME versions and on which platform they are available.


Table 5.2 KNIME versions (from their website)

Linux

Windows

Mac OS X

KNIME (32-bit)

Yes

Yes

No

KNIME (64-bit)

Yes

Yes

Yes

KNIME Developer Version (32-bit)

Yes

Yes

No

KNIME Developer Version (64-bit)

Yes

Yes

Yes



Installation is accomplished by:

Download one of the aforementioned versions, unzip it to any directory. For Windows, click on the knime.exe file, and for Linux, click on knime in order to start KNIME. When KNIME is started for the first time, a welcome screen (Figure 5.8) appears.

Image

Figure 5.8 KNIME welcome screen

From here, you can

  1. Open KNIME workbench: Opens the KNIME workbench to immediately start exploring KNIME, build own workflows, and explore your data.
  2. Get additional nodes: In addition to the ready-to-start basic KNIME installation, there are additional plug-ins for KNIME, for example, an R and Weka integration, or the integration of the Chemistry Development Kit with the additional nodes for the processing of chemical structures, compounds, and so on. You can download these features later from within KNIME (File, Update KNIME) as well.

The KNIME workbench is organized as in Figure 5.9.

Figure 5.9 KNIME workflow

A workflow is built by dragging the nodes from the Node Repository to the Workflow Editor and connecting them. Nodes are the basic processing units of a workflow. Each node has a number of input and output ports. Data (or a model) is transferred via a connection from an out-port to the in-port of another node.

Node Status

When a node is dragged to the workflow editor, the status light lights up red, which means that the node has to be configured in order to be able to be executed. A node is configured by right-clicking it, choosing Configure, and adjusting the necessary settings in the node’s dialog, as displayed in Figure 5.10.

Figure 5.10 KNIME configuration

When the dialog is closed by clicking on the OK button, the node is configured and the status light changes to yellow: the node is ready to be executed. Right-clicking the node again shows an enabled Execute option; clicking on it will execute the node and the result of this node will be available at the out-port. After a successful execution, the status light of the node is green. The result(s) can be inspected by exploring the out-port view(s): the last entries in the context menu open them.

Ports

The ports on the left are input ports, where the data from the out-port of the predecessor node is provided. The ports on the right are out-ports. The result of the node’s operation on the data is provided at the out-port to successor nodes. A tooltip provides information about the output of the node, further information can be found in the node description. Nodes are typed such that only ports of the same type can be connected.

Data Port

Figure 5.11 shows the most common type, the data port (a white triangle), which transfers flat data tables from node to node.

Figure 5.11 KNIME data port

Figure 5.12 shows a database port: Nodes executing commands inside a database are recognized by these database ports displayed as brown squares.

Figure 5.12 KNIME database port

Data mining nodes learn a model, which is passed to the referring predictor node via a blue squared PMML port (Figure 5.13).

Figure 5.13 KNIME PMML port

Whenever a node provides data that does not fit a flat data table structure, a general purpose port for structured data is used (dark cyan square). All ports not listed earlier are known as “unknown” types (gray square), as in Figure 5.14.

Image

Figure 5.14 Other KNIME ports

Opening a New Workflow

We can open a new workflow for LoanRaw.csv. We first input the data, by clicking and dragging a File Reader node. We enter the location of the file by clicking on Browse … and locating the file. Figure 5.15 exhibits this operation.

Image

Figure 5.15 Opening LoanRaw.csv

Click on the File Reader node and select Apply followed by OK. Then, click on File Reader (which now should have a yellow status light) and click on Execute. If all is well, the status light will change to green. This enables linkage to other nodes. Click-and-drag the Decision Tree Learner node and link it to the File Reader icon. Click on this Decision Tree Learner node and select Apply followed by OK. Then, click on the icon again and select Execute. The status light should change to green.

In Chapter 3, we discussed a data mining process, where it is a good practice to build the model on one set of data, and then test it on another subset of data. The purpose of building data mining models is to have them available to predict new cases. Figure 5.16 shows the KNIME workflow process.

Image

Figure 5.16 KNIME LoanRaw.csv decision tree process

In Figure 5.16, we demonstrate a complete process where a training set is read into the first File Reader and the Decision Tree Learner is used to build a model. This feeds into the Decision Tree Predictor node, which is linked to another File Reader node with test data. The Decision Tree Predictor node feeds into a Scorer node, which provides a confusion matrix, displaying model fit on the test data. To apply this model to new data, a third File Reader node is where new cases are linked, feeding into another Decision Tree Predictor node (linked to the Decision Tree Learner model), providing output to an Interactive Table providing model predictions for the new cases. In the Decision Tree Learner node, we apply Pruning of MDL. Right-clicking on the node, we obtain the following decision tree (Figure 5.17).

Image

Figure 5.17 KNIME decision tree

We will demonstrate KNIME throughout the book. We also add WEKA installation, as it has great tools for data mining algorithms.

WEKA

WEKA (you can download off the Web) www.cs.waikato.ac.nz/ml/weka/. The download comes with documentation.

On WEKA:

Hit the Open file … button on the upper left.

Link to LoanRaw.csv (or any .csv or .arff file you want to analyze).

Install—hopefully, you get Figure 5.18.

Image

Figure 5.18 WEKA opening screen

Select Explorer, yielding Figure 5.19.

Image

Figure 5.19 WEKA explorer screen

Select Open file … and pick file from your hard drive. In Figure 5.20, we picked LoanRaw.csv.

Image

Figure 5.20 WEKA screen for LoanRaw.csv

You can play around with Visualized, Select attributes, (even Cluster or Associate), but the point for us is to build classification models with Classify, as in Figure 5.21.

Image

Figure 5.21 WEKA classify screen

Select Choose to get Figure 5.22.

Image

Figure 5.22 WEKA explorer classification algorithm menu

Select Trees to get Figure 5.23.

Image

Figure 5.23 WEKA tree algorithm menu

There are 16 different decision tree models. The interesting ones for us are J48 and SimpleCart.

If you select J48, and then click on the Choose line (with J48 C 0.25 M 2), you get a control window, as in Figure 5.24.

Image

Figure 5.24 WEKA J48 parameter settings

You can change the confidence factor of the algorithm (requiring a minimum level of confidence before retaining a rule) and the minimum number of cases for a rule (called support). This provides a means to try to adjust the number of decision tree rules.

To run a J48 model (with defaults of C=0.25 and M=2), select OK in Figure 5.25.

Image

Figure 5.25 Running J48 decision tree

Select Start.

This yields Figure 5.26.

Image

Figure 5.26 J48 output for LoanRaw.csv

The result is a tree with no rules—it cheats and says all loan applications are OK (it is wrong 65 times out of 650, giving a correct classification rate of 0.90). I call this a degenerate model—it just says everything is OK.

To get a more interesting model, play with C and M.

For C=0.5 and M=6, I get Figure 5.27.

Image

Figure 5.27 Modified J48 decision tree

The actual tree is more compactly shown in Figure 5.28.

Image

Figure 5.28 Compact tree for Figure 5.27

This tree has six leaves (ways this rule set has to reach a conclusion)—note that you can shorten that—here, I did it in 2. It has a tree size of 9 (rows to express the rule set—although here I count 8)—tree size doesn’t mean much.

Hit the Choose button

Under Functions, you can select

   Logistic for logistic regression

   Multilayer perceptron or RBF Network for neural network

Trees

   J48 for a good tree

   Decision Stump for a simple

   SimpleCart

For J48, you can control the number of rules by manipulating parameter M (which is the minimum number of cases required in order to build a rule).

If Output Variable Is Continuous

Functions

   Linear regression

Trees

   M5P gives a tree of multiple linear regressions

Select the button for Use training data or Cross-validation with 10 folds, or Percentage split.

WEKA will give you a decision tree. Under Trees, you can select a variety of decision tree models, to include J48 (requires categorical output). For data files containing only continuous data, you can use Decision Stump or M5P. The answer you get in that case will be an estimate of the proportion of the outcome variable.

To Predict Within WEKA

This is the hardest thing about WEKA. There is a way to apply the neural net model to test cases, but it is cumbersome. Generally, I’ve found that using a supplied test set works pretty well for generating the predictions that use only numeric data (e.g., expenditure files). You have to make sure the header rows match exactly and add any “actual” value for the independent variable column (our assignments have not provided an actual value for test cases). You have to make sure the data structures match exactly, so in the case of the expenditure files, there are some text fields that have to be converted into numeric values.

WEKA doesn’t handle text data as cleanly, but it can be done. I appended the test cases to the training .csv file and saved with a different name. All text values have to match exactly to the training set (–e.g., Undergrad = UG, None = none, etc.). It is case-sensitive, which adds to the difficulty. Load the training set (this part is the same for numeric or text). From Test options, use Supplied test set (click on Set and browse to find the file). Then, go to More options and click on Output predictions. Run the model and the predictions appear above the Summary. WEKA will give predictions (what the model projects) and actuals (what was in the dataset) for all observations. The added data will be at the end, so you only have to scroll up above the confusion matrix.

When that fails, one thing that seems to work is:

To the original dataset, add the new cases at the bottom (making sure that the spelling is the same).

Have the original data set loaded (Figure 5.29).

Image

Figure 5.29 Original data load—LoanRaw.csv

When you select the model, instead of 10-fold testing, click on Supply test set and link the file with the 10 cases at the bottom.

Under More options, select Output Predictions.

When you then run the model, the predictions (for all 510) will be listed above the confusion matrix (Figure 5.30). The “actual” will be phony—but the “predicted” will be there.

Image Image

Figure 5.30 Prediction output

Summary

There are many excellent data mining software products, commercial (which are often expensive, but quite easy to use) as well as open source. WEKA was one of the earlier open source products. R has grown to be viable for major data mining, and the Rattle GUI makes it easy to implement. KNIME is newer and has some of the click-and-drag features that made commercial software so easy to use.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset