There are many excellent commercial data mining software products, although these tend to be expensive. These include SAS Enterprise Miner and IBM’s Intelligent Miner, as well as many more recent variants and new products appearing regularly. One source of information is www.kdnuggets.com under “software.” Some of these are free. The most popular software by rdstats.com/articles/popularity by product are shown in Table 5.1.
Table 5.1 Data mining software by popularity (rdstats.com)
Rank |
||
1 |
R |
Open source |
2 |
SAS |
Commercial |
3 |
SPSS |
Commercial |
4 |
WEKA |
Open source |
5 |
Statistica |
Commercial |
5 |
Rapid Miner |
Commercial |
Rattle is a graphical user interface (GUI) system for R (also open source), and is also highly recommended. WEKA is a great system, but we have found issues with reading test data, making it a bit troublesome. Down the list in 11th place is KNIME, a very easy-to-use open source GUI system that we will demonstrate. KNIME has the feature that it will read both R and WEKA models, along with the click-and-drag functionality to build workflows similar to SAS and SPSS products.
R
To install R, visit https://cran.rstudio.com/
Open a folder for R.
Select Download R for Windows.
To install Rattle:
Open the R Desktop icon (32-bit or 64-bit) and enter the following command at the R prompt. R will ask for a CRAN mirror. Choose a nearby location.
> install.packages(“rattle”)
Enter the following two commands at the R prompt. This loads the Rattle package into the library and then starts up Rattle.
> library(rattle)
> rattle()
If the RGtk2 package is yet to be installed, there will be an error popup indicating that libatk-1.0-0.dll is missing from your computer. Click on the OK button and then, you will be asked if you would like to install GTK+. Click on OK to do so. This then downloads and installs the appropriate GTK+ libraries for your computer. After this has finished, do exit from R and restart it so that it can find the newly installed libraries.
When running Rattle, a number of other packages will be downloaded and installed as needed, with Rattle asking for the user’s permission before doing so. They only need to be downloaded once.
The installation has been tested to work on Microsoft Windows, 32-bit and 64-bit, XP, Vista, and 7 with R 3.1.1, Rattle 3.1.0, and RGtk2 2.20.31. If you are missing something, you will get a message from R asking you to install a package. I read nominal data (string), and was prompted that I needed “stringr.” On the R console (see Figure 5.1), click on the “Packages” tab on the top line.
Give the command “Install packages,” which will direct you to HTTPS CRAN mirror. Select one of the sites (like “USA(TX) [https]”) and find “stringr” and click on it. Then, upload that package. You may have to restart R.
To run a model, on the Filename line, click on the icon and browse for the file “LoanRaw.csv.” Click on the Execute icon on the upper left of the Rattle window. This yields Figure 5.2.
We can Explore—the default is Summary. Execute yields Figure 5.3. Figure 5.3 shows the output.
Here variable “Risk” is a function “Assets,” “Debt,” and “Want.” Rattle treated “Debt” as an identifier variable and deleted it from the analysis. This can be adjusted by the user so desires.
Select the Model tab, yielding Figure 5.4.
This yields options to set parameters for a decision tree, which we will examine later in the book. For now, we can use the default settings shown, Execute, and obtain Figure 5.5.
Rattle can also provide a descriptive decision tree by selecting the Rules button, yielding Figure 5.6.
Selecting the Draw button yields Figure 5.7, a graphic decision tree.
This just provides an initial glimpse of what R (through Rattle) can provide. We will demonstrate the analysis in greater depth in subsequent chapters.
KNIME
To install KNIME, you have to log into tech.knime.org/user, getting a username and password. You then can proceed to installation. Table 5.2 describes the KNIME versions and on which platform they are available.
Table 5.2 KNIME versions (from their website)
Linux |
Windows |
Mac OS X |
|
KNIME (32-bit) |
Yes |
Yes |
No |
KNIME (64-bit) |
Yes |
Yes |
Yes |
KNIME Developer Version (32-bit) |
Yes |
Yes |
No |
KNIME Developer Version (64-bit) |
Yes |
Yes |
Yes |
Installation is accomplished by:
Download one of the aforementioned versions, unzip it to any directory. For Windows, click on the knime.exe file, and for Linux, click on knime in order to start KNIME. When KNIME is started for the first time, a welcome screen (Figure 5.8) appears.
From here, you can
The KNIME workbench is organized as in Figure 5.9.
A workflow is built by dragging the nodes from the Node Repository to the Workflow Editor and connecting them. Nodes are the basic processing units of a workflow. Each node has a number of input and output ports. Data (or a model) is transferred via a connection from an out-port to the in-port of another node.
Node Status
When a node is dragged to the workflow editor, the status light lights up red, which means that the node has to be configured in order to be able to be executed. A node is configured by right-clicking it, choosing Configure, and adjusting the necessary settings in the node’s dialog, as displayed in Figure 5.10.
When the dialog is closed by clicking on the OK button, the node is configured and the status light changes to yellow: the node is ready to be executed. Right-clicking the node again shows an enabled Execute option; clicking on it will execute the node and the result of this node will be available at the out-port. After a successful execution, the status light of the node is green. The result(s) can be inspected by exploring the out-port view(s): the last entries in the context menu open them.
Ports
The ports on the left are input ports, where the data from the out-port of the predecessor node is provided. The ports on the right are out-ports. The result of the node’s operation on the data is provided at the out-port to successor nodes. A tooltip provides information about the output of the node, further information can be found in the node description. Nodes are typed such that only ports of the same type can be connected.
Data Port
Figure 5.11 shows the most common type, the data port (a white triangle), which transfers flat data tables from node to node.
Figure 5.12 shows a database port: Nodes executing commands inside a database are recognized by these database ports displayed as brown squares.
Data mining nodes learn a model, which is passed to the referring predictor node via a blue squared PMML port (Figure 5.13).
Whenever a node provides data that does not fit a flat data table structure, a general purpose port for structured data is used (dark cyan square). All ports not listed earlier are known as “unknown” types (gray square), as in Figure 5.14.
Opening a New Workflow
We can open a new workflow for LoanRaw.csv. We first input the data, by clicking and dragging a File Reader node. We enter the location of the file by clicking on Browse … and locating the file. Figure 5.15 exhibits this operation.
Click on the File Reader node and select Apply followed by OK. Then, click on File Reader (which now should have a yellow status light) and click on Execute. If all is well, the status light will change to green. This enables linkage to other nodes. Click-and-drag the Decision Tree Learner node and link it to the File Reader icon. Click on this Decision Tree Learner node and select Apply followed by OK. Then, click on the icon again and select Execute. The status light should change to green.
In Chapter 3, we discussed a data mining process, where it is a good practice to build the model on one set of data, and then test it on another subset of data. The purpose of building data mining models is to have them available to predict new cases. Figure 5.16 shows the KNIME workflow process.
In Figure 5.16, we demonstrate a complete process where a training set is read into the first File Reader and the Decision Tree Learner is used to build a model. This feeds into the Decision Tree Predictor node, which is linked to another File Reader node with test data. The Decision Tree Predictor node feeds into a Scorer node, which provides a confusion matrix, displaying model fit on the test data. To apply this model to new data, a third File Reader node is where new cases are linked, feeding into another Decision Tree Predictor node (linked to the Decision Tree Learner model), providing output to an Interactive Table providing model predictions for the new cases. In the Decision Tree Learner node, we apply Pruning of MDL. Right-clicking on the node, we obtain the following decision tree (Figure 5.17).
We will demonstrate KNIME throughout the book. We also add WEKA installation, as it has great tools for data mining algorithms.
WEKA
WEKA (you can download off the Web) www.cs.waikato.ac.nz/ml/weka/. The download comes with documentation.
On WEKA:
Hit the Open file … button on the upper left.
Link to LoanRaw.csv (or any .csv or .arff file you want to analyze).
Install—hopefully, you get Figure 5.18.
Select Explorer, yielding Figure 5.19.
Select Open file … and pick file from your hard drive. In Figure 5.20, we picked LoanRaw.csv.
You can play around with Visualized, Select attributes, (even Cluster or Associate), but the point for us is to build classification models with Classify, as in Figure 5.21.
Select Choose to get Figure 5.22.
Select Trees to get Figure 5.23.
There are 16 different decision tree models. The interesting ones for us are J48 and SimpleCart.
If you select J48, and then click on the Choose line (with J48 C 0.25 M 2), you get a control window, as in Figure 5.24.
You can change the confidence factor of the algorithm (requiring a minimum level of confidence before retaining a rule) and the minimum number of cases for a rule (called support). This provides a means to try to adjust the number of decision tree rules.
To run a J48 model (with defaults of C=0.25 and M=2), select OK in Figure 5.25.
Select Start.
This yields Figure 5.26.
The result is a tree with no rules—it cheats and says all loan applications are OK (it is wrong 65 times out of 650, giving a correct classification rate of 0.90). I call this a degenerate model—it just says everything is OK.
To get a more interesting model, play with C and M.
For C=0.5 and M=6, I get Figure 5.27.
The actual tree is more compactly shown in Figure 5.28.
This tree has six leaves (ways this rule set has to reach a conclusion)—note that you can shorten that—here, I did it in 2. It has a tree size of 9 (rows to express the rule set—although here I count 8)—tree size doesn’t mean much.
Hit the Choose button
Under Functions, you can select
Logistic for logistic regression
Multilayer perceptron or RBF Network for neural network
Trees
J48 for a good tree
Decision Stump for a simple
SimpleCart
For J48, you can control the number of rules by manipulating parameter M (which is the minimum number of cases required in order to build a rule).
If Output Variable Is Continuous
Functions
Linear regression
Trees
M5P gives a tree of multiple linear regressions
Select the button for Use training data or Cross-validation with 10 folds, or Percentage split.
WEKA will give you a decision tree. Under Trees, you can select a variety of decision tree models, to include J48 (requires categorical output). For data files containing only continuous data, you can use Decision Stump or M5P. The answer you get in that case will be an estimate of the proportion of the outcome variable.
To Predict Within WEKA
This is the hardest thing about WEKA. There is a way to apply the neural net model to test cases, but it is cumbersome. Generally, I’ve found that using a supplied test set works pretty well for generating the predictions that use only numeric data (e.g., expenditure files). You have to make sure the header rows match exactly and add any “actual” value for the independent variable column (our assignments have not provided an actual value for test cases). You have to make sure the data structures match exactly, so in the case of the expenditure files, there are some text fields that have to be converted into numeric values.
WEKA doesn’t handle text data as cleanly, but it can be done. I appended the test cases to the training .csv file and saved with a different name. All text values have to match exactly to the training set (–e.g., Undergrad = UG, None = none, etc.). It is case-sensitive, which adds to the difficulty. Load the training set (this part is the same for numeric or text). From Test options, use Supplied test set (click on Set and browse to find the file). Then, go to More options and click on Output predictions. Run the model and the predictions appear above the Summary. WEKA will give predictions (what the model projects) and actuals (what was in the dataset) for all observations. The added data will be at the end, so you only have to scroll up above the confusion matrix.
When that fails, one thing that seems to work is:
To the original dataset, add the new cases at the bottom (making sure that the spelling is the same).
Have the original data set loaded (Figure 5.29).
When you select the model, instead of 10-fold testing, click on Supply test set and link the file with the 10 cases at the bottom.
Under More options, select Output Predictions.
When you then run the model, the predictions (for all 510) will be listed above the confusion matrix (Figure 5.30). The “actual” will be phony—but the “predicted” will be there.
Summary
There are many excellent data mining software products, commercial (which are often expensive, but quite easy to use) as well as open source. WEKA was one of the earlier open source products. R has grown to be viable for major data mining, and the Rattle GUI makes it easy to implement. KNIME is newer and has some of the click-and-drag features that made commercial software so easy to use.