In Rattle, you have to explicitly declare the role of each variable. A variable can have five different roles:
Rattle can load data from many data sources. Here are some options:
.Rdata
extension.In this book, we're going to load data from the CSV files to explain Rattle's functionalities. CSV is widely used to load data, and we'll find example datasets on the Internet as CSV files.
As we've seen before, we'll use a CSV file from Kaggle to learn how to load a dataset into Rattle. Download the file train.csv
from the competition page at http://www.kaggle.com/c/titanic-gettingStarted.
The steps to load the train.csv
file are as follows:
Rattle loads the data from the file, analyzes it, and guesses the structure of the dataset. Now we can start exploring the structure of our data. In the Rattle window, we can see that the loaded dataset has 891 observations with nine input variables and Survived as the target variable. We can change the role of each variable with the radio buttons. Note that Age, Cabin, and Embarked have missing values:
We'll focus on these missing values in the next section of this chapter.
The objective of this dataset is to predict whether or not a passenger will survive the sinking of the Titanic. Our target variable is survived and has two possible values:
0
(not survived)1
(survived)The variable name is an identifier that identifies a unique passenger. For this reason, it has 891 observations and 891 different values.
Make changes in the roles of the different variables and click on the Execute button to update the data. To save your work, click on the Save button and give it an appropriate file name.
The Save button will save our work, but it will not modify the data source (the CSV file).
In Rattle's Data tab, there are two useful buttons—View and Edit. With these buttons, you can edit and visualize your data. We also have a Partition check box, as you can see in the following screenshot:
Generally, we split the datasets into three subsets of data—a training dataset, a validation dataset, and a testing dataset. We're going to leave this option for now and we'll come back to partitioning in Chapter 5, Clustering and Other Unsupervised Learning Methods, and Chapter 6, Decision Trees and Other Supervised Learning Methods.
The last option in data loading is Weight Calculator. This option allows us to enter a formula to give more importance to some observations.