Loading data

In Rattle, you have to explicitly declare the role of each variable. A variable can have five different roles:

  • Input: The prediction process will use input variables to predict the value of the target variable.
  • Target: The target variable is the output of our model.
  • Risk: The risk variable is a measure of the target variable.
  • Ident or Identifier: An identifier is a variable that identifies a unique occurrence of an object. In our preceding example, the variable Person is an identifier that identifies a unique person.
  • Ignore: A variable marked Ignore will be ignored by the model. We'll come back to this role later-some variables can create noise and decrease the performance of your predictive model.

Rattle can load data from many data sources. Here are some options:

  • Use the Spreadsheet option to load data from a Comma Separated Value (CSV) file.
  • Open Database Connectivity (ODBC) is a standard to define database connectivity. Using this standard, you can load from most common databases. This will allow you to load data from ERP, CRM, data warehouse systems, and others.
  • Use Attribute-Relation File Format (ARFF) to load data from Weka files. Weka is a machine learning software written in Java.
  • You can also load R Datasets; these are tables loaded in memory using R. Currently, Rattle supports R data frames.
  • The RData file option allows you to load an R Dataset that has been saved in a file, usually with the .Rdata extension.
  • With the Library option, Rattle can load sample datasets provided by R packages.
  • The Corpus option allows loading and processing a folder of documents.
  • In the following screenshot, you can see a Script option, but this option is not implemented. It will be available in a future version.

In this book, we're going to load data from the CSV files to explain Rattle's functionalities. CSV is widely used to load data, and we'll find example datasets on the Internet as CSV files.

Loading a CSV File

As we've seen before, we'll use a CSV file from Kaggle to learn how to load a dataset into Rattle. Download the file train.csv from the competition page at http://www.kaggle.com/c/titanic-gettingStarted.

The steps to load the train.csv file are as follows:

  1. Open Rattle and go to the Data tab:
    Loading a CSV File
  2. Select Spreadsheet as the data source and click on the Filename folder icon.
  3. Select the file train.csv and click on Open:
    Loading a CSV File
  4. Finally, click the Execute button to load the dataset:
    Loading a CSV File

Rattle loads the data from the file, analyzes it, and guesses the structure of the dataset. Now we can start exploring the structure of our data. In the Rattle window, we can see that the loaded dataset has 891 observations with nine input variables and Survived as the target variable. We can change the role of each variable with the radio buttons. Note that Age, Cabin, and Embarked have missing values:

Loading a CSV File

We'll focus on these missing values in the next section of this chapter.

The objective of this dataset is to predict whether or not a passenger will survive the sinking of the Titanic. Our target variable is survived and has two possible values:

  • 0 (not survived)
  • 1 (survived)

The variable name is an identifier that identifies a unique passenger. For this reason, it has 891 observations and 891 different values.

Make changes in the roles of the different variables and click on the Execute button to update the data. To save your work, click on the Save button and give it an appropriate file name.

The Save button will save our work, but it will not modify the data source (the CSV file).

Loading a CSV File

In Rattle's Data tab, there are two useful buttons—View and Edit. With these buttons, you can edit and visualize your data. We also have a Partition check box, as you can see in the following screenshot:

Loading a CSV File

Generally, we split the datasets into three subsets of data—a training dataset, a validation dataset, and a testing dataset. We're going to leave this option for now and we'll come back to partitioning in Chapter 5, Clustering and Other Unsupervised Learning Methods, and Chapter 6, Decision Trees and Other Supervised Learning Methods.

The last option in data loading is Weight Calculator. This option allows us to enter a formula to give more importance to some observations.

Tip

You can assign roles to variables automatically by modifying their names in the data source. When you load a variable with a name that starts with ID, Rattle marks it automatically as having a role of ident. You can also mark a variable as target, risk, and ignore using Target, Risk, and Ignore.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset