Train the model

This is the most important phase; in fact it is time to build and train our machine learning model. In this step, the machine learning begins to work with the definition of the model and the next training. The model starts to extract knowledge from the large amounts of data that we had available, and nothing has been explained so far.

Let's now split the data for the training and the test model. Training and testing the model forms the basis for further usage of the model in predictive analytics. Given a dataset of 699 rows of data, which includes the predictor and response variables, we split the dataset into a convenient ratio (say 70:30) and allocate 490 rows for training and 209 rows for testing. The rows are selected at random to reduce bias. Once the training data is available, the data is fed to the machine learning algorithm to get the massive universal function in place. The training data determines the features to be used to get to the output from the input.

Once sufficient convergence is achieved, the model is stored in memory and the next step is to test the model. We pass the 209 rows of data to check whether the actual output matches with the predicted output from the model. Testing is done to get various metrics that can validate the model. If the accuracy is too wary, the model has to be rebuilt with changes in the training data and other parameters passed to the machine learning algorithm.

As anticipated, the training and test of the model will be performed with separate data in a dataset for the training and in another dataset for the testing phase. To split the dataset into two subdivisions of appropriately divided data, we will use the Split Data module. To do this, we will run the following procedure:

To start, select and drag the Split Data module into the canvas area; connect it to the output of the Clean Missing Data module. This is the last module inserted in the flowchart.
Click on the Split Data module and select it. The Split Data module is located in the following path of the left-hand sidebar: Data Transformation | Sample and Split. Find the Fraction of rows in the first output dataset option in the Properties panel to the right of the canvas and set it to 0.7. In this way, 70% of the data will be used to perform the model training and 30% for the model testing. It is clear that in order to identify the model that best approximates our data, it is possible to experiment with different percentages.

In the following screenshot the essential elements of the last procedure analyzed are shown:

As we can see in the preceding screenshot, the Split Data module has two output ports: 1 is the training set, 2 is the testing set. We can also note that in the canvas Properties section, the Randomized split parameter is set. This means that 70% of the data is output through the first port of the module randomly; so the first 490 rows are not taken in succession, but that number of lines is randomly taken over the entire dataset. Obviously, this procedure is also used for the testing dataset. Furthermore, the Random seed parameter is also present, which controls the seeding of the pseudo random number generator. All this to make the example we are going to make reproducible.

At this point, we have the two datasets for training and testing. The time has come to choose the machine learning algorithm. As we said at the beginning of the section, the problem we want to tackle is to recognize the type of breast cancer based on some measured parameters. Obviously, this is a classification problem. The output variable (Output_Class) takes only two values (0, 1), which correspond to the two diagnoses (benign, malignant).

One way to address a classification problem is to use logistic regression. Logistic regression is a methodology used to predict the value of a dichotomous dependent variable on the basis of a set of explanatory variables, both qualitative and quantitative. The dependent variable is a qualitative response of dichotomous type, and it describes the outcome or success concerning the occurrence of a random event. Follow this procedure:

To select the learning algorithm, expand the Machine Learning category in the module palette to the left of the canvas area, and then expand Initialize Model. You will see different categories of modules that can be used to initialize machine learning algorithms. For this experiment, select the Two-Class Logistic Regression module from the Classification category and drag it into the experiment canvas area.
Once the machine learning algorithm has been chosen, it is necessary to add the module that allows us to train the model: Train Model. This module allows us to train a model after having defined and set its parameters and requires tagged data. You can also use Train Model to redevelop an existing model with new data. The Train Model module is located in the following path of the module palette: Machine Learning | Train. Drag the Train Model module to the experiment canvas area.
Connect the output port of the Two-Class Logistic Regression module to the left input port of the Train Model module and connect the left data output port of the Split Data module to the right input port of the Train Model module.

At this point, a symbol in the box relating to the Train module reminds us that something needs to be set. Click on the Train Model module. Click on the Launch column selector in the Properties pane to the right of the canvas, and then select the Output_Class column. This is the value that we intend to estimate with the model.

In the following screenshot are highlighted the essential elements used in the previous procedure:

At this point it is possible to run the experiment. The result is a logistic regression model the training of which was performed. It can be used to assign a class to the new data related to the clinical tests carried out on people in order to detect the nature of cancer formation (benign, malignant).

Table of Contents for Train the model

Create new playlist

Sign In

Sign Up

Table of Contents for
Train the model