Microsoft Azure Machine Learning Studio (henceforth referred to as MAML) is an online collaborative, drag‐and‐drop tool for building machine learning models. Instead of implementing machine learning algorithms in languages like Python or R, MAML encapsulates the most‐commonly used machine learning algorithms as modules, and it lets you build learning models visually using your dataset. This shields the beginning data science practitioners from the details of the algorithms, while at the same time offering the ability to fine‐tune the hyperparameters of the algorithm for advanced users. Once the learning model is tested and evaluated, you can publish your learning models as web services so that your custom apps or BI tools, such as Excel, can consume it. What's more, MAML supports embedding your Python or R scripts within your learning models, giving advanced users the opportunity to write custom machine learning algorithms.
In this chapter, you will take a break from all of the coding that you have been doing in the previous few chapters. Instead of implementing machine learning using Python and Scikit‐learn, you will take a look at how to use the MAML to perform machine learning visually using drag‐and‐drop.
Now that you have a good sense of what machine learning is and what it can do, let's get started with an experiment using MAML. For this experiment, you will be using a classic example in machine learning—predicting the survival of a passenger on the Titanic.
In case you are not familiar with the Titanic, on April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1,502 out of 2,224 passengers and crew. While the main reason for the deaths was due to insufficient lifeboats, of those who survived, most of them were women, children, and the upper‐class. As such, this presents a very interesting experiment in machine learning. If we are given a set of data points, containing the various profiles of passengers (such as gender, cabin class, age, and so forth) and whether they survived the sinking, it would be interesting for us to use machine learning to predict the survivability of a passenger based on his/her profile.
Interestingly, you can get the Titanic data from Kaggle (https://www.kaggle.com/c/titanic/data
). Two sets of data are provided (see Figure 11.1):
You use the training set to train your learning model so that you can use it to make predictions. Once your learning model is trained, you will make use of the testing set to predict the survivability of passengers.
Because the testing test does not contain a label specifying if a passenger survived, we will not use it for this experiment. Instead, we will only use the training set for training and testing our model.
Once the training set is downloaded, examine its contents (see Figure 11.2).
The training set should have the following fields:
We are now ready to load the data into MAML. Using your web browser, navigate to http://studio.azureml.net
, and click the “Sign up here” link (see Figure 11.3).
If you just want to experience MAML without any financial commitment, choose the Free Workspace option and click Sign In (see Figure 11.4).
Once you are signed in, you should see a list of items on the left side of the page (see Figure 11.5). I will highlight some of the items on this panel as we move along.
To create learning models, you need datasets. For this example, we will use the dataset that you have just downloaded.
Click the + NEW item located at the bottom‐left of the page. Select DATASET on the left (see Figure 11.6), and then click the item on the right labeled FROM LOCAL FILE.
Click the Choose File button (see Figure 11.7) and locate the training set downloaded earlier. When finished, click the tick button to upload the dataset to the MAML.
You are now ready to create an experiment in MAML. Click the + NEW button at the bottom‐left of the page and select Blank Experiment (see Figure 11.8).
You should now see the canvas, as shown in Figure 11.9.
You can give a name to your experiment by typing it over the default experiment name at the top (see Figure 11.10).
Once that is done, let's add our training dataset to the canvas. You can do so by typing the name of the training set in the search box on the left, and the matching dataset will now appear (see Figure 11.11).
Drag and drop the train.csv
dataset onto the canvas (see Figure 11.12).
The train.csv
dataset has an output port (represented by a circle with a 1 inside). Clicking it will reveal a context menu (see Figure 11.13).
Click Visualize to view the content of the dataset. The dataset is now displayed, as shown in Figure 11.14.
Take a minute to scroll through the data. Observe the following:
All of the fields that are not discarded are useful in helping us to create a learning model. These fields are known as features.
Now that we have identified the features we want, let's add the Select Columns in Dataset module to the canvas (see Figure 11.16).
In the Properties pane, click the Launch column selector and select the columns, as shown in Figure 11.17.
The Select Columns in Dataset module will reduce the dataset to the columns that you have specified. Next, we want to make some of the columns categorical. To do that, add the Edit Metadata module, as shown in Figure 11.18, and connect it as shown. Click the Launch column selector button, and select the Survived, Pclass, SibSp, and Parch fields. In the Categorical section of the properties pane, select “Make categorical.”
You can now run the experiment by clicking the RUN button located at the bottom of the MAML. Once the experiment is run, click the output port of the Edit Metadata module and select Visualize. Examine the dataset displayed.
If you examine the dataset returned by the Edit Metadata module carefully, you will see that the Age column has some missing values. It is always good to remove all those rows that have missing values so that those missing values will not affect the efficiency of the learning model. To do that, add a Clean Missing Data module to the canvas and connect it as shown in Figure 11.19. In the properties pane, set the “Cleaning mode” to “Remove entire row.”
Click RUN. The dataset should now have no more missing values. Also notice that the number of rows has been reduced to 712 (see Figure 11.20).
When building your learning model, it is essential that you test it with sample data after the training is done. If you only have one single set of data, you can split it into two parts—one for training and one for testing. This is accomplished by the Split Data module (see Figure 11.21). For this example, I am splitting 80 percent of the dataset for training and the remaining 20 percent for testing.
The left output port of the Split Data module will return 80 percent of the dataset while the right output port will return the remaining 20 percent.
You are now ready to create the training model. Add the Two‐Class Logistic Regression and Train Model modules to the canvas and connect them as shown in Figure 11.22. The Train Model module takes in a learning algorithm and a training dataset. You will also need to tell the Train Model module the label for which you are training it. In this case, it is the Survived column.
Once you have trained the model, it is essential that you verify its effectiveness. To do so, use the Score Model module, as shown in Figure 11.23. The Score Model takes in a trained model (which is the output of the Train Model module) and a testing dataset.
You are now ready to run the experiment again. Click RUN. Once it is completed, select the Scored Labels column (see Figure 11.24). This column represents the results of applying the test dataset against the learning model. The column next to it, Scored Probabilities, indicates the confidence of the prediction. With the Scored Labels column selected, look at the right side of the screen and above the chart, select Survived for the item named “compare to.” This will plot the confusion matrix.
The y‐axis of the confusion matrix shows the actual survival information of passengers: 1 for survived and 0 for did not survive. The x‐axis shows the prediction. As you can see, 75 were correctly predicted not to survive the disaster, and 35 were correctly predicted to survive the disaster. The two other boxes show the predictions that were incorrect.
While the numbers for the predictions look pretty decent, it is not sufficient to conclude at this moment that we have chosen the right algorithm for this problem. MAML comes with 25 machine learning algorithms for different types of problems. Now let's use another algorithm provided by MAML, Two‐Class Decision Jungle, to train another model. Add the modules as shown in Figure 11.25.
Click Run. You can click the output port of the second Score Model module to view the result of the model, just like the previous learning model. However, it would be more useful to be able to compare them directly. You can accomplish this using the Evaluate Model module (see Figure 11.26).
Click RUN to run the experiment. When done, click the output port of the Evaluate Model module and you should see something like Figure 11.27.
The blue line represents the algorithm on the left input port of the Evaluate Model module (Two‐Class Logistic Regression), while the red line represents the algorithm on the right (Two‐Class Decision Jungle). When you click either the blue or red box, you will see the various metrics for each algorithm displayed below the chart.
Now that you have seen an experiment performed using two specific machine learning algorithms—Two‐Class Logistic Regression and Two‐Class Decision Jungle—let's step back a little and examine the various metrics that were generated by the Evaluate Model module. Specifically, let's define the meaning of the following terms:
This set of numbers is known as the confusion matrix. The confusion matrix is discussed in detail in Chapter 7, “Supervised Learning—Classification Using Logistic Regression.” So if you are not familiar with it, be sure to read up on Chapter 7.
Once the most effective machine learning algorithm has been determined, you can publish the learning model as a web service. Doing so will allow you to build custom apps to consume the service. Imagine that you are building a learning model to help doctors diagnose breast cancer. Publishing as a web service would allow you to build apps to pass the various features to the learning model to make the prediction. Best of all, by using MAML, there is no need to handle the details of publishing the web service—MAML will host it for you on the Azure cloud.
To publish our experiment as a web service:
This will create a new Predictive experiment, as shown in Figure 11.28.
Click RUN, and then DEPLOY WEB SERVICE. The page seen in Figure 11.29 will now be shown.
Click the Test hyperlink. The test page shown in Figure 11.30 is displayed. You can click the Enable button to fill the various fields from your training set. This will save you the chore of filling in the various fields.
The fields should now be filled with values from the training data. At the bottom of the page, click Test Request/Response and the prediction will be shown on the right.
At the top of the Test page, you should see a Consume link as shown in Figure 11.31. Click it.
You will see the credentials that you need to use in order to access your web service, as well as the URLs for the web service. At the bottom of the page, you will see the sample code generated for you that you could use to access the web service programmatically (see Figure 11.32). The sample code is available in C#, Python 2, Python 3, and R.
Click the Python 3+ tab, and copy the code generated. Click the View in Studio link at the top‐right of the page to return to MAML. Back in MAML, click the + NEW button at the bottom of the screen. Click NOTEBOOK on the left, and you should be able to see the various notebooks, as shown in Figure 11.33.
Click Python 3, give a name to your notebook, and paste in the Python code that you copied earlier (see Figure 11.34).
Be sure to replace the value of the api_key
variable with that of your primary key. Press Ctrl+Enter to run the Python code. If the web service is deployed correctly, you should see the result at the bottom of the screen (see Figure 11.35).
In this chapter, you have seen how you can use the MAML to create machine learning experiments. Instead of writing your code in Python, you can use the various algorithms provided by Microsoft and build your machine learning models visually using drag and drop. This is very useful for beginners who want to get started with machine learning without diving into the details. Best of all, MAML helps you to deploy your machine learning as a web service automatically—and it even provides the code for you to consume it.
In the next chapter, you will learn how to deploy your machine learning models created in Python and Scikit‐learn manually using Python and the Flask micro‐framework.