Get the data

As the data source, we will use the dataset downloaded from the UCI Machine Learning Repository. These are now available on our PC in .csv format. The first operation to be performed will be uploading the dataset to the Azure Machine Learning Studio. To do this, we will perform the following steps:

  1. Click on + NEW at the bottom of the Azure Machine Learning workspace window.
  2. Select DATASET.
  3. Select FROM LOCAL FILE. The following window is opened:
  1. In the Upload a new dataset dialog window, click on Browse and find the .csv file in the local filesystem.
  2. Enter a name for the dataset or accept the proposed name.
  3. For data type, select Generic CSV File with header (.csv).
  4. Add a description if you like.
  5. Click on the OK check mark.

After uploading the data, a new dataset appears in the dataset window. This dataset is now available for any kind of analysis. Now that the data is loaded into the workspace, we can include it in our experiment. To do this, we create a new experiment:

  1. Just click on the + button at the bottom left of the Azure Machine Learning Studio workspace and a pop-up window will open.
  2. Choose a blank experiment option and an experiment canvas is opened
  3. In the module palette to the left of the experiment canvas, expand Saved Datasets
  4. Under My Datasets, drag .csv onto the canvas

The following screenshot shows a new dataset module added onto the experiment canvas:

To get a data overview and some statistical information, right-click on the output port of the dataset (the small circle at the bottom with number 1) and select Visualize. A new window is opened, with a summary of the contents of the file (numbers of rows and columns). Furthermore, all the columns of the dataset are available. For each one, the name, a histogram of the data, and the values relative to the first lines are given. To get more statistics, just click on the column that interests you. In this way, on the right-hand side of the window will be reported a series of statistics: Mean, Median, Min, Max, Standard Deviation, Unique Values, Missing Values, and Feature Type. Furthermore, you can view the histogram of the data contained in this column as shown in the following screenshot:

All modules in the canvas will be equipped with ports (input/output) represented by small circles: the input ports at the top and the output ports at the bottom. Previously, we used the output port of the dataset to get a view. By connecting one module's output port to the input port of another, it is possible to create a data stream in the experiment. To preview the data status at a specific point, simply click on the output port of a dataset or module. Through this analysis, we can notice if some column contains a missing value.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset