In this chapter, we will cover how to train and enrich your data using a no-code, UI-based machine learning approach with Synapse Studio. We will learn how to configure and use an AutoML and cloud-based service of Azure ML to build your machine learning model with ease. Using AutoML, you will be able to develop a highly scalable, efficient, and robust model with a code-free experience. You will learn how to connect to an existing data source and train the model without writing a single line of code, using machine learning techniques within Synapse Studio.
You will also learn something very interesting – how to integrate Azure Cognitive Services so that you can bring the power of AI along with predictive analytics.
Apart from this, you will also get to learn how to use scalable machine learning models using SparkML and Machine Learning Library (MLlib). The Synapse runtime consists of many open source libraries, which you can leverage to build the Azure Machine Learning SDK.
We will cover the following recipes:
Azure Synapse Studio gives you the flexibility to develop a machine learning model on top of your dataset. In this recipe, you will learn how you can use the AutoML feature to train your model on the existing Spark tables. You can select the Spark table that you want to train the dataset on with the code-free experience of machine learning models using AutoML.
We will be using the regression model in this recipe. However, it is completely dependent on the problem that you are trying to solve, and you can choose from models including regression, classification, or Time Series Insights to fit your need.
We will be using the same Spark tables that we created in Chapter 2, Creating Robust Data Pipelines and Data Transformation.
We will need to do some setup to prepare for this recipe:
Let's begin this recipe and see how we can create the AutoML model with Azure Synapse Studio. We will be leveraging the existing Spark table to build the Azure Machine Learning model using AutoML:
You can now configure the experiment-related parameters in the UI. Make sure you set Target column as an integer; otherwise, you will not be able to create the experiment:
The AutoML run will be executed in the Azure Machine Learning workspace once you have submitted the experiment for execution. You can monitor the model execution from the Azure Machine Learning portal:
The model training will be leveraging the existing Spark pool that you have created in Synapse Analytics as part of the Spark compute. However, for monitoring the Automated ML model training, you can go directly to the Azure Machine Learning portal to check the experiment execution and monitor it from there:
You should be able to see all the child runs and their statuses from the Azure Machine Learning portal, along with other details such as the run name, the submitted time, and the run duration, as shown in Figure 6.10:
Let's now look at how you can build a regression model with Azure Synapse Studio using a Jupyter notebook and then deploy the same on the Azure Machine Learning workspace. In the previous recipe, we saw how we can build and train a machine learning model with code-less experience.
It's time to explore how we can build the regression model with Synapse Studio using the notebook experience and deploy the Studio on the Azure Machine Learning workspace.
We will be leveraging the same Spark pool to build and train the model and deploy it from the Azure Synapse workspace to the Azure Machine Learning workspace, which is linked to Synapse.
We will perform this within the same notebook experience.
Make sure the following have been completed before you begin:
Let's get back to the same Synapse workspace, and under Develop, create a new notebook with the name AMLSparkNotebook:
import azureml.core
from azureml.core import Experiment, Workspace, Dataset, Datastore
from azureml.train.automl import AutoMLConfig
from notebookutils import mssparkutils
from azureml.data.dataset_factory import TabularDatasetFactory
The notebook view is shown in Figure 6.11:
linkedService_name = "AzureMLService"
experiment_name = "synapsewrkspac-mybookexperiement"
ws = mssparkutils.azureML.getWorkspace(linkedService_name)
experiment = Experiment(ws, experiment_name)
Figure 6.12 shows the notebook:
df = spark.sql("SELECT * FROM default.yellow_tripdataml")
datastore = Datastore.get_default(ws)
dataset = TabularDatasetFactory.register_spark_dataframe(df, datastore, name = experiment_name + "-dataset")
Please refer to Figure 6.13 for dataset definition:
automl_config = AutoMLConfig(spark_context = sc,
task = "regression",
training_data = dataset,
label_column_name = "fare_amount",
primary_metric = "spearman_correlation",
experiment_timeout_hours = 1,
max_concurrent_iterations = 1,
enable_onnx_compatible_models = True)
You can refer to Figure 6.14 to understand how to define the automl configuration:
run = experiment.submit(automl_config)
displayHTML("<a href={} target='_blank'>Your experiment in Azure Machine Learning portal: {}</a>".format(run.get_portal_url(), run.id))
After you submit the Spark job, you will see the following output. You can click on the link generated by the run, as shown in Figure 6.15:
This recipe leverages the power of the Spark pool that you have created to perform the data exploration and train your machine learning model. The notebook experience within Synapse makes it a one-stop shop for the developer and the data analyst to collaborate and perform their respective activities; this empowers the data scientist to create the AutoML model:
The takeaway from this recipe is that you can combine a notebook and UI-based approach to build the machine learning model. The notebook can be published to the Synapse workspace, and you can load it anytime and customize the model as per your need.
In this recipe, you will learn how to enrich your data, which is under SQL dedicated pools, and apply the existing machine learning model that we created in the Training a model using AutoML in Synapse section. This will help the data analyst and the data professional to directly select and run the existing machine learning model without worrying about writing the actual model.
This is the best way to utilize the model that you trained in the Training a model using AutoML in Synapse recipe and leverage it to predict the existing SQL pool tables with the help of the predict model wizard in the Synapse workspace.
To complete this recipe:
Now, let's begin the actual recipe.
Let's get back to the same Synapse workspace, and under Data tab, expand the SQL pool database and navigate to the table folder:
You need to map the table column with the machine learning model and define the model output as variable1. The mappings are mostly pre-populated, since the model is already deployed in the Azure Machine Learning workspace. Click Continue:
CREATE PROCEDURE aml_model_procedure
AS
BEGIN
SELECT
CAST([VendorID] AS [varchar]) AS [VendorID],
CAST([tpep_pickup_datetime] AS [varchar]) AS [tpep_pickup_datetime],
CAST([tpep_dropoff_datetime] AS [varchar]) AS [tpep_dropoff_datetime],
CAST([passenger_count] AS [varchar]) AS [passenger_count],
CAST([trip_distance] AS [varchar]) AS [trip_distance],
CAST([RateCodeID] AS [varchar]) AS [RatecodeID],
CAST([store_and_fwd_flag] AS [varchar]) AS [store_and_fwd_flag],
CAST([PULocationID] AS [varchar]) AS [PULocationID],
CAST([DOLocationID] AS [varchar]) AS [DOLocationID],
CAST([payment_type] AS [varchar]) AS [payment_type],
CAST([extra] AS [varchar]) AS [extra],
CAST([mta_tax] AS [varchar]) AS [mta_tax],
CAST([tip_amount] AS [varchar]) AS [tip_amount],
CAST([tolls_amount] AS [varchar]) AS [tolls_amount],
CAST([improvement_surcharge] AS [varchar]) AS [improvement_surcharge],
CAST([total_amount] AS [varchar]) AS [total_amount],
CAST([congestion_surcharge] AS [varchar]) AS [congestion_surcharge]
INTO [NYTaxiSTG].[#Tripadf]
FROM [NYTaxiSTG].[Tripadf];
SELECT *
FROM PREDICT (MODEL = (SELECT [model] FROM aml_models WHERE [ID] = 'synapsewrkspac-yellow_tripdataml-20210927014551-Best:1'),
DATA = [NYTaxiSTG].[#Tripadf],
RUNTIME = ONNX) WITH ([variable1] [real])
END
GO
EXEC aml_model_procedure
You can refer to Figure 6.22 to check the output of the stored procedure:
Let's understand what we have done so far and how this works. The SQL dedicated pool has the capability to run and score with the existing machine learning model with a historical dataset. You can predict and score with familiar T-SQL scripts and call the model with the script. You saw how we can create a stored procedure and define a table output for scoring using the existing machine learning model.
You can refer to the following architecture to understand how exactly the overall recipe works and what we have done. However, this functionality is currently not supported in SQL serverless pool:
Azure Synapse Analytics provides you with a single collaborative platform for data processing in memory, leveraging the power of Apache Spark. This is an in-memory distributed platform in which you have the option to run scalable machine learning algorithms.
MLlib and Spark ML are two highly distributed and scalable environments for machine learning libraries. Some of the default machine learning libraries that are included are TensorFlow, scikit-learn, PyTorch, and XGBoost.
SynapseML (previously MMLSpark) is the Microsoft machine learning library for Apache Spark, which includes many distributed frameworks for Spark and provides seamless integration between the Microsoft Cognitive Toolkit (CNTK) or OpenCV. This enables high throughput with extraordinary performance because of the Spark cluster running behind.
In this recipe, we will learn how we can integrate Azure Cognitive Services into the Synapse workspace.
With Azure Cognitive Services, we are now bringing together the power of AI to enrich our data with pre-trained AI models.
For this recipe, you will need the following:
Let's go through the step-by-step process of integrating Cognitive Services with the Synapse workspace:
Let's understand how we will leverage the Anomaly Detector Cognitive Service on the Spark pool table to predict with the model. Azure Cognitive Services is a cloud-based service for the REST API that will help you to build various cognitive intelligence.
Here, we are leveraging the existing Cognitive Service, which we created in the How to do it… section.
You need to create a machine learning prediction model on the Spark table that you want to run the anomaly detection:
This will eventually connect with the existing Cognitive Service that we created in the Getting ready section in this recipe:
This will eventually generate the code for calling the Anomaly Detector pre-trained model, which you can run and modify as required: