© Julian Soh, Marshall Copeland, Anthony Puca, and Micheleen Harris 2020
J. Soh et al.Microsoft Azurehttps://doi.org/10.1007/978-1-4842-5958-0_19

19. Data Engineering and the Modern Data Estate

Julian Soh1 , Marshall Copeland2, Anthony Puca3 and Micheleen Harris1
(1)
Washington, WA, USA
(2)
Texas, TX, USA
(3)
Colorado, CO, USA
 

In Chapter 17, we discussed data abundance and emerging data trends. Then in Chapter 18, we looked at database migration to Azure to better manage our data by providing more redundancy and protection. The reality is that data could exist on-premises, in the cloud, or both, and therefore there needs to be tools and processes to help manage the moving, copying, merging, and transforming of data.

Furthermore, when we talk about data in the cloud, we are not only talking about IaaS or PaaS. Data sources also reside in SaaS, such as in Office 365, Dynamics 365, SalesForce, SAP, and Oracle Service Cloud. Data in PaaS can reside in Amazon S3, Amazon Redshift, Google AdWords, or Azure SQL Database. So, data abundance is not only volume and type; it includes the vast diversity in storage and applications as well.

In this chapter, we explore the role of the data engineer and the tools and processes to manage this vast data estate. A data engineer prepares and cleans data so that it can be used in reports and analyses; this work is sometimes referred to as data wrangling. The quality of downstream workloads, including the success of advanced analytical capabilities like machine learning, depends on the quality of the available data, not just the quantity.

Terminology

There are a few new terms that describe the landscape of data engineering and the evolution of data transformation processes.

Data Estate

Data estate is the term used to not only describe what type of data an organization owns and has access to but where that data resides. Traditionally, data is locked into individual databases and applications, which are often called data silos because it is very difficult to unlock that data. Today, data continues to exist in silos, but those silos expand from on-premises into different clouds. A successful digital transformation journey starts with knowing what kind of data you own, where it resides, and how to free it from silos so that it can unlock profound insights that can drive business transformation with AI and machine learning. So, the journey for digital transformation begins with knowing the state of your data estate.

Note

There is a good tutorial about discovery and managing shadow IT in your network, which also addresses data silos. You can find this tutorial at https://docs.microsoft.com/en-us/cloud-app-security/tutorial-shadow-it.

Modern Data Warehouse: ELT vs. ETL

When we talk about data warehousing here, we are not talking about a product or a technology. We are talking about the strategies and processes that stages and transforms our data so that it is ready for data scientists and analysts to use.

Data in its most rudimentary form is known as raw data, and we put that raw data through stages of cleaning and enrichment, so that it becomes more meaningful. Traditionally, this is done via a process known as ETL (extract-transform-load ). The idea behind ETL is to create a data pipeline that extracts data from some system, maps it to desired fields in another database, usually a data warehouse, and changes the format of the data to conform to the target field, and then load it into the database. Thus, going through the process exactly as implied by its name—ETL.

With the vast amount of data and the need to be dynamic, the modern approach is to reverse the order of the ETL to be agile. Therefore, instead of transforming the data and then loading it, the idea is to load the data as-is into some staging area and then transform it after the fact. This allows the data pipeline to operate efficiently in extracting or loading the data, since we are decoupling the transformation of the data from the process. If desired, which is often the case, a second separate pipeline can carry out the transformation, thereby streamlining the process. This process is known as ELT (extract-load-transform ).

In many cases, ELT is driven by the speed of data; for example, take an IoT scenario. Bursting streaming data from IoT devices needs to be captured as soon as it comes in. There is no time to transform the data, and there may be different downstream systems that need all or a subset of this streaming data to make decisions, some of which need to be made in real time.

Modern Storage and Big Data

Storage for today’s data needs to be reliable, fast, redundant, and capable of storing a lot of data (Big Data). Furthermore, it must be more economically viable than owning the infrastructure to deliver these capabilities. Moreover, storage for today’s data needs must account for, and maximize the three Vs: volume, velocity, and variety.

Examples of storage options in Azure designed with these characteristics include Azure Blob, Azure Data Lake, Azure Queues, and Azure Tables. In this chapter, we focus primarily on Azure Blob storage and Azure Data Lake Services (ADLS).

Modern Data Platform Strategies

To optimize cost, one of the modern data platform strategies is to decouple storage from compute, which is a primary reason driving the modern data warehouse ELT approach. For example, if you load data into a SQL database, that database is supported by a database engine that relies on compute resources. By decoupling storage from compute, we can store raw and staging data in standalone storage like Azure Blob or Azure Data Lake Services.

Compute is used only when there is a need to analyze the data. PaaS services like Azure Databricks can access raw or staged data from Azure Blob or Azure Data Lake Services and spin up managed Spark clusters only when compute is needed. Upon being idle, the Spark clusters can be shut down without affecting the data storage. Likewise, data can be loaded into Azure Synapse, which is the new name for Azure Data Warehouse. Azure Synapse is a service that is built specifically for analysis. Therefore, it can load external data from Azure Blob using Polybase, and the database engine in Azure Synapse can also be paused. The strategy to decouple storage from compute can significantly reduce the cost of managing a data estate.

Separating storage from compute also gives organizations the ability to tier the data properly. Getting data to the end users at the right time can further affect costs by creating hot paths and cold paths. The data that is needed for real-time analysis goes to the hot path and therefore uses high-performance storage; whereas data that does not need to be analyzed in real time but still provides good insight (which is essential for machine learning and developing models) is served by a warm path.

In Chapter 14, we introduced Azure Blob and Azure Data Lake Services. We also looked at using Azure Storage Explorer to navigate and manage the data stored in Azure Blob and Azure Data Lake. In this chapter, we leverage the Azure Blob storage we created in Chapter 14 as part of our modern data warehouse pipeline process. The trend for future data solutions is to use Blob storage as the preferred source. Relational databases and sources are still common, but the trend for the three Vs continues to drive the adoption of Blob storage as the preferred source.

Azure Data Factory (ADF)

Azure Data Factory is a PaaS data orchestrator. The primary use case for Azure Data Factory is being the single tool to manage all data estate management activities, such as moving, copying, loading, transforming data. Azure Data Factory comes with many on-premises and cloud connectors to different types of data sources and applications to unlock the data within those sources. Like everything else, the best way to better understand Azure Data Factory is to get our hands dirty with some hands-on exercises. You start with an exercise to deploy Azure Data Factory and explore its interface and capabilities. Then there are more exercises based on a few use case scenarios.

We have also forked a copy of Microsoft’s documentation for Azure Data Factory to our GitHub repo at https://github.com/harris-soh-copeland-puca/azure-docs/tree/master/articles/data-factory.

Hands-on: Installing Azure Data Factory

In this first exercise, you deploy an instance of Azure Data Factory and explore its user interface (UI) and some of Azure Data Factory’s capabilities.
  1. 1.

    From the Azure portal, click Create a resource.

     
  2. 2.

    Type Data Factory in the search box, and then click Data Factory in the search results.

     
  3. 3.

    Click Create.

     
  4. 4.

    Give this instance of Azure Data Factory a globally unique name.

     
  5. 5.

    Keep V2 as the version to deploy.

     
  6. 6.

    Place this Azure Data Factory in an existing resource group or create a new one.

     
  7. 7.

    Pick a location closest to you.

     
  8. 8.

    Uncheck the checkbox to enable Git.

    Note Git is a source and change control technology that can protect the artifacts in Azure Data Factory, such as pipelines, datasets, connections, and so forth. We highly recommend that you turn on Git even though we do not go through the steps here. We cover Git and source control in Chapter 21, and you can turn on Git after Azure Data Factory is deployed. At the time of writing, Azure Data Factory can only use GitHub or Azure DevOps but not other Git-based services.

     
  9. 9.

    Click Create.

     
  10. 10.

    After Azure Data Factory is created, click Go to resource.

     

For Azure Data Factory, the core user interface is the hosted authoring tool. There is no software to install, and you spend most of your time in the authoring tool.

Hands-on: Exploring Azure Data Factory

In this exercise, you explore the Azure Data Factory authoring workspace. This workspace is the core of Azure Data Factory and is the place where data engineers spend most of their time.
  1. 1.

    From the Overview pane of Azure Data Factory, locate and click Author and monitor. This launches the Azure Data Factory authoring workspace.

    Note Instead of going to the Azure portal, selecting the Azure Data Factory instance, and then clicking Author and monitor, you can also go directly to the Azure Data Factory author workspace by going to https://adf.azure.com. If you have multiple subscriptions and or Azure Data Factory instances, you can select the correct subscription and Azure Data Factory instance.

     
  2. 2.

    From the left border, locate and click the pencil icon, which launches the authoring UI, as seen in Figure 19-1.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig1_HTML.jpg
Figure 19-1

Authoring in Azure Data Factory

  1. 3.

    When the authoring workspace is launched, locate the five different classes of Azure Data Factory resources—pipelines, datasets, data flows, connections, and triggers, as seen in Figure 19-2.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig2_HTML.jpg
Figure 19-2

Azure Data Factory resources

Let’s take a moment to explore these Azure Data Factory resources. These resources are the elements we use to build a data pipeline activity. Using these resources via the UI is a matter of dragging and dropping them onto the canvas to the right, or by clicking the + located next to the search box. Figure 19-3 depicts the relationship between the Azure Data Factory resources and how they are used as building blocks.
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig3_HTML.png
Figure 19-3

Architecture of Azure Data Factory resources

Triggers

Triggers are quite self-explanatory. Triggers activate the data pipeline process. They are based on schedules, events, or timed intervals (tumbling window).
  1. 1.

    Click Triggers in the Azure Data Factory, and then click + New.

     
  2. 2.

    Observe the properties of the trigger object, as seen in Figure 19-4. Click Cancel for now because you are only exploring the trigger and not actually implementing it.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig4_HTML.jpg
Figure 19-4

Azure Data Factory trigger options

Data Pipeline or Data Flow

The data pipeline or flow represents the orchestration work to be done by Azure Data Factory. It describes the sequence in which different activities need to take place to get the data into the right form and location. Figure 19-5 overlays the representation of the data pipeline onto how an actual pipeline may look like in Azure Data Factory. In the Figure, each of the boxes represents an actual Azure Data Factory activity that is carried out in sequence from left to right. Each activity has dependencies from the previous activity and may take parameters as inputs to control the workflow.
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig5_HTML.jpg
Figure 19-5

The data pipeline/flow

  1. 1.

    Click the + next to the search box in Azure Data Factory’s authoring workspace, and select Pipeline, as seen in Figure 19-6.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig6_HTML.jpg
Figure 19-6

Creating a new pipeline in Azure Data Factory

  1. 2.

    A new pipeline named pipeline1 is created and placed under Pipelines.

     
  2. 3.

    A new pane titled Activities appears. Explore the different types of activities that a data pipeline can hold. Locate each of the activities shown in Figure 19-5.

     
  3. 4.

    Expand Move & transform and drag and drop the Copy data activity to the canvas for pipeline1 on the right. Then change the name of this activity and see it change in the canvas as well, as seen in Figure 19-7. For this exercise, name this activity Copy_from_Blob_to_Blob.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig7_HTML.jpg
Figure 19-7

Copy data activity in Azure Data Factory

  1. 5.

    Click Code in the top-right corner of the canvas, and look at the JSON definition for this pipeline. Click Cancel when you are done.

     

Datasets

Datasets are the repositories where data is sourced or deposited. These repositories may be databases, such as SQL or Oracle. They can be from services like a REST API endpoint. Or, they can be from a SaaS API like SalesForce or Dynamics 365.

Figure 19-8 shows the properties of an Azure SQL Database dataset object in Azure Data Factory. Clicking the tabs for this dataset object shows the different properties that you must set up to establish connectivity to the data source, and ultimately the content. Each dataset object that is available in Azure Data Factory consists of unique and known properties that are specific to the data source type.

In this example, since the dataset is an Azure SQL Database, its property contains a field for the table that this dataset object should access. Compare this to the properties of a JSON dataset, which is essentially a file that resides in some location. The properties of such a dataset, as seen in Figure 19-9, does not contain a field to identify a table, but rather a file path and the encoding type.
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig8_HTML.jpg
Figure 19-8

Properties of an Azure SQL Database dataset in Azure Data Factory

../images/336094_2_En_19_Chapter/336094_2_En_19_Fig9_HTML.jpg
Figure 19-9

Properties of a JSON file as a dataset

  1. 1.

    Click the ellipses next to Datasets and select New dataset.

     
  2. 2.

    Browse through the different built-in datasets that Azure Data Factory can use. Click the Azure tab and select Azure Blob Storage.

     
  3. 3.

    Click Continue.

     
  4. 4.

    Select DelimitedText as the format, and then click Continue.

     
  5. 5.

    Give this dataset a name. For this exercise, use the name CSV_on_Blob_raw.

     
  6. 6.

    Click the drop-down menu to expand the Linked service options, and select + New.

     

For here, we are going to start defining the Linked service, which is a different Azure Data Factory resource that a dataset is dependent on. We continue this exercise in the next section.

Linked Services

Linked services are objects that define the connection to the location where a data source resides. The connection can be to a type of database engine, a file share, a REST endpoint, or even a SaaS. Generally, the core information required is to define the location and provide credentials for authentication. Once a connection is successfully established, and then it can be used as the transport layer to populate the datasets that are reliant on this connection. That is the role of a Linked service.
  1. 1.

    Picking up from the previous exercise, you should see the pane to establish a New linked service for Azure Blob storage.

     
  2. 2.

    Give this linked service a name. For this exercise, name it Azure_Blob_LS.

     
  3. 3.

    Leave AutoResolveIntegrationRuntime selected as the connect via integration runtime.

     
  4. 4.

    Leave Authentication method as the account key.

     
  5. 5.

    Next, remember the Azure Blob storage that you created in Chapter 14? You use that storage location in this exercise. So, open a new tab or a different browser instance and go to https://portal.azure.com.

     
  6. 6.

    Navigate to the Azure Blob storage resource built in Chapter 14 and retrieve the storage account name and one of the Access keys.

     
  7. 7.

    Go back to Azure Data Factory.

     
  8. 8.

    Select Connection string, and then select Enter manually as the option for the account selection method.

     
  9. 9.

    Copy and paste the storage account name and storage account key with the information that you retrieved in step 5.

    Note When Azure Data Factory and Azure Blob storage are in the same Azure subscription, as is in this exercise, you could select From Azure subscription as the account selection method and browse for the Azure Blob storage instead of copying and pasting account names and keys. But, this exercise demonstrated that Azure Blob storage could reside in a different subscription, and the only way to access that location is via other authentication methods. In this case, we used an access key, but we could have used Azure Active Directory, Key Vault, or a temporary SAS key.

     
  10. 10.

    Click Test connection at the bottom of the pane and make sure the connection is successful. Next, click Create.

     
  11. 11.

    You have now created a Linked service and are back to finalizing the details of the dataset. Click Browse and select the customer.csv file located in the raw folder. This file is from an exercise in Chapter 14.

     
  12. 12.

    Click OK.

     
  13. 13.

    Check the First row as header box.

     
  14. 14.

    Click OK again to complete the creation of the dataset.

     

Datasets and Link Services

In the last two exercises, you created a dataset and its underlying linked service in a single sequence because of their dependencies. If you browse Azure Data Factory now, you see both objects exist in their respective locations in the editor. Because they are discrete objects, you can create a linked service object first and then use it in a dataset later. In fact, you can also reuse linked services. Now that we have a linked service to our Azure Blob storage in Azure Data Factory, we do not have to recreate another linked service. If we need a different dataset populated with another file from this same location, we just reuse this linked service.

Cloning

Any resource in Azure Data Factory can be cloned. In this exercise, we clone the CSV_on_Blob_raw dataset.
  1. 1.

    Click the ellipses next to the CSV_on_Blob_raw dataset and select Clone.

     
  2. 2.

    The dataset is cloned with the same name and _copy1 appended. So, you should see CSV_on_Blob_raw_copy1 as a new dataset.

     
  3. 3.

    Rename this dataset CSV_on_Blob_stage.

     
  4. 4.

    Click the Connection tab.

     
  5. 5.

    Remove the contents of the container, directory, and file name. Enter stage as the file path, as seen in Figure 19-10.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig10_HTML.jpg
Figure 19-10

Editing file path for a dataset

In this exercise, you explored the capability of cloning a resource like a dataset. In this modified dataset, we reused the linked service by not changing that field but pointed it to a different container in Azure Blob storage.

You can also clone resources that contain other resources. For example, when you clone an entire pipeline that contains activities, the entire pipeline is cloned. Just be aware that the individual activities contained in the cloned pipeline are not cloned. So, if you modify the properties of an activity in the cloned pipeline, it is the same activity referenced by the original pipeline.

Next Steps: Self-Guided Assignment

Before continuing, we leave it up to you to go through the preceding exercise once again to create another dataset for Azure SQL Database and its required linked service. Use the tpcc Azure SQL Database from Chapter 17. You need this second dataset with the customer table so that you can append the contents of the customer.csv file from Azure Blob storage to the table using a Copy Data pipeline in Azure Data Factory. Use Users_on_SQLDB as the name for the dataset, and Azure_SQLDB_LS as the name for the Azure SQL Database Link service.

Hands-on: Creating a Copy Data Pipeline

You are going to continue building pipeline1, which you created earlier in this chapter.
  1. 1.

    Select pipeline1.

     
  2. 2.

    On the editing canvas, click any part of the white space to unselect the Copy data activity. This ensures that the properties shown below the canvas belong to the pipeline, and not the activity. Compare the two screenshots in Figure 19-11.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig11_HTML.jpg
Figure 19-11

Selecting the right properties to edit

  1. 3.

    Rename pipeline1 as Copy_Blob_SQL.

     
  2. 4.

    Select the Copy_from_Blob_to_Blob activity, and then click the Source tab in the properties below the canvas.

     
  3. 5.

    Click the Source dataset drop-down menu, and select CSV_on_Blob_raw as the dataset.

     
  4. 6.

    Click Open to open the dataset.

     
  5. 7.

    Select the Connection tab and check the First row as header box.

     
  6. 8.

    Select the Schema tab, and click Import schema.

     
  7. 9.

    Select From connection/store as the schema source. You should see the schema from the CSV file imported.

     
  8. 10.

    At the top of the canvas, select Copy_Blob_SQL.

     
  9. 11.

    Click the Sink tab in the properties pane beneath the canvas and select CSV_on_Blob_stage as the sink dataset.

     
  10. 12.

    Type .csv in the File extension box, as seen in Figure 19-12.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig12_HTML.jpg
Figure 19-12

Specifying file extension

  1. 13.

    Click Open to open the sink dataset properties and select the Connection tab. Make sure the First row as header box is checked.

     
  2. 14.

    At the top of the canvas, select the Copy_BloB_SQL tab to show the pipeline.

     
  3. 15.

    Click Debug to run this pipeline.

     
  4. 16.

    The properties pane beneath the canvas focuses on the Output tab. Click the icon that looks like a circled arrow to refresh and see the status of the pipeline, as seen in Figure 19-13.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig13_HTML.jpg
Figure 19-13

Status of pipeline execution

  1. 17.

    The pipeline should have executed successfully. At this point, this pipeline has a single copy activity that copied a CSV file from one container in Azure Blob storage to another container in the same Azure Blob storage. If you used a different linked service for the sink dataset, this file would have been moved to another Azure subscription or a completely different service altogether.

     

Using the knowledge gained from Chapter 14 (Azure Storage), confirm that customer.csv was copied to the stage folder in Azure Blob storage (use Azure Storage Explorer). Once you have run this pipeline, you may want to re-run it again to see the timestamp for the file and validate that the copy took place.

Saving Your Work

Look at the top of the authoring workspace in Azure Data Factory, and you see a Publish all button with a number next to it. This number tells you the number of resources that have not been pushed to Azure Data Factory, or a Git repo if source control was enabled.

Click Publish All. A separate pane appears; it lists the pending changes and new workloads that need to be published to Azure Data Factory, as shown in Figure 19-14.
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig14_HTML.jpg
Figure 19-14

Resources that need to be published to Azure Data Factory

Click the Publish button, and the resources are pushed up to Azure Data Factory, GitHub, or Azure DevOps Repos, depending on your configuration.

Note

Publishing in Azure Data Factory is akin to saving your work. It is important to note that all changes remain in the browser session until it is published. Therefore, if you close the browser session without publishing your work, all the changes are lost. During publishing, all changes are validated, and errors must be addressed before the work can be published. You can run a separate validation without publishing by clicking the Validate option located next to Debug.

Hands-on: Multiple Activities in a Pipeline

Usually, there are multiple activities in a pipeline. You can have one activity per pipeline, and then have one pipeline trigger another pipeline, as seen in Figure 19-15, but that is not the best method. The ability to execute another pipeline is useful when you are reusing existing pipelines to build a new process or for splitting up a pipeline that has too many activities and is therefore difficult to keep track of.

You may also want to time the execution of a set of activities against a second set of activities so that you can place them in two separate pipelines, and then trigger the execution of the second pipeline with the first pipeline if certain parameters are met.
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig15_HTML.jpg
Figure 19-15

Executing another pipeline as one of the activities within a pipeline

One easy strategy to adopt is to clone pipelines throughout the build stage so you can do unit testing as data is being copied and transformed from stage to stage.

In the preceding exercise, we now have a pipeline that copies data from one container to another (raw to stage) in Azure Blob storage, and we know it works. In this exercise, we continue building this pipeline by moving the data from the stage container to the Azure SQL Database dataset you created earlier as part of your self-guided assignment.
  1. 1.

    Click the Copy_Blob_SQL pipeline.

     
  2. 2.

    Expand Move & transform. Drag a new Copy data activity and place it to the right of Copy_from_Blob_to_Blob in the canvas, as seen in Figure 19-16.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig16_HTML.jpg
Figure 19-16

New Copy data activity added to the pipeline

  1. 3.

    Rename the new Copy data activity from Copy data1 to Copy_from_Blob_to_SQL.

     
  2. 4.

    Click the Source tab in the properties pane beneath the canvas and select CSV_on_Blob_stage as the source dataset.

     
  3. 5.

    Click Preview data to see the data in the CSV file. This confirms that Azure Data Factory can access the file and can successfully read its contents.

     
  4. 6.

    Select Wildcard file path as the option for file path type.

     
  5. 7.

    Click the Sink tab and select the Users_on_SQLDB dataset as the sink dataset.

     
  6. 8.

    Click the Mapping tab, and then click Import schemas.

     
  7. 9.

    Azure Data Factory attempts to map the fields between the source and sink datasets. Notice that it was not able to map several of the fields, as seen in Figure 19-17.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig17_HTML.jpg
Figure 19-17

Initial mapping of source and sink datasets

  1. 10.

    Modify the mapping appropriately by defining the source types, mapping it to the right fields in the sink dataset. Delete the fields that do not match up by hovering next to each field, and clicking the trash icon, as seen in Figure 19-18.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig18_HTML.jpg
Figure 19-18

Mapping the source and sink datasets

  1. 11.

    Next, click and hold down on the Success Output interface of the Copy_from_Blob_to_Blob activity. Connect a line to the input interface of the Copy_from_Blob_to_SQL activity, as seen in Figure 19-19.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig19_HTML.jpg
Figure 19-19

Connecting two activities in a pipeline

  1. 12.

    At the top of the canvas, click Debug.

     
  2. 13.

    The pipeline now runs in two stages. First, it runs the copy data activity to copy users.csv from the raw container in Azure Blob storage to the stage container in Azure Blob storage.

     
  3. 14.

    The pipeline then moves to the next copy data activity where it copies the fields in users.csv in the stage container in Azure Blob storage to the Users table in Azure SQL Database, using the mappings defined in step 10. The results of both successful stages are shown in the Output pane of the pipeline, as seen in Figure 19-20.

     
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig20_HTML.jpg
Figure 19-20

Successful execution of a pipeline with multiple activities

These exercises should have provided you with a good foundation for how Azure Data Factory works. Although we only covered a very basic pipeline with copy activities, you should explore the use of special activities that allow you to build logic into your pipelines, like the ForEach, DoUntil, and If condition loops located under Iteration & conditionals.

More complex activities and the use of variables, parameters, dynamically created content are published on this book’s GitHub repo at https://github.com/harris-soh-copeland-puca.

Accessing On-Premises Data Sources

The existence of hybrid clouds is expected to continue. Therefore, there is a need for a scenario where Azure Data Factory must connect to on-premises data sources to copy or move data to the cloud or vice-versa.

This hybrid scenario is dependent on special software called the self-hosted integration runtime (IR) . The self-hosted integration runtime is an Azure Data Factory agent that runs on a computer connected to an on-premises environment. It has access to the on-premises data sources and has outbound Internet connectivity to contact an Azure Data Factory instance. In that sense, the self-hosted integration runtime serves as a data gateway between the on-premises environment and Azure Data Factory.

The Architecture of the Self-Hosted Integration Runtime

Figure 19-21 is from Microsoft’s documentation; it is a high-level depiction of hybrid data flows involving the self-hosted integration runtime (see https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime).
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig21_HTML.jpg
Figure 19-21

High-level diagram of hybrid data flows using the self-hosted IR (Source: https://​docs.​microsoft.​com/​en-us/​azure/​data-factory/​create-self-hosted-integration-runtime)

Installing and Configuring the Self-Hosted Integration Runtime

Installing the self-hosted integration runtime is very extensively covered and very straightforward. You can find a forked copy of the latest instructions to install the self-hosted integration runtime at our GitHub repo at https://github.com/harris-soh-copeland-puca/azure-docs/blob/master/articles/data-factory/create-self-hosted-integration-runtime.md.

The installation and configuration of the self-hosted integration runtime include nine steps.
  1. 1.

    In the Azure portal, launch the authoring workspace in Azure Data Factory and click Connections.

     
  2. 2.

    Click the Integration runtime tab, and then click + New.

     
  3. 3.

    Select Self-Hosted, and then click Continue.

     
  4. 4.

    Provide a name for the self-hosted integration runtime, and then click Create.

     
  5. 5.

    Once created, you are provided two keys. Copy one set of keys, which will be needed during the installation of the self-service integration runtime software later.

     
  6. 6.

    Identify a computer or virtual machine that hosts the self-hosted integration runtime. There are special considerations, such as not hosting the self-hosted integration runtime together with a database server. Refer to the detailed documentation from our GitHub repo using the link at the top of this section.

     
  7. 7.

    Download the self-hosted integration runtime to the computer and initiate the installation process.

     
  8. 8.

    When the installation is complete, it asks for a key to register with an Azure Data Factory instance. Paste the key copied from step 6 and click Register.

     
Those are the steps for deploying a self-hosted integration runtime. When you look in Azure Data Factory, you should see the new self-hosted integration runtime, and its status should be Running, as seen in Figure 19-22.
../images/336094_2_En_19_Chapter/336094_2_En_19_Fig22_HTML.jpg
Figure 19-22

Azure Data Factory showing the status of the self-hosted integration runtime

Now, when you define datasets that are drawing from on-premises resources, you need to define a Linked service that utilized the self-hosted integration runtime instead of the default AutoResolveIntegrationRuntime.

AutoResolveIntegrationRuntime is an Azure-based integration runtime used to connect to any Azure or Cloud-based endpoint. Self-hosted integration runtimes access on-premises data sources or endpoints behind a firewall.

Summary

Azure Data Factory is a very powerful data orchestration tool without the footprint of multiple on-premises tools. It proves to be an extremely valuable tool in the data engineer’s toolkit. Azure Data Factory is a cloud-based tool with an integrated code repository. Azure Data Factory has more than 90 connectors and a self-hosted integrated runtime; it can access data that resides anywhere.

There are many use case scenarios, connectors, and data flow examples that can be gleaned from the extensive documentation provided and updated by the Microsoft Azure Data Factory product group. A forked copy of the evergreen documentation for Azure Data Factory is at our GitHub repo at https://github.com/harris-soh-copeland-puca/azure-docs/tree/master/articles/data-factory.

Note

Two data activities are important in the modern data world but are not covered in this chapter: data warehousing and data cataloging. This is because at the time of writing, Azure Data Warehouse was released as Azure Synapse Analysis, and Azure Data Catalog Gen 2 was in early preview. We provide updates to these two services at https://harris-soh-copeland-puca.github.io.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset