The value proposition of edge computing is to process data closer to the source and deliver intelligent near real-time responsiveness for different kinds of applications across different use cases. Additionally, edge computing reduces the amount of data that is required to be transferred to the cloud, thus saving on network bandwidth costs. Often, high-performance edge applications require local compute, local storage, network, data analytics, and machine learning capabilities to process high-fidelity data in low latencies. Although AWS IoT Greengrass allows you to run sophisticated edge applications on devices and gateways, it will be resource-constrained compared to the horsepower from the cloud. Therefore, for different use cases, it's quite common to leverage the scale of cloud computing for high-volume complex data processing needs.
In the previous chapter, you learned about the different design patterns around data transformation strategies on the edge. This chapter will focus on explaining how you can build different data workflows on the cloud based on the data velocity, data variety, and data volume collected from the HBS hub running a Greengrass instance. Specifically, you will learn how to persist data in a transactional data store, develop API driven access, and build a serverless data warehouse to serve data to end users. Therefore, the chapter is divided into the following topics:
The technical requirements for this chapter are the same as those outlined in Chapter 2, Foundations of Edge Workloads. See the full requirements in that chapter.
The term Big in Big data is relative, as the influx of data has grown substantially in the last two decades from terabytes to exabytes due to the digital transformation of enterprises and connected ecosystems. The advent of big data technologies has allowed people (think social media) and enterprises (think digital transformation) to generate, store, and analyze huge amounts of data. To analyze datasets of this volume, sophisticated computing infrastructure is required that can scale elastically based on the amount of input data and required outcome. This characteristic of big data workloads, along with the availability of cloud computing, democratized the adoption of big data technologies by companies of all sizes. Even with the evolution of edge computing, big data processing on the cloud plays a key role in IoT workloads, as data is more valuable when it's adjacent and enriched with other data systems. In this chapter, we will learn how the big data ecosystem allows for advanced processing and analytical capabilities on the huge volume of raw measurements or events collected from the edge to enable the consumption of actionable information by different personas.
The integration of IoT with big data ecosystems has opened up a diverse set of analytical capabilities that allows the generation of additional business insights. These include the following:
The outcome of these processes allows organizations to have increased visibility to new information, emerging trends, or hidden data correlation to improve efficiencies or generate new revenue streams. In this chapter, you will learn about the approaches of both descriptive and predictive analytics on data collected from the edge. In addition to this, you will learn how to implement design patterns such as streaming to a data lake or a transactional data store on the cloud, along with leveraging API driven access, which are considered anti-patterns for the edge. So, let's get started with the design methodologies of big data that are relevant for IoT workloads.
Big data processing is generally categorized in terms of the three Vs: the volume of data (for example, a terabyte, petabyte, or more), the variety of data (that is, structured, semi-structured, or unstructured), and the velocity of data (that is, the speed with which it's produced or consumed). However, as more organizations begin to adopt big data technologies, there have been additions to the list of Vs, such as the following:
For edge computing and the Internet of Things (IoT), all six Vs are relevant. The following diagram presents a visual summary of the range of data that has become available with the advent of IoT and big data technologies. This requires you to consider different ways in which to organize data at scale based on its respective characteristics:
So, you have already learned about data modeling concepts in Chapter 5, Ingesting and Streaming Data from the Edge, which is a standard way of organizing data into meaningful structures based on data types and relationships and extracting value out of it. However, collecting data, storing it in a stream or a persistent layer, and processing it quickly to take intelligent actions is only one side of the story. The next challenge is to work out how to keep a high quality of data throughout its life cycle so that it continues to generate business value for downstream applications over inconsistency or risk. For IoT workloads, this aspect is critical as the devices or gateways reside in a physical world, with intermittent connectivity, at times, being susceptible to different forms of interference. This is where the domain-driven design (DDD) approach can help.
To manage the quality of the data better, we need to learn how to organize data by content such as data domains or subject areas. One of the most common approaches in which to do that is through DDD, which was introduced by Eric Evans in 2003. In his book, Eric states The heart of software is its ability to solve domain-related problems for its user. All other features, vital though they may be, support this basic purpose. Therefore, DDD is an approach to software development centering around the requirements, rules, and processes of a business domain.
The DDD approach includes two core concepts: bounded context and ubiquitous language. Let's dive deeper into each of them:
The following diagram is an illustration of a bounded context:
All of these different business capabilities can be defined as a bounded context. So, now the business capabilities have been determined, we can define the technology requirements within this bounded context to deliver the required business outcome. The general rule of thumb is that the applications, data, or processes should be cohesive and not span for consumption by other contexts. In this chapter, we are going to primarily focus on building bounded contexts for external capabilities that are required by end consumers using different technologies.
Note
However, in the real world, there can be many additional factors to bear in mind when it comes to defining a bounded context, such as an organizational structure, product ownership, and more. We will not be diving deep into these factors as they are not relevant to the topic being discussed here.
Note
The DDD model doesn't mandate how to determine the bounded context within application or data management. Therefore, it's recommended that you work backward from your use case and determine the appropriate cohesion.
So, with this foundation, let's define some design principles of data management on the cloud – some of these will be used for the remainder of the chapter.
We will outline a set of guardrails (that is principles) to understand how to design data workloads using DDD:
In the remainder of the chapter, we will touch upon most of these design principles (if not all) as we dive deeper into the different design patterns, data flows, and hands-on labs.
As data flows from the edge to the cloud securely over different channels (such as through speed or batch layers), it is a common practice to store the data in different staging areas or a centralized location based on the data velocity or data variety.These data sources act as a single source of truth and help to ensure the quality of the data for their respective bounded contexts. Therefore, in this section, we will discuss different data storage options, data flow patterns, and anti-patterns on the cloud. Let's begin with data storage.
As we learned, in earlier chapters, since edge solutions are constrained in terms of computing resources, it's important to optimize the number of applications or the amount of data persisted locally based on the use case. On the other hand, the cloud doesn't have that constraint, as it comes with virtually unlimited resources with different compute and storage options. This makes it a perfect fit for big data applications to grow and contract based on the demand. In addition to this, it provides easy access to a global infrastructure to orchestrate data required by different downstream or end consumers in the region that is closer to them. Finally, data is more valuable when it's augmented with other data or metadata; thus, in recent times, patterns such as data lakes have become very popular. So, what is a data lake?
A data lake is a centralized, secure, and durable storage platform that allows you to ingest, store structured and unstructured data, and transform the raw data as required. You can think of data lake as a superset of the data pond concepts introduced in Chapter 5, Ingesting and Streaming Data from the Edge. Since the IoT devices or gateways are relatively low in storage, only highly valuable data that's relevant for the edge operations can be persisted locally in a data pond:
Some of the foundational characteristics of the data lake architecture are explained here:
You must be wondering how data from a data lake is made available to a data warehouse or an ODS. That's where data integration patterns play a key role.
Data Integration and Interoperability (DII) happens through the batch, speed, and serving layers. A common methodology in the big data world that intertwines all these layers is Extract, Transform, and Load (ETL) or Extract, Load, and Transform (ELT). We have already explained these concepts, in detail, in Chapter 5, Ingesting and Streaming Data from the Edge, and discussed how they have evolved with time into different data flow patterns such as event-driven, batch, lambda, and complex event processing. Therefore, we will not be repeating the concepts here. But in the next section, we will explain how they relate to data workflows in the cloud.
Earlier in this chapter, we discussed how bounded contexts can be used to segregate different external capabilities for end consumers, such as fleet telemetry, fleet monitoring, or fleet analytics. Now, it's time to learn how these concepts can be implemented using different data flow patterns.
Let's consider a scenario; you discover that you have been getting a higher electricity bill for the last six months, and you would like to compare the utilization of different equipment for that time period. Alternatively, you want visibility of more granular information, such as how many times did the washing machine run during the day in the last six months? And for how long? This led to how many X watts of consumption?
This is where batch processing helps. It had been the de facto standard of the industry before event-driven architecture gained popularity and is still heavily used for different use cases such as order management, billing, payroll, financial statements, and more. In this mode of processing, a large volume of data, such as thousands or hundreds of thousands of records (or more), is typically transmitted in a file format (such as TXT or CSV), cleaned, transformed, and loaded into a relational database or data warehouse. Thereafter, the data is used for data reconciliation or analytical purposes. A typical batch processing environment also includes a job scheduler that can trigger an analytical workflow based on schedules of feed availability or those that are required by the business.
To design the fleet analytics bounded context, we have designed a batch workflow, as follows:
In this pattern, the following activities are taking place:
Fun fact
Amazon EMR and Amazon Redshift support big data processing through decoupling of the compute layer and the storage layer, which means there is no need to copy all the data to local storage from the data lake. Therefore, processing becomes more cost-efficient and operationally optimal.
The ubiquitous language used in this bounded context includes the following:
Batch processing is powerful since it doesn't have any windowing restrictions. There is a lot of flexibility in terms of how to correlate individual data points with the entire dataset, whether it's terabytes or exabytes in size for desired analytical outcomes.
Let's consider the following scenario: you have rushed out of your home, and you get a notification after boarding your commute that you left the cooking stove on. Since you have a connected stove, you can immediately turn it off remotely from an app to avoid fire hazards. Bingo!
This looks easy, but there is a certain level of intelligence required at the local hub (such as HBS hub) and a chain of events to facilitate this workflow. These might include the following:
So, as you can observe, a lot is happening in a matter of seconds between the edge, the cloud, and the end user to help mitigate the hazard. This is where patterns such as event-driven architectures became very popular in the last decade or so.
Prior to EDA, polling and Webhooks were the common mechanisms in which to communicate events between different components. Polling is inefficient since there is always a lag in terms of how to fetch new updates from the data source and sync them with downstream services. Webhooks are not always the first choice, as they might require custom authorization and authentication configurations. In short, both of these methods require additional work to be integrated or have scaling issues. Therefore, you have the concept of events, which can be filtered, routed, and pushed to different other services or systems with less bandwidth and lower resource utilization since the data is transmitted as a stream of small events or datasets. Similar to the edge, streaming allows the data to be processed as it arrives without incurring any delay.
Generally, event-driven architectures come in two topologies, the mediator topology and the broker topology. We have explained them here:
The broker topology is very common with edge workloads since it decouples the edge from the cloud and allows the overall solution to scale better. Therefore, for the fleet telemetry bounded context, we have designed an event-driven architecture using a broker topology, as shown in the following diagram.
In the following data flow, the events streamed from a connected HBS hub (that is, the edge) are routed over MQTT to an IoT gateway (that is, AWS IoT Core), which allows the filtering of data (if required) through an in-built rules engine and persists the data to an ODS (that is, Amazon DynamoDB). Amazon DynamoDB is a highly performant nonrelational database service that can scale automatically based on the volume of data streamed from millions of edge devices. From the previous chapter, you should already be familiar with how to model data and optimize NoSQL databases for time series data. Once the data is persisted in Amazon DynamoDB, Create, Read, Update, and Delete (CRUD) operations can be performed on top of the data using serverless functions (that is, AWS Lambda). Finally, the data is made available through an API access layer (that is, the Amazon API gateway) in a synchronous or asynchronous manner:
The ubiquitous language used in this bounded context includes the following:
Stream processing and EDA is powerful for many IoT use cases that require near-real-time attention such as alerting, anomaly detection, and more, as it analyzes the data as soon as it arrives. However, there is a trade-off with every architecture and EDA is no exception either. With a stream, since the processed results are made available immediately, the analysis of a particular data point cannot consider future values. Even for past values, it's restricted to a shorter time interval, which is generally specified through different windowing mechanisms (such as sliding, tumbling, and more). And that's where batch processing plays a key role.
Let's consider the following scenario where you plan to reduce food wastage at your home. Therefore, every time you check-in at a grocery store, you receive a notification with a list of perishable items in your refrigerator (or food shelf), as they have not even been opened or are underutilized and are nearing their expiry date.
This might sound like an easy problem to solve, but there is a certain amount of intelligence required at the local hub (such as HBS hub) and a complex event processing workflow on the cloud to facilitate this. It might include the following:
This problem might become further complicated for a restaurant business due to the volume of perishable items and the scale at which they operate. In such a scenario, having near real-time visibility to identify waste based on current practices can help the business optimize its supply chain and save a lot of costs. So, as you can imagine, the convergence of edge and IoT with big data processing capabilities such as CEP can help unblock challenging use cases.
Processing and querying events as they arrive in small chunks or in bulk is relatively easier compared to recognizing patterns by correlating events. That's where CEP is useful. It's considered as a subset of stream processing with the focus to identify special (or complex) events by correlating events from multiple sources or by listening to telemetry data for a longer period of time. One of the common patterns to implement CEP is by building state machines.
In the following flow, the events streamed from a connected HBS hub (that is, the edge) are routed over MQTT to an IoT gateway (that is, AWS IoT Core), which filters the complex events based on set criteria and pushes them to different state machines defined within the complex event processing engine (that is, AWS IoT events). AWS IoT events is a fully managed CEP service that allows you to monitor equipment or device fleets for failure or changes in operation and, thereafter, trigger actions based on defined events:
The ubiquitous language used in the fleet monitoring bounded context includes the following:
CEP can be useful for many IoT use cases that require attention based on events from multiple sensors, timelines, or other environmental factors.
There can be many other design patterns that you need to consider in order to design a real-life IoT workload. Those could be functional or non-functional requirements such as data archival for regulatory requirements, data replication for redundancy, and disaster recovery for achieving required RTO or RPO; however, that's beyond the scope of this book, as these are general principles and not necessarily related to edge computing or IoT workloads. There are many other books or resources available on those topics if they are of interest to you.
The anti-patterns for processing data on the cloud from edge devices can be better explained using three laws – the law of physics, the law of economics, and the law of the land:
AWS offers different edge services for supporting use cases that need to comply with the preceding laws and are not limited to IoT services only. For example, consider the following:
The preceding services are beyond the scope of the book, and the information is only provided for you to be well informed of the breadth and depth of AWS edge services.
In this section, you will learn how to design a piece of architecture on the cloud leveraging the concepts that you have learned in this chapter. Specifically, you will continue to use the lambda architecture pattern introduced in Chapter 5, Ingesting and Streaming Data from the Edge, to process the data on the cloud:
In the previous chapter, you already completed steps 1 and 4. This chapter will help you to complete steps 2, 3, 5, 6, and 7, which includes consuming the telemetry data and building an analytics pipeline for performing BI:
In this hands-on section, your objectives include the following:
This lab builds on top of the cloud resources that you have already deployed in Chapter 5, Ingesting and Streaming Data from the Edge. So, please ensure you have completed the hands-on section there prior to proceeding with the following steps here. In addition, please go ahead and deploy the CloudFormation template from the chapter 6/cfn folder to create the resources required in this lab, such as the AWS API gateway, lambda functions, and AWS Glue crawler.
Note
Please retrieve the parameters required for this CloudFormation template (such as an S3 bucket) from the Output section of the deployed CloudFormation stack of Chapter 5, Ingesting and Streaming Data from the Edge.
In addition to this, you can always find the specific resource names required for this lab (such as lambda functions) from the Resources or Output sections of the deployed CloudFormation stack. It is a good practice to copy those into a notepad locally so that you can refer to them quickly.
Once the CloudFormation has been deployed successfully, please continue with the following steps.
Navigate to the AWS console and try to generate insights from the data persisted in the operational (or transactional) data store. As you learned in the previous chapter, all the near real-time data processed is persisted in a DynamoDB table (packt_sensordata):
Here, the query interface allows you to filter the data quickly based on different criteria. If you are familiar with SQL, you can also try the PartiQL editor, which is on the DynamoDB console.
For better performance and faster response times, we recommend that you use Query over Scan.
In addition to having interactive query capabilities on data, you will often need to build a presentation layer and business logic for various other personas (such as consumers, fleet operators, and more) to access data. You can define the business logic layer using Lambda:
try {
switch (event.routeKey) {
case "GET /items/{device_id}":
var nid = String(event.pathParameters.id);
body = await dynamo
.query({
TableName: "<table-name>",
KeyConditionExpression: "id = :nid",
ExpressionAttributeValues: {
":nid" : nid
}
})
.promise();
break;
case "GET /items":
body = await dynamo.scan({ TableName: "<table-name>" }).promise();
break;
case "PUT /items":
let requestJSON = JSON.parse(event.body);
await dynamo
.put({
TableName: "<table-name>",
Item: {
device_id: requestJSON.id,
temperature: requestJSON.temperature,
humidity: requestJSON.humidity,
device_name: requestJSON.device_name
}
})
.promise();
body = `Put item ${requestJSON.id}`;
break;
default:
throw new Error(`Unsupported route: "${event.routeKey}"`);
}
} catch (err) {
statusCode = 400;
body = err.message;
} finally {
body = JSON.stringify(body);
}
return {
statusCode,
body,
headers
};};
Please note that here, we are using lambda functions. This is because serverless functions have become a common pattern to process event-driven data in near real time. Since it alleviates the need for you to manage or operate any servers throughout the life cycle of the application, your only responsibility is to write the code in a supported language and upload it to the Lambda console.
Fun fact
AWS IoT Greengrass provides a Lambda runtime environment for the edge along with different languages such as Python, Java, Node.js, and C++. That means you don't need to manage two different code bases (such as embedded and cloud) or multiple development teams. This will cut down your development time, enable a uniform development stack from the edge to the cloud, and accelerate your time to market.
Now the business logic has been developed using the lambda function, let's create the HTTP interface (aka the presentation layer) using Amazon API gateway. This is a managed service for creating, managing, and deploying APIs at scale:
/items GET – allows accessing all the items on the DynamoDB table (can be an expensive operation)
a. Query all items from the table
curl https://xxxxxxxx.executeapi.<region>.amazonaws.com/items
You should see a long list of items on our Terminal that has been retrieved from the dynamodb table.
Amazon API gateway allows you to create different types of APIs, and the one configured earlier falls into the HTTP API category that allows access to lambda functions and other HTTP endpoints. Additionally, we could have used the REST APIs here, but the HTTP option was chosen for its simplicity of use, as it can automatically stage and deploy the required APIs without any additional effort and can be more cost-effective. In summary, you have now completed implementing the bounded context of an ODS through querying or API interfaces.
In this next section, you will build an analytics pipeline on the batch data persisted on Amazon S3 (that is, the data lake). To achieve this, you can use AWS Glue to crawl data and generate a data catalog. Thereafter, you will use Athena for interactive querying and use QuickSight for visualizing data through charts/dashboards.
AWS Glue is a managed service that offers many ETL functionalities, such as data crawlers, data catalogs, batch jobs, integration with CI/CD pipelines, Jupyter notebook integration, and more. Primarily, you will use the data crawler and cataloging capabilities in this lab. We feel that might be sufficient for IoT professionals since data engineers will be mostly responsible for these activities in the real world. However, if you believe in learning and are curious, please feel free to play with the other features:
In this lab, you used a crawler with a single data source only; however, you can add multiple data sources if required by your use case. Once the data catalog is updated and the data (or metadata) is available, you can consume it through different AWS services. You might need to often clean, filter, or transform your data as well. However, these responsibilities are not generally performed by the IoT practitioners and, primarily, fall with the data analysts or data scientists.
Amazon Athena acts as a serverless data warehouse where you run analytical queries on the data that's curated by an ETL engine such as Glue. Athena uses a schema-on-read approach; thus, a schema is projected onto your data when you run a query. And since Athena enables the decoupling of the compute and storage layers, you can connect to different data lake services such as S3 to run these queries on. Athena uses Apache Hive for DDL operations such as defining tables and creating databases. For the different functions supported through queries, Presto is used under the hood. Both Hive and Presto are open source SQL engines:
This should bring up the data results with only 10 records. If you would like to view the entire dataset, please remove the 10-parameter limit from the end of the query. If you are familiar with SQL, please feel free to tweak the query and play with different attributes or join conditions.
Note
Here, you processed JSON data streamed from an HBS hub device. But what if your organization wants to leverage a more lightweight data format? Athena offers native support for various data formats such as CSV, AVRO, Parquet, and ORC through the use of serializer-deserializer (SerDe) libraries. Even complex schemas are supported through regular expressions.
So far, you have crawled the data from the data lake, created the tables, and successfully queried the data. Now, in the final step, let's learn how to build dashboards and charts that can enable BI on this data.
As an IoT practitioner, building business dashboards might not be part of your core responsibilities. However, some basic knowledge is always useful. If you think of traditional BI solutions, it might take data engineers weeks or months to build complex interactive ad hoc data exploration and visualization capabilities. Therefore, business users are constrained to pre-canned reports and preselected queries. Also, these traditional BI solutions require significant upfront investments and don't perform as well at scale as the size of data sources grow. That's where Amazon QuickSight helps. It's a managed service that's easy to use, highly scalable, and supports complex capabilities required for business:
Feel free to play with different fields or visual types to visualize other smart home-related information. As you might have observed, while creating the dataset, QuickSight natively supports different AWS and third-party data sources such as Salesforce, ServiceNow, Adobe Analytics, Twitter, Jira, and more. Additionally, it allows instant access to the data through mobile apps (such as iOS and Android) for business users or operations to quickly infer data insights for a specific workload along with integrations to machine learning augmentation.
Congratulations! You have completed the entire life cycle of data processing and data consumption on the cloud using different AWS applications and data services. Now, let's wrap up this chapter with a summary and knowledge-check questions.
Challenge zone (Optional)
In the Amazon API gateway section, you built an interface to retrieve all the items from the dynamodb table. However, what if you need to extract a specific item (or set of items) for a particular device such as HVAC? That can be a less costly operation compared to scanning all data.
Hint: You need to define a route such as GET /items {device_id}. Check the lambda function to gain a better understanding of how it will map to the backend logic.
In this chapter, you were introduced to big data concepts relevant to IoT workloads. You learned how to design data flows using DDD approach along with different data storage and data integration patterns that are common with IoT workloads. You implemented a lambda architecture to process fleet telemetry data and an analytical pipeline. Finally, you validated the workflow by consuming data through the APIs and visualizing it through business dashboards. In the next chapter, you will learn how all of this data can be used to build, train, and deploy machine learning models.
Before moving on to the next chapter, test your knowledge by answering these questions. The answers can be found at the end of the book:
Take a look at the following resources for additional information on the concepts discussed in this chapter: