The Internet of things (IoT) had a humble beginning in 1999, in Procter & Gamble, when Kevin Ashton introduced the idea of integrating a radio-frequency identification (RFID) antenna into lipstick shelves to enable branch managers to better track cosmetic inventories for replenishments. Since then, this technology has been adopted across all industry segments in some form or another and has become ubiquitous in today's world.
Managing a set of RFID tags, sensors, and actuators inside a known physical boundary is a relatively easy task. However, managing millions (or billions or trillions) of these devices globally throughout their lifecycle is not. Especially when these devices are spread across different locations with various forms of connectivity and interfaces.
Therefore, in this chapter, you will learn about the best practices of onboarding, maintaining, and diagnosing a fleet of devices remotely through AWS native services. Additionally, you will gain hands-on experience in building an operational hub to assess the health of the connected fleet for taking required actions.
In this chapter, we will be covering the following topics:
The technical requirements for this chapter are the same as those outlined in Chapter 2, Foundations of Edge Workloads. See the full requirements in that chapter.
We already introduced you to the different activities involved in the IoT manufacturing supply chain in Chapter 8, DevOps and MLOps for the Edge. Onboarding refers to the process of manufacturing, assembling, and registering a device with a registration authority. In this section, we will dive deeper into the following activities that play a part in the onboarding workflow:
So far, in this book, you have been using a Raspberry Pi (or a virtual environment) to perform the hands-on exercises. This is a common practice for development and prototyping needs. However, as your project progresses toward a higher environment (such as QA or production), it is recommended that you consider hardware that's industry-grade and can operate in various conditions. Therefore, all the aforementioned activities in Figure 9.1 need to be completed before your device (that is, the connected HBS hub) can be made available through your distribution channels at different retail stores (and sell like hotcakes!).
For the remainder of this chapter, we assume that your company already has a defined supply chain with your preferred vendors and the device manufacturing workflow is operational to assemble the devices. As your customers unbox these devices (with so much excitement!) and kick off the setup process, the device needs to bootstrap locally (on the edge) and register with the AWS IoT services successfully to become fully operational.
So, you must be thinking, what are the necessary steps to perform in advance for the device registration to be a success? Here it comes.
There are different types of cryptographic credentials such as a user ID, password, vended tokens (such as JWT and OAuth), and symmetric or asymmetric keys that can be used by an IoT device. We recommend using asymmetric keys as, at the time of writing, these are considered to be the most secure approach in the industry. In all the previous hands-on exercises, you took advantage of the asymmetric X.509 keys and certificates generated by the AWS IoT Certificate Authority (CA), that were embedded in the connected HBS hub running Greengrass. A CA is a trusted entity that issues cryptographic credentials such as digital keys and certificates. These credentials are registered on the cloud and embedded onto the devices to enable transport-layer security or TLS-based mutual authentication. Specifically, there are four digital resources associated with a mutual TLS authentication workflow, as follows:
The following diagram shows the workflow and the location of these digital resources on the device and in the cloud:
Therefore, it's critical to decide on a CA early on in the design process. This is so that it can issue the aforementioned digital resources required by the devices to perform registration with a backend authority and become fully operational. There are three different ways a CA can be used with AWS IoT Core, as shown in the following table along with a list of pros and cons:
Once the CA setup is complete, the next step is to choose the device provisioning approach based on the scenario. Let's understand that in more detail next.
Terms such as provisioning and registration are used interchangeably in many contexts in the IoT world, but we believe there is a clear distinction between them. For us, device provisioning is the amalgamation of two activities – device registration and device activation. Device registration is the process where a device successfully authenticates using its initial cryptographic credentials against a registration authority (such as the AWS IoT Identity service), reports distinctive attributes such as the model and serial number, and gets associated as a unique identity in a device registry. Additionally, the authority can return a new set of credentials, which the device can replace with the prior ones. Following this, a privilege escalator can enhance the privilege of the associated principal (such as the X.509 certificate) for the device to be activated and fully operational.
There are different approaches to these provisioning steps, which are often derived from the level of control or convenience an organization intends to have in the manufacturing supply chain. Often, this choice is determined by several factors such as in-house skills, cost, security, time to market, or sensitivity to the intellectual property of the product. You learned about the approach of automatic provisioning through IoT Device Tester in Chapter 2, Foundations of Edge Workloads, which is prevalent for prototyping and experimentation purposes.
In this section, we will discuss two production-grade provisioning approaches that can scale from one device to millions of devices (or more) by working backward from the following scenarios.
In this scenario, you can provision the fleet of HBS hubs in bulk in your supply chain using unique firmware images that include unique cryptographic credentials:
{"ThingName": "hbshub-one", "SerialNumber": "0001", "CSR": "*** CSR FILE CONTENT ***"}
{"ThingName": "hbshub-two", "SerialNumber": "0002", "CSR": "*** CSR FILE CONTENT ***"}
{"ThingName": "hbshub-three", "SerialNumber": "0003", "CSR": "*** CSR FILE CONTENT ***"}
This approach is pretty common with microcontrollers running an RTOS since incremental updates are not supported (yet) with those hardwares. However, for the connected HBS hub, it's more agile and operationally efficient to decouple the firmware image from the crypto credentials.
That's where this second option comes in. Here, you will still generate the things and unique credentials from your CA in the same way as you did in the previous step, but you will not inject it into the firmware. Instead, you will develop an intelligent firmware that can accept credentials over different interfaces such as secure shell (SSH), a network file system (NFS), or a serial connection. As a best practice, it's also common to store the credentials in a separate chip such as a secure element or a trusted platform module (TPM). Additionally, the firmware can use a public key cryptography standards interface (such as PKCS#11) to retrieve the keys and certificates as required by the firmware or other local applications in real time. At the time of writing this book, Greengrass v2 is awaiting support for TPM, although it was a supported feature in Greengrass v1:
Let us take a look at how to go about the second option:
Let's consider the following scenario; your devices might not have the capability to accept unique credentials at the time of manufacturing. Or it's cost-prohibitive for your organization to undertake the operational overhead of embedding unique credentials in each HBS hub in your supply chain. This is where another pattern emerges, referred to as fleet provisioning by claim, where, as a device maker, you can embed a non-unique shared credential (referred to as claim) in your fleet. However, we recommend that you do not share the same claim for the entire fleet, rather only a percentage of it to reduce the blast radius in the case of any security issues. Take a look at the following steps:
def pre_provisioning_hook(event, context):
return {
'allowProvisioning': True,
'parameterOverrides': {
'DeviceLocation': 'NewYork'
}
}
Here, when the hub wakes up and connects to AWS IoT for the first time, the claim certificate is exchanged for permanent X.509 credentials that have been signed by the CA (AWS or BYO). This is where the fleet provisioning plugin helps, as it allows the device to publish and subscribe to the required MQ telemetry transport (mqtt) topics, accept the unique credentials, and persist in a secure type of storage:
Word of Caution
Fleet provisioning by claim poses security risks if the shared claims are not protected through the supply channels.
Once the devices have been provisioned, the next step is to organize them to ease the management throughout its life cycle. The Greengrass core devices can be organized into thing groups, which is the construct for organizing a fleet of devices within the AWS IoT ecosystem. A thing group can be either static or dynamic in nature. As their name suggests, static groups allow the organization of devices based on non-changing attributes such as the product type, the manufacturer, the serial number, the production date, and more.
Additionally, static groups permit building a hierarchy of devices with parent and child devices that can span up to seven layers. For example, querying a group of washing machine sensors within a serial number range that belongs to company XYZ can be useful to identify devices that need to be recalled due to a production defect.
In comparison, dynamic groups are created using indexed information such as the connectivity status, registry metadata, or device shadow. Therefore, the membership of dynamic groups is always changing. That is the reason dynamic groups are not associated with any device hierarchy; for example, querying a group of HBS devices that are connected at a point in time and have a firmware version of v1. This result can allow a fleet operator to push a firmware update notification to the respective owners.
Another advantage of using thing groups is the ability to assign fleet permissions (that is, policies) at the device group level, which then cascades to all the devices in that hierarchy. This eases the overhead of managing policies at each device level. Concurrently, though, it's possible to have device-specific policies, and the AWS IoT Identity service will automatically assess the least-privileged level of access permitted between the group and device level during the authentication and authorization workflow.
Now you have a good understanding of how to provision and organize the HBS hubs using different approaches. Next, let's discuss how to manage the fleet once it has been rolled out.
Although it might be easier to monitor a handful of devices, managing a fleet of devices at scale can turn out to be an operational nightmare. Why? Well, this is because IoT devices (such as the HBS hub) are not just deployed in a controlled perimeter (such as a data center). As you should have gathered by now, these devices can be deployed anywhere, such as home, office, business locations, that might have disparate power utilization, network connectivity, and security postures. For example, there can be times when the devices operate offline and are not available over a public or private network due to the intermittent unavailability of WI-FI connectivity in that premises. Therefore, as an IoT professional, you have to consider various scenarios and plan in advance for managing your fleet at scale.
In the context of a connected HBS hub, device management can help you achieve the following:
So, as you might have gathered, developing an IoT solution and rolling it out to the customers is just the beginning. It's necessary to govern the entire life cycle of the solution to achieve the business outcomes cited earlier. Therefore, device management can also be considered as a bigger umbrella for the following activities:
The following is a diagram showing the IoT Device Management workflow:
We have already discussed the first three topics in the preceding section in the context of Greengrass. Therefore, we will move on to focus on the remaining activities.
Monitoring the Greengrass-enabled HBS hubs and the associated devices will be key in achieving a reliable, highly available, and performant IoT workload. You should have a mechanism to collect monitoring data from the edge solution to debug failures when they occur. Greengrass supports the collection of system health telemetry and custom metrics, which are diagnostic data points to monitor the performance of critical operations of different components and applications on the Greengrass core devices. The following is a list of the different ways to gather this data:
Later, in the hands-on section, you will collect these metrics and process them on the cloud.
As you are collecting all of these data points (metrics and logs) from the HBS hub and publishing them to the cloud, the next step is to allow different personas such as fleet operators (or other downstream businesses) to consume this information. This can be achieved through CloudWatch, which natively offers various capabilities related to logging insights, generating dashboards, setting up alarms, and more. If your organization has already standardized on a monitoring solution (such as Splunk, Sumologic, Datadog, or others) CloudWatch also supports that integration.
Finally, in the control plane, Greengrass integrates with AWS CloudTrail to log access events related to service APIs, IAM users, and roles. These access logs can be used to determine additional details about Greengrass access such as the IP address from which a request was made, who the request was made by, and when it was made, which can be useful for various security and operational needs.
The previously explained services, such as Amazon CloudWatch (or a third-party solution), can be robust enough to generate the various insights required to monitor the health of IoT workloads. However, another common ask from IoT administrators or fleet operators is to have a single-pane-of-glass view that allows them to consume a comprehensive set of information from the device fleet, to quickly troubleshoot operational events.
For example, consider a scenario where customers are complaining that their HBS hubs are malfunctioning. As a fleet operator, you can observe a lot of connection drops and high-resource utilization from the dashboard. Therefore, you look up the logs (on a device or in the cloud) and identify it as a memory leak issue due to a specific component (such as Aggregator). Based on your operation playbook, you need to identify whether this is a one-off issue or whether more devices in the fleet are showing similar behavior. Therefore, you need an interface to search, identify, and visualize the metrics such as the device state, device connection, battery level across the fleet, or on a set filtered by user location. Here comes the need for a fleet management solution such as AWS Fleet Hub, which allows the creation of a fully managed web application to cater to various personas using a no-code approach. In our scenario, this web application can help the operators to view, query, and interact with a fleet of connected HBS hubs in near real time and troubleshoot the issue further. In addition to monitoring, the operators can also respond to alarms and trigger a remote operation over the air (OTA) to remediate deployed devices from a single interface. AWS Fleet Hub applications also enable the following:
In summary, these capabilities from Fleet Hub can allow an organization to respond more quickly to different operational events and, thereby, improve customer experience.
In the preceding scenario, you learned how a fleet operator can stay well informed about the operational events in near real time through a single-pane-of-glass view. However, what about diagnosing the issue further if the remote actions through Fleet Hub are not sufficient to remediate the identified issue? For example, an operator might have triggered a remote action to restart the aggregator component or the HBS hub itself, but that did not solve the problem for the end consumer. Therefore, as a next step, the operator is required to gain direct access to the hub or associated sensors for further troubleshooting. Traditionally, in such a situation, a company will schedule an appointment with a technician, which means additional cost and wait time for the customers. That's where a remote diagnostics capability such as AWS IoT Secure Tunneling can be useful. This is an AWS Managed service that allows fleet operators to gain additional privileges (such as SSH or RDP access) over a secure tunnel to the destination device.
The secure tunneling component of Greengrass enables secure bidirectional communication between an operator workstation and a Greengrass-enabled device (such as the HBS hub) even if it's behind restricted firewalls. This is made possible because the remote operations navigate through a secure tunnel under the hood. Moreover, the devices will also continue to use the same cryptographic credentials (that is, X509 certificates) used in telemetry for this remote operation. The only other dependency from the client side (that is, the fleet operator) is the installation of proxy software on the laptop or a web browser. That is because this proxy software makes the magic happen by allowing the exchange of temporary credentials (that is, access tokens) with the tunneling service when the sessions are initiated. The following diagram shows the workflow of secure tunneling:
For our scenario, the source refers to the workstation of the fleet operator, the destination refers to the connected HBS hub, and the secure tunnel service is managed by AWS.
Now that you have gained a good understanding of how to better monitor, maintain, and diagnose edge devices, let's get our hands dirty in the final section of this chapter.
In this section, you will learn how to use the nucleus emitter and the telemetry agent to capture various metrics and logs from edge devices and visualize those through Amazon CloudWatch and AWS IoT Fleet Hub. The following is the architecture that shows the different services and steps you will complete during the lab:
The following table lists the services that you will use in this exercise:
Your objective in this hands-on section includes the following steps, as depicted in the preceding architecture:
Let's take a look at the preceding steps, in more detail, next.
In this section, you will learn how to set up a Fleet Hub application that can be used to monitor the metrics from the connected HBS hub. Perform the following steps:
Congratulations! You are all set up with the Fleet Hub dashboard!
Note
Although we have only created one user for this lab, you can integrate AWS SSO with your organization's identity management systems such as Active Directory. This will allow role-based access to the dashboard for different personas. Ideally, this configuration will fall under the purview of identity engineers and won't be the responsibility of the IoT professionals.
Next, let's set up the routing rules for ingesting the telemetry data from the HBS hub to the cloud:
Great work! EventBridge is all set to ingest telemetry data and publish it to the CloudWatch group.
We will revisit these dashboards again at the end of the lab to visualize the collected data from the hubs. For now, let's switch gears to configure and deploy the edge connectors on Greengrass.
In this section, you will learn how to deploy the Nucleus Emitter and Log Manager agents to a Greengrass-enabled HBS hub, to publish the health telemetry to the cloud. Perform the following steps:
Now you have set up the fleet hub and deployed the agents on the edge, you can visualize the health telemetry data using the following steps:
Therefore, as a fleet operator, you can now visualize the health of your device fleet along with being alarmed for threshold breaches.
So far, you have learned how to operate a fleet of Greengrass-enabled devices using AWS native solutions. These patterns are also applicable for non-Greengrass devices, for example, the devices that leverage other devices' SDKs (such as AWS IoT Device SDK or FreeRTOS).
Challenge Zone
I would like to throw a quick challenge for you to determine how you can trigger an OTA job from AWS IoT Fleet Hub to a specific HBS hub device. This can be useful when you have to push an update such as a configuration file that is required during an operational event. Best of luck!
Let's wrap up this chapter with a quick summary and a set of knowledge-check questions.
In this chapter, you were introduced to design patterns and the best practices of onboarding, managing, and maintaining a fleet of devices. These practices can help you to provision and operate millions (or more) of connected devices across different geographic locations. Additionally, you learned how to diagnose edge devices remotely for common problems or tunnel in securely for advanced troubleshooting. Following this, you implemented an edge-to-cloud architecture, leveraging various AWS-built components and services. This allowed you to collect health telemetry from the HBS hubs, which a fleet operator can visualize through dashboards, be notified through alarms, or take action as required.
In the next and final chapter, we will summarize all the key lessons that you have learned throughout the book (and more), so you are all set to build well-architected solutions for the real world.
Before moving on to the next chapter, test your knowledge by answering these questions. The answers can be found at the end of the book:
Take a look at the following resources for additional information on the concepts discussed in this chapter: