You now have the skills required to create edge machine learning (ML) solutions. This chapter acts as both a summary of the key lessons that have been learned throughout this book and follows through on why they are best practices by reviewing the delivered solution. By reviewing the solution, we can see how the Home Base Solutions prototype hub design holds up and where there are further opportunities to improve it. You will learn what it is like to perform a deep analysis of the solution using the AWS Well-Architected Framework, a mechanism that was created for reviewing complex solutions. Finally, we'll leave you with suggested next steps for your journey as a practitioner of delivering intelligent workloads to the edge.
In this chapter, we're going to cover the following main topics:
In this section, we will group and summarize the key lessons from throughout this book as a quick reference to ensure that the most important lessons were not missed. There is a loose chronology to the groupings based on the material from Chapters 1 to 9, but some lessons may appear in a group outside the order in which they appeared in this book.
The following key lessons capture the definition, value proposition, and shape of an edge ML solution:
The following key lessons summarize the definition of AWS IoT Greengrass and the best practices for using it to deliver edge ML solutions:
The following key lessons summarize the techniques and patterns you should consider when breaking down a problem into modeled data and the ML workloads that you use in your edge ML solutions:
The following key lessons summarize important distinctions in the production version of your solution and how to operate the solution at scale:
In the next section, you will learn about a framework provided by AWS for evaluating design trade-offs when building solutions on the platform.
In 2015, AWS launched a framework for guiding developers through the process of making good design decisions when building on AWS. The AWS Well-Architected Framework codifies the best practices for defining, deploying, and operating workloads on the AWS cloud. It exists as a whitepaper of best practices and a web-based tool to approach a solution evaluation as a checklist of considerations and suggested mitigation strategies. This expertise aims to serve AWS customers but is delivered in a format that is generally useful for evaluating any kind of digital workload. We will use this framework to retroactively review this book's solution of the Home Base Solutions appliance monitoring product.
The Well-Architected Framework organizes best practices into five pillars. A pillar is a section of the framework that aggregates design principles and guiding questions to resolve under a common purpose. The five pillars are as follows:
You may recognize some of these pillars as the key benefits we used to define the value proposition of edge ML solutions in Chapter 1, Introduction to the Data-Driven Edge with Machine Learning! Each pillar includes a narrative and a set of questions to evaluate and consider. The questions that the architect does not have a clear response or existing mitigation strategy for are then used to define the gap between how well the solution is architected now and where it needs to be. For example, if the review helps us identify a single point of failure in our architecture, then we would decide between the acceptability of that risk in our solution or whether to refactor with a failover mechanism.
It's important to understand that when you're answering the framework's questions to review your solution, there are no objectively right or wrong answers. The overall posture of your solution is not a quantifiable outcome of completing a review. The process you use to answer individual questions may identify important refactors or highlight gaps in the original design. It's up to your team's definition of done to decide how complete or thorough your answers must be and how many questions are resolved, in the sense that your team is satisfied with the due diligence that's been performed. A lazy or superficial review may not lead to any meaningful change. As the criticality of the solution increases, the amount of rigor in your review may scale proportionally or even non-linearly.
In your application of the framework, you may find value in moving pillar by pillar, answering each question in series, or by crafting a subset of prioritized questions as a cross-section of all the pillars. It is also recommended and more common to review the framework between the steps of designing the solution and implementing it. This helps architects prevent failures and raise the security posture before investing time and resources in building the solution. For this book, we elected to save the review for the end to move quickly into the hands-on projects, recognizing that we are practicing in a safe, prototype environment.
The AWS Well-Architected Framework also includes extensions that are referred to as lenses. A lens is a collection of additional best practices related to a particular domain or type of solution, such as a SaaS application or an IoT solution. These lenses help architects within their domains to critically analyze their solutions, though the guidance within doesn't broadly apply to all kinds of solutions, such as the main body of the framework. Our review in this chapter will use a mix of framework questions between the main body and the IoT Lens. Links to both resources are included in this chapter's References section. In the next section, we will review our solution using a subset of the questions posed by the AWS Well-Architected Framework.
Before we perform a solution review, let's restate the problem, revisit the target solution, and reflect on what was built in this book. This will help us refresh our memory and contextualize the solution review using the Well-Architected Framework.
Our fictional narrative had us working at Home Base Solutions as the IoT architect responsible for designing a new home appliance monitoring product. This product was a combination of a hub device that connects to consumers' home networks and interacts with paired appliance monitoring kits. These kits are attached to consumers' large appliances, such as furnaces or washing machines, and send telemetry data to the hub device. The hub device processes telemetry data, streams it to the cloud to train ML models, and hosts local inference workloads using new telemetry and the deployed models. The following diagram shows how these entities are related in consumers' homes:
Our target solution was to prototype the hub device on a Raspberry Pi to collect telemetry data and run the ML workloads, all while using a SenseHAT expansion module to collect sensor data and signal results visually to the LED matrix. We used AWS IoT Greengrass to deploy a runtime environment to the hub device that could install and run our code as components. These components encapsulated our business logic to collect sensor telemetry, route data through the edge and cloud, fetch resources from the cloud, and run our ML inference workloads.
We used Amazon SageMaker to train a new ML model in the cloud using the sensor telemetry that was acquired by the hub device and streamed it to the cloud as training data. This ML model was deployed to the edge to intelligently assess the health of our monitored appliance and signal to the consumer if any anomalous behavior is detected. Finally, we planned how to scale up our solution to a fleet of hub devices, their monitoring kits, and ML models, and how to operate this fleet in a production environment. The following diagram reviews our solution architecture:
With this brief review of our business objective and solution architecture to set the context, let's apply the AWS Well-Architected Framework to analyze our solution.
The format we will use for the Well-Architected review is to state a question from the framework and then respond with an answer in the role of the HBS IoT architect. As a reminder, highlights from the framework were selected from the base material and the IoT Lens to drive interesting analysis for this chapter. There are more best practices to consider in the complete body of the framework.
Note
The following sections pull in questions from the AWS Well-Architected Framework and the IoT Lens extension. Questions labeled as OPS 4, for example, indicate that they are from the Well-Architected Framework. A question labeled as IOTOPS 4 indicates it is from the IoT Lens extension. This distinction is not relevant for this chapter but it identifies which source material the question was copied from.
The operational excellence pillar reinforces thinking about how we operate the live solution. It organizes its guidance into four sub-areas: organization, preparation, operation, and evolution. This pillar stresses the importance of an organization's work culture and mechanisms for anticipating the inevitability of failure, reducing the influence of human error, and learning from mistakes. Now, let's review a selection of the questions from this pillar and some sample responses we might see as output from the architect.
We will summarize a response to these three related questions from the operational excellence pillar. In this context, the workload means anything related to meeting our business objectives, such as informing customers of their failing appliances. This is different than operations, which refers to anything related to the technical implementation we use to operate the workload, such as the deployment mechanisms or tools we use to notify our team of the impact.
We have designed each level of our workload to report some kind of health state. Our workload can be defined at three levels, each with mechanisms for reporting its state so that we can automate monitoring and alerting. These three levels are the fleet of devices, the components running on a hub device, and the cloud pipeline of training ML models. At the fleet level, hub devices report the health of their deployments and connectivity status to the cloud with services such as AWS IoT Greengrass and Amazon CloudWatch. We can use services such as AWS IoT Device Management to monitor for devices in unhealthy states and take action against them. The components that are running on devices are monitored by the IoT Greengrass core software, and logs for each component can be shipped to the cloud for automated analysis. The ML training pipeline reports metrics on training accuracy so that we can measure the overall state of meeting our business objectives.
We will implement threshold alarms on critical failures, such as devices failing deployments and appliance monitoring kits losing connection to their hub devices. These enable us to proactively mitigate failures before they impact our customers, or reach out to customers to inform them of actions they can take to restore local operations.
To reduce defects and mitigate deployment risks, we must include a physical copy of each target hardware profile running our solution in our testing and deployment pipeline. These devices will be the first to receive new deployments through Greengrass by specifying them as a separate group in AWS IoT Core. We can configure our CI/CD pipeline to create new deployments for that group and wait for these deployments to be reported as successful before advancing the deployment to the first wave of production devices.
We get some out-of-the-box remediation value from Greengrass anyway because, by default, it will roll back failed deployments to the previous state. This helps minimize the downtime of production devices and instantly signals to our team that something is wrong with the deployment. Greengrass can also stop the fleet of grouped devices from being deployed if a certain portion of them fail their deployment activity.
In our solution of using Greengrass, we know what the documented minimum requirements are for running the Greengrass software. We used the IoT Device Tester software to validate that our target hardware platform is compatible with Greengrass's requirements and can connect to the AWS service. We should use the IoT Device Tester software to validate any future hardware platforms that we want to use as HBS hub devices.
We should also calculate the necessary additional resources that are consumed by all of our components. For example, if we know that all of our total static resources will consume 1 GB on disk, we know we need at least that much, plus room for storing logs, temporary resources, and so on. Once we have calculated the minimum requirements for our solution, we can add a custom test to IoT Device Tester that can validate that each new hardware target is ready to run our solution.
The security pillar reinforces thinking about how to maintain or raise your workload's security posture, such as protecting access to data and systems. It organizes best practices into the following sub-areas: Identity and Access Management, detection, infrastructure protection, data protection, and incident response. This pillar stresses clearly defining the resources and actors in your workload, the boundaries and access patterns between them, and the mechanisms for enforcing those boundaries.
The identities and permissions for people are managed by AWS Identity Access and Management. Our customers will log into their management app using federated identities from OAuth providers such as Google or Facebook or create new usernames directly with us using Amazon Cognito. We will tie Cognito identities to the devices they own and interact with using policies.
Identities and permissions for devices are managed by a combination of AWS IAM and AWS IoT Core. The device-to-cloud identity uses an X.509 private key and certificate registered with AWS IoT Core to establish MQTT connections. This can be used to exchange a certificate for temporary AWS credentials. These temporary AWS credentials are tied to an IAM role that has policies attached to it to determine what the credentials are allowed to do with various AWS services. By using unique private keys on each device, the identity of a device cannot be spoofed by a malicious actor.
At the edge, we can classify data as either runtime data that is derived from sensors or used to deliver business outcomes or operational data that comes from software and system logs. In our current design, we do not handle runtime and operational data any differently in terms of management or protection. Here, we have the opportunity to better safeguard any potential customer privacy data, such as video feeds from connected cameras.
At the edge, any data that's in transit between the components of the Greengrass solution is not encrypted. We use the permissions model of Greengrass's components and interprocess communication (IPC) to protect access to data that's published over IPC. Data in transit between leaf devices and the Greengrass device using MQTT is encrypted over the network using the private key and certificate with mutual TLS.
At the edge, data at rest is not encrypted and instead relies on the permissions of the Unix filesystem to protect access to data. We must ensure we use proper user and group configurations to protect access to data at rest. Here, we have the opportunity to put a validation mechanism in place to alert us if new system users or groups are created or modified. To perform security threat analysis each time, we must add a new component to the solution to check whether it has the proper security in place for data access.
From the edge to the cloud, we should use mutual TLS to encrypt MQTT traffic in transit and Amazon Signature Version 4 to encrypt any other traffic that's exchanged with AWS APIs with the temporary credentials. Data at rest that's stored in AWS services uses the encryption policies of each service. For example, data stored in Amazon Simple Storage Service (S3) can use server-side encryption with AWS-managed encryption keys.
Our operations team has alarms set on the operational health metrics of the fleet of devices. For example, if a device fails a deployment, the operations team will receive an incident ticket as a notification from the alarm. If a group of devices fails a deployment, we will page our operations team for immediate triaging.
We will author a series of runbooks for anticipated failure events for our operations team to follow as a first response. The first step will be to define the minimum set of runbooks needed before we are comfortable with the first wave of production devices.
We will work with our ODM to document the security life cycle from the supply chain of parts, through assembly and delivery to our warehouse for inclusion in the retail packaging. It is important to us that parts such as the central processor, volatile and non-volatile memory, and the Trusted Platform Module (TPM), which houses the private key, are authentic and haven't been tampered with before they are assembled into our product.
All TPMs provided by the ODM for our devices will be associated with a CA that we will register in AWS IoT Core. We will pre-provision each device in the cloud so that the devices can simply connect using their protected credentials and not require any JIT registration process.
Should we identify any device as having a compromised identity, we will assess whether a certificate rotation activity is a sufficient mitigation. If not, we will revoke its certificate in AWS IoT Core to prevent it from exchanging further data and proactively reach out to the customer to start an exchange process.
The reliability pillar reinforces that a workload should continue to operate as it was designed and when it is expected to. It organizes best practices into the following sub-areas: foundations, workload architecture, change management, and failure management. This pillar stresses concepts such as failover and healing mechanisms in response to failures, testing recovery scenarios, and monitoring for availability during steady-state operations and after deploying a change.
We have designed our workload service architecture using a service-oriented architecture and implemented the principles of isolated, decoupled services. We use this architecture to make it easier to design, author, test, and ship code, as well as to minimize the impact that an isolated service experiencing faults will have on the solution. We codify this architecture design using the mechanisms defined by the core Greengrass software and its components.
For our edge solution, we use versioned components to incrementally update the software running on our devices through Greengrass deployments. We deploy changes on test devices before rolling those changes out to production devices. Deployments that fail on a device will be automatically rolled back. Deployments that fail to 10% of a fleet will roll back the entire deployment to that fleet.
For our cloud solutions, we use CloudFormation templates and stacks to provision cloud resources and make changes to them. We do not make any changes to the production infrastructure not authored through IaC mechanisms. These changes must be reviewed by a peer on the team before they can be deployed. We can use CloudWatch Metrics and Logs for our provisioned resources to monitor for any unhealthy statuses and roll back CloudFormation changes in the event of operational impact.
Our edge ML solutions are designed to operate independently from the cloud. Some features are impacted during periods of network instability, such as publishing failure events to the cloud for customer push notifications to their mobile app. Events, telemetry data, and logs that are destined for the cloud are buffered locally and will eventually get to the cloud once network instability has been resolved. Data that is published to the cloud but does not get an acknowledgment of this will be retried, such as with an MQTT quality of service set to an at least once level of service.
When the cloud is trying to communicate with devices, such as when a new deployment is ready to be fetched, we use durable services such as Greengrass, which keep track of devices that are offline and haven't completed a pending deployment activity yet.
(In this case, component failure does not mean Greengrass components specifically.) Here, we use a service-oriented architecture to withstand component failures so that any of our custom services should be able to fail without bringing down the entire solution. For example, if the component that reads measurements from the temperature sensor fails, the hub device and edge solution will still be operational, albeit with less accuracy when it comes to detecting appliance anomalies.
There are some components provided by Greengrass that, if failing, could impact multiple outcomes in our solution, such as the IPC messaging bus. If components such as these fail, our custom components will not be able to publish new messages, and receiving components would stop getting new messages to work with. We should update our custom component code, which publishes messages, so that it can buffer messages where we cannot afford to drop messages while IPC is unavailable. We should also study the behavior of Greengrass and its ability to self-recover when a provided function such as IPC is impacted.
If any of our cyber-physical hardware interfaces fail, such as a sensor no longer being able to be read, we would stop seeing values being published over IPC and get error messages in the corresponding software component that uses the sensor. We may be able to triage events like these remotely using uploaded logs. If any of our compute, memory, disk, or network hardware components fail, the entire solution will likely be disabled and require on-premises triaging or the device being exchanged through our customer support program.
The performance efficiency pillar reinforces that we strike a balance between consumed resources and the available budget and that we continue to seek out efficiency gains as technology evolves. It organizes best practices into the following sub-areas: selection, review, monitoring, and tradeoffs. This pillar stresses delegating complex tasks for solved problems, planning for data to be at the right place at the right time, and reducing how much infrastructure your team must manage.
Concerning our ML model training needs, we will initially select compute instances on AWS based on our default settings and evaluate whether there are more cost-effective instance profiles to use in our training life cycle through trial and error. Since ML is a differentiator for our consumer product, we want to enable the ML model on our customers' devices within an appropriate service-level agreement (SLA), such as within one business day after accumulating enough training data to produce an accurate model. As we ramp up our production fleet, we may find value in batching training jobs to maximize the utilization of provisioned compute instances.
Concerning our target device hardware at the edge, we will measure the performance of our full production workload on the prototype device, such as a Raspberry Pi, and iterate toward a production hardware profile based on the overall utilization of the compute device. We want to leave some buffer room in the total utilization in case we deploy new workloads to devices as a future upgrade.
We will monitor new releases from AWS for opportunities to bring in new managed Greengrass components that handle even more undifferentiated heavy lifting for our edge workload. We will also monitor new releases in the Amazon SageMaker and Amazon Elastic Cloud Compute portfolios for opportunities to optimize our ML training pipeline.
We will use the managed component for enabling AWS IoT Device Defender to collect system-level metrics from each device, such as compute, memory, and disk utilization. We will monitor for anomalies and threshold breaches and act in response to any detected impacts.
For high-priority business outcomes and operational alerts, such as informing others of a detected anomaly or a drop in sensor values, data will be transmitted from devices to the cloud as soon as such data is available. For other classes of data, such as reporting component logs or sensor telemetry to use in a new ML training job, data can be transmitted in batches daily.
The cost optimization pillar reinforces how to operate a solution that meets business needs at the lowest cost. It organizes best practices into the following sub-areas: financial management, usage awareness, cost-effective resources, managing demand and supply, and optimizing over time. This pillar stresses measuring the overall efficiency of your cloud expenditure, measuring return on investment to prioritize where next to optimize, and seeking implementation details that can lower costs without compromising on requirements.
We will use a combination of Amazon CloudWatch for metrics and logs, as well as the AWS Billing console to monitor the usage and cost of consumed AWS services. The most significant source of cost is anticipated to be cloud compute instances for our ML training workloads. We will monitor the costs associated with each device for outliers where individual devices are consuming more in cloud costs than the fleet's average.
To capture sensor telemetry from our appliance monitoring kits, we will batch the telemetry data for daily transmission to the cloud, which will go directly to Amazon S3. This will dramatically lower the cost of the transmission compared to sending each payload as it is published by the sensor components. We do not have plans to further optimize the payload sizes for any operational messages that are exchanged between Greengrass devices and the cloud because we do not anticipate these messages to make up a significant expense.
That concludes our sample responses to the AWS Well-Architected review. Are there any responses you disagree with or would otherwise modify? The review process is a guideline and is not designed to contain right or wrong answers. It is up to you and your team of collaborators to define how complete the answers should be and whether or not you have action items as a result of the review. Questions that the team has no answer to or cannot articulate a detailed answer to are good opportunities to learn more about your architecture and anticipate problems before they surface in your solution. In the next section, we will provide some final coverage of the AWS features you may find useful but that did not fit in the scope of this book.
This book focused on a specific use case as a fictitious narrative to selectively highlight features available from AWS that can be used to deliver intelligent workloads to the edge. There is so much more you can achieve with AWS IoT Greengrass, the other services in the AWS IoT suite, the ML suite of services, and the rest of AWS than what we could cover in a single book.
In this section, we will point out a few more features and services that may be of interest to you as an architect in this space, as well as offer some ideas on how to extend the solution you've built so far to further your proficiency.
The Greengrass features we used in the solution represent a subset of the flexibility that Greengrass solutions can offer. You learned how to build with components, deploy software to the edge, fetch ML resources, and make use of built-in features for routing messages throughout the edge and the cloud. The components we used in the hub device prototype primarily downloaded Python code and launched long-lived applications that interacted with the IPC messaging bus. Components can be designed to run one-off programs per deployment, per device boot, or on a schedule. They can be designed to run services that act as dependencies for your other component software, or wait to start once other dependencies in your component graph have run successfully.
Your components can interact with the deployment life cycle by subscribing to notifications about deployment events. For example, your software can request a deferment until a safe milestone is met (such as draining a queue or writing in-memory records to disk), or signal to the Greengrass nucleus that it is now ready for an update.
Components can signal to other components that they should pause or resume their functionality. For example, if a component responsible for a limited resource such as disk space or memory identifies a high utilization event, it could request that the components consuming those resources pause until the utilization comes back into the desired range.
Components can interact with each other's configuration by requesting the current configuration state, subscribing to further changes, or setting a new value for a component's configuration. Returning to the previous example, if a resource watchdog component didn't want to fully pause a consuming component, it could specify a new configuration value for the consuming component to write sampled values less frequently or enter a low-power state.
All three of the previously mentioned features work using Greengrass IPC and are simple applications of local messaging between your components and the Greengrass nucleus that govern the component life cycle. There is lots of utility for these features in your solution design and they demonstrate how you can build systems for component interaction on top of IPC.
Here are a few more features of Greengrass that you should be aware of as you continue your journey as an edge ML solution architect. The documentation for Greengrass's features can be found online at https://docs.aws.amazon.com/greengrass:
Now, let's review a few more features in the wider suite of AWS IoT services that can power up your next project.
The suite of AWS IoT services covers use cases for device connectivity and control, managing a fleet at scale, detecting and mitigating security vulnerabilities, performing complex event detection, analyzing industrial IoT operations, and more. Greengrass is a model implementation of designing edge solutions on top of existing AWS IoT services and also natively integrates with them in powerful ways. Here are a few more features in the AWS IoT suite to take a look at when designing your next edge ML solution. The documentation for these features can be found at https://docs.aws.amazon.com/iot:
Now, let's review a few more services in the ML suite of AWS that add more intelligence to your workloads.
The ML services of AWS span from tools that help developers train and deploy models to high-level artificial intelligence services that solve specific use cases. While the following services run exclusively in the AWS cloud today, they can help augment your AI edge solutions that can work with remote services:
Next, we will provide a few ideas on the next steps you could take to extend this book's solution.
With your working solution of a hub device running a local ML workload, you have practiced using all the necessary tools to deploy intelligent workloads to the edge. You may already have a project in mind to apply the lessons you've learned from this book and reinforce what you've learned through practical application. If you are looking for inspiration on the next steps to take, we have compiled a few ideas for extending what you have already built as a means to develop further proficiency with this book's topics:
Do you have other suggestions for project extensions or want to show off what you've built? Feel free to engage with our community of readers and architects by opening a pull request or issue in this book's GitHub repository!
That's all we have for you! We, the authors, believe that these are the best techniques, practices, and tools you can use to continue your journey as an architect of edge ML solutions. While some of the tools are specific to AWS, everything else should generally serve you in building these kinds of solutions. Solutions built with AWS IoT Greengrass can just as easily include components that communicate with your web services or the services of cloud vendors such as Microsoft or Google. The guiding principle of this book was to prioritize teaching you how to build and how to think about building edge ML solutions over using specific tools.
As you take your next steps, whether they are extending this book's prototype hub device, starting a new solution, or modernizing an existing solution, we hope you find value in reflecting upon the lessons you've learned and critically thinking about the tradeoffs that help you reach your goals. We welcome your feedback on this book's content and the technical examples through the communication methods included in this book's GitHub repository. We are committed to the continued excellence of the examples and tutorials provided, recognizing that in the world of information technology, tools evolve, dependencies update, and code breaks.
We are sincerely grateful that you decided to invest your precious time in exploring the world of bringing intelligent workloads to the edge with our book. We wish you the best of luck in your future endeavors!
Take a look at the following resources for additional information on the concepts that were discussed in this chapter: