Chapter 10: Reviewing the Solution with AWS Well-Architected Framework

You now have the skills required to create edge machine learning (ML) solutions. This chapter acts as both a summary of the key lessons that have been learned throughout this book and follows through on why they are best practices by reviewing the delivered solution. By reviewing the solution, we can see how the Home Base Solutions prototype hub design holds up and where there are further opportunities to improve it. You will learn what it is like to perform a deep analysis of the solution using the AWS Well-Architected Framework, a mechanism that was created for reviewing complex solutions. Finally, we'll leave you with suggested next steps for your journey as a practitioner of delivering intelligent workloads to the edge.

In this chapter, we're going to cover the following main topics:

  • Summarizing the key lessons
  • Describing the AWS Well-Architected Framework
  • Reviewing the solution
  • Diving deeper into AWS services

Summarizing the key lessons

In this section, we will group and summarize the key lessons from throughout this book as a quick reference to ensure that the most important lessons were not missed. There is a loose chronology to the groupings based on the material from Chapters 1 to 9, but some lessons may appear in a group outside the order in which they appeared in this book.

Defining edge ML solutions

The following key lessons capture the definition, value proposition, and shape of an edge ML solution:

  • Definition of an edge ML solution: Bringing intelligent workloads to the edge means applying ML technology that's been incorporated into cyber-physical solutions that interoperate the analog and digital spaces. An edge ML solution uses devices that have sufficient compute power to run ML workloads and either directly interface with physical components such as sensors and actuators, or indirectly interface with end devices over a local network or serial protocol.
  • Key benefits for ML at the edge: The four key benefits of bringing intelligent workloads to the edge are reducing the latency of reacting to a measured event, improving the solution's availability by reducing its runtime dependency on a remote network entity such as a server, reducing the cost of operations by reducing the quantity of data to transmit over the WAN, and enabling compliance with data governance policies by handling protected data exclusively at the edge.
  • Tools for edge ML solutions: The three tools that are needed to operate intelligent workloads at the edge are a runtime for orchestrating your edge software, a ML library and ML model, and a mechanism for exchanging data bi-directionally throughout the edge and with remote services.
  • Decoupled, isolated services: Design your edge ML solutions using principles from the service-oriented architecture to deliver a cohesive whole made up of isolated services that use decoupling mechanisms to interact. Code that's been designed with a singular purpose is easier to write, test, reuse, and maintain. The code that acquires measurements from a sensor does not need to know how to directly invoke a ML inference service. The inference service does not need to know how to emit results directly to a connected actuator. The degree of isolation and separation of concerns to achieve is a spectrum and a balancing act for architects to consider trade-offs.
  • Don't re-engineer solved problems: Use proven, trusted technologies to solve implementation details that aren't core to the business problems that are solved by your edge ML solution. For example, don't create a new messaging protocol or data storage layer unless none already satisfy your business requirements.
  • Common edge topologies: Four topologies that are common in building edge solutions are star, bus, tree, and hybrid. The star topology describes how leaf devices interact with a hub or gateway device that tends to run any ML workloads. The bus topology describes how isolated services can interact with each other using decoupling mechanisms. The tree topology describes how a fleet of edge solutions can be managed from a central service. The hybrid topology describes the general shape of any macro-level architecture of edge solutions interacting with cloud services.

Using IoT Greengrass

The following key lessons summarize the definition of AWS IoT Greengrass and the best practices for using it to deliver edge ML solutions:

  • What is Greengrass? AWS IoT Greengrass is a runtime orchestration tool designed for delivering intelligent workloads to the edge by solving many of the implementation details common to IoT and ML solutions, enabling architects to focus on rapidly delivering business objectives. Greengrass supports a service-oriented architecture by letting developers define components that run independently and optionally interact with the wider solution by using decoupled mechanisms such as interprocess communication channels, streams, files, or message queues. Greengrass natively interacts with AWS services to deliver common functionality that architects would otherwise need to solve for, such as deploying software, fetching resources, and transmitting data to the cloud.
  • Building with components: Greengrass defines units of functionality as components, which are recipes for bundling static resources such as artifacts, configuration, dependencies, and runtime behavior. Developers can run any kind of code as a component, be it a shell script, interpreted code, a compiled binary, an AWS Lambda function, or a container such as Docker.
  • Deploying components to the edge: Components can be deployed locally during development cycles using the Greengrass CLI on a test device. For production use, components are registered in the AWS IoT Greengrass service and deployed remotely as part of a rollout to one or more grouped devices. A deployment defines a set of components to include the version of the component in and optionally any configuration to apply to those components at the time of deployment. A device can belong to multiple groups and will aggregate all the current deployments of the groups that it belongs to.
  • Security model between the edge and the cloud: The security model between a Greengrass device and AWS services uses a combination of asymmetric cryptography, roles, and policies. Greengrass identifies itself to AWS IoT services using a private key and certificate registered with AWS IoT Core. This certificate is attached to an IoT policy that grants permissions, such as connecting and exchanging messages. The device can request temporary AWS credentials from the AWS IoT Core credentials provider service to identify itself with other AWS services. This works by specifying an AWS Identity and Access Management role that has policies attached to it to grant permissions to other AWS services. Before you add another AWS service interaction to your Greengrass solution, you need to attach a new policy or update an attached policy to include the appropriate permission for that API.
  • Accelerate building with managed components: Use AWS-managed components to solve requirements when applicable. These components solve common requirements such as interacting with AWS services, deploying a local MQTT broker to connect to local devices, synchronizing device state between the edge and the cloud, and running ML workloads.

Modeling data and ML workloads

The following key lessons summarize the techniques and patterns you should consider when breaking down a problem into modeled data and the ML workloads that you use in your edge ML solutions:

  • Types of structured data: Data that's acquired at the edge can be classified into three types: structured (a well-defined schema), semi-structured (a schema with some variance in terms of used keys), and unstructured data (a schema with high variance or no schema). All three kinds of data can be evaluated by ML workloads, but the training methods and algorithms may differ for each.
  • Analyze data to select implementation choices: Use data modeling techniques to break down a high-level problem from the conceptual model, to a logical model, to a physical model to inform implementation decisions when choosing technologies for collecting, storing, and accessing data. Analyze your data's size, shape, velocity, and consistency requirements to inform implementation decisions when choosing data storage technologies.
  • Common data flow patterns: Some of the common data flow patterns that can be used in an edge ML architecture are extract, transform, load (ETL), event-driven (streaming), micro-batch processing, and Lambda architecture (parallel hot/cold paths). Avoid anti-patterns for edge architecture such as complex event detection, batch processing, data replication, and data archiving. These patterns are best implemented at layers in your topologies, such as a data center or cloud services.
  • Domain-driven design: Consider the 10 principles of domain-driven design to best organize your data: manage data ownership through domains, define domains using bounded contexts, link a bounded context to application workloads, share the ubiquitous language within the bounded context, preserve the original sourced data, associate the data with metadata, use the right tool for the right job, tier your data storage, secure and govern the data pipeline, and design for scale.
  • The three laws of edge workloads: Keep data workloads at the edge (instead of the cloud) when you must observe the three laws. The law of physics means that the latency between the edge and the cloud has limits, and sometimes your workload requirements cannot tolerate this latency. The law of economics means it may be cost-prohibitive to move all your data to the cloud. The law of the land means that there are data governance and compliance requirements that necessitate that some data remains at the edge.
  • Types of ML training algorithms: ML models can be trained with one of three patterns: supervised (the training data is labeled by a human), unsupervised (the training data is unlabelled; the machine finds patterns or conclusions on its own), or semi-supervised (a mix of labeled and unlabelled data). Training a model to mimic the work of a human expert, such as classifying objects in an image, typically means using a supervised or semi-supervised pattern. Training a model to find relationships between data typically means using an unsupervised pattern.
  • Iterating the data-to-model life cycle: Use the Cross-Industry Standard Process for Data Mining (CRISP-DM) to iterate your ML workloads, from understanding your data to preparing it for training, to evaluating model performance, and then deploying models to the edge.
  • Use ML appropriately: Not every problem can or should be solved with ML. Small datasets or data with low signal-to-noise ratios tend not to train useful models. Simple requirements (such as needing a one-off prediction) can be solved with conventional data analysis, querying, and regression techniques.
  • Value of the cloud in training models: Use the scale of the cloud to train models efficiently and on sufficiently large datasets. Once your models are performing well in the evaluation phase, use model optimization to compress the model so that it has high efficiency and a small footprint on the target hardware platform running your edge ML solution. Continue to test and evaluate the performance of your compressed model on the device and after any retraining events.
  • ML needs a team: A single technical resource can push all the buttons needed to gather data, train a model, and deploy it to the edge, but the process of training an effective model is multi-faceted. Training effective models and deploying them to the edge requires experts from multiple domains to reach a successful outcome. It's okay that one person can't do it all.

Operating a production solution

The following key lessons summarize important distinctions in the production version of your solution and how to operate the solution at scale:

  • DevOps is cultural: Developer operations (DevOps) is not just about new technology and tools. It represents a cultural shift in how organizations promote ownership, collaboration, and cohesiveness across teams to foster innovation. The DevOps paradigm yields benefits to the software delivery life cycle of edge ML solutions, in addition to traditional software development.
  • Use managed components for monitoring: Use components provided by AWS to store logs and metrics in your Amazon CloudWatch account. This will help your team operate a fleet of devices by diagnosing issues remotely through the logs and monitoring for unhealthy devices with alarms on the metrics.
  • IaC is valuable for the edge, too: Store and deploy your solution as Infrastructure as Code (IaC) resources where possible. This makes it easier to maintain your solution's definition and reliably reproduce results across deployments.
  • Your device life cycle begins with manufacturing: Providing identities to devices and defining their provisioning processes has implications for your device's supply chain. Provisioning a test device on your desk is easy. Creating a provisioning pipeline for a production fleet is much more challenging. Communicate the requirements early with your supply chain vendors, original equipment manufacturers (OEMs), and original device manufacturers (ODMs).
  • Ship code in virtualized environments: Your software components can be defined as scripts, source code, binaries, and anything in between. Prefer to ship your code in virtualized environments such as Docker and AWS Lambda where possible to deliver more predictability for runtime operations at the edge.
  • MLOps is circular: Much like the CRISP-DM model for solving problems with data science, the pattern of building and operating ML models is circular. MLOps with models deployed to the edge can be extra challenging as devices are often remote, offline, or exposed to unpredictable elements. Design MLOps into your product life cycle early to lean into good habits. Adding it later is only harder.
  • Deployments can be expensive: Services such as AWS IoT Greengrass make it easy to deploy software to the edge, but the cost of transmitting data must be considered. Many edge solutions are at the end of expensive network connections where you cannot afford to incrementally push models over and over again or fix broken deployments. Set up your DevOps and MLOps pipelines so that you have the highest confidence in your deployments before they go out to the production fleet.
  • Scale the provisioning process: Certificate authorities (CAs) let you define your device's identity with unique certificates. Use your own CA, one from a trusted vendor, or the one provided by AWS to scale up the identities of your device fleets. Use automated provisioning strategies such as Just-in-Time (JIT) provisioning to onboard devices as they connect to your service for the first time.
  • Operators need to scale, too: Scaled production fleets of devices can mean managing thousands to millions of devices. Use tooling that simplifies how to operate that many entities by focusing on outliers and high severity issues. This means you need a solution that captures and indexes this kind of operational data. You also need a solution that makes it easy to dive deep into a single device or apply a fix to a large selection of impacted devices at a time.

In the next section, you will learn about a framework provided by AWS for evaluating design trade-offs when building solutions on the platform.

Describing the AWS Well-Architected Framework

In 2015, AWS launched a framework for guiding developers through the process of making good design decisions when building on AWS. The AWS Well-Architected Framework codifies the best practices for defining, deploying, and operating workloads on the AWS cloud. It exists as a whitepaper of best practices and a web-based tool to approach a solution evaluation as a checklist of considerations and suggested mitigation strategies. This expertise aims to serve AWS customers but is delivered in a format that is generally useful for evaluating any kind of digital workload. We will use this framework to retroactively review this book's solution of the Home Base Solutions appliance monitoring product.

The Well-Architected Framework organizes best practices into five pillars. A pillar is a section of the framework that aggregates design principles and guiding questions to resolve under a common purpose. The five pillars are as follows:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

You may recognize some of these pillars as the key benefits we used to define the value proposition of edge ML solutions in Chapter 1, Introduction to the Data-Driven Edge with Machine Learning! Each pillar includes a narrative and a set of questions to evaluate and consider. The questions that the architect does not have a clear response or existing mitigation strategy for are then used to define the gap between how well the solution is architected now and where it needs to be. For example, if the review helps us identify a single point of failure in our architecture, then we would decide between the acceptability of that risk in our solution or whether to refactor with a failover mechanism.

It's important to understand that when you're answering the framework's questions to review your solution, there are no objectively right or wrong answers. The overall posture of your solution is not a quantifiable outcome of completing a review. The process you use to answer individual questions may identify important refactors or highlight gaps in the original design. It's up to your team's definition of done to decide how complete or thorough your answers must be and how many questions are resolved, in the sense that your team is satisfied with the due diligence that's been performed. A lazy or superficial review may not lead to any meaningful change. As the criticality of the solution increases, the amount of rigor in your review may scale proportionally or even non-linearly.

In your application of the framework, you may find value in moving pillar by pillar, answering each question in series, or by crafting a subset of prioritized questions as a cross-section of all the pillars. It is also recommended and more common to review the framework between the steps of designing the solution and implementing it. This helps architects prevent failures and raise the security posture before investing time and resources in building the solution. For this book, we elected to save the review for the end to move quickly into the hands-on projects, recognizing that we are practicing in a safe, prototype environment.

The AWS Well-Architected Framework also includes extensions that are referred to as lenses. A lens is a collection of additional best practices related to a particular domain or type of solution, such as a SaaS application or an IoT solution. These lenses help architects within their domains to critically analyze their solutions, though the guidance within doesn't broadly apply to all kinds of solutions, such as the main body of the framework. Our review in this chapter will use a mix of framework questions between the main body and the IoT Lens. Links to both resources are included in this chapter's References section. In the next section, we will review our solution using a subset of the questions posed by the AWS Well-Architected Framework.

Reviewing the solution

Before we perform a solution review, let's restate the problem, revisit the target solution, and reflect on what was built in this book. This will help us refresh our memory and contextualize the solution review using the Well-Architected Framework.

Reflecting upon the solution

Our fictional narrative had us working at Home Base Solutions as the IoT architect responsible for designing a new home appliance monitoring product. This product was a combination of a hub device that connects to consumers' home networks and interacts with paired appliance monitoring kits. These kits are attached to consumers' large appliances, such as furnaces or washing machines, and send telemetry data to the hub device. The hub device processes telemetry data, streams it to the cloud to train ML models, and hosts local inference workloads using new telemetry and the deployed models. The following diagram shows how these entities are related in consumers' homes:

Figure 10.1 – Reviewing the HBS smart hub product's design

Figure 10.1 – Reviewing the HBS smart hub product's design

Our target solution was to prototype the hub device on a Raspberry Pi to collect telemetry data and run the ML workloads, all while using a SenseHAT expansion module to collect sensor data and signal results visually to the LED matrix. We used AWS IoT Greengrass to deploy a runtime environment to the hub device that could install and run our code as components. These components encapsulated our business logic to collect sensor telemetry, route data through the edge and cloud, fetch resources from the cloud, and run our ML inference workloads.

We used Amazon SageMaker to train a new ML model in the cloud using the sensor telemetry that was acquired by the hub device and streamed it to the cloud as training data. This ML model was deployed to the edge to intelligently assess the health of our monitored appliance and signal to the consumer if any anomalous behavior is detected. Finally, we planned how to scale up our solution to a fleet of hub devices, their monitoring kits, and ML models, and how to operate this fleet in a production environment. The following diagram reviews our solution architecture:

Figure 10.2 – The original solution architecture diagram from Chapter 1, Introduction to the Data-Driven Edge with Machine Learning

Figure 10.2 – The original solution architecture diagram from Chapter 1, Introduction to the Data-Driven Edge with Machine Learning

With this brief review of our business objective and solution architecture to set the context, let's apply the AWS Well-Architected Framework to analyze our solution.

Applying the framework

The format we will use for the Well-Architected review is to state a question from the framework and then respond with an answer in the role of the HBS IoT architect. As a reminder, highlights from the framework were selected from the base material and the IoT Lens to drive interesting analysis for this chapter. There are more best practices to consider in the complete body of the framework.

Note

The following sections pull in questions from the AWS Well-Architected Framework and the IoT Lens extension. Questions labeled as OPS 4, for example, indicate that they are from the Well-Architected Framework. A question labeled as IOTOPS 4 indicates it is from the IoT Lens extension. This distinction is not relevant for this chapter but it identifies which source material the question was copied from.

Operational excellence

The operational excellence pillar reinforces thinking about how we operate the live solution. It organizes its guidance into four sub-areas: organization, preparation, operation, and evolution. This pillar stresses the importance of an organization's work culture and mechanisms for anticipating the inevitability of failure, reducing the influence of human error, and learning from mistakes. Now, let's review a selection of the questions from this pillar and some sample responses we might see as output from the architect.

OPS 4, OPS 8, and OPS 9 – How do you design your workload so that you can understand its state? How do you understand the health of your workload? How do you understand the health of your operations?

We will summarize a response to these three related questions from the operational excellence pillar. In this context, the workload means anything related to meeting our business objectives, such as informing customers of their failing appliances. This is different than operations, which refers to anything related to the technical implementation we use to operate the workload, such as the deployment mechanisms or tools we use to notify our team of the impact.

We have designed each level of our workload to report some kind of health state. Our workload can be defined at three levels, each with mechanisms for reporting its state so that we can automate monitoring and alerting. These three levels are the fleet of devices, the components running on a hub device, and the cloud pipeline of training ML models. At the fleet level, hub devices report the health of their deployments and connectivity status to the cloud with services such as AWS IoT Greengrass and Amazon CloudWatch. We can use services such as AWS IoT Device Management to monitor for devices in unhealthy states and take action against them. The components that are running on devices are monitored by the IoT Greengrass core software, and logs for each component can be shipped to the cloud for automated analysis. The ML training pipeline reports metrics on training accuracy so that we can measure the overall state of meeting our business objectives.

We will implement threshold alarms on critical failures, such as devices failing deployments and appliance monitoring kits losing connection to their hub devices. These enable us to proactively mitigate failures before they impact our customers, or reach out to customers to inform them of actions they can take to restore local operations.

OPS 5 and OPS 6 – How do you reduce defects, ease remediation, and improve the flow into production? How do you mitigate deployment risks?

To reduce defects and mitigate deployment risks, we must include a physical copy of each target hardware profile running our solution in our testing and deployment pipeline. These devices will be the first to receive new deployments through Greengrass by specifying them as a separate group in AWS IoT Core. We can configure our CI/CD pipeline to create new deployments for that group and wait for these deployments to be reported as successful before advancing the deployment to the first wave of production devices.

We get some out-of-the-box remediation value from Greengrass anyway because, by default, it will roll back failed deployments to the previous state. This helps minimize the downtime of production devices and instantly signals to our team that something is wrong with the deployment. Greengrass can also stop the fleet of grouped devices from being deployed if a certain portion of them fail their deployment activity.

IOTOPS 3 – How are you ensuring that newly provisioned devices have the required operational prerequisites?

In our solution of using Greengrass, we know what the documented minimum requirements are for running the Greengrass software. We used the IoT Device Tester software to validate that our target hardware platform is compatible with Greengrass's requirements and can connect to the AWS service. We should use the IoT Device Tester software to validate any future hardware platforms that we want to use as HBS hub devices.

We should also calculate the necessary additional resources that are consumed by all of our components. For example, if we know that all of our total static resources will consume 1 GB on disk, we know we need at least that much, plus room for storing logs, temporary resources, and so on. Once we have calculated the minimum requirements for our solution, we can add a custom test to IoT Device Tester that can validate that each new hardware target is ready to run our solution.

Security

The security pillar reinforces thinking about how to maintain or raise your workload's security posture, such as protecting access to data and systems. It organizes best practices into the following sub-areas: Identity and Access Management, detection, infrastructure protection, data protection, and incident response. This pillar stresses clearly defining the resources and actors in your workload, the boundaries and access patterns between them, and the mechanisms for enforcing those boundaries.

SEC 2 and SEC 3 – How do you manage identities for people and machines? How do you manage permissions for people and machines?

The identities and permissions for people are managed by AWS Identity Access and Management. Our customers will log into their management app using federated identities from OAuth providers such as Google or Facebook or create new usernames directly with us using Amazon Cognito. We will tie Cognito identities to the devices they own and interact with using policies.

Identities and permissions for devices are managed by a combination of AWS IAM and AWS IoT Core. The device-to-cloud identity uses an X.509 private key and certificate registered with AWS IoT Core to establish MQTT connections. This can be used to exchange a certificate for temporary AWS credentials. These temporary AWS credentials are tied to an IAM role that has policies attached to it to determine what the credentials are allowed to do with various AWS services. By using unique private keys on each device, the identity of a device cannot be spoofed by a malicious actor.

IOTSEC 10 – How do you classify, manage, and protect your data in transit and at rest?

At the edge, we can classify data as either runtime data that is derived from sensors or used to deliver business outcomes or operational data that comes from software and system logs. In our current design, we do not handle runtime and operational data any differently in terms of management or protection. Here, we have the opportunity to better safeguard any potential customer privacy data, such as video feeds from connected cameras.

At the edge, any data that's in transit between the components of the Greengrass solution is not encrypted. We use the permissions model of Greengrass's components and interprocess communication (IPC) to protect access to data that's published over IPC. Data in transit between leaf devices and the Greengrass device using MQTT is encrypted over the network using the private key and certificate with mutual TLS.

At the edge, data at rest is not encrypted and instead relies on the permissions of the Unix filesystem to protect access to data. We must ensure we use proper user and group configurations to protect access to data at rest. Here, we have the opportunity to put a validation mechanism in place to alert us if new system users or groups are created or modified. To perform security threat analysis each time, we must add a new component to the solution to check whether it has the proper security in place for data access.

From the edge to the cloud, we should use mutual TLS to encrypt MQTT traffic in transit and Amazon Signature Version 4 to encrypt any other traffic that's exchanged with AWS APIs with the temporary credentials. Data at rest that's stored in AWS services uses the encryption policies of each service. For example, data stored in Amazon Simple Storage Service (S3) can use server-side encryption with AWS-managed encryption keys.

IOTSEC 11 – How do you prepare to respond to an incident that impacts a single device or a fleet of devices?

Our operations team has alarms set on the operational health metrics of the fleet of devices. For example, if a device fails a deployment, the operations team will receive an incident ticket as a notification from the alarm. If a group of devices fails a deployment, we will page our operations team for immediate triaging.

We will author a series of runbooks for anticipated failure events for our operations team to follow as a first response. The first step will be to define the minimum set of runbooks needed before we are comfortable with the first wave of production devices.

IOTSEC 8 – How are you planning the security life cycle of your IoT devices?

We will work with our ODM to document the security life cycle from the supply chain of parts, through assembly and delivery to our warehouse for inclusion in the retail packaging. It is important to us that parts such as the central processor, volatile and non-volatile memory, and the Trusted Platform Module (TPM), which houses the private key, are authentic and haven't been tampered with before they are assembled into our product.

All TPMs provided by the ODM for our devices will be associated with a CA that we will register in AWS IoT Core. We will pre-provision each device in the cloud so that the devices can simply connect using their protected credentials and not require any JIT registration process.

Should we identify any device as having a compromised identity, we will assess whether a certificate rotation activity is a sufficient mitigation. If not, we will revoke its certificate in AWS IoT Core to prevent it from exchanging further data and proactively reach out to the customer to start an exchange process.

Reliability

The reliability pillar reinforces that a workload should continue to operate as it was designed and when it is expected to. It organizes best practices into the following sub-areas: foundations, workload architecture, change management, and failure management. This pillar stresses concepts such as failover and healing mechanisms in response to failures, testing recovery scenarios, and monitoring for availability during steady-state operations and after deploying a change.

REL 3 – How do you design your workload service architecture?

We have designed our workload service architecture using a service-oriented architecture and implemented the principles of isolated, decoupled services. We use this architecture to make it easier to design, author, test, and ship code, as well as to minimize the impact that an isolated service experiencing faults will have on the solution. We codify this architecture design using the mechanisms defined by the core Greengrass software and its components.

REL 8 – How do you implement change?

For our edge solution, we use versioned components to incrementally update the software running on our devices through Greengrass deployments. We deploy changes on test devices before rolling those changes out to production devices. Deployments that fail on a device will be automatically rolled back. Deployments that fail to 10% of a fleet will roll back the entire deployment to that fleet.

For our cloud solutions, we use CloudFormation templates and stacks to provision cloud resources and make changes to them. We do not make any changes to the production infrastructure not authored through IaC mechanisms. These changes must be reviewed by a peer on the team before they can be deployed. We can use CloudWatch Metrics and Logs for our provisioned resources to monitor for any unhealthy statuses and roll back CloudFormation changes in the event of operational impact.

IOTREL 3 – How do you handle device reliability when communicating with the cloud?

Our edge ML solutions are designed to operate independently from the cloud. Some features are impacted during periods of network instability, such as publishing failure events to the cloud for customer push notifications to their mobile app. Events, telemetry data, and logs that are destined for the cloud are buffered locally and will eventually get to the cloud once network instability has been resolved. Data that is published to the cloud but does not get an acknowledgment of this will be retried, such as with an MQTT quality of service set to an at least once level of service.

When the cloud is trying to communicate with devices, such as when a new deployment is ready to be fetched, we use durable services such as Greengrass, which keep track of devices that are offline and haven't completed a pending deployment activity yet.

REL 11 and IOTREL 6 – How do you design your workload to withstand component failures? How do you verify different levels of hardware failure modes for your physical assets?

(In this case, component failure does not mean Greengrass components specifically.) Here, we use a service-oriented architecture to withstand component failures so that any of our custom services should be able to fail without bringing down the entire solution. For example, if the component that reads measurements from the temperature sensor fails, the hub device and edge solution will still be operational, albeit with less accuracy when it comes to detecting appliance anomalies.

There are some components provided by Greengrass that, if failing, could impact multiple outcomes in our solution, such as the IPC messaging bus. If components such as these fail, our custom components will not be able to publish new messages, and receiving components would stop getting new messages to work with. We should update our custom component code, which publishes messages, so that it can buffer messages where we cannot afford to drop messages while IPC is unavailable. We should also study the behavior of Greengrass and its ability to self-recover when a provided function such as IPC is impacted.

If any of our cyber-physical hardware interfaces fail, such as a sensor no longer being able to be read, we would stop seeing values being published over IPC and get error messages in the corresponding software component that uses the sensor. We may be able to triage events like these remotely using uploaded logs. If any of our compute, memory, disk, or network hardware components fail, the entire solution will likely be disabled and require on-premises triaging or the device being exchanged through our customer support program.

Performance efficiency

The performance efficiency pillar reinforces that we strike a balance between consumed resources and the available budget and that we continue to seek out efficiency gains as technology evolves. It organizes best practices into the following sub-areas: selection, review, monitoring, and tradeoffs. This pillar stresses delegating complex tasks for solved problems, planning for data to be at the right place at the right time, and reducing how much infrastructure your team must manage.

PERF 2 – How do you select your compute solution?

Concerning our ML model training needs, we will initially select compute instances on AWS based on our default settings and evaluate whether there are more cost-effective instance profiles to use in our training life cycle through trial and error. Since ML is a differentiator for our consumer product, we want to enable the ML model on our customers' devices within an appropriate service-level agreement (SLA), such as within one business day after accumulating enough training data to produce an accurate model. As we ramp up our production fleet, we may find value in batching training jobs to maximize the utilization of provisioned compute instances.

Concerning our target device hardware at the edge, we will measure the performance of our full production workload on the prototype device, such as a Raspberry Pi, and iterate toward a production hardware profile based on the overall utilization of the compute device. We want to leave some buffer room in the total utilization in case we deploy new workloads to devices as a future upgrade.

PERF 6 – How do you evolve your workload to take advantage of new releases?

We will monitor new releases from AWS for opportunities to bring in new managed Greengrass components that handle even more undifferentiated heavy lifting for our edge workload. We will also monitor new releases in the Amazon SageMaker and Amazon Elastic Cloud Compute portfolios for opportunities to optimize our ML training pipeline.

PERF 7 – How do you monitor your resources to ensure they are performing?

We will use the managed component for enabling AWS IoT Device Defender to collect system-level metrics from each device, such as compute, memory, and disk utilization. We will monitor for anomalies and threshold breaches and act in response to any detected impacts.

IOTPERF 10 – How frequently is data transmitted from devices to your IoT application?

For high-priority business outcomes and operational alerts, such as informing others of a detected anomaly or a drop in sensor values, data will be transmitted from devices to the cloud as soon as such data is available. For other classes of data, such as reporting component logs or sensor telemetry to use in a new ML training job, data can be transmitted in batches daily.

Cost optimization

The cost optimization pillar reinforces how to operate a solution that meets business needs at the lowest cost. It organizes best practices into the following sub-areas: financial management, usage awareness, cost-effective resources, managing demand and supply, and optimizing over time. This pillar stresses measuring the overall efficiency of your cloud expenditure, measuring return on investment to prioritize where next to optimize, and seeking implementation details that can lower costs without compromising on requirements.

COST 3 – How do you monitor usage and cost?

We will use a combination of Amazon CloudWatch for metrics and logs, as well as the AWS Billing console to monitor the usage and cost of consumed AWS services. The most significant source of cost is anticipated to be cloud compute instances for our ML training workloads. We will monitor the costs associated with each device for outliers where individual devices are consuming more in cloud costs than the fleet's average.

IOTCOST 1 and IOTCOST 3 – How do you select an approach for batch, enriched, and aggregate data that's delivered from your IoT platform to other services? How do you optimize the payload's size between devices and your IoT platform?

To capture sensor telemetry from our appliance monitoring kits, we will batch the telemetry data for daily transmission to the cloud, which will go directly to Amazon S3. This will dramatically lower the cost of the transmission compared to sending each payload as it is published by the sensor components. We do not have plans to further optimize the payload sizes for any operational messages that are exchanged between Greengrass devices and the cloud because we do not anticipate these messages to make up a significant expense.

That concludes our sample responses to the AWS Well-Architected review. Are there any responses you disagree with or would otherwise modify? The review process is a guideline and is not designed to contain right or wrong answers. It is up to you and your team of collaborators to define how complete the answers should be and whether or not you have action items as a result of the review. Questions that the team has no answer to or cannot articulate a detailed answer to are good opportunities to learn more about your architecture and anticipate problems before they surface in your solution. In the next section, we will provide some final coverage of the AWS features you may find useful but that did not fit in the scope of this book.

Diving deeper into AWS services

This book focused on a specific use case as a fictitious narrative to selectively highlight features available from AWS that can be used to deliver intelligent workloads to the edge. There is so much more you can achieve with AWS IoT Greengrass, the other services in the AWS IoT suite, the ML suite of services, and the rest of AWS than what we could cover in a single book.

In this section, we will point out a few more features and services that may be of interest to you as an architect in this space, as well as offer some ideas on how to extend the solution you've built so far to further your proficiency.

AWS IoT Greengrass

The Greengrass features we used in the solution represent a subset of the flexibility that Greengrass solutions can offer. You learned how to build with components, deploy software to the edge, fetch ML resources, and make use of built-in features for routing messages throughout the edge and the cloud. The components we used in the hub device prototype primarily downloaded Python code and launched long-lived applications that interacted with the IPC messaging bus. Components can be designed to run one-off programs per deployment, per device boot, or on a schedule. They can be designed to run services that act as dependencies for your other component software, or wait to start once other dependencies in your component graph have run successfully.

Your components can interact with the deployment life cycle by subscribing to notifications about deployment events. For example, your software can request a deferment until a safe milestone is met (such as draining a queue or writing in-memory records to disk), or signal to the Greengrass nucleus that it is now ready for an update.

Components can signal to other components that they should pause or resume their functionality. For example, if a component responsible for a limited resource such as disk space or memory identifies a high utilization event, it could request that the components consuming those resources pause until the utilization comes back into the desired range.

Components can interact with each other's configuration by requesting the current configuration state, subscribing to further changes, or setting a new value for a component's configuration. Returning to the previous example, if a resource watchdog component didn't want to fully pause a consuming component, it could specify a new configuration value for the consuming component to write sampled values less frequently or enter a low-power state.

All three of the previously mentioned features work using Greengrass IPC and are simple applications of local messaging between your components and the Greengrass nucleus that govern the component life cycle. There is lots of utility for these features in your solution design and they demonstrate how you can build systems for component interaction on top of IPC.

Here are a few more features of Greengrass that you should be aware of as you continue your journey as an edge ML solution architect. The documentation for Greengrass's features can be found online at https://docs.aws.amazon.com/greengrass:

  • Nucleus configuration: When installing the Greengrass core software on your device (or later, through deployments of updated configuration), you have several options to explore for optimizing consumed resources, network configuration, and how the device interacts with the cloud service. All of these have intelligent defaults to get you started, but your production implementations may need to include refinements of these per the results of your well-architected review!
  • Run Greengrass inside a Docker container: In this book's solution, we installed Greengrass as a service running on the Raspberry Pi. Greengrass can also be installed on a device running in its own Docker container. You may find this valuable for simplifying a custom installation across devices using IaC. This can also be used to ship your entire Greengrass solution as an isolated service as part of a grander solution architecture running on the device.
  • Run Docker containers as components: Your Docker containers can be imported into an edge ML solution without modification if you wrap them as a new component. Greengrass offers a managed component for interacting with Docker Engine running on the device. It can pull images down from Docker Hub and Amazon Elastic Container Registry. This can expedite your path to adopting Greengrass in existing workloads where your business logic is already defined in Docker containers.

Now, let's review a few more features in the wider suite of AWS IoT services that can power up your next project.

AWS IoT services

The suite of AWS IoT services covers use cases for device connectivity and control, managing a fleet at scale, detecting and mitigating security vulnerabilities, performing complex event detection, analyzing industrial IoT operations, and more. Greengrass is a model implementation of designing edge solutions on top of existing AWS IoT services and also natively integrates with them in powerful ways. Here are a few more features in the AWS IoT suite to take a look at when designing your next edge ML solution. The documentation for these features can be found at https://docs.aws.amazon.com/iot:

  • Secure tunneling: In Chapter 4, Extending the Cloud to the Edge, you uploaded system and component logs to Amazon CloudWatch to remotely triage your device. What happens if you need more information than what your logs are capturing or you need to run a command on the device but don't want to write a component just for that? With secure tunneling, you can signal your devices to establish SSH tunnels to your operators, eliminating the necessity for inbound network connections to your device. Greengrass has a managed component to enable this feature on your device.
  • Fleet indexing: When using the shadow service of AWS IoT Core to synchronize the state of your Greengrass devices, leaf devices connected to hubs, and even your components, all of your shadows can be indexed for queries and dynamic groups using the fleet indexing service of AWS IoT Device Management. Fleet indexing makes it easy to search for shadow-backed entities in your solution for analysis and action. For example, you could create a dynamic group of devices reporting under 20% battery level and inform your remote technicians to prioritize battery replacements on those devices.
  • Device Defender: AWS IoT Device Defender is a service for automating the process of detecting and mitigating security vulnerabilities that can use ML to build a profile of your fleets' normal behavior. The service can inform your security team of devices operating in unusual ways, such as a spike in network traffic or disconnection events that could represent a malicious actor interfering with your device.

Now, let's review a few more services in the ML suite of AWS that add more intelligence to your workloads.

Machine learning services

The ML services of AWS span from tools that help developers train and deploy models to high-level artificial intelligence services that solve specific use cases. While the following services run exclusively in the AWS cloud today, they can help augment your AI edge solutions that can work with remote services:

  • Amazon Lex and Polly: You can build voice interfaces into your solutions using the same technology that powers Amazon's Alexa voice assistant. Lex is a service for building interactive experiences from voice and text inputs. Polly is a service for translating text to speech. You can use both to process audio requests from your devices and return a lifelike synthesized response.
  • Amazon Rekognition: You can augment an intelligent workload at the edge with deeper insights from the cloud. For example, your edge ML workload may use a simpler motion or object detection model to capture significant events as video clips, then send only these clips to the cloud for deeper analysis with a service such as Rekognition. This pattern of escalation can help you cut down on the resources that are needed at the edge and reduce the costs of operating ML workloads exclusively in the cloud.

Next, we will provide a few ideas on the next steps you could take to extend this book's solution.

Ideas for further proficiency

With your working solution of a hub device running a local ML workload, you have practiced using all the necessary tools to deploy intelligent workloads to the edge. You may already have a project in mind to apply the lessons you've learned from this book and reinforce what you've learned through practical application. If you are looking for inspiration on the next steps to take, we have compiled a few ideas for extending what you have already built as a means to develop further proficiency with this book's topics:

  • Add a new cyber-physical interface: The Raspberry Pi's GPIO interface, USB ports, and audio output jack offer a wide design space for extending the cyber-physical interfaces of your hub device. You could add a new sensor, author a custom component for reading values from it, stream values to the cloud, and train your ML model to detect anomalies in that feed. You could plug in a speaker and generate audio alerts in response to the object detection model we deployed in Chapter 4, Extending the Cloud to the Edge. You could go one step further and synthesize custom speech audio from Amazon Polly and play that over the speaker!
  • Build a prototype of the appliance monitoring kit: In this book, we used an onboard sensor from SenseHAT to acquire measurements as an approximation of the appliance monitoring kit. With two devices, you could provision one as the hub device and another as the monitoring kit, connecting the kit to the hub over MQTT, as demonstrated in Chapter 4, Extending the Cloud to the Edge. Remove the component from the hub device that emits sensor readings and write a new component that subscribes to an MQTT topic and forwards new messages from the monitoring kit to the same IPC topic, as defined in Chapter 3, Building the Edge. Your monitoring kit device doesn't need to deploy Greengrass; it can run an application similar to the one we saw in Chapter 4, Extending the Cloud to the Edge, for connecting to the Greengrass MQTT broker using the AWS IoT Device SDK.
  • Perform your own AWS Well-Architected review: The sample review that we provided in this chapter highlighted a few of the questions that are posed to architects and kept answers at a somewhat high-level analysis. As your next step, you could complete the rest of the review with questions that have not been included in this chapter and also take the opportunity to document your answers to the same questions, perhaps for your workload.

Do you have other suggestions for project extensions or want to show off what you've built? Feel free to engage with our community of readers and architects by opening a pull request or issue in this book's GitHub repository!

Summary

That's all we have for you! We, the authors, believe that these are the best techniques, practices, and tools you can use to continue your journey as an architect of edge ML solutions. While some of the tools are specific to AWS, everything else should generally serve you in building these kinds of solutions. Solutions built with AWS IoT Greengrass can just as easily include components that communicate with your web services or the services of cloud vendors such as Microsoft or Google. The guiding principle of this book was to prioritize teaching you how to build and how to think about building edge ML solutions over using specific tools.

As you take your next steps, whether they are extending this book's prototype hub device, starting a new solution, or modernizing an existing solution, we hope you find value in reflecting upon the lessons you've learned and critically thinking about the tradeoffs that help you reach your goals. We welcome your feedback on this book's content and the technical examples through the communication methods included in this book's GitHub repository. We are committed to the continued excellence of the examples and tutorials provided, recognizing that in the world of information technology, tools evolve, dependencies update, and code breaks.

We are sincerely grateful that you decided to invest your precious time in exploring the world of bringing intelligent workloads to the edge with our book. We wish you the best of luck in your future endeavors!

References

Take a look at the following resources for additional information on the concepts that were discussed in this chapter:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset