THE AWS CERTIFIED SYSOPS ADMINISTRATOR - ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:
This chapter covers high availability on AWS. In previous chapters, you were introduced to compute, networking, databases, and storage on AWS. What you haven’t been exposed to are some of the fully managed services that can help you deploy and maintain a highly available and scalable application.
What is high availability? Availability refers to the amount of time your system is in a functioning condition. In general terms, your availability is referred to as 100 percent minus your system’s downtime. Since events that may disrupt your system’s availability are never entirely predictable, there are always ways to make an application more available. Improving availability typically leads to increased cost, however. When considering how to make your environment more available, it’s important to balance the cost of the improvement with the benefit to your users. Does high availability mean that you ensure your application is always alive/reachable, or does it mean that the application is servicing requests within an acceptable level of performance? A way to make your application highly available and scalable is to decouple components so that they scale independently and can survive minimal disruptions. See Table 10.1 for varying high availability performance levels.
TABLE 10.1 Levels of High Availability
Percent of Uptime | Max Downtime per Year | Equivalent Downtime per Day |
90% (1 nine) | 36.5 days | 2.4 hours |
99% (2 nines) | 3.65 days | 14 minutes |
99.9% (3 nines) | 8.76 hours | 86 seconds |
99.99% (4 nines) | 52.6 minutes | 8.6 seconds |
99.999% (5 nines) | 5.25 minutes | .86 seconds |
In this chapter, you will be introduced to Amazon Simple Queue Services (Amazon SQS) and Amazon Simple Notification Services (Amazon SNS), which can be used to decouple applications and retain transactions. In addition to decoupling tiers within an application, these services can help deliver high availability and fault tolerance to your application. This chapter will also cover additional strategies to keep your infrastructure highly available, redundant, and fault tolerant.
Amazon Simple Queue Service (Amazon SQS) is a web service that gives you access to message queues that store messages waiting to be processed. With Amazon SQS, you can quickly build message queuing applications that can run on any computer. You can use the service to move data between diverse, distributed application components without losing messages and without requiring each component always to be available.
Amazon SQS can help you build a distributed application with decoupled components by working closely with the Amazon Elastic Compute Cloud (Amazon EC2) and other AWS infrastructure services. You can access the service via the Amazon SQS console, the AWS Command Line Interface (AWS CLI), a generic web services Application Programming Interface (API), and any programming language that the AWS Software Development Kit (SDK) supports. Amazon SQS supports both standard and First-In, First-Out (FIFO) queues.
Think of a queue as a temporary repository for messages that are awaiting processing. Using Amazon SQS, you can decouple the components of an application so that they run independently of each other, with Amazon SQS easing message management between components. The queue acts as a buffer between the component producing and saving data and the component receiving the data for processing. This means that the queue resolves issues that arise if the producer (for example, a web front end) is producing work faster than the consumer (such as the application worker) can process it or if the producer or consumer is only intermittently connected to the network.
Amazon SQS is designed to deliver your message at least once and supports multiple readers and writers interacting with the same queue. A single queue can be used simultaneously by many distributed application components with no need for those components to coordinate with each other to share the queue.
Amazon SQS is engineered always to be available and deliver messages. This is achieved by the system being distributed between multiple machines and multiple facilities. Due to this highly distributed architecture, there is a trade-off—when using standard queues, Amazon SQS does not guarantee FIFO delivery of messages. This may be okay for many distributed applications, as long as each message can stand on its own and as long as all messages are delivered. In that scenario, the order is not important. If your system requires that order be preserved, you can place sequencing information in each message so that you can reorder the messages when the queue returns them.
You can have as many queues with as many messages as you like in the Amazon SQS system. A queue can be empty if you haven’t sent any messages to it, or if you have deleted all of the messages from it.
You assign a name to each of your queues. You can get a list of all of your queues or a subset of your queues that share the same initial characters in their names (for example, you could get a list of all of your queues whose names start with Q3).
Use Cases for Amazon SQS
Amazon SQS is engineered always to be available and deliver messages. You can delete a queue at any time, whether it is empty or not. Amazon SQS queues retain messages for a set period of time. The default setting is four days; however, you can configure a queue to retain messages for up to 14 days.
You can achieve loose coupling of systems by using queues between systems and exchanging messages that transfer jobs. This enables asynchronous linking of systems. This method lets you increase the number of virtual servers that receive and process the messages in parallel. If there is no image to process, you can configure Auto Scaling to terminate the servers that are in excess. See Figure 10.1 for an example of a queuing chain pattern.
Although you can use this pattern without cloud technology, the queue itself is provided as an AWS Cloud service (Amazon SQS), which makes it easier for you to use this pattern.
Taking image processing as an example, the sequential operations of uploading, storing, and encoding the image, creating a thumbnail, and copyrighting are tightly linked. This tight linkage complicates the recovery operations when there has been a failure.
How This Benefits Your Application
A standard queue makes a best effort to preserve order in messages, but due to the distributed nature of the queue, AWS cannot guarantee that you will receive messages in the exact order that you sent them. If your system requires that order be preserved, we recommend using a FIFO queue.
Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability. On rare occasions, one of the servers storing a copy of a message might be unavailable when you receive or delete the message. If that occurs, the copy of the message will not be deleted on that unavailable server, and you might get that message copy again when you receive messages. Because of this, you must design your application to be idempotent (that is, it must not be adversely affected if it processes the same message more than once). The point being made is that Amazon SQS will deliver a message at least once, and may in some cases deliver the message again. For more information, refer to visibility timeout in this chapter.
When you retrieve messages from the queue, Amazon SQS samples a subset of the servers and returns messages from just those servers. Amazon SQS supports two types of polling methods:
Figure 10.2 shows an example of a short poll of the queue and messages being returned after a receive request. Amazon SQS samples several of the servers (in gray) and returns the messages from those servers (Message A, C, D, and B). Message E is not returned to this particular request, but it would be returned to a subsequent request.
Visibility timeout is the period of time that a message is invisible to the rest of your application after an application component gets it from the queue. During the visibility timeout, the component that received the message usually processes it and then deletes it from the queue. This prevents multiple components from processing the same message.
Here is how it works:
When the application needs more time for processing, the visibility timeout can be changed dynamically via the ChangeMessageVisibility operation.
Amazon SQS offers standard queues as the default queue type. A standard queue allows you to have a nearly unlimited number of transactions per second. Standard queues support at-least-once message delivery. Occasionally (because of its highly-distributed architecture), however, more than one copy of a message might be delivered out of order. Standard queues provide best-effort ordering, versus a First-In, First-Out (FIFO) queue. This ensures that messages are delivered in the same order as they’re sent.
You can use standard message queues in many scenarios, as long as your application can process messages that arrive more than once and out of order. For example:
A First-In, First-Out (FIFO) queue has all of the capabilities of the standard queue. FIFO queues are designed to enhance messaging between applications when the order of operations and events are critical or where duplicates can’t be tolerated. FIFO queues also provide exactly-once processing and are limited to 300 Transactions Per Second (TPS).
FIFO queues are designed to enhance messaging between applications when the order of operations and events is critical. For example:
A Dead Letter Queue (DLQ) is an Amazon SQS queue that you configure to receive messages from other Amazon SQS queues, referred to as “source queues.” Typically, you set up a DLQ to receive messages after a maximum number of processing attempts has been reached. A DLQ provides the ability to isolate messages that could not be processed.
A DLQ is just like any other Amazon SQS queue—messages can be sent to it and received from it like any other Amazon SQS queues. You can create a DLQ from the Amazon SQS API and the Amazon SQS area in the AWS Management Console.
A developer associates an access policy statement (specifying the permissions being granted) with the queue to be shared. Amazon SQS provides APIs to create and manage the access policy statements: AddPermission, RemovePermission, SetQueueAttributes, and GetQueueAttributes. Refer to the latest API specification for more details.
Amazon SQS in each region is totally independent in message stores and queue names; therefore, the messages cannot be shared between queues in different regions.
Queues can be shared in the following ways:
Characteristics of Amazon SQS
Amazon Simple Notification Service (Amazon SNS) is a web service that coordinates and manages the delivery or sending of messages to subscribing endpoints or clients. In Amazon SNS, there are two types of clients—publishers and subscribers, which are also referred to as producers and consumers. Publishers communicate asynchronously with subscribers by producing and sending a message to a topic, which is a logical access point and communication channel. Subscribers (for example web servers, email addresses, Amazon SQS queues, and AWS Lambda functions) consume or receive the message or notification over one of the supported protocols (such as Amazon SQS, HTTP/S, email, Short Message Service [SMS], or AWS Lambda) when they are subscribed to the topic. See Figure 10.3 for details on the supported protocols.
When using Amazon SNS, you create a topic and control access to it by defining resource policies that determine which publishers and subscribers can communicate with the topic. Refer to Chapter 3, “Security and AWS Identity and Access Management (IAM),” for more details on resource policies.
A publisher sends messages to topics that they have created or to topics to which they have permission to publish. Instead of including a specific destination address in each message, a publisher sends a message to the topic. Amazon SNS matches the topic to a list of subscribers that have subscribed to that topic and delivers the message to each of those subscribers. Each topic has a unique name that identifies the Amazon SNS endpoint for publishers to post messages and subscribers to register for notifications. Subscribers receive all messages published to the topics to which they subscribe, and all subscribers to a topic receive the same messages. The service is accessed via the Amazon SQS console, the AWS CLI, a generic web services API, and any programming language that the AWS SDK supports.
Amazon SNS lets you push messages to mobile devices or distributed services via API or an easy-to-use management console. You can seamlessly scale from a handful of messages per day to millions of messages or more.
With Amazon SNS you can publish a message once and deliver it one or more times. You can choose to direct unique messages to individual Apple, Android, or Amazon devices or broadcast deliveries to many mobile devices with a single publish request.
Amazon SNS allows you to group multiple recipients using topics. A topic is an “access point” for allowing recipients to subscribe dynamically to identical copies of the same notification. One topic can support deliveries to multiple endpoint types; for example, you can group together iOS, Android, and SMS recipients. When you publish once to a topic, Amazon SNS delivers appropriately formatted copies of your message to each subscriber.
In a fan-out scenario, an Amazon SNS message is sent to a topic and then replicated and pushed to multiple Amazon SQS queues, HTTP/S endpoints, or email addresses. This allows for parallel asynchronous processing. For example, you could develop an application that sends an Amazon SNS message to a topic whenever a change in a stock price occurs. Then the Amazon SQS queues that are subscribed to that topic would receive identical notifications for the new stock price. The Amazon EC2 server instance attached to one of the queues could handle the trade reporting, while the other server instance could be attached to a data warehouse for analysis and the trending of the stock price over a specified period of time. Figure 10.4 describes the fan-out scenario.
To have Amazon SNS deliver notifications to an Amazon SQS queue, a developer should subscribe to a topic specifying “Amazon SQS” as the transport and a valid Amazon SQS queue as the endpoint. In order to allow the Amazon SQS queue to receive notifications from Amazon SNS, the Amazon SQS queue owner must subscribe the Amazon SQS queue to the topic for Amazon SNS. If the user owns both the Amazon SNS topic being subscribed to and the Amazon SQS queue receiving the notifications, nothing further is required. Any message published to the topic will automatically be delivered to the specified Amazon SQS queue. If the owner of the Amazon SQS queue is not the owner of the topic, Amazon SNS will require an explicit confirmation to the subscription request.
Characteristics of Amazon SNS
In Chapter 5, “Networking,” you learned that the boundary of an Amazon Virtual Private Cloud (Amazon VPC) is at the region, and that regions comprise Availability Zones. In Chapter 4, “Compute,” you learned how Amazon EC2 instances are bound to the Availability Zone. What happens if a natural disaster takes out your Amazon EC2 instance running in an affected Availability Zone? What if your application goes viral and starts taking on more traffic than expected? How do you continue to serve your users?
So far in this chapter, you have learned that services like Amazon SNS and Amazon SQS are highly available, fault-tolerant systems, which can be used to decouple your application. Chapter 7, “Databases,” introduced Amazon DynamoDB and Amazon ElastiCache as NoSQL databases. Now we are going to put all these parts together to deliver a highly available architecture.
Figure 10.5 illustrates a highly available three-tier architecture (the concept of which was discussed in Chapter 2, “Working with AWS Cloud Services”).
Network Address Translation (NAT) gateways were covered in Chapter 5. NAT servers allow traffic from private subnets to traverse the Internet or connect to other AWS Cloud services. Individual NAT servers can be a single point of failure. The NAT gateway is a managed device, and each NAT gateway is created in a specific Availability Zone and implemented with redundancy in that Availability Zone. To achieve optimum availability, use NAT gateways in each Availability Zone.
As discussed in Chapter 5, Elastic Load Balancing comes in two types: Application Load Balancer and Classic Load Balancer. Elastic Load Balancing allows you to decouple an application’s web tier (or front end) from the application tier (or back end). In the event of a node failure, the Elastic Load Balancing load balancer will stop sending traffic to the affected Amazon EC2 instance, thus keeping the application available, although in a deprecated state. Self-healing can be achieved by using Auto Scaling. For a refresher on Elastic Load Balancing, refer to Chapter 5.
Auto Scaling is a web service designed to launch or terminate Amazon EC2 instances automatically based on user-defined policies, schedules, and health checks. Application Auto Scaling automatically scales supported AWS Cloud services with an experience similar to Auto Scaling for Amazon EC2 resources. Application Auto Scaling works with Amazon EC2 Container Service (Amazon ECS) and will not be covered in this guide.
Auto Scaling helps to ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You create collections of Amazon EC2 instances, called Auto Scaling groups. You can specify the minimum number of instances in each Auto Scaling group, and Auto Scaling ensures that your group never goes below this size. Likewise, you can specify the maximum number of instances in each Auto Scaling group, and Auto Scaling ensures that your group never goes above this size. If you specify the desired capacity, either when you create the group or at any time thereafter, Auto Scaling ensures that your group has that many instances. If you specify scaling policies, then Auto Scaling can launch or terminate instances on demand as your application needs increase or decrease.
For example, the Auto Scaling group shown in Figure 10.6 has a minimum size of one instance, a desired capacity of two instances, and a maximum size of four instances. The scaling policies that you define adjust the number of instances, within your minimum and maximum number of instances, based on the criteria that you specify.
Auto Scaling is set in motion by Amazon CloudWatch. Amazon CloudWatch is covered in Chapter 9, “Monitoring and Metrics.”
Auto Scaling Components
Launch configuration Launch configuration defines how Auto Scaling should launch your Amazon EC2 instances. Auto Scaling provides you with an option to create a new launch configuration using the attributes from an existing Amazon EC2 instance. When you use this option, Auto Scaling copies the attributes from the specified instance into a template from which you can launch one or more Auto Scaling groups.
Auto Scaling group Your Auto Scaling group uses a launch configuration to launch Amazon EC2 instances. You create the launch configuration by providing information about the image that you want Auto Scaling to use to launch Amazon EC2 instances. The information can be the image ID, instance type, key pairs, security groups, and block device mapping.
Auto Scaling policy An Auto Scaling group uses a combination of policies and alarms to determine when the specified conditions for launching and terminating instances are met. An alarm is an object that watches over a single metric (for example, the average CPU utilization of your Amazon EC2 instances in an Auto Scaling group) over a time period that you specify. When the value of the metric breaches the thresholds that you define over a number of time periods that you specify, the alarm performs one or more actions. An action can be sending messages to Auto Scaling. A policy is a set of instructions for Auto Scaling that tells the service how to respond to alarm messages. These alarms are defined in Amazon CloudWatch (refer to Chapter 9 for more details).
Scheduled action Scheduled action is scaling based on a schedule, allowing you to scale your application in response to predictable load changes. To configure your Auto Scaling group to scale based on a schedule, you need to create scheduled actions. A scheduled action tells Auto Scaling to perform a scaling action at a certain time in the future. To create a scheduled scaling action, you specify the start time at which you want the scaling action to take effect, along with the new minimum, maximum, and desired size you want for that group at that time. At the specified time, Auto Scaling will update the group to set the new values for minimum, maximum, and desired sizes, as specified by your scaling action.
In order to scale out or back in, your application needs to be stateless. Consider storing session-related information off of the instance. Amazon DynamoDB or Amazon ElastiCache can be used for session state management. For more information on Amazon ElastiCache and Amazon DynamoDB, refer to Chapter 7.
Auto Recovery is an Amazon EC2 feature that is designed to increase instance availability. It allows you to recover supported instances automatically when a system impairment is detected.
To use Auto Recovery, create an Amazon CloudWatch Alarm that monitors an Amazon EC2 instance and automatically recovers the instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. Terminated instances cannot be recovered. A recovered instance is identical to the original instance, including the instance ID, private IP addresses, elastic IP addresses, and all instance metadata.
When the StatusCheckFailed_System alarm is triggered and the recover action is initiated, you will be notified by the Amazon SNS topic that you selected when you created the alarm and associated the recover action. During instance recovery, the instance is migrated during an instance reboot, and any data that is in-memory is lost. When the process is complete, information is published to the Amazon SNS topic that you configured for the alarm. Anyone who is subscribed to this Amazon SNS topic will receive an email notification that includes the status of the recovery attempt and any further instructions. You will notice an instance reboot on the recovered instance.
Examples of problems that cause system status checks to fail include the following:
The recover action can also be triggered when an instance is scheduled by AWS to stop or retire due to degradation of the underlying hardware.
In Chapter 7, we covered how to achieve high availability of your database by deploying Amazon Relational Database Service (Amazon RDS) across multiple Availability Zones. You can also improve the performance of a read-heavy database by using Read Replicas to scale your database horizontally. Amazon RDS MySQL, PostgreSQL, and Maria DB can have up to 5 Read Replicas, and Amazon Aurora can have up to 15 Read Replicas.
You can also place your Read Replica in a different AWS Region closer to your users for better performance. Additionally, you can use Read Replicas to increase the availability of your database by promoting a Read Replica to a master for faster recovery in the event of a disaster. Read Replicas are not a replacement for the high availability and automatic failover capabilities that Multi-AZ architectures provide, however. For a refresher on Amazon RDS high availability and Read Replicas, refer to Chapter 7.
In addition to building a highly available application that runs in a single region, your application may require regional fault tolerance. This can be delivered by placing and running infrastructure in another region and then using Amazon Route 53 to load balance the traffic between the regions.
If you need to keep Amazon Simple Storage Service (Amazon S3) data in multiple regions, you can use cross-region replication. Cross-region replication is a bucket-level feature that enables automatic, asynchronous copying of objects across buckets in different AWS Regions. More information on Amazon S3 and cross-region replication is available in Chapter 6, “Storage Systems.”
Amazon DynamoDB uses DynamoDB Streams to replicate data between regions. An application in one AWS Region modifies the data in an Amazon DynamoDB table. A second application in another AWS Region reads these data modifications and writes the data to another table, creating a replica that stays in sync with the original table.
When you have more than one resource performing the same function (for example, more than one HTTP/S server or mail server), you can configure Amazon Route 53 to check the health of your resources and respond to Domain Name System (DNS) queries using only the healthy resources. For example, suppose your website, Example.com, is hosted on 10 servers, 2 each in 5 regions around the world. You can configure Amazon Route 53 to check the health of those servers and to respond to DNS queries for Example.com using only the servers that are currently healthy.
You can set up a variety of failover configurations using Amazon Route 53 alias, weighted, latency, geolocation routing, and failover resource record sets.
Active-active failover Use this failover configuration when you want all of your resources to be available the majority of the time. When a resource becomes unavailable, Amazon Route 53 can detect that it is unhealthy and stop including it when responding to queries.
Active-passive failover Use this failover configuration when you want a primary group of resources to be available the majority of the time and a secondary group of resources to be on standby in case all of the primary resources become unavailable. When responding to queries, Amazon Route 53 includes only the healthy primary resources. If all of the primary resources are unhealthy, Amazon Route 53 begins to include only the healthy secondary resources in response to DNS queries.
Active-active-passive and other mixed configurations You can combine alias and non-alias resource record sets to produce a variety of Amazon Route 53 behaviors. More information on record types can be found in Chapter 5.
In order for these failover configurations to work, health checks will need to be configured. There are three types of health checks: health checks that monitor an endpoint, health checks that monitor Amazon CloudWatch Alarms, and health checks that monitor other health checks.
The following sections discuss simple and complex failover configurations.
The simplest failover configuration of having two or more resources performing the same function can benefit from health checks. For example, you might have multiple Amazon EC2 servers running HTTP server software responding to requests for your Example.com website. In Amazon Route 53, you create a group of resource record sets that have the same name and type, such as weighted resource record sets or latency resource record sets of type A. You create one resource record set for each resource, and you configure Amazon Route 53 to check the health of the corresponding resource. In this configuration, Amazon Route 53 chooses which resource record set will respond to a DNS query for Example.com and bases the choice in part on the health of your resources.
As long as all of the resources are healthy, Amazon Route 53 responds to queries using all of your Example.com weighted resource record sets. When a resource becomes unhealthy, Amazon Route 53 responds to queries using only the healthy resource record sets for Example.com.
Following are the steps for how you configure Amazon Route 53 to check the health of your resources in this simple configuration and how Amazon Route 53 responds to queries based on the health of your resources.
The following example shows a group of weighted resource record sets in which the third resource record set is unhealthy. Initially, Amazon Route 53 selects a resource record set based on the weights of all three resource record sets. If it happens to select the unhealthy resource record set the first time, Amazon Route 53 selects another resource record set, but this time it omits the weight of the third resource record set from the calculation. See Figure 10.7 for an example of an unhealthy health check.
If you omit a health check from one or more resource record sets in a group of resource record sets, Amazon Route 53 treats those resource record sets as healthy. Amazon Route 53 has no basis for determining the health of the corresponding resource without an assigned health check and might choose a resource record set for which the resource is unhealthy. See Figure 10.8 for more information.
Checking the health of resources in complex configurations works much the same way as in simple configurations. In complex configurations, however, you use a combination of alias resource record sets (including weighted alias, latency alias, and failover alias) and non-alias resource record sets to build a decision tree that gives you greater control over how Amazon Route 53 responds to requests.
For example, you might use latency alias resource record sets to select a region close to a user and use weighted resource record sets for two or more resources within each region to protect against the failure of a single endpoint or an Availability Zone. Figure 10.9 shows this configuration.
An overview of how Amazon EC2 and Amazon Route 53 are configured follows:
When you have multiple resources in a region, you can create weighted or failover resource record sets for your resources. You can also make even more complex configurations by creating weighted alias or failover alias resource record sets that, in turn, refer to multiple resources.
You use the Evaluate Target Health setting for each latency alias resource record set to make Amazon Route 53 evaluate the health of the alias targets—the weighted resource record sets—and respond accordingly.
Figure 10.10 demonstrates the sequence of events that follows:
In this section of this chapter, we discuss how to make your connections to AWS redundant. In Chapter 5, we discussed the various connectivity options to connect to AWS and, more specifically, your Amazon VPC. We will discuss the Virtual Private Network (VPN) and AWS Direct Connect connectivity options and how to make them highly available.
To get your connectivity up and running quickly, you can implement VPN connections because they are a quick, easy, and cost-effective way to set up remote connectivity to your Amazon VPC. To enable redundancy, each AWS Virtual Private Gateway (VGW) has two VPN endpoints with capabilities for static and dynamic routing. Although statically routed VPN connections from your router (also known as the customer gateway) are sufficient for establishing remote connectivity to an Amazon VPC, this is not a highly available configuration.
The best practice for making VPN connections highly available is to use redundant routers (customer gateways) and dynamic routing for automatic failover between AWS and customer VPN endpoints. Figure 10.11 demonstrates each VPN connection, consisting of two IPsec tunnels to both VGW endpoints, as a single line. See Figure 10.11 for a sample active-active VPN connection.
The configuration in Figure 10.11 consists of four fully-meshed, dynamically-routed IPsec tunnels between both VGW endpoints and two customer gateways. AWS provides configuration templates for a number of supported VPN devices to assist in establishing these IPsec tunnels and configuring Border Gateway Protocol (BGP) for dynamic routing.
In addition to the AWS-provided VPN and BGP configuration details, customers must configure Amazon VPCs to route traffic efficiently to their datacenter networks. In the example in Figure 10.11, the VGW will prefer to send 10.0.0.0/16 traffic to Data Center 1 through Customer Gateway 1 and only reroute this traffic through Data Center 2 if the connection to Data Center 1 is down. Likewise, 10.1.0.0/16 traffic will prefer the VPN connection originating from Data Center 2.
This configuration relies on the Internet to carry traffic between on-premises networks and Amazon VPC. Although AWS leverages multiple Internet Service Providers (ISPs), and even if the customer leverages multiple ISPs, an Internet service disruption can still affect the availability of VPN network connectivity due to the interdependence of ISPs and Internet routing. The only way to control the exact network path of your traffic is to provision private network connectivity with AWS Direct Connect.
If you need to establish private connectivity between AWS and your datacenter, office, or colocation environment, use AWS Direct Connect. More information on AWS Direct Connect can be found in Chapter 5.
AWS Direct Connect uses a dedicated, physical connection in one AWS Direct Connect location. Therefore, to make AWS Direct Connect highly available, consider implementing multiple dynamically-routed AWS Direct Connect connections. Architectures with the highest levels of availability will leverage different AWS Direct Connect partner networks to ensure network-provider redundancy. Refer to Figure 10.12 for an illustration of an AWS Direct Connect connection, consisting of a physical connection that contains logical connections, as one continual line.
The configuration depicted in Figure 10.12 consists of AWS Direct Connect connections to separate AWS Direct Connect routers in two locations from two independently configured customer devices. AWS provides example router configurations to assist in establishing AWS Direct Connect connections and configuring BGP for dynamic routing.
In addition to the AWS-provided configuration details, you must configure your Amazon VPCs to route traffic efficiently to the datacenter networks. In the example in Figure 10.12, the VGW will prefer to send 10.0.0.0/16 traffic to Data Center 1 and only reroute this traffic to Data Center 2 if connectivity to Customer Device 1 is down. Likewise, 10.1.0.0/16 traffic will prefer the AWS Direct Connect connection from Data Center 2.
AWS Direct Connect allows you to create resilient connections to AWS because you have full control over the network path and network providers between your remote networks and AWS.
Some AWS customers would like the benefits of one or more AWS Direct Connect connections for their primary connectivity to AWS, combined with a lower-cost backup connection. To achieve this objective, they can establish AWS Direct Connect connections with a VPN backup, as depicted in Figure 10.13.
The configuration depicted in Figure 10.13 consists of two dynamically-routed connections—one using AWS Direct Connect and the other using a VPN connection from two different customer gateways. AWS provides example router configurations to assist in establishing both AWS Direct Connect and VPN connections with BGP for dynamic routing. By default, AWS will always prefer to send traffic over an AWS Direct Connect connection, so no additional configuration is required to define primary and backup connections. In the example in Figure 10.13, both Customer Gateway 1 (VPN) and Customer Device 2 (AWS Direct Connect) advertise a summary route of 10.0.0.0/16. AWS will send all traffic to Customer Device 2 as long as this network path is available.
Although BGP prefers AWS Direct Connect to VPN connections by default, you still have the ability to influence AWS routing decisions by advertising more specific (or static) routes. If, for example, you want to leverage your backup VPN connection for a subset of traffic (for example, developer traffic versus production traffic), you can advertise specific routes from Customer Gateway 1. Refer to Chapter 5 for more information on route preference.
Disaster recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on your company’s business continuity or finances could be termed a disaster. This includes hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error, or some other significant event. You should plan on how to minimize the impact of a disaster by investing time and resources to plan and prepare, train employees, and document and update processes.
The amount of investment for DR planning for a particular system can vary dramatically depending on the cost of a potential outage. Companies that have traditional physical environments typically must duplicate their infrastructure to ensure the availability of spare capacity in the event of a disaster. The infrastructure needs to be procured, installed, and maintained so that it is ready to support the anticipated capacity requirements. During normal operations, the infrastructure typically is under-utilized or over-provisioned. With Amazon Web Services (AWS), your company can scale up its infrastructure on an as-needed, pay-as-you-go basis. You get access to the same highly secure, reliable, and fast infrastructure that Amazon uses to run its own global network of websites. AWS also gives you the flexibility to change and optimize resources quickly during a DR event, which can result in significant cost savings.
We will be using two common industry terms for disaster planning:
Recovery time objective (RTO) This represents the time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.
Recovery point objective (RPO) This is the acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon).
A company typically decides on an acceptable RTO and RPO based on the financial impact to the business when systems are unavailable. The company determines financial impact by considering many factors, such as the loss of business and damage to its reputation due to downtime and the lack of systems availability.
IT organizations then plan solutions to provide cost-effective system recovery based on the RPO within the timeline and the service level established by the RTO.
In this final section of the chapter, we will discuss the various types of disaster recovery scenarios on Amazon Web Services.
In most traditional environments, data is backed up to tape and sent off-site regularly. If you use this method, it can take a long time to restore your system in the event of a disruption or disaster. Amazon S3 is an ideal destination for backup data that might be needed quickly to perform a restore. Transferring data to and from Amazon S3 is typically done through the network, and it is therefore accessible from any location. There are many commercial and open-source backup solutions that integrate with Amazon S3. You can use AWS Import/Export Snowball to transfer very large datasets by shipping storage devices directly to AWS. For longer-term data storage where retrieval times of several hours are adequate, there is Amazon Glacier, which has the same durability model as Amazon S3. Amazon Glacier is a low-cost alternative to Amazon S3. Amazon Glacier and Amazon S3 can be used in conjunction to produce a tiered backup solution.
AWS Storage Gateway enables snapshots of your on-premises data volumes to be transparently copied into Amazon S3 for backup. You can create local volumes or Amazon EBS volumes from these snapshots. Storage-cached volumes allow you to store your primary data in Amazon S3, but keep your frequently accessed data local for low-latency access. As with AWS Storage Gateway, you can snapshot the data volumes to give highly durable backup. In the event of DR, you can restore the cache volumes either to a second site running a storage cache gateway or to Amazon EC2.
You can use the gateway-VTL (Virtual Tape Library) configuration of AWS Storage Gateway as a backup target for your existing backup management software. This can be used as a replacement for traditional magnetic tape backup. For systems running on AWS, you also can back up into Amazon S3. Snapshots of Amazon EBS volumes, Amazon RDS databases, and Amazon Redshift data warehouses can be stored in Amazon S3. Alternatively, you can copy files directly into Amazon S3, or you can choose to create backup files and copy those to Amazon S3. There are many backup solutions that store data directly in Amazon S3, and these can be used from Amazon EC2 systems as well.
Key Steps for the Backup and Restore Method of Disaster Recovery
The term pilot light is often used to describe a DR scenario in which a minimal version of an environment is always running in AWS. The idea of the pilot light is an analogy that comes from a natural gas heater. In a natural gas heater, a small flame or pilot light is always on so it can quickly ignite the entire furnace to heat up the house.
This scenario is similar to a backup-and-restore scenario. For example, with AWS you can maintain a pilot light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core. Infrastructure elements for the pilot light itself typically include your database servers, which would replicate data to Amazon EC2 or Amazon RDS.
Depending on the system, there might be other critical data outside of the database that needs to be replicated to AWS. This is the critical core of the system (the pilot light) around which all other infrastructure pieces in AWS (the rest of the furnace) can quickly be provisioned to restore the complete system.
Unlike the Backup-Restore phase, the Pilot-Light method requires some upfront configuration and deployment prior to the occurrence of a disaster.
Key Preparation Steps for the Pilot Light Method of Disaster Recovery
To recover the rest of your environment around the pilot light, you can start your systems from the Amazon Machine Images (AMIs) within minutes on the appropriate instance types. For your dynamic data servers, you can resize them to handle production volumes as needed or add capacity accordingly. Horizontal scaling often is the most cost-effective and scalable approach to add capacity to a system. For example, you can add more web servers at peak times. However, you can also choose larger Amazon EC2 instance types, and thus scale vertically for applications such as relational databases. From a networking perspective, any required DNS updates can be done in parallel.
After recovery, you should ensure that redundancy is restored as quickly as possible. A failure of your DR environment shortly after your production environment fails is unlikely; however it is possible, so be prepared. Continue to take regular backups of your system and consider additional redundancy at the data layer. Remember that your DR environment is now your production environment.
The term warm-standby is used to describe a DR scenario in which a scaled-down version of a fully-functional environment is always running in the cloud. A warm-standby solution extends the pilot light elements and preparation. It further decreases the recovery time because some services are always running. By identifying your business-critical systems, you can fully duplicate these systems on AWS and have them always on.
These servers can be running on a minimum-sized fleet of Amazon EC2 instances on the smallest sizes possible. This solution is not scaled to take a full-production load; however, it is fully functional. In a disaster, the system is scaled up quickly to handle the production load. In AWS, this can be done by adding more instances to the load balancer and by resizing the small capacity servers to run on larger Amazon EC2 instance types.
Just like the Pilot Light Method of disaster recovery, the warm standby method needs some upfront deployment and preparation prior to the occurrence of a disaster.
Key Preparation Steps for the Warm-Standby Solution for Disaster Recovery
In the case of failure of the production system, the standby environment will be scaled up for production load, and DNS records will be changed to route all traffic to AWS.
Key Recovery Steps for the Warm-Standby Method of Disaster Recovery
A multi-site solution runs in AWS, as well as on your existing on-site infrastructure, in an active-active configuration. The data replication method that you employ will be determined by the recovery point that you choose. Refer to the RPO definition at the beginning of this section.
You should use a DNS service that supports weighted routing, such as Amazon Route 53, to route production traffic to different sites that deliver the same application or service. A proportion of traffic will go to your infrastructure in AWS, and the remainder will go to your on-site infrastructure.
In an on-site disaster situation, you can adjust the DNS weighting and send all traffic to the AWS servers. The capacity of the AWS service can be rapidly increased to handle the full production load. You can use Amazon EC2 Auto Scaling to automate this process.
You might need some application logic to detect the failure of the primary database services and cut over to the parallel database services running in AWS. The cost of this scenario is determined by how much production traffic is handled by AWS during normal operation.
In the recovery phase, you pay only for what you use for the duration that the DR environment is required at full scale. You can further reduce cost by purchasing Amazon EC2 Reserved Instances for your “always on” AWS servers. Refer to Amazon EC2 purchasing options in Chapter 4.
In a multi-site disaster recovery solution, another site must be a duplicate of your current production site, and it will take half of the production traffic.
Key Preparation Steps for the Multi-Site Solution Method for Disaster Recovery
When failing over to the available site, DNS records will need to be changed to direct all production traffic to the available production site.
Key Recovery Steps for the Multi-Site Solution Method of Disaster Recovery
Once you have restored your primary site to a working state, you will need to restore your normal service, which is often referred to as a “fail back.” Depending on your DR strategy, this typically means reversing the flow of data replication so that any data updates received while the primary site was down can be replicated back, without the loss of data. Let’s discuss failback from the various methods of failover.
To failback from the disaster recovery site to the now working production site, you will need to follow these key steps:
Key Steps for the Failback the Backup and Restore Method of Disaster Recovery
Failback from the disaster recovery site when Pilot Light, Warm-Standby, and Multi-Site are used follows the same methodology to restore service to the former production site.
Failback Pilot Light, Warm-Standby, and Multi-Site Methods of Disaster Recovery
In this chapter, we discussed how to decouple applications for scalability and high availability. The techniques discussed were using Amazon SQS and Amazon SNS to decouple an application’s front end from its back end, thus reducing their dependencies on one another (so they can scale independently, for example). Regional applications such as Amazon S3 and Amazon DynamoDB can replicate their data into other regions to provide a regional highly available solution. When building systems, it is a best practice to use the AWS managed services as they are inherently fault tolerant. For workloads that cannot take advantage of managed services, it is up to you to make them highly available. AWS provides the parts and pieces—you put them together to deliver a highly available or fault tolerant system. Lastly, prepare for a disaster and develop a recovery plan using the various DR failover scenarios and the failback methods for each failover method.
Understand Amazon SQS polling. Amazon SQS uses short polling by default, querying only a subset of the servers (based on a weighted random distribution) to determine whether any messages are available for inclusion in the response. Long polling eliminates false empty responses by querying all (rather than a limited number) of the servers.
Understand visibility timeout and where it is applied. Immediately after the message is received, it remains in the queue. To prevent other consumers from processing the message again, Amazon SQS sets a visibility timeout, which is a period of time during which Amazon SQS prevents other consuming components from receiving and processing the message.
Understand message order with Amazon SQS standard queues. Standard queues provide a loose-FIFO capability that attempts to preserve the order of messages.
Understand First-In, First-Out (FIFO) queues. FIFO queues preserve the exact order in which messages are sent and received.
Understand the various protocols that Amazon SNS supports for its subscribers. HTTP/HTTPS, Email/Email-JSON, Amazon SQS, SMS, and AWS Lambda
Know the difference between Amazon SQS and Amazon SNS. Amazon SQS uses a pull method to receive messages; Amazon SNS uses a push method.
Understand how Amazon DynamoDB replicates data between regions. DynamoDB streams can be used to replicate data from one Amazon DynamoDB table in one region to another Amazon DynamoDB table in another region.
Understand how to scale an Amazon RDS deployment for high availability. Amazon RDS supports a multi-AZ deployment. Know the Amazon RDS databases that support Read Replicas in other regions.
Have a working knowledge of how Amazon Route 53 can be used to failover to another region. Understand what failover routing is and how it can be used to failover to other regions and to static websites hosted on Amazon S3.
Understand how health checks work with Amazon Route 53. Know the types of health checks and how to set alarms that trigger actions for a given metric.
Know how to set up a highly available VPN. The Virtual Private Gateway (VPG) supports two tunnels. Most routers can function in an active/active pattern with both tunnels. Older routers can handle active/passive configurations.
Understand how to make AWS Direct Connect highly available. To make AWS Direct Connect highly available, deploy two AWS Direct Connects.
Understand how to use a VPN as a backup to AWS Direct Connect. If you are unable to deploy a second AWS Direct Connect or have a requirement for additional availability, then keep your existing VPN. If you do not have one, then you can deploy a VPN for backup connectivity to your Amazon Virtual Private Cloud (Amazon VPC).
Understand the various types of disaster recovery methods available on AWS. The various types of failover are Backup/Restore, Pilot Light, Warm-Standby, and Multi-Site. Understand how to failover and how to fail back.
By now you should have set up an account in AWS. If you haven’t, now would be the time to do so. It is important to note that these exercises are in your AWS account and thus are not free.
Use the Free Tier when launching resources. The Amazon AWS Free Tier applies to participating services across the following AWS Regions: US East (Northern Virginia), US West (Oregon), US West (Northern California), Canada (Central), EU (London), EU (Ireland), EU (Frankfurt), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), and South America (Sao Paulo). For more information, see https://aws.amazon .com/s/dm/optimization/server-side-test/free-tier/free_np/.
If you have not yet installed the AWS Command Line utilities, refer to Chapter 2, Exercise 2.1 (Linux) or Exercise 2.2 (Windows).
The reference for the AWS CLI can be found at http://docs.aws.amazon.com/cli/latest/reference/.
These exercises will walk you through some high-availability scenarios, and will have you research the “how to” in order to complete the exercise. By looking up how to perform a specific task, you will be on your way to mastering the task. The goal of this guide isn’t just to prepare to pass the AWS Certified SysOps Administrator – Associate exam. It is designed to serve you as a reference companion in your day-to-day duties as an AWS Certified SysOps Administrator.
You are looking to use Amazon Simple Notification Service (Amazon SNS) to send alerts to members of your operations team. What protocols does Amazon SNS support? (Choose all that apply.)
What is the first step that you must take to set up Amazon Simple Notification Service?
You have just finished setting up your AWS Direct Connect to connect your datacenter to AWS. You have one Amazon Virtual Private Cloud (Amazon VPC). You are worried that the connection is not redundant. Your company is going to implement a redundant AWS Direct Connect connection in the coming months. What can you suggest to management as an intermediate backup to a redundant AWS Direct Connect connection?
Your company has just implemented their redundant AWS Direct Connect connection. What must you do in order to ensure that in case of failure, traffic will route over the redundant AWS Direct Connect link?
You manage an Amazon Simple Storage Service (Amazon S3) bucket for your company’s Legal department. The bucket is called Legal. You were approached by one of the lawyers who is concerned about the availability of their Legal bucket. What could you do to ensure a higher level of availability for the Legal bucket?
You are working with your application development team and have created an Amazon Simple Queue Service (Amazon SQS) standard queue. Your developers are complaining that messages are being processed more than once. What can you do to limit or stop this behavior?
You have a test version of a production application in an Amazon Virtual Private Cloud. The test application is using Amazon Relational Database Service (Amazon RDS) for the database. You have just been mandated to move your production version of the application to AWS. The application owner is concerned about the availability of the database. How can you make Amazon RDS highly available?
You are managing a multi-region application deployed on AWS. You currently use your own DNS. What types of failover does Amazon Route 53 support? (Choose three.)
Amazon Route 53 supports the following types of health checks. (Choose all that apply.)
How do you make your hardware VPN redundant?
You are part of the disaster recovery team, and you are working to create a DR policy. What is the bare minimum type of disaster recovery that you can perform with AWS?
As you are getting more familiar with AWS, you are looking to respond better to disasters that may happen in your datacenter. Traditional backup and restore operations take too long. The longest part of the restore process is the database. What would be a better option?
You are not seeing network traffic flow on the AWS side of your VPN connection to your Amazon VPC. How do you check the status?
You have just inherited an application that has a recovery point objective (RPO) of zero and a recovery time objective (RTO) of zero. What type of disaster recovery method should you deploy?
You have an application with a recovery point objective of zero and a recovery time objective of 20 minutes. What type of disaster recovery method should you use?