Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10
High Availability

THE AWS CERTIFIED SYSOPS ADMINISTRATOR - ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:

Domain 1.0 Monitoring and Metrics
1.1 Demonstrate ability to monitor availability and performance
Content may include the following:
- Monitoring endpoints health checks and routing traffic to healthy endpoints
- Enabling Amazon Elastic Compute Cloud (Amazon EC2) auto-recovery
Domain 2.0 High Availability
2.1 Implement scalability and elasticity based on scenario
Content may include the following:
- Scalability of Amazon Elastic Compute Cloud (Amazon EC2) with Auto Scaling
- Scaling Amazon RDS database for performance
- Elastic Load Balancing
- Session state management
2.2 Ensure level of fault tolerance based on business needs
Content may include the following:
- Using Amazon Simple Queue Service (Amazon SQS) to assist in making applications fault tolerant
- Using Amazon Simple Notification Service (Amazon SNS) to fan-out an application
- Delivering high availability to your Amazon RDS with Multi-AZ deployments
- Enabling cross-region replication for Amazon Simple Storage Service (Amazon S3)
- Making AWS Direct Connect highly available
- AWS Direct Connect failover to backup VPN
Domain 5.0 High Availability
5.3 Manage backup and disaster recovery processes
Content may include the following:
- Different backup and restore methods available on AWS
Domain 7.0 Networking
7.1 Demonstrate ability to implement networking features of AWS
7.2 Demonstrate ability to implement connectivity features of AWS
Content may include the following:
- Different Failover scenarios using Amazon Route 53

images

Introduction to High Availability

This chapter covers high availability on AWS. In previous chapters, you were introduced to compute, networking, databases, and storage on AWS. What you haven’t been exposed to are some of the fully managed services that can help you deploy and maintain a highly available and scalable application.

What is high availability? Availability refers to the amount of time your system is in a functioning condition. In general terms, your availability is referred to as 100 percent minus your system’s downtime. Since events that may disrupt your system’s availability are never entirely predictable, there are always ways to make an application more available. Improving availability typically leads to increased cost, however. When considering how to make your environment more available, it’s important to balance the cost of the improvement with the benefit to your users. Does high availability mean that you ensure your application is always alive/reachable, or does it mean that the application is servicing requests within an acceptable level of performance? A way to make your application highly available and scalable is to decouple components so that they scale independently and can survive minimal disruptions. See Table 10.1 for varying high availability performance levels.

TABLE 10.1 Levels of High Availability

Percent of Uptime	Max Downtime per Year	Equivalent Downtime per Day
90% (1 nine)	36.5 days	2.4 hours
99% (2 nines)	3.65 days	14 minutes
99.9% (3 nines)	8.76 hours	86 seconds
99.99% (4 nines)	52.6 minutes	8.6 seconds
99.999% (5 nines)	5.25 minutes	.86 seconds

In this chapter, you will be introduced to Amazon Simple Queue Services (Amazon SQS) and Amazon Simple Notification Services (Amazon SNS), which can be used to decouple applications and retain transactions. In addition to decoupling tiers within an application, these services can help deliver high availability and fault tolerance to your application. This chapter will also cover additional strategies to keep your infrastructure highly available, redundant, and fault tolerant.

Amazon Simple Queue Service

Amazon Simple Queue Service (Amazon SQS) is a web service that gives you access to message queues that store messages waiting to be processed. With Amazon SQS, you can quickly build message queuing applications that can run on any computer. You can use the service to move data between diverse, distributed application components without losing messages and without requiring each component always to be available.

Amazon SQS can help you build a distributed application with decoupled components by working closely with the Amazon Elastic Compute Cloud (Amazon EC2) and other AWS infrastructure services. You can access the service via the Amazon SQS console, the AWS Command Line Interface (AWS CLI), a generic web services Application Programming Interface (API), and any programming language that the AWS Software Development Kit (SDK) supports. Amazon SQS supports both standard and First-In, First-Out (FIFO) queues.

Using Amazon Simple Queue Service to Decouple an Application

Think of a queue as a temporary repository for messages that are awaiting processing. Using Amazon SQS, you can decouple the components of an application so that they run independently of each other, with Amazon SQS easing message management between components. The queue acts as a buffer between the component producing and saving data and the component receiving the data for processing. This means that the queue resolves issues that arise if the producer (for example, a web front end) is producing work faster than the consumer (such as the application worker) can process it or if the producer or consumer is only intermittently connected to the network.

Amazon SQS is designed to deliver your message at least once and supports multiple readers and writers interacting with the same queue. A single queue can be used simultaneously by many distributed application components with no need for those components to coordinate with each other to share the queue.

Amazon SQS is engineered always to be available and deliver messages. This is achieved by the system being distributed between multiple machines and multiple facilities. Due to this highly distributed architecture, there is a trade-off—when using standard queues, Amazon SQS does not guarantee FIFO delivery of messages. This may be okay for many distributed applications, as long as each message can stand on its own and as long as all messages are delivered. In that scenario, the order is not important. If your system requires that order be preserved, you can place sequencing information in each message so that you can reorder the messages when the queue returns them.

You can have as many queues with as many messages as you like in the Amazon SQS system. A queue can be empty if you haven’t sent any messages to it, or if you have deleted all of the messages from it.

You assign a name to each of your queues. You can get a list of all of your queues or a subset of your queues that share the same initial characters in their names (for example, you could get a list of all of your queues whose names start with Q3).

Use Cases for Amazon SQS

Integrate Amazon SQS with other AWS infrastructure web services to make applications more reliable and flexible.
Use an Amazon SQS queue as a queue of work, where each message is a task that needs to be completed by a process. One or many computers can read tasks from the queue and perform them.
Have Amazon SQS help a browser-based application receive notifications from a server. The application server can add the notifications to a queue, which the browser can poll even if there is a firewall between them.
Keep notifications of significant events in a business process in an Amazon SQS queue. Each event can have a corresponding message in a queue, and applications that need to be aware of the event can read and process the messages.

Amazon SQS is engineered always to be available and deliver messages. You can delete a queue at any time, whether it is empty or not. Amazon SQS queues retain messages for a set period of time. The default setting is four days; however, you can configure a queue to retain messages for up to 14 days.

The Queuing Chain Pattern

You can achieve loose coupling of systems by using queues between systems and exchanging messages that transfer jobs. This enables asynchronous linking of systems. This method lets you increase the number of virtual servers that receive and process the messages in parallel. If there is no image to process, you can configure Auto Scaling to terminate the servers that are in excess. See Figure 10.1 for an example of a queuing chain pattern.

Image shows queuing chain pattern followed as put and get message between EC2 and SQS, then Process job, and again continues to put and get message between EC2 and SQS. — **FIGURE 10.1** Demonstrates a queuing chain pattern

Although you can use this pattern without cloud technology, the queue itself is provided as an AWS Cloud service (Amazon SQS), which makes it easier for you to use this pattern.

The Problem to Be Solved

Taking image processing as an example, the sequential operations of uploading, storing, and encoding the image, creating a thumbnail, and copyrighting are tightly linked. This tight linkage complicates the recovery operations when there has been a failure.

How This Benefits Your Application

Use asynchronous processing to return responses quickly.
Structure the system through loose coupling of Amazon EC2 instances.
Handle performance and service requirements through merely increasing or decreasing the number of Amazon EC2 instances used in job processing.
A message remains in the queue service even if an Amazon EC2 instance fails, which enables processing to be continued immediately upon recovery of the Amazon EC2 instance and facilitates a system that is resistant to failure.

Message Order

A standard queue makes a best effort to preserve order in messages, but due to the distributed nature of the queue, AWS cannot guarantee that you will receive messages in the exact order that you sent them. If your system requires that order be preserved, we recommend using a FIFO queue.

At-Least-Once Delivery

Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability. On rare occasions, one of the servers storing a copy of a message might be unavailable when you receive or delete the message. If that occurs, the copy of the message will not be deleted on that unavailable server, and you might get that message copy again when you receive messages. Because of this, you must design your application to be idempotent (that is, it must not be adversely affected if it processes the same message more than once). The point being made is that Amazon SQS will deliver a message at least once, and may in some cases deliver the message again. For more information, refer to visibility timeout in this chapter.

Message Sample

When you retrieve messages from the queue, Amazon SQS samples a subset of the servers and returns messages from just those servers. Amazon SQS supports two types of polling methods:

Short polling means a sample of queues is polled and a particular receive request might not return all of your messages, whereas a subsequent request will. If you keep retrieving from your queues, Amazon SQS will sample all of the servers and you will receive all of your messages. This can be CPU intensive for your application.
Long polling reduces the number of empty responses by allowing Amazon SQS to wait until a message is available in the queue before sending a response. Unless the connection times out, the response to the ReceiveMessage request contains at least one of the available messages, up to the maximum number of messages specified in the ReceiveMessage action.

Figure 10.2 shows an example of a short poll of the queue and messages being returned after a receive request. Amazon SQS samples several of the servers (in gray) and returns the messages from those servers (Message A, C, D, and B). Message E is not returned to this particular request, but it would be returned to a subsequent request.

Image shows short poll process from distributed on SQS server (containing several servers in gray, dark, and light) returns messages (A, C, D, and B) received from sampled server to your distributed system's components. — **FIGURE 10.2** Short poll

Visibility Timeout

Visibility timeout is the period of time that a message is invisible to the rest of your application after an application component gets it from the queue. During the visibility timeout, the component that received the message usually processes it and then deletes it from the queue. This prevents multiple components from processing the same message.

Here is how it works:

When a message is received, it becomes “locked” while being processed. This prevents it from being processed by other components.
The component receiving the message processes it and then deletes it from the queue.
If message processing fails, the lock expires and the message becomes available again (fault tolerance).

When the application needs more time for processing, the visibility timeout can be changed dynamically via the ChangeMessageVisibility operation.

Standard Queues

Amazon SQS offers standard queues as the default queue type. A standard queue allows you to have a nearly unlimited number of transactions per second. Standard queues support at-least-once message delivery. Occasionally (because of its highly-distributed architecture), however, more than one copy of a message might be delivered out of order. Standard queues provide best-effort ordering, versus a First-In, First-Out (FIFO) queue. This ensures that messages are delivered in the same order as they’re sent.

You can use standard message queues in many scenarios, as long as your application can process messages that arrive more than once and out of order. For example:

Decouple live user requests from intensive background work: Let users upload media while resizing or encoding it.
Allocate tasks to multiple worker nodes: Process a high number of credit card validation requests.
Batch messages for future processing: Schedule multiple entries to be added to a database.

First-In, First-Out Queues

A First-In, First-Out (FIFO) queue has all of the capabilities of the standard queue. FIFO queues are designed to enhance messaging between applications when the order of operations and events are critical or where duplicates can’t be tolerated. FIFO queues also provide exactly-once processing and are limited to 300 Transactions Per Second (TPS).

FIFO queues are designed to enhance messaging between applications when the order of operations and events is critical. For example:

Ensuring that user-entered commands are executed in the right order
Displaying the correct product price by sending price modifications in the right order
Preventing a student from enrolling in a course before registering for an account

Dead Letter Queues

A Dead Letter Queue (DLQ) is an Amazon SQS queue that you configure to receive messages from other Amazon SQS queues, referred to as “source queues.” Typically, you set up a DLQ to receive messages after a maximum number of processing attempts has been reached. A DLQ provides the ability to isolate messages that could not be processed.

A DLQ is just like any other Amazon SQS queue—messages can be sent to it and received from it like any other Amazon SQS queues. You can create a DLQ from the Amazon SQS API and the Amazon SQS area in the AWS Management Console.

Shared Queues

A developer associates an access policy statement (specifying the permissions being granted) with the queue to be shared. Amazon SQS provides APIs to create and manage the access policy statements: AddPermission, RemovePermission, SetQueueAttributes, and GetQueueAttributes. Refer to the latest API specification for more details.

Amazon SQS in each region is totally independent in message stores and queue names; therefore, the messages cannot be shared between queues in different regions.

Queues can be shared in the following ways:

With other AWS accounts
Anonymously
A permission gives access to another person to use your queue in some particular way.
A policy is the actual document that contains the permissions you granted.

Characteristics of Amazon SQS

Configurable settings per queue
Message order is not guaranteed with standard queues.
Message order is preserved with FIFO queues.
Messages can be deleted while in queue.
Messages can contain up to 256 KB of text data, including XML, JSON, and unformatted text.
FIFO and standard queues support server-side encryption (as of April 2017).

Amazon Simple Notification Service

Amazon Simple Notification Service (Amazon SNS) is a web service that coordinates and manages the delivery or sending of messages to subscribing endpoints or clients. In Amazon SNS, there are two types of clients—publishers and subscribers, which are also referred to as producers and consumers. Publishers communicate asynchronously with subscribers by producing and sending a message to a topic, which is a logical access point and communication channel. Subscribers (for example web servers, email addresses, Amazon SQS queues, and AWS Lambda functions) consume or receive the message or notification over one of the supported protocols (such as Amazon SQS, HTTP/S, email, Short Message Service [SMS], or AWS Lambda) when they are subscribed to the topic. See Figure 10.3 for details on the supported protocols.

Image described by surrounding text. — **FIGURE 10.3** Protocols supported by Amazon SNS

When using Amazon SNS, you create a topic and control access to it by defining resource policies that determine which publishers and subscribers can communicate with the topic. Refer to Chapter 3, “Security and AWS Identity and Access Management (IAM),” for more details on resource policies.

A publisher sends messages to topics that they have created or to topics to which they have permission to publish. Instead of including a specific destination address in each message, a publisher sends a message to the topic. Amazon SNS matches the topic to a list of subscribers that have subscribed to that topic and delivers the message to each of those subscribers. Each topic has a unique name that identifies the Amazon SNS endpoint for publishers to post messages and subscribers to register for notifications. Subscribers receive all messages published to the topics to which they subscribe, and all subscribers to a topic receive the same messages. The service is accessed via the Amazon SQS console, the AWS CLI, a generic web services API, and any programming language that the AWS SDK supports.

Mobile Push Messaging

Amazon SNS lets you push messages to mobile devices or distributed services via API or an easy-to-use management console. You can seamlessly scale from a handful of messages per day to millions of messages or more.

With Amazon SNS you can publish a message once and deliver it one or more times. You can choose to direct unique messages to individual Apple, Android, or Amazon devices or broadcast deliveries to many mobile devices with a single publish request.

Amazon SNS allows you to group multiple recipients using topics. A topic is an “access point” for allowing recipients to subscribe dynamically to identical copies of the same notification. One topic can support deliveries to multiple endpoint types; for example, you can group together iOS, Android, and SMS recipients. When you publish once to a topic, Amazon SNS delivers appropriately formatted copies of your message to each subscriber.

Amazon SNS Fan-Out Scenario

In a fan-out scenario, an Amazon SNS message is sent to a topic and then replicated and pushed to multiple Amazon SQS queues, HTTP/S endpoints, or email addresses. This allows for parallel asynchronous processing. For example, you could develop an application that sends an Amazon SNS message to a topic whenever a change in a stock price occurs. Then the Amazon SQS queues that are subscribed to that topic would receive identical notifications for the new stock price. The Amazon EC2 server instance attached to one of the queues could handle the trade reporting, while the other server instance could be attached to a data warehouse for analysis and the trending of the stock price over a specified period of time. Figure 10.4 describes the fan-out scenario.

Image shows fan-out scenario between publisher (trading system) sends messages to topic which is divided between two SQS queues to two EC2 instances, and finally to data warehouse and trade reporting system. — **FIGURE 10.4** Fan-out scenario

To have Amazon SNS deliver notifications to an Amazon SQS queue, a developer should subscribe to a topic specifying “Amazon SQS” as the transport and a valid Amazon SQS queue as the endpoint. In order to allow the Amazon SQS queue to receive notifications from Amazon SNS, the Amazon SQS queue owner must subscribe the Amazon SQS queue to the topic for Amazon SNS. If the user owns both the Amazon SNS topic being subscribed to and the Amazon SQS queue receiving the notifications, nothing further is required. Any message published to the topic will automatically be delivered to the specified Amazon SQS queue. If the owner of the Amazon SQS queue is not the owner of the topic, Amazon SNS will require an explicit confirmation to the subscription request.

Characteristics of Amazon SNS

Each notification message contains a single published message.
Message order is not guaranteed.
A message cannot be deleted after it has been published.
Amazon SNS delivery policy can be used to control retries in case of message delivery failure.
Messages can contain up to 256 KB of text data, including XML, JSON, and unformatted text.

Highly Available Architectures

In Chapter 5, “Networking,” you learned that the boundary of an Amazon Virtual Private Cloud (Amazon VPC) is at the region, and that regions comprise Availability Zones. In Chapter 4, “Compute,” you learned how Amazon EC2 instances are bound to the Availability Zone. What happens if a natural disaster takes out your Amazon EC2 instance running in an affected Availability Zone? What if your application goes viral and starts taking on more traffic than expected? How do you continue to serve your users?

So far in this chapter, you have learned that services like Amazon SNS and Amazon SQS are highly available, fault-tolerant systems, which can be used to decouple your application. Chapter 7, “Databases,” introduced Amazon DynamoDB and Amazon ElastiCache as NoSQL databases. Now we are going to put all these parts together to deliver a highly available architecture.

Figure 10.5 illustrates a highly available three-tier architecture (the concept of which was discussed in Chapter 2, “Working with AWS Cloud Services”).

Image shows three tier architecture design of integrated system routed with user, internet gateway (private load balancer between ElastiCaches) and amazon S3 (AMIs and snapshots). — **FIGURE 10.5** Highly available three-tier architecture

Network Address Translation (NAT) Gateways

Network Address Translation (NAT) gateways were covered in Chapter 5. NAT servers allow traffic from private subnets to traverse the Internet or connect to other AWS Cloud services. Individual NAT servers can be a single point of failure. The NAT gateway is a managed device, and each NAT gateway is created in a specific Availability Zone and implemented with redundancy in that Availability Zone. To achieve optimum availability, use NAT gateways in each Availability Zone.

Elastic Load Balancing

As discussed in Chapter 5, Elastic Load Balancing comes in two types: Application Load Balancer and Classic Load Balancer. Elastic Load Balancing allows you to decouple an application’s web tier (or front end) from the application tier (or back end). In the event of a node failure, the Elastic Load Balancing load balancer will stop sending traffic to the affected Amazon EC2 instance, thus keeping the application available, although in a deprecated state. Self-healing can be achieved by using Auto Scaling. For a refresher on Elastic Load Balancing, refer to Chapter 5.

Auto Scaling

Auto Scaling is a web service designed to launch or terminate Amazon EC2 instances automatically based on user-defined policies, schedules, and health checks. Application Auto Scaling automatically scales supported AWS Cloud services with an experience similar to Auto Scaling for Amazon EC2 resources. Application Auto Scaling works with Amazon EC2 Container Service (Amazon ECS) and will not be covered in this guide.

Auto Scaling helps to ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You create collections of Amazon EC2 instances, called Auto Scaling groups. You can specify the minimum number of instances in each Auto Scaling group, and Auto Scaling ensures that your group never goes below this size. Likewise, you can specify the maximum number of instances in each Auto Scaling group, and Auto Scaling ensures that your group never goes above this size. If you specify the desired capacity, either when you create the group or at any time thereafter, Auto Scaling ensures that your group has that many instances. If you specify scaling policies, then Auto Scaling can launch or terminate instances on demand as your application needs increase or decrease.

For example, the Auto Scaling group shown in Figure 10.6 has a minimum size of one instance, a desired capacity of two instances, and a maximum size of four instances. The scaling policies that you define adjust the number of instances, within your minimum and maximum number of instances, based on the criteria that you specify.

Image described by caption and surrounding text. — **FIGURE 10.6** Auto Scaling group

Auto Scaling is set in motion by Amazon CloudWatch. Amazon CloudWatch is covered in Chapter 9, “Monitoring and Metrics.”

Auto Scaling Components

Launch configuration
Group
Scaling policy (optional)
Scheduled action (optional)

Launch configuration Launch configuration defines how Auto Scaling should launch your Amazon EC2 instances. Auto Scaling provides you with an option to create a new launch configuration using the attributes from an existing Amazon EC2 instance. When you use this option, Auto Scaling copies the attributes from the specified instance into a template from which you can launch one or more Auto Scaling groups.

Auto Scaling group Your Auto Scaling group uses a launch configuration to launch Amazon EC2 instances. You create the launch configuration by providing information about the image that you want Auto Scaling to use to launch Amazon EC2 instances. The information can be the image ID, instance type, key pairs, security groups, and block device mapping.

Auto Scaling policy An Auto Scaling group uses a combination of policies and alarms to determine when the specified conditions for launching and terminating instances are met. An alarm is an object that watches over a single metric (for example, the average CPU utilization of your Amazon EC2 instances in an Auto Scaling group) over a time period that you specify. When the value of the metric breaches the thresholds that you define over a number of time periods that you specify, the alarm performs one or more actions. An action can be sending messages to Auto Scaling. A policy is a set of instructions for Auto Scaling that tells the service how to respond to alarm messages. These alarms are defined in Amazon CloudWatch (refer to Chapter 9 for more details).

Scheduled action Scheduled action is scaling based on a schedule, allowing you to scale your application in response to predictable load changes. To configure your Auto Scaling group to scale based on a schedule, you need to create scheduled actions. A scheduled action tells Auto Scaling to perform a scaling action at a certain time in the future. To create a scheduled scaling action, you specify the start time at which you want the scaling action to take effect, along with the new minimum, maximum, and desired size you want for that group at that time. At the specified time, Auto Scaling will update the group to set the new values for minimum, maximum, and desired sizes, as specified by your scaling action.

Session State Management

In order to scale out or back in, your application needs to be stateless. Consider storing session-related information off of the instance. Amazon DynamoDB or Amazon ElastiCache can be used for session state management. For more information on Amazon ElastiCache and Amazon DynamoDB, refer to Chapter 7.

Amazon Elastic Compute Cloud Auto Recovery

Auto Recovery is an Amazon EC2 feature that is designed to increase instance availability. It allows you to recover supported instances automatically when a system impairment is detected.

To use Auto Recovery, create an Amazon CloudWatch Alarm that monitors an Amazon EC2 instance and automatically recovers the instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. Terminated instances cannot be recovered. A recovered instance is identical to the original instance, including the instance ID, private IP addresses, elastic IP addresses, and all instance metadata.

When the StatusCheckFailed_System alarm is triggered and the recover action is initiated, you will be notified by the Amazon SNS topic that you selected when you created the alarm and associated the recover action. During instance recovery, the instance is migrated during an instance reboot, and any data that is in-memory is lost. When the process is complete, information is published to the Amazon SNS topic that you configured for the alarm. Anyone who is subscribed to this Amazon SNS topic will receive an email notification that includes the status of the recovery attempt and any further instructions. You will notice an instance reboot on the recovered instance.

Examples of problems that cause system status checks to fail include the following:

Loss of network connectivity
Loss of system power
Software issues on the physical host
Hardware issues on the physical host that impact network reachability

The recover action can also be triggered when an instance is scheduled by AWS to stop or retire due to degradation of the underlying hardware.

Scaling Your Amazon Relational Database Service Deployment

In Chapter 7, we covered how to achieve high availability of your database by deploying Amazon Relational Database Service (Amazon RDS) across multiple Availability Zones. You can also improve the performance of a read-heavy database by using Read Replicas to scale your database horizontally. Amazon RDS MySQL, PostgreSQL, and Maria DB can have up to 5 Read Replicas, and Amazon Aurora can have up to 15 Read Replicas.

You can also place your Read Replica in a different AWS Region closer to your users for better performance. Additionally, you can use Read Replicas to increase the availability of your database by promoting a Read Replica to a master for faster recovery in the event of a disaster. Read Replicas are not a replacement for the high availability and automatic failover capabilities that Multi-AZ architectures provide, however. For a refresher on Amazon RDS high availability and Read Replicas, refer to Chapter 7.

Multi-Region High Availability

In addition to building a highly available application that runs in a single region, your application may require regional fault tolerance. This can be delivered by placing and running infrastructure in another region and then using Amazon Route 53 to load balance the traffic between the regions.

Amazon Simple Storage Service

If you need to keep Amazon Simple Storage Service (Amazon S3) data in multiple regions, you can use cross-region replication. Cross-region replication is a bucket-level feature that enables automatic, asynchronous copying of objects across buckets in different AWS Regions. More information on Amazon S3 and cross-region replication is available in Chapter 6, “Storage Systems.”

Amazon DynamoDB

Amazon DynamoDB uses DynamoDB Streams to replicate data between regions. An application in one AWS Region modifies the data in an Amazon DynamoDB table. A second application in another AWS Region reads these data modifications and writes the data to another table, creating a replica that stays in sync with the original table.

Amazon Route 53

When you have more than one resource performing the same function (for example, more than one HTTP/S server or mail server), you can configure Amazon Route 53 to check the health of your resources and respond to Domain Name System (DNS) queries using only the healthy resources. For example, suppose your website, Example.com, is hosted on 10 servers, 2 each in 5 regions around the world. You can configure Amazon Route 53 to check the health of those servers and to respond to DNS queries for Example.com using only the servers that are currently healthy.

You can set up a variety of failover configurations using Amazon Route 53 alias, weighted, latency, geolocation routing, and failover resource record sets.

Active-active failover Use this failover configuration when you want all of your resources to be available the majority of the time. When a resource becomes unavailable, Amazon Route 53 can detect that it is unhealthy and stop including it when responding to queries.

Active-passive failover Use this failover configuration when you want a primary group of resources to be available the majority of the time and a secondary group of resources to be on standby in case all of the primary resources become unavailable. When responding to queries, Amazon Route 53 includes only the healthy primary resources. If all of the primary resources are unhealthy, Amazon Route 53 begins to include only the healthy secondary resources in response to DNS queries.

Active-active-passive and other mixed configurations You can combine alias and non-alias resource record sets to produce a variety of Amazon Route 53 behaviors. More information on record types can be found in Chapter 5.

In order for these failover configurations to work, health checks will need to be configured. There are three types of health checks: health checks that monitor an endpoint, health checks that monitor Amazon CloudWatch Alarms, and health checks that monitor other health checks.

The following sections discuss simple and complex failover configurations.

Health Checks for Simple Failover

The simplest failover configuration of having two or more resources performing the same function can benefit from health checks. For example, you might have multiple Amazon EC2 servers running HTTP server software responding to requests for your Example.com website. In Amazon Route 53, you create a group of resource record sets that have the same name and type, such as weighted resource record sets or latency resource record sets of type A. You create one resource record set for each resource, and you configure Amazon Route 53 to check the health of the corresponding resource. In this configuration, Amazon Route 53 chooses which resource record set will respond to a DNS query for Example.com and bases the choice in part on the health of your resources.

As long as all of the resources are healthy, Amazon Route 53 responds to queries using all of your Example.com weighted resource record sets. When a resource becomes unhealthy, Amazon Route 53 responds to queries using only the healthy resource record sets for Example.com.

Following are the steps for how you configure Amazon Route 53 to check the health of your resources in this simple configuration and how Amazon Route 53 responds to queries based on the health of your resources.

Configuring Amazon Route 53 to Check the Health of Your Resources

You identify the resources whose health you want Amazon Route 53 to monitor. For example, you might want to monitor all of the HTTP servers that respond to requests for Example.com.
You create health checks for your resources. A health check tells Amazon Route 53 how to send requests to the endpoint whose health you want to check: which protocol (HTTP, HTTPS, or Transmission Control Protocol [TCP]) and which IP address and port to use and also a domain name and path for HTTP/HTTPS health checks.

A common configuration is to create one health check for each resource and to use the same IP address for the health check endpoint for the resource. If the IP address for your HTTP server is 192.0.2.117, you create a health check for which the IP address is 192.0.2.117.
You might need to configure router and firewall rules so that Amazon Route 53 can send regular requests to the endpoints that you specified in your health checks.
You create a group of resource record sets for your resources (for example, a group of weighted resource record sets that all have a type of A). You associate the health checks that you created in Step 2 with the corresponding resource record sets. The graphic illustrates how these health checks will operate.
Amazon Route 53 periodically sends a request to each endpoint that you specified when you created your health checks; it doesn’t perform the health check when it receives a DNS query. Based on the responses, Amazon Route 53 decides whether the endpoints are healthy and uses that information to determine how to respond to queries.
When Amazon Route 53 receives a query for Example.com:
1. Amazon Route 53 chooses a resource record set based on the routing policy. In this case, it chooses a resource record set based on weight.
2. It determines the current health of the selected resource record set by checking the status of the health check for that resource record set.
3. If the selected resource record set is unhealthy, it repeats the process of choosing a resource record set based on the routing policy. This time, the unhealthy resource record set isn’t considered.
4. It responds to the query with the selected healthy resource record set.

The following example shows a group of weighted resource record sets in which the third resource record set is unhealthy. Initially, Amazon Route 53 selects a resource record set based on the weights of all three resource record sets. If it happens to select the unhealthy resource record set the first time, Amazon Route 53 selects another resource record set, but this time it omits the weight of the third resource record set from the calculation. See Figure 10.7 for an example of an unhealthy health check.

When Amazon Route 53 initially selects from among all three resource record sets, it responds to requests using the first resource record set about 20 percent of the time: 10/(10 + 20 + 20).
When Amazon Route 53 determines that the third resource record set is unhealthy, it responds to requests using the first resource record set about 33 percent of the time: 10/(10 + 20).

Image shows three sets of weighted: resource record set (containing different name, type, value, and weight) connected to HTTP health check (containing ID, IP address, and port). Third set has unhealthy and failed stamp over them. — **FIGURE 10.7** Unhealthy health check

If you omit a health check from one or more resource record sets in a group of resource record sets, Amazon Route 53 treats those resource record sets as healthy. Amazon Route 53 has no basis for determining the health of the corresponding resource without an assigned health check and might choose a resource record set for which the resource is unhealthy. See Figure 10.8 for more information.

Image shows two sets of weighted: resource record set (containing different name, type, value, and weight) connected to HTTP health check (containing ID, IP address, and port). Third set states no health check: always healthy. — **FIGURE 10.8** No health check enabled

Health Checks for Complex Failover

Checking the health of resources in complex configurations works much the same way as in simple configurations. In complex configurations, however, you use a combination of alias resource record sets (including weighted alias, latency alias, and failover alias) and non-alias resource record sets to build a decision tree that gives you greater control over how Amazon Route 53 responds to requests.

For example, you might use latency alias resource record sets to select a region close to a user and use weighted resource record sets for two or more resources within each region to protect against the failure of a single endpoint or an Availability Zone. Figure 10.9 shows this configuration.

Image shows two different sets of latency resource record set, each divided into two types of weighted: resource record set (containing different name, type, value, and weight) connected to HTTP health check (containing ID, IP address, and port). — **FIGURE 10.9** Health check for a complex failover

An overview of how Amazon EC2 and Amazon Route 53 are configured follows:

You have Amazon EC2 instances in two regions: us-east-1 and ap-southeast-2. You want Amazon Route 53 to respond to queries by using the resource record sets in the region that provides the lowest latency for your customers, so you create a latency alias resource record set for each region. (You create the latency alias resource record sets after you create resource record sets for the individual Amazon EC2 instances.)
Within each region, you have two Amazon EC2 instances. You create a weighted resource record set for each instance. The name and the type are the same for both of the weighted resource record sets in each region.

When you have multiple resources in a region, you can create weighted or failover resource record sets for your resources. You can also make even more complex configurations by creating weighted alias or failover alias resource record sets that, in turn, refer to multiple resources.

Each weighted resource record set has an associated health check. The IP address for each health check matches the IP address for the corresponding resource record set. This isn’t required, but it is the most common configuration.
For both latency alias resource record sets, you set the value of Evaluate Target Health to Yes.

You use the Evaluate Target Health setting for each latency alias resource record set to make Amazon Route 53 evaluate the health of the alias targets—the weighted resource record sets—and respond accordingly.

Figure 10.10 demonstrates the sequence of events that follows:

Amazon Route 53 receives a query for Example.com. Based on the latency for the user making the request, Amazon Route 53 selects the latency alias resource record set for the us-east-1 region.
Amazon Route 53 selects a weighted resource record set based on weight. Evaluate Target Health is Yes for the latency alias resource record set, so Amazon Route 53 checks the health of the selected weighted resource record set.
The health check failed, so Amazon Route 53 chooses another weighted resource record set based on weight and checks its health. That resource record set also is unhealthy.
Amazon Route 53 backs out of that branch of the tree, looks for the latency alias resource record set with the next-best latency, and chooses the resource record set for ap-southeast-2.
Amazon Route 53 again selects a resource record set based on weight and then checks the health of the selected resource record set. The health check passed, so Amazon Route 53 returns the applicable value in response to the query.

Highly Available Connectivity Options

In this section of this chapter, we discuss how to make your connections to AWS redundant. In Chapter 5, we discussed the various connectivity options to connect to AWS and, more specifically, your Amazon VPC. We will discuss the Virtual Private Network (VPN) and AWS Direct Connect connectivity options and how to make them highly available.

Redundant Active-Active VPN Connections

To get your connectivity up and running quickly, you can implement VPN connections because they are a quick, easy, and cost-effective way to set up remote connectivity to your Amazon VPC. To enable redundancy, each AWS Virtual Private Gateway (VGW) has two VPN endpoints with capabilities for static and dynamic routing. Although statically routed VPN connections from your router (also known as the customer gateway) are sufficient for establishing remote connectivity to an Amazon VPC, this is not a highly available configuration.

The best practice for making VPN connections highly available is to use redundant routers (customer gateways) and dynamic routing for automatic failover between AWS and customer VPN endpoints. Figure 10.11 demonstrates each VPN connection, consisting of two IPsec tunnels to both VGW endpoints, as a single line. See Figure 10.11 for a sample active-active VPN connection.

Image shows VPC (172.16.0.0/16) in Amazon region connected to customer gateway #1 and #2 (data center 10.0.0.0/8) through secure virtual private gateway leading to internet, and VPN connection. — **FIGURE 10.11** Redundant active-active VPN connections

Configuration Details

The configuration in Figure 10.11 consists of four fully-meshed, dynamically-routed IPsec tunnels between both VGW endpoints and two customer gateways. AWS provides configuration templates for a number of supported VPN devices to assist in establishing these IPsec tunnels and configuring Border Gateway Protocol (BGP) for dynamic routing.

In addition to the AWS-provided VPN and BGP configuration details, customers must configure Amazon VPCs to route traffic efficiently to their datacenter networks. In the example in Figure 10.11, the VGW will prefer to send 10.0.0.0/16 traffic to Data Center 1 through Customer Gateway 1 and only reroute this traffic through Data Center 2 if the connection to Data Center 1 is down. Likewise, 10.1.0.0/16 traffic will prefer the VPN connection originating from Data Center 2.

This configuration relies on the Internet to carry traffic between on-premises networks and Amazon VPC. Although AWS leverages multiple Internet Service Providers (ISPs), and even if the customer leverages multiple ISPs, an Internet service disruption can still affect the availability of VPN network connectivity due to the interdependence of ISPs and Internet routing. The only way to control the exact network path of your traffic is to provision private network connectivity with AWS Direct Connect.

Redundant Active-Active AWS Direct Connect Connections

If you need to establish private connectivity between AWS and your datacenter, office, or colocation environment, use AWS Direct Connect. More information on AWS Direct Connect can be found in Chapter 5.

AWS Direct Connect uses a dedicated, physical connection in one AWS Direct Connect location. Therefore, to make AWS Direct Connect highly available, consider implementing multiple dynamically-routed AWS Direct Connect connections. Architectures with the highest levels of availability will leverage different AWS Direct Connect partner networks to ensure network-provider redundancy. Refer to Figure 10.12 for an illustration of an AWS Direct Connect connection, consisting of a physical connection that contains logical connections, as one continual line.

Image shows VPC (172.16.0.0/16) in Amazon region connected to customer device #1 and #2 (data center 10.0.0.0/8) through secure virtual private gateway into two direct connect partners. — **FIGURE 10.12** AWS Direct Connect with redundant connections

Configuration Details

The configuration depicted in Figure 10.12 consists of AWS Direct Connect connections to separate AWS Direct Connect routers in two locations from two independently configured customer devices. AWS provides example router configurations to assist in establishing AWS Direct Connect connections and configuring BGP for dynamic routing.

In addition to the AWS-provided configuration details, you must configure your Amazon VPCs to route traffic efficiently to the datacenter networks. In the example in Figure 10.12, the VGW will prefer to send 10.0.0.0/16 traffic to Data Center 1 and only reroute this traffic to Data Center 2 if connectivity to Customer Device 1 is down. Likewise, 10.1.0.0/16 traffic will prefer the AWS Direct Connect connection from Data Center 2.

AWS Direct Connect allows you to create resilient connections to AWS because you have full control over the network path and network providers between your remote networks and AWS.

AWS Direct Connect with Backup VPN Connection

Some AWS customers would like the benefits of one or more AWS Direct Connect connections for their primary connectivity to AWS, combined with a lower-cost backup connection. To achieve this objective, they can establish AWS Direct Connect connections with a VPN backup, as depicted in Figure 10.13.

Image shows VPC (172.16.0.0/16) in Amazon region connected to customer gateway #1 and device #2 (data center 10.0.0.0/16) through secure virtual private gateway into internet and direct connect partner. — **FIGURE 10.13** AWS Direct Connect with backup VPN connection

Configuration Details

The configuration depicted in Figure 10.13 consists of two dynamically-routed connections—one using AWS Direct Connect and the other using a VPN connection from two different customer gateways. AWS provides example router configurations to assist in establishing both AWS Direct Connect and VPN connections with BGP for dynamic routing. By default, AWS will always prefer to send traffic over an AWS Direct Connect connection, so no additional configuration is required to define primary and backup connections. In the example in Figure 10.13, both Customer Gateway 1 (VPN) and Customer Device 2 (AWS Direct Connect) advertise a summary route of 10.0.0.0/16. AWS will send all traffic to Customer Device 2 as long as this network path is available.

Although BGP prefers AWS Direct Connect to VPN connections by default, you still have the ability to influence AWS routing decisions by advertising more specific (or static) routes. If, for example, you want to leverage your backup VPN connection for a subset of traffic (for example, developer traffic versus production traffic), you can advertise specific routes from Customer Gateway 1. Refer to Chapter 5 for more information on route preference.

Disaster Recovery

Disaster recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on your company’s business continuity or finances could be termed a disaster. This includes hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error, or some other significant event. You should plan on how to minimize the impact of a disaster by investing time and resources to plan and prepare, train employees, and document and update processes.

The amount of investment for DR planning for a particular system can vary dramatically depending on the cost of a potential outage. Companies that have traditional physical environments typically must duplicate their infrastructure to ensure the availability of spare capacity in the event of a disaster. The infrastructure needs to be procured, installed, and maintained so that it is ready to support the anticipated capacity requirements. During normal operations, the infrastructure typically is under-utilized or over-provisioned. With Amazon Web Services (AWS), your company can scale up its infrastructure on an as-needed, pay-as-you-go basis. You get access to the same highly secure, reliable, and fast infrastructure that Amazon uses to run its own global network of websites. AWS also gives you the flexibility to change and optimize resources quickly during a DR event, which can result in significant cost savings.

We will be using two common industry terms for disaster planning:

Recovery time objective (RTO) This represents the time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.

Recovery point objective (RPO) This is the acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon).

A company typically decides on an acceptable RTO and RPO based on the financial impact to the business when systems are unavailable. The company determines financial impact by considering many factors, such as the loss of business and damage to its reputation due to downtime and the lack of systems availability.

IT organizations then plan solutions to provide cost-effective system recovery based on the RPO within the timeline and the service level established by the RTO.

In this final section of the chapter, we will discuss the various types of disaster recovery scenarios on Amazon Web Services.

Backup and Restore Method

In most traditional environments, data is backed up to tape and sent off-site regularly. If you use this method, it can take a long time to restore your system in the event of a disruption or disaster. Amazon S3 is an ideal destination for backup data that might be needed quickly to perform a restore. Transferring data to and from Amazon S3 is typically done through the network, and it is therefore accessible from any location. There are many commercial and open-source backup solutions that integrate with Amazon S3. You can use AWS Import/Export Snowball to transfer very large datasets by shipping storage devices directly to AWS. For longer-term data storage where retrieval times of several hours are adequate, there is Amazon Glacier, which has the same durability model as Amazon S3. Amazon Glacier is a low-cost alternative to Amazon S3. Amazon Glacier and Amazon S3 can be used in conjunction to produce a tiered backup solution.

AWS Storage Gateway enables snapshots of your on-premises data volumes to be transparently copied into Amazon S3 for backup. You can create local volumes or Amazon EBS volumes from these snapshots. Storage-cached volumes allow you to store your primary data in Amazon S3, but keep your frequently accessed data local for low-latency access. As with AWS Storage Gateway, you can snapshot the data volumes to give highly durable backup. In the event of DR, you can restore the cache volumes either to a second site running a storage cache gateway or to Amazon EC2.

You can use the gateway-VTL (Virtual Tape Library) configuration of AWS Storage Gateway as a backup target for your existing backup management software. This can be used as a replacement for traditional magnetic tape backup. For systems running on AWS, you also can back up into Amazon S3. Snapshots of Amazon EBS volumes, Amazon RDS databases, and Amazon Redshift data warehouses can be stored in Amazon S3. Alternatively, you can copy files directly into Amazon S3, or you can choose to create backup files and copy those to Amazon S3. There are many backup solutions that store data directly in Amazon S3, and these can be used from Amazon EC2 systems as well.

Key Steps for the Backup and Restore Method of Disaster Recovery

Select an appropriate tool or method to back up your data into AWS.
Ensure that you have an appropriate data retention policy in place.
Ensure that appropriate security measures are in place for this data, including encryption of the data and the appropriate data access policies.
Regularly test the recovery of your data and, more importantly, the restoration of your systems.

Pilot Light Method

The term pilot light is often used to describe a DR scenario in which a minimal version of an environment is always running in AWS. The idea of the pilot light is an analogy that comes from a natural gas heater. In a natural gas heater, a small flame or pilot light is always on so it can quickly ignite the entire furnace to heat up the house.

This scenario is similar to a backup-and-restore scenario. For example, with AWS you can maintain a pilot light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core. Infrastructure elements for the pilot light itself typically include your database servers, which would replicate data to Amazon EC2 or Amazon RDS.

Depending on the system, there might be other critical data outside of the database that needs to be replicated to AWS. This is the critical core of the system (the pilot light) around which all other infrastructure pieces in AWS (the rest of the furnace) can quickly be provisioned to restore the complete system.

Preparation Phase

Unlike the Backup-Restore phase, the Pilot-Light method requires some upfront configuration and deployment prior to the occurrence of a disaster.

Key Preparation Steps for the Pilot Light Method of Disaster Recovery

Set up Amazon EC2 instances to replicate or mirror data.
Ensure that you have all supporting custom software packages available in AWS.
Create and maintain Amazon Machine Images (AMIs) of key servers where fast recovery is required.
Regularly run these servers, test them, and apply any software updates and configuration changes.
Automate the provisioning of AWS resources.

Recovery Phase

To recover the rest of your environment around the pilot light, you can start your systems from the Amazon Machine Images (AMIs) within minutes on the appropriate instance types. For your dynamic data servers, you can resize them to handle production volumes as needed or add capacity accordingly. Horizontal scaling often is the most cost-effective and scalable approach to add capacity to a system. For example, you can add more web servers at peak times. However, you can also choose larger Amazon EC2 instance types, and thus scale vertically for applications such as relational databases. From a networking perspective, any required DNS updates can be done in parallel.

After recovery, you should ensure that redundancy is restored as quickly as possible. A failure of your DR environment shortly after your production environment fails is unlikely; however it is possible, so be prepared. Continue to take regular backups of your system and consider additional redundancy at the data layer. Remember that your DR environment is now your production environment.

Key Recovery Steps for the Pilot-Light Method of Disaster Recovery

Start your application Amazon EC2 instances from your custom AMIs.
Resize existing database/data store instances to process the increased traffic.
Add additional database/data store instances to give the DR site resilience in the data tier; if you are using Amazon RDS, turn on Multi-AZ to improve resilience.
Change DNS to point to the Amazon EC2 servers.
Install and configure any non-AMI based systems, ideally in an automated way.

Warm-Standby Method

The term warm-standby is used to describe a DR scenario in which a scaled-down version of a fully-functional environment is always running in the cloud. A warm-standby solution extends the pilot light elements and preparation. It further decreases the recovery time because some services are always running. By identifying your business-critical systems, you can fully duplicate these systems on AWS and have them always on.

These servers can be running on a minimum-sized fleet of Amazon EC2 instances on the smallest sizes possible. This solution is not scaled to take a full-production load; however, it is fully functional. In a disaster, the system is scaled up quickly to handle the production load. In AWS, this can be done by adding more instances to the load balancer and by resizing the small capacity servers to run on larger Amazon EC2 instance types.

Preparation Phase

Just like the Pilot Light Method of disaster recovery, the warm standby method needs some upfront deployment and preparation prior to the occurrence of a disaster.

Key Preparation Steps for the Warm-Standby Solution for Disaster Recovery

Set up Amazon EC2 instances to replicate or mirror data.
Create and maintain AMIs.
Run your application using a minimal footprint of Amazon EC2 instances or AWS infrastructure.
Patch and update software and configuration files in line with your live environment.

Recovery Phase

In the case of failure of the production system, the standby environment will be scaled up for production load, and DNS records will be changed to route all traffic to AWS.

Key Recovery Steps for the Warm-Standby Method of Disaster Recovery

Increase the size of the Amazon EC2 fleets in service with the load balancer.
Start applications on larger Amazon EC2 instance types as needed.
Either manually change the DNS records or use Amazon Route 53 automated health checks so that all traffic is routed to the AWS environment.
Use Auto Scaling to right-size the fleet or accommodate the increased load.
Scale up your database (if needed).

Multi-Site Solution Method

A multi-site solution runs in AWS, as well as on your existing on-site infrastructure, in an active-active configuration. The data replication method that you employ will be determined by the recovery point that you choose. Refer to the RPO definition at the beginning of this section.

You should use a DNS service that supports weighted routing, such as Amazon Route 53, to route production traffic to different sites that deliver the same application or service. A proportion of traffic will go to your infrastructure in AWS, and the remainder will go to your on-site infrastructure.

In an on-site disaster situation, you can adjust the DNS weighting and send all traffic to the AWS servers. The capacity of the AWS service can be rapidly increased to handle the full production load. You can use Amazon EC2 Auto Scaling to automate this process.

You might need some application logic to detect the failure of the primary database services and cut over to the parallel database services running in AWS. The cost of this scenario is determined by how much production traffic is handled by AWS during normal operation.

In the recovery phase, you pay only for what you use for the duration that the DR environment is required at full scale. You can further reduce cost by purchasing Amazon EC2 Reserved Instances for your “always on” AWS servers. Refer to Amazon EC2 purchasing options in Chapter 4.

Preparation Phase

In a multi-site disaster recovery solution, another site must be a duplicate of your current production site, and it will take half of the production traffic.

Key Preparation Steps for the Multi-Site Solution Method for Disaster Recovery

Set up your AWS environment to duplicate your production environment.
Set up DNS weighting, or similar traffic routing technology, to distribute incoming requests to both sites.
Configure automated failover to re-route traffic away from the affected site.

Recovery Phase

When failing over to the available site, DNS records will need to be changed to direct all production traffic to the available production site.

Key Recovery Steps for the Multi-Site Solution Method of Disaster Recovery

Either manually or by using DNS failover, change the DNS weighting so that all requests are sent to the AWS site.
Have application logic for failover to use the local AWS database servers for all queries.
Consider using Auto Scaling to right-size the AWS fleet automatically. You can further increase the availability of your multi-site solution by designing Multi-AZ architectures.

Failing Back from a Disaster

Once you have restored your primary site to a working state, you will need to restore your normal service, which is often referred to as a “fail back.” Depending on your DR strategy, this typically means reversing the flow of data replication so that any data updates received while the primary site was down can be replicated back, without the loss of data. Let’s discuss failback from the various methods of failover.

Failback the Backup and Restore Method

To failback from the disaster recovery site to the now working production site, you will need to follow these key steps:

Key Steps for the Failback the Backup and Restore Method of Disaster Recovery

Freeze data changes to the DR site.
Take a backup.
Restore the backup to the primary site.
Re-point users to the primary site.
Unfreeze the changes.

Failback Pilot Light, Warm-Standby, and Multi-Site Methods

Failback from the disaster recovery site when Pilot Light, Warm-Standby, and Multi-Site are used follows the same methodology to restore service to the former production site.

Failback Pilot Light, Warm-Standby, and Multi-Site Methods of Disaster Recovery

Establish reverse mirroring/replication from the DR site back to the primary site, once the primary site has caught up with the changes.
Freeze data changes to the DR site.
Re-point users to the primary site.
Unfreeze the changes.

Summary

In this chapter, we discussed how to decouple applications for scalability and high availability. The techniques discussed were using Amazon SQS and Amazon SNS to decouple an application’s front end from its back end, thus reducing their dependencies on one another (so they can scale independently, for example). Regional applications such as Amazon S3 and Amazon DynamoDB can replicate their data into other regions to provide a regional highly available solution. When building systems, it is a best practice to use the AWS managed services as they are inherently fault tolerant. For workloads that cannot take advantage of managed services, it is up to you to make them highly available. AWS provides the parts and pieces—you put them together to deliver a highly available or fault tolerant system. Lastly, prepare for a disaster and develop a recovery plan using the various DR failover scenarios and the failback methods for each failover method.

Resources to Review

Amazon Simple Queue Service: https://aws.amazon.com/sqs/
Amazon Simple Notification Service: https://aws.amazon.com/sns/
Amazon DynamoDB Cross-Region Replication: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.html
Disaster Recovery on AWS: http://d36cz9buwru1tt.cloudfront.net/ AWS_Disaster_Recovery.pdf
Fault Tolerance and High Availability on AWS: https://media.amazonwebservices .com/architecturecenter/AWS_ac_ra_ftha_04.pdf

Exam Essentials

Understand Amazon SQS polling. Amazon SQS uses short polling by default, querying only a subset of the servers (based on a weighted random distribution) to determine whether any messages are available for inclusion in the response. Long polling eliminates false empty responses by querying all (rather than a limited number) of the servers.

Understand visibility timeout and where it is applied. Immediately after the message is received, it remains in the queue. To prevent other consumers from processing the message again, Amazon SQS sets a visibility timeout, which is a period of time during which Amazon SQS prevents other consuming components from receiving and processing the message.

Understand message order with Amazon SQS standard queues. Standard queues provide a loose-FIFO capability that attempts to preserve the order of messages.

Understand First-In, First-Out (FIFO) queues. FIFO queues preserve the exact order in which messages are sent and received.

Understand the various protocols that Amazon SNS supports for its subscribers. HTTP/HTTPS, Email/Email-JSON, Amazon SQS, SMS, and AWS Lambda

Know the difference between Amazon SQS and Amazon SNS. Amazon SQS uses a pull method to receive messages; Amazon SNS uses a push method.

Understand how Amazon DynamoDB replicates data between regions. DynamoDB streams can be used to replicate data from one Amazon DynamoDB table in one region to another Amazon DynamoDB table in another region.

Understand how to scale an Amazon RDS deployment for high availability. Amazon RDS supports a multi-AZ deployment. Know the Amazon RDS databases that support Read Replicas in other regions.

Have a working knowledge of how Amazon Route 53 can be used to failover to another region. Understand what failover routing is and how it can be used to failover to other regions and to static websites hosted on Amazon S3.

Understand how health checks work with Amazon Route 53. Know the types of health checks and how to set alarms that trigger actions for a given metric.

Know how to set up a highly available VPN. The Virtual Private Gateway (VPG) supports two tunnels. Most routers can function in an active/active pattern with both tunnels. Older routers can handle active/passive configurations.

Understand how to make AWS Direct Connect highly available. To make AWS Direct Connect highly available, deploy two AWS Direct Connects.

Understand how to use a VPN as a backup to AWS Direct Connect. If you are unable to deploy a second AWS Direct Connect or have a requirement for additional availability, then keep your existing VPN. If you do not have one, then you can deploy a VPN for backup connectivity to your Amazon Virtual Private Cloud (Amazon VPC).

Understand the various types of disaster recovery methods available on AWS. The various types of failover are Backup/Restore, Pilot Light, Warm-Standby, and Multi-Site. Understand how to failover and how to fail back.

Exercises

By now you should have set up an account in AWS. If you haven’t, now would be the time to do so. It is important to note that these exercises are in your AWS account and thus are not free.

Use the Free Tier when launching resources. The Amazon AWS Free Tier applies to participating services across the following AWS Regions: US East (Northern Virginia), US West (Oregon), US West (Northern California), Canada (Central), EU (London), EU (Ireland), EU (Frankfurt), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), and South America (Sao Paulo). For more information, see https://aws.amazon .com/s/dm/optimization/server-side-test/free-tier/free_np/.

If you have not yet installed the AWS Command Line utilities, refer to Chapter 2, Exercise 2.1 (Linux) or Exercise 2.2 (Windows).

The reference for the AWS CLI can be found at http://docs.aws.amazon.com/cli/latest/reference/.

These exercises will walk you through some high-availability scenarios, and will have you research the “how to” in order to complete the exercise. By looking up how to perform a specific task, you will be on your way to mastering the task. The goal of this guide isn’t just to prepare to pass the AWS Certified SysOps Administrator – Associate exam. It is designed to serve you as a reference companion in your day-to-day duties as an AWS Certified SysOps Administrator.

EXERCISE 10.5

Subscribe the Queue to Your Amazon SNS Topic.

Now you will subscribe MyTopic from Exercise 10.1.

In the AWS Management Console, navigate to Messaging and then to Amazon Simple Queue Service.
Select your input queue, and subscribe to your MyTopic Amazon SNS Topic.
Return to the Amazon SNS dashboard, and choose Topics.
Publish to Topic, and choose the defaults.
Navigate to the Amazon SQS dashboard.
You will notice you have one message in the input queue (and an email should have popped up on your screen if you have email notifications turned on).
Select the input queue, and select Queue Actions to poll the queue.
Select more details to see the message. Scroll through the message body, and notice the different fields.
Delete your message, delete your queue, and then delete your Amazon SNS topic.

EXERCISE 10.6

Deploy Amazon RDS in a Multi-AZ Configuration.

In this exercise, you will create an Amazon Relational Database Service (Amazon RDS) MySQL database in a multi-AZ configuration and simulate a failover.

Log in to the AWS Management Console, and navigate to the Amazon RDS console.
Launch a new Amazon RDS DB instance, and select MySQL as your Database type and then select Production MySQL Multi-AZ Deployment.
In the Instance Specifications dashboard, change the storage type to General Purpose SSD. Change the allocated storage to 100 GB.
In the Settings, set DB Instance Identifier = certbook, master username = admin, and password is password. These settings are for lab purposes only. Use complex names and passwords in production.
Create a new security group called certbookdb and database name = certbookdb. Leave everything else at its default setting.
Launch DB instance. (This step will take a few minutes to complete, so take a break.)
To simulate a failover, select your certbookdb Amazon RDS instance, and reboot it.
Confirm the reboot with failover.
Navigate to the Details tab (the little icon with a magnifying glass).
Make note of your Availability Zone, reboot again, and then notice your Availability Zone.
Now let’s create a Read Replica of certbookdb.
Select Actions, and then select Create Read Replica Instance.
Enter certbookdb-rr as the DN identifier, and leave the settings at their defaults.
Navigate back to the Details page, and you will see your Read Replica.

That’s it. You have created a highly available Amazon RDS MySQL database with a Read Replica. You have also simulated a failover. Now delete your Read Replica (certbookdb-rr) and your DB instance (certbookdb).

Review Questions

You are looking to use Amazon Simple Notification Service (Amazon SNS) to send alerts to members of your operations team. What protocols does Amazon SNS support? (Choose all that apply.)
1. HTTP/HTTPS
2. SQL
3. Email-JSON
4. Amazon Simple Queue Service (Amazon SQS)
What is the first step that you must take to set up Amazon Simple Notification Service?
1. Create a topic.
2. Subscribe to a topic.
3. Publish to a topic.
4. Define the Amazon Resource Name (ARN).
You have just finished setting up your AWS Direct Connect to connect your datacenter to AWS. You have one Amazon Virtual Private Cloud (Amazon VPC). You are worried that the connection is not redundant. Your company is going to implement a redundant AWS Direct Connect connection in the coming months. What can you suggest to management as an intermediate backup to a redundant AWS Direct Connect connection?
1. Recommend that your management team speed up the implementation of the AWS Direct Connect.
2. Nothing. AWS Direct Connect is redundant as long as you use two routers.
3. Create a VPN between your datacenter and your Amazon VPC.
4. Create another Amazon VPC.
Your company has just implemented their redundant AWS Direct Connect connection. What must you do in order to ensure that in case of failure, traffic will route over the redundant AWS Direct Connect link?
1. Traffic will failover automatically.
2. Manually configure your traffic to route over the redundant link.
3. Open a support ticket with AWS.
4. Open a support ticket with your telecom provider.
You manage an Amazon Simple Storage Service (Amazon S3) bucket for your company’s Legal department. The bucket is called Legal. You were approached by one of the lawyers who is concerned about the availability of their Legal bucket. What could you do to ensure a higher level of availability for the Legal bucket?
1. Nothing. Amazon S3 is designed for eleven nines of durability.
2. Enable versioning on the Amazon S3 bucket.
3. Create an Amazon S3 bucket in another region, and create an AWS Lambda function that copies data to the Amazon S3 bucket in the other region.
4. Create an Amazon S3 bucket in another region. Enable versioning on the bucket, and enable cross-region replication to the Amazon S3 bucket in the other region.
You are working with your application development team and have created an Amazon Simple Queue Service (Amazon SQS) standard queue. Your developers are complaining that messages are being processed more than once. What can you do to limit or stop this behavior?
1. Have the developers make their application idempotent.
2. Deploy a First-In, First-Out (FIFO) queue and have the developers point their application to it.
3. Increase the visibility timeout on the queue.
4. Adjust the size of the Amazon Elastic Compute Cloud fleet.
You have a test version of a production application in an Amazon Virtual Private Cloud. The test application is using Amazon Relational Database Service (Amazon RDS) for the database. You have just been mandated to move your production version of the application to AWS. The application owner is concerned about the availability of the database. How can you make Amazon RDS highly available?
1. Deploy Amazon RDS in one Availability Zone (AZ), and deploy a Read Replica in another Availability Zone.
2. Deploy Amazon RDS in a multi-Availability Zone model.
3. Use Amazon DynamoDB, as it is a regional service.
4. Deploy a Read Replica in another region.
You are managing a multi-region application deployed on AWS. You currently use your own DNS. What types of failover does Amazon Route 53 support? (Choose three.)
1. Active-active failover
2. Active-active-passive failover
3. Latency-based failover
4. Weighted Round Robin (WRR) failover
5. Active-passive failover
Amazon Route 53 supports the following types of health checks. (Choose all that apply.)
1. Health of a specific Amazon Virtual Private Cloud (VPC)
2. The health of a specified resource, such as a web server
3. The status of an Amazon CloudWatch Alarm
4. The status of other health checks
How do you make your hardware VPN redundant?
1. Deploy your own software VPN solution on AWS, ensuring that the software-based VPN is deployed to two separate Availability Zones. Take down the hardware VPN.
2. Deploy two customer gateways. Configure your customer gateways to take advantage of the two tunnels provided by the hardware VPN.
3. Deploy Cloud Hub.
4. The hardware VPN is redundant in itself, and no additional configuration is required.
You are part of the disaster recovery team, and you are working to create a DR policy. What is the bare minimum type of disaster recovery that you can perform with AWS?
1. Multi-site between your datacenter and AWS
2. Pilot-light method
3. Warm-standby method
4. Backup-restore method
As you are getting more familiar with AWS, you are looking to respond better to disasters that may happen in your datacenter. Traditional backup and restore operations take too long. The longest part of the restore process is the database. What would be a better option?
1. Pilot-light method
2. Active-active sites
3. AWS Snowball
4. AWS Storage Gateway stored mode
You are not seeing network traffic flow on the AWS side of your VPN connection to your Amazon VPC. How do you check the status?
1. Deploy a third-party tool to monitor your VPN.
2. Go to the AWS management tool, and navigate VPN connections.
3. Check your router’s logs.
4. Turn on route propagation in the Amazon VPC’s main routing table.
You have just inherited an application that has a recovery point objective (RPO) of zero and a recovery time objective (RTO) of zero. What type of disaster recovery method should you deploy?
1. Multi-Site active-active method
2. Warm-standby method
3. Backup-restore method
4. Pilot Light
You have an application with a recovery point objective of zero and a recovery time objective of 20 minutes. What type of disaster recovery method should you use?
1. Multi-Site active-active
2. Warm-standby method
3. Backup-Restore
4. Pilot Light

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10 High Availability

Create new playlist

Sign In

Sign Up

Introduction to High Availability

Amazon Simple Queue Service

Using Amazon Simple Queue Service to Decouple an Application

The Queuing Chain Pattern

The Problem to Be Solved

Message Order

At-Least-Once Delivery

Message Sample

Visibility Timeout

Standard Queues

First-In, First-Out Queues

Dead Letter Queues

Shared Queues

Amazon Simple Notification Service

Mobile Push Messaging

Amazon SNS Fan-Out Scenario

Highly Available Architectures

Network Address Translation (NAT) Gateways

Elastic Load Balancing

Auto Scaling

Session State Management

Amazon Elastic Compute Cloud Auto Recovery

Scaling Your Amazon Relational Database Service Deployment

Multi-Region High Availability

Amazon Simple Storage Service

Amazon DynamoDB

Amazon Route 53

Health Checks for Simple Failover

Health Checks for Complex Failover

Highly Available Connectivity Options

Redundant Active-Active VPN Connections

Configuration Details

Redundant Active-Active AWS Direct Connect Connections

Configuration Details

AWS Direct Connect with Backup VPN Connection

Configuration Details

Disaster Recovery

Backup and Restore Method

Pilot Light Method

Preparation Phase

Recovery Phase

Key Recovery Steps for the Pilot-Light Method of Disaster Recovery

Warm-Standby Method

Preparation Phase

Recovery Phase

Multi-Site Solution Method

Preparation Phase

Recovery Phase

Failing Back from a Disaster

Failback the Backup and Restore Method

Failback Pilot Light, Warm-Standby, and Multi-Site Methods

Summary

Resources to Review

Exam Essentials

Exercises

Review Questions

Table of Contents for
Chapter 10 High Availability