Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER SIX

Autoscaling

In Chapter 5, we talked about installing hardware in your on-premises datacenter and its deployment to production. As it is well known, use of public cloud obviates the need for this. Further, it obviates the need for maintenance of the infrastructure and thereby boosts product development agility, which is crucial in an increasingly competitive landscape. For instance, in 2008, Netflix kicked off its migration from on-premises to Amazon Web Services (AWS), and in February 2016, it announced the completion of the migration of its streaming service to the cloud.¹ Migration to the cloud eliminated the cycles spent on hardware procurement and datacenter maintenance, and resulted in higher development agility.

The use of a cloud service at large scale is much more expensive compared to the use of an in-house datacenter. This calls for the development of techniques to minimize the cost overhead associated with the use of a public cloud without sacrificing its various benefits such as elasticity. Autoscaling allows automatic scaling up of capacity when it is needed and scaling back down when it is not needed. An enterprise can set a minimum required capacity across Availability Zones (AZs)—isolated locations in a given geographic region—to ensure quick accessibility. This helps save cost by providing the best availability and performance for a given cost. For cases in which a reserved capacity is purchased, the cost can be amortized by scheduling jobs of noncritical and batch services during the off-peak hours.

NOTE

This part of the discussion is based in Arun’s experience at Netflix.

The Challenge

Capitalizing on the elasticity of the cloud efficiently is nontrivial. Specifically, you must be wary of the following:

Aggressive scale-down can potentially adversely affect latency and throughput (in the worst case, the service might become unavailable). Higher latency would degrade the experience of the end users. Further, from a corporate standpoint, lower throughput would adversely impact the bottom line (this holds in general for any end-user facing service).
Aggressive scale-up can result in overprovisioning, thereby ballooning the footprint on the cloud. Of course, higher operational costs would adversely affect the bottom line.

Figure 6-1 illustrates these caveats. Additionally, efficient exploitation of elasticity of the cloud across multiple applications can contain the overall footprint.

In the long run, on-demand usage is much more expensive than the use of reserved instances.² Consequently, it is critical to develop novel techniques to exploit elasticity of the cloud systematically.

SPOT MARKET IN PUBLIC CLOUDS

For most public clouds, you can bid for unused instances. These instances are priced up to 90 percent lower than On-Demand instance and thereby can lower your costs significantly. On Amazon EC2, such instances are referred to as Spot instances.³ A spot market is a market where standardized products are traded for immediate delivery. The price of the products on the market depends on supply and demand. Google Compute Engine’s Preemptible Virtual Machines are the equivalent of Amazon EC2 Spot instances. In the interest of brevity, in this sidebar we discuss only the various key aspects related to Amazon EC2 Spot instances.

The hourly price for a Spot instance (of each instance type in each Availability Zone) is set by Amazon EC2, and fluctuates depending on the supply of and demand for Spot instances. A Spot instance runs whenever the bid exceeds the current market price. When the current Spot price rises above your bid price, the Spot instance is reclaimed by AWS. A two-minute warning, formally known as a Spot Instance Termination Notice, is made available when an instance has been marked for termination so that the application can use this time to save its state, upload final logfiles, or remove itself from an Elastic Load Balancer (ELB). To allow in-flight requests to complete when de-registering Spot instances that are about to be terminated, you can enable connection draining on the load balancer with a timeout of 90 seconds. Connection draining causes the ELB load balancer to stop sending new requests to a de-registering instance or an unhealthy instance, while keeping the existing connections open.

Spot instances are a cost-effective choice if there’s flexibility with respect to when an application is run and whether the application can potentially be interrupted. For example, Spot instances are well-suited for Monte-Carlo simulations, video transcoding, batch jobs, background processing, and so on. Further, it’s critical that you ensure that the application can start running on Spot instances quickly.⁴

You can create a Lambda function to dynamically manage Auto Scaling Groups (ASGs) based on the Spot market. The Lambda function could periodically invoke the EC2 Spot APIs to assess market prices and availability and respond by creating new autoscaling launch configurations and groups automatically. Likewise, the Lambda function could also delete any Spot ASGs and launch configurations that have no instances.

A spot market is defined by the following:

Instance type
Region
AZ

Each spot market offers its own current price. In light of this, you should try to use different instance types in different AZs (or even regions) because it allows you to pick the lowest price available. Figure 6-2 (borrowed from “Revenue Maximization for Cloud Computing Services,” by C. Kilcioglu and C. Maglaras, 2015) illustrates the variation of Spot price by instance type. A recent research argues that prices of Spot instances are usually not market-driven as sometimes previously assumed (see “Deconstructing Amazon EC2 Spot Instance Pricing,” by O. A. Ben-Yehuda et al. 2013.); on the contrary, the prices of Spot instances are typically generated at random from within a tight price interval via a dynamic hidden reserve price. For more information on how to bid for Spot instances in the cloud, go to “Readings”.

Figure 6-2. Spot Price_by Instance Type

In May 2015, AWS announced the Spot Fleet—a fleet is a collection of Spot instances that are all working together as part of a distributed application—API that makes it possible for you to launch and manage an entire fleet of Spot instances with one request. This obviated the need to write custom code for discovering capacity, monitoring market prices across instance types and availability zones, and managing bids. Having said that, a Spot Fleet is not fault-tolerant and hence can experience availability and performance degradation induced by sudden termination of Spot instances. One strategy to address this is to launch a core group of On-Demand instances⁵ to maintain a minimum level of guaranteed compute resources and supplement them with Spot instances when the opportunity arises. Figure 6-3 illustrates this concept.

Figure 6-3. (a) Provisioning for different fault-tolerant levels using 2 more spot types (b) Provisioning for different fault-tolerant levels using a mixture of On-Demand and Spot instances

You can place ASGs running On-Demand instances together with ASGs running Spot instances using different Spot bid prices behind the same ELB. This provides more flexibility and helps to meet changing traffic demands. Another strategy is to launch Spot instances with a required duration (also known as Spot blocks), which are not interrupted due to changes in the Spot price.

It’s not uncommon that a margin on each instance is kept to handle short-term workload surge and thereby provide a buffer to buy time for booting up new instances. This margin empirically ranges from 20 to 25 percent of the instance’s capacity. With over-provisioning (as illustrated in Figure 6-3), this margin can be reduced when using Spot instances. The margin can change dynamically based on the current fault-tolerant level. Because higher a fault-tolerant level results in a greater amount of overprovisioning, you can be more aggressive in reducing the margin of each instance.

Akin to the Spot instances on AWS, Google Preemptible VMs may be shut down at any time and are well suited for distributed, fault-tolerant workloads that do not require continuous availability of any single instance. When Preemptible VMs are terminated, they receive a 30-second notice—this is in contrast to the two-minute notice in case of Spot instances on AWS—giving you the opportunity to shut down cleanly (including saving work, if applicable). For more information about Preemptible VMs, go to https://cloud.google.com/compute/docs/instances/preemptible.

For the remainder of this chapter, we discuss the various aspects of autoscaling using Amazon EC2 as a reference public cloud. The key underlying concepts are, however, applicable for autoscaling on any public cloud.

Autoscaling on Amazon EC2

Amazon’s Auto Scaling service lets you launch or terminate EC2 instances (up to a defined minimum and maximum, respectively) automatically based on user-defined policies, schedules, and health checks. You can use Amazon’s CloudWatch for real-time monitoring of EC2 instances. Metrics such as CPU utilization, latency, and request counts are provided automatically by CloudWatch. Further, you can use CloudWatch to access up-to-the-minute statistics, view graphs, and set alarms (defined here):

Definition 1

An Amazon CloudWatch alarm is an object that watches over a single metric. An alarm can change state depending on the value of the metric. An action is invoked when an alarm changes state and remains in that state for a number of time periods.

You can configure a CloudWatch alarm to send a message to autoscaling whenever a specific metric has reached a threshold value. When the alarm sends the message, autoscaling executes the associated policy on an ASG to scale the group up or down. Note that an Auto Scaling action is invoked when the specified metric remains above the threshold value for a number of time periods. This is to ensure that a scaling action is not triggered due to a sudden spike in the metric.

Definition 2

A policy is a set of instructions for Auto Scaling that instructs the service how to respond to CloudWatch alarm messages.

Separate policies are instituted for autoscaling up and autoscaling down. The two key parameters associated with an Auto Scaling Policy are the following:

ScalingAdjustment: The number of instances by which to scale. AdjustmentType determines the interpretation of this number (e.g., as an absolute number or as a percentage of the existing ASG size). A positive increment adds to the current capacity and a negative value removes from the current capacity.
AdjustmentType: This specifies whether the ScalingAdjustment is an absolute number or a percentage of the current capacity. Valid values are ChangeInCapacity or PercentChangeInCapacity (described later in the chapter).

An autoscaling action, say scale up, usually takes a while to take effect. In light of this, you can specify a cooldown period (defined momentarily) to ensure that a new autoscaling event is triggered after the completion of the previous autoscaling event.

Definition 3

Cooldown is the period of time after autoscaling initiates a scaling activity during which no other scaling activity can take place. A cooldown period allows the effect of a scaling activity to become visible in the metrics that originally triggered the activity. This period is configurable and gives the system time to perform and adjust to any new scaling activities (such as scale-in and scale-out) that affect capacity.

On AWS, autoscaling also can be carried in a temporal fashion, referred to as scheduled scaling. In particular, scaling based on a schedule allows you to scale an application in response to predictable load changes. For instance, if the traffic begins to increase on Friday evening and remains high until Sunday evening, you can scale activities based on the predictable traffic patterns of the web application. To create a scheduled scaling action, you must specify the start time of the scaling action and the new minimum, maximum, and desired sizes for the scaling action. At the specified time, autoscaling updates the group with the values for minimum, maximum, and desired size specified by the scaling action.

At Netflix, we employed scaling by policy wherein a given cluster was scaled up/down based on the incoming request per second (RPS) of a given application. We used incoming RPS as the metric to drive autoscaling because it is independent of the application and directly relates to throughput.

Design Guidelines

In this section, we detail the various design guidelines underlying the algorithms for autoscaling discussed later in this chapter.

Avoiding the ping-pong effect

During a scale up event, new nodes are added to a given ASG. As a consequence, the RPS per node drops. However, if the RPS per node drops below the threshold specified for scaling down an ASG, it would trigger a scale-down event. This would result in alternating scale-up and scale-down events, as illustrated in Figure 6-4 (a), and referred to as a ping-pong effect. At Netflix, we observed that ping-ponging can potentially result in higher latency and, in the worst case, can cause violation of the Service-Level Agreement (SLA) of the service.

Thus, when defining the autoscaling policies, it is imperative to ensure that the policy is not susceptible to ping-pong effect. The desired autoscaling profile is exemplified by Figure 6-4 (b).

Figure 6-4. (a) Illustrating ping-pong effect (b) Desired autoscaling profile (Y-axis corresponds to the number of nodes in the ASG and X-axis corresponds to time)

Be proactive, not reactive

As mentioned earlier, applications such as the recommendation engine at Netflix take a long time to start. This can be ascribed to a variety of reasons; for example, loading of metadata of Netflix subscribers and precomputation of certain features. For such applications, it is critical to trigger the scale-up event in a proactive fashion, not reactively.

Let us consider the scenario shown in Figure 6-5. The solid arrow in the figure corresponds to the need to scale-up a given ASG as mandated by the SLA and increasing traffic. However, owing to a long application startup time, the autoscaling up is triggered, signified by the dashed arrow in the figure, proactively. The proactive approach ensures that the ASG is sufficiently provisioned by the time the latency approaches the SLA and that the SLA is never violated!

Illustration of scaling in a proactive fashion. Solid arrow signifies the need to scale up (as governed by SLA of the application at hand) and the dashed arrow signifies the corresponding autoscaling event (governed by the start up of the application)

Aggressive upwards, conservative downwards

Delivering the best user experience is critical for business. Thus, you might want to employ an aggressive scale-up policy so as to be able to handle a more-than-expected increase in traffic. Also, an aggressive scale-up approach provides a buffer for increase in traffic during the cooldown period. In contrast, you might want to employ a conservative scale-down policy so as to be able to handle a slower (than the historical trend) ramp-down of traffic. Aggressive scale-down might accidentally result in under-provisioning, thereby adversely affecting latency and throughout.

Scalability Analysis

Determining the threshold for scale-up is an integral step is defining an autoscaling policy. A low threshold will result in under-utilization of the instances in the ASG; conversely, a high threshold can result in higher latency, thereby degrading the user experience. To this end, load testing is carried out to determine the throughput corresponding to the SLA of application (see Figure 6-6).

Trade-off between latency and throughput (load)

Properties

Each scale-up event should satisfy the following:

Property 1: RPS per node after scale up should be more than the scale-down threshold (T_D).

Property 1 ensures that scale-up does not induce a ping-pong effect. Likewise, each scale-down event should satisfy the following:

Property 2: RPS per node after scale-down should be less than the scale-up threshold (T_U).

Akin to Property 1, Property 2 also ensures that scale-up does not induce a ping-pong effect.

Autoscaling by Fixed Amount

In this section, we present a technique for scaling an ASG up/down by a fixed number of instances and as per the guidelines laid out earlier. AdjustmentType for the scaling policy is set to ChangeInCapacity, which is defined as follows:

ChangeInCapacity: This AdjustmentType is used to increase or decrease the capacity by a fixed amount on top of the existing capacity. For instance, let’s assume that the capacity of a given ASG is three and that ChangeInCapacity is set to five. When the policy is executed, autoscaling will add five more instances to ASG.

Algorithm 1 (shown in the following section) details the parameters and the steps to determine the scaling thresholds (for both scaling up and scaling down). The scale-down value D and the scale-up value U are inputs to the algorithm. The constants 0.90 and 0.50 used in defining T_U, T_D were determined empirically so as to minimize the impact on user experience and contain ASG under-utilization. Loop L1 in Algorithm 1 corresponds to scaling up an ASG as the incoming traffic increases. Loop L2 in Algorithm 1 scales down an ASG as the incoming traffic decreases.

Algorithm 1—autoscaling up/down by a fixed amount

Input: An application with a specified SLA.
Parameters: D Scale down value; U Scale up value; T_D Scale down threshold (RPS per node); T_U Scale up threshold (RPS per node); N_min Minimum number of nodes in the ASG

Let T (SLA) return the maximum RPS per node for the specified SLA.

T_U ← 0.90 × T (SLA)

T_D ← 0.50 × T_U

Let N_c, RPS_n denote the current number of nodes and RPS per node respectively

L1: /* Scale Up (if RPS_n > T_U */

repeat

← N_c

N_c ← N_c + U

RPS_n ← RPS_n * / N_c

until RPS_n > T_U

L2: /* Scale Down (if RPS_n < T_D) */

repeat

← N_c

N_c ← max(N_min, N_c – D)

RPS_n ← RPS_n * / N_c

until RPS_n < T_D or N_c = N_min

Illustration of Algorithm 1

For a better understanding of Algorithm 1, let’s walk through a case study. The parameters of the algorithm are listed before Tables 6-1 and 6-2. Initially, RPS_ASG = 500 and N_c = 6. As RPS_ASG increases to 1540, RPS_n approaches T_U. An autoscaling up-event is triggered thereby adding 3 (= U) nodes to the ASG. As RPS_ASG increases subsequently, autoscaling up-events are triggered. Note that all the entries in column six satisfy Property 1.

During scale-down, the initial RPS_ASG = 5000 and N_c = 18. As RPS_ASG decreases to 3240, and RPS_n approaches T_D. An autoscaling down-event is triggered, thereby deleting 2 (= D) nodes from the ASG. Note that all the entries in column 12 satisfy Property 2.

Illustration of Algorithm 1 (D =2, U = 3, T_D = 180, T_U = 230)

Table 6-1. Scale Up
# Nodes (current)	Nodes added	RPS_ASG	RPS_n	Total nodes	New RPS_n
6	0	500	83.33	6
		1740
	3			9	193.33
		2610
	3			12	217.50
		3480
	3			15	232.00
		4350
	3			18	241.67
		5520
	3			21	248.57

Table 6-2. Scale Down
# Nodes (current)	Nodes added	RPS_ASG	RPS_n	Total nodes	New RPS_n
18		5000	277.78	18
		3240
	2			16	202.50
		2880
	2			14	205.71
		2520
	2			12	210.00
		2160
	2			10	216.00
		1800
	2			8	225.00

Scaling by Percentage

In this section, we present a technique for scaling an ASG up or down by a percentage of current capacity and as per the guidelines laid out earlier. AdjustmentType for the scaling policy is set to PercentChangeInCapacity, which is defined here:

PercentChangeInCapacity: This AdjustmentType is used to increase or decrease the capacity by a percentage of the desired capacity. For instance, let’s assume that an ASG has 15 instances and a scaling-up policy of the type PercentChangeInCapacity and adjustment set to 15. When the policy is run, autoscaling will increase the ASG size by two.; Note that if the PercentChangeInCapacity returns a value between 0 and 1, autoscaling will round it off to 1. If the PercentChangeInCapacity returns a value greater than 1, autoscaling will round it off to the lower value.

Algorithm 2 details the parameters and the steps to determine the scaling thresholds (for both scaling up and scaling down). The scale-down value D and the scale-up value U (note that both are percentages) are inputs to the algorithm. The constants 0.90 and 0.50 used in defining T_U and T_D were determined empirically so as to minimize impact on user experience and contain ASG under utilization. Loop L1 in Algorithm 2 corresponds to scaling up an ASG as the incoming traffic increases. Loop L2 in Algorithm 2 scales down an ASG as the incoming traffic decreases.

Algorithm 2—autoscaling up/down by a percentage of current capacity

Input: An application with a specified SLA.
Parameters: D Scale down percentage value; U Scale up percentage value; T_D Scale down threshold (RPS per node); T_U Scale up threshold (RPS per node); N_min Minimum number of nodes in the ASG

Let T (SLA) return the maximum RPS per node for the specified SLA.

T_U ← 0.90 × T (SLA)

T_D ← 0.50 × T_U

Let N_c, RPS_n denote the current number of nodes and RPS per node respectively

L1: /* Scale Up (if RPS_n > T_U) */

repeat

← N_c

N_c ← N_c + max(1, N_c × U/100)

RPS_n ← RPS_n * /N_c

until RPS_n > T_U

L2: /* Scale Down (if RPS_n < T_D */

repeat

← N_c

N_c ← max(N_min,N_c – max(1,N_c × D/100))

RPS_n ← RPS_n * /N_c

until RPS_n < T_D or N_c = N_min

Illustration of Algorithm 2

For a better understanding of Algorithm 2, let’s again walk through a case study. The parameters of the algorithm are mentioned before Tables 6-3 and 6-4. N_min is set to 1. Initially, RPS_ASG = 500 and N_c = 6. As RPS_ASG increases to 1540 and RPS_n approaches T_U, an autoscaling up-event is triggered, thereby adding 1 (= max(1, 6 × 10/100) node to the ASG. As RPS_ASG increases subsequently, autoscaling up-events are triggered. Note that all the entries in column six satisfy Property 1.

During scale-down, the initial RPS_ASG = 5000 and N_c =18. As RPS_ASG decreases to 4140, RPS_n approaches T_D. An autoscaling down-event is triggered thereby deleting 1 (= max(1, .18 * 8/100.)) node from the ASG. Note that all the entries in column 12 satisfy Property 2.

Illustration of Algorithm 2 (D = 8, U = 10, N_min = 1, T_D = 230, T_U = 290)

Table 6-3. Scale Up
# Nodes (current)	Nodes added	RPS_ASG	RPS_n	Total nodes	New RPS_n
6	0	500	83.33	6
		1740
	1			7	248.57
		2030
	1			8	253.75
		2320
	1			9	257.78
		2610
	1			10	261.00
		2900
	1			11	263.64
		3190
	1			12	265.83
		3480
	1			13	267.69
		3770
	1			14	269.29
		4060
	1			15	270.67
		4350
	1			16	271.88
		4640
	1			17	272.94
		4930
	1			18	273.89
		5220
	1			19	274.74

Table 6-4. Scale Down
# Nodes (current)	Nodes added	RPS_ASG	RPS_n	Total nodes	New RPS_n
18		5000	277.78	18
		4140
	1			17	243.53
		3910
	1			16	244.38
		3680
	1			15	245.33
		3450
	1			14	246.43
		3220
	1			13	247.69
		2990
	1			12	249.17
		2760
	1			11	250.91
		2530
	1			10	253.00
		2300
	1			9	255.56
		2070
	1			8	258.75
		1840
	1			7	262.86
		1610
	1			6	268.33

Upon comparing the illustrations of Algorithms 1 and 2, we note that the threshold values U and D are higher in the case of the latter. This boosts hardware utilization and reduces the footprint on the cloud.

Startup Time Aware Scaling

In this section, we extend Algorithm 2 to guide autoscaling for applications with long startup times. Long application startup times can be ascribed to a variety of reasons; for example, loading of metadata. As discussed earlier, in the presence of long startup times, autoscaling up needs to be done proactively. For this, we employ the following steps:

For a historical time–series of RPS in production, determine the change in RPS over every A_start minutes, where A_start denotes the application startup time. This would yield a time–series with these data points:
- RPS_Astart − RPS₀
- RPS_Astart+1 − RPS₁
- RPS_Astart+2 − RPS₂
- RPS_Astart+3 − RPS₃
- ·
- ·
- ·
where RPS_t denotes the RPS at time t. The derived time–series, referred to as rolling RPS change, captures the change in RPS in any window of width A_start minutes.
Compute the 99th percentile of the rolling time–series, denoted by R_RPS.
Compute γ = T_U − R_RPS. The parameter γ is the effective threshold for scale up. The use of 99th percentile of the rolling RPS change time–series is consistent with the Aggressive Upwards guideline outlined earlier.

The top of Figure 6-7 shows an example RPS time–series (with one-minute granularity) of an application. The startup time of the application was 30 minutes. The corresponding rolling RPS time–series is shown at the bottom of Figure 6-7. The 99th percentile of the rolling time–series is 1.109.

Rolling change in RPS_n for an application startup time of 30 minutes

Algorithm 3 details the parameters and the steps to determine the scaling thresholds (for both scaling up and scaling down). The scale-down value D and the scale-up value U are inputs to the algorithm. The constants 0.90 and 0.50 used in defining T_U, T_D were determined empirically so as to minimize the impact on user experience and contain ASG under-utilization. Loop L1 in “Algorithm 3” corresponds to scaling-up an ASG as the incoming traffic increases. Loop L2 in Algorithm 3 scales-down an ASG as the incoming traffic decreases. Unlike scale-up, the threshold for scale-down D need not be adjusted, because applications do not induce a long delay during termination of instances on Amazon’s EC2.

Algorithm 3—application start up aware autoscaling up/down by a percentage of current capacity

Input: An application with a specified SLA.
Parameters: D Scale down percentage value; U Scale up percentage value; A_start Application start up time (mins); T_D Scale down threshold (RPS per node); T_U Scale up threshold (RPS per node); N_min Minimum number of nodes in the ASG

Let T(SLA) return the maximum RPS per node for the specified SLA.

T_U ← 0.90 × T(SLA)

T_D ← 0.50 × T_U

Let N_c,RPS_n denote the current number of nodes and RPS per node respectively

Transform RPS time series to a rolling A_start (min) time series

Let R_RPS denote the 99^th percentile of the rolling time series

Let γ = T_U – R_RPS

L1: /* Scale Up (if RPS_n > γ) */

repeat

← N_c

N_c ← N_c + max(1,N_c × U/100)

RPS_n ← RPS_n * /N_c

until RPS_n > T_U

L2: /* Scale Down (if RPS_n < T_D) */

repeat

← N_c

N_c ← max(N_min, N_c – max(1,N_c × D/100))

RPS_n ← RPS_n * /N_c

until RPS_n < T_D or N_c = N_min

Illustration of Algorithm 3

For a better understanding of Algorithm 3, let’s one more time walk through a case study (refer to Tables 6-5 and 6-6). The RPS and the rolling RPS change time–series for the application are shown in Figure 6-7, respectively. The parameters of the algorithm are mentioned before Tables 6-5 and 6-6. N_min is set to 1. Initially, RPS_ASG = 800, N_c = 170, and γ = 12.9. As RPS_ASG increases to 2193, RPS_n approaches γ. An autoscaling up-event is triggered, thereby adding 25 (= max(1, 170 × 15/100)) nodes to the ASG. As RPS_ASG increases subsequently, autoscaling up-events are triggered. Note that all the entries in column seven satisfy Property 1.

During scale down, the initial RPS_ASG = 4400 and N_c = 389. As RPS_ASG decreases to 3890, RPS_n approaches T_D. An autoscaling down-event gets triggered thereby deleting 38 (= max(1, 389 x 10/100)) nodes from the ASG. Note that all the entries in column 13 satisfy Property 2.

Illustration of Algorithm 3 (D = 10, U = 15, U_min = 1, A_START = 30, R_RPS = 1.1, T_D = 10, T_U = 14)

Table 6-5. Scale Up
# Nodes (current)	Nodes added	RPS_ASG	RPS_n	γ = T_U – R_RPS	Total nodes	New RPS_n
170	0	800	4.71	12.9	170
		2193
	25				195	11.25
		2515.5
	29				224	11.23
		2889.6
	33				257	11.24
		3315.3
	38				295	11.24
		3805.5
	44				339	11.23
		4373.1
	50				389	11.24
		5018.1

Table 6-6. Scale Down
# Nodes (current)	Nodes added	RPS_ASG	RPS_n	Total nodes	New RPS_n
389		4400	11.31	389
		3890
	38			351	11.08
		3510
	35			316	11.11
		3160
	31			285	11.09
		2850
	28			257	11.09
		2570
	25			232	11.08
		2320
	23			209	11.10
		2090

Potpourri

There have been cases wherein the CPU utilization on production nodes spiked without any increase in traffic. This can happen to a variety of accidental events. To handle such cases, instituting add-on scale-up policies (i.e., besides a scale-up policy based on RPS), as exemplified in Figure 6-8, helps to mitigate the impact on the end users.

In July 2015, AWS introduced new scaling policies with steps. For example, you can specify different responses for different levels of average CPU utilization, say <50%, [50%, 60%), [60%, 80%), and ≥80%. Further, if you create multiple-step scaling policies for the same resource (perhaps based on CPU utilization and inbound network traffic) and both of them fire at approximately the same time, autoscaling will look at both policies and choose the one that results in the change of the highest magnitude.

NOTE

For further information, go to https://aws.amazon.com/blogs/aws/auto-scaling-update-new-scaling-policies-for-more-responsive-scaling/.

In certain scenarios, you might want to protect certain instances in an ASG from termination. For example, an instance might be handling a long-running work task, perhaps pulled from an SQS queue. Protecting the instance from termination will avoid wasted work. In a similar vein, an instance might serve a special purpose within the group; for example, it could be the master node of a Hadoop cluster, or a “canary” that flags the entire group of instances as up and running. To this end, you can use the Instance Protection Feature offered by AWS. In most cases, at least one instance in an ASG should be left unprotected; if all of the instances are protected, no scale action will be taken.

NOTE

For further information, go to https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling/.

Leading companies such as Netflix and Facebook have been using autoscaling to improve cluster performance, service availability, and reduce costs. In its post, Facebook shared the following:

...a particular type of web server at Facebook consumes about 60 watts of power when it’s idle (0 RPS, or requests-per-second). The power consumption jumps to 130 watts when it runs at low-level CPU utilization (small RPS). But when it runs at medium-level CPU utilization, power consumption increases only slightly to 150 watts. Therefore, from a power-efficiency perspective, we should try to avoid running a server at low RPS and instead try to run at medium RPS.

For details, refer to the following:

Besides RPS, other metrics have been used for autoscaling: CPU utilization, memory usage, disk I/O bandwidth, network link load, peak workload, jobs in progress, service rate, the number of concurrent users, the number of active connections, jitter, delay, and the average response time per request. Regression modeling has been employed for predicting the amount of resources needed for the actual workload and possibly retract over-provisioned resources. Likewise, several other approaches have been employed for resource-demand prediction such as prediction based on changes in the request arrival rate (i.e., the slope of the workload). You can employ sensitivity analysis to characterize the different types of inputs and determine the types of resources that have the highest impact on the throughput (or the performance metric of interest) of the application. Subsequently, you can set up multiple autoscaling rules based on one or more resource types.

In recent years, the use of containers has received wide attention. Amazon EC2 Container Service (ECS), Google Container Engine, and Microsoft Azure Container Service are the most popular public container services. Multi-AZ clusters make the ECS infrastructure highly available, thereby providing a safeguard from potential zone failure. The AZ–aware ECS scheduler manages, scales, and distributes the tasks across the cluster, thus making the architecture highly available. On AWS, akin to the EC2 instances, autoscaling policies also can be defined for ECS instances. You can use the approaches discussed earlier in this chapter in the context of autoscaling container instances, as well.

Advanced Approaches

Given the importance of exploiting the elasticity of the cloud in the best possible fashion, several advanced techniques have been proposed for autoscaling in both the industry and the academia. For instance, Facebook employed the classic control theory and proportional-integral (PI) controller to achieve fast reaction time. Netflix developed two prediction algorithms—one of which is an augmented linear regression–based algorithm, the other based on Fast Fourier Transform (FFT). One of the key highlights of the Netflix approach is that it’s predictive. Specifically, its approach learns the request pattern based on historical data and subsequently drives the scale-up or scale-down action. Both approaches have been deployed in production environments. Given that no comparative analysis was presented, it is difficult to assess how these techniques fare against the techniques that were proposed previously.

Many other approaches for autoscaling have been proposed based on control theory, queuing theory, fuzzy logic, neural networks, reinforcement learning, support vector machines, wavelet transform, regression splines, pattern matching, Kalman filters, sliding window, proportional thresholding, second-order regression, histograms, time–series models, the secant method, voting system, and look-ahead control. In most cases, these techniques “learn” from past traffic patterns and resource usage and hence are unable to adapt with any new pattern that might appear as a result of the dynamic nature of the web traffic.

NOTE

For more information, refer to the section “Readings”.

In practice, applicability of autoscaling approaches based on the preceding is limited owing to a wide variety of reasons. For instance, reinforcement learning–based approaches require a long time to learn and adapt only to slowly changing conditions. Therefore, you cannot apply such techniques to real applications that usually experience sudden traffic bursts. In a similar vein, queuing theory–based approaches impose hard assumptions that are typically not valid for real, complex systems. Besides, such approaches are intended for stationary scenarios and hence you will need to recalculate the queuing model when the conditions of the application change. In the case of control theory–based approaches, determining the gain parameters is nontrivial.

Summary

At times, we observe spikes in incoming traffic. This can happen due to a variety reasons. For instance, at the end of events such as the Super Bowl, you would observe (as expected) a sudden rise in incoming traffic (e.g., number of tweets). Figure 6-9 presents an example traffic profile with spikes. State-of-the-art autoscaling techniques do not fare well against such spikes.

Akin to the preceding, “burstiness” in the workload at finer timescales (in the order of seconds) can potentially adversely affect the efficacy of autoscaling techniques, as demonstrated in Figure 6-10.

In the under-provisioning scenario, fine-scale burstiness can potentially cause an increased queuing effect and a high request defection rate, thereby resulting in increased SLA violations. On the other hand, in the over-provisioning scenario, fine-scale burstiness can potentially result in reduced resource utilization at the application server. Thus, as a community, we need to build support for fine-grained monitoring and develop more agile, adaptive policies to guarantee effective elasticity under fine-scale burstiness.

Autoscaling a service down independent of the traffic upstream can potentially result in meltdowns. Thus, it is critical to develop autoscaling techniques that capture the interaction between different services in a Service-Oriented Architecture (SOA). Outages in the cloud and in datacenters (see “Resources”) have been becoming increasingly more frequent. One of the ways to minimize the impact of outages is to extend the SOA to span multiple Infrastructure as a Service (IaaS) vendors. This would in turn call for extending the techniques proposed in this chapter to be vendor-aware.

Readings

http://docs.rightscale.com/cm/dashboard/manage/arrays/arrays_actions.html#set-up-autoscaling-using-voting-tags
A. Ilyushkin et al. (2017). An Experimental Performance Evaluation of Autoscaling Policies for Complex Workflows.
A. V. Papadopoulos et al. (2016). PEAS: A Performance Evaluation Framework for Auto-Scaling Strategies in Cloud Applications.
M. Grechanik et al. (2016). Enhancing Rules For Cloud Resource Provisioning Via Learned Software Performance Models.
C. Qu et al. (2016). A reliable and cost-efficient auto-scaling system for web applications using heterogeneous spot instances.
L. Zheng et al. (2015). How to Bid the Cloud.
A. N. Toosi et al. (2015). SipaaS: Spot instance pricing as a Service framework and its implementation in OpenStack.
W. Guo et al. (2015). Bidding for Highly Available Services with Low Price in Spot Instance Market.
S. Islam et al. (2015). Evaluating the impact of fine-scale burstiness on cloud elasticity.
V. R. Messias et al. (2015). Combining time series prediction models using genetic algorithm to autoscaling Web applications hosted in the cloud infrastructure.
M. Beltran. (2015). Defining an Elasticity Metric for Cloud Computing Environments.
A. Y. Nikravesh et al. (2015). Towards an autonomic auto-scaling prediction system for cloud resource provisioning.
M. Barati and S. Sharifian. (2015). A hybrid heuristic-based tuned support vector regression model for cloud load prediction.
P. Padala et al. (2014). Scaling of Cloud Applications Using Machine Learning.
T. Lorido-Botran et al. (2014). A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments.
H. Alipour et al. (2014). Analyzing auto-scaling issues in cloud environments.
H. Fernandez et al. (2014). Autoscaling Web Applications in Heterogeneous Cloud Infrastructures.
N. R. Herbst et al. (2013). Self-adaptive workload classification and forecasting for proactive resource provisioning.
E. Barrett et al. (2012). Applying reinforcement learning towards automating resource allocation and application scalability in the cloud.
D. Villegas et al. (2012). An analysis of provisioning and allocation policies for infrastructure-as-a-service clouds.
S. Islam et al. (2012). How a consumer can measure elasticity for cloud platforms.
— (2012). Empirical Prediction Models for Adaptive Resource Provisioning in the Cloud.
R. Han et al. (2012). Lightweight Resource Scaling for Cloud Applications.
X. Dutreilh et al. (2011). Using Reinforcement Learning for Autonomic Resource Allocation in Clouds: towards a fully automated workflow.
N. Roy et al. (2011). Efficient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting.
W. Iqbal et al. (2011). Adaptive resource provisioning for read intensive multi-tier applications in the cloud.
M. Mao and M. Humphrey. (2011). Auto-scaling to minimize cost and meet application deadlines in cloud workflows.
Nilabja Roy et al. (2011). Efficient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting.
Zhiming Shen et al. (2011). Cloudscale: Elastic resource scaling for multi-tenant cloud systems.
P. Lama and X. Zhou. (2010). Autonomic Provisioning with Self-Adaptive Neural Fuzzy Control for End-to-end Delay Guarantee.
E. Caron et al. (2010). Forecasting for Cloud computing on-demand resources based on pattern matching.
Z. Gong et al. (2010). PRESS: PRedictive elastic resource scaling for cloud systems.
S. Meng et al. (2010). Tide: Achieving self-scaling in virtualized datacenter management middleware.
S. Yi et al. (2010). Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud.
H. C. Lim et al. (2009). Automated control in cloud computing: Challenges and opportunities.
E. Kalyvianaki et al. (2009). Self-adaptive and self-configured CPU resource provisioning for virtualized servers using Kalman filters.
B. Urgaonkar et al. (2008). Agile dynamic provisioning of multi-tier internet applications.

Resources

“Delta Meltdown Reflects Problems with Aging Technology.” (2016). http://on.wsj.com/2wCcsPq.
“Southwest Outage, Canceled Flights Cost an Estimated $54M.” (2016). https://bloom.bg/2wsoJp4
“WhatsApp apologises as service crashes on New Year’s Eve: Users worldwide unable to connect as messaging app goes offline.” (2015). http://dailym.ai/2vtE6J3.
“Google Docs Outage Further Saps Friday Productivity.” (2015). http://on.wsj.com/2vtEbMR.
“Slack outage cues massive freakout, but it’s significant for more than that.” (2015). http://mashable.com/2015/11/23/slack-down-reactions/.
“AWS Outage.” (2012). http://aws.amazon.com/message/67457/.
“Twitter Is Down, Again.” (2012). http://tcrn.ch/2wK5Ihz.
“Twitter Outage.” (2012). http://bit.ly/twitter-outage-2012.
“Google Talk Is Down: Worldwide Outage Since 6:50 AM EDT.” (2012). http://tcrn.ch/2vjNIqH.
“AWS Outage.” (2011). http://aws.amazon.com/message/65648/.
“Twitter Outage.” (2011). http://status.twitter.com/post/2369720246/streaming-outage.
“Time is Money: The Value of On-Demand.” by Joe Weinman (2011). http://joeweinman.com/Resources/Joe_Weinman_Time_Is_Money.pdf
“Lightning Strike Triggers Amazon EC2 Outage.” (2009). http://www.datacenterknowledge.com/archives/2009/06/11/lightning-strike-triggers-amazon-ec2-outage/.
“Outage for Amazon Web Services.” (2009). http://www.datacenterknowledge.com/archives/2009/07/19/outage-for-amazon-web-services/.
“Brief Power Outage for Amazon Data Center.” (Dec. 2009). http://www.datacenterknowledge.com/archives/2009/12/10/power-outage-for-amazon-data-center/.
“Major Outage for Amazon S3 and EC2.” (Feb. 2008). http://www.datacenterknowledge.com/archives/2008/02/15/major-outage-for-amazon-s3-and-ec2/.
“Amazon EC2 Outage Wipes Out Data.” (Oct. 2007). http://www.datacenterknowledge.com/archives/2007/10/02/amazon-ec2-outage-wipes-out-data/.
“List of web host service outages.” http://bit.ly/list-host-outages.

¹ “Completing the Netflix Cloud Migration.” (2016) https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration

² We encourage you to compare the prices of Reserved and On-Demand instances on AWS at https://aws.amazon.com/ec2/pricing/.

³ For further information about Spot instances, go to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html.

⁴ As per Mike Tung of Diffbot, bidding, allocation, and booting of Spot instances at Amazon’s end is quite fast—in the neighborhood of two to three minutes. Configuring the instance and loading the job is typically the bottleneck. The bottleneck is supposedly worse with Spot instances than with on-demand instances. For further details, go to http://blog.diffbot.com/setting-up-a-machine-learning-farm-in-the-cloud-with-spot-instances-auto-scaling/.

⁵ The key differences between Spot instances and On-Demand instances are that Spot instances might not start immediately, the hourly price for Spot instances varies based on demand, and Amazon EC2 can terminate an individual Spot instance as the hourly price for or availability of Spot instances changes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
6. Autoscaling

CHAPTER SIX

Autoscaling

NOTE

The Challenge

Figure 6-1. Excess versus unserved demand

Autoscaling on Amazon EC2