Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12 Cost of cloud computing

This chapter covers

Investigating cost drivers for cloud cost
Comparing cost-optimization practices
Implementing tests for cost-compliant IaC
Calculating an estimate for infrastructure cost

When you use a cloud provider, you get very excited at the ease of provisioning. After all, you can create a resource at the click of a mouse or a single command. However, the cost of cloud computing becomes a concern as your organization scales and grows. Your updates to infrastructure as code can affect the overall cost of the cloud!

We must build cost considerations into our infrastructure just as we build security into it. If you make your system and find out you overran your cost, you could break the system while trying to remove resources and reduce your cloud computing bill. In chapter 8, I recommended baking security into IaC like a cake. The cost of ingredients affects how many cakes we could bake, essential to know before you start.

This chapter covers the practices you can combine with IaC to manage the cost of cloud computing and reduce unused resources. You will find some high-level, general cost-control practices and patterns that I describe in the context of IaC. However, regularly apply these practices to re-optimize costs as your system evolves based on customer demand, organizational scale, and cloud provider billing.

I focus primarily on cloud computing because of its flexibility and on-demand billing. You often account for the cost of data center computing with your organization’s chargeback system. Each business unit establishes a budget for its data center resources, and the technology department issues chargeback based on resource usage, taking into account the operating cost of the data center.

You can always apply the practices for managing your cost drivers with cost control and estimation. However, the techniques I outline for cost reduction and optimization may or may not work for every situation (no matter whether you’re using the cloud, data center, or managed service). Depending on your scale, geographic architecture, business domain, or data center usage, your use cases and systems may require specialized assessment or re-platforming.

12.1 Manage cost drivers

Say you work as a consultant for a company that needs to migrate its platform that supports conferences and events to the public cloud. The company asks you to “lift and shift” its configuration in its data center to the public cloud. You help the company’s teams build out infrastructure in GCP, applying all of the principles and practices you learned from this book. Eventually, your team rolls out the platform on GCP and successfully supports its first customer: a small three-hour community conference.

A few weeks after the event, your client schedules a cryptic meeting. When the meeting starts, the client shows you their cloud bill. It totals over $10,000 for the development and support of a single three-hour conference! The finance team does not seem happy with the cost, especially since the company lost money running the conference. You get your next task: reduce the cost per conference as much as possible.

The preceding example uses a fictitious, very simplified cloud bill that approximates the cost of a conference platform service based on GCP’s pricing calculator (https://cloud.google.com/products/calculator) as of 2021. The estimates may not include all of the offerings you need, updated pricing for the platform, differences between environments, or the sizes you might use in a comparable system. I rounded the subtotals to streamline the example.

If you run the example, you may reach a GCP quota for the N2D machine type instances. The servers will exceed the platform’s free tier! You can change the machine type to free-tier instances to run the examples without a charge.

Thanks to the tagging practices you borrowed from chapter 8, the bill uses the tag to identify which resources belong to the community conference and their environment. You manage to break down your cloud computing bill, as shown in table 12.1, and identify the costs by the type and size of the infrastructure resource.

Table 12.1 Your cloud bill by offering and the environment

Offering	Subtotal, testing environment	Subtotal, production environment	Subtotal
Compute (servers)	$400	$3,600	$4,000
Database (Cloud SQL)	$250	$2,250	$2,500
Messaging (Pub/Sub)	$100	$900	$1,000
Object storage (Cloud Storage)	$100	$900	$1,000
Data transfer (networking egress)	$100	$900	$1,000
Other (Cloud CDN, Support)	$50	$450	$500
Total	$1,000	$9,000	$10,000

The cloud bill mostly abstracts the specific names for equivalent Azure and AWS services. For clarity, I list some of the GCP offerings with approximations to AWS or Azure ones:

Database (Cloud SQL)—Amazon RDS, Azure SQL Database
Messaging (Pub/Sub)—Amazon Simple Queue Service (SQS) and Simple Notification Service (SNS), Azure Service Bus
Object storage (Cloud Storage)—Amazon S3, Azure Blob Storage
Other (Cloud CDN, Support)—Amazon CloudFront, Azure Content Delivery Network (CDN)

Separating cost by offerings and environments helps you identify which factors contribute to the cost and where you should investigate further. To start reducing the bill, you must determine the cost drivers, the factors or activities that affect the total cost.

Definition Cost drivers are the factors or activities that affect your total cloud computing cost.

When you assess cost drivers, calculate the percentage cost of cloud offerings. Some offerings will always cost more than others. You can still use the breakdown to help you identify services to optimize. Breaking down cost by environment helps you identify the footprint of testing versus production environments. Comparing the two will give you a better picture of which environment has inefficiencies you can reduce.

Based on your breakdown, you calculate the percentage for each offering and environment. In figure 12.1, you chart out that compute resources take 40% of the bill. You also discover that the team spent 10% of the total on the testing environment and 90% on the production environment.

Figure 12.1 The resource tags break down the cost by service and environment.

A large part of the bill goes to compute resources—specifically, servers. If your team members need to support an even larger conference, they need to control the type of resources they create and to optimize the resource size based on usage. You decide to investigate methods of controlling the size and types of servers the team can use.

12.1.1 Implement tests to control cost

You examine the metrics for the conference and the resource usage for each server. None of the servers exceeded their virtual CPU (vCPU) or memory usage. For the most part, you determine that you need at most 32 vCPUs for your production environment. Your client’s infrastructure team confirms that the maximum usage does not exceed 32 vCPUs.

Note GCP uses the term machine type to refer to a predefined virtual machine shape with specific vCPU and memory ratios to fit your workload requirements. Similarly, AWS uses the term instance type, and Azure uses the term size.

However, the public cloud makes it easy for anyone to adjust a server to use 48 vCPU. Your bill increases by 50% because of the additional CPU, and you don’t even use all of it. To more proactively control the cost using IaC, you combine unit testing from chapter 6 and policy enforcement from chapter 8.

Figure 12.2 Your test should parse machine types from server configurations, check the GCP API for the number of CPUs, and verify they do not exceed the limit.

Figure 12.2 adds a new policy test to the server’s delivery pipeline. The test checks every server defined in IaC for its number of vCPUs and compares it to the value returned by the GCP API. If the API’s information exceeds your maximum vCPU limit of 32, your test fails.

Why call the infrastructure provider’s API for vCPU information? Many infrastructure providers offer an API or client library to retrieve information about a given machine type from their catalogs. You can use this to dynamically get information about the number of CPUs and memory.

Infrastructure providers change offerings frequently. Furthermore, you cannot account for every possible server type. Writing your test to call the infrastructure provider API for the most updated information helps improve the overall evolvability of the test.

In listing 12.1, let’s implement the policy test to check for the maximum vCPU limit. First, you build a method to call the GCP API for the number of vCPUs for a given machine type.

Listing 12.1 Retrieving vCPU count for machine type from the GCP API

import googleapiclient.discovery
 
 
class MachineType():                                     ❶
   def __init__(self, gcp_json):                         ❶
       self.name = gcp_json['name']                      ❶
       self.cpus = gcp_json['guestCpus']                 ❶
       self.ram = self._convert_mb_to_gb(                ❷
           gcp_json['memoryMb'])                         ❷
       self.maxPersistentDisks = gcp_json[               ❷
           'maximumPersistentDisks']                     ❷
       self.maxPersistentDiskSizeGb = gcp_json[          ❷
           'maximumPersistentDisksSizeGb']               ❷
       self.isSharedCpu = gcp_json['isSharedCpu']        ❷
 
   def _convert_mb_to_gb(self, mb):                      ❷
       GIGABYTE = 1.0/1024                               ❷
       return GIGABYTE * mb                              ❷
 
 
def get_machine_type(project, zone, type):
   service = googleapiclient.discovery.build(
       'compute', 'v1')
   result = service.machineTypes().list(                 ❸
       project=project,                                  ❸
       zone=zone,                                        ❸
       filter=f'name:"{type}"').execute()                ❸
   types = result['items'] if 'items' in result else None
   if len(types) != 1:
       return None
   return MachineType(types[0])                          ❹

❶ Defines a machine type object to store any attributes you might need to check, including number of vCPUs

❷ Converts megabytes of memory to gigabytes for consistent unit measure

❸ Calls the GCP API to retrieve the number of vCPUs for a given machine type

❹ Returns a machine type object with vCPU and disk attributes

Anytime you use a new machine type, you can use the same function to retrieve vCPUs and memory. Next, you write a test to parse every server defined in your configuration for its machine type. In the following listing, you retrieve the number of vCPUs for the machine type for a list of servers and verify that the vCPUs do not exceed the limit of 32.

Listing 12.2 Writing a policy test to check that servers do not exceed 32 vCPUs

import pytest
import os
import compute
import json
 
ENVIRONMENTS = ['testing', 'prod']                                      ❶
CONFIGURATION_FILE = 'main.tf.json'                                     ❶
 
PROJECT = os.environ['CLOUDSDK_CORE_PROJECT']                           ❶
 
 
@pytest.fixture(scope="module")                                         ❶
def configuration():                                                    ❶
   merged = []                                                          ❶
   for environment in ENVIRONMENTS:                                     ❶
       with open(f'{environment}/{CONFIGURATION_FILE}', 'r') as f:      ❶
           environment_configuration = json.load(f)                     ❶
           merged += environment_configuration['resource']              ❶
   return merged                                                        ❶
 
def resources(configuration, resource_type):                            ❶
   resource_list = []                                                   ❶
   for resource in configuration:                                       ❶
       if resource_type in resource.keys():                             ❶
           resource_name = list(                                        ❶
               resource[resource_type].keys())[0]                       ❶
           resource_list.append(                                        ❶
               resource[resource_type]                                  ❶
               [resource_name])                                         ❶
   return resource_list                                                 ❶
 
 
@pytest.fixture                                                         ❶
def servers(configuration):                                             ❶
   return resources(configuration,
                    'google_compute_instance')
 
 
def test_cpu_size_less_than_or_equal_to_limit(servers):
   CPU_LIMIT = 32                                                       ❷
   non_compliant_servers = []                                           ❸
   for server in servers:
       type = compute.get_machine_type(                                 ❹
           PROJECT, server['zone'],                                     ❹
           server['machine_type'])                                      ❹
       if type.cpus > CPU_LIMIT:                                        ❺
           non_compliant_servers.append(server['name'])                 ❺
   assert len(non_compliant_servers) == 0,                             ❻
       f'Servers found using over {CPU_LIMIT}' +                       ❻
       f' vCPUs: {non_compliant_servers}'                               ❻

❶ Parses and extracts any server JSON configurations across testing and production environments

❷ Sets the CPU limit to 32, the maximum required for the application

❸ Initializes a list of noncompliant servers that exceed the 32 vCPU limit

❹ For each server configuration, retrieves the machine type attribute and calls the GCP API for more information

❺ If the server configuration includes a machine type that exceeds 32 vCPUs, add it to a list of noncompliant servers.

❻ Checks that all servers comply with the CPU limit. If not, fail the test and throw an error for the servers that exceed 32 CPUs.

You configure the test with a soft mandatory enforcement policy. Soft mandatory enforcement means your team reviews and approves the more expensive resource type before you create it. You can override the machine type to a larger size if you have a business justification.

Outside of checking machine types for vCPU and memory limits, you may also need to add overrides for unique architectures or machine types that apply to certain use cases, like machine learning. However, they cost more than a general-purpose resource type.

You can test that IaC uses general-purpose resources by default. General-purpose machine or resource types offer lower-cost options. If someone needs a specialized, more expensive resource, you can enable it with soft mandatory enforcement.

Other tests might include checks for specific configurations like scheduled reboots, automatic scaling, or private networking. Each of these configurations contributes to optimizing the cost of your resources. Expressing them in IaC lets you verify that configuration conforms to best practices to reduce cost early in the development process.

12.1.2 Automate cost estimation

You react to too large or expensive resource changes with policy tests to control costs. What if you want a proactive way to check how you change your budget by changing cost drivers? Imagine you want to know how adjusting your production server size to the machine type of n2d-standard-16 (16 vCPUs) might affect the future cost of a different three-hour conference.

Figure 12.3 outlines the workflow to estimate the cost of five servers with the machine type n2d-standard-16. Once you calculate the price, you can add a policy test to verify that the total does not exceed your monthly budget.

Figure 12.3 Cost estimation parses IaC for machine types, calculates the monthly cost of the resource, and generates a value to compare with your expected budget.

Cost estimation parses your IaC for resource attributes and generates an estimate of their cost. You can use cost estimation to check that your changes stay within your budget or assess adjustments to cost drivers.

Definition Cost estimation extracts infrastructure resource attributes and generates an estimate of their total cost.

How does cost estimation help you evolve your infrastructure? Cost estimation offers additional transparency into cost drivers that may affect your architecture. As you change your system, you can use these tests to help budget and communicate charge-back across teams.

I wrote minimal code to demonstrate the general workflow of cost estimation. For clarity, I omitted some of the code from the text. You can find all of the code organized at https://github.com/joatmon08/manning-book/tree/main/ch12.

The example uses the Google Cloud Billing Catalog API, which offers a service catalog with pricing. I also use a specialized Python client library for the Cloud Catalog to access the billing API (http://mng.bz/mOOn). The example does not account for specialized pricing, such as sustained use discounts or preemptible (equivalent to spot instances in AWS).

You will find a few tools that offer cost estimation. Each cloud provider offers its own cost estimator user interface for entering your resources. Other tools implement a more scalable workflow than my example code to parse configuration, call a cloud provider’s API, and calculate a cost estimate. I will not list them in this chapter, as they change frequently and depend on the cloud provider and your IaC tooling.

Get the price per unit

I recommend dynamically requesting information from a cloud provider’s service catalog API. Price per unit can change, and hardcoding the prices often results in the wrong cost estimation. To start implementing cost estimation in the example, you need some logic to call the cloud provider’s catalog and retrieve the price per unit based on your machine type.

The Google Cloud Billing Catalog API offers a list of services and SKUs based on price per unit of CPU or memory (RAM). In listing 12.3, you get the service identifier for the Google Compute Engine service. The Google Cloud Billing Catalog API categorizes prices based on service identifiers, which you must retrieve dynamically.

Listing 12.3 Getting the Google Compute Engine service from the catalog

from google.cloud import billing_v1
 
 
class ComputeService:
   def __init__(self):
       self.billing =                                                ❶
           billing_v1.services.cloud_catalog.CloudCatalogClient()     ❶
       for result in self.billing.list_services():
           if result.display_name == 'Compute Engine':                ❷
               self.name = result.name

❶ Creates a client with the Python library for Google Cloud Billing Catalog API

❷ Gets the service identifier for Google Compute Engine in the catalog

You can call the Google Cloud Billing Catalog API a second time for the price of a machine type in listing 12.4. Using the service identifier from the preceding step, you get a list of SKUs for the Google Compute Engine service. You write some code to parse its response list of SKUs to match the machine type and purpose, and retrieve the unit price per CPU or gigabyte of memory.

Listing 12.4 Getting the CPU and RAM price for Compute Engine SKUs

from google.cloud import billing_v1
 
 
class ComputeSKU:
   def __init__(self, machine_type, service_name):
       self.billing =                                               ❶
           billing_v1.services.cloud_catalog.CloudCatalogClient()    ❶
       self.service_name = service_name
       type_name = machine_type.split('-')                           ❷
       self.family = type_name[0]                                    ❷
       self.exclude = [                                              ❸
           'custom',                                                 ❸
           'preemptible',                                            ❸
           'sole tenancy',                                           ❸
           'commitment'                                              ❸
       ] if type_name[1] == 'standard' else []                       ❸
 
 
   def _filter(self, description):                                   ❸
       return not any(                                               ❸
           type in description for type in self.exclude              ❸
       )                                                             ❸
 
   def _get_unit_price(self, result):                                ❹
       expression = result.pricing_info[0]
       unit_price = expression. 
           pricing_expression.tiered_rates[0].unit_price.nanos 
           if expression else 0
       category = result.category.resource_group
       if category == 'CPU':
           self.cpu_pricing = unit_price
       if category == 'RAM':
           self.ram_pricing = unit_price
 
   def get_pricing(self, region):                                    ❺
       for result in self.billing.list_skus(parent=self.service_name):
           description = result.description.lower()                  ❻
           if region in result.service_regions and                  ❻
                   self.family in description and                   ❻
                   self._filter(description):                        ❻
               self._get_unit_price(result)
       return self.cpu_pricing, self.ram_pricing

❶ Creates a client with the Python library for the Google Cloud Billing Catalog API

❷ For a machine type like n2d-standard-16, extracts the machine family (N2D) and purpose (standard) to identify the SKU

❸ If you use a standard machine type, do not search for any specialized Compute Service SKUs in the catalog.

❹ Retrieves the unit price per CPU or RAM in nano-dollars (10^–9)

❺ Calls the Google Cloud Billing Catalog and retrieves a list of SKUs for the Compute Service

❻ Finds SKUs that match the region, machine family, and purpose of the machine type based on its description

The Google Cloud Billing Catalog sets a unit price based on the number of CPUs and gigabytes of memory. As a result, you cannot search based on the name of the machine type. Instead, you need to correlate the general-purpose machine type with the catalog’s description.

Calculate the monthly cost for a single resource

Once you retrieve the CPU and RAM unit price for a given machine type, you can use it to calculate a monthly cost for a single instance of the machine. Some cloud catalogs set up a unit price by a factor. For example, GCP uses nano units, which means you need to also multiply by the factor. Listing 12.5 implements the code to calculate the monthly cost for a single server. You multiply the unit price by the average number of hours per month, 730, and the nano units.

Listing 12.5 Calculating the monthly cost for a single server

HOURS_IN_MONTH = 730                                      ❶
NANO_UNITS = 10**-9                                       ❶
 
def calculate_monthly_compute(machine_type, region):
   service_name = ComputeService().name                   ❷ 
   sku = ComputeSKU(machine_type.name, service_name)      ❸
   cpu_price, ram_price = sku.get_pricing(region)         ❸
 
   cpu_cost = machine_type.cpus * cpu_price *            ❹
       HOURS_IN_MONTH if cpu_price else 0                 ❹
   ram_cost = machine_type.ram * ram_price *             ❺
       HOURS_IN_MONTH if ram_price else 0                 ❺
   return (cpu_cost + ram_cost) * NANO_UNITS              ❻

❶ Sets a constant for the average of 730 hours in a month and converts nano-dollars to dollars (10^–9).

❷ Gets the Compute Engine service identifier from the Google Cloud Billing Catalog API

❸ Sets up the SKU for the machine type and gets its CPU and RAM unit pricing

❹ Multiplies a machine type’s number of CPUs by unit price and hours in the month

❺ Multiplies a machine type’s gigabytes of memory (RAM) by unit price and hours in the month

❻ Sums up the CPU and RAM cost and converts it to dollar units

You now have a minimal form of cost estimation that calculates the cost of a single server. With the initial cost calculation for a single server, you can parse your IaC for all servers, retrieve their machine types and regions, and calculate a total cost. In the future, you can add more logic to retrieve SKUs for other services, like databases or messaging.

Check that your cost does not exceed the budget

You decide that you can do more with your cost estimation. You write a test with a soft mandatory enforcement approach to check whether your estimated cost exceeds your monthly budget. For example, your client tells you that a conference should not exceed a monthly budget of $4,500. You can compare your cost estimation to the budget and proactively identify any cost drivers.

Let’s write a test to estimate the new cost of servers and compare it to the budget. In the following listing, you parse your IaC for all servers and count the number of servers with specific machine types and regions.

Listing 12.6 Parsing IaC for all servers

from compute import get_machine_type
import pytest
import os
import json
 
ENVIRONMENTS = ['testing', 'prod']
CONFIGURATION_FILE = 'main.tf.json'
 
 
@pytest.fixture(scope="module")
def configuration():                             ❶
   merged = []
   for environment in ENVIRONMENTS:
       with open(f'{environment}/{CONFIGURATION_FILE}', 'r') as f:
           environment_configuration = json.load(f)
           merged += environment_configuration['resource']
   return merged
 
 
@pytest.fixture
def servers(configuration):                      ❷
   servers = dict()
   server_configs = resources(configuration,
                              'google_compute_instance')
   for server in server_configs:
       region = server['zone'].rsplit('-', 1)[0]
       machine_type = server['machine_type']
       key = f'{region},{machine_type}'
       if key not in servers:
           type = get_machine_type(              ❸
               PROJECT, server['zone'],          ❸
               machine_type)                     ❸
           servers[key] = {                      ❹
               'type': type,                     ❹
               'num_servers': 1                  ❹
           }                                     ❹
       else:                                     ❹
           servers[key]['num_servers'] += 1      ❹
   return servers

❶ Reads every configuration file that defines each environment, such as testing and production

❷ For each server in the configuration file, creates a list of their regions and machine types

❸ Calls the Google Compute API and gets details on the machine type, such as its number of CPUs and memory

❹ Tracks the number of servers with the specific machine type and region to streamline the SKUs you need to retrieve

You can call these methods in your test to retrieve cost information for each machine type in a specific region and sum the total cost. The test in the following listing checks whether the total cost exceeds the monthly budget of $4,500.

Listing 12.7 Getting CPU and RAM price for Compute Engine SKUs

from estimation import calculate_monthly_compute
 
 
PROJECT = os.environ['CLOUDSDK_CORE_PROJECT']
MONTHTLY_COMPUTE_BUDGET = 4500                                           ❶
 
 
def test_monthly_compute_budget_not_exceeded(servers):                   ❷
   total = 0
   for key, value in servers.items():
       region, _ = key.split(',')
       total += calculate_monthly_compute(value['type'], region) *      ❸
           value['num_servers']                                          ❸
   assert total < MONTHTLY_COMPUTE_BUDGET                                ❹

❶ Sets a constant to communicate the expected monthly compute budget

❷ Tests that the cost of your servers does not exceed the monthly compute budget

❸ Calculates the monthly total per server based on machine type and region, multiplied by servers for that machine type, and sums up the total

❹ Confirms that the estimated total cost does not exceed the monthly compute budget

You now have a test to estimate the total monthly cost of computing resources and compare it to your budget! Each time someone changes the infrastructure, the test recalculates the new cost of the system.

A cost estimation gives you a general view of your infrastructure cost but may not accurately reflect your actual bill. You’ll have to account for a certain margin of error. If your estimation exceeds the monthly budget, it may indicate that you need to reassess size and resource usage. You’ll also refine your monthly budget over time depending on the growth of your systems.

Continuous delivery with cost estimation

How do you check that infrastructure changes do not exceed your budget? Each time you change your IaC and push it to a repository, your cost estimation and test for budget runs. Budget tests in your pipeline help you identify costly infrastructure changes and refine the resource in a testing environment. The process prevents chargeback in the production environment.

For example, let’s say you want to add another server to the testing environment for a different conference. In figure 12.4, you create a configuration to add another server with the machine type of n2d-standard-8. The pipeline runs a test to calculate the monthly cost with the new server and check it against the monthly budget.

Figure 12.4 Adding another server exceeds your budget of $4,500, causing the test to fail.

You push the configuration to the repository, and your delivery pipeline runs the tests to check for budget compliance. The pipeline fails! You check the logs and discover that your cost estimation exceeds the expected monthly budget:

$ pytest test_budget.py
FAILED test_budget.py::test_monthly_compute_budget_not_exceeded – 
➥assert 4687.6161600000005 < 4500

You speak with your finance team. The finance analyst confirms that the budget can increase to accommodate the new testing instance. You update the monthly budget within the test to $4,700 for future changes!

Whether you write a cost estimation mechanism or use a tool, you should consider adding it to your delivery pipelines as another test for policy. Estimation helps guide instance sizing and usage. It should never stop a change before production. Instead, it should provide an opportunity for you to reassess the need for the resource.

Do not account for every cost driver as part of your cost estimation. Instead, choose the resources that make up the bulk of your bill. The example focuses on computing resources like servers, which often contribute significantly to the cost. You might implement cost estimation for other resources, like databases or messaging frameworks.

Always question the accuracy of your cost estimation! You cannot predict the resources you will create or how you use them. For example, you might find it hard to estimate the cost to transfer data between regions or services until it happens. Reconcile your cost estimate with your monthly bill and assess which cost drivers contribute to the difference.

A monthly comparison can help you identify any changes and budget for the actual cost based on a multiplier of the estimate. In the remainder of the chapter, we’ll discuss ways to reduce cloud waste and optimize cost outside of proactive measures like testing or estimation.

Given the following code, which of the following statements are true? (Choose all that apply.)

HOURS_IN_MONTH = 730
MONTHLY_BUDGET = 5000
DATABASE_COST_PER_HOUR = 5
NUM_DATABASES = 2
BUFFER = 0.1
 
def test_monthly_budget_not_exceeded():
   total = HOURS_IN_MONTH * NUM_DATABASES * DATABASE_COST_PER_HOUR
   assert total < MONTHLY_BUDGET + MONTHLY_BUDGET * BUFFER

A) The test will pass because the cost of the database is within the budget.

B) The test estimates the monthly cost of databases.

C) The test does not account for different database types.

D) The test calculates the monthly cost per database instance.

E) The test includes a buffer of 10% for any cost overage as a soft mandatory policy.

See appendix B for answers to exercises.

12.2 Reduce cloud waste

You can use IaC to implement proactive measures to manage cost drivers for cloud computing. However, you need to combine them with other practices to continue to reduce and optimize costs. After all, your client in the example still doesn’t appreciate a $10,000 cloud bill for a three-hour conference!

If you provision a large server but do not use all the CPU or memory, you have wasted unused CPU or memory. You have an opportunity to reduce your cloud computing cost! One approach you can take to improve your bill’s state involves eliminating cloud waste, unused or underutilized infrastructure resources.

Definition Cloud waste is unused or underutilized infrastructure resources.

You can reduce cloud waste by deleting, expiring, or stopping unused resources; scheduling or scaling instances based on usage; and assessing the right resource size or type for your system; see figure 12.5.

Figure 12.5 You can reduce cloud waste by removing unused resources, or scheduling and sizing resources to accommodate usage patterns.

Identifying cloud waste often starts as the first response to a surprising cost on your public cloud bill. However, you can use these techniques in your data center, especially for a private cloud. While they do not provide an immediate short-term benefit, they help optimize data center resource usage and long-term cost reduction.

12.2.1 Stop untagged or unused resources

Sometimes you and your team will create infrastructure resources for testing or other configuration. You end up forgetting about them until they show up on your cloud bill. As a first iteration of reducing cloud waste, you can identify unused resources and remove them.

Recall that you have a mission of reducing the cost of running a conference for your client. Could you reduce costs by identifying unused resources? Yes! Sometimes our teams create resources for testing and forget to remove them.

For example, you retrieve a list of servers in your Google Cloud projects and examine them in table 12.2. While many of them in testing and production have tags, you notice two instances with no tags. The n2d-standard-16 machines cost about $700 (7% of the total monthly bill).

Table 12.2 Cost of servers by type and environment

Machine type	Environment	Number of servers	Subtotal
`n2d-standard-8`	Testing	2	$400
`n2d-standard-16`	Production	2	$700
`n2d-standard-32`	Production	3	$2,900
Total			$4,000

You ask the team about the untagged instances in production. They created the servers for a sandbox to verify the application but never used them. Just to make sure, you check the server usage metrics over the month, and they all stayed at zero. You identified some cloud waste!

The team did create the servers with IaC. You delete the configuration and push the changes to remove the unused instances. Deleting the configuration removes the disks and resources attached to the instances. Fortunately, the cloud bill for your next conference reflects the reduction.

Why would you confirm usage of the server by metrics and team members? You do not want to remove the resource by accident. Sometimes you have unexpected dependencies on what seems like an unused resource.

Make sure that any resources you intend to delete do not have additional dependencies. If you have concerns about deleting an untagged or unused resource, you can always stop the resource for a week or two, wait to determine if it breaks the system, and then delete it.

12.2.2 Start and stop resources on a schedule

Your next cloud bill comes back 7% less, thanks to the removal of unused servers. However, the finance team wants you to reduce it even more. You puzzle over this until you talk to one of the client’s team members. They mention that they never run testing or use any of their infrastructure resources over the weekends. The client needs the platform available the weekend before the conference.

Could you find a way to turn off the servers on a Friday night and turn them on each Monday? You do not get charged for the 48 hours you shut down the servers. Scheduling a regular shutdown means cost reduction.

You discover that GCP defines an instance shutdown schedule with a compute resource policy (http://mng.bz/o25N). You start the servers each Monday and shut them down each Saturday, as outlined in figure 12.6.

Figure 12.6 You can reduce cost by scheduling a resource to start and stop when not in use.

Shutting down instances on a schedule alleviates the cost of running servers. However, this technique works only if you understand the behavior of your system. Starting and stopping resources on a schedule can disrupt development work.

Some applications do not have fault tolerance and would continue to fail even if a resource restarts successfully. In general, most reboot schedules run only in testing environments. The schedule provides an opportunity for you to verify your system’s resilience because you have a planned outage each weekend.

Listing 12.8 implements the resource policy for instance scheduling in GCP. The schedule expires the week before the conference, so it does not shut down the servers over the weekend. The development team might need to work on the platform in the few days before the conference.

Listing 12.8 Creating a resource policy for instance scheduling

def build(name, region, week_before_conference):
   expiration_time = datetime.strptime(                      ❶
       week_before_conference,                               ❶
       '%Y-%m-%d').replace(                                  ❶
           tzinfo=timezone.utc).isoformat().replace(         ❶
               '+00:00', 'Z')                                ❶
   return {
       'google_compute_resource_policy': {
           'weekend': {
               'name': name,
               'region': region,
               'description':
               'start and stop instances over the weekend',
               'instance_schedule_policy': {                 ❷
                   'vm_start_schedule': {                    ❸
                       'schedule': '0 0 * * MON'             ❸
                   },                                        ❸
 
                   'vm_stop_schedule': {                     ❹
                       'schedule': '0 0 * * SAT'             ❹
                   },                                   
                   'time_zone': 'US/Central',                ❺
                   'expiration_time': expiration_time     
               }
           }
       }
   }

❶ Expires the schedule the week before the conference, using the RFC 3339 date format

❷ Creates a compute resource policy with an instance schedule

❸ Starts the virtual machines every Monday at midnight

❹ Stops the virtual machines every Saturday at midnight

❺ Runs the schedule in the US central time zone since development teams work in the central United States

Other public cloud providers usually offer a similar automation capability to start and stop the virtual machines on a schedule. For example, AWS uses Instance Scheduler to start and stop servers and databases (http://mng.bz/nNev). Similarly, Azure uses a start/stop virtual machine workflow based on Azure functions (http://mng.bz/v6Xx).

If your public or private cloud platform does not offer scheduled shutdown capabilities, you will need to write your automation to run on a regular schedule. I implemented this before using various tools, including serverless functions, cron jobs on container orchestrators, or scheduled runs on continuous integration frameworks.

Instead of running seven servers 730 hours per month, you run it about 144 hours less (assuming three weekends in a month and a 48-hour shutdown). Using your cost estimation code, you update your calculation for 586 hours per month. It outputs that you reduced the overall cost by $700 (7% of the total monthly bill)!

The example adds the schedule to the testing environment. However, you could add a reboot schedule to a production environment if it has cyclical usage patterns. For example, the conference platform runs on demand for three hours and serves user traffic only during the week. Shutting down servers and databases for 48 hours does not disrupt user traffic. However, you may not want to implement a reboot schedule in a production environment that serves requests continuously.

12.2.3 Choose the correct resource type and size

If your production environment needs to serve customers 24 hours a day, seven days a week, you can still reduce cloud waste without a resource schedule by assessing your resource types and sizes. Many resources do not utilize their CPU or memory fully.

Many times, we provision larger resources because we don’t know how much we need. After running a system for some time, you can adjust the size of the resource for its actual usage. You can reduce cost by changing a resource type, size, reservation, replicas, or even cloud provider, as shown in figure 12.7.

Figure 12.7 You can change a resource’s attributes to utilize it better and reduce its cost.

You decide to investigate your client’s conference platform to find cloud waste in resource type and size. You realize you cannot reduce the cost of computing resources, so you check the database (Cloud SQL). The team provisioned the production database for 4 TB of solid-state drive (SSD) storage, detailed in table 12.3.

Table 12.3 Your cloud bill by offering and resource

Offering	Type	Environment	Number	Subtotal
Cloud SQL				$2,500
	db-standard-1, 400 GB SSD, 600 GB backup	Testing	1	$250
	db-standard-4, 4 TB SSD, 6 TB backup	Production	1	$2,250

After checking metrics and database usage, you realize that it needs only a 1 TB SSD. You update the disk size of the database in your IaC. Fixing the size reduces the cost of the database by $1,350 (22.5% of the total monthly bill)!

You might not use a resource to its full potential in many other ways. You might consider changing the type of resource if it uses a more expensive machine type. You need to ask yourself and your team, “Do we need this high-performance database in my testing environment if we do not run performance tests?”

Probably not! Choosing the right size and type for a given environment may take a few iterations. You want to choose a resource type, size, and replicas that simulate production without making it an exact duplicate in cost.

For the conference example, you might have three n2d-standard-32 instances in production and three n2d-standard-8 instances in your testing environment. The configuration still tests the three application instances without incurring a cost of 72 CPUs.

Other times, you can change the resource’s reservation type. GCP and many other cloud providers offer an ephemeral (also known as spot or preemptible) resource type. The resource costs less, but the cloud provider reserves the right to stop the resource and give CPU or memory to another customer. While an ephemeral resource reservation can reduce cost, you need to carefully consider whether your application and system can handle the disruption.

12.2.4 Enable autoscaling

You tried to identify as much cloud waste as possible in your environment but still want to reduce costs further. Many systems have customer usage patterns that do not require their CPU, memory, or bandwidth in the system every hour of every day.

For example, the conference platform needed 100% of its capacity during only the three hours of the conference! Could you have automatically increased or decreased the number of servers based on demand?

Figure 12.8 sets the target utilization to 75% of CPU usage so that the GCP managed instance group starts and stops servers to match the target metric. It increases and decreases the size of the group based on demand.

Figure 12.8 An autoscaling group includes a target utilization rate, which allows it to start and stop resources to adjust for usage automatically.

You added autoscaling for each group of servers. Autoscaling increases or decreases the number of resources in a group based on a metric, such as CPU or memory. Many public cloud providers offer an autoscaling group resource that you can create with IaC.

Definition Autoscaling is the practice of automatically increasing or decreasing the number of resources in a group based on a metric.

GCP autoscaling requires that you set a target metric to scale up or scale down resources to achieve. For most of the month with low traffic, you expect to use only one server. However, when peak traffic runs through your conference platform, you need a maximum of three. You decide to use a metric for CPU utilization and set the target to 75%.

You update the IaC for the servers. In listing 12.9, you replace the original servers and instance scheduling resource policy with a managed instance group and autoscaling policy. The autoscaling schedule starts each morning, increases or decreases instances to achieve 75% CPU utilization, and scales the instances to zero each evening.

Listing 12.9 Creating an autoscaling group based on CPU utilization

def build(name, machine_type, zone,
         min, max, cpu_utilization,
         cooldown=60,
         network='default'):
   region = zone.rsplit('-', 1)[0]
   return [{                                                             ❶
       'google_compute_autoscaler': {                                    ❶
           name: {                                                       ❶
               'name': name,                                             ❶
               'zone': zone,                                             ❶
               'target': '${google_compute_instance_group_manager.' +    ❶
               f'{name}.id}}',
               'autoscaling_policy': {
                   'max_replicas': max,                                  ❷
                   'min_replicas': 0,                                    ❸
                   'cooldown_period': cooldown,
                   'cpu_utilization': {
                       'target': cpu_utilization                         ❹
                   },
                   'scaling_schedules': {                                ❺
                       'name': 'weekday-scaleup',                        ❺
                       'min_required_replicas': min,                     ❺
                       'schedule': '0 6 * * MON-FRI',                    ❺
                       'duration_sec': '57600',                          ❺
                       'time_zone': 'US/Central'                         ❺
                   }                                                     ❺
               }
           }
       }
   }]

❶ Attaches an instance group to the autoscaling resource. We omitted the instance group for clarity.

❷ Sets the maximum number of replicas to scale up if CPU utilization increases above 75%

❸ Sets the minimum number of replicas to zero by default, which means it stops the virtual machines

❹ Uses CPU utilization as the target metric for the autoscaling group

❺ Sets a scaling schedule that increases the minimum number of replicas each Monday through Friday morning for the development team’s usage patterns

You can set a scaling schedule in the example to mimic a weekend shutdown that you previously implemented. In general, use the patterns for modules to create an autoscaling module. The module should set an opinionated, default target metric depending on your workload.

If you have a unique workload that does not fit the module’s preset target metrics, you can set its default to target CPU utilization or memory and assess its behavior over time. When you roll out the instance group, apply the blue-green deployment pattern from chapter 9 to replace active workloads or instances. Rolling out a schedule and autoscaling group should not disrupt applications.

To encourage teams to use autoscaling groups and scheduling, you can create several policy tests to make sure your autoscaling group reduces cloud waste. For example, one test could verify that your IaC has no individual servers and contains only autoscaling groups. The test encourages the team to take advantage of elasticity.

Another test you can add involves checking the maximum replica limit. Suppose your application suddenly consumes a lot of CPU or memory, or a bad actor injects a cryptocurrency mining binary on your machine. In that case, you don’t want the autoscaling group to automatically increase its capacity to 100 machines.

12.2.5 Set a resource expiration tag

You may reduce cloud waste by dynamically scaling up and down resources based on utilization, but you also need to accommodate on-demand, manually created resources. For example, the client team members complain that they often need to create sandbox servers for further testing. However, they often forget about the servers. Can you “expire” the servers after some time if no one updates them?

You decide to update your tag module to attach a new tag for the expiration date in the testing environment. Recall that you can use the prototype pattern (chapter 3) to establish standard tags. After applying the policy tests to check for tag compliance in chapter 8, you know that each resource in the testing environment will have an expiration date.

For example, the team members might create a server with an initial expiration date of February 2, as shown in figure 12.9. However, they decide to update the server. As part of the change, the tag module retrieves the current date (February 5), adds seven days, and updates the tag on the server to a new date (February 12).

Figure 12.9 Create an expiration date in the tag module that resets the expiration date of the module to one week from the change.

Why use set expiration as part of the tag module? Your tag module should get applied across all your IaC. This allows you to establish a default duration of seven days and apply it to all infrastructure resources.

You can also control when to apply the expiration tag as part of a module. The module applies the expiration tag only if you don’t create a resource in the production environment or continuously run it in the testing environment. The following listing updates the prototype module for default tags with an expiration date.

Listing 12.10 Tag module with an expiration date

import datetime
 
EXPIRATION_DATE_FORMAT = '%Y-%m-%d'                        ❶
EXPIRATION_NUMBER_OF_DAYS = 7                              ❷
 
 
class DefaultTags():
   def __init__(self, environment, long_term=False):
       self.tags = {
           'customer': 'community',
           'automated': True,
           'cost_center': 123456,
           'environment': environment
       }
       if environment != 'prod' and not long_term:         ❸
           self._set_expiration()                          ❸
 
   def get(self):
       return self.tags
 
   def _set_expiration(self):
       expiration_date = (                                ❹
           datetime.datetime.now() +                      ❹
           datetime.timedelta(                            ❹
               days=EXPIRATION_NUMBER_OF_DAYS)            ❹
       ).strftime(EXPIRATION_DATE_FORMAT)                 ❺
       self.tags['expiration'] = expiration_date

❶ Formats the date to a string of year, month, and day

❷ Calculates the expiration date for seven days from the current date

❸ Sets the expiration tag if you do not create the resource in production or mark it as a long-term resource

❹ Calculates the expiration date for seven days from the current date

❺ Formats the date to a string of year, month, and day

You set one week as a default because it gives team members enough time to develop and test a resource. They can always renew another week if needed by running their delivery pipeline to update the tag automatically. However, you do need to enable an override to allow long-term resources in testing environments.

How do you enforce expiration date tagging by default but exempt resources from an expiration date? You can create a policy test with soft mandatory enforcement. A soft mandatory policy makes exceptions and audits long-term resources in testing environments.

Let’s write a test that enforces the expiration tag for each server resource in listing 12.11. If the server does not exist in the list of exempt resources, it fails the test and stops the delivery pipeline from deploying all changes to production.

Listing 12.11 Test checks that testing resources have an expiration date

import pytest
 
 
def test_all_nonprod_resources_should_have_expiration_tag(
       servers, server_exemptions):                             ❶
   noncompliant = []
   for name, values in servers.items():
       if 'expiration' not in values['labels'].keys() and      ❷
               name not in server_exemptions:                   ❸
           noncompliant.append(name)                            ❸
   assert len(noncompliant) == 0, 
       'all nonprod resources should have ' + 
       f'expiration tag, {noncompliant}'

❶ Retrieves a list of servers in configuration and those exempt from the policy

❷ Checks if the tag for expiration exists in server tags

❸ If you did not exempt the server, you must flag the server as noncompliant with the policy.

Adding the resource to an exemption list means that your teammates will carefully examine which resources persist in the testing environment. During peer review (chapter 7), you can identify any new, persistent resource based on changes to the exemption list. A single source of persistent resources in a testing environment ensures that you can audit and discuss cost control early in the development process.

After implementing the expiration tag in IaC, you need to write a script that runs daily. Figure 12.10 shows the script’s workflow. It checks whether the expiration date matches the current date. If so, the automation deletes the resource.

Why set an expiration date by using IaC? The workflow of setting an expiration date using a tag module builds in the ability to renew the resource expiration date! Rather than introduce development friction by adding separate automation, you build renewals into the development process.

Figure 12.10 Setting an expiration tag allows daily automation to determine whether to delete a temporary resource and thus reduce cost.

For example, if a team still needs the resource, it can always rerun its IaC delivery pipeline to reset the expiration date for another seven days. Active changes to the resource also reset the expiration date. If you change your infrastructure, you probably still need the resource.

What happens when a resource expires and you still need it? You can always rerun your IaC and create a new one. Using IaC to add and renew the expiration date provides cost compliance and functional visibility across teams.

Note Sometimes you’ll find separate automation for automatic tagging. The automation adds the expiration date to infrastructure resources after their creation. While the automatic tagging means greater control for cost compliance, it also introduces drift between the actual and intended configurations. Furthermore, automatic expiration often confuses team members. Unless they paid attention to your communications, they might find their resource deleted after a few days!

You can always set the expiration time interval to something other than several days. If you want to offer more flexibility to teams, you can offer a range of days through the tag module. I recommend calculating the absolute expiration date and adding it to the tag instead of the time interval for easier cleanup automation.

With all of the changes in the example, what happened to your client’s cloud computing bill? Your bill went from a little over $10,000 to about $6,500 (a 35% reduction)! Your client appreciates the efficient use of their cloud resources.

In reality, you might not achieve the same dramatic cost reduction as the example. However, you can always apply the practices and techniques to your IaC to introduce minor changes that reduce cost when possible. Capturing cost reduction practices in IaC with tests ensures that everyone writes it with the constraint of cost in mind.

Imagine you have three servers. You examine their utilization and notice the following:

You need one server to serve the minimum traffic.
You need three servers to serve the maximum traffic.
The server handles traffic 24 hours a day, seven days a week.

What can you do to optimize the cost of the servers for the next month?

A) Schedule the resources to stop on the weekends.

B) Add an autoscaling_policy to scale based on memory.

C) Set an expiration of three hours for all servers.

D) Change the servers to smaller CPU and memory machine types.

E) Migrate the application to containers and pack applications more densely on servers.

See appendix B for answers to exercises.

12.3 Optimize cost

You can apply other IaC principles and practices to reduce cloud waste and manage cost drivers. Techniques like building environments on demand, updating routes between regions, or testing in production can use IaC to further optimize cost, as shown in figure 12.11.

Figure 12.11 Cost optimization requires IaC practices for scaling and deploying infrastructure resources.

In particular, the principles of reproducibility, composability, and evolvability can help with creative techniques to further optimize cost. These techniques include reproducing environments on demand to reduce persistent testing environments, composing infrastructure across cloud providers, and evolving production infrastructure across regions and cloud providers.

Recall that you reduced your client’s cloud computing cost for a conference by 35%. A year later, the finance team asks for your help to optimize the cost of their platform. They’ve grown their business and want to optimize costs across hundreds of customers and a managed service.

12.3.1 Build environments on demand

From a broader perspective, you need to examine which environments exist in testing and production. You might start by reducing cloud waste across all environments. However, as your company grows, it adds more environments to support and test more products.

Imagine you examine your client’s infrastructure. The client has many testing environments. You determine that three or four of them run continuously and support specialized testing. For example, the quality assurance (QA) team uses one of the environments for performance testing twice a year. For the rest of the year, these environments remain dormant.

You decide to remove the continuously running environments. If the QA team wants an environment for performance testing, it can do that on demand. The team copies the production environment with its factory and builder modules that allow inputs. The modules provide flexibility to specify variables and parameters for different environments.

Figure 12.12 shows the rest of the workflow for creating an on-demand environment. The QA team copies the IaC to a new repository specific to the testing environment in the organization’s multirepository structure. The team updates the parameters and variables, runs its tests, and removes the environment.

Figure 12.12 You can copy the configuration for production to create and customize on-demand environments for testing.

Why use reproducibility to create new environments on demand and delete them? A new, updated environment ensures that the latest configuration matches production. If you use the environment only once a year, you do not want to keep it constantly running for 11 months.

While it takes time to create a new environment, you probably take the same amount of time trying to fix drift between your testing and production environment. Identifying unnecessary long-running environments and switching them to an on-demand model can help mitigate costs, especially if you can easily re-create them.

12.3.2 Use multiple clouds

With a few cloud provider options, you may also consider deploying to other clouds and optimizing cost based on resource, workload, and time of day. IaC can help standardize and organize your configuration for multiple clouds. Deploying to multiple clouds can accommodate specialized workloads or teams that want specific infrastructure resources.

For example, imagine your client uses Google Cloud Dataflow for stream data processing. However, the cost varies depending on the type of pipeline. You convince some reporting teams to convert a few of the batch-processing pipelines to Amazon EMR to reduce the overall cost.

In figure 12.13, the report service team switches its IaC to use the Amazon EMR module. To minimize the disruption of jobs, the team members use the blue-green deployment pattern from chapter 9 to gradually increase the number of jobs they run in Amazon EMR.

Figure 12.13 The report service switches its batch-processing job to use Amazon EMR instead of Google Cloud Dataflow by referencing a different module.

The principle of composability becomes an important part of a multicloud configuration. IaC makes it easier to manage and identify infrastructure resources across different cloud providers. Using modules to express dependencies between clouds also helps you evolve your resources over time.

In chapter 5, we separated IaC configuration into different folders based on tools and providers. Many IaC tools do not offer a unified data model for cloud provider resources. Build a different module for each cloud you plan to support. Separating modules by provider reduces the complexity and supports isolated testing of modules. Maintaining separate modules for each cloud provider also enables easy identification of infrastructure resources and providers.

12.3.3 Assess data transfer between regions and clouds

With the adoption of multiple clouds, you might discover that you didn’t reduce your overall cloud computing bill. You need to consider multiple clouds carefully because providers will charge for data transfer between regions and outside their cloud network. Data transfer costs add up in surprising ways!

You check your client’s cloud computing bill and notice that a lot of the cost comes from data transfers across regions and out of the network. After some investigation, you discover that many of the services and testing environments communicate across regions, availability zones, and over the public internet.

For example, integration tests for the chat service in us-central1-a use the public IP address of the user profile service in us-central1-b! You realize that all of the services in the integration testing environment do not need to test across regions, availability zones, or out of the network.

Integration testing tests the functionality of the service only relative to other services and not the system itself. Figure 12.14 uses the refactoring techniques from chapter 10 to consolidate the infrastructure resources in the integration testing environment into one availability zone.

Figure 12.14 Refactor your IaC to support an integration testing environment in a single availability zone and resolve to a private IP address.

What happens if the availability zone fails? You can always switch the IaC to a different zone or region. The applications still communicate over a private network and do not charge for data transfer out of Google Cloud’s network or between regions and zones.

Favor private over public networking, not just for security but also for cost and efficiency. If you use multiple clouds, know which resources need to communicate across clouds. Sometimes you might find it more cost-efficient to shift an entire set of services to another cloud than pay for data transfer between clouds. Applying the change and refactoring techniques in chapters 9 and 10 can help consolidate services and communications.

12.3.4 Test in production

Even after shifting to multiple clouds and optimizing data transfer, you might find that your testing environment cannot fully mimic production and costs too much to run. At some point, you cannot get away with testing in an isolated environment. Rather than simulate the production environment in its entirety, you can continue to optimize costs by testing directly in the production environment.

In the case of the conference platform, you help the video service team implement changes to test in production. In figure 12.15, the team members stage a new set of infrastructure resources with the expected changes hidden behind a feature flag. Then, they toggle the flag to direct all applications and user traffic and verify its functionality in production. The two sets of resources run simultaneously for a few weeks. After a few weeks, they delete the old infrastructure resources.

Figure 12.15 Testing in production for IaC applies blue-green deployments and feature flagging for a set of resources.

The team tested its service in production by using blue-green services without using a testing environment. Testing in production involves a set of practices that enable you to run tests against production data and systems.

Definition Testing in production is a set of practices that allow you to run tests against production data and systems.

In software development, you use techniques like feature flagging to hide certain functionality in production for testing. Similarly, you use canary deployments to test functionality with a small group of users before offering it to everyone on the platform.

For IaC, testing in production does not fit entirely with the software development practices. You don’t want to test that the code works with a small group of users. You just want to know if you create broken infrastructure systems that might affect the users! You can apply a few techniques, like feature flagging and canary deployment. We did this in chapters 9 and 10.

You could test IaC directly in production without a blue-green deployment pattern or feature flagging. However, you will need an established roll-forward plan in case of failure. I worked in one organization that depended entirely on local testing before pushing changes to production. If our change failed, we would attempt to update the system to a previous state. If all else failed, we would create a whole new infrastructure environment and direct all application and user traffic to the new environment.

You might go through all cost optimization, cloud waste reduction, and cost driver control techniques and still never fully optimize your cloud bill! Over time, your organization’s usage and product demands change.

If you have proper monitoring and instrumentation in your systems, you might discover that you have periods of greater or lower demand. For example, your client’s platform has the most demand in May, June, October, and November, for peak conference season.

Note You might need to re-architect your systems to take advantage of public cloud elasticity, the ability to scale up or down, and reduce cost over time. Some software architectures do not make it easy for you to dynamically scale resources up or down. The solution often requires a refactor or re-platform of the application to improve system cost efficiency.

Understanding your system’s resource usage and requirements over time can help you further optimize costs beyond the techniques I described in this chapter. You could reduce costs even further by negotiating a contract with the infrastructure provider. Choosing a new pricing model lets you save on a certain number of reserved instances or volume-based discounts.

The longer you run the system, the more metrics you aggregate. The information can help you iterate on cost driver control, cloud waste reduction, and overall cost optimization in IaC. You can also use the information to negotiate with your cloud provider and reduce the surprises on your cloud computing bill!

Summary

Before changing IaC to optimize cost, identify the cost drivers (resources or activities) that may affect the total cost.
Manage cost drivers in infrastructure, such as compute resources, by adding policy tests to check for resource type, size, and reservation.
Cost estimation parses your IaC for resource attributes and generates an estimate of their cost based on a cloud provider’s API.
You can add a policy test to check the output of cost estimation to approximate whether you exceeded your budget.
Cloud waste describes unused or underutilized infrastructure resources.
Eliminating cloud waste can help decrease your cloud computing cost.
Reduce cloud waste by removing or stopping untagged or unused resources, starting and stopping resources on a schedule, right-sizing infrastructure resources for better utilization, enabling autoscaling, and tagging a resource expiration date.
Autoscaling increases or decreases the number or amount of resources in a given group based on a metric, such as CPU or memory.
Techniques for optimizing cloud computing cost involve building environments on demand, using multiple clouds, assessing data transfer, and testing in production.
Testing in production uses practices like blue-green deployments and feature flagging to test infrastructure changes without a testing environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12 Cost of cloud computing

Create new playlist

Sign In

Sign Up