9 Making changes

This chapter covers

  • Determining when to use patterns like blue-green deployments to update infrastructure
  • Establishing immutability to create environments across multiple regions with IaC
  • Determining change strategies for updating stateful infrastructure

In previous chapters, we started with helpful practices and patterns for modularizing, decoupling, testing, and deploying infrastructure changes. However, you also need to manage changes to infrastructure with infrastructure as code techniques. In this chapter, you’ll learn how to change IaC with strategies that apply immutability to minimize the impact of potential failure.

Let’s return to Datacenter for Veggies and its struggles to modularize IaC from chapter 5. The company acquires Datacenter for Carnivorous Plants as a subsidiary. Its carnivorous plants, like Venus flytraps, require particular growing conditions.

As a result, Datacenter for Carnivorous Plants needs global networking and network-optimized servers and components. Most teams configure their IaC but realize that their code can’t handle such a widespread change. They ask you, a Datacenter for Veggies engineer with some experience in IaC, to help them.

The engineering team directs you to the Cape sundew team to update its infrastructure first. As a hardy carnivorous plant, the Cape sundew can best handle any system downtime that causes fluctuations in temperature and watering. All of the infrastructure resources for the Cape sundew infrastructure exist in a single repository with a few configuration files.

You investigate and diagram the architecture of the sundew system. Figure 9.1 shows a regional forwarding rule (load balancer) that sends traffic to a regional network with a container cluster and three servers. All traffic circulates in the same region and not globally.

Figure 9.1 The Cape sundew applications use an infrastructure system with shared load balancing resources, three servers, and one container cluster.

You want to change all resources to send traffic globally instead of routing in a single region. However, the sundew team defined all of these resources in the same IaC, also known as the singleton pattern (refer to chapter 3). The resources also share infrastructure state.

How do you roll out the network changes to the system and minimize their impact? You worry about disrupting the watering application if you treat the infrastructure mutably by making the change in place. For example, changing your network to use global routing may affect the servers supporting the watering application.

You remember from chapter 2 that you can use the idea of immutability to build new infrastructure with new changes. If you can apply these techniques to the system, you can isolate and test the changes in a new environment without affecting the old one. In this chapter, you’ll learn how to isolate and make changes to IaC.

Note Demonstrating change strategies requires a sufficiently large (and complex) example. If you run the complete example, you will incur a cost that exceeds the GCP free tier. This book includes only the relevant lines of code and omits the rest for readability. For the listing in its entirety, refer to the book’s code repository at https://github.com/joatmon08/manning-book/tree/main/ch09. If you convert these examples for AWS and Azure, you will also incur a cost. When possible, I offer notations on converting the examples to the cloud provider of your choice.

9.1 Pre-change practices

You jump right into changing the sundew system. Unfortunately, you accidentally delete a configuration attribute that tags a server with blue, which allows traffic between all blue instances in the network. You push your change to your delivery pipeline to test the configuration and apply it to production.

Testing misses the deleted tag. Fortunately, your monitoring system sends you an alert that the watering application cannot communicate with your new server! You divert all requests to a duplicate server instance to ensure that the sundews still get watered while you debug.

You realize that you should not start changing the system. The sundew system has existing architecture and tools you need to understand before you start. You also need to know if the system has backups or alternative environments ready for use if you break something. What should you do before you make a change?

9.1.1 Following a checklist

You always run the risk of introducing bugs and other problems when changing IaC. You need testing, monitoring, and observability (the ability to infer a system’s internal state from its outputs) to ensure that you haven’t affected the system during your change. If you do not have some visibility into your system, you cannot quickly troubleshoot problems from broken changes.

Before you change the sundew system, you decide to review a few things about your system. Figure 9.2 shows what you review. You first add a test to check for your deleted tag. Next, you examine your monitoring system for both system and application health checks and metrics. Finally, you create duplicate servers as backup, just in case you break the existing servers. You can send traffic to the backup server if you accidentally affect the updated server.

Figure 9.2 Before updating the network, you need to verify test coverage, systems and application monitoring, and redundancy.

Why did you create the duplicate servers as backup? It helps to have extra resources that duplicate the previous configuration that you use only if the primary resources fail. This redundancy keeps your system running.

Definition Redundancy is the duplication of resources to improve the system’s performance. If updated components fail, the system can use working resources with a previous configuration.

In general, review the following checklist before you make a change:

  • Can you preview each change and test the changes in an isolated environment?

  • Does the system have monitoring and alerts configured to identify any anomalies?

  • Does the application track error responses with health checks, logging, observability, or metrics?

  • Do the applications and their systems have any redundancy?

The items on the list focus on visibility and awareness. Without monitoring systems or tests to help identify problems, you may have trouble identifying or resolving broken changes. I once pushed a change that broke an application and did not find out until two weeks after release. It took so long to realize the problem because we did not have alerts on the application!

A pre-change checklist sets the foundation for debugging any problems and establishing any backup plans should the change fail. You can even use the practices from chapters 6 and 8 to build this checklist into delivery pipelines as quality gates.

9.1.2 Adding reliability

After reviewing the pre-change checklist, you realize that you need a better backup environment in your system. To ensure that you don’t bring down the entire sundew system in your continued refactoring efforts, you need additional redundancy. When you finally deploy a change to the sundew team’s modules, you do not need to worry about disrupting the system.

Unfortunately, the sundew system exists only in us-central1. The sundews don’t get watered if the region fails! You decide to build an idle production sundew system in another region (us-west1) so you can restart the watering application. You use IaC to reproduce the active region in us-central1 to the passive (idle) region in us-west1.

You can now use the environment in the passive region as a backup. In figure 9.3, you update the sundew team’s configuration to use a server module and push the change to the active environment. If it does not work, you temporarily send all traffic to the passive environment while you debug the problem. Otherwise, you run tests and update the passive environment with the module changes.

Figure 9.3 You use IaC to implement an active-passive configuration for the sundew system to improve its reliability during changes.

The sundew system now uses an active-passive configuration, in which one environment sits idle and serves as a backup.

Definition In an active-passive configuration, one system is the active environment for completing user requests, and the other is the backup environment.

If the environment in us-central1 stops working, you can always send traffic to the other passive environment in us-west1. Switching from a failed active environment to the working passive one follows the process of failover.

Definition Failover is the practice of using passive (or standby) resources to take over when the primary resources fail.

Why would you want an active-passive configuration? Building a passive environment in a second region improves the overall reliability of the system. Reliability measures how long a system performs correctly within a time period.

Definition Reliability measures how long a system performs correctly within a time period.

You want to keep your system reliable as you make IaC changes. Improving reliability minimizes disruption to business-critical applications and, ultimately, end users. You reduce your blast radius to the active environment with the option to cut over traffic to a working passive environment.

Let’s create the active-passive configuration in code. In your terminal, you copy the file named blue.py containing the sundew’s infrastructure resources to a new file named passive.py:

$ cp blue.py passive.py

For passive.py in listing 9.1, you update a few variables to create the passive sundew environment, including region and name.

Listing 9.1 Updating the passive sundew environment for us-west1

TEAM = 'sundew'
ENVIRONMENT = 'production'
VERSION = 'passive'                                                 
REGION = 'us-west1'                                                 
IP_RANGE = '10.0.1.0/24'                                            
 
 
zone = f'{REGION}-a'
network_name = f'{TEAM}-{ENVIRONMENT}-network-{VERSION}'            
server_name = f'{TEAM}-{ENVIRONMENT}-server-{VERSION}'              
 
cluster_name = f'{TEAM}-{ENVIRONMENT}-cluster-{VERSION}'            
cluster_nodes = f'{TEAM}-{ENVIRONMENT}-cluster-nodes-{VERSION}'     
cluster_service_account = f'{TEAM}-{ENVIRONMENT}-sa-{VERSION}'      
 
labels = {                                                          
   'team': TEAM,                                                    
   'environment': ENVIRONMENT,                                      
   'automated': True                                                
}                                                                   

Sets the version to identify the passive environment

Changes the region from us-central1 to us-west1 for the passive environment

Updates a different IP address range for the passive environment to avoid sending requests

Remaining variables and functions reference the constants for the passive environment, including region.

Defines labels for resources so you can identify the production environment

AWS and Azure equivalents

GCP labels are similar to AWS and Azure tags. You can take the object defined in the labels variable and pass it to AWS and Azure resource tags.

You now have a backup environment in case your module change goes wrong. Imagine you push the change and break the active environment in us-central1. You can update and push the configuration for a production global load balancer to failover and send everything to the passive environment. In the following listing, you change the weight on your global load balancer to send 100% of traffic to the passive environment.

Listing 9.2 Failover to passive sundew environment in us-west1

import blue                                                       
import passive                                                    
 
services_list = [
   {
       'version': 'blue',                                         
       'zone': blue.zone,
       'name': f'{shared_name}-blue',
       'weight': 0                                                
   }, {
       'version': 'passive',                                      
       'zone': passive.zone,
       'name': f'{shared_name}-passive',
       'weight': 100                                              
   }
]
 
 
def _generate_backend_services(services):                         
   backend_services_list = []
   for service in services:
       version = service['version']
       weight = service['weight']
       backend_services_list.append({
           'backend_service': (                                   
               '${google_compute_backend_service.'                
               f'{version}.id}}'                                  
           ),                                                     
           'weight': weight,                                      
       })
   return backend_services_list
 
 
def load_balancer(name, default_version, services):
   return [{
       'google_compute_url_map': {                                
           TEAM: [{
               'name': name,
               'path_matcher': [{
                   'name': 'allpaths',
                   'path_rule': [{                                
                       'paths': [                                 
                           '/*'                                   
                       ],                                         
                       'route_action': {                          
                           'weighted_backend_services':           
                               _generate_backend_services(        
                                   services)                      
                       }                                          
                   }]
               }]
           }]
       }
   }]

Imports IaC for both blue (active environment) and passive environment

Defines a list of versions for each environment to attach to the load balancer, blue and passive

Configures the load balancer with a weight to send 0% of traffic to the blue version

Configures the load balancer with a weight to send 100% of traffic to the passive version

Adds the two versions to the load balancer with their weights as routes to the load-balancing rule

Defines backend services for the blue and passive environments with a weight to direct traffic to each one

Creates the Google compute URL map (load-balancing rule) using a Terraform resource based on the path, blue (active) and passive servers, and weight

Sets up a path rule that directs all paths to the active or passive servers

Adds the two versions to the load balancer with their weights as routes to the load-balancing rule

AWS and Azure equivalents

The Google Cloud URL map is similar to AWS Application Load Balancer (ALB) or Azure Traffic Manager and Application Gateway. To convert listing 9.2 to AWS, you will need to update the resource to create an AWS ALB and listener rule. Then, add path routing and weight attributes to the ALB listener rule.

For Azure, you will need to link an Azure Traffic Manager profile and endpoint to an Azure Application Gateway. Update the Azure Traffic Manager with weights and route them to the correct backend address pool attached to an Azure Application Gateway.

After you fail over the system to the passive environment, the sundew team reports the return of the watering application. You have a chance to debug problems with your module in the blue (active) environment. The active-passive configuration will protect from failures in individual regions in the future.

The sundew team members tell you that, eventually, they want to send traffic to both regions. Both environments in each region process requests. Figure 9.4 shows their dream configuration. The next time, they want to update the module and push it out to one region. If a region breaks, most requests will still get processed by the system. You water the sundews less frequently but have an opportunity to fix the broken region.

Figure 9.4 In the future, the sundew team will further refactor the system configuration to support an active-active configuration and send requests to both regions.

Why aspire to run two active environments? Many distributed systems run in an active-active configuration, which means both systems process and accept requests and replicate data between them. Public cloud architectures recommend using a multiregion, active-active configuration to improve the system’s reliability.

Definition In an active-active configuration, multiple systems are the active environments for completing user requests and replicating data between them.

Your changes to IaC will differ depending on active-passive or active-active configuration. The sundew team’s active-active configuration must refactor the IaC into more modular components and support data replication between environments. Assuming the sundew team refactors its applications to support an active-active configuration, you will need to implement some form of global load balancing in your IaC and connect it to each region.

IaC conforms to reproducibility, which we covered in chapter 1. Thanks to this principle, we can create a new environment in a new region with a few updates to attributes. You do not have to painstakingly rebuild an environment resource by resource as we did in chapter 2.

However, you may find yourself copying and pasting a lot of the same configuration. Try to pass regions as inputs and modularize your IaC to reduce copying and pasting. Separate shared resources configuration, like global load balancer definitions, away from the environment configurations.

Multiregion environments always incur a cost in terms of money and time but can help improve system reliability. IaC expedites creating copies of new environments in other regions and enforces consistent configurations across regions. Inconsistencies between regions can introduce significant system failures and increase maintenance effort! Chapter 12 discusses cost management and its considerations.

9.2 Blue-green deployment

Now that you have another environment to fall back on if you accidentally corrupt the active one, you can start updating the sundew system to use global networking and premium tier network access for servers. The sundew system uses an active-passive configuration, which means duplicating a whole new environment in a new region just to make changes.

You realize that some changes don’t require duplicating entire environments across multiple regions. Why have an entire passive environment just to update a server? Can’t you just update one server? After all, we want to minimize the blast radius of failures and optimize our resource efficiency. Instead of using active- passive configuration, you can apply the pattern to fewer resources on a much smaller scale.

In figure 9.5, you reproduce a new network with global networking instead of the entire environment. You label the new network green and deploy a set of three servers and one cluster on it. After testing the new resources, you use your global load balancer to send a small percentage of traffic to the new resources. The requests succeed, indicating that the update to global routing worked.

Figure 9.5 Use blue-green deployment to create a green environment for the global network, send some traffic to the new resources, and remove the old resources.

The pattern of creating a new set of resources and gradually cutting over to them applies the principles of composability and evolvability of your system. You add a new green set of resources to your environment and allow them to evolve independently of the old resources. If you want to make changes to infrastructure, repeat this workflow to reduce your blast radius of changes and test new resources before sending traffic to them.

This pattern, called a blue-green deployment, creates a new subset of infrastructure resources that stages the changes you want to make, labeled green. You then direct a few requests to the green staging infrastructure resources and ensure that everything works. Over time, you send all requests to the green infrastructure resources and remove the old production resources, labeled blue.

Definition Blue-green deployment is a pattern that creates a new subset of infrastructure resources with the changes you want to make. You gradually shift traffic from the old set of resources (blue) to the new set of resources (green) and eventually remove the old ones.

Blue-green deployment allows you to isolate and test changes in a (temporarily) new staging environment before sending requests to it. After validating the green environment, you can switch the new environment to the production one and delete the old one. You temporarily pay for two environments for a few weeks but minimize the overall cost of maintaining persistent environments.

Note Blue-green deployment has a few different labels, occasionally with subtle nuances depending on the context. It doesn’t matter what color or names you label the environments, as long as you can identify which environment serves as the existing production one or the new staging one. I have also used production/staging and version numbering (v1/v2) to identify old and new resources during blue-green deployment.

You should use a blue-green deployment pattern to refactor or update your IaC beyond a few minimal configurations. Blue-green deployment depends on the principle of immutability to create new resources, cut over traffic or functionality to them, and remove old resources. Most of the patterns for refactoring (chapter 10) and changing IaC often involve the principle of reproducibility.

Image building and configuration management

You can similarly mitigate the risk of a failed machine image or configuration by applying a blue-green deployment pattern. Isolate machine image or configuration management updates to a new server (green), send traffic to it, and test its functionality before removing the old server (blue).

You already have a global load balancer for the existing blue network that you can use later to connect the new green network. In the following sections, let’s implement each step of a blue-green deployment for the sundew system.

9.2.1 Deploying the green infrastructure

To start a blue-green deployment for the sundew system’s global networking and premium tier servers, you copy the configuration for the existing blue network. You create the file named green.py and paste the blue network configuration. In the following listing, you make changes to the network definition so it uses a global routing mode.

Listing 9.3 Creating the green network

TEAM = 'sundew'
ENVIRONMENT = 'production'
VERSION = 'green'                                            
REGION = 'us-central1'
IP_RANGE = '10.0.0.0/24'                                     
 
zone = f'{REGION}-a'
network_name = f'{TEAM}-{ENVIRONMENT}-network-{VERSION}'     
 
labels = {
   'team': TEAM,
   'environment': ENVIRONMENT,
   'automated': True
}
 
 
def build():                                                 
   return network()                                          
 
 
def network(name=network_name,                               
           region=REGION,
           ip_range=IP_RANGE):
   return [
       {
           'google_compute_network': {                       
               VERSION: [{
                   'name': name,
                   'auto_create_subnetworks': False,
                   'routing_mode': 'GLOBAL'                  
               }]
           }
       },
       {
           'google_compute_subnetwork': {                    
               VERSION: [{
                   'name': f'{name}-subnet',
                   'region': region,
                   'network': f'${{google_compute_network.{VERSION}.name}}',
                   'ip_cidr_range': ip_range
               }]
           }
       }
   ]

Sets the name of the new network version to “green”

Keeps the IP address range for green the same as the blue network. GCP allows the two networks to have the same CIDR block if you have not set up peering.

Uses the module to create the JSON configuration for the network and subnetwork for the green network

Creates the Google network by using a Terraform resource based on the name and a global routing mode

Updates the green network’s routing mode to global to expose routes globally

Creates the Google subnetwork by using a Terraform resource based on the name, region, network, and IP address range

AWS and Azure equivalents

If you convert listing 9.3 to AWS or Azure, the global routing mode does not apply. You can still update the code listing to AWS or Azure by changing Google’s network and subnetwork to a VPC or virtual network, and changing the subnets and routing tables.

You want to keep the same configuration for blue and green resources when possible. They should differ only in the changes you want to make to the green sources. However, you might have some differences!

For example, if I had some specific peering configuration for my networks, I could not use the blue network’s IP address range for the green network. Instead, I would need a different IP address range, like 10.0.1.0/24, and update any dependencies to communicate to another IP address range.

Blue-green deployment favors immutability, creating new, updated resources and isolating the changes away from the old resources. However, deploying a new version of a low-level resource like networking does not mean you can immediately send live traffic to it. You always start by changing and testing the infrastructure resource you want to update. Then, you must change and test other resources that depend on it.

9.2.2 Deploying high-level dependencies to the green infrastructure

When you use the blue-green deployment pattern, you always need to deploy a new infrastructure resource with the changes and a new set of high-level resources that depend on it. You finished updating the network but cannot use it unless you have servers and applications on it. The new network needs high-level infrastructure that depends on it.

You communicate to the sundew teams to deploy new clusters and servers onto the green network, as shown in figure 9.6. The servers must use premium networking on the global network. The sundew team also deploys its applications onto the cluster and servers.

Figure 9.6 After creating the network, a low-level infrastructure resource, you also need to create new high-level resources like the server and container cluster that depend on it.

In this example, changing a low-level infrastructure resource like the network affects the high-level resources. The servers must run with premium networking. Updating the original blue network in-place from regional to global routing would likely have affected the servers and cluster. With a blue-green deployment, you evolve the network attributes for the servers without affecting the live environment.

Let’s review a sample of the IaC that the sundew team members added to deploy the cluster onto the green network in the following listing. They copied the cluster configuration from the blue resource and updated its attributes to run on the green network.

Listing 9.4 Adding a new cluster to the green network

VERSION = 'green'                                                                
 
cluster_name = f'{TEAM}-{ENVIRONMENT}-cluster-{VERSION}'                         
cluster_nodes = f'{TEAM}-{ENVIRONMENT}-cluster-nodes-{VERSION}'                  
cluster_service_account = f'{TEAM}-{ENVIRONMENT}-sa-{VERSION}'                   
 
 
def build():                                                                     
   return network() + 
       cluster()                                                                 
 
def cluster(name=cluster_name,                                                   
           node_name=cluster_nodes,                                              
           service_account=cluster_service_account,                              
           region=REGION):                                                       
   return [
       {
           'google_container_cluster': {                                         
               VERSION: [
                   {
                       'initial_node_count': 1,   
                       'location': region,
                       'name': name,                                             
                       'remove_default_node_pool': True,
                       'network': f'${{google_compute_network.{VERSION}.name}}', 
                       'subnetwork':                                            
                         f'${{google_compute_subnetwork.{VERSION}.name}}'        
                   }
               ]
           }
       }
   ]

Labels the new version of the cluster “green”

Uses the module to create the JSON configuration for the network, subnetwork, and cluster for the green network

Builds the cluster on the green network and subnetwork

Passes required attributes to the cluster, including name, node names, service accounts for automation, and region

Creates the Google container cluster by using a Terraform resource with one node and on the green network

Builds the cluster on the green network and subnetwork

AWS and Azure equivalents

You can update the code by changing the Google container cluster to an Amazon EKS cluster or Azure Kubernetes Service (AKS) cluster. You will need an Amazon VPC and Azure virtual network for the Kubernetes node pools (also called groups).

The cluster does not require any changes to adapt to the global networking configuration. However, the servers need premium networking. You copy the server configuration from blue and change it to use premium networking attributes in green.py.

Listing 9.5 Adding premium networking to servers on the green network

 
VERSION = 'green'                                           
 
server_name = f'{TEAM}-{ENVIRONMENT}-server-{VERSION}'      
 
 
def build():                                                
   return network() + 
       cluster() + 
       server0() +                                         
       server1() +                                         
       server2()                                            
 
 
def server0(name=f'{server_name}-0',                        
           zone=zone):                                      
   return [
       {
           'google_compute_instance': {                     
               f'{VERSION}_0': [{
                   'allow_stopping_for_update': True,
                   'boot_disk': [{
                       'initialize_params': [{
                           'image': 'ubuntu-1804-lts'    
                       }]
                   }],
                   'machine_type': 'f1-micro',          
                   'name': name,
                   'zone': zone,
                   'network_interface': [{
                       'subnetwork': 
                            f'${{google_compute_subnetwork.{VERSION}.name}}',
                       'access_config': {
                           'network_tier': 'PREMIUM'        
                       }
                   }]
               }]
           }
       }
   ] 

Labels the new version of the network “green”

Creates a template for the server name, which includes the team, environment, and version (blue or green)

Uses the module to create the JSON configuration for the network, subnetwork, cluster, and server for the green network

Builds the three servers on the green network with the cluster

Copies and pastes each server configuration. This code snippet features the first server, server0. Other server configurations are omitted for clarity.

Creates a small Google compute instance (server) by using a Terraform resource on the green network

Sets the network tier to use premium networking. This enables compatibility with the underlying subnet, which uses global routing.

AWS and Azure equivalents

If you convert listing 9.5 to AWS or Azure, the network tier does not apply. You can still update the code by changing the Google compute instance to an Amazon EC2 instance or Azure Linux virtual machine with an Ubuntu 18.04 image. You will need an Amazon VPC and Azure virtual network first.

Updating the network tier to premium should not affect the functionality of the applications, although you don’t quite know! The green environment allows you to identify and mitigate any problems before affecting sundew growth. After the sundew team makes the updates, it pushes the changes and checks the test results in the delivery pipeline.

The tests include unit, integration, and end-to-end testing to ensure that you can run the applications on the new container cluster and send requests to the new green servers. Fortunately, the tests pass, and you feel ready to send live traffic to the green resources.

9.2.3 Using a canary deployment to the green infrastructure

You could immediately send all traffic to the green network, servers, and cluster. However, you don’t want to bring down the sundew system! Ideally, you want to switch all traffic back to blue when you find a problem with your system. In figure 9.7, you adjust the global load balancer to send 90% of traffic to the blue network and 10% of traffic to the services on the green network.

Figure 9.7 Configure the global load balancer to run a canary test and send a small percentage of traffic to green resources.

If this small amount of traffic sent to your system, known as a canary deployment, results in errors in requests, you need to debug and fix your changes.

Definition Canary deployment is a pattern that sends a small percentage of traffic to the updated resources in the system. If the requests complete successfully, you increase the percentage of traffic over time.

Why send a small amount of traffic first? You do not want all of your requests failing. Sending a few requests to the updated resources helps identify critical problems before you affect your entire system.

Canary in a coal mine

A canary in software or infrastructure serves as a first indicator of whether your new system, feature, or application will work. The term comes from the expression “canary in a coal mine.” Miners would bring caged birds with them into mines. If the mine had dangerous gas, the birds would serve as the first indicator.

You’ll often find references to canary testing in software development, which measures user experience for a new version of an application or feature. I highly recommend canary deployment, the technique of sending a small percentage of traffic to new resources, anytime you make significant infrastructure changes.

Note that you do not have to use load balancers to achieve a canary deployment. You can use any method to send a few requests to updated infrastructure resources. For example, you could add one updated application instance to an existing pool of three application instances. A round-robin load-balancing scheme sends about 25% of your requests to the new, updated instance and 75% to old, existing application instances.

For the sundew team, you separate the global load balancer configuration away from the green and blue environments in your configuration. This improves the load balancer’s evolvability. You add the green servers as a separate backend service to the load balancer and control requests between green and blue environments.

You define the load balancer in a file named shared.py in the following listing. Let’s add the green version of the network (servers and cluster, too!) to the list of versioned environments with a weight of 10.

Listing 9.6 Adding a green version to the list of load-balancing services

import blue                                         
import green                                        
 
shared_name = f'{TEAM}-{ENVIRONMENT}-shared'        
 
 
services_list = [                                   
   {
       'version': 'blue',                           
       'zone': blue.zone,                           
       'name': f'{shared_name}-blue',               
       'weight': 90                                 
   },
   {
       'version': 'green',                          
       'zone': green.zone,                          
       'name': f'{shared_name}-green',              
       'weight': 10                                 
   }
]
 
 
def _generate_backend_services(services):           
   backend_services_list = []
   for service in services:                         
       version = service['version']                 
       weight = service['weight']                   
       backend_services_list.append({
           'backend_service': (                     
               '${google_compute_backend_service.'  
               f'{version}.id}}'                    
           ),    
           'weight': weight,                        
       })
   return backend_services_list

Imports IaC for both blue and green environments

Creates the name for the shared global load balancer based on the team and environment

Defines a list of versions for each environment to attach to the load balancer, blue and green environments

Adds the blue network, servers, and cluster to the load balancer list. Retrieves the availability zone of the blue environment from its IaC.

Sets the weight of traffic to the blue server instances to 90, representing 90% of requests

Adds the green network, servers, and cluster to the load balancer list. Retrieves the availability zone of the green environment from its IaC.

Sets the weight of traffic to the green server instances to 10, representing 10% of requests

Creates a function to generate a list of backend services for a load balancer

For each environment, defines a Google load-balancing backend service with a version and weight

AWS and Azure equivalents

The backend services in listing 9.6 are similar to an AWS target group for an AWS ALB. However, Azure requires additional resources. You will need to create an Azure Traffic Manager profile and endpoint to the backend address pool attached to the Azure Application Gateway.

The load balancer in shared.py already accepts a list of backend services with differing weights. Once you deploy the list of weights and services in the following listing, the load balancer configuration starts sending 10% of traffic to the green network.

Listing 9.7 Updating the load balancer to send traffic to green

default_version = 'blue'                                       
 
def load_balancer(name, default_version, services):            
   return [{
       'google_compute_url_map': {                             
           TEAM: [{
               'default_service': (                            
                   '${google_compute_backend_service.'         
                   f'{default_version}.id}}'                   
               ),    
               'description': f'URL Map for {TEAM}',
               'host_rule': [{
                   'hosts': [
                       f'{TEAM}.{COMPANY}.com'
                   ],
                   'path_matcher': 'allpaths'
               }],
               'name': name,
               'path_matcher': [{
                   'default_service': (                        
                       '${google_compute_backend_service.'     
                       f'{default_version}.id}}'               
                   ),                                          
                   'name': 'allpaths',
                   'path_rule': [{
                       'paths': [
                           '/*'
                       ],
                       'route_action': {
                           'weighted_backend_services':        
                               _generate_backend_services(     
                                   services)                   
                       }
                   }]
               }]
           }]
       }
   }]

Sends all traffic from the load balancer to the blue environment by default

Uses the module to create the JSON configuration for the load balancer to send 10% of traffic to green and 90% of traffic to blue

Creates the Google compute URL map (load-balancing rule) by using a Terraform resource based on the path, blue and green environments, and weight

Sets routing rules on the load balancer to send 10% of traffic to green and 90% of traffic to blue.

AWS and Azure equivalents

The Google Cloud URL map is similar to an AWS ALB or Azure Traffic Manager and Application Gateway. To convert listing 9.7 to AWS, you will need to update the resource to create an AWS ALB and listener rule. Then, add path routing and weight attributes to the ALB listener rule.

For Azure, you will need to link an Azure Traffic Manager profile and endpoint to an Azure Application Gateway. Update the Azure Traffic Manager with weights and route them to the correct backend address pool attached to an Azure Application Gateway.

You run Python in listing 9.8 to build the Terraform JSON configuration for review. The JSON configuration for the load balancer includes the instance groups that organize the blue servers and green servers, backend services to target the blue and green instances’ groups, and the weighted routing actions.

Listing 9.8 JSON configuration for the load balancer

{
   "resource": [
       {
           "google_compute_url_map": {                                    
               "sundew": [
                   {
                       "default_service":                                
                           "${google_compute_backend_service.blue.id}",   
                       "description": "URL Map for sundew",
                       "host_rule": [
                           {
                               "hosts": [
                                   "sundew.dc4plants.com"
                               ],
                               "path_matcher": "allpaths"
                           }
                       ],
                       "name": "sundew-production-shared",
                       "path_matcher": [
                           {
                               "default_service": 
                                 "${google_compute_backend_service.blue.id}",
                               "name": "allpaths",
                               "path_rule": [
                                   {
                                       "paths": [                         
                                           "/*"                           
                                       ],                                 
                                       "route_action": {
                                           "weighted_backend_services": [
                                               {
                                                   "backend_service": 
                                   "${google_compute_backend_             
                                   service.blue.id}",                     
                                                   "weight": 90,          
                                               },
                                               {
                                                   "backend_service":
                                   "${google_compute_backend_             
                                   service.green.id}",                    
                                                   "weight": 10,          
                                               }
                                           ]
                                       }
                                   }
                               ]
                           }
                       ]
                   }
               ]
           }
       }
   ]
}

Defines the Google compute URL map (load-balancing rule) using a Terraform resource based on the path, blue and green environments, and weight

Defines the default service for the Google compute URL map (load-balancing rule) to the blue environment

Sends all requests to the blue or green environments based on weight

Sends 90% of traffic to blue backend services, which use the blue network

Sends 10% of traffic to green backend services, which use the green network

Why send all traffic by default to the blue environment? You know the blue environment processes requests successfully. If your green environment breaks, you can quickly switch the load balancer to send traffic to the default blue environment.

In general, copy, paste, and update the green resources. If you express the blue resources in a module, you need to change only the attributes passed to the module. I separate the green and blue environment definitions in separate folders or files, when possible. This makes it easier to identify the environments later.

You may notice some Python code in shared.py that makes it easier to evolve the list of environments and the default environment attached to the load balancer. I usually define a list of environments and a variable for the default environment. Then, I iterate over the list of environments and attach the attributes to a load balancer. This ensures that the high-level load balancer can evolve to accommodate the different resources and environments.

As you add new resources, you can adjust your load balancer to send traffic to additional environments. You may find yourself updating your load balancer’s IaC each time you want to run a blue-green deployment. Taking the time and effort to configure the load balancer helps mitigate any problems from changes and controls the rollout of potentially disruptive updates.

9.2.4 Performing regression testing

If you immediately send all traffic to the green network and it fails, you could disrupt the watering system for the Cape sundews. As a result, you start with a canary deployment and increase the ratio of traffic to the green network by 10% each day. The process takes about two weeks, but you feel confident that you updated the network correctly! If you find a problem, you reduce traffic to the green network and debug.

Figure 9.8 shows your gradual process of increasing traffic and testing the green environment over time. You gradually decrease traffic to the blue environment until it reaches 0%. You inversely increase traffic to the green environment until it reaches 100%. You run the green environment for a week or two before disabling the blue network just in case the change broke the system in the green environment.

Figure 9.8 Allow a week for a regression test before cutting all traffic to the new network and removing the old one, which allows for time to verify functionality.

The process of gradually increasing traffic and waiting a week before proceeding seems so painful! However, you need to allow the system to run enough traffic through the green environment to determine if you can proceed. Some failures appear with only enough traffic through the system, while others take time to detect.

The window of time you spend testing, observing, and monitoring the system for errors becomes part of a regression test for the system. Regression testing checks whether changes to the system affect existing or new functionality. Gradually increasing the traffic over time allows you to assess the system’s functionality while mitigating the potential failure impact.

Definition Regression testing checks whether changes to the system affect existing or new functionality.

How much should you increase traffic to the green environment? Increasing traffic by 1% each day does not provide much information unless your system serves millions of requests. Gradual doesn’t offer clear criteria on which increments you should use. I recommend assessing the number of requests your system serves daily and the cost of failure (such as errors on user requests).

I usually start with increments of 10% and check how many requests that means for the system. If I do not get a sufficient sample size of requests to identify failures, I increase the increment. You want to insert a regression-testing window between each percentage increase to identify system failures.

Even after increasing the load balancer to send all the traffic to the green network, you still want to keep running tests and monitor system functionality for a week or two. Why run regression tests for a few weeks? Sometimes you might encounter edge cases from application requests that break functionality. By allowing a period for regression tests, you can observe whether the system can handle unexpected or uncommon payloads or requests.

9.2.5 Deleting the blue infrastructure

You observe the sundew system for two weeks and resolve any errors. You know that the blue network has not processed any requests or data for about two weeks, which means you can remove it without additional migration steps. You confirm the inactivity with a peer or change advisory board review. Figure 9.9 updates the default service to the green environment before deleting the blue environment from IaC.

Figure 9.9 Decommission the old network by deleting it from IaC and removing all references.

I consider deleting the blue environment with the network a major change that requires additional peer review. You may not know who uses the network. Other resources that you do not share with other teams, like servers, may not need additional peer review or change approval. Assess the potential impact of deleting an environment and categorize the change based on patterns in chapter 7. Let’s adjust the load balancer in shared.py by setting the default service to the green network and removing the blue network from its backend services, as shown in the following listing.

Listing 9.9 Removing the blue environment load balancer

import blue                             
import green                            
 
TEAM = 'sundew'
ENVIRONMENT = 'production'
PORT = 8080
 
shared_name = f'{TEAM}-{ENVIRONMENT}-shared'
 
default_version = 'green'               
 
services_list = [                       
   {
       'version': 'green',
       'zone': green.zone,
       'name': f'{shared_name}-green',
       'weight': 100                    
   }
]

Imports IaC for both blue and green environments

Changes the default version of the network for the load balancer to green

Removes the blue network and instances from the list of backend services to generate

Sends all traffic to the green network

AWS and Azure equivalents

Listing 9.9 stays the same for AWS and Azure. You will want to map the version, availability zone, name, and weight to the AWS ALB or Azure Traffic Manager.

You apply the changes to the load balancer. However, you do not delete the blue resources immediately because you must ensure that the load balancer does not reference any blue resources. After testing the changes, you remove the code to build the blue environment from main.py and leave the green environment, as follows.

Listing 9.10 Removing the blue environment from main.py

import green
import json
 
if __name__ == "__main__":
   resources = {
       'resource':
       shared.build() +                              
       green.build()                                 
   }
 
   with open('main.tf.json', 'w') as outfile:        
       json.dump(resources, outfile,                 
                 sort_keys=True, indent=4)           

Uses the shared module to create the JSON configuration for the global load balancer

Uses the green module to create the JSON configuration for the network with global routing, servers with premium networking, and the cluster

Writes the Python dictionary out to a JSON file to be executed by Terraform later

You apply the changes, and your IaC tool deletes all blue resources. You decide to delete the blue.py file to prevent anyone from creating new blue resources. I recommend removing any files you do not use to reduce confusion for your teammates in the future. Otherwise, you may have a system with more resources than you need.

Exercise 9.1

Consider the following code:

if __name__ == "__main__":
  network.build()
  queue.build(network)
  server.build(network, queue)
  load_balancer.build(server)
  dns.build(load_balancer)

The queue depends on the network. The server depends on the network and queue. How would you run a blue-green deployment to upgrade the queue with SSL?

See appendix B for answers to exercises.

9.2.6 Additional considerations

Imagine the sundew team needs to change the network again. Instead of creating a new green network, the team can create a new blue network and repeat the deployment, regression test, and deletion process! Since the old blue network no longer exists, this update does not conflict with an existing environment.

What you name your versions or iterations of changes does not matter, as long as you differentiate between old and new resources. For networking specifically, you consider allocating two sets of IP address ranges. You should permanently reserve one for the blue network and the other for the green network. The allocation will allow you the flexibility to make changes by using blue-green deployment without the need to search for open network space.

In general, I decide to use a blue-green deployment strategy when I encounter the following:

  • Reverting changes to the resource takes a long time.

  • I am not confident that I can revert changes to the resource after deployment.

  • The resource has many high-level dependencies that I cannot easily identify.

  • The resource change affects critical applications that cannot have downtime.

Not all infrastructure resources should use blue-green deployment. For example, you can update IAM policies in place and quickly revert them if you identify a problem. You’ll learn more about reverting changes in chapter 11.

A blue-green deployment strategy costs less in time and money than maintaining multiple environments. However, this strategy will cost more when you have to deploy low-level infrastructure resources like networks, projects, or accounts! I usually consider the pattern worth the cost. It isolates the change to specific resources and provides a lower-risk methodology for deploying changes and minimizing disruption to systems.

9.3 Stateful infrastructure

Throughout this chapter, the example omitted an essential class of infrastructure resources. However, the sundew system includes numerous resources that process, manage, and store data. For example, the sundew system includes a Google SQL database running on a network with regional routing.

9.3.1 Blue-green deployment

The sundew application team members remind you that they need to update their database to use the new network with global routing. You update the private network ID in IaC and push the changes to your repository. Your deployment pipeline fails on a compliance test (something you learned about in chapter 8).

You notice that the test checks the dry run (plan) for whether the database deletions have failed:

$ pytest . -q
 
F                                                 [100%]
====== FAILURES ======
_____ test_if_plan_deletes_database _____
 
database = {'address': 'google_sql_database_instance.blue', 'change': 
{'actions': ['delete'], 'after': None, 'after_sensitive': False, 
'after_unknown': {}, ...}, 'mode': 'managed', 'name': 'blue', ...}
 
    def test_if_plan_deletes_database(database):
>       assert database['change']['actions'][0] != 'delete'
E       AssertionError: assert 'delete' != 'delete'
 
test/test_database_plan.py:35: AssertionError
======= short test summary info =======
FAILED test/test_database_plan.py::test_if_plan_deletes_database - 
AssertionError: assert 'delete' != 'delete'
1 failed in 0.04s

The compliance test prevents you from deleting a critical database! If you applied the change without the test, you would delete all of the sundew data! The sundew team raises concerns about updating the database’s network in place, so you need to do a blue-green deployment.

In figure 9.10, you manually verify that you can migrate the database to the green network. However, you discover you cannot migrate the database. The sundew system can accommodate for missing data, so you copy the IaC for the blue database and create a new green database instance in the premium network. After migrating the data from the blue to the green database, you switch applications to use the new database and remove the old one.

Figure 9.10 If you can’t update the database in place, you must deploy a new green database, migrate and reconcile the data, and change the database endpoint from blue to green.

The blue-green deployment strategy applies to the database, a class of infrastructure resource that focuses on state. Stateful infrastructure resources like the database store and manage data. In reality, all applications process and store some amount of data. However, stateful infrastructure requires additional care because changes can directly affect data. This type of infrastructure includes databases, queues, caches, or stream-processing tools.

Definition Stateful infrastructure describes infrastructure resources that store and manage data.

Why use a blue-green deployment for infrastructure resources with data? Sometimes you cannot update the resource in place with IaC. Replacing the resource may corrupt or lose data, which affects your applications. A blue-green deployment helps you test the functionality of your new database before your applications use it.

9.3.2 Update delivery pipeline

Let’s return to the sundew team. You must fix the delivery pipeline to automate the update for the database. In figure 9.11, you update your delivery pipeline with a step that automatically migrates data from the blue to the green database. When you add the green database and deploy it, your pipeline deploys the new database, runs integration tests, automatically migrates data from the blue to the green database, and completes the pipeline with end-to-end tests.

Preserve idempotency with the automation for your migration step. Your script or automation for migration should result in the same database state each time. It should avoid duplicating data each time you run the automation. The data migration process differs depending on the kind of stateful infrastructure you have (database, queue, cache, or stream-processing tools).

Figure 9.11 The infrastructure deployment pipeline should add the green database and copy data from blue to green.

Note You can dedicate entire books to migrating and managing stateful infrastructure with minimal (or zero) downtime. I recommend Database Reliability Engineering by Laine Campbell and Charity Majors (O’Reilly, 2017) for other patterns and practices on managing databases. You can refer to specific documentation on migration, upgrading, and availability for other stateful infrastructure resources.

Depending on how often you update your stateful infrastructure, you should capture the automated data migration to your deployment pipelines and not within your IaC. Separating data migration allows you to change any steps and debug problems with the migration independent of creating and removing the stateful resources.

9.3.3 Canary deployment

To complete the database update for the sundew team, you make changes to application configuration to use the green database. As high-level resources, the applications depend on the database like a load balancer sends traffic to servers. You can use a modified form of canary deployment to cut over to the new database.

Figure 9.12 shows the pattern of regression testing and configuring the application to use the database. After a period of regression testing (to ensure that functionality still works), you update the application to write data exclusively to the green database. After another period of regression testing, you update the application to read data from the green database.

Figure 9.12 You incrementally update the application deployment pipeline to write to both databases, write to the green database, and then read from the green database.

You incrementally roll out the changes to the application to revert to the blue database if you encounter an issue. Since the application writes to two databases, you may have to write additional automation to reconcile data. However, writing to new stateful infrastructure ensures that you test any critical functionality related to storing and updating data. Only then can applications properly read and process data.

Figure 9.13 Delete the blue database by removing it from IaC and pushing the changes into your deployment pipeline.

Now that the sundew applications use the green database, figure 9.13 removes the blue database by deleting it from the IaC. Note that the compliance tests will fail when you remove the database because you deleted a database! You update the test to fail if you plan on deleting a green database and not a blue one. The manual override allows you to remove the blue database since applications no longer use it.

Techniques like canary deployment provide rapid feedback to mitigate the impact of failure, especially in a situation involving data processing. It can mean the difference between fixing a few wrong entries in a database versus restoring it entirely from backup! I get a sense of comfort knowing that I am making changes to stateful infrastructure in an isolated green environment instead of the live production system.

Strategies using immutability like blue-green deployment offer a structured process to make changes and minimize the blast radius of potential failures. Thanks to the principle of reproducibility, you can typically use an immutable approach for changing IaC by duplicating and editing configuration. The principle also allows you to improve the redundancy of your infrastructure system with a similar process of duplication.

Summary

  • Ensure that your system has testing, monitoring, and observability for infrastructure and applications before starting any infrastructure updates.

  • Redundancy in IaC means adding extra, idle resources to the configuration so you can fail over in case of component failure.

  • Adding redundant configuration to IaC can improve system reliability, measuring how long a system performs correctly over a certain period of time.

  • An active-passive configuration includes an active environment serving requests and a duplicate idle environment for when the active environment fails.

  • Failover switches traffic from a failing active environment to an idle passive environment.

  • An active-active configuration sets two active environments serving requests, both of which you can duplicate and manage with IaC.

  • Blue-green deployment creates a new subset of infrastructure resources that stages the changes you want to make and gradually switches requests to the new subset. You can then remove the old set of resources.

  • In a blue-green deployment, deploy the resource you want to change and the high-level resources that depend on it.

  • Canary deployment sends a small percentage of traffic to the new infrastructure resources to verify whether the system works correctly. Over time, you increase the percentage of traffic.

  • Allow a few weeks for regression testing to check whether changes to the system affect existing or new functionality.

  • Stateful infrastructure resources—like databases, caches, queues, and stream-processing tools—store and manage data.

  • Add a migration step to the blue-green deployment of stateful infrastructure resources. You must copy data between blue and green stateful infrastructure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset