10 Refactoring

This chapter covers

  • Determining when to refactor IaC to avoid impacting systems
  • Applying feature flagging to change infrastructure attributes mutably
  • Explaining rolling updates to complete in-place updates

Over time, you might outgrow the patterns and practices you use to collaborate on infrastructure as code. Even change techniques like blue-green deployment cannot solve conflicts in configuration or changes as your team works on some IaC. You must deliver a series of major changes to your IaC and address problems with scaling the practice.

For example, the sundew team for Datacenter for Carnivorous Plants expresses that it can no longer comfortably and confidently roll out new changes to its system. The team puts all infrastructure resources in one repository (as per the singleton pattern) to quickly deliver the system and just kept adding new updates on top of it.

The sundew team outlines a few problems with its system. First, the team finds its updates to infrastructure configuration constantly overlapping. One teammate works on updating servers, only to find another teammate has updated the network and will affect their changes.

Second, it takes more than 30 minutes to run a single change. One change makes hundreds of calls to your infrastructure API to retrieve the state of resources, which slows the feedback cycle.

Finally, the security team expresses concern that the sundew infrastructure may have an insecure configuration. The current configuration does not use standardized, hardened company infrastructure modules.

You realize you need to change the sundew team’s configuration. The configuration should use an existing server module approved by the security team. You also need to break the configuration into separate resources to minimize the blast radius of changes.

This chapter discusses some IaC patterns and techniques to break down large singleton repositories with hundreds of resources. As the IaC helper on the sundew team, you’ll refactor the system’s singleton configuration into separate repositories and structure the server configurations to use modules to avoid conflicts and comply with security standards.

Note Demonstrating refactoring requires a sufficiently large (and complex) example. If you run the complete example, you will incur a cost that exceeds the GCP free tier. This book includes only the relevant lines of code and omits the rest for readability. For complete listings, refer to the book’s code repository at https://github.com/joatmon08/manning-book/tree/main/ch10. If you convert these examples for AWS and Azure, you will also incur a cost. When possible, I offer notations on converting the examples to the cloud provider of your choice.

10.1 Minimizing the refactoring impact

The sundew team needs help breaking down its infrastructure configuration. You decide to refactor the IaC to isolate conflicts better, reduce the amount of time to apply changes to production, and secure them according to company standards. Refactoring IaC involves restructuring configuration or code without impacting existing infrastructure resources.

Definition Refactoring IaC is the practice of restructuring configuration or code without impacting existing infrastructure resources.

You communicate to the sundew team that its configuration needs to undergo a refactor to fix the problems. While the team members support your effort, they challenge you to minimize the impact of your refactor. Challenge accepted: you apply a few techniques to reduce the potential blast radius as you refactor IaC.

Technical debt

Refactoring often resolves technical debt. Technical debt began as a metaphor to describe the cost of any code or approach that makes the overall system challenging to change or extend.

To understand technical debt applied to IaC, recall that the sundew team put all its infrastructure resources into one repository. The sundew team accumulates debt in time and effort. A change to a server that should take a day takes four days because the team needs to resolve conflicts with another change and wait for hundreds of requests to the infrastructure API. Note that you’ll always have some technical debt in complex systems, but you need continuous efforts to minimize it.

A management team dreads hearing that you need to address technical debt because you don’t work on features. I argue that the technical debt you accumulate in infrastructure will always come back to haunt you. The gremlin of technical debt comes in the form of someone changing infrastructure and causing application downtime, or worse, a security breach that exposes personal information and incurs a monetary cost. Assessing the impact of not fixing technical debt helps justify the effort.

10.1.1 Reduce blast radius with rolling updates

The Datacenter for Carnivorous Plants platform and security team offer a server module with secure configurations, which you can use for the sundew system. The sundew team’s infrastructure configuration has three server configurations but no usage of the secure module. How do you change the sundew IaC to use the module?

Imagine you create three new servers together and immediately send traffic to them. If the applications do not run correctly on the server, you could disrupt the sundew system entirely, and the poor plants do not get watered! Instead, you might reduce the blast radius of your server module refactor by gradually changing the servers one by one.

In figure 10.1, you create one server configuration using the module, deploy the application to the new server, validate that the application works, and delete the old server. You repeat the process two more times for each server. You gradually roll out the change to one server before updating the next one.

Figure 10.1 Use rolling updates to create each new server, deploy the application, and test its functionality while minimizing disruption to other resources.

A rolling update gradually changes similar resources one by one and tests each one before continuing the update.

Definition A rolling update is a practice of changing a group of similar resources one by one and testing them before implementing the change to the next one.

Applying a rolling update to the sundew team’s configuration isolates failures to individual servers each time you make the update and allows you to test the server’s functionality before proceeding to the next one.

The practice of rolling updates can save you the pain of detangling a large set of failed changes or incorrectly configured IaC. For example, if the Datacenter for Carnivorous Plants module doesn’t work on one server, you have not yet rolled it out and affected the remaining servers. A rolling update lets you check that you have the proper IaC for each server before continuing to the next one. A gradual approach also mitigates any downtime in the applications or failures in updating the servers.

Note I borrowed rolling updates for refactoring from workload orchestrators, like Kubernetes. When you need to update new nodes (virtual machines) for a workload orchestrator, you may find it uses an automated rolling update mechanism. The orchestrator cordons the old node, prevents new workloads from running on it, starts all of the running processes on a new node, drains all of the processes on the old node, and sends workloads and requests to the new node. You should mimic the workflow when you refactor!

Thanks to the rolling update and incremental testing, you know that the servers can run with the secure module. You tell the team that you finished refactoring the servers and confirmed that they work with internal services. The sundew team can now send all customer traffic to the newly secured servers. However, the team members tell you that they need to update the customer-facing load balancer first!

10.1.2 Stage refactoring with feature flags

You need a way to hide the new servers from the customer-facing load balancer for a few days and attach them when the team approves. However, you have all the configurations ready! You want to hide the server attachments with a single variable to simplify for the sundew team. When the team members complete their load balancer update, they need to update only one variable to add the new servers to the load balancer.

Figure 10.2 outlines how you set up the variable to add new servers created by the module. You create a Boolean to enable or disable the new server module, using True or False. Then, you add an if statement to IaC that references the Boolean value. A True variable adds the new servers to the load balancer. A False variable removes the servers from the load balancer.

Figure 10.2 Feature flags in IaC involve a process of creation, management, and removal.

The Boolean variable helps with the composability or evolvability of IaC. A single change to the variable adds, removes, or updates a configuration. The variable, called a feature flag (or feature toggle), is used to enable or disable infrastructure resources, dependencies, and attributes. You often find feature flags in software development with a trunk-based development model (these work only on the main branch).

Definition Feature flags (also known as feature toggles) enable or disable infrastructure resources, dependences, or attributes by using a Boolean.

Flags hide certain features or code and prevent them from impacting the rest of the team on the main branch. For the sundew team, you hide the new servers from the load balancer until the team completes the load balancer change. Similarly, you can use feature flags in IaC to stage configuration and push the update with a single variable.

Set up the flag

To start implementing a feature flag and stage changes for the new servers, you add a flag and set it to False. You default a feature flag to False to preserve the original infrastructure state, as shown in figure 10.3. The sundew configuration disables the server module by default so that nothing happens to the original servers.

Let’s implement the feature flag in Python. You set the server module flag’s default to False in a separate file called flags.py. The file defines the flag, ENABLE_ SERVER_MODULE, and sets it to False:

ENABLE_SERVER_MODULE = False

 

Figure 10.3 Set the feature flag to False by default to preserve the infrastructure resources’ original state and dependencies.

You could also embed feature flags as variables in other files, but you might lose track of them! You decide to put them in a separate Python file.

Note I always define feature flags in a file to identify and change them in one place.

The following listing imports the feature flag in main.py and adds the logic to generate a list of servers to add to the load balancer.

Listing 10.1 Including the feature flag to add servers to the load balancer

import flags                                                         
 
def _generate_servers(version):
   instances = [                                                     
       f'${{google_compute_instance.{version}_0.id}}',               
       f'${{google_compute_instance.{version}_1.id}}',               
       f'${{google_compute_instance.{version}_2.id}}'                
   ]                                                                 
   if flags.ENABLE_SERVER_MODULE:                                    
       instances = [                                                 
           f'${{google_compute_instance.module_{version}_0.id}}',    
           f'${{google_compute_instance.module_{version}_1.id}}',    
           f'${{google_compute_instance.module_{version}_2.id}}',    
       ]                           
   return instances

Imports the file that defines all of the feature flags

Defines a list of existing Google compute instances (servers) by using a Terraform resource in the system

Uses a conditional statement to evaluate the feature flag and adds the server module’s resources to the load balancer

A feature flag set to True will attach the servers created by the module to the load balancer. Otherwise, it will keep the original servers

AWS and Azure equivalents

To convert listing 10.1 to AWS or Azure, use the AWS EC2 Terraform resource (http://mng.bz/VMMr) or the Azure Linux virtual machine Terraform resource (http://mng.bz/xnnq). You will need to update only the references within the list of instances.

Run Python with the feature flag toggled off to generate a JSON configuration. The resulting JSON configuration adds only the original servers to the load balancer, which preserves the existing state of infrastructure resources.

Listing 10.2 JSON configuration with feature flag disabled

{
   "resource": [
       {
           "google_compute_instance_group": {                          
               "blue": [
                   {
                       "instances": [
                           "${google_compute_instance.blue_0.id}",     
                           "${google_compute_instance.blue_1.id}",     
                           "${google_compute_instance.blue_2.id}"      
                       ]
                   }
               ]
           }
       }
   ]
}

Creates a Google compute instance group using a Terraform resource to attach to the load balancer

Configuration includes a list of the original Google compute instances, preserving the current state of infrastructure resources

AWS and Azure equivalents

A Google Compute instance group has no straightforward equivalent in AWS or Azure. Instead, you will need to replace the compute instance group with a resource definition for an AWS Target Group (http://mng.bz/AyyE) for a load balancer. For Azure, you will need a backend address pool and three addresses to the virtual machine instances (http://mng.bz/ZAAj).

The feature flag set to False by default uses the principle of idempotency. When you run the IaC, your infrastructure state should not change. Setting the flag ensures that you do not accidentally change existing infrastructure. Preserving the original state of the existing servers minimizes disruption to dependent applications.

Enable the flag

The sundew team made its changes and provided approval to add the new servers created by the module to the load balancer. You set the feature flag to True, as shown in figure 10.4. When you deploy the change, you attach servers from a module to the load balancer and remove the old servers.

Figure 10.4 Set the feature flag to True, attach the three new servers created by the module to the load balancer, and detach the old servers.

Let’s examine the updated feature flag in action. You start by setting the feature flag for the servers to True:

ENABLE_SERVER_MODULE = True

Run Python to generate a new JSON configuration. The configuration in the following listing now includes the servers created by the module that you will attach to the load balancer.

Listing 10.3 JSON configuration with feature flag enabled

{
   "resource": [
       {
           "google_compute_instance_group": {
               "blue": [
                   {
                       "instances": [
                           "${google_compute_instance.module_blue_0.id}",  
                           "${google_compute_instance.module_blue_1.id}",  
                           "${google_compute_instance.module_blue_2.id}"   
                       ]
                   }
               ]
           }
       }
   ]
}

The new servers created by the module replace the old servers because you enabled the feature flag.

The feature flag allows you to stage the module’s low-level server resources without affecting the load balancer’s high-level dependency. You can rerun the code with the feature toggle off to reattach the old servers.

Why use the feature flag to switch to the server module? A feature flag hides functionality from production until you feel ready to deploy resources associated with it. You offer one variable to add, remove, or update a set of resources. You can also use the same variable to revert changes.

Remove the flag

After running the servers for some time, the sundew team reports that the new server module works. You can now remove the old servers in listing 10.4. You no longer need the feature flag, and you don’t want to confuse another team member when they read the code. You refactor the Python code for the load balancer to remove the old servers and delete the feature flag.

Listing 10.4 Removing the feature flag after the change completes

import blue                                                       
 
 
def _generate_servers(version):
   instances = [                                                  
       f'${{google_compute_instance.module_{version}_0.id}}',     
       f'${{google_compute_instance.module_{version}_1.id}}',     
       f'${{google_compute_instance.module_{version}_2.id}}',     
   ]    
   return instances

You can remove the import for the feature flags because you no longer need it for your servers.

Permanently attaches the servers created by the module to the load balancer and removes the feature flag

Domain-specific languages

Listing 10.4 shows a feature flag in a programming language. You can also use feature flags in DSLs, although you must adapt them based on your tool’s syntax. In Terraform, you can mimic a feature flag by using a variable and the count meta-argument (http://mng.bz/R44n):

variable "enable_server_module" {
 type        = bool
 default     = false
 description = "Choose true to build servers with a module."
}
module "server" {
 count   = var.enable_server_module ? 1 : 0
 ## omitted for clarity
}

In AWS CloudFormation, you can pass a parameter and set a condition (http://mng.bz/2nnN) to enable or disable resource creation:

AWSTemplateFormatVersion: 2010-09-09
Description: Truncated example for CloudFormation feature flag
Parameters:
 EnableServerModule:
   AllowedValues:
     - 'true'
     - 'false'
   Default: 'false'
   Description: Choose true to build servers with a module.
   Type: String
Conditions:
 EnableServerModule: !Equals
   - !Ref EnableServerModule
   - true
Resources:
 ServerModule:
   Type: AWS::CloudFormation::Stack
   Condition: EnableServerModule
   ## omitted for clarity

Besides feature flags to enable and disable entire resources, you can use conditional statements to enable or disable specific attributes for a resource.

As a general rule, remove the feature flag after you finish the change. Too many feature flags can clutter your IaC with complicated logic, making it hard to troubleshoot infrastructure configuration.

Use cases

The example uses feature flags to refactor singleton configurations into infrastructure modules. I often apply feature flags to this use case to simplify the creation and removal of infrastructure resources. Other use cases for feature flags include the following:

  • Collaborating and avoiding change conflicts on the same infrastructure resources or dependencies

  • Staging a group of changes and rapidly deploying them with a single update to the flag

  • Testing a change and quickly disabling it on failure

A feature flag offers a technique to hide or isolate infrastructure resource, attribute, and dependency changes during the refactoring of infrastructure configuration. However, changing the toggle can still disrupt a system. In the example of the sundew team’s servers, we cannot simply toggle the feature flag to True and expect the servers to run the application. Instead, we combine the feature flag with other techniques like rolling updates to minimize disruption to the system.

10.2 Breaking down monoliths

The sundew team members express that they still have a problem with their system. You identify the singleton configuration with hundreds of resources and attributes as the root cause. Whenever someone makes a change, the teammate must resolve conflicts with another person. They also have to wait 30 minutes to make a change.

A monolithic architecture for IaC means defining all infrastructure resources in one place. You need to break the monolith of IaC into smaller, modular components to minimize working conflicts between teammates and speed up the deployment of changes.

Definition A monolithic architecture for IaC defines all infrastructure resources in a single configuration and the same state.

In this section, we’ll walk through a refactor of the sundew team’s monolith. The most crucial step begins with identifying and grouping high-level infrastructure resources and dependencies. We complete the refactor with the low-level infrastructure resources.

Monolith vs. monorepository

Recall that you can put your infrastructure configuration into a single repository (chapter 5). Does a single repository mean you have a monolithic architecture? Not necessarily. You can subdivide a single repository into separate subdirectories. Each subdirectory contains separate IaC.

A monolithic architecture means you manage many resources together and tightly couple them, making it difficult to change a subset in isolation. The monolith usually results from an initial singleton pattern (all configurations in one place) that expands over time.

You might have noticed that I started immediately with patterns for modularizing infrastructure resources and dependencies in chapters 3 and 4. Why not present this chapter on refactoring earlier? If you can identify and apply some of the patterns early in IaC development, you can avoid the monolithic architecture. However, you sometimes inherit a monolith and often need to refactor it.

10.2.1 Refactor high-level resources

The sundew team manages hundreds of resources in one set of configuration files. Where should you start breaking down the IaC? You decide to look for high-level infrastructure resources that do not depend on other resources.

The sundew team has one set of high-level infrastructure in GCP project-level IAM service accounts and roles. The IAM service accounts and roles don’t need to create a network or server before setting user and service account rules on the project. None of the other resources depend on the IAM roles and service accounts. You can group and extract them first.

You cannot use a blue-green deployment approach because GCP does not allow duplicate policies. However, you cannot simply delete roles and accounts from the monolithic configuration and copy them to a new repository. Deleting them prevents everyone from logging into the project! How can you extract them?

You can copy and paste the configuration into its separate repository or directory, initialize the state for the separated configuration, and import the resources into the infrastructure state associated with the new configuration. Then, you delete the IAM configuration in the monolithic configuration. As with the rolling update, you gradually change each set of infrastructure resources, test the changes, and proceed to the next one.

Figure 10.5 outlines the solution to refactoring a monolith for high-level resources. You copy the code from the monolith to a new folder and import the live infrastructure resource into the state of the code in the new folder. You redeploy the code to make sure it does not change existing infrastructure. Finally, remove the high-level resources from the monolith.

Figure 10.5 The sundew system’s IAM policies for the GCP project have no dependencies, and you can easily refactor them without disrupting other infrastructure.

As with feature flags, we use the principle of idempotency to run IaC and verify that we do not affect the active infrastructure state. Anytime you refactor, make sure you deploy the changes and check the dry run. You do not want to accidentally change an existing resource and affect its dependencies.

We will refactor the example in the following few sections. Stay with it! I know refactoring tends to feel tedious, but a gradual approach ensures that you do not introduce widespread failures into your system.

Copy from the monolith to a separate state

Your initial refactor begins by copying the code to create the IAM roles and service accounts to a new directory. The sundew team wants to keep a single repository structure that stores all the IaC in one source control repository but separates the configurations into folders.

You identify the IAM roles and service accounts to copy the team’s code to a new folder, as shown in figure 10.6. The active IAM policies and their infrastructure state in GCP do not change.

Why reproduce the IaC for the IAM policies in a separate folder? You want to split up your monolithic IaC without affecting any of the active resources. The most important practice when refactoring involves preserving idempotency. Your active state should never change when you move your IaC.

Figure 10.6 Copy the files for the IAM policies into a new directory for the sundew-production-iam configuration, and avoid changing the live infrastructure resources in GCP.

Let’s start refactoring the IAM policies out of the monolith. Create a new directory that manages only the IAM policies for the GCP project:

$ mkdir -p sundew_production_iam

Copy the IAM configuration from the monolith into the new directory:

$ cp iam.py sundew_production_iam/

You don’t need to change anything since the IAM policies do not depend on other infrastructure. The file iam.py in the following listing separates the creation and role assignment for a set of users.

Listing 10.5 IAM configuration separated from the monolith

import json
 
TEAM = 'sundew'
TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE = 'google_service_account'             
TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE = 'google_project_iam_member'          
 
 
users = {                                                                 
   'audit-team': 'roles/viewer',                                          
   'automation-watering': 'roles/editor',                                 
   'user-02': 'roles/owner'                                               
}                                                                         
 
 
def get_user_id(user):
   return user.replace('-', '_')
 
 
def build():
   return iam()
 
 
def iam(users=users):                                                     
   iam_members = []
   for user, role in users.items():
       user_id = get_user_id(user)
       iam_members.append({
           TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE: [{                         
               user_id: [{                                                
                   'account_id': user,                                    
                   'display_name': user                                   
               }]                                                         
           }]                                                             
       })
       iam_members.append({
           TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE: [{                         
               user_id: [{                                                
                   'role': role,                                          
                   'member': 'serviceAccount:${google_service_account.'   
                   + f'{user_id}' + '.email}'                             
               }]                                                         
           }]                                                             
       })
   return iam_members

Sets the resource types that Terraform uses as constants so you can reference them later if needed

Keeps all of the users you added to the project as part of the monolith

Uses the module to create the JSON configuration for the IAM policies outside the monolith

Creates a GCP service account for the project for each user in the sundew production project

Assigns the specific role defined for each service account, such as viewer, editor, or owner

AWS and Azure equivalents

Listing 10.5 creates all users and groups as service accounts in GCP so you can run the example to completion. You typically use service accounts for automation.

A service account in GCP is similar to an AWS IAM user dedicated to service automation or Azure Active Directory application registered with a client secret. To rebuild the code in AWS or Azure, adjust the roles for viewer, editor, and owner to fit AWS or Azure roles.

Set constants and create methods that output resource types and identifiers when separating configuration. You can always use them for other automation and continued system maintenance, especially as you continue to refactor the monolith!

In the following listing, create a main.py file in the sundew_production_iam folder that references the IAM configuration and outputs the Terraform JSON for it.

Listing 10.6 Entry point to build the separate JSON configuration for IAM

import iam                                       
import json
 
if __name__ == "__main__":
   resources = {
       'resource': iam.build()                   
   }
 
   with open('main.tf.json', 'w') as outfile:    
       json.dump(resources, outfile,             
                 sort_keys=True, indent=4)       

Imports the IAM configuration code and builds the IAM policies

Writes the Python dictionary out to a JSON file to be executed by Terraform later

Do not run Python yet to create the Terraform JSON or deploy the IAM policies! You already have IAM policies defined as part of GCP. If you run python main.py and apply the Terraform JSON with the separated IAM configuration, GCP throws an error that the user account and assignment already exists:

$ python main.py
 
$ terraform apply -auto-approve
## output omitted for clarity
| Error: Error creating service account: googleapi:
Error 409: Service account audit-team already exists within project 
projects/infrastructure-as-code-book., alreadyExists

The sundew team members do not want you to remove and create new accounts and roles. If you delete and create new accounts, they cannot log into their GCP project. You need a way to migrate the existing resources defined in the monolith and link them to code defined in its own folder.

Import the resources to the new state

Sometimes creating new resources with your refactored IaC will disrupt development teams and business-critical systems. You cannot use the principle of immutability to delete the old resources and create new ones. Instead, you must migrate active resources from one IaC definition to another.

Figure 10.7 Get the current state of the separated resources from the infrastructure provider and import the identifiers before reapplying IaC.

In the case of the sundew team, you extract the identifiers for each service account from the monolithic configuration and “move” them to the new state. Figure 10.7 demonstrates how to detach each service account and its role assignments from the monolith and attach them to the IaC in the sundew_production_iam directory. You call the GCP API for the current state of the IAM policies and import the live infrastructure resources into the separated configuration and state. Running the IaC should reveal no changes to its dry run.

Why import the IAM policy information with the GCP API? You want to import the updated, active state of the resource. A cloud provider’s API offers the most up-to-date configuration for resources. You can call the GCP API to retrieve the user emails, roles, and identifiers for the sundew team.

Rather than write your own import capability and save the identifiers in a file, you decide to use Terraform’s import capability to add existing resources to the state. You write some Python code in the following listing that wraps around Terraform to automate a batch import of IAM resources so that the sundew team can reuse it.

Listing 10.7 File import.py separately imports sundew IAM resources

import iam                                                              
import os
import googleapiclient.discovery                                        
import subprocess
 
PROJECT = os.environ['CLOUDSDK_CORE_PROJECT']                           
 
 
def _get_members_from_gcp(project, roles):                              
   roles_and_members = {}                                               
   service = googleapiclient.discovery.build(                           
       'cloudresourcemanager', 'v1')                                    
   result = service.projects().getIamPolicy(                            
       resource=project, body={}).execute()                             
   bindings = result['bindings']                                        
   for binding in bindings:                                             
       if binding['role'] in roles:                                     
           roles_and_members[binding['role']] = binding['members']      
   return roles_and_members                                             
 
 
def _set_emails_and_roles(users, all_members):                          
   members = []                                                         
   for username, role in users.items():                                 
       members += [(iam.get_user_id(username), m, role)                 
                   for m in all_members[role] if username in m]         
   return members                                                       
 
 
def check_import_status(ret, err):
   return ret != 0 and 
       'Resource already managed by Terraform' 
       not in str(err)
 
 
def import_service_account(project_id, user_id, user_email):            
   email = user_email.replace('serviceAccount:', '')                    
   command = ['terraform', 'import', '-no-color',                       
              f'{iam.TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE}.{user_id}',    
              f'projects/{project_id}/serviceAccounts/{email}']         
   return _terraform(command)                                           
 
 
def import_project_iam_member(project_id, role,                         
                             user_id, user_email):                      
   command = ['terraform', 'import', '-no-color',                       
              f'{iam.TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE}.{user_id}',    
              f'{project_id} {role} {user_email}']                      
   return _terraform(command)                                           
 
 
def _terraform(command):                                                
   process = subprocess.Popen(                                          
       command,                                                         
       stdout=subprocess.PIPE,                                          
       stderr=subprocess.PIPE)                                          
   stdout, stderr = process.communicate()                               
   return process.returncode, stdout, stderr                            
 
 
if __name__ == "__main__":
   sundew_iam = iam.users                                               
   all_members_for_roles = _get_members_from_gcp(                       
       PROJECT, set(sundew_iam.values()))                               
   import_members = _set_emails_and_roles(                              
       sundew_iam, all_members_for_roles)                               
   for user_id, email, role in import_members:
       ret, _, err = import_service_account(PROJECT, user_id, email)    
       if check_import_status(ret, err):                                
           print(f'import service account failed: {err}')               
       ret, _, err = import_project_iam_member(PROJECT, role,           
                                               user_id, email)          
       if check_import_status(ret, err):                                
           print(f'import iam member failed: {err}')                    

Retrieves the list of sundew users from iam.py in sundew_production_iam

Uses the Google Cloud Client Libraries for Python to get a list of members assigned to a role in the GCP project

Retrieves the GCP project ID from the CLOUDSDK_CORE_PROJECT environment variable

Gets the email and user IDs for the sundew IAM members only

Imports the service account to the sundew_production_iam state based on project and user email, using the resource type constant you set in iam.py

Imports a role assignment to the sundew_production_iam state based on project, role, and user email, using the resource type constant you set in iam.py

Both import methods wrap around the Terraform CLI command and return any errors and output.

Retrieves the list of sundew users from iam.py in sundew_production_iam

Uses the Google Cloud Client Libraries for Python to get a list of members assigned to a role in the GCP project

Gets the email and user IDs for the sundew IAM members only

If the import fails and it did not already import the resource, outputs the error

Libcloud vs. Cloud Provider SDKs

The examples in this chapter need to use the Google Cloud Client Library for Python instead of Apache Libcloud, which I showed in chapter 4. While Apache Libcloud works for retrieving information about virtual machines, it does not work for other resources in GCP. For more information about the Google Cloud Client Library for Python, review http://mng.bz/1ooZ.

You can update listing 10.7 to use the Azure libraries for Python (http://mng.bz/Pnn2) or AWS SDK for Python (https://aws.amazon.com/sdk-for-python/) to retrieve information about users. These would replace the GCP API client library.

As with defining dependencies, you want to dynamically retrieve identifiers from the infrastructure provider API for your resources to import. You never know when someone will change the resource, and the identifier you thought you needed no longer exists! Use your tags and naming conventions to search the API response for the resources you need.

When you run python import.py and perform a dry run of the Terraform JSON with the separated IAM configuration, you get a message that you do not have to make any changes. You successfully imported the existing IAM resources into their separate configuration and state:

$ python main.py
 
$ terraform plan
No changes. Your infrastructure matches the configuration.
 
Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.
 
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Sometimes your dry run indicates drift between the active resource state and the separated configuration. Your copied configuration does not match the active state of the resource. The differences often come from someone changing the active state of an infrastructure resource during a manual change or during updates to the default value for an attribute. Update your separated IaC to match the attributes of the active infrastructure resource.

Import with and without a provisioning tool

Many provisioning tools have a function for importing resources. For example, AWS CloudFormation uses the resource import command. The example uses Python wrapped around terraform import to move service accounts. Breaking down monolithic configuration will become tedious without it.

If you write IaC without a tool, you do not need a direct import capability. Instead, you need logic to check that the resources exist. The sundew service accounts and role assignments can work without Terraform or IaC import capability:

  1. Call the GCP API to check whether the sundew team’s service accounts and role attachments exist.

  2. If they do, check whether the API response for the service account attributes matches your desired configuration. Update the service account as needed.

  3. If they do not, create the service accounts and role attachments.

Remove the refactored resources from the monolith

You managed to extract and move the sundew team’s service accounts and role assignments to separate IaC. However, you don’t want the resources to stay in the monolith. You remove the resources from the monolith’s state and configuration before reapplying and updating your tool, as shown in figure 10.8.

Figure 10.8 Remove the policies from the monolith’s state and configuration before applying the updates and completing the refactor.

This step helps maintain IaC hygiene. Remember from chapter 2 that our IaC should serve as the source of truth. You do not want to manage one resource with two sets of IaC. If they conflict, the two IaC definitions for the resource may affect dependencies and the configuration of the system.

You want the IAM policy directory to serve as the source of truth. Going forward, the sundew team needs to declare changes to its IAM policy in the separate directory and not in the monolith. To avoid confusion, let’s remove the IAM resources from the IaC monolith.

To start, you must remove the sundew IAM resources from Terraform state, represented in a JSON file. Terraform includes a state removal command that you can use to take out portions of the JSON based on the resource identifier. Listing 10.8 uses Python code to wrap around the Terraform command. The code allows you to pass any resource type and identifier you want to remove from the infrastructure state.

Listing 10.8 File remove.py removes resources from the monolith's state

from sundew_production_iam import iam                                  
import subprocess
 
 
def check_state_remove_status(ret, err):                               
   return ret != 0                                                    
       and 'No matching objects found' not in str(err)                 
 
  
def state_remove(resource_type, resource_identifier):                  
   command = ['terraform', 'state', 'rm', '-no-color',                 
              f'{resource_type}.{resource_identifier}']                
   return _terraform(command)                                          
 
 
def _terraform(command):                                               
   process = subprocess.Popen(
       command,
       stdout=subprocess.PIPE,
       stderr=subprocess.PIPE)
   stdout, stderr = process.communicate()
   return process.returncode, stdout, stderr
 
 
if __name__ == "__main__":
   sundew_iam = iam.users                                              
   for user in iam.users:                                              
       ret, _, err = state_remove(                                     
           iam.TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE,                     
           iam.get_user_id(user))                                      
       if check_state_remove_status(ret, err):                         
           print(f'remove service account from state failed: {err}')   
       ret, _, err = state_remove(                                     
           iam.TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE,                     
           iam.get_user_id(user))                                      
       if check_state_remove_status(ret, err):                         
           print(f'remove role assignment from state failed: {err}')   

Retrieves the list of sundew users from iam.py in sundew_production_iam. Referencing the variable from the separated IaC allows you to run the removal automation for future refactoring efforts.

If the removal failed and did not already remove the resource, outputs the error

Creates a method that wraps around Terraform’s state removal command. The command passes the resource type, such as service account and identifier to remove.

Opens a subprocess that runs the Terraform command to remove the resource from the state

For each user in sundew_production_iam, removes their service account and role assignment from the monolith’s state

Removes the GCP service account from the monolith’s Terraform state based on its user identifier

Checks that the subprocess’s Terraform command successfully removed the resource from the monolith’s state

Removes the GCP role assignment from the monolith’s Terraform state based on its user identifier

Do not run python remove.py yet! Your monolith still contains a definition of the IAM policies. Open your monolithic IaC’s main.py. In the following listing, remove the code that builds the IAM service accounts and role assignments for the sundew team.

Listing 10.9 Removing the IAM policies from the monolith’s code

import blue                                      
import json                                      
 
if __name__ == "__main__":
   resources = {
       'resource': blue.build()                  
   }
 
   with open('main.tf.json', 'w') as outfile:    
       json.dump(resources, outfile,             
                 sort_keys=True, indent=4)       

Removes the import for the IAM policies

Removes the code to build the IAM policies within the monolith and leaves the other resources

Writes the configuration out to a JSON file to be executed by Terraform later. The configuration does not include the IAM policies.

You can now update your monolith. First, use python remove.py to delete the IAM resources from the monolith’s state:

$ python remove.py

This step signals that your monolith no longer serves as the source of truth for the IAM policies and service accounts. You do not delete the IAM resources! You can imagine this as handing over ownership of the IAM resources to the new IaC in a separate folder.

In your terminal, you can finally update the monolith. Generate a new Terraform JSON without the IAM policies and apply the updates; you should not have any changes:

$ python main.py
 
$ terraform apply
google_service_account.blue: Refreshing state...
google_compute_network.blue: Refreshing state...
google_compute_subnetwork.blue: Refreshing state...
google_container_cluster.blue: Refreshing state...
google_container_node_pool.blue: Refreshing state...
 
No changes. Your infrastructure matches the configuration.
 
Terraform has compared your real infrastructure against your configuration 
and found no differences, so no changes are needed.
 
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

If your dry run includes a resource you refactored, you know that you did not remove it from the monolith’s state or configuration. You need to examine the resources and identify whether to remove them manually.

10.2.2 Refactor resources with dependencies

You can now work on the lower-level infrastructure resources with dependencies, such as the sundew team’s container orchestrator. The sundew team members ask you to avoid creating a new orchestrator and destroying the old one since they do not want to disrupt applications. You need to refactor and extract the low-level container orchestrator in place.

Start copying the container configuration out of the monolith, repeating the same process you used for refactoring the IAM service accounts and roles. You create a separate folder labeled sundew_production_orchestrator:

$ mkdir -p sundew_production_orchestrator

You select and copy the method to create the cluster into sundew_production_orchestrator/cluster.py. However, you have a problem. The container orchestrator needs the network and subnet names. How do you get the name of the network and subnet when the container orchestrator cannot reference the monolith?

Figure 10.9 implements dependency injection with an existing monolith using the infrastructure provider’s API as the abstraction layer. The IaC to create the cluster calls the GCP API to get network information. You pass the network ID to the cluster to use.

Figure 10.9 Copy the infrastructure and add new methods to call the GCP API and get the network ID for the cluster.

A monolith passes the dependency explicitly between resources. When you create a new folder, your separated resources need information about its low-level dependencies. Recall that you can decouple infrastructure modules with the dependency injection (previously in chapter 4). A high-level module calls an abstraction layer to get identifiers for low-level dependencies.

When you start refactoring resources with dependencies, you must implement an interface for dependency injection. In the sundew team’s code for listing 10.10, update sundew_production_orchestrator/cluster.py to use the Google Cloud Client Library and retrieve the subnet and network names for the cluster configuration.

Note Several dependencies, variables, and imports have been removed from listing 10.10 for additional clarity. Refer to the book’s code repository at https://github.com/joatmon08/manning-book/tree/main/ch10/s03/s02 for the full example.

Listing 10.10 Using dependency inversion for the network name in the cluster

import googleapiclient.discovery                                   
                                
 
def _get_network_from_gcp():                                       
   service = googleapiclient.discovery.build(                      
       'compute', 'v1')                                            
   result = service.subnetworks().list(                            
       project=PROJECT,                                            
       region=REGION,                                              
       filter=f'name:"{TEAM}-{ENVIRONMENT}-*"').execute()          
   subnetworks = result['items'] if 'items' in result else None
   if len(subnetworks) != 1:                                       
       print("Network not found")                                  
       exit(1)                                                     
   return subnetworks[0]['network'].split('/')[-1],               
       subnetworks[0]['name']                                      
 
 
def cluster(name=cluster_name,                                     
           node_name=cluster_nodes,
           service_account=cluster_service_account,
           region=REGION):
   network, subnet = _get_network_from_gcp()                       
   return [
       {
           'google_container_cluster': {                           
               VERSION: [
                   {
                       'name': name,
                       'network': network,                         
                       'subnetwork': subnet                        
                   }
               ]
           },
           'google_container_node_pool': {                         
               VERSION: [                                          
                   {                                               
                       'cluster':                                  
                       '${google_container_cluster.' +             
                           f'{VERSION}' + '.name}'                 
                   }                                               
               ]                                                   
           },                                                      
           'google_service_account': {                             
               VERSION: [
                   {
                       'account_id': service_account,
                       'display_name': service_account
                   }
               ]
           }
       }
   ]

Sets up access to the GCP API using the Google Cloud Client Library for Python

Creates a method that retrieves the network information from GCP and implements dependency injection

Queries the GCP API for a list of subnetworks with names that start with sundew-production

Throws an error if the GCP API did not find the subnetwork

Returns the network name and subnetwork name

Several dependencies, variables, and imports have been removed from the code listing for additional clarity. Refer to the book’s code repository for the full example.

Applies the dependency inversion principle and calls the GCP API to retrieve the network and subnet names

Creates the Google container cluster, node pool, and service account by using a Terraform resource

Uses the network and subnet names to update the container cluster

AWS and Azure equivalents

You can update listing 10.10 to use the Azure libraries for Python (http://mng.bz/Pnn2) or AWS SDK for Python (https://aws.amazon.com/sdk-for-python/) to replace the GCP API client library.

Next, update the resources. Create an Amazon VPC and Azure virtual network for the Kubernetes node pools (also called groups). Then, switch the Google container cluster to an Amazon EKS cluster or AKS cluster.

When refactoring an infrastructure resource with dependencies, you must implement dependency injection to retrieve the low-level resource attributes. Listing 10.10 uses an infrastructure provider’s API, but you can use any abstraction layer you choose. An infrastructure provider’s API often provides the most straightforward abstraction. You can use it to avoid implementing your own.

After copying and updating the container cluster to reference network and subnet names from the GCP API, you repeat the refactoring workflow shown in figure 10.10. You import the live infrastructure resource into sundew_production_orchestrator, apply the separate configuration, check for any drift between the active state and the IaC, and remove the resource’s configuration and reference in the monolith’s state.

The main difference between refactoring a high-level resource versus a lower-level resource out of a monolith involves the implementation of dependency injection. You can choose the type of dependency injection you want to use, such as the infrastructure provider’s API, module outputs, or infrastructure state. Note that you might need to change the monolithic IaC to output the attributes if you do not use the infrastructure provider’s API.

Otherwise, ensure that you apply idempotency by rerunning your IaC after refactoring. You want to avoid affecting the active resources and isolate all changes to the IaC. If your dry run reflects changes, you must fix the drift between your refactored code and infrastructure state before moving forward with other resources.

Figure 10.10 Refactor higher-level resources to get low-level identifiers with the GCP API before continuing to refactor low-level resources.

10.2.3 Repeat refactoring workflow

After you extract the IAM service accounts and roles and the container orchestrator, you can continue to break down the sundew system’s monolithic IaC configuration. The workflow in figure 10.11 summarizes the general pattern for breaking down monolithic IaC. You identify which resources depend on each other, extract their configuration, and update their dependencies to use dependency injection.

Figure 10.11 The workflow for refactoring an IaC monolith starts by identifying high-level resources with no dependencies.

Identify the high-level infrastructure resources that do not depend on anything or have anything depending on them. I use the high-level resources to test the workflow of copying, separating, importing, and deleting them from the monolith. Next, I identify the higher-level resources that depend on other resources. During the copying, I refactor them to reference attributes through dependency injection. I identify and repeat the process through the system, eventually concluding with the lowest-level resources that do not have any dependencies.

Configuration management

While this chapter focused primarily on IaC provisioning tools, configuration management can also turn into a monolith of automation and result in the same challenges, including taking a long time to run or having conflicting changes to parts of the configuration. You can apply a similar refactoring workflow to monolithic configuration management:

  1. Extract the most independent parts of the automation with no dependencies and separate them into a module.

  2. Run the configuration manager and make sure you did not change your resource state.

  3. Identify configuration that depends on the outputs or existence of lower-level automation. Extract them and apply dependency injection to retrieve any required values for the configuration.

  4. Run the configuration manager and make sure you did not change your resource state.

  5. Repeat the process until you effectively reach your configuration manager’s first step.

As you refactor IaC monoliths, identify ways to decouple the resources from one another. I find refactoring a challenge and rarely without some failures and mistakes. Isolating individual components and carefully testing them will help identify problems and minimize disruption to the system. If I do encounter failures, I use the techniques in chapter 11 to fix them.

Exercise 10.1

Given the following code, what order and grouping of resources would you use to refactor and break down the monolith?

if __name__ == "__main__":
  zones = ['us-west1-a', 'us-west1-b', 'us-west1-c']
  project.build()
  network.build(project)
  for zone in zones:
    subnet.build(project, network, zone)
  database.build(project, network)
  for zone in zones:
    server.build(project, network, zone)
  load_balancer.build(project, network)
  dns.build()

A) DNS, load balancer, servers, database, network + subnets, project

B) Load balancer + DNS, database, servers, network + subnets, project

C) Project, network + subnets, servers, database, load balancer + DNS

D) Database, load balancer + DNS, servers, network + subnets, project

See appendix B for answers to exercises.

Summary

  • Refactoring IaC involves restructuring configuration or code without impacting existing infrastructure resources.

  • Refactoring resolves technical debt, a metaphor to describe the cost of changing code.

  • A rolling update changes similar infrastructure resources one by one and tests each resource before moving to the next one.

  • Rolling updates allow you to implement and troubleshoot changes incrementally.

  • Feature flags (also known as feature toggles) enable or disable infrastructure resources, dependencies, or attributes.

  • Apply feature flags to test, stage, and hide changes before applying them to production.

  • Define feature flags in one place (such as a file or configuration manager) to identify their values at a glance.

  • Remove feature flags when you do not need them anymore.

  • Monolithic IaC happens when you define all of your infrastructure resources in one place, and removing one resource causes the entire configuration to fail.

  • Refactoring a resource out of a monolith involves separating and copying the configuration into a new directory or repository, importing it into a new separate state, and removing the resources from the monolithic configuration and state.

  • If your resource depends on another resource, update your separated resource configuration to use dependency injection and retrieve identifiers from an infrastructure provider API.

  • Breaking down a monolith starts by refactoring high-level resources or configurations with no dependencies, then resources or configurations with dependencies, and concluding with low-level resources or configurations with no dependencies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset