Over time, you might outgrow the patterns and practices you use to collaborate on infrastructure as code. Even change techniques like blue-green deployment cannot solve conflicts in configuration or changes as your team works on some IaC. You must deliver a series of major changes to your IaC and address problems with scaling the practice.
For example, the sundew team for Datacenter for Carnivorous Plants expresses that it can no longer comfortably and confidently roll out new changes to its system. The team puts all infrastructure resources in one repository (as per the singleton pattern) to quickly deliver the system and just kept adding new updates on top of it.
The sundew team outlines a few problems with its system. First, the team finds its updates to infrastructure configuration constantly overlapping. One teammate works on updating servers, only to find another teammate has updated the network and will affect their changes.
Second, it takes more than 30 minutes to run a single change. One change makes hundreds of calls to your infrastructure API to retrieve the state of resources, which slows the feedback cycle.
Finally, the security team expresses concern that the sundew infrastructure may have an insecure configuration. The current configuration does not use standardized, hardened company infrastructure modules.
You realize you need to change the sundew team’s configuration. The configuration should use an existing server module approved by the security team. You also need to break the configuration into separate resources to minimize the blast radius of changes.
This chapter discusses some IaC patterns and techniques to break down large singleton repositories with hundreds of resources. As the IaC helper on the sundew team, you’ll refactor the system’s singleton configuration into separate repositories and structure the server configurations to use modules to avoid conflicts and comply with security standards.
Note Demonstrating refactoring requires a sufficiently large (and complex) example. If you run the complete example, you will incur a cost that exceeds the GCP free tier. This book includes only the relevant lines of code and omits the rest for readability. For complete listings, refer to the book’s code repository at https://github.com/joatmon08/manning-book/tree/main/ch10. If you convert these examples for AWS and Azure, you will also incur a cost. When possible, I offer notations on converting the examples to the cloud provider of your choice.
The sundew team needs help breaking down its infrastructure configuration. You decide to refactor the IaC to isolate conflicts better, reduce the amount of time to apply changes to production, and secure them according to company standards. Refactoring IaC involves restructuring configuration or code without impacting existing infrastructure resources.
Definition Refactoring IaC is the practice of restructuring configuration or code without impacting existing infrastructure resources.
You communicate to the sundew team that its configuration needs to undergo a refactor to fix the problems. While the team members support your effort, they challenge you to minimize the impact of your refactor. Challenge accepted: you apply a few techniques to reduce the potential blast radius as you refactor IaC.
The Datacenter for Carnivorous Plants platform and security team offer a server module with secure configurations, which you can use for the sundew system. The sundew team’s infrastructure configuration has three server configurations but no usage of the secure module. How do you change the sundew IaC to use the module?
Imagine you create three new servers together and immediately send traffic to them. If the applications do not run correctly on the server, you could disrupt the sundew system entirely, and the poor plants do not get watered! Instead, you might reduce the blast radius of your server module refactor by gradually changing the servers one by one.
In figure 10.1, you create one server configuration using the module, deploy the application to the new server, validate that the application works, and delete the old server. You repeat the process two more times for each server. You gradually roll out the change to one server before updating the next one.
A rolling update gradually changes similar resources one by one and tests each one before continuing the update.
Definition A rolling update is a practice of changing a group of similar resources one by one and testing them before implementing the change to the next one.
Applying a rolling update to the sundew team’s configuration isolates failures to individual servers each time you make the update and allows you to test the server’s functionality before proceeding to the next one.
The practice of rolling updates can save you the pain of detangling a large set of failed changes or incorrectly configured IaC. For example, if the Datacenter for Carnivorous Plants module doesn’t work on one server, you have not yet rolled it out and affected the remaining servers. A rolling update lets you check that you have the proper IaC for each server before continuing to the next one. A gradual approach also mitigates any downtime in the applications or failures in updating the servers.
Note I borrowed rolling updates for refactoring from workload orchestrators, like Kubernetes. When you need to update new nodes (virtual machines) for a workload orchestrator, you may find it uses an automated rolling update mechanism. The orchestrator cordons the old node, prevents new workloads from running on it, starts all of the running processes on a new node, drains all of the processes on the old node, and sends workloads and requests to the new node. You should mimic the workflow when you refactor!
Thanks to the rolling update and incremental testing, you know that the servers can run with the secure module. You tell the team that you finished refactoring the servers and confirmed that they work with internal services. The sundew team can now send all customer traffic to the newly secured servers. However, the team members tell you that they need to update the customer-facing load balancer first!
You need a way to hide the new servers from the customer-facing load balancer for a few days and attach them when the team approves. However, you have all the configurations ready! You want to hide the server attachments with a single variable to simplify for the sundew team. When the team members complete their load balancer update, they need to update only one variable to add the new servers to the load balancer.
Figure 10.2 outlines how you set up the variable to add new servers created by the module. You create a Boolean to enable or disable the new server module, using True
or False
. Then, you add an if
statement to IaC that references the Boolean value. A True
variable adds the new servers to the load balancer. A False
variable removes the servers from the load balancer.
The Boolean variable helps with the composability or evolvability of IaC. A single change to the variable adds, removes, or updates a configuration. The variable, called a feature flag (or feature toggle), is used to enable or disable infrastructure resources, dependencies, and attributes. You often find feature flags in software development with a trunk-based development model (these work only on the main branch).
Definition Feature flags (also known as feature toggles) enable or disable infrastructure resources, dependences, or attributes by using a Boolean.
Flags hide certain features or code and prevent them from impacting the rest of the team on the main branch. For the sundew team, you hide the new servers from the load balancer until the team completes the load balancer change. Similarly, you can use feature flags in IaC to stage configuration and push the update with a single variable.
To start implementing a feature flag and stage changes for the new servers, you add a flag and set it to False
. You default a feature flag to False
to preserve the original infrastructure state, as shown in figure 10.3. The sundew configuration disables the server module by default so that nothing happens to the original servers.
Let’s implement the feature flag in Python. You set the server module flag’s default to False
in a separate file called flags.py. The file defines the flag, ENABLE_ SERVER_MODULE
, and sets it to False
:
You could also embed feature flags as variables in other files, but you might lose track of them! You decide to put them in a separate Python file.
Note I always define feature flags in a file to identify and change them in one place.
The following listing imports the feature flag in main.py and adds the logic to generate a list of servers to add to the load balancer.
import flags ❶ def _generate_servers(version): instances = [ ❷ f'${{google_compute_instance.{version}_0.id}}', ❷ f'${{google_compute_instance.{version}_1.id}}', ❷ f'${{google_compute_instance.{version}_2.id}}' ❷ ] ❷ if flags.ENABLE_SERVER_MODULE: ❸ instances = [ ❹ f'${{google_compute_instance.module_{version}_0.id}}', ❹ f'${{google_compute_instance.module_{version}_1.id}}', ❹ f'${{google_compute_instance.module_{version}_2.id}}', ❹ ] return instances
❶ Imports the file that defines all of the feature flags
❷ Defines a list of existing Google compute instances (servers) by using a Terraform resource in the system
❸ Uses a conditional statement to evaluate the feature flag and adds the server module’s resources to the load balancer
❹ A feature flag set to True will attach the servers created by the module to the load balancer. Otherwise, it will keep the original servers
Run Python with the feature flag toggled off to generate a JSON configuration. The resulting JSON configuration adds only the original servers to the load balancer, which preserves the existing state of infrastructure resources.
{ "resource": [ { "google_compute_instance_group": { ❶ "blue": [ { "instances": [ "${google_compute_instance.blue_0.id}", ❷ "${google_compute_instance.blue_1.id}", ❷ "${google_compute_instance.blue_2.id}" ❷ ] } ] } } ] }
❶ Creates a Google compute instance group using a Terraform resource to attach to the load balancer
❷ Configuration includes a list of the original Google compute instances, preserving the current state of infrastructure resources
The feature flag set to False
by default uses the principle of idempotency. When you run the IaC, your infrastructure state should not change. Setting the flag ensures that you do not accidentally change existing infrastructure. Preserving the original state of the existing servers minimizes disruption to dependent applications.
The sundew team made its changes and provided approval to add the new servers created by the module to the load balancer. You set the feature flag to True
, as shown in figure 10.4. When you deploy the change, you attach servers from a module to the load balancer and remove the old servers.
Let’s examine the updated feature flag in action. You start by setting the feature flag for the servers to True
:
Run Python to generate a new JSON configuration. The configuration in the following listing now includes the servers created by the module that you will attach to the load balancer.
{ "resource": [ { "google_compute_instance_group": { "blue": [ { "instances": [ "${google_compute_instance.module_blue_0.id}", ❶ "${google_compute_instance.module_blue_1.id}", ❶ "${google_compute_instance.module_blue_2.id}" ❶ ] } ] } } ] }
❶ The new servers created by the module replace the old servers because you enabled the feature flag.
The feature flag allows you to stage the module’s low-level server resources without affecting the load balancer’s high-level dependency. You can rerun the code with the feature toggle off to reattach the old servers.
Why use the feature flag to switch to the server module? A feature flag hides functionality from production until you feel ready to deploy resources associated with it. You offer one variable to add, remove, or update a set of resources. You can also use the same variable to revert changes.
After running the servers for some time, the sundew team reports that the new server module works. You can now remove the old servers in listing 10.4. You no longer need the feature flag, and you don’t want to confuse another team member when they read the code. You refactor the Python code for the load balancer to remove the old servers and delete the feature flag.
import blue ❶ def _generate_servers(version): instances = [ ❷ f'${{google_compute_instance.module_{version}_0.id}}', ❷ f'${{google_compute_instance.module_{version}_1.id}}', ❷ f'${{google_compute_instance.module_{version}_2.id}}', ❷ ] return instances
❶ You can remove the import for the feature flags because you no longer need it for your servers.
❷ Permanently attaches the servers created by the module to the load balancer and removes the feature flag
The example uses feature flags to refactor singleton configurations into infrastructure modules. I often apply feature flags to this use case to simplify the creation and removal of infrastructure resources. Other use cases for feature flags include the following:
Collaborating and avoiding change conflicts on the same infrastructure resources or dependencies
Staging a group of changes and rapidly deploying them with a single update to the flag
A feature flag offers a technique to hide or isolate infrastructure resource, attribute, and dependency changes during the refactoring of infrastructure configuration. However, changing the toggle can still disrupt a system. In the example of the sundew team’s servers, we cannot simply toggle the feature flag to True
and expect the servers to run the application. Instead, we combine the feature flag with other techniques like rolling updates to minimize disruption to the system.
The sundew team members express that they still have a problem with their system. You identify the singleton configuration with hundreds of resources and attributes as the root cause. Whenever someone makes a change, the teammate must resolve conflicts with another person. They also have to wait 30 minutes to make a change.
A monolithic architecture for IaC means defining all infrastructure resources in one place. You need to break the monolith of IaC into smaller, modular components to minimize working conflicts between teammates and speed up the deployment of changes.
Definition A monolithic architecture for IaC defines all infrastructure resources in a single configuration and the same state.
In this section, we’ll walk through a refactor of the sundew team’s monolith. The most crucial step begins with identifying and grouping high-level infrastructure resources and dependencies. We complete the refactor with the low-level infrastructure resources.
The sundew team manages hundreds of resources in one set of configuration files. Where should you start breaking down the IaC? You decide to look for high-level infrastructure resources that do not depend on other resources.
The sundew team has one set of high-level infrastructure in GCP project-level IAM service accounts and roles. The IAM service accounts and roles don’t need to create a network or server before setting user and service account rules on the project. None of the other resources depend on the IAM roles and service accounts. You can group and extract them first.
You cannot use a blue-green deployment approach because GCP does not allow duplicate policies. However, you cannot simply delete roles and accounts from the monolithic configuration and copy them to a new repository. Deleting them prevents everyone from logging into the project! How can you extract them?
You can copy and paste the configuration into its separate repository or directory, initialize the state for the separated configuration, and import the resources into the infrastructure state associated with the new configuration. Then, you delete the IAM configuration in the monolithic configuration. As with the rolling update, you gradually change each set of infrastructure resources, test the changes, and proceed to the next one.
Figure 10.5 outlines the solution to refactoring a monolith for high-level resources. You copy the code from the monolith to a new folder and import the live infrastructure resource into the state of the code in the new folder. You redeploy the code to make sure it does not change existing infrastructure. Finally, remove the high-level resources from the monolith.
As with feature flags, we use the principle of idempotency to run IaC and verify that we do not affect the active infrastructure state. Anytime you refactor, make sure you deploy the changes and check the dry run. You do not want to accidentally change an existing resource and affect its dependencies.
We will refactor the example in the following few sections. Stay with it! I know refactoring tends to feel tedious, but a gradual approach ensures that you do not introduce widespread failures into your system.
Copy from the monolith to a separate state
Your initial refactor begins by copying the code to create the IAM roles and service accounts to a new directory. The sundew team wants to keep a single repository structure that stores all the IaC in one source control repository but separates the configurations into folders.
You identify the IAM roles and service accounts to copy the team’s code to a new folder, as shown in figure 10.6. The active IAM policies and their infrastructure state in GCP do not change.
Why reproduce the IaC for the IAM policies in a separate folder? You want to split up your monolithic IaC without affecting any of the active resources. The most important practice when refactoring involves preserving idempotency. Your active state should never change when you move your IaC.
Let’s start refactoring the IAM policies out of the monolith. Create a new directory that manages only the IAM policies for the GCP project:
Copy the IAM configuration from the monolith into the new directory:
You don’t need to change anything since the IAM policies do not depend on other infrastructure. The file iam.py in the following listing separates the creation and role assignment for a set of users.
import json TEAM = 'sundew' TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE = 'google_service_account' ❶ TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE = 'google_project_iam_member' ❶ users = { ❷ 'audit-team': 'roles/viewer', ❷ 'automation-watering': 'roles/editor', ❷ 'user-02': 'roles/owner' ❷ } ❷ def get_user_id(user): return user.replace('-', '_') def build(): return iam() def iam(users=users): ❸ iam_members = [] for user, role in users.items(): user_id = get_user_id(user) iam_members.append({ TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE: [{ ❹ user_id: [{ ❹ 'account_id': user, ❹ 'display_name': user ❹ }] ❹ }] ❹ }) iam_members.append({ TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE: [{ ❺ user_id: [{ ❺ 'role': role, ❺ 'member': 'serviceAccount:${google_service_account.' ❺ + f'{user_id}' + '.email}' ❺ }] ❺ }] ❺ }) return iam_members
❶ Sets the resource types that Terraform uses as constants so you can reference them later if needed
❷ Keeps all of the users you added to the project as part of the monolith
❸ Uses the module to create the JSON configuration for the IAM policies outside the monolith
❹ Creates a GCP service account for the project for each user in the sundew production project
❺ Assigns the specific role defined for each service account, such as viewer, editor, or owner
Set constants and create methods that output resource types and identifiers when separating configuration. You can always use them for other automation and continued system maintenance, especially as you continue to refactor the monolith!
In the following listing, create a main.py file in the sundew_production_iam folder that references the IAM configuration and outputs the Terraform JSON for it.
import iam ❶ import json if __name__ == "__main__": resources = { 'resource': iam.build() ❶ } with open('main.tf.json', 'w') as outfile: ❷ json.dump(resources, outfile, ❷ sort_keys=True, indent=4) ❷
❶ Imports the IAM configuration code and builds the IAM policies
❷ Writes the Python dictionary out to a JSON file to be executed by Terraform later
Do not run Python yet to create the Terraform JSON or deploy the IAM policies! You already have IAM policies defined as part of GCP. If you run python
main.py
and apply the Terraform JSON with the separated IAM configuration, GCP throws an error that the user account and assignment already exists:
$ python main.py $ terraform apply -auto-approve ## output omitted for clarity | Error: Error creating service account: googleapi: ➥Error 409: Service account audit-team already exists within project ➥projects/infrastructure-as-code-book., alreadyExists
The sundew team members do not want you to remove and create new accounts and roles. If you delete and create new accounts, they cannot log into their GCP project. You need a way to migrate the existing resources defined in the monolith and link them to code defined in its own folder.
Import the resources to the new state
Sometimes creating new resources with your refactored IaC will disrupt development teams and business-critical systems. You cannot use the principle of immutability to delete the old resources and create new ones. Instead, you must migrate active resources from one IaC definition to another.
In the case of the sundew team, you extract the identifiers for each service account from the monolithic configuration and “move” them to the new state. Figure 10.7 demonstrates how to detach each service account and its role assignments from the monolith and attach them to the IaC in the sundew_production_iam directory. You call the GCP API for the current state of the IAM policies and import the live infrastructure resources into the separated configuration and state. Running the IaC should reveal no changes to its dry run.
Why import the IAM policy information with the GCP API? You want to import the updated, active state of the resource. A cloud provider’s API offers the most up-to-date configuration for resources. You can call the GCP API to retrieve the user emails, roles, and identifiers for the sundew team.
Rather than write your own import capability and save the identifiers in a file, you decide to use Terraform’s import capability to add existing resources to the state. You write some Python code in the following listing that wraps around Terraform to automate a batch import of IAM resources so that the sundew team can reuse it.
import iam ❶ import os import googleapiclient.discovery ❷ import subprocess PROJECT = os.environ['CLOUDSDK_CORE_PROJECT'] ❸ def _get_members_from_gcp(project, roles): ❷ roles_and_members = {} ❷ service = googleapiclient.discovery.build( ❷ 'cloudresourcemanager', 'v1') ❷ result = service.projects().getIamPolicy( ❷ resource=project, body={}).execute() ❷ bindings = result['bindings'] ❷ for binding in bindings: ❷ if binding['role'] in roles: ❷ roles_and_members[binding['role']] = binding['members'] ❷ return roles_and_members ❷ def _set_emails_and_roles(users, all_members): ❹ members = [] ❹ for username, role in users.items(): ❹ members += [(iam.get_user_id(username), m, role) ❹ for m in all_members[role] if username in m] ❹ return members ❹ def check_import_status(ret, err): return ret != 0 and 'Resource already managed by Terraform' ➥not in str(err) def import_service_account(project_id, user_id, user_email): ❺ email = user_email.replace('serviceAccount:', '') ❺ command = ['terraform', 'import', '-no-color', ❺ f'{iam.TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE}.{user_id}', ❺ f'projects/{project_id}/serviceAccounts/{email}'] ❺ return _terraform(command) ❺ def import_project_iam_member(project_id, role, ❻ user_id, user_email): ❻ command = ['terraform', 'import', '-no-color', ❻ f'{iam.TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE}.{user_id}', ❻ f'{project_id} {role} {user_email}'] ❻ return _terraform(command) ❻ def _terraform(command): ❼ process = subprocess.Popen( ❼ command, ❼ stdout=subprocess.PIPE, ❼ stderr=subprocess.PIPE) ❼ stdout, stderr = process.communicate() ❼ return process.returncode, stdout, stderr ❼ if __name__ == "__main__": sundew_iam = iam.users ❽ all_members_for_roles = _get_members_from_gcp( ❾ PROJECT, set(sundew_iam.values())) ❾ import_members = _set_emails_and_roles( ❿ sundew_iam, all_members_for_roles) ❿ for user_id, email, role in import_members: ret, _, err = import_service_account(PROJECT, user_id, email) ❺ if check_import_status(ret, err): ⓫ print(f'import service account failed: {err}') ⓫ ret, _, err = import_project_iam_member(PROJECT, role, ❻ user_id, email) ❻ if check_import_status(ret, err): ⓫ print(f'import iam member failed: {err}') ⓫
❶ Retrieves the list of sundew users from iam.py in sundew_production_iam
❷ Uses the Google Cloud Client Libraries for Python to get a list of members assigned to a role in the GCP project
❸ Retrieves the GCP project ID from the CLOUDSDK_CORE_PROJECT environment variable
❹ Gets the email and user IDs for the sundew IAM members only
❺ Imports the service account to the sundew_production_iam state based on project and user email, using the resource type constant you set in iam.py
❻ Imports a role assignment to the sundew_production_iam state based on project, role, and user email, using the resource type constant you set in iam.py
❼ Both import methods wrap around the Terraform CLI command and return any errors and output.
❽ Retrieves the list of sundew users from iam.py in sundew_production_iam
❾ Uses the Google Cloud Client Libraries for Python to get a list of members assigned to a role in the GCP project
❿ Gets the email and user IDs for the sundew IAM members only
⓫ If the import fails and it did not already import the resource, outputs the error
As with defining dependencies, you want to dynamically retrieve identifiers from the infrastructure provider API for your resources to import. You never know when someone will change the resource, and the identifier you thought you needed no longer exists! Use your tags and naming conventions to search the API response for the resources you need.
When you run python import.py
and perform a dry run of the Terraform JSON with the separated IAM configuration, you get a message that you do not have to make any changes. You successfully imported the existing IAM resources into their separate configuration and state:
$ python main.py $ terraform plan No changes. Your infrastructure matches the configuration. Terraform has compared your real infrastructure against your configuration ➥and found no differences, so no changes are needed. Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
Sometimes your dry run indicates drift between the active resource state and the separated configuration. Your copied configuration does not match the active state of the resource. The differences often come from someone changing the active state of an infrastructure resource during a manual change or during updates to the default value for an attribute. Update your separated IaC to match the attributes of the active infrastructure resource.
Remove the refactored resources from the monolith
You managed to extract and move the sundew team’s service accounts and role assignments to separate IaC. However, you don’t want the resources to stay in the monolith. You remove the resources from the monolith’s state and configuration before reapplying and updating your tool, as shown in figure 10.8.
This step helps maintain IaC hygiene. Remember from chapter 2 that our IaC should serve as the source of truth. You do not want to manage one resource with two sets of IaC. If they conflict, the two IaC definitions for the resource may affect dependencies and the configuration of the system.
You want the IAM policy directory to serve as the source of truth. Going forward, the sundew team needs to declare changes to its IAM policy in the separate directory and not in the monolith. To avoid confusion, let’s remove the IAM resources from the IaC monolith.
To start, you must remove the sundew IAM resources from Terraform state, represented in a JSON file. Terraform includes a state removal command that you can use to take out portions of the JSON based on the resource identifier. Listing 10.8 uses Python code to wrap around the Terraform command. The code allows you to pass any resource type and identifier you want to remove from the infrastructure state.
from sundew_production_iam import iam ❶ import subprocess def check_state_remove_status(ret, err): ❷ return ret != 0 ❷ and 'No matching objects found' not in str(err) ❷ def state_remove(resource_type, resource_identifier): ❸ command = ['terraform', 'state', 'rm', '-no-color', ❸ f'{resource_type}.{resource_identifier}'] ❸ return _terraform(command) ❸ def _terraform(command): ❹ process = subprocess.Popen( command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = process.communicate() return process.returncode, stdout, stderr if __name__ == "__main__": sundew_iam = iam.users ❶ for user in iam.users: ❺ ret, _, err = state_remove( ❻ iam.TERRAFORM_GCP_SERVICE_ACCOUNT_TYPE, ❻ iam.get_user_id(user)) ❻ if check_state_remove_status(ret, err): ❼ print(f'remove service account from state failed: {err}') ❼ ret, _, err = state_remove( ❽ iam.TERRAFORM_GCP_ROLE_ASSIGNMENT_TYPE, ❽ iam.get_user_id(user)) ❽ if check_state_remove_status(ret, err): ❼ print(f'remove role assignment from state failed: {err}') ❼
❶ Retrieves the list of sundew users from iam.py in sundew_production_iam. Referencing the variable from the separated IaC allows you to run the removal automation for future refactoring efforts.
❷ If the removal failed and did not already remove the resource, outputs the error
❸ Creates a method that wraps around Terraform’s state removal command. The command passes the resource type, such as service account and identifier to remove.
❹ Opens a subprocess that runs the Terraform command to remove the resource from the state
❺ For each user in sundew_production_iam, removes their service account and role assignment from the monolith’s state
❻ Removes the GCP service account from the monolith’s Terraform state based on its user identifier
❼ Checks that the subprocess’s Terraform command successfully removed the resource from the monolith’s state
❽ Removes the GCP role assignment from the monolith’s Terraform state based on its user identifier
Do not run python remove.py
yet! Your monolith still contains a definition of the IAM policies. Open your monolithic IaC’s main.py. In the following listing, remove the code that builds the IAM service accounts and role assignments for the sundew team.
import blue ❶ import json ❶ if __name__ == "__main__": resources = { 'resource': blue.build() ❷ } with open('main.tf.json', 'w') as outfile: ❸ json.dump(resources, outfile, ❸ sort_keys=True, indent=4) ❸
❶ Removes the import for the IAM policies
❷ Removes the code to build the IAM policies within the monolith and leaves the other resources
❸ Writes the configuration out to a JSON file to be executed by Terraform later. The configuration does not include the IAM policies.
You can now update your monolith. First, use python remove.py
to delete the IAM resources from the monolith’s state:
This step signals that your monolith no longer serves as the source of truth for the IAM policies and service accounts. You do not delete the IAM resources! You can imagine this as handing over ownership of the IAM resources to the new IaC in a separate folder.
In your terminal, you can finally update the monolith. Generate a new Terraform JSON without the IAM policies and apply the updates; you should not have any changes:
$ python main.py $ terraform apply google_service_account.blue: Refreshing state... google_compute_network.blue: Refreshing state... google_compute_subnetwork.blue: Refreshing state... google_container_cluster.blue: Refreshing state... google_container_node_pool.blue: Refreshing state... No changes. Your infrastructure matches the configuration. Terraform has compared your real infrastructure against your configuration ➥and found no differences, so no changes are needed. Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
If your dry run includes a resource you refactored, you know that you did not remove it from the monolith’s state or configuration. You need to examine the resources and identify whether to remove them manually.
You can now work on the lower-level infrastructure resources with dependencies, such as the sundew team’s container orchestrator. The sundew team members ask you to avoid creating a new orchestrator and destroying the old one since they do not want to disrupt applications. You need to refactor and extract the low-level container orchestrator in place.
Start copying the container configuration out of the monolith, repeating the same process you used for refactoring the IAM service accounts and roles. You create a separate folder labeled sundew_production_orchestrator:
You select and copy the method to create the cluster into sundew_production_orchestrator/cluster.py. However, you have a problem. The container orchestrator needs the network and subnet names. How do you get the name of the network and subnet when the container orchestrator cannot reference the monolith?
Figure 10.9 implements dependency injection with an existing monolith using the infrastructure provider’s API as the abstraction layer. The IaC to create the cluster calls the GCP API to get network information. You pass the network ID to the cluster to use.
A monolith passes the dependency explicitly between resources. When you create a new folder, your separated resources need information about its low-level dependencies. Recall that you can decouple infrastructure modules with the dependency injection (previously in chapter 4). A high-level module calls an abstraction layer to get identifiers for low-level dependencies.
When you start refactoring resources with dependencies, you must implement an interface for dependency injection. In the sundew team’s code for listing 10.10, update sundew_production_orchestrator/cluster.py to use the Google Cloud Client Library and retrieve the subnet and network names for the cluster configuration.
Note Several dependencies, variables, and imports have been removed from listing 10.10 for additional clarity. Refer to the book’s code repository at https://github.com/joatmon08/manning-book/tree/main/ch10/s03/s02 for the full example.
import googleapiclient.discovery ❶ def _get_network_from_gcp(): ❷ service = googleapiclient.discovery.build( ❶ 'compute', 'v1') ❶ result = service.subnetworks().list( ❸ project=PROJECT, ❸ region=REGION, ❸ filter=f'name:"{TEAM}-{ENVIRONMENT}-*"').execute() ❸ subnetworks = result['items'] if 'items' in result else None if len(subnetworks) != 1: ❹ print("Network not found") ❹ exit(1) ❹ return subnetworks[0]['network'].split('/')[-1], ❺ subnetworks[0]['name'] ❺ def cluster(name=cluster_name, ❻ node_name=cluster_nodes, service_account=cluster_service_account, region=REGION): network, subnet = _get_network_from_gcp() ❼ return [ { 'google_container_cluster': { ❽ VERSION: [ { 'name': name, 'network': network, ❾ 'subnetwork': subnet ❾ } ] }, 'google_container_node_pool': { ❽ VERSION: [ ❽ { ❽ 'cluster': ❽ '${google_container_cluster.' + ❽ f'{VERSION}' + '.name}' ❽ } ❽ ] ❽ }, ❽ 'google_service_account': { ❽ VERSION: [ { 'account_id': service_account, 'display_name': service_account } ] } } ]
❶ Sets up access to the GCP API using the Google Cloud Client Library for Python
❷ Creates a method that retrieves the network information from GCP and implements dependency injection
❸ Queries the GCP API for a list of subnetworks with names that start with sundew-production
❹ Throws an error if the GCP API did not find the subnetwork
❺ Returns the network name and subnetwork name
❻ Several dependencies, variables, and imports have been removed from the code listing for additional clarity. Refer to the book’s code repository for the full example.
❼ Applies the dependency inversion principle and calls the GCP API to retrieve the network and subnet names
❽ Creates the Google container cluster, node pool, and service account by using a Terraform resource
❾ Uses the network and subnet names to update the container cluster
When refactoring an infrastructure resource with dependencies, you must implement dependency injection to retrieve the low-level resource attributes. Listing 10.10 uses an infrastructure provider’s API, but you can use any abstraction layer you choose. An infrastructure provider’s API often provides the most straightforward abstraction. You can use it to avoid implementing your own.
After copying and updating the container cluster to reference network and subnet names from the GCP API, you repeat the refactoring workflow shown in figure 10.10. You import the live infrastructure resource into sundew_production_orchestrator, apply the separate configuration, check for any drift between the active state and the IaC, and remove the resource’s configuration and reference in the monolith’s state.
The main difference between refactoring a high-level resource versus a lower-level resource out of a monolith involves the implementation of dependency injection. You can choose the type of dependency injection you want to use, such as the infrastructure provider’s API, module outputs, or infrastructure state. Note that you might need to change the monolithic IaC to output the attributes if you do not use the infrastructure provider’s API.
Otherwise, ensure that you apply idempotency by rerunning your IaC after refactoring. You want to avoid affecting the active resources and isolate all changes to the IaC. If your dry run reflects changes, you must fix the drift between your refactored code and infrastructure state before moving forward with other resources.
After you extract the IAM service accounts and roles and the container orchestrator, you can continue to break down the sundew system’s monolithic IaC configuration. The workflow in figure 10.11 summarizes the general pattern for breaking down monolithic IaC. You identify which resources depend on each other, extract their configuration, and update their dependencies to use dependency injection.
Identify the high-level infrastructure resources that do not depend on anything or have anything depending on them. I use the high-level resources to test the workflow of copying, separating, importing, and deleting them from the monolith. Next, I identify the higher-level resources that depend on other resources. During the copying, I refactor them to reference attributes through dependency injection. I identify and repeat the process through the system, eventually concluding with the lowest-level resources that do not have any dependencies.
As you refactor IaC monoliths, identify ways to decouple the resources from one another. I find refactoring a challenge and rarely without some failures and mistakes. Isolating individual components and carefully testing them will help identify problems and minimize disruption to the system. If I do encounter failures, I use the techniques in chapter 11 to fix them.
Refactoring IaC involves restructuring configuration or code without impacting existing infrastructure resources.
Refactoring resolves technical debt, a metaphor to describe the cost of changing code.
A rolling update changes similar infrastructure resources one by one and tests each resource before moving to the next one.
Rolling updates allow you to implement and troubleshoot changes incrementally.
Feature flags (also known as feature toggles) enable or disable infrastructure resources, dependencies, or attributes.
Apply feature flags to test, stage, and hide changes before applying them to production.
Define feature flags in one place (such as a file or configuration manager) to identify their values at a glance.
Monolithic IaC happens when you define all of your infrastructure resources in one place, and removing one resource causes the entire configuration to fail.
Refactoring a resource out of a monolith involves separating and copying the configuration into a new directory or repository, importing it into a new separate state, and removing the resources from the monolithic configuration and state.
If your resource depends on another resource, update your separated resource configuration to use dependency injection and retrieve identifiers from an infrastructure provider API.
Breaking down a monolith starts by refactoring high-level resources or configurations with no dependencies, then resources or configurations with dependencies, and concluding with low-level resources or configurations with no dependencies.