We took many chapters to discuss writing and collaborating on infrastructure as code. All of the practices and principles you learned for IaC accumulate to the crucial moment when you push a change, it causes your system to fail, and you need to roll it back! However, IaC doesn’t support rollback. You do not fully revert IaC changes. What does it mean to fix failures if you don’t roll them back?
This chapter focuses on fixing failed changes from IaC. First, we’ll discuss what it means to “revert” IaC changes by rolling forward. Then, you’ll learn workflows for troubleshooting and fixing the failed change. While the techniques in this chapter might not apply to every scenario you’ll encounter in your system, they establish a broad set of practices you can use to start repairing IaC failures.
Imagine you work for a company called Cool Caps for Keys. It creates custom keyboard caps and connects customers with artists to design the caps. As the security engineer, you need to narrow down the access control for applications and users across GCP projects.
You copy the Google Cloud SQL database configuration and update the access control to implement least-privilege access for team members and applications. You choose the policies required for different applications to use infrastructure and verify that the applications still work.
Next, you talk to the promotions team. Its application accesses the database directly by using a database username and password. Direct access to the database means that you can remove the policy for roles/cloudsql.admin
from the promotions application’s service account. You remove the policy, test the changes, confirm with the promotions team that the change did not affect its application in its testing environment, and push it to production; see figure 11.1.
An hour later, the promotions team tells you that its application keeps throwing an error that it can’t access the database! You suspect that your change might have introduced a problem. While you could start digging around for the problem, you prioritize fixing the promotions service so it can access the database before investigating further.
You need to fix the service so users can make requests to the system. However, you cannot simply return the system to a previously working state. IaC prioritizes immutability, which means any changes to the system, including reverted changes, must create new resources!
For example, let’s fix the promotions service for Cool Caps for Keys by reverting the change and adding the role to the service account. In figure 11.2, you revert the commit and add roles/cloudsql.admin
back to the service account. Then, you push the changes to your testing and production environments.
You revert the commit and push forward the changes to the testing and production environments. You roll forward IaC because it uses immutability to return the system to a working state.
Definition The practice of rolling forward IaC reverts changes to a system and uses immutability to return the system to a working state.
Rollback implies that you return infrastructure to a previous state. In reality, the immutability of IaC means that you create a new state anytime you make changes. You cannot fully restore the state of infrastructure back to its previous state. Sometimes you can’t actually restore the infrastructure to a previous state because your change has a large blast radius.
Let’s revert your changes to the service account and roll forward changes to add back the permission. First, check your commit history because version control keeps track of all the changes you make. The commit prefixed with a31
includes your removal of roles/cloudsql.admin
:
$ git log --oneline -2 a3119fc (HEAD -> main) Remove database admin access from promotions 6f84f5d Add database admin to promotions service
Applying the GitOps practices from chapter 7, you want to avoid making manual, break-glass changes. Instead, you favor operational changes through IaC! You revert the commit to push updates to restore the promotions service to a working state:
You push the commit, and the pipeline adds the role back to the service account. After you roll forward, the application works again. You’ve successfully returned the infrastructure state to a working state. However, you never achieve a full restore of state. Instead, you rolled out a new state that matched the previous working state.
Rolling back IaC often means rolling forward changes to the infrastructure state. You use git revert
as a forward-moving undo to preserve immutability and roll forward undo updates to infrastructure.
The benefit of taking a roll-forward mentality means that you expand your troubleshooting approach. In the example, you reverted a broken commit and restored functionality to the promotions service by matching the new state to a prior working one. However, sometimes reverting a commit doesn’t fix your system and makes everything worse! Instead, you can roll forward new changes and restore functionality.
Let’s imagine the promotions service still doesn’t work after you roll forward changes. Rather than try to fix the application, you create a new environment with the change and a new promotions service. You start a canary deployment technique from chapter 9 to gradually increase the traffic to fully restore the application, as shown in figure 11.3. You disable the failed environment after all requests go to the new service instance for debugging.
IaC allows you to reproduce environments with less effort. Furthermore, conforming to immutability means that you already have a pattern of creating new environments for changes. The combination of the two principles helps mitigate higher-risk changes with a larger blast radius.
Use cases that involve data or completely irrecoverable resources cannot roll forward to a prior state. You could corrupt application data or affect other infrastructure while detangling a cascading failure. Rather than roll forward to revert, you can roll forward and implement new changes by applying the change techniques from chapter 9.
You may also restore functionality with a combination of reverts and completely new changes. Expanding your roll-forward mentality to include new changes outside of undoing old ones offers a helpful alternative to restore functionality quickly and minimize disruption to other parts of your system.
You put a bandage on your system so the promotions team can still send out promotional offers for Cool Caps for Keys. However, you still need to secure the IAM of the application! Where do you start to find out why the promotions service failed when you removed administrative permission that the promotions team shouldn’t need?
Troubleshooting your IaC also follows specific patterns. Even in the most complex infrastructure systems, many failed changes from IaC usually come from three causes: drift, dependencies, or differences. Examining your configuration for any of these causes helps you identify the problem and a potential fix.
Many broken infrastructure changes stem from configuration drift between configuration and resource state. In figure 11.4, you start by checking for drift. Make sure the IaC for the service account matches the state of the service account in GCP.
Checking for drift between code and state ensures that you eliminate any failures due to differences between the two. Differences between code and state can introduce unexpected problems. Removing those differences ensures that your change behavior works as expected.
In the case of Cool Caps for Keys, you review the permissions for the promotions service account that you define in IaC. The following listing outlines the IaC that defines the service and roles.
from os import environ import database ❶ import iam ❷ import network ❸ import server ❹ import json import os SERVICE = 'promotions' ENVIRONMENT = 'prod' REGION = 'us-central1' ZONE = 'us-central1-a' PROJECT = os.environ['CLOUDSDK_CORE_PROJECT'] role = 'roles/cloudsql.admin' ❺ if __name__ == "__main__": resources = { ❻ 'resource': network.Module(SERVICE, ENVIRONMENT, REGION).build() + ❼ iam.Module(SERVICE, ENVIRONMENT, REGION, PROJECT, ❽ role).build() + ❽ database.Module(SERVICE, ENVIRONMENT, REGION).build() + ❾ server.Module(SERVICE, ENVIRONMENT, ZONE).build() ❿ } with open('main.tf.json', 'w') as outfile: ⓫ json.dump(resources, outfile, ⓫ sort_keys=True, indent=4) ⓫
❶ Imports the database module to build the Google Cloud SQL database
❷ Imports the Google service account module and creates the configuration with the permissions
❸ Imports the network module to build the Google network and subnetwork
❹ Imports the server module to build the Google compute instance
❺ Promotions service account should have permission for the “cloudsql.admin” role to access the database
❻ Uses the module to create the JSON configuration for the database, network, service account, and server
❼ Imports the network module to build the Google network and subnetwork
❽ Imports the Google service account module and creates the configuration with the permissions
❾ Imports the database module to build the Google Cloud SQL database
❿ Imports the server module to build the Google compute instance
⓫ Writes out the Python dictionary to a JSON file to be executed by Terraform later
Then, compare the code to the promotions application’s service account permissions in GCP. The service account has only roles/cloudsql.admin
permissions consistent with your IaC:
$ gcloud projects get-iam-policy $CLOUDSDK_CORE_PROJECT bindings: - members: - serviceAccount:promotions-prod@infrastructure-as-code-book ➥.iam.gserviceaccount.com role: roles/cloudsql.admin version: 1
If you find configuration drift between IaC and the active resource state, you can further investigate whether it affects system functionality. You may choose to eliminate some of the drift to ensure that it doesn’t contribute to the root cause. However, just because you detect some drift does not mean that it breaks your system! Some drift may have nothing to do with the failure.
If you determine drift does not contribute to the failure, you can check for resources that depend on your updated one. In figure 11.5, you start graphing which resources depend on the service account. In both the IaC and production environment, the server depends on the service account.
You want to check that the expected dependencies match the actual. Unexpected dependencies disrupt change behavior. When you review the code in the following listing, you verify that the service account’s email gets passed to the server.
class Module(): ❶ def __init__(self, service, environment, zone, machine_type='e2-micro'): self._name = f'{service}-{environment}' self._environment = environment self._zone = zone self._machine_type = machine_type def build(self): ❶ return [ { 'google_compute_instance': { ❷ self._environment: { 'allow_stopping_for_update': True, 'boot_disk': [{ 'initialize_params': [{ 'image': 'ubuntu-1804-lts' }] }], 'machine_type': self._machine_type, 'name': self._name, 'zone': self._zone, 'network_interface': [{ 'subnetwork': '${google_compute_subnetwork.' + f'{self._environment}' + '.name}', 'access_config': { 'network_tier': 'STANDARD' } }], 'service_account': [{ ❸ 'email': '${google_service_account.' + ❸ f'{self._environment}' + '.email}', ❸ 'scopes': ['cloud-platform'] ❸ }] ❸ } } } ]
❶ Uses the module to create the JSON configuration for the server
❷ Creates the Google compute instance by using a Terraform resource based on the name, address, region, and network
❸ The factory for the promotions application’s server uses a service account to access GCP services.
However, the promotions team mentions that its application directly accesses the database by using its IP address, username, and password. Why would the server need the service account if the application reads the database connection string from a file?
You realize this highlights a discrepancy. You ask the promotions team to show you the application code. The application configuration does not use the database IP address, username, or password!
After additional debugging with the promotions team, you discover that the promotions application connects to the database on localhost
. The configuration uses the Cloud SQL Auth proxy (https://cloud.google.com/sql/docs/mysql/sql-proxy), which handles the connection and logs into the database! Therefore, the service account connected to the server needs database access.
Figure 11.6 shows that the promotions application accesses the database through a proxy. The proxy uses the service account to authenticate and access the database. The service account needs access to the database with a policy.
Congratulations, you discovered why the promotions application broke when you removed the service account! However, you get a little suspicious. Shouldn’t you have found the same problem in the testing environment? After all, you tested the change in a testing environment, and the application did not break.
Why did the change work in testing but not in production? You examine the promotions application in testing. The application does not connect to the database on localhost
. Instead, it uses the database IP address, username, and password.
You explain to the application team that the production IaC uses the Cloud SQL Auth proxy, while the testing IaC directly calls the database; see figure 11.7. Both configurations use the roles/cloudsql.admin
permission.
After further discussion with the promotions team, you discover that the team implemented an emergency change to secure production with the Cloud SQL Auth proxy. However, the team did not have a chance to update the testing environment to match! The mismatch allowed your updates to succeed in the testing environment but fail in the production environment.
You want to keep the testing and production environments as similar as possible. However, you cannot always reproduce production in testing environments. As a result, you will encounter failed changes due to discrepancies in both. Systematically identifying differences between testing and production environments helps highlight gaps in testing and change delivery.
While IaC should document all changes and configuration to your system, you might still discover a few surprises between IaC and environments. Figure 11.8 summarizes your structured approach to debugging failed changes in the promotions application’s IaC. You check for drift, dependencies, and finally, differences between environments.
After determining a root cause, you can finally implement a long-term fix. You must now reconcile the difference between the testing and production environment and revisit least-privilege access to secure the promotions application’s service account.
Your original task for Cool Caps for Keys involved updating the service account permissions for each application to ensure least-privilege access to services. You tried to remove database administrative access from the promotions application’s service account but failed. After troubleshooting the issue, you can now fix the problem.
You might find yourself a bit impatient at this point! After all, you have not finished updating the access for other applications in Cool Caps for Keys. However, don’t just go in and change everything at once. Pushing a batch of changes can make it difficult to debug the source of failure (previously referenced in chapter 7). Your testing environment still doesn’t match production, and you can still affect the promotions application if you make too many changes at once.
Throughout the book, I mention the process of making minor changes to minimize the blast radius of potential failure. Similarly, incremental fixes break down the changes you need to make to a system to prevent future failure.
Definition Incremental fixes break changes into smaller parts to gradually improve a system and prevent future failure.
Making minor configuration changes and gradually deploying them helps you recognize the first sign of trouble and stage your IaC for future success.
As I mentioned in chapter 2, you need to reconcile any manual changes to the infrastructure state with IaC. If you find some drift, you need to address it first! Your system should not have too many break-glass changes if you prioritize the use of IaC.
Recall that the promotions application for Cool Caps for Keys implemented a break-glass change that resulted in a difference between testing and production environments. The production application uses the Cloud SQL Auth proxy to connect to the database, while the testing application directly connects to the database through the IP address and password. You need to build a Cloud SQL Auth proxy in the testing environment.
To start fixing drift, you need to reconstruct the current state of infrastructure in configuration. Figure 11.9 reconstructs the installation commands for the Cloud SQL Auth proxy based on the production server. Then, you add the commands to IaC and apply them to the testing environment.
In this example, the team did not add IaC for the manual change. As a result, you spend additional time rebuilding the installation of the Cloud SQL Auth proxy. An out-of-band change such as the proxy caused a failed change, which takes even more time and effort to fix.
To help minimize some of these problems, use the process of migrating to IaC described in chapter 2. Capturing the manual change as IaC helps minimize differences between environments and drift between IaC and actual state. If you need to reconstruct the state of infrastructure, remember that chapter 2 includes a high-level example of migrating existing infrastructure into IaC. However, you typically have to find or write a tool to transform the state to IaC.
Let’s write the IaC to install the proxy. You checked the command history on the promotions application’s server in production and reconstructed the installation of the Cloud SQL Auth proxy. The following listing automates the commands and installation process in a startup script for the promotions application’s server.
class Module(): def _startup_script(self): ❶ proxy_download = 'https://dl.google.com/cloudsql/' + ❷ 'cloud_sql_proxy.linux.amd64' ❷ exec_start = '/usr/local/bin/cloud_sql_proxy ' + ❸ '-instances=${google_sql_database_instance.' + ❸ f'{self._environment}.connection_name}}=tcp:3306' ❸ return f""" ❹ #!/bin/bash wget {proxy_download} -O /usr/local/bin/cloud_sql_proxy chmod +x /usr/local/bin/cloud_sql_proxy cat << EOF > /usr/lib/systemd/system/cloudsqlproxy.service ❺ [Install] WantedBy=multi-user.target [Unit] Description=Google Cloud Compute Engine SQL Proxy Requires=networking.service After=networking.service [Service] Type=simple WorkingDirectory=/usr/local/bin ExecStart={exec_start} Restart=always StandardOutput=journal User=root EOF systemctl daemon-reload ❺ systemctl start cloudsqlproxy ❺ """ def build(self): ❻ return [ { 'google_compute_instance': { ❼ self._environment: { ❼ 'metadata_startup_script': self._startup_script() ❼ } ❼ } ❼ } ]
❶ Creates a startup script that reconstructs the manual installation commands for the Cloud SQL Auth proxy
❷ Sets a variable as the proxy download URL
❸ Sets a variable that runs the Cloud SQL Auth proxy binary on port 3306
❹ Returns a shell script that installs the proxy and starts it up with the server
❺ Configures the systemd daemon to start and stop the Cloud SQL Auth proxy
❻ Uses the module to create the JSON configuration for the Google compute instance and includes a startup script to install the proxy
❼ Adds the startup script to the server. I omit other attributes for clarity.
You do not update the service account with new permissions! In the spirit of incremental fixes, you want to avoid adding more changes to track as you push to production. You add the startup script to the promotions application’s server and change the testing environment without more updates.
While you updated your IaC to account for drift, you also need to make sure testing and production environments use your new IaC. For Cool Caps for Keys, you make sure the database connection works in the testing environment. Then, you ask the promotions team to update its application configuration to connect to the database through the proxy on localhost
.
The promotions team pushes its application configuration to use the Cloud SQL Auth proxy into the testing environment, run tests, and update production, as shown in figure 11.10. You keep the roles/cloudsql.admin
permission on the service account because the proxy needs it.
The push re-creates the production server with the new startup script. After additional end-to-end testing for the promotions application, you confirm that you successfully updated both testing and production environments.
Why start with reconciling drift before differences between production and testing environments? In this example, you opt to reconcile drift first because you will spend more time manually installing the packages in the testing environment. If you update your IaC and automate the package installation, you can ensure that the change works in the testing environment before pushing it to the production environment.
You might choose to reconcile testing and production environments first because you have a large amount of drift. In that case, match testing and production environments before fixing drift. You want an accurate testing environment before you implement your reconciliation changes.
Reconcile drift and differences in environments to help the next person make updates to the system. They do not have to worry about knowing the difference in configuration or manually configuring the proxy. The extra time you spend updating your IaC helps you avoid additional time debugging!
Now that you’ve minimized the blast radius of potential failure by reconciling drift and updating your environments, you can finally push forward the original change. Your debugging and incremental fixes change your infrastructure. When you return to implementing the original change, you may need to adjust your code.
Let’s finish the original change for the Cool Caps for Keys promotions application. Recall that the security team asked you to remove administrative permissions from the service account. This process ensures least-privilege access and accounts for using the Cloud SQL Auth proxy.
You know that the service account must have database access because the application uses the Cloud SQL Auth proxy. Now, you try to figure out what kind of minimal access the application should use. The roles/cloudsql.client
permission offers enough access for a service account to get a list of instances and connect to them.
In figure 11.11, you change the service account’s permissions from administrative access to roles/cloudsql.client
. You push this change to the testing environment, verify that promotions still work, and deploy the roles/cloudsql.client
permission to production.
You reconciled the difference between testing and production environments for the proxy. In theory, the testing environment should now catch any issues with your change. Any failed changes should now appear in the testing environment. Let’s change the permission for the service account in the following listing from roles/cloudsql.admin
to roles/cloudsql.client
.
from os import environ import database import iam import network import server import json import os SERVICE = 'promotions' ENVIRONMENT = 'prod' REGION = 'us-central1' ZONE = 'us-central1-a' PROJECT = os.environ['CLOUDSDK_CORE_PROJECT'] role = 'roles/cloudsql.client' ❶ if __name__ == "__main__": resources = { ❷ 'resource': ❷ network.Module(SERVICE, ENVIRONMENT, REGION).build() + ❷ iam.Module(SERVICE, ENVIRONMENT, REGION, PROJECT, ❸ role).build() + ❸ database.Module(SERVICE, ENVIRONMENT, REGION).build() + ❷ server.Module(SERVICE, ENVIRONMENT, ZONE).build() ❷ } ❷ with open('main.tf.json', 'w') as outfile: json.dump(resources, outfile, sort_keys=True, indent=4)
❶ Changes the promotions service account role to client access, which allows connecting to the database instance
❷ Imports the network, database, and server modules without changing the resources
❸ Imports the service account and attaches the “roles/cloudsql .client” role to its permissions
You commit and apply the change. The testing environment applies the change and validates that the application still works! You confirm with the promotions team, which approves the production change.
The team promotes the new permissions change to production, runs end-to-end tests, and confirms that the promotions application can access the database! After a few weeks of debugging and making changes, you can finally fix the other applications in Cool Caps for Keys.
Why dedicate an entire chapter with a single example of fixing failed changes? It represents the reality of fixing IaC. You want to resolve the failure as quickly as possible without making it worse.
Rolling forward helps restore the system’s working state and minimize disruption to infrastructure resources. Then, you can work on troubleshooting the root cause. Many infrastructure failures come from drift, dependencies, or differences between testing and production environments. After you address these discrepancies, you implement your original change.
Learning the art of rolling forward IaC takes time and experience. While you could just log into the cloud provider console and make manual changes to get the system working, remember that the bandage will fall off very quickly and won’t promote long-term system healing. Using IaC to track and fix the system incrementally minimizes the impact of the repairs and provides context for anyone else updating the system.
Repairing failures for IaC involves rolling forward fixes instead of rolling them back.
Rolling forward IaC uses immutability to return the system to a working state.
Before debugging and implementing long-term fixes, prioritize stabilizing and restoring the system to a working state.
When troubleshooting IaC, check for drift, unexpected dependencies, and differences between environments as part of the root cause.
Work on incremental fixes to quickly recognize and reduce the blast radius of potential failure.
Before you re-implement the original change that failed, make sure you reconcile drift and differences between environments for accurate testing and future system updates.
Reconstructing state in IaC to reconcile drift involves aggregating manual commands for server configuration or transforming infrastructure metadata into IaC.