11 Fixing failures

This chapter covers

  • Determining how to roll forward failed changes to restore functionality
  • Organizing an approach for IaC troubleshooting
  • Categorizing repairs for failed changes

We took many chapters to discuss writing and collaborating on infrastructure as code. All of the practices and principles you learned for IaC accumulate to the crucial moment when you push a change, it causes your system to fail, and you need to roll it back! However, IaC doesn’t support rollback. You do not fully revert IaC changes. What does it mean to fix failures if you don’t roll them back?

This chapter focuses on fixing failed changes from IaC. First, we’ll discuss what it means to “revert” IaC changes by rolling forward. Then, you’ll learn workflows for troubleshooting and fixing the failed change. While the techniques in this chapter might not apply to every scenario you’ll encounter in your system, they establish a broad set of practices you can use to start repairing IaC failures.

Troubleshooting and site reliability engineering

I do not dive too deeply into the process and principles of troubleshooting failed systems in this book. Most of the discussion around troubleshooting centers around how to manage it in the context of IaC. For more information on troubleshooting and building reliable systems, I recommend Site Reliability Engineering by Betsy Beyer et al. (O’Reilly, 2016).

As a general rule, prioritize stabilization and restoration of functionality for services and customers before you debug further for a root cause (the issue that led to the problem). A temporary bandage provides you the opportunity to troubleshoot and implement a longer-term fix for the system.

11.1 Restoring functionality

Imagine you work for a company called Cool Caps for Keys. It creates custom keyboard caps and connects customers with artists to design the caps. As the security engineer, you need to narrow down the access control for applications and users across GCP projects.

You copy the Google Cloud SQL database configuration and update the access control to implement least-privilege access for team members and applications. You choose the policies required for different applications to use infrastructure and verify that the applications still work.

Next, you talk to the promotions team. Its application accesses the database directly by using a database username and password. Direct access to the database means that you can remove the policy for roles/cloudsql.admin from the promotions application’s service account. You remove the policy, test the changes, confirm with the promotions team that the change did not affect its application in its testing environment, and push it to production; see figure 11.1.

Figure 11.1 After removing the administrative database access for the promotions service, you discover that the change broke the service’s ability to access the database.

An hour later, the promotions team tells you that its application keeps throwing an error that it can’t access the database! You suspect that your change might have introduced a problem. While you could start digging around for the problem, you prioritize fixing the promotions service so it can access the database before investigating further.

11.1.1 Rolling forward to revert changes

You need to fix the service so users can make requests to the system. However, you cannot simply return the system to a previously working state. IaC prioritizes immutability, which means any changes to the system, including reverted changes, must create new resources!

For example, let’s fix the promotions service for Cool Caps for Keys by reverting the change and adding the role to the service account. In figure 11.2, you revert the commit and add roles/cloudsql.admin back to the service account. Then, you push the changes to your testing and production environments.

Figure 11.2 You add the administrative database role for the promotions service to roll forward the system to a working state.

You revert the commit and push forward the changes to the testing and production environments. You roll forward IaC because it uses immutability to return the system to a working state.

Definition The practice of rolling forward IaC reverts changes to a system and uses immutability to return the system to a working state.

Rollback implies that you return infrastructure to a previous state. In reality, the immutability of IaC means that you create a new state anytime you make changes. You cannot fully restore the state of infrastructure back to its previous state. Sometimes you can’t actually restore the infrastructure to a previous state because your change has a large blast radius.

Let’s revert your changes to the service account and roll forward changes to add back the permission. First, check your commit history because version control keeps track of all the changes you make. The commit prefixed with a31 includes your removal of roles/cloudsql.admin:

$ git log --oneline -2
a3119fc (HEAD -> main) Remove database admin access from promotions
6f84f5d Add database admin to promotions service

Applying the GitOps practices from chapter 7, you want to avoid making manual, break-glass changes. Instead, you favor operational changes through IaC! You revert the commit to push updates to restore the promotions service to a working state:

$ git revert a3119fc

You push the commit, and the pipeline adds the role back to the service account. After you roll forward, the application works again. You’ve successfully returned the infrastructure state to a working state. However, you never achieve a full restore of state. Instead, you rolled out a new state that matched the previous working state.

Rolling back IaC often means rolling forward changes to the infrastructure state. You use git revert as a forward-moving undo to preserve immutability and roll forward undo updates to infrastructure.

Configuration management

Configuration management does not prioritize immutability but still rolls forward reverted changes to a server or resource. For example, imagine you install a package with version 3.0.0 and need to revert to 2.0.0. Your configuration management tool may choose to uninstall the new version and reinstall the old version. You do not restore the package and its configuration to its previous state. You just restore the server to a new working state with an older package.

11.1.2 Rolling forward for new changes

The benefit of taking a roll-forward mentality means that you expand your troubleshooting approach. In the example, you reverted a broken commit and restored functionality to the promotions service by matching the new state to a prior working one. However, sometimes reverting a commit doesn’t fix your system and makes everything worse! Instead, you can roll forward new changes and restore functionality.

Let’s imagine the promotions service still doesn’t work after you roll forward changes. Rather than try to fix the application, you create a new environment with the change and a new promotions service. You start a canary deployment technique from chapter 9 to gradually increase the traffic to fully restore the application, as shown in figure 11.3. You disable the failed environment after all requests go to the new service instance for debugging.

Figure 11.3 When you cannot recover the promotions application, you can use a canary deployment to cut over traffic to a new instance and restore the system.

IaC allows you to reproduce environments with less effort. Furthermore, conforming to immutability means that you already have a pattern of creating new environments for changes. The combination of the two principles helps mitigate higher-risk changes with a larger blast radius.

Use cases that involve data or completely irrecoverable resources cannot roll forward to a prior state. You could corrupt application data or affect other infrastructure while detangling a cascading failure. Rather than roll forward to revert, you can roll forward and implement new changes by applying the change techniques from chapter 9.

You may also restore functionality with a combination of reverts and completely new changes. Expanding your roll-forward mentality to include new changes outside of undoing old ones offers a helpful alternative to restore functionality quickly and minimize disruption to other parts of your system.

11.2 Troubleshooting

You put a bandage on your system so the promotions team can still send out promotional offers for Cool Caps for Keys. However, you still need to secure the IAM of the application! Where do you start to find out why the promotions service failed when you removed administrative permission that the promotions team shouldn’t need?

Troubleshooting your IaC also follows specific patterns. Even in the most complex infrastructure systems, many failed changes from IaC usually come from three causes: drift, dependencies, or differences. Examining your configuration for any of these causes helps you identify the problem and a potential fix.

11.2.1 Check for drift

Many broken infrastructure changes stem from configuration drift between configuration and resource state. In figure 11.4, you start by checking for drift. Make sure the IaC for the service account matches the state of the service account in GCP.

Figure 11.4 Start by checking for drift between IaC and state.

Checking for drift between code and state ensures that you eliminate any failures due to differences between the two. Differences between code and state can introduce unexpected problems. Removing those differences ensures that your change behavior works as expected.

In the case of Cool Caps for Keys, you review the permissions for the promotions service account that you define in IaC. The following listing outlines the IaC that defines the service and roles.

Listing 11.1 Promotions service account with database admin permission

from os import environ
import database                                                     
import iam                                                          
import network                                                      
import server                                                       
import json
import os
 
SERVICE = 'promotions'
ENVIRONMENT = 'prod'
REGION = 'us-central1'
ZONE = 'us-central1-a'
PROJECT = os.environ['CLOUDSDK_CORE_PROJECT']
role = 'roles/cloudsql.admin'                                       
 
if __name__ == "__main__":
   resources = {                                                    
       'resource':
       network.Module(SERVICE, ENVIRONMENT, REGION).build() +       
       iam.Module(SERVICE, ENVIRONMENT, REGION, PROJECT,            
                  role).build() +                                   
       database.Module(SERVICE, ENVIRONMENT, REGION).build() +      
       server.Module(SERVICE, ENVIRONMENT, ZONE).build()            
   }
 
   with open('main.tf.json', 'w') as outfile:                       
       json.dump(resources, outfile,                                
                 sort_keys=True, indent=4)                          

Imports the database module to build the Google Cloud SQL database

Imports the Google service account module and creates the configuration with the permissions

Imports the network module to build the Google network and subnetwork

Imports the server module to build the Google compute instance

Promotions service account should have permission for the “cloudsql.admin” role to access the database

Uses the module to create the JSON configuration for the database, network, service account, and server

Imports the network module to build the Google network and subnetwork

Imports the Google service account module and creates the configuration with the permissions

Imports the database module to build the Google Cloud SQL database

Imports the server module to build the Google compute instance

Writes out the Python dictionary to a JSON file to be executed by Terraform later

AWS and Azure equivalents

The AWS equivalent of the GCP Cloud SQL administrator permission is similar to AmazonRDSFullAccess. Azure does not have an exact equivalent. Instead, you will need to add an Azure Active Directory account to the database directly and grant administrative consent for Azure SQL Database API permissions.

Then, compare the code to the promotions application’s service account permissions in GCP. The service account has only roles/cloudsql.admin permissions consistent with your IaC:

$ gcloud projects get-iam-policy $CLOUDSDK_CORE_PROJECT
bindings:
- members:
  - serviceAccount:promotions-prod@infrastructure-as-code-book
  .iam.gserviceaccount.com
  role: roles/cloudsql.admin
version: 1

If you find configuration drift between IaC and the active resource state, you can further investigate whether it affects system functionality. You may choose to eliminate some of the drift to ensure that it doesn’t contribute to the root cause. However, just because you detect some drift does not mean that it breaks your system! Some drift may have nothing to do with the failure.

11.2.2 Check for dependencies

If you determine drift does not contribute to the failure, you can check for resources that depend on your updated one. In figure 11.5, you start graphing which resources depend on the service account. In both the IaC and production environment, the server depends on the service account.

Figure 11.5 Troubleshoot any resources that depend on the one you want to update.

You want to check that the expected dependencies match the actual. Unexpected dependencies disrupt change behavior. When you review the code in the following listing, you verify that the service account’s email gets passed to the server.

Listing 11.2 Promotions server depends on the promotions service account

class Module():                                                       
   def __init__(self, service, environment,
                zone, machine_type='e2-micro'):
       self._name = f'{service}-{environment}'
       self._environment = environment
       self._zone = zone
       self._machine_type = machine_type
 
   def build(self):                                                   
       return [
           {
               'google_compute_instance': {                           
                   self._environment: {
                       'allow_stopping_for_update': True,
                       'boot_disk': [{
                           'initialize_params': [{
                               'image': 'ubuntu-1804-lts'
                           }]
                       }],
                       'machine_type': self._machine_type,
                       'name': self._name,
                       'zone': self._zone,
                       'network_interface': [{
                           'subnetwork':
                           '${google_compute_subnetwork.' +
                           f'{self._environment}' + '.name}',
                           'access_config': {
                               'network_tier': 'STANDARD'
                           }
                       }],
                       'service_account': [{                          
                           'email': '${google_service_account.' +     
                           f'{self._environment}' + '.email}',        
                           'scopes': ['cloud-platform']               
                       }]                                             
                   }
               }
           }
       ]

Uses the module to create the JSON configuration for the server

Creates the Google compute instance by using a Terraform resource based on the name, address, region, and network

The factory for the promotions application’s server uses a service account to access GCP services.

AWS and Azure equivalents

Create a network in AWS or Azure. Then, update the code listing to use an Azure Linux virtual machine Terraform resource (http://mng.bz/J22p) with a managed identity block. The identity block should include a list of user IDs with permissions to access Azure. For AWS, you would define an IAM instance profile for the AWS EC2 Terraform resource.

However, the promotions team mentions that its application directly accesses the database by using its IP address, username, and password. Why would the server need the service account if the application reads the database connection string from a file?

You realize this highlights a discrepancy. You ask the promotions team to show you the application code. The application configuration does not use the database IP address, username, or password!

After additional debugging with the promotions team, you discover that the promotions application connects to the database on localhost. The configuration uses the Cloud SQL Auth proxy (https://cloud.google.com/sql/docs/mysql/sql-proxy), which handles the connection and logs into the database! Therefore, the service account connected to the server needs database access.

Figure 11.6 shows that the promotions application accesses the database through a proxy. The proxy uses the service account to authenticate and access the database. The service account needs access to the database with a policy.

Figure 11.6 The promotions application accesses the database through a proxy, which needs a service account with database permissions.

AWS and Azure equivalents

The AWS equivalent of the GCP Cloud SQL Auth proxy is Amazon RDS Proxy. The proxy helps enforce database connections and avoids the need for database usernames and passwords in application code.

Azure does not have an equivalent SQL proxy option. Instead, you must set up an Azure Private Link to the database. This allocates an IP address on a private network of your choice. You can configure your database to allow your application to log in with an Azure Active Directory service principal.

Congratulations, you discovered why the promotions application broke when you removed the service account! However, you get a little suspicious. Shouldn’t you have found the same problem in the testing environment? After all, you tested the change in a testing environment, and the application did not break.

11.2.3 Check for differences in environments

Why did the change work in testing but not in production? You examine the promotions application in testing. The application does not connect to the database on localhost. Instead, it uses the database IP address, username, and password.

You explain to the application team that the production IaC uses the Cloud SQL Auth proxy, while the testing IaC directly calls the database; see figure 11.7. Both configurations use the roles/cloudsql.admin permission.

Figure 11.7 Check for differences between testing and production to reconcile any tested changes that fail.

After further discussion with the promotions team, you discover that the team implemented an emergency change to secure production with the Cloud SQL Auth proxy. However, the team did not have a chance to update the testing environment to match! The mismatch allowed your updates to succeed in the testing environment but fail in the production environment.

You want to keep the testing and production environments as similar as possible. However, you cannot always reproduce production in testing environments. As a result, you will encounter failed changes due to discrepancies in both. Systematically identifying differences between testing and production environments helps highlight gaps in testing and change delivery.

While IaC should document all changes and configuration to your system, you might still discover a few surprises between IaC and environments. Figure 11.8 summarizes your structured approach to debugging failed changes in the promotions application’s IaC. You check for drift, dependencies, and finally, differences between environments.

Figure 11.8 You use IaC to troubleshoot your broken change by checking for drift, unexpected dependencies, and differences between testing and production.

After determining a root cause, you can finally implement a long-term fix. You must now reconcile the difference between the testing and production environment and revisit least-privilege access to secure the promotions application’s service account.

Exercise 11.1

A team reports that that its application can no longer connect to another application. The application worked last week, but requests have failed since Monday. The team has made no changes to its application and suspects the problem may be a firewall rule. Which steps can you take to troubleshoot the problem? (Choose all that apply.)

A) Log into the cloud provider and check the firewall rules for the application.

B) Deploy new infrastructure and applications to a green environment for testing.

C) Examine the changes in IaC for the application.

D) Compare the firewall rules in the cloud provider with IaC.

E) Edit the firewall rules and allow all traffic between the applications.

See appendix B for answers to exercises.

11.3 Fixing

Your original task for Cool Caps for Keys involved updating the service account permissions for each application to ensure least-privilege access to services. You tried to remove database administrative access from the promotions application’s service account but failed. After troubleshooting the issue, you can now fix the problem.

You might find yourself a bit impatient at this point! After all, you have not finished updating the access for other applications in Cool Caps for Keys. However, don’t just go in and change everything at once. Pushing a batch of changes can make it difficult to debug the source of failure (previously referenced in chapter 7). Your testing environment still doesn’t match production, and you can still affect the promotions application if you make too many changes at once.

Throughout the book, I mention the process of making minor changes to minimize the blast radius of potential failure. Similarly, incremental fixes break down the changes you need to make to a system to prevent future failure.

Definition Incremental fixes break changes into smaller parts to gradually improve a system and prevent future failure.

Making minor configuration changes and gradually deploying them helps you recognize the first sign of trouble and stage your IaC for future success.

11.3.1 Reconcile drift

As I mentioned in chapter 2, you need to reconcile any manual changes to the infrastructure state with IaC. If you find some drift, you need to address it first! Your system should not have too many break-glass changes if you prioritize the use of IaC.

Recall that the promotions application for Cool Caps for Keys implemented a break-glass change that resulted in a difference between testing and production environments. The production application uses the Cloud SQL Auth proxy to connect to the database, while the testing application directly connects to the database through the IP address and password. You need to build a Cloud SQL Auth proxy in the testing environment.

To start fixing drift, you need to reconstruct the current state of infrastructure in configuration. Figure 11.9 reconstructs the installation commands for the Cloud SQL Auth proxy based on the production server. Then, you add the commands to IaC and apply them to the testing environment.

Figure 11.9 In testing, you need to install the Cloud SQL Auth proxy package onto the promotions application’s server to reconcile drift from the break-glass change.

In this example, the team did not add IaC for the manual change. As a result, you spend additional time rebuilding the installation of the Cloud SQL Auth proxy. An out-of-band change such as the proxy caused a failed change, which takes even more time and effort to fix.

To help minimize some of these problems, use the process of migrating to IaC described in chapter 2. Capturing the manual change as IaC helps minimize differences between environments and drift between IaC and actual state. If you need to reconstruct the state of infrastructure, remember that chapter 2 includes a high-level example of migrating existing infrastructure into IaC. However, you typically have to find or write a tool to transform the state to IaC.

Let’s write the IaC to install the proxy. You checked the command history on the promotions application’s server in production and reconstructed the installation of the Cloud SQL Auth proxy. The following listing automates the commands and installation process in a startup script for the promotions application’s server.

Listing 11.3 Installing the Cloud SQL Auth proxy in the server startup script

class Module():
   def _startup_script(self):                                            
       proxy_download = 'https://dl.google.com/cloudsql/' +             
           'cloud_sql_proxy.linux.amd64'                                 
       exec_start = '/usr/local/bin/cloud_sql_proxy ' +                 
           '-instances=${google_sql_database_instance.' +               
           f'{self._environment}.connection_name}}=tcp:3306'             
 
       return f"""                                                       
       #!/bin/bash
       wget {proxy_download} -O /usr/local/bin/cloud_sql_proxy
       chmod +x /usr/local/bin/cloud_sql_proxy
      
       cat << EOF > /usr/lib/systemd/system/cloudsqlproxy.service        
       [Install]
       WantedBy=multi-user.target
 
       [Unit]
       Description=Google Cloud Compute Engine SQL Proxy
       Requires=networking.service
       After=networking.service
 
       [Service]
       Type=simple
       WorkingDirectory=/usr/local/bin
       ExecStart={exec_start}
       Restart=always
       StandardOutput=journal
       User=root
       EOF
 
       systemctl daemon-reload                                           
       systemctl start cloudsqlproxy                                     
       """
 
   def build(self):                                                      
       return [
           {
               'google_compute_instance': {                              
                   self._environment: {                                  
                       'metadata_startup_script': self._startup_script() 
                   }                                                     
               }                                                         
           }
       ]

Creates a startup script that reconstructs the manual installation commands for the Cloud SQL Auth proxy

Sets a variable as the proxy download URL

Sets a variable that runs the Cloud SQL Auth proxy binary on port 3306

Returns a shell script that installs the proxy and starts it up with the server

Configures the systemd daemon to start and stop the Cloud SQL Auth proxy

Uses the module to create the JSON configuration for the Google compute instance and includes a startup script to install the proxy

Adds the startup script to the server. I omit other attributes for clarity.

AWS and Azure equivalents

For AWS and Azure, you do not have to install software for the proxy on the instance. If you would like to reproduce listing 11.3 in AWS and Azure for practice, you can pass the startup script to the resource as user_data to the AWS instance or custom_ data to the Azure Linux virtual machine.

You do not update the service account with new permissions! In the spirit of incremental fixes, you want to avoid adding more changes to track as you push to production. You add the startup script to the promotions application’s server and change the testing environment without more updates.

Startup script, configuration manager, or image builder?

I use the startup script field in this example to avoid introducing more syntax. Instead, you should implement the configuration of any new packages or processes with a configuration manager or image builder.

For example, the configuration manager would push the Cloud SQL Auth proxy installation process to any server with the promotions application. Similarly, the image builder configures the proxy for each image you bake! Whenever you reference the image for the promotions application, you always have the proxy built into the server.

11.3.2 Reconcile differences in environments

While you updated your IaC to account for drift, you also need to make sure testing and production environments use your new IaC. For Cool Caps for Keys, you make sure the database connection works in the testing environment. Then, you ask the promotions team to update its application configuration to connect to the database through the proxy on localhost.

The promotions team pushes its application configuration to use the Cloud SQL Auth proxy into the testing environment, run tests, and update production, as shown in figure 11.10. You keep the roles/cloudsql.admin permission on the service account because the proxy needs it.

Figure 11.10 You need to push your IaC changes onto the promotions application’s server in testing and production environments.

The push re-creates the production server with the new startup script. After additional end-to-end testing for the promotions application, you confirm that you successfully updated both testing and production environments.

Why start with reconciling drift before differences between production and testing environments? In this example, you opt to reconcile drift first because you will spend more time manually installing the packages in the testing environment. If you update your IaC and automate the package installation, you can ensure that the change works in the testing environment before pushing it to the production environment.

You might choose to reconcile testing and production environments first because you have a large amount of drift. In that case, match testing and production environments before fixing drift. You want an accurate testing environment before you implement your reconciliation changes.

Reconcile drift and differences in environments to help the next person make updates to the system. They do not have to worry about knowing the difference in configuration or manually configuring the proxy. The extra time you spend updating your IaC helps you avoid additional time debugging!

11.3.3 Implement the original change

Now that you’ve minimized the blast radius of potential failure by reconciling drift and updating your environments, you can finally push forward the original change. Your debugging and incremental fixes change your infrastructure. When you return to implementing the original change, you may need to adjust your code.

Let’s finish the original change for the Cool Caps for Keys promotions application. Recall that the security team asked you to remove administrative permissions from the service account. This process ensures least-privilege access and accounts for using the Cloud SQL Auth proxy.

You know that the service account must have database access because the application uses the Cloud SQL Auth proxy. Now, you try to figure out what kind of minimal access the application should use. The roles/cloudsql.client permission offers enough access for a service account to get a list of instances and connect to them.

In figure 11.11, you change the service account’s permissions from administrative access to roles/cloudsql.client. You push this change to the testing environment, verify that promotions still work, and deploy the roles/cloudsql.client permission to production.

Figure 11.11 You need to push your IaC changes onto the promotions application’s server in testing and production environments.

You reconciled the difference between testing and production environments for the proxy. In theory, the testing environment should now catch any issues with your change. Any failed changes should now appear in the testing environment. Let’s change the permission for the service account in the following listing from roles/cloudsql.admin to roles/cloudsql.client.

Listing 11.4 Changing the service account role to the database client

from os import environ
import database
import iam
import network
import server
import json
import os
 
SERVICE = 'promotions'
ENVIRONMENT = 'prod'
REGION = 'us-central1'
ZONE = 'us-central1-a'
PROJECT = os.environ['CLOUDSDK_CORE_PROJECT']
role = 'roles/cloudsql.client'                                    
 
if __name__ == "__main__":
   resources = {                                                  
       'resource':                                                
       network.Module(SERVICE, ENVIRONMENT, REGION).build() +     
       iam.Module(SERVICE, ENVIRONMENT, REGION, PROJECT,          
                  role).build() +                                 
       database.Module(SERVICE, ENVIRONMENT, REGION).build() +    
       server.Module(SERVICE, ENVIRONMENT, ZONE).build()          
   }                                                              
 
   with open('main.tf.json', 'w') as outfile:
       json.dump(resources, outfile,
                 sort_keys=True, indent=4)

Changes the promotions service account role to client access, which allows connecting to the database instance

Imports the network, database, and server modules without changing the resources

Imports the service account and attaches the “roles/cloudsql .client” role to its permissions

AWS and Azure equivalents

The AWS equivalent of the GCP Cloud SQL administrator permission is similar to AmazonRDSFullAccess. Azure does not have an exact equivalent. Instead, you will need to add an Azure Active Directory account to the database directly and grant administrative consent for Azure SQL Database API permissions.

In AWS, you can add the rds-db:connect action to the IAM role attached to the EC2 instance. In Azure, you will need to revoke administrative access and grant SELECT access to the Azure AD user linked to the database user (http://mng.bz/woo7).

You commit and apply the change. The testing environment applies the change and validates that the application still works! You confirm with the promotions team, which approves the production change.

The team promotes the new permissions change to production, runs end-to-end tests, and confirms that the promotions application can access the database! After a few weeks of debugging and making changes, you can finally fix the other applications in Cool Caps for Keys.

Why dedicate an entire chapter with a single example of fixing failed changes? It represents the reality of fixing IaC. You want to resolve the failure as quickly as possible without making it worse.

Rolling forward helps restore the system’s working state and minimize disruption to infrastructure resources. Then, you can work on troubleshooting the root cause. Many infrastructure failures come from drift, dependencies, or differences between testing and production environments. After you address these discrepancies, you implement your original change.

Learning the art of rolling forward IaC takes time and experience. While you could just log into the cloud provider console and make manual changes to get the system working, remember that the bandage will fall off very quickly and won’t promote long-term system healing. Using IaC to track and fix the system incrementally minimizes the impact of the repairs and provides context for anyone else updating the system.

Summary

  • Repairing failures for IaC involves rolling forward fixes instead of rolling them back.

  • Rolling forward IaC uses immutability to return the system to a working state.

  • Before debugging and implementing long-term fixes, prioritize stabilizing and restoring the system to a working state.

  • When troubleshooting IaC, check for drift, unexpected dependencies, and differences between environments as part of the root cause.

  • Work on incremental fixes to quickly recognize and reduce the blast radius of potential failure.

  • Before you re-implement the original change that failed, make sure you reconcile drift and differences between environments for accurate testing and future system updates.

  • Reconstructing state in IaC to reconcile drift involves aggregating manual commands for server configuration or transforming infrastructure metadata into IaC.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset