1 Introducing infrastructure as code

This chapter covers

  • Defining infrastructure
  • Defining infrastructure as code
  • Understanding why infrastructure as code is important

If you just started working with public cloud providers or data center infrastructure, you might feel overwhelmed with all that you have to learn. You don’t know what you need to know to do your job! Between data center infrastructure concepts, new public cloud offerings, container orchestrators, programming languages, and software development, you have a lot to research.

Besides learning everything you can, you also have to keep up with your company’s requirements to innovate and grow. Building systems to support it all gets challenging. You need a way to support more complex systems, minimize your maintenance effort, and avoid disruption to customers using your application.

What do you need to know to work with cloud computing or data center infrastructure? How can you scale your systems across a team and your organization? The answer to both questions involves infrastructure as code (IaC), the process of automating infrastructure changes in a codified manner to achieve scalability, reliability, and security.

Everyone can use IaC, from system administrators, site reliability engineers, DevOps engineers, security engineers, and software developers, to quality assurance engineers. Whether you’ve just run your first tutorial for IaC or passed a public cloud certification (congratulations!), you can apply IaC to larger systems and teams to simplify, sustain, and scale your infrastructure.

This book offers a practical approach to IaC by applying software development practices and patterns to infrastructure management. The book presents practices like testing, continuous delivery, refactoring, and design patterns with an infrastructure twist. You’ll find practices and patterns to help you manage your infrastructure no matter the automation, tool, platform, or technology.

I divided this book into three parts (figure 1.1). Part 1 covers the practices you can apply to write IaC, while part 2 describes your team’s patterns and practices to collaborate on it. Part 3 covers some approaches to scaling IaC across your organization.

Figure 1.1 In this book, you will learn the intersections of practices among you, your team, and your organization and how they help you scale systems and support mission-critical applications.

Many of the patterns and practices in this book intersect these three interests. Writing good IaC individually helps you better share and scale it across your team and organization. Well-written IaC helps solve problems with collaborating on IaC, especially as more people adopt it.

Part 1 starts by defining infrastructure and explaining common IaC design patterns. These topics involve foundational concepts that help you scale IaC across your team. You may already be familiar with some of the material in this part, so review these chapters to establish foundations for more advanced concepts.

In parts 2 and 3, you’ll learn the patterns and practices you need to scale systems and support infrastructure for mission-critical applications. These practices extend from you to your team and organization, from creating one application metrics alert for an application to implementing a network change across a 50,000-person organization. Many terms and concepts build on each other in these parts, so you might find it helpful to read the chapters in order.

1.1 What is infrastructure?

Before I dive into IaC, let’s begin with the definition of infrastructure. When I began working in a data center, the literature often defined infrastructure as hardware or devices that provide network, storage, or compute capability. Figure 1.2 shows how applications run on servers (compute), connect through a switch (network), and maintain data on disks (storage).

Figure 1.2 The data center definition of infrastructure includes networking, compute, and storage resources used to run an application.

These three categories matched the physical devices we managed in a data center. We made changes by scanning our IDs to get into a building, plugging into a device, typing commands, and hoping that everything still worked. With the advent of cloud computing, we continued to use these categories to discuss virtualization of specific devices.

However, the data center definition of infrastructure doesn’t quite apply to today’s services and offerings. Imagine that another team requests that you help them deliver their application to production for users. You run through a checklist that includes setting up the following:

  • Enough servers

  • Network connectivity for users

  • A database for storing application data

Does completing this checklist mean the team can run this application in production? Not necessarily. You do not know if you set up enough servers or the proper access to log into the application. You also need to know if network latency affects the application’s database connection.

In this narrow definition of infrastructure, you omit some critical tasks necessary for production readiness, including the following:

  • Monitoring application metrics

  • Exporting metrics for business reporting

  • Setting up alerts for teams operating the application

  • Adding health checks for servers and databases

  • Supporting user authentication

  • Logging and aggregating application events

  • Storing and rotating database passwords in a secrets manager

You need these to-do items to deliver an application to production that will work reliably and securely. You might think of them as operational requirements, but they still require infrastructure resources.

Besides infrastructure related to operations, public cloud providers abstract the management of base networking, compute, and storage and offer platform as a service (PaaS) offerings instead, from object stores like storage buckets to event-streaming platforms like managed Apache Kafka. Providers even offer function as a service (FaaS) or containers as a service, additional abstractions for computing resources. The increasing marketplace of software as a service (SaaS), such as hosted application performance-monitoring software, might also be required to support an application in production and could also be infrastructure.

With so many services, we cannot describe infrastructure with just compute, network, and storage categories. We need to include operational infrastructure, PaaS, or SaaS offerings in our application delivery. Figure 1.3 adjusts the model of infrastructure to include additional service offerings like SaaS and PaaS that help us deliver applications.

Because of the growing complexity, varying operating models, and user abstraction of data center management, you can’t limit the definition of infrastructure to hardware or physical devices related to compute, network, or storage.

Definition Infrastructure refers to the software, platform, or hardware that delivers or deploys applications to production.

Here is a non-exhaustive list of infrastructure you might encounter:

  • Servers

  • Workload orchestration platforms (e.g., Kubernetes, HashiCorp Nomad)

  • Network switches

  • Load balancers

  • Databases

  • Object stores

  • Caches

  • Queues

  • Event-streaming platforms

  • Monitoring platforms

  • Data pipeline systems

  • Payment platforms

Expanding the infrastructure definition provides a common language across teams managing resources for various purposes. For example, a team managing an organization’s continuous integration (CI) framework utilizes infrastructure from either a continuous integration SaaS or compute resources from a public cloud. Another team builds upon the framework, thus making it critical infrastructure.

Figure 1.3 The infrastructure for an application might include queues on a public cloud, containers running applications, serverless functions for additional processing, or even monitoring services to check the system’s health.

1.2 What is infrastructure as code?

Before explaining infrastructure as code, we must understand the manual infrastructure configuration. In this section, I outline the problem with infrastructure and manual configuration. Then I define infrastructure as code.

1.2.1 Manual configuration of infrastructure

As part of a network team, I learned to change a network switch by copying and pasting commands from a text document. I once pasted shutdown instead of no shutdown, turning off a network interface! I quickly turned it back on, hoping that no one noticed and it didn’t affect anything. A week later, however, I discovered that it shut off connectivity to a critical application and affected a few customer requests.

In retrospect, I ran into a few problems with my manual copying and pasting of commands and infrastructure configuration. First, I had no idea which resources my change would affect (also known as the blast radius). I did not know which networks or applications used the interface.

Definition The blast radius refers to the impact a failed change has on a system. A larger blast radius often affects more components or the most critical components.

Second, the network switch accepted my command without testing its effects or checking its intent. Finally, no one else knew what affected the application’s processing of customer requests, and it took a week for them to identify the root cause as my miscopied command.

How does writing infrastructure in a codified manner help catch my miscopied network switch command? I could store my configuration and automation under source control to record the commands. To catch my mistake going forward, I create a virtual switch and a test that runs my script and checks the health of the interface.

After the tests pass, I promote the change to production because the tests check for the correct command. If I apply the wrong command, I can search infrastructure configuration to determine which applications run on the affected network. You can refer to chapter 6 for testing practices and chapter 11 for reverting changes.

Besides the risk of misconfiguration, my development momentum sometimes slows because of manual infrastructure configuration. Once, it took nearly two months for me to test my application against a database. Throughout those two months, my team submitted over 10 tickets related to creating the database, configuring new routing to connect the application to the database, and opening firewall rules to allow my application. The platform team manually configured everything in a public cloud. Development teams did not have direct access because of security concerns.

In other words, manual configuration of infrastructure often does not scale as systems and teams grow. Manual changes increase the change failure rate of systems, slow development, and expose the system’s attack surface to a potential security exploit. You will always have the temptation to update some values into a console. However, these changes accumulate.

The next person who makes a change to the system may introduce a failure into the system that will be difficult to troubleshoot because changes haven’t been audited or organized. Changes like updating a firewall to allow some traffic during the development process can inadvertently leave the system vulnerable to attack.

1.2.2 Infrastructure as code

What should you do to change infrastructure, if not manual changes? You can apply a software development life cycle for infrastructure resources and configuration in the form of IaC. However, an infrastructure development life cycle goes beyond configuration files and scripts.

Infrastructure needs to scale, manage failure, support rapid software development, and secure applications. A development life cycle for infrastructure involves more specific patterns and practices to support collaboration, deployment, and testing. In figure 1.4, a simplified workflow changes infrastructure by using configuration or scripts and committing them to version control. A commit automatically starts a workflow to deploy and test the changes to your infrastructure.

Figure 1.4 A development life cycle for infrastructure includes writing the code as documentation, committing it to version control, applying it to infrastructure in an automated way, and testing it.

Why should you remember the development life cycle? You can use it as a general pattern for managing changes and verifying that they don’t affect your system. The life cycle captures infrastructure as code, which automates infrastructure changes in a codified manner and applies DevOps practices such as version control and continuous delivery.

Definition Infrastructure as code (IaC) applies DevOps practices to automating infrastructure changes in a codified manner to achieve scalability, resiliency, and security.

I often find IaC cited as a necessary practice of DevOps. It certainly addresses the automation piece of the CAMS model (culture, automation, measurement, and sharing). Figure 1.5 positions IaC as part of automation practices and philosophy in the DevOps model. The practices of code as documentation, version control, software development patterns, and continuous delivery align with the development life cycle workflow we discussed previously.

Figure 1.5 IaC applies version control, software development patterns, continuous integration, and code as documentation to infrastructure.

Why focus on IaC as part of automation in the DevOps model? Your organization does not have to adopt DevOps to use IaC. Its benefits improve DevOps adoption and metrics but still apply to any infrastructure configuration. You can still use IaC practices to improve the process of making infrastructure changes without affecting production.

note You will see some DevOps practices included in this book, but I do not focus on its theory or principles. I recommend Accelerate by Nicole Forsgren et al. (IT Revolution Press, 2018) for a higher-level understanding of DevOps. You can also peruse The Phoenix Project by Gene Kim et al. (IT Revolution Press, 2013), which describes the fictional transformation of an organization adopting DevOps.

This book covers some approaches for codifying infrastructure to eliminate the friction of scale while maintaining the reliability and security of infrastructure for application users, whether you use the data center or cloud. Software development practices such as version control of configuration files, CI pipelines, and testing can help scale and evolve changes to infrastructure while mitigating downtime and building secure configuration.

1.2.3 What is not infrastructure as code?

Could you be doing IaC if you type out some configuration in a document? You might consider IaC to include adding configuration instructions in a change ticket. You can argue that a tutorial for building a queue or a shell script for configuring a server counts as IaC. Each of these examples can be forms of IaC if you can use them to do the following:

  • Reliably and accurately reproduce the infrastructure it expresses

  • Revert configuration to a specific version or point in time

  • Communicate and assess the blast radius of a change to the resources

The configurations or scripts, however, are usually outdated, unversioned, or ambiguous in intent. You may even find yourself struggling to understand and change the configuration written in an IaC tool. The tool facilitates IaC workflows but not necessarily the practices and approaches that allow systems to grow while reducing operational responsibility and change failure. You need a set of principles to identify IaC.

1.3 Principles of infrastructure as code

As I mentioned, not every piece of code or configuration related to infrastructure will scale or mitigate downtime. Throughout the book, I highlight how IaC principles apply to certain code listings or practices. You can even use these principles to assess your IaC.

While others may add or subtract to this list of principles, I remember four of the most important principles with the mnemonic RICE. This stands for reproducibility, idempotency, composability, and evolvability. I define and apply each principle in the following sections.

1.3.1 Reproducibility

Imagine someone asks you to create a development environment with a queue and a server. You share a set of configuration files with your teammate. They use them to re-create a new environment for themselves in less than an hour. Figure 1.6 shows how you shared your configuration and enabled your teammate to reproduce a new environment. You discovered the power of reproducibility, the first principle of IaC!

Why should IaC conform to this principle of reproducibility? The ability to copy and reuse infrastructure configuration saves you and your team’s initial engineering time. You don’t have to reinvent the wheel to create new environments or infrastructure resources.

Figure 1.6 Manual changes introduce drift between version control and actual state and affect reproducibility, so you instead update changes in version control.

Definition The principle of reproducibility means that you can use the same configuration to reproduce an environment or infrastructure resources.

However, you’ll find adhering to the principle of reproducibility more complicated than copying and pasting a configuration. To demonstrate this nuance, imagine you need to reduce a network address space from /16 to /24. You do have IaC written that expresses the network. However, you decide to choose the easy route of logging into the cloud provider and typing /24 into the text box.

Before you log into the cloud provider, you reflect on whether your change workflow adheres to reproducibility. You ask yourself the following questions:

  • Will a teammate know that you’ve updated the network?

  • If you run your configuration, will the network address space return to the /16?

  • If you create a new environment with the configuration in version control, will it have an address space of /24?

You answer no to each of those questions. You cannot guarantee that you will reproduce the manual change successfully.

If you go ahead and type /24 in the cloud provider’s console, your network has drifted from its desired state expressed in IaC (figure 1.7). To conform to reproducibility, you decide to update the version control configuration to /24 and apply the automation.

Figure 1.7 Manual changes introduce drift between version control and actual state and affects reproducibility, so you instead update changes in version control.

This scenario demonstrates the challenge of conforming to reproducibility. You need to minimize the inconsistency between the expected and actual infrastructure configuration, also known as configuration drift.

Definition Configuration drift is the deviation of infrastructure configuration from the desired configuration to the actual configuration.

As a practice, you can ensure the principle of reproducibility by placing your configuration files in version control and keeping version control as updated as possible. Maintaining the principle of reproducibility helps you collaborate better and manage testing environments similar to production.

In chapter 6, you’ll learn more about infrastructure testing environments, which benefit from reproducibility. You’ll also apply reproducibility to practices and patterns in testing and upgrading infrastructure, from creating test infrastructure mirroring production to deploying new infrastructure to replace older systems (blue-green deployments).

1.3.2 Idempotency

Some IaC includes repeatability as a principle, which means running the same automation and yielding consistent results. I pose that IaC needs a stricter requirement. Running automation should result in the same end state for an infrastructure resource. After all, I have one main objective when I write automation: the ability to run the automation multiple times and get the same result.

Let’s consider why IaC needs a stricter requirement. Imagine you write a network script that first configures an interface and then reboots. The first time you run the script, the switch configures the interface and reboots. You save this script as version 1.

A few months later, your teammate asks you to run the script again on the switch. You run the script, and the switch reboots. However, the reboot disconnects some critical applications! You already configured the network interface. Why do you need to reboot the switch?

You figure out a way to prevent the switch reboot if you’ve already configured the network interface. In figure 1.8, you create version 2 of the script and add a conditional if statement. The statement checks whether you have configured the interface already before rebooting the switch. When you run the version 2 script again, you do not disconnect the applications.

The conditional statement conforms to the principle of idempotency. Idempotency ensures that you can repeat the automation without affecting the infrastructure unless you change the configuration or drift occurs. If an infrastructure configuration or script is idempotent, you can rerun the automation multiple times without affecting the state or operability of a resource.

Figure 1.8 In version 1 of the script, you reboot the switch each time you run the script. In version 2, you check whether the network interface is configured before rebooting the switch. This preserves the working state of the network.

Definition The principle of idempotency ensures that you can repeatedly run the automation on infrastructure without affecting its end state or having any side effects. You should affect infrastructure only when you update its attributes in automation.

Why should you adhere to idempotency in your IaC, such as in the case of your network script? In the example, you want to avoid rebooting the network switch to keep the network running correctly. You already configured the network interface; why configure it again? You should need to configure the interface only if it doesn’t exist or change.

Without idempotency, your automation might break accidentally. For example, you may repeat a script and create a new set of servers, doubling their numbers. More catastrophically, you might automate a database update only to remove a critical database!

You can ensure the principle of idempotency by checking the repeatability of scripts and configuration. As a general practice, include a number of conditional statements checking whether a configuration matches the expected one before running your automation. Conditional statements help apply changes when required and avoid side effects that may affect the operability of infrastructure.

Designing automation with idempotency reduces risk, as it encourages the inclusion of logic to preserve the final expected state of the system. If the automation fails once and causes an outage in the system, organizations no longer want to automate again because of its perceived risk. Idempotency will become a guiding principle as you learn how to safely roll forward changes and preview automation changes before deployment in chapter 11.

1.3.3 Composability

You want to mix and match any set of infrastructure components, no matter the tool or configuration. You also need to update the individual configurations without affecting the entire system. Both of these requirements promote modularity and decoupling infrastructure dependencies, something you’ll learn more about in chapters 3 and 4.

For example, imagine you create infrastructure for an application that someone accesses at hello-world.com. These are the minimum resources you need for a secure and production-ready configuration:

  • A server

  • A load balancer

  • A private network for a server

  • A public network for a load balancer

  • A routing rule to allow traffic out of the private network

  • A routing rule to allow public traffic to the load balancer

  • A routing rule to allow traffic from the load balancer to the server

  • A domain name for hello-world.com

You could write this configuration from scratch. However, what if you found preconstructed modules that group infrastructure components you can use to assemble the system? You now have multiple modules that create the following:

  • Networks (private and public networks, gateways to route traffic from the private network, routing rules to allow traffic out of the private network)

  • Servers

  • Load balancers (domain name, routing rules to allow traffic from the load balancer to the server)

In figure 1.9, you pick the network, server, and load balancer modules to build your production environment. Later, you realize you need a premium load balancer. You swap out the standard load balancer with a premium one so you can serve more traffic. The server and network continue running without affecting users.

Your teammate can even add a database into the environment without affecting the load balancer, servers, or network. You group and select infrastructure resources in various combinations, which adheres to the principle of composability.

Definition Composability ensures that you can assemble any combination of infrastructure resources and update each one without affecting the others.

Figure 1.9 You build your production environment with building blocks of infrastructure so you can add new resources, like the premium load balancer, easily.

The more composable your configuration, the easier it is to create new systems with less effort. Think of constructing your IaC with building blocks. You want to be able to update or evolve subsets of resources without toppling the entire system! If you do not consider the composability of your IaC, you run the risk of change failures due to unknown dependencies in complex infrastructure systems.

The self-service benefit of composability can help your organization scale and empower teams to interact safely with infrastructure systems. Chapters 3 and 4 examine some patterns that can help you approach more modular infrastructure construction and improve composability.

1.3.4 Evolvability

You want to account for the scale and growth of your system but not optimize the configuration too early and unnecessarily. Much of infrastructure configuration will change over time, including its architecture.

As a practical example, you might initially name an infrastructure resource example. Later, you need to change the resource name to production. You start the change by finding and replacing hundreds of tags, names, dependent resources, and more. The procedure of find-and-replace requires a high rate of effort.

You notice that you forgot to change some fields as you apply the changes, and your new infrastructure changes fail. To ensure the future evolution of names, tags, and other metadata, you instead create a variable for the name, and the configuration references the variable. In figure 1.10, you update the global NAME variable, and the change propagates across the entire system.

Figure 1.10 Rather than finding and replacing all instances of the name, you can set a top-level variable with the name for all resources.

The example seems almost too simple. Why does changing a name matter? IaC built with evolvability as a principle minimizes the effort (time and cost) to change a system and the risk of failure for the change.

Definition The principle of evolvability ensures that you can change your infrastructure resources to accommodate system scale or growth while minimizing effort and risk of failure.

System evolution includes changes beyond minor ones, such as name changes. A more disruptive change in infrastructure architecture might involve a replacement of Google Cloud Bigtable with Amazon Elastic Map Reduce (EMR). The application requiring the replacement has been future-proofed using Apache HBase, an open source distributed database that supports both offerings and simply requires the database endpoint.

We account for this evolution in the IaC by outputting the database endpoint to retrieve the application and completing the update behind the scenes by creating configurations for both offerings. After testing the Amazon Web Services (AWS) database, we output its endpoint for consumption by the application.

Note I do not fully cover the theory behind evolving your architecture in this book. I highly recommend Building Evolutionary Architectures by Neal Ford et al. (O’Reilly, 2017) if you want to learn more. The book discusses how to build your infrastructure architecture to account for changes.

You might find yourself struggling to evolve your system because you have not used patterns and practices that allow it to change. Useful IaC focuses on techniques to facilitate future evolution. Many of the chapters in this book demonstrate patterns that help maintain evolvability and minimize the impact of changes to critical systems.

1.3.5 Applying the principles

Reproducibility, idempotency, composability, and evolvability seem specific in their definitions. However, they all help constrain your infrastructure architecture and define the behavior of many IaC tools. Your IaC must align with all four principles to scale, collaborate, and change your company. Figure 1.11 summarizes these four important principles and their definitions.

Figure 1.11 IaC should be reproducible, idempotent, composable, and evolvable. You can ask yourself a series of questions to determine whether your IaC conforms to all four principles.

As you write IaC, ask whether you conform to all four principles. These principles help you write and share your IaC with less effort and ideally minimize the impact of changes to your system. A missing principle can hinder updates to infrastructure resources or increase the blast radius of potential failure.

As you practice IaC, ask whether your configuration or tools align with the practices. For example, ask the following questions about your tool:

  • Does the tool allow you to re-create entire environments?

  • What happens when you rerun the tool to enforce configuration?

  • Can you mix and match various configuration snippets to make a new set of infrastructure components?

  • Does the tool offer capabilities to help you evolve an infrastructure resource without impacting other systems?

This book uses the principles to answer these questions and provide you with the skills to test, upgrade, and deploy infrastructure with resiliency and scalability in mind.

Exercise 1.1

Choose an infrastructure script or configuration in your organization. Assess whether it adheres to the principles of IaC. Does it promote reproducibility, use idempotency, help with composability, and ease evolvability?

See appendix B for answers to exercises.

1.4 Why use infrastructure as code?

IaC is typically considered a DevOps practice. However, you don’t have to be applying DevOps across your entire organization to use it. You still want to manage your infrastructure in a way that reduces the change failure rate and mean time to resolution (MTTR), so you can sleep in on the weekends as an operator or spend more time writing code as a developer. There are a few reasons to use IaC, even if you think you don’t need it.

1.4.1 Change management

You might experience a sinking feeling when you’ve applied a change to certain infrastructure, only to realize that someone reported it broke something. Organizations try to prevent this with change management, a set of steps and reviews to ensure that your changes do not affect production. The process often includes a change review board to review the changes, or change windows to block off time to execute the change.

Definition Change management outlines a set of steps and reviews that you take in your company to implement changes in production and prevent their failure.

However, no change is risk-free. Applying IaC practices can mitigate the risk of a change by modularizing your infrastructure (chapter 3) and rolling forward changes (chapter 11) to limit the blast radius.

In one regretful instance, I ignored my intuition to use IaC to mitigate the risk of a change. I had to roll out a new binary to servers, which required them to restart a set of dependent services. I wrote a script, asked my teammate to check it, and had the change review board sign off to run it. After applying and verifying the change over the weekend, I came in on Monday to several messages telling me servers supporting a reconciliation application went down overnight. My teammate traced it to an older operating system incompatibility with the dependencies in my script.

In retrospect, IaC could have mitigated the risk of the change. When I applied the RICE principles to the change, I realized that I forgot the following:

  • Reproducibility—I did not reproduce my script on various test instances mimicking the various servers.

  • Idempotency—I did not include logic to check the operating system before running the command.

  • Composability—I did not limit the blast radius of the change to a small set of less critical servers.

  • Evolvability—I did not update the servers with a newer operating system and reduce variation across the infrastructure.

Reducing variation allows for infrastructure evolution and risk mitigation because the actual configuration matches the one you expect during your automation, making changes more straightforward and reliable to apply. We’ll discuss how to fit IaC into your change management process in chapter 7.

1.4.2 Return on time investment

IaC and time investment can be challenging to justify, especially if your devices or hardware do not have suitable automation interfaces. In addition to a lack of easy automation, it can be difficult to justify spending time automating a task that you do only once a year or even a decade. While IaC might take extra time to implement, it lowers the time to execute changes in the long term. How exactly does that work?

Imagine you need to update the same package on 10 servers. You used to do this without IaC. You’d manually log in, update the package, verify that everything works, fix errors, and move to the next one. On average, you’d spend 10 hours updating the servers.

Figure 1.12 shows that changes made without IaC have a constant level of effort over time. If you have additional changes, you may spend a few more hours fixing or updating the system. A failed change might mean you spend more effort over several days to fix the system.

You decide to invest time into building IaC for these servers (the solid line in figure 1.12). You reduce the servers’ configuration drift, which takes about 40 hours. After your initial time investment, you spend less than 5 minutes updating all of the servers each time you make the change.

Why bother understanding the relationship between time and effort for IaC? Prevention helps reduce the effort of remediation. Without IaC, you might find yourself spending weeks trying to remediate a major system outage. You typically spend those weeks trying to reverse engineer the manual changes, revert specific changes, or at worst, building a new system from scratch.

Figure 1.12 IaC requires less effort over time after the high initial automation effort. Without IaC, time to execute changes can be highly variable.

You must make an initial investment in writing IaC, even if it seems steep. This investment helps you over time as you spend less time debugging failed configurations or recovering broken systems in the long term. If your system completely fails one day, you can reproduce it easily by running your IaC.

The automation and tests encourage predictability and limit the blast radius of a failure change. They lower the change failure rate and MTTR of failed systems. As your infrastructure system evolves and scales, you can use the detailed testing practices covered in this book to improve the change failure rate in your system and alleviate the burden of future changes.

1.4.3 Knowledge sharing

IaC communicates infrastructure architecture and configuration, which helps reduce human error and improve reliability. An engineer once told me, “We don’t need to do IaC for a network switch for a tertiary passive data center (used for backup). It’s just me configuring stuff anyway, and we’d only need to make this change once and never touch it again.”

The engineer left the organization shortly after configuring the switch. Later, my team needed to convert the tertiary passive data center to an active data center for compliance requirements. In a panic, we scrambled to reverse engineer the configuration on the switch. It took us a good part of two months to figure out the network connection, rework its configuration, and manage it with IaC.

Even if a task seems particularly obscure or the team configuring infrastructure consists of one person, investing the time and effort to approach infrastructure configuration in an “as code” manner can help accommodate evolution, especially as infrastructure systems and teams scale. You’ll find that you spend more time remembering how you configured the obscure switch when someone reports an outage or when teaching the server configuration to a new team member.

Writing a task “as code,” also known as code as documentation, communicates the expected state of the infrastructure and the system’s architecture.

Definition Code as documentation ensures that the code communicates the intent of the software or system without the need for additional reference documentation.

Someone unfamiliar with the system should examine the infrastructure configuration and understand its intent. You cannot expect all code to serve as documentation from a practical standpoint. However, the code should reflect most of your infrastructure architecture and system expectations.

1.4.4 Security

Auditing and checking for insecure configurations in IaC can highlight security concerns earlier in the development process. You hear this as shifting security left. If you incorporate security checks earlier in the process, you find fewer vulnerabilities when the system configuration runs in production. You’ll learn more about security patterns and practices in chapter 8.

For example, you might temporarily increase access to an object store such that anyone can write and read from it in development. You push it to production. However, some of the objects in the store allow everyone to write to and read from them. While this seems like a simple mistake, the configuration has grave implications if the store contains customer data.

Note For more examples of insecure infrastructure configuration, you can search the news for a misconfigured object store that exposed driver’s license information or even a database with a default password that revealed millions of consumer credit cards. Some security breaches involve legitimate vulnerabilities, but many involve insecure configuration. Organizations that prevent these misconfigurations in the first place can usually quickly examine configuration, audit access control, assess the blast radius, and remediate the breach.

IaC simplifies access control by expressing it in a single configuration. With IaC, you can test the configuration to ensure that the object store does not allow public access. Furthermore, you can include a production check to verify that your policy allows only read access to specific objects. Even security policies in a data center, such as firewall rules, can be expressed in IaC and audited to ensure that its rules allow only inbound connections from known sources.

If you experience a security breach, IaC allows you to examine configuration, audit access control quickly, assess the blast radius, and remediate the breach. You can use the same IaC practices to make all sorts of changes. You’ll discover some practices in this book to help audit and secure your infrastructure in adherence to IaC principles.

1.5 Tools

IaC tooling varies quite widely because it applies to various resources. Most tools fall into one of three use cases, all of which address very different functions and vary widely in behavior, including the following:

  • Provisioning

  • Configuration management

  • Image building

In this book, I primarily focus on tools used for provisioning, which deploy and manage sets of infrastructure resources. I include some sidebars and examples in configuration management and image building to highlight differences in approach.

1.5.1 Examples in this book

In this book, I faced a challenge to build concrete examples agnostic of tools or platforms. As a patterns and practices book, I needed to find a way to express the concepts in a general programming language without rewriting the logic of a tool.

Python and Terraform

Figure 1.13 outlines the workflow of the code listings and examples. For more information, appendix A details the technical implementation. I wrote the code listings in Python to create a JavaScript Object Notation (JSON) file consumed by HashiCorp Terraform, a provisioning tool with various integrations across public clouds and other infrastructure providers.

Figure 1.13 This book’s examples use Python to create a JSON file that Terraform can consume.

When you run the Python script by using python run.py, the code creates a JSON file with the file extension *.tf.json. The JSON file uses syntax specific to Terraform. Then you can go into the directory with the *.tf.json file and run terraform init and terraform apply to create the resources. While the Python code seems to add unnecessary abstraction, it ensures that I can offer concrete examples agnostic of platform and tool.

I recognize that the complexity of this workflow seems nonsensical. However, it serves two purposes. The Python files provide a generalized implementation of patterns and practices with a programming language. The JSON configuration allows you to run and create the resources with a tool instead of me writing in abstractions.

Note You can find the complete code examples at this book’s code repository: https://github.com/joatmon08/manning-book.

You do not have to know Python or Terraform in depth to understand the code examples. If you would like to run the examples and create the resources, I recommend reviewing an introductory tutorial to Terraform or Python for syntax and commands.

Note You can find numerous resources on Terraform and Python. Check out Terraform in Action by Scott Winkler (Manning, 2021), The Quick Python Book by Naomi R. Cedar (Manning, 2018), or Python Workout by Reuven M. Lerner (Manning, 2020).

Google Cloud Platform

While AWS or Microsoft Azure may be more popular choices at publication, I decided to use Google Cloud Platform (GCP) as the primary cloud provider for three main reasons. First, GCP requires overall fewer resources to achieve a similar architecture. This reduces the verbosity of the examples and focuses on the pattern and approach instead of the configuration.

Second, GCP’s offerings use more straightforward naming and generic infrastructure terminology. If you work in a data center, you’ll still be able to recognize what the services create. For example, Google Cloud SQL creates SQL databases.

If you run the examples, you will use the following resources in GCP:

  • Networking (networks, load balancers, and firewalls)

  • Compute

  • Managed queues (Pub/Sub)

  • Storage (Cloud Storage)

  • Identity and access management (IAM)

  • Kubernetes offerings (Kubernetes Engine and Cloud Run)

  • Database (Cloud SQL)

You do not have to know the details of each service. I use them to demonstrate dependency management between infrastructure resources. I avoided using specialized services that are different for each cloud platform, such as machine learning.

AWS and Azure equivalents

Each example includes sidebars on equivalent AWS and Azure service offerings to further solidify specific patterns and techniques. To update the examples for the cloud provider of your choice, you may need to do some language replacements and change some of your dependencies. For example, you can create GCP networks with built-in gateways, while you must explicitly build them in AWS networks.

Some examples have an AWS equivalent in the book’s code repository (https://github.com/joatmon08/manning-book). You’ll also find more information on setting up AWS or Azure to run with examples in appendix A.

The third reason I wrote the examples in GCP involves cost. GCP offers a free usage tier. If you create a new account with GCP, you will receive a free trial of up to $300 (at time of publication). You can use offerings in the free usage tier if you have an existing account. I note any required resources that do not qualify for the free usage tier at the time of publication.

Using Google Cloud Platform

For more information on GCP’s free program, refer to its program web page at https://cloud.google.com/free.

I recommend creating a separate GCP project to run all of the examples. A separate project will isolate your resources. When you finish this book, you can delete the project and its resources. Review the tutorial for creating GCP projects at http://mng.bz/e7QG.

1.5.2 Provisioning

Provisioning tools create and manage sets of infrastructure resources for a given provider, whether it’s a public cloud, data center, or hosted monitoring solution. A provider refers to a data center, IaaS, PaaS, or SaaS responsible for providing infra-structure resources.

Definition A provisioning tool creates and manages a set of infrastructure resources for a public cloud, data center, or hosted monitoring solution.

Some provisioning tools work with only a specific provider, while others integrate with multiple providers (table 1.1).

Table 1.1 Examples of provisioning tools and providers

Tool

Provider

AWS CloudFormation

Amazon Web Services

Google Cloud Deployment Manager

Google Cloud Platform

Azure Resource Manager

Microsoft Azure

Bicep

Microsoft Azure

HashiCorp Terraform

Various (for a complete list, see www.terraform.io/docs/providers/index.html)

Pulumi

Various (for a complete list, see www.pulumi.com/docs/intro/cloud-providers/)

AWS Cloud Development Kit

Amazon Web Services

Kubernetes manifests

Kubernetes (container orchestrator)

Most provisioning tools can preview changes to a system and express dependencies between infrastructure resources, also known as a dry run.

Definition A dry run analyzes and outputs expected changes to infrastructure before you apply the changes to resources.

For example, you can express a dependency between a network and a server. If you change the network, the provisioning tool will show that the server may also change.

1.5.3 Configuration management

Configuration management tools ensure that servers and computer systems run in the desired state. Most configuration management tools excel at device configuration, such as server installation and maintenance.

Definition A configuration management tool configures a set of servers or resources for the packages and attributes hosted on them.

For example, if you have 10,000 servers in your data center, how do you ensure that all of them run a specific version of the packages your security team approved? It does not scale for you to log into 10,000 servers and manually type commands to review. If you configure the servers with a configuration management tool, you can run a single command to review all 10,000 servers and execute the packages’ updates.

A non-exhaustive list of configuration management tools that address this problem space includes the following:

  • Chef

  • Puppet

  • Red Hat Ansible

  • Salt

  • CFEngine

While this book focuses on provisioning tools and managing multiprovider systems, I will address some configuration management practices related to infrastructure testing, updating infrastructure, and security. Configuration management can help tune your server and network infrastructure.

Note I recommend books and tutorials for a configuration management tool of your choice for additional information. They will provide a more detailed guide specific to their design approach.

To add to the confusion, you might notice that some configuration management tools offer integrations with the data center and cloud providers. As a result, you might think of using your configuration tool as your provisioning tool. While possible, this approach may not be ideal because provisioning tools often have a different design approach for specifically addressing dependencies among infrastructure resources. I explore this nuance in the next chapter.

1.5.4 Image building

When you create a server, you must specify a machine image with an operating system. Image-building tools create images used for application runtime, whether a container or server.

Definition An image-building tool builds machine images for application runtimes, such as containers or servers.

Most image builders allow you to specify a runtime environment and build targets. Table 1.2 outlines a few tools, their supported runtime environments, and their build targets and platforms.

Table 1.2 Examples of image builders and providers

Tool

Runtime environment

Build target

HashiCorp Packer

Containers and servers

Various (for a complete list, see www.packer.io/docs/builders)

Docker

Containers

Container registries

EC2 Image Builder

Servers

Amazon Web Services

Azure VM Image Builder

Servers

Microsoft Azure

I do not include detailed discussions on image building in this book. However, the patterns for testing, delivery, and compliance do have sidebars for image building in chapters 6–8. In the next chapter, you’ll learn about immutability, a critical paradigm that informs the approach of image builders.

Figure 1.14 shows how image building, configuration management, and provisioning tools work together. The process of deploying a new server configuration often starts with a configuration management tool, as you start building the foundations and testing whether your server configuration is correct.

After establishing the server configuration you want, you use an image builder to preserve the server’s image with its versioning and runtime. Finally, your provisioning tool references the image builder’s snapshot to create new production servers with the configuration you want.

Figure 1.14 Each type of IaC tool contributes to the life cycle of a server infrastructure resource, from configuration to image capture and deployment.

This workflow represents the ideal end-to-end approach for managing and deploying servers with IaC tooling. However, as you will learn, infrastructure can be complex, and this workflow may not apply to every use case. Various infrastructure systems and their dependencies complicate provisioning, which I chose to be the primary focus of this book, and is why the examples use provisioning tools.

Summary

  • Infrastructure can be software, a platform, or hardware that delivers or deploys applications to production.

  • Infrastructure as code is a DevOps practice of automating infrastructure to achieve reliability, scalability, and security.

  • The principles of IaC are reproducibility, idempotency, composability, and evolvability.

  • By following the principles of IaC, you can improve change management processes, lower time spent on fixing failed systems in the long term, better share knowledge and context, and build security into your infrastructure.

  • IaC tools include provisioning, configuration management, and image-building tools.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset