8 Security and compliance

This chapter covers

  • Choosing protections for credentials and secrets in IaC
  • Implementing policies to enforce compliant and secure infrastructure
  • Preparing end-to-end tests for security and compliance

In previous chapters, I alluded to the importance of securing infrastructure as code and checking its conformance with your organization’s security and compliance requirements. Oftentimes, you don’t address these requirements until later in your engineering process. By that point, you may have already deployed an insecure configuration or violated a compliance requirement about data privacy!

For example, imagine you work for a retail company called uDress. Your team has six months to build a new frontend application on GCP. The company needs it available by the holiday season. Your team works very hard and develops enough functionality to go live. However, a month before you deploy and test the new application, the compliance and security team performs an audit—and you fail.

Now, you have new items in your backlog to fix the security and compliance issues and adhere to company policy. Unfortunately, these fixes delay your delivery timeline or, at worst, break functionality. You might wish that you knew about these from the very beginning, at least so you could plan for them!

Your company’s policy ensures that systems comply with security, audit, and organizational requirements. In addition, your security or compliance teams often define policies based on industry, country, and more.

Definition A policy is a set of rules and standards in your organization to ensure compliance to security, industry, or regulatory requirements.

This chapter will teach you to protect credentials and secrets and write tests to enforce policies for security and compliance. If you think about these practices before you write IaC, you can build secure, compliant infrastructure and avoid delays in your delivery timeline. Inspired by a manager I used to work with, “We’re baking security into the infrastructure instead of icing it later.”

8.1 Managing access and secrets

I already introduced the idea of “baking” security into IaC in chapter 2. IaC uses two sets of secrets. You use API credentials to automate infrastructure and sensitive variables such as passwords to pass to resources. You can store both secrets in secrets managers to handle their protection and rotation.

In this section, I focus on securing IaC delivery pipelines. IaC expresses the expected state of the infrastructure, which often includes root passwords, usernames, private keys, and other sensitive information. Infrastructure delivery pipelines control the deployment and release of infrastructure that needs this information.

Let’s imagine you build delivery pipelines for the new uDress system to deploy infrastructure. The pipelines use a set of infrastructure provider credentials to create and update resources. Each pipeline also reads a database password from a secrets manager and passes it as an attribute to create the database.

Your security team points out two problems with your approach. First, the infrastructure delivery pipeline uses full administrative credentials to configure GCP. Second, your team’s delivery pipeline accidentally prints out the root database password in its logs!

Your delivery pipeline just increased the attack surface (sum of the different points of attack) of your system.

Definition The attack surface describes the sum of different points of attack where an unauthorized user can compromise a system.

Anyone can use the administrative credentials or root database password to gain information and compromise your system. You need a solution to better secure the credentials and the database password. The solution should hopefully minimize the attack surface.

8.1.1 Principle of least privilege

IaC delivery pipelines have points of attack that allow unauthorized users to use credentials with elevated access. For example, you used chapter 7 to build a pipeline that continuously delivers infrastructure changes to production. The pipeline needs some permissions to change infrastructure in GCP.

Initially, your team gives the pipeline full administrative credentials so it can create all of the resources in GCP. If someone accesses those credentials, they could create and update anything in the uDress system. Someone could exploit your team’s pipeline to run machine learning models or access other customer data!

The pipeline does not need access to every resource. You decide to update the credentials so it uses only the minimal set of permissions it needs to update specific resources. You determine that the IaC creates only network, Google App Engine, and Cloud SQL resources. You remove administrative access from the credentials and replace them with write access to the three resources.

When the pipeline runs, as shown in figure 8.1, the new credentials have just enough access to update the three sets of resources. It also retrieves the database password from the secrets manager before deploying updates to the network, application, and database. After deploying the changes to a testing environment, you add a unit test to verify that the credentials no longer have administrative access.

Figure 8.1 Remove administrative credentials from the uDress frontend delivery pipeline and limit them to network, application, and database access.

You remediated the security concern of pipeline credentials by using the principle of least privilege. This principle ensures that a user or service account gets only the minimum access they require to complete their task.

Definition The principle of least privilege indicates that users or service accounts should have minimum access requirements to a system. They should have only enough to complete their tasks.

Maintaining the principle of least privilege takes time and effort. You usually change access as you add new resources to IaC. In general, attach roles to delivery pipeline credentials. Grouping access permissions into roles helps promote composability so you can add and remove access as needed.

Apply the module practices from chapter 3 to offer modules of permission sets. For example, you may offer a factory module for uDress’s web applications to customize network, application, and database write access. Any web application can use the module and properly reproduce the minimum set of privileges it needs.

Let’s use the access management module to implement least privilege access management for uDress’s frontend delivery pipeline in listing 8.1. You limit the pipeline to network, application, and Cloud SQL administrative credentials. These credentials allow the pipeline to create, delete, and update the network, application, and database, but not update any other resource types.

Listing 8.1 Least privilege access management policy for the frontend

import json
import iam                                                      
 
 
def build_frontend_configuration():
   name = 'frontend'
   roles = [
       'roles/compute.networkAdmin',                            
       'roles/appengine.appAdmin',                              
       'roles/cloudsql.admin'                                   
   ]
 
   frontend = iam.ApplicationFactoryModule(name, roles)         
   resources = {
       'resource': frontend._build()                            
   }
   return resources
 
 
if __name__ == "__main__":
   resources = build_frontend_configuration()                   
 
   with open('main.tf.json', 'w') as outfile:                   
       json.dump(resources, outfile, sort_keys=True, indent=4)  

Creates the role configuration based on a list of roles for a service account, including networking, App Engine, and Cloud SQL

Imports the application access management factory module to create access management roles for the frontend application

Uses the method to create the JSON configuration for the pipeline’s access permissions

Writes the Python dictionary out to a JSON file to be executed by Terraform later

AWS and Azure equivalents

Google App Engine is similar to AWS Elastic Beanstalk or Azure App Service, which deploy web applications and services to provider-managed infrastructure.

Google Cloud SQL is similar to Amazon Relational Database Service (RDS), which deploys different managed databases. Azure has different services for specific databases, such as Azure Database for PostgreSQL or Azure SQL Database offerings.

As you adhere to the principle of least privilege, take care when you remove permissions. Sometimes a pipeline needs more specific permissions to read or update dependencies. You can break infrastructure or applications if they do not have sufficient permissions.

Some infrastructure providers, including GCP, analyze the permissions used for a service account or user and output a set of excess permissions. You can also run other third-party tools to analyze access and identify unused permissions. I recommend using these tools to check and update your access control each time you add a new infrastructure resource.

8.1.2 Protecting secrets in configuration

Besides using administrative credentials from a pipeline to access an infrastructure provider, someone could alter the pipeline to print out sensitive information about infrastructure. For example, the frontend delivery pipeline outputs the root database password in the logs. Anyone accessing the logs from the pipeline can use the root password to log into the database!

To address this security concern, you decide to mark the password as a sensitive variable by using your IaC tool. The tool redacts the password in the logs. You also install a plugin in your pipeline tool to identify and redact any sensitive information, such as the password. You add these two configurations to your pipeline in figure 8.2 to avoid compromising the database password in the pipeline logs. As a safety precaution, you rotate the database password in the secrets manager and directly change the password in the database, rather than using IaC.

You can use tools to mask the password in the delivery pipeline by either suppressing or redacting the plaintext information.

Definition Masking your sensitive information means suppressing or redacting its plaintext format to prevent someone from reading the information.

Using one or both mechanisms will prevent the sensitive information from appearing in the pipeline logs. Sensitive information can include passwords, encryption keys, or infrastructure identifiers like IP addresses. If you think someone can use the information to gain access to your system, consider masking the value in your pipeline.

However, masking sensitive information doesn’t guarantee protection from unauthorized access. You still need a workflow to remediate the exposed credentials as quickly as possible. As a solution, store and rotate the credentials with a secrets manager after you use them to configure your IaC.

Figure 8.2 You can protect the root database password by using tools to mask the value and rotating the credential after applying changes to IaC.

Separately managing secrets introduces mutability, or in-place changes, to your IaC. While it introduces drift between the actual root database password and the one expressed in IaC, managing the password mutably prevents someone from exploiting the IaC pipeline and using the credentials.

As you build IaC, think about this checklist of security requirements in your delivery pipeline to minimize its attack surface:

  1. Check for least privilege access for infrastructure provider credentials from the beginning. You should provide enough permissions to apply and secure your IaC.

  2. Generate a secret by using a function to generate a random string or read the secret from a secrets manager. Avoid passing secrets as static variables to your configuration.

  3. Check that your pipeline masks sensitive configuration data in its dry-run capability or command outputs.

  4. Provide a mechanism to revoke and rotate compromised credentials or data quickly.

You can solve many of the requirements in the checklist with a secrets manager. The secrets manager can omit the need for statically defining secrets in configuration. While some requirements serve as general security practices for delivery pipelines, they also apply to secure IaC. You can review chapter 2 for the pattern of securing secrets with a secrets manager.

8.2 Tagging infrastructure

After securing your infrastructure, you have the challenge of running and supporting it. Operating infrastructure requires a set of troubleshooting and auditing patterns and practices. As you continue to add infrastructure to your system, you need a way to identify the purpose and life cycle of resources.

Imagine the uDress frontend application goes live. However, your team gets a message from the finance team. Your infrastructure provider billing has exceeded the expected budget for the past two or three months. You search in the provider’s interface to determine which resources have contributed most to the cost. How do you know the owner and environment of each resource?

GCP offers the use of labels, which allows you to add metadata to your resources for identification and audit purposes. You update these labels to include owner and environment. In figure 8.3, uDress includes identification of owner and environment, standards for tag format, and automation metadata. You decide to dash-delimit tag names and values so the tags work with GCP.

Figure 8.3 Tags should include identification of owner, environment, and automation for easy troubleshooting.

Outside of GCP, other infrastructure providers allow you to add metadata to identify resources. In your organization, you’ll develop a tagging strategy to define a standard set of metadata used for auditing your infrastructure system.

Definition A tagging strategy defines a set of metadata (also known as tags) used for auditing, managing, and securing infrastructure resources in your organization.

Why use metadata in the form of tags? Tags help you search and audit resources, actions necessary for billing and compliance. You can also use tags to do bulk automation of infrastructure resources. Bulk automation includes cleanup or break-glass (manual changes to stabilize or fix system failures) updates to a subset of resources.

Let’s implement standard tags for uDress in the following listing. From chapter 3, you apply the prototype pattern to define a list of standard tags for your uDress. You reference the uDress tag module to create a list of labels for a GCP server in your code.

Listing 8.2 Using the tags module to set standard tags for the server

class TagsPrototypeModule():                                        
   def __init__(
           self, service, department,
           business_unit, company, team_email,
           environment):
       self.resource = {                                            
           'service': service,                                      
           'department': department,                                
           'business-unit': business_unit,                          
           'company': company,                                      
           'email': team_email,                                     
           'environment': environment,                              
           'automated': True,                                       
           'repository': f"${company}-${service}-infrastructure"    
       }   
 
 
class ServerFactory:
   def __init__(self, name, network, zone='us-central1-a', tags={}):
       self.name = name
       self.network = network
       self.zone = zone
       self.tags = TagsPrototypeModule(                             
           'frontend', 'web', 12345, 'udress',                      
           '[email protected]', 'production')                     
       self.resource = self._build()
 
   def _build(self):                                                
       return {
           'resource': [
               {
                   'google_compute_instance': [                     
                       {
                           self.name: [
                               {
                                   'allow_stopping_for_update': True,
                                   'boot_disk': [
                                       {
                                           'initialize_params': [
                                               {
                                                   'image': 'ubuntu-1804-lts'
                                               }
                                           ]
                                       }
                                   ],
                                   'machine_type': 'f1-micro',
                                   'name': self.name,
                                   'network_interface': [
                                       {
                                           'network': self.network
                                       }
                                   ],
                                   'zone': self.zone,
                                   'labels': self.tags             
                               }
                           ]
                       }
                   ]
               }
           ]
       }

The tag module uses the prototype pattern to define a standard set of tags.

Sets tags to identify owner, department, business unit for billing, and repository for the resource

Passes the required parameters to set tags for the frontend application

Uses the module to create the JSON configuration for the server

Creates the Google compute instance (server) by using a Terraform resource

Add the tags from the tag module as labels to the Google compute instance

AWS and Azure equivalents

To convert listing 8.2 to another cloud provider, change the resource to an Amazon EC2 instance or Azure Linux virtual machine. Then, pass self.tags to the tags attribute for the AWS or Azure resource.

How do you know which tags to add? Recall from chapter 2 that you must standardize the naming and tagging of your infrastructure resources. Discuss these considerations with compliance, security, and finance teams. That will help determine which tags you need and how to use them. At a minimum, I always have a tag for the following:

  • Service or team

  • Team email or communication channel

  • Environment (development or production)

For example, let’s say the uDress security team audits the frontend resources and discovers some misconfigured infrastructure. The team members can check the tags, identify the service and environment with the problem, and reach out to the team that created the resource.

You may also include tags for the following:

  • Automation, which helps you identify manually created resources from automated ones

  • Repository, which allows you to correlate the resource with its original configuration in version control

  • The business unit, which identifies the billing or charge-back identifier for accounting

  • Compliance, which identifies whether the resource has compliance or policy requirements for handling personal information

As you decide on your tagging, make sure it conforms to a general set of constraints so you can apply the same tags across any infrastructure provider. Most infrastructure providers have character restrictions on tags. I usually prefer dash-case, which uses lowercase tag names and values split with hyphens. While you can use camel case (stylistically, camelCase), not all providers have case-sensitive tagging.

Tag character limits also vary depending on the infrastructure provider. Most providers support a maximum length of 128 characters for the tag key and 256 characters for the tag value. You will have to balance the verbosity of descriptive names (described in chapter 2) with the provider’s tag limits!

Another part of your tagging strategy involves deciding whether you delete untagged resources. Consider enforcing tags for all resources in the production environment. The testing environment can support untagged resources for manual testing. In general, I do not recommend immediately deleting untagged resources without careful examination. You don’t want to delete an essential resource by accident.

8.3 Policy as code

Securing access and secrets in infrastructure delivery pipelines and managing tags in infrastructure providers can improve security and compliance practices. However, you might wish to identify insecure or noncompliant infrastructure configuration before it goes to production. You’d like to catch a problem before someone finds it in your production system.

Imagine connecting the uDress frontend application to another database. You open a firewall rule to allow all traffic inbound to a managed database for testing. After testing, you expect to remove the database, so you do not tag it.

You forget about the firewall and tag configuration and send it off for review. Unfortunately, your teammate misses them in code review and pushes the changes to production. Two weeks later, you discover that an unknown entity has accessed some data! However, you have no tags to identify the compromised database.

What could you have done differently? Recall the importance of unit tests or static analysis of infrastructure configuration in chapter 6. You can apply the same techniques to write tests specifically for security and policy.

Rather than depend on a teammate to catch the problem, you can express policy as code to statically analyze the configuration for the permissive firewall rule or lack of tags. Policy as code tests infrastructure metadata and verifies that it complies with security or compliance requirements.

Definition Policy as code (also known as shift-left security testing, or static analysis of IaC) tests infrastructure metadata and verifies that values comply with security or compliance requirements before pushing changes to production. Policy as code includes the rules you write for dynamic analysis tools or vulnerability scanning.

I discussed the long-term benefit of automating and testing IaC in chapters 1 and 6. You similarly have an initial short-term time investment for writing policy as code. The policy checks continuously verify the compliance of each change you want to make to production. You minimize the surprises after the compliance and security teams audit your system. Over time, you decrease the long-term time investment with a shorter time to production.

8.3.1 Policy engines and standards

Tools can help run policy as code by evaluating metadata based on a set of rules. Most testing tools in this space use a policy engine. A policy engine takes policies as input and evaluates infrastructure resources for compliance.

Definition A policy engine takes policies as the input and evaluates resource metadata for compliance to policies.

Many policy engines parse and check fields in infrastructure configuration or state. In figure 8.4, a policy engine extracts JSON or other metadata from the IaC or system state. Then it passes the metadata to a security or policy test. The engine runs the test to parse fields, check their values, and fail if the actual values do not match the expected values.

Figure 8.4 Tests for security and policy parse the configuration or state of the system for the correct field values and fail if they do not match an expected value.

This workflow applies to policy as code tools and any tests you write yourself. Policy as code tools make testing for values more straightforward because the tools abstract the complexity of parsing for fields and checking the values. However, tools don’t cover every value or use case you want to test.

As a result, you usually write your own policy engine to suit your purposes. In the examples for this chapter, I use pytest, a Python testing framework, as a primitive “policy engine” to check for a secure and compliant configuration.

Policy engines

The policy as a code ecosystem has different tools for different purposes. Most tools fall into one of three use cases, all of which address very different functions and vary widely in behavior:

  1. Security tests for specific platforms

  2. Policy tests for industry or regulatory standards

  3. Custom policies

Table 8.1 includes a non-exhaustive list of policy engines for provisioning tools, both vendor and open source. I’ve outlined a few of the technology integrations and the use case category for each tool.

Table 8.1 Examples of policy engines for provisioning tools

Tool

Use case(s)

Technology integration(s)

AWS CloudFormation Guard

Security tests for specific platforms

Custom policies

AWS CloudFormation

HashiCorp Sentinel

Security tests for specific platforms

Custom policies

HashiCorp Terraform

Pulumi CrossGuard

Security tests for specific platforms

Custom policies

Pulumi SDK

Open Policy Agent

(Underlying technology for Fugue, Conftest, Kubernetes Gatekeeper, and more)

Security tests for specific platforms (tool-dependent)

Policy tests for industry or regulatory standards (tool-dependent)

Custom policies

Various (for a complete list, see www.openpolicyagent.org/docs/latest/ecosystem/)

Chef InSpec

Security tests for specific platforms

Custom policies

Various (for a complete list, search the Chef marketplace at https://supermarket.chef.io)

Kyverno

Security tests for specific platforms

Custom policies

Kubernetes

You often need to mix and match tools to cover all use cases. No single tool covers all use cases. Some tools offer customization, which you can use to build policies of your own. In general, consider extending an existing tool with custom policies so you can establish opinionated patterns and defaults with your security, compliance, and engineering teams. In reality, you’ll probably adopt five or six policy engines to cover the tools, platforms, and policies you need.

Note that I do not include any security or policy tooling specific to data center appliances, which often depend on your organization’s procurement requirements. You may also find some community projects outside of the examples listed in table 8.1. I often find these tools and their integrations replaced by newer ones since the ecosystem changes rapidly.

Image building and configuration management

Image building tools do not have too many security or policy tools, as you tend to write your tests for them. Configuration management tools follow a similar approach to provisioning tools. You will need to find community or built-in tools that verify security and policy configuration.

Industry or regulatory standards

You might examine table 8.1 and discover that few tools include policy tests for industry or regulatory standards. Most of these policies exist in documentation form, and you often have to write them yourself. On occasion, you can find policy test suites created by the community that you’ll need to augment with your own.

For example, the National Institute of Standards and Technology (NIST) in the United States publishes a list of security benchmarks as part of the National Checklist Program (https://ncp.nist.gov/repository). A reviewer for this book also recommended Security Technical Implementation Guides (STIGs) from the US Department of Defense, including technical testing and configuration standards.

Note Yes, I am missing many tools or standards in this section. The standards I included apply to the United States and not necessarily worldwide. By the time you read this, policy engines will have changed features, integrations, or open source status, and the industry or regulatory standards will have updated drafts. If you’d like to recommend one, please let me know at https://github.com/joatmon08/tdd-infrastructure.

8.3.2 Security tests

What should you test to secure your infrastructure? Some policy as code tools offer opinionated defaults that capture best practices for a secure system. However, you might need to write your own for your company’s specific platforms and infrastructure.

Let’s start fixing your database security breach. Fortunately, the testing data did not have anything important. However, in the future, you don’t want your teammate copying and deploying the configuration to production. To prevent the testing environment’s IaC from deploying to production, you write a test to secure a network for a database.

The database needs a very restrictive, least-privilege (minimum access) firewall rule. Figure 8.5 shows how you implement a test to retrieve the firewall configuration from IaC. The configuration goes to a test, which parses the source range from the firewall rule. If the range contains a permissive rule, 0.0.0.0/0, the test fails.

Figure 8.5 Retrieve the source range value from the firewall rule configuration and determine if it contains an overly permissive range.

GCP uses 0.0.0.0/0 to denote that any IP address can access the database. If someone gains access to your network, they can access your database if they have the username and password. Your new test fails before an overly permissive rule like 0.0.0.0/0 goes to production.

Listing 8.3 implements the test for the firewall rule in Python. In your test, you implement code to open the JSON configuration file, retrieve the source_ranges list, and check if the list contains 0.0.0.0/0.

Listing 8.3 Using a test to parse the firewall rule for 0.0.0.0

import json
import pytest
from main import APP_NAME, CONFIGURATION_FILE
 
 
@pytest.fixture(scope="module")
def resources():
   with open(CONFIGURATION_FILE, 'r') as f:                            
       config = json.load(f)                                           
   return config['resource']                                           
 
 
@pytest.fixture
def database_firewall_rule(resources):                                 
   return resources[0][                                                
       'google_compute_firewall'][0][APP_NAME][0]                      
 
 
def test_database_firewall_rule_should_not_allow_everything(           
       database_firewall_rule):
   assert '0.0.0.0/0' not in                                          
       database_firewall_rule['source_ranges'],                       
       'database firewall rule must not ' +                           
       'allow traffic from 0.0.0.0/0, specify source_ranges ' +       
       'with exact IP address ranges'                                  

Loads the infrastructure configuration from a JSON file

Parses the resource block out of the JSON configuration file

Parses the Google compute firewall resource defined by Terraform from the JSON configuration

Uses a descriptive test name explaining the policy for the firewall rule, which should not allow all traffic

Checks that 0.0.0.0/0, or allow all, is not defined in the rule’s source ranges

Uses a descriptive error message describing how to correct the firewall rule, such as removing 0.0.0.0/0 from source ranges

AWS and Azure equivalents

A firewall rule in GCP is equivalent to an AWS security group (http://mng.bz/Qvvm) or Azure network security group (http://mng.bz/XZZY). To update the code, create a security group resource in the cloud provider of your choice. Then, edit the test to switch GCP’s source_ranges with the security_rule.source_port_range attribute for Azure or ingress.cidr_blocks attribute for AWS.

Imagine your new teammate wants to run some tests on the database from their laptop. They make a change to open the firewall rule to 0.0.0.0/0 in IaC. The pipeline runs the Python code to generate JSON:

$ python main.py

The pipeline runs unit tests checking the JSON file with the configuration. It recognizes the firewall rule contains 0.0.0.0/0 in the list of allowed source ranges and throws an error:

$ pytest test_security.py
====== short test summary info ======
FAILED test_security.py::test_database_firewall_rule_should_not_allow_everything - 
     AssertionError: database firewall rule must not allow traffic 
     from 0.0.0.0/0, specify source_ranges with exact IP address ranges
===== 1 failed in 0.04s ======

Your teammate reads the error description and realizes that the firewall rule should not allow all traffic. They can correct their configuration to add their laptop IP address to source ranges.

Just like functional tests in chapter 6, security tests educate the rest of your team on ideal secure practices for infrastructure. While the tests do not necessarily catch all security violations, they communicate important information about security expectations. Moving the unknown knowns of security best practices to known knowns eliminates a repeated mistake.

These tests also help scale security practices in your organization. Your teammate feels empowered to correct the configuration. Furthermore, your security team has fewer investigations and follow-ups for security violations. Making security part of everyone’s responsibility reduces the time and effort for future remediation.

Positive versus negative testing

In the example of the database IP address range, you checked that an IP address range does not match every IP address (0.0.0.0/0). Called negative testing, this process asserts that the value does not match. You can also use positive testing to assert that attributes do match an expected value.

Some references suggest that you express all security or policy tests with one type. However, I usually write tests with both positive and negative testing assertions. The combination better expresses the intent of the security and policy requirement. For example, you can use the negative test to check for any IP address range against any infrastructure configuration written by any team. On the other hand, if you have an IP address range that every firewall rule must include, such as a VPN connection, you can use a positive test.

You can write tests to check other secure configurations, including these:

  • Ports, IP ranges, or protocols on other network policies

  • Access control for no administrative or root access of infrastructure resources, servers, or containers

  • Metadata configuration to mitigate exploitation of instance metadata

  • Access and audit logging configuration for security information and event management (SIEM), such as for load balancers, IAM, or storage buckets

  • Package or configuration versions for fixed vulnerabilities

This non-exhaustive list covers some general configurations. However, you should consult with your security team or other industry benchmarks for additional information and tests.

8.3.3 Policy tests

Security tests verify that you minimize the attack surface of misconfiguration in your IaC. However, you need other tests for auditing, reporting, billing, and troubleshooting. For example, your testing database should have a tag on it so someone can identify its owner and report the security breach.

The uDress compliance team reminds you to add tags to your GCP database so they can identify the database owner. They also notify you that the security breach caused the database resources to scale, which increased your cloud computing bill. Without tags, the compliance team had a difficult time identifying who to contact about the security problem and the increased bill.

You add tags to the database configuration. To remind yourself of tagging in the future, you use the workflow shown in figure 8.6 to implement a unit test to check for tags. As with the security test for the firewall rule configuration, you parse a JSON file with the database configuration to check that you correctly filled the labels with tags. If the test has empty labels, the test fails.

Figure 8.6 You implement a test that parses the database configuration and checks for a list of tags in the database’s user labels.

The policy test behaves similarly to the security test. However, it tests the tags instead of the IP source ranges. While the policy does not better secure the infrastructure, it improves your ability to troubleshoot and identify resources.

Let’s implement the test workflow. In the following listing, you write a test to check that you have more than zero tags under the GCP user_labels parameter.

Listing 8.4 Using a test to parse the database configuration for tags

import json
import pytest
from main import APP_NAME, CONFIGURATION_FILE
 
 
@pytest.fixture(scope="module")
def resources():                                                   
   with open(CONFIGURATION_FILE, 'r') as f:                        
       config = json.load(f)                                       
   return config['resource']                                       
 
 
@pytest.fixture
def database_instance(resources):
   return resources[2][                                            
       'google_sql_database_instance'][0][APP_NAME][0]             
 
 
def test_database_instance_should_have_tags(database_instance):    
   assert database_instance['settings'][0]['user_labels']         
       is not None                                                 
   assert len(                                                     
       database_instance['settings'][0]['user_labels']) > 0,      
       'database instance must have `user_labels`' +              
       'configuration with tags'                                   

Loads the infrastructure configuration from a JSON file

Parses the resource block out of the JSON configuration file

Parses user labels in the Google SQL database instance defined by Terraform from the JSON configurations

Uses a descriptive test name explaining the policy for tagging the database

Checks that the user labels on the database do not have an empty list or null value

Uses a descriptive error message describing the addition of tags to GCP user labels

AWS and Azure equivalents

To convert listing 8.4 to AWS or Azure, change the Google SQL database instance to a PostgreSQL offering from either cloud. You can use AWS RDS (http://mng.bz/yvvJ) or Azure Database for PostgreSQL (http://mng.bz/M552). Then, parse the database instance resource for the tags attribute. Both Azure and AWS use tags.

You add the test to your security tests. The next time your teammate makes a change and forgets tags, the test fails. Your teammate reads the error message and corrects their IaC to include the tags. You can implement tests for other organizational policies, including these:

  • Required tags for all resources

  • Number of approvers for a change

  • The geographic location of an infrastructure resource

  • Log outputs and target servers for auditing

  • Separate development data from production data

This non-exhaustive list covers some general configurations. However, you should consult with your compliance team or other industry benchmarks for additional information and tests. As you write your tests, ensure that you include clear error messages outlining which policies the test checks.

8.3.4 Practices and patterns

As you write more security and policy tests, you gain confidence that your configuration remains secure and compliant. How can you teach this across your team and company? You can apply some of the practices and patterns for testing to checking infrastructure security and compliance. Next, I’ll cover the practices and patterns for writing security and policy tests in greater detail.

Use detailed test names and error messages

You’ll notice detailed test names and error messages for the uDress policy and security tests. These names and messages seem verbose but communicate to teammates precisely what the policy looks for and how they should correct it! I introduced a technique in chapter 2 to verify the quality of your naming and code. Try asking someone else to read the test. If they can understand its purpose, they can update their configuration to conform to the policy as code.

Modularize tests

You can apply some of the module patterns from chapter 3 to policy as code. For example, the uDress payments team asks to borrow your security and policy tests for their infrastructure. You divide your database policies into database-tests, and firewall policies into firewall-tests.

The security team also asks you to add a Center for Internet Security (CIS) benchmark. This industry benchmark includes tests to verify the best practices for secure configuration on GCP. After adding the security benchmark, you realize that you have too many tests to track in multiple repositories.

Figure 8.7 moves all of these tests into a repository named gcp-security-test. The repository organizes all tests for uDress’s GCP infrastructure. The uDress frontend and payments teams can reference a shared repository, import the tests, and run them against their configuration. Meanwhile, the security team can update the security benchmarks in one place in the gcp-security-tests repository.

Figure 8.7 Add policy as code to a shared repository for distribution across all teams creating infrastructure.

As with your infrastructure approach to code repository structures, you can choose to put your organization’s policy as code in a single repository or divide it across multiple repositories based on the environment. In either structure, make sure all teams have visibility into security and policy tests for the organization to learn how to deploy compliant infrastructure.

Furthermore, divide the tests based on business unit, function, environment, infrastructure provider, infrastructure resource, stack, or benchmark. You want to evolve the types of tests individually as your business changes. Some business units may need one type of test, while another may not. Dividing the tests and running them selectively helps.

Add policy as code to delivery pipelines

Your team wants to make sure to run the security and policy tests before pushing to production, so they add them as a stage of their delivery pipeline. Policy as code runs after deploying the changes to a testing environment but before releasing to production. You get fast feedback on infrastructure changes, prioritizing functionality but checking policy before production.

The security team also adds policy as code to scan the running production environment. This dynamic analysis continuously verifies the security and compliance of any emergency or break-glass changes to infrastructure.

Figure 8.8 shows the workflow of static analysis in a delivery pipeline and dynamic analysis of running infrastructure to check configuration changes and address issues in resources proactively. After deploying the changes to a testing environment, you run the security and policy tests. They should pass before releasing the changes to production. When the resources get the changes, you scan the running infrastructure with similar tests for runtime security and policy checks.

Figure 8.8 Tests for security and policy check for improper infrastructure configuration and prevent the changes from going to production.

You may have different tests for static and dynamic analysis. Some tests, such as live verification of endpoint access, can work on only running infrastructure. As a result, you want to run some tests before and after you push to production.

If your static analysis tests take too long, you could run a subset of tests after you push the change to production. However, you will have to quickly remediate any security or compliance violations. As a result, I recommend running the most critical security and policy tests as part of your pipeline.

Image building

You might encounter the practice of building immutable server or container images. By baking the packages you want into a server or container image, you can create new servers with updates without the problems of in-place updates.

Use the same workflow of the infrastructure pipeline with policy as code to build the immutable images. The workflow includes unit tests to check the scripts for specific installation requirements, such as a company package registry, and integration tests against a test server to verify that the package versions comply with policy and security.

You can always use dynamic analysis in the form of an agent to scan a server and make sure its configuration complies with rules. For example, Sysdig offers Falco, a runtime security tool that runs on a server and checks for rule compliance.

Most teams do not want their security or policy tests to block all changes from going to production. For example, what if customers need to access public endpoints for infrastructure? The test to check for a private network only may not apply. Sometimes you find exceptions in your security policy.

Define enforcement levels

As you build more policy as code, you must identify the most important ones and make exceptions for others. For example, the uDress security team identifies the database tagging policy to be hard mandatory. The delivery pipeline must fail if it does not find tags, and someone must add the tags.

You define three categories of policy, as shown in figure 8.9. The security team mandates that you fix the database tags before pushing to production. However, the team makes an exception for your firewall rule because customers need access to your endpoint. The team also includes some advice on more secure infrastructure configurations.

Figure 8.9 You can divide policy as code into three enforcement categories that gate changes before production.

I classify policy as code into three categories of enforcement (borrowed from HashiCorp Sentinel’s terminology):

  • Hard mandatory for required policies

  • Soft mandatory for policies that may require manual analysis for an exception

  • Advisory for knowledge-sharing of best practices

The security team classifies the firewall rule as a soft mandatory. Some public load balancers must allow access from 0.0.0.0/0. If the firewall rule test fails, someone from the security team must review the rule and manually approve the change to production in the pipeline.

The security team sets the CIS benchmarks as advisory for knowledge sharing and best practices. They ask you to correct the configuration, if possible, but they do not require enforcement before production.

Do you have to run security tests before changes go to production? They take a while to run, after all! If you worry that security and policy tests will gate the changes too long, run the hard mandatory tests before deploying to production.

You can run the soft mandatory or advisory tests asynchronously, so only the necessary tests block your pipeline. I do not recommend running all security and policy tests asynchronously because you may temporarily introduce a noncompliant configuration to production, even if you fix it quickly after running asynchronous tests!

Figure 8.10 summarizes testing patterns and practices, such as writing detailed test names and error messages. Similar to infrastructure, you can modularize tests based on function and add tests to delivery pipelines for production.

Figure 8.10 Tests for security and policy check for improper infrastructure configuration and prevent the changes from going to production.

No matter the tool, security benchmark, or policy rules, you should express and communicate the practices in the form of tests. Following these patterns and practices will help you and your teammates improve your security and compliance knowledge.

As your organization grows and sets more policies, you expand your security and policy tests with it. Early adoption of policy as code sets a foundation for baking security and compliance practices into your IaC. If you cannot find a tool to run the tests you need, consider writing your own tests to parse IaC.

Summary

  • A company policy ensures that systems comply with security, audit, and organizational requirements. Your company defines policies based on industry, country, and other factors.

  • The principle of least privilege gives a user or service account only the minimum access they require.

  • Ensure that your IaC uses credentials with least privilege access to the infrastructure provider. Least privilege prevents someone from exploiting the credentials and creating unauthorized resources in your environment.

  • Use a tool to suppress or redact plaintext, sensitive information in a delivery pipeline.

  • Rotate any usernames or passwords generated by IaC after applying it in the pipeline.

  • Tagging infrastructure with service, owner, email, accounting information, environment, and automation details makes it easier to identify and audit security and billing.

  • Policy as code tests some infrastructure metadata and verifies that it complies with a secure or compliant configuration.

  • Use policy as code to test the security and compliance of your infrastructure before pushing to production but after functional testing of your system.

  • Apply clean IaC and module patterns to managing and scaling policy as code.

  • Classify each security and policy test into one of three enforcement categories, such as hard mandatory (must fix), soft mandatory (manual review), and advisory (best practice, but not blocking production).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset