Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5
Cloud Security Operations

Cloud security operations comprise a mix of old and new practices for understanding, mitigating, and monitoring security risks in an organization's cloud environments. Old practices include standard activities that apply to legacy or on-premises IT, such as legal and regulatory compliance management, as well as novel activities like orchestrating cloud infrastructure by writing virtual machine (VM) definitions instead of physically installing new hardware and software.

Key to cloud security operations are the two main roles in cloud computing: the cloud service provider (CSP) and the cloud consumer. The CSP and consumer share responsibilities for securely operating and using the cloud, respectively, and require clearly defined, agreed-upon objectives documented in contracts and service level agreements (SLAs).

Build and Implement Physical and Logical Infrastructure for Cloud Environment

It is important to bear in mind that many aspects of secure cloud operations will be handled by the CSP and therefore may be largely invisible to the cloud consumers. As security professionals, it is critical to understand the importance of both the provider and consumer roles; particularly important is including adequate oversight of the CSP in third-party security risk management activities. From the CSP's perspective, proper isolation controls are essential due to the multitenant nature of the cloud, as well as appropriate capacity, redundancy, and resiliency to ensure that the cloud service meets the availability requirements that customers demand.

Hardware-Specific Security Configuration Requirements

In public cloud deployments, hardware configuration will be handled by the CSP rather than the cloud consumer. Obviously, private and community clouds will require the security practitioner to properly configure and secure hardware, and in some cases, public clouds may offer a virtual private cloud (VPC) option, where some of these elements are configurable by the consumer. The rules for hardening, or securely configuring, systems in cloud environments are the same as they are for on-prem systems, though the methods may be different.

There are several targets for hardening hardware, including the following:

TPM: The Trusted Platform Module (TPM) is a dedicated module included in a computing system with specialized capabilities for cryptographic functions, sometimes referred to as a cryptographic coprocessor. Unlike an HSM, it is usually a physical component of the system hardware and cannot be added or removed at a later date. It has dedicated components including a processor, persistent storage memory, and volatile memory for performing cryptographic processing, and it is used to support cryptographic functions and enable trust in a computing system.
- The TPM typically provides a number of services related to cryptography, including random or pseudorandom number generators, asymmetric key generation, and hash generators. TPMs are often used for highly secure storage of limited data as well, such as the cryptographic keys used in full disk encryption solutions like Microsoft BitLocker.
- TPMs are often used to form roots of trust, since they are highly specialized and secured. For example, a hash of hardware component versions installed in a system can be relied upon if it is digitally signed by the TPM. This hash can be compared with a hash of the current system state to determine if any changes have been made, allowing for the verification of a system's hardware integrity.
- TPMs are implemented in a variety of form factors. Dedicated hardware may be used to provide tamper resistance or tamper evidence. Integrated and firmware TPMs may be included as part of another chipset or run in a dedicated trusted execution environment of a specific chip but are generally less resistant to tampering.
- Virtual TPMs are part of the hypervisor and provided to VMs running on a virtualization platform. Since this is a software solution, it cannot implement the same level of tamper resistance as a hardware-based TPM. The hypervisor is responsible for providing an isolated or sandboxed environment where the TPM executes, which is separate from the software running inside a particular VM. This isolated environment must implement robust access controls to block inappropriate access to the TPM.
- A definition of TPM is provided in ISO/IEC 11889, which specifies how various cryptographic techniques and architectural elements are to be implemented. It consists of four parts including an overview and architecture of the TPM, design principles, commands, and supporting routines (code). An industry consortium known as the Trusted Computing Group publishes a specification for TPMs, which currently stands at version 2.0.
HSM: A hardware security module (HSM) is a dedicated piece of hardware designed to support and perform cryptographic functions. It is a stand-alone module that can be added to a computing system, often as an expansion card or physically connected using a cable, port, or network interface. This allows the HSM to be added later for enhanced security and allows for upgrades of the HSM if needed for enhanced security.
- The HSM performs many of the same functions as a TPM, including secure storage of cryptographic keys, encryption and decryption functions, and cryptographic-based authentication functions. They can also generate data needed for cryptographic functions, such as pseudorandom numbers.
- Common uses for HSMs include generation and storage of asymmetric key pairs in a public key infrastructure (PKI), generation and authorization of financial information including encryption of account information, cryptocurrency wallet security, and securing records in Domain Name System Security Extensions (DNSSEC).
- Because of the lack of physical control over cloud computing infrastructure, major cloud vendors offer virtual HSMs that can be created inside a consumer's cloud environment. Once the virtual cloud HSM is installed, it can be used for the standard cryptographic functions; as with other cloud services, one of the major advantages is an abstraction of the underlying hardware. The HSM functions are presented as a service that all servers or other cloud resources can utilize, typically by making an API call to the cloud HSM.
- For cloud consumers with very high security requirements, some CSPs offer the ability to integrate a physical HSM that is managed by the consumer on their own premises. This allows customers to generate and securely store their own cryptographic keys that can be used to encrypt data or control access to resources in the cloud, making it nearly impossible for other cloud tenants or the CSP staff to access the consumer's data. This approach adds expense and complexity but can be useful for organizations with significant concerns regarding cloud security.

Storage Controllers

Storage controllers are hardware implemented to, as their name implies, control storage devices. This may involve a number of functions including access control, assembly of data to fulfill a request (for example, reconstructing a file that has been broken into multiple blocks and stored across disks), and providing users or applications with an interface to the storage services. Several standards exist including iSCSI and Fibre Channel/Fibre Channel over Ethernet (FCoE). These are storage area network (SAN) technologies that create dedicated networks for data storage and retrieval. Security concerns for SANs are much the same as for regular network services, including proper access control, encryption of data in transit or at rest, and adequate isolation/segmentation to address both availability and confidentiality.

Network Configuration

For public cloud consumers, the majority of network configuration is likely to happen in a software-defined network (SDN) management console rather than via hardware-based network device configuration. It is the responsibility of the CSP to manage the underlying physical hardware including network controller devices such as switches and network interface cards (NICs). Concerns for this physical hardware include the following:

Providing adequate physical and environmental security for ingress and egress points to the facility such as point of presence (POP) rooms where ISP connectivity is established. This includes physical access control devices such as locks, adequate and redundant electrical power, and appropriate environmental controls like cooling to deal with the heat generated by these devices.
Designing for resiliency and redundancy of network infrastructure is essential, due to the heavy reliance on virtualization in cloud environments. A single network cable being unplugged or severed can lead to thousands of hosts losing connectivity. The obvious advantage to virtualization is that a redundant network connection is also shared, so dual NICs and ISP connections can be easily shared by hundreds or even thousands of hosts. Performing single point of failure (SPOF) analysis is crucial for the CSP to ensure that they meet requirements for availability.
Establishing flexible and scalable network architecture is key. In physical terms, creating a LAN requires physically interconnecting users to a single switch. In SDNs, this same functionality needs to be available without physically changing any device connections. The CSP's network must provide sufficient physical connectivity and support software-defined virtual local area networks (VLANs) through the use of 802.1Q VLAN tags.
CSPs must provide appropriate security capabilities for the virtualized and distributed environment. Many traditional security tools rely on techniques like monitoring traffic via a switched port analyzer (SPAN) or mirror port, which sends a copy of all traffic to the monitoring device. Inter-VM network communication on a virtualized host generally does not leave the physical hardware and traverse network connections and could therefore bypass monitoring tools that rely on mirroring. VM-specific tools exist to overcome this limitation, and secure configuration of hypervisors must be enforced to ensure that they provide appropriate monitoring capabilities.
CSPs may have business drivers to allow for customer-managed network security controls. Many public clouds offer a virtual private cloud (VPC), which is essentially a sandboxed area within the larger public cloud dedicated to use by a specific customer. These take the form of a dedicated VLAN for a specific user organization, which means other cloud tenants are blocked from accessing resources in the VPC since they are not members of the same VLAN.

Installation and Configuration of Virtualization Management Tools

Virtualization management tools require particular security and oversight measures as they are essential to cloud computing. Without these in place, compromising a single management tool could lead to further compromise of hundreds or thousands of VMs and data. Tools that fall into this category include VMware vSphere, for example, as well as many of the CSP's administrative consoles that provide configuration and management of cloud environments by consumers.

Best practices for these tools will obviously be driven in large part by the virtualization platform in use and will track closely to practices in place for other high-criticality server-based assets. Vendor-recommended installation and hardening instructions should always be followed and possibly augmented by external hardening standards such as Center for Internet Security (CIS) Benchmarks or Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs). Other best practices include the following:

Redundancy: Any critically important tool can be a single point of failure (SPOF), so adequate planning for redundancy should be performed. High availability and duplicate architecture will likely be appropriate due to the nature of these tools.
Scheduled downtime and maintenance: Patching is crucially important for virtualization management tools, as is routine maintenance. Downtime may not be acceptable, so these tools may be patched or taken offline for maintenance on a rotating schedule with migration of live VMs to prevent loss of service.
Isolated network and robust access controls: Access to virtualization management tools should be tightly controlled, with adequate enforcement of need-to-know restrictions and least privilege to ensure that authorization is not excessive to a user's job function. Where possible, isolated networks, communication channels, or encryption should be utilized, such as a VPN or dedicated administrative network.
Configuration management and change management: These tools and the infrastructure that supports them should be placed under configuration management to ensure that they stay in a known, hardened state. When changes are necessary, formal change management should be utilized to ensure that the impact of any changes is understood and new risks adequately mitigated.
Logging and monitoring: Logging activities can create additional overhead, which may not be appropriate for all systems. Highly critical assets like virtualization management tools are likely to warrant the extra operational overhead and should be configured to log all activity.

Virtual Hardware–Specific Security Configuration Requirements

The cloud's heavy reliance on virtualization and multitenancy creates a new risk of data breaches when multiple users share a single piece of hardware. In a nonvirtual environment, it is comparatively difficult to leak data between systems, but a VM shares physical hardware with potentially hundreds of other machines.

The biggest issue related to virtual hardware security is the enforcement, by the hypervisor, of strict segregation between the guest operating systems running on a single host. The hypervisor acts as a form of reference monitor by mediating access by the various guest machines to the physical resources of the hardware they are running on and, in some cases, inter-VM network communication. Since most hypervisors are proprietary and produced by software vendors, there are two main forms of control a CCSP should be aware of:

Configuration: Ensure that the hypervisor has been configured correctly to provide the minimum necessary functionality, such as disallowing inter-VM network communications if not required or ensuring that virtualization tools like guest snapshots are encrypted.
Patching: This control applies for any software. Monitor vulnerabilities disclosed that impact the virtualization tools in use by your organization, and ensure that all patches are applied in a timely manner.

Virtual hardware is highly configurable and, especially in general-purpose environments like platform as a service (PaaS) or infrastructure as a service (IaaS) public cloud offerings, may not be configured for security by default. It is the responsibility of the cloud consumer to properly configure the cloud environment to meet their specific needs, such as segmenting networks and configuring network access controls to permit only appropriate host-to-host communications. Particular concerns for virtual network security controls include the following:

Virtual private cloud (VPC): This is essentially a carved-out section of a public cloud dedicated to use by a particular customer. Similar to the way a VPN creates a virtual connection over the public Internet, a VPC gives the customer a greater level of control, including managing private nonroutable IP addresses and control over inter-VM communication, such as allowing only specific hosts in a middleware VPC to communicate on specific ports to databases in a database VPC. VPCs are essentially a hybrid of the public and private cloud models designed to balance cost savings of the public cloud with greater security in a private cloud.
Security groups: In clouds, a security group is similar to an access control list (ACL) for network access. Security groups can be configured to control access to various elements of the cloud environment, and then new VMs instantiated have the security group applied to manage their network access.

A partial security concern related to virtual hardware configuration is the amount of virtual hardware provisioned. Many tools will allow the user to specify quantities, such as amount of memory, speed, and number of processing cores, as well as other attributes such as type of storage for a VM. Availability is obviously one concern—sufficient quantities of these virtual resources must be provisioned to support the intended workload. From a business perspective, these virtual resources should not be overprovisioned, however, because they do have associated costs.

An emerging trend in cloud environments is the provisioning of hardware using definition files, referred to as infrastructure as code. These definition files are read by the CSP and used to specify virtual hardware parameters and configurations, simplifying the process of setting up and configuring the environment. In many organizations, this changes the old paradigm of developers writing code and operations personnel configuring the hosts, because developers can package infrastructure definitions with their application code. This requires adequate training for developers to ensure that they understand the business needs, security requirements, and configuration options available to them.

The ability to deploy infrastructure using a definition file enables a feature of cloud computing known as autoscaling. Resources deployed in the cloud environment can be monitored for utilization, and as resources reach their limit, additional resources are automatically added. For example, a web server hosting an online ordering app may come under increased traffic when a celebrity endorses a product; in an autoscaling cloud environment, new instances of the web server are spun up automatically to deal with the increased traffic. Serverless computing is another feature of cloud computing that can support availability. Serverless environments like Azure Functions and AWS Lambda allow developers to deploy their application code without specifying server resources required. When the application is run—for example, a customer wants to place an order—the CSP provides sufficient resources to handle that demand, supporting availability. The cloud consumer pays for the resources only when they are running the particular application, saving costs.

Installation of Guest Operating System Virtualization Toolsets

Toolsets exist that can provide extended functionality for various guest operating systems including Unix, Linux, and Microsoft Windows. These toolsets provide specific or enhanced capabilities for a particular operating system (OS), such as support for additional devices, driver software, or enhanced interactivity. In a public cloud, these toolsets will typically be provided by the CSP; if a customer requires functionality that depends on a particular OS toolset, it is up to the customer to verify if the CSP can support it before using that provider's cloud.

When managing your own virtualization environment, installing these toolsets should follow the concept of minimum necessary functionality. If a virtualization cluster will not be running virtual Windows servers, then the Windows OS toolset does not need to be installed. The personnel responsible for the virtualization cluster need to understand the business requirements and build the solution appropriately.

Operate Physical and Logical Infrastructure for Cloud Environment

Cloud computing has shifted many responsibilities for managing physical and logical infrastructure away from the users of the corresponding services. When organizations hosted their own infrastructure, it was essential to have adequate processes to assess risks and provision adequate security controls to mitigate them, but in the cloud, many of these tasks are the responsibility of the cloud provider instead. However, the consumers must still be aware of the risks inherent in using the cloud. It is essential for a CCSP to understand these matters to adequately assess cloud services and doubly important if the organization is providing a cloud service. In the case of a private cloud host or a security practitioner working for a CSP, these controls will be directly under the purview of the organization.

Configure Access Control for Local and Remote Access

In most instances, access to cloud resources will be done remotely, so adequate security controls must be implemented in the remote administrative tools implemented to support these functions. A CCSP should be familiar with the following protocols for supporting remote administration:

Secure Shell (SSH): As the name implies, this standard provides a secure way for an administrator to access and manipulate a remote system. This is often achieved by interacting with a local command-line interface (CLI), which sends commands to the remote host for execution. SSH can be configured to implement encryption using either symmetric (shared) or asymmetric (public/private key pair) cryptography. In both cases, cryptography is used for both protection of data via encryption and authentication of users based on the assumption that a user has maintained control of their key. Only users with the appropriate key(s) will be granted access.
Remote Desktop Protocol (RDP): RDP was initially a technology specific to Microsoft Windows but is now widely available across Windows, macOS, Linux, and mobile operating systems including iOS and Android. A typical session requires both an RDP server, which provides remote access and control to the machine it is running on, and an RDP client through which the user interacts with the remote machine. Security features available in RDP include encryption of data in transit, user authentication, and bandwidth management to ensure that remote desktop sharing and interaction are adequate to support user interactions. As a key element of the Windows OS as well as remote access, RDP functionality is often a target of hackers, so it is critical to maintain up-to-date versions as part of patch management.
Access to RDP can be controlled in a variety of ways. As a Microsoft standard, Active Directory is often utilized for identification and authentication, and the RDP standard also supports smart card authentication.
Virtual network computing (VNC): This may be considered analogous to RDP and is often implemented for remote access and control of Unix- or Linux-based systems where RDP is not native.

In situations where local administration is being performed, a secure Keyboard Video Mouse (KVM) switch may be utilized. This is a device that allows access to multiple hosts using a single set of human interface peripherals such as a keyboard, mouse, and monitor—a user does not need to have multiple keyboards on their desk physically attached to different computers.

A basic KVM allows the user to switch their peripherals to interact with various computers; a secure KVM adds additional protections for highly secured environments that primarily address the potential for data to leak between the various connected systems. Attributes of secure KVMs include the following:

Isolated data ports: The physical construction ensures that each connected system is physically isolated from others.
Tamper-evident or tamper-resistant designs: Secure KVMs may be manufactured to make physical tampering extremely difficult or impossible without destroying vital hardware, such as permanently soldered circuit boards or components. They may also implement tamper-evident stickers/labels that make it obvious if the KVM has been opened because the sticker is torn.
Secure storage: KVMs may implement a buffer to store data, but switching between connected systems causes any data in the buffer to be erased. The buffer may additionally have only limited capacity, which reduces the usefulness but also reduces the amount of data that could potentially be leaked.
Secured firmware: The software required by the KVM to run may often be fixed—it can't be changed without rendering the unit inoperable—or require signed firmware updates from the original manufacturer. Both are designed to prevent tampering.
Physical disconnects: A KVM typically contains buttons on the front that allow the user to switch between the various connected systems. A physical disconnect physically breaks the connection between each system when a new system button is pressed, preventing data leaks from the currently connected system to the next system being connected.
USB port and device restrictions: KVMs typically offer USB ports that are used to connect the peripherals being shared. Highly secured KVMs may implement restrictions on the type of USB devices that can be connected, such as allowing keyboards and mice but blocking mass storage devices, as a way to restrict what a malicious user can do if they gain physical access.

All CSPs provide remote administrative access to cloud users via an administrative console. In AWS this is known as the Management Console, in Azure as the Portal, and in Google Cloud as the Cloud Console. All three offer a visual user interface (UI) for interaction that allows for the creation and administration of resources including user accounts, VMs, cloud services such as compute or storage, and network configurations. These functions are also available via CLI tools, most of which call APIs to perform these same administrative functions manually and which enable automation such as creating resources based on infrastructure as code definition files.

The CSP is responsible for ensuring that access to these consoles is limited to properly authenticated users, and in many cases, the CSPs restrict access to consumer resources entirely. For example, it is possible for a user in Azure to create a VM and remove all network access from it, effectively locking themself out as well. Even Microsoft cannot reconfigure that VM, because allowing that level of access to consumer resources is incredibly risky. By contrast, the cloud consumer is responsible for implementing appropriate authorization and access control for their own members in accordance with access management policies, roles, and rules. Because of the highly sensitive nature and abilities granted by these consoles, they should be heavily isolated and protected, ideally with multifactor authentication. All use of the admin console should be logged and should be routinely reviewed as a critical element of a continuous monitoring capability. Access to the UI or CLI functions is a key area for personnel security and access control policies, similar to any system or database admin position in an on-prem environment.

Secure Network Configuration

One aspect of cloud services is their broad network accessibility, so it is virtually impossible to find a cloud security concern that does not in some way relate back to secure network connectivity. Several protocols and concepts are important to understand as they relate to securing networks and the data transmitted.

Virtual Local Area Networks

VLANs were originally designed to support the goal of availability by reducing contention on a shared communications line. They do so by isolating traffic to just a subset of network hosts, so all hosts connected to the VLAN broadcast their communications to each other but not to the broader network. Communication with other VLANs or subnets must go through a control device of some sort, often a firewall, which offers confidentiality and integrity protections by enforcing network-level access control. For example, in a multitiered application architecture, the web servers will traditionally be isolated from database servers in separate VLANs, and the database layer will implement much more restrictive access controls.

VLAN network traffic is identified by the sending device using a VLAN tag, specified in the IEEE802.1Q standard. The tag identifies the VLAN that the particular data frame belongs to and is used by network equipment to determine where the frame will be distributed. For example, a switch will not broadcast frames tagged VLAN1 to devices connected to VLAN2, alleviating congestion. Firewalls that exist between the two VLANs can make decisions to allow or drop traffic based on rules specifying allowed or denied ports, protocols, or sender/recipient addresses.

An extension of VLAN technology specific to cloud computing environments is the virtual extensible LAN (VXLAN) framework. It is intended to provide methods for designing VLANs utilizing layer 2 protocols onto layer 3; effectively, it allows for the creation of virtual LANs that may exist across different data centers or cloud environments. VXLAN is more suitable than VLANs for complex, distributed, and virtualized environments due to limitations on the number of devices that can be part of a VLAN, as well as limitations of protocols designed to support availability of layer 2 devices such as Spanning Tree Protocol (STP). It is specified in RFC 7348.

From the standpoint of the cloud consumer, network security groups (NSGs) are also a key tool in providing secure network services and provide a hybrid function of network traffic isolation and filtering similar to a firewall. The NSG allows or denies network traffic access based on a list of rules such as source IP, destination IP, protocol, and port number. Virtual resources can be segmented (isolated) from each other based on the NSGs applied to them. For example, a development environment may allow inbound traffic only from your organization's IP addresses on a broad range of ports, while a production environment may allow access from any IP address but only on ports 80/443 for web traffic. Resources in the development environment could also be prevented from communicating with any resources in production to prevent an attacker from pivoting.

Transport Layer Security

Transport Layer Security (TLS) is a set of cryptographic protocols that provide encryption for data in transit, and it replaced a previous protocol known as Secure Sockets Layer (SSL). The current version of TLS is 1.3; versions below this either are considered less secure or have demonstrated compromises. TLS provides a framework of supported cryptographic ciphers and keylengths that may be used to secure communications, and this flexibility ensures broad compatibility with a range of devices and systems. It also means the security practitioner must carefully configure their TLS-protected systems to support only ciphers that are known to be secure and disable older options if they have been compromised.

TLS specifies a handshake protocol when two parties establish an encrypted communications channel. This comprises these three steps:

The client initiates a request with a ClientHello message. This provides a list of cipher suites and TLS version it supports.
The server (if configured properly) chooses the highest-supported TLS version and cipher suite and communicates that choice to the client, along with the server's certificate containing a public key.
Depending on the cipher suite chosen, the client and server may then exchange other data, including a pre-master secret key used to negotiate a session key. The session key is utilized to encrypt all data that is to be shared.

TLS can be used to provide both encryption of data and authentication/proof of origin, as it relies on digital certificates and public key cryptography. In many cases, such as publicly available web apps, one-way authentication is used for the client's browser to authenticate the server it is connecting to, such as an online banking application. In mutual authentication, both client and server (or server and server) are required to exchange certificates. If both parties trust the issuer of the certificates, they can mutually authenticate their identities. Some high-security or high-integrity environments require mutual authentication before data is transmitted, but the overhead makes it largely infeasible for all TLS encryption (imagine the PKI required to authenticate every Internet user, web app, and IoT device on the planet).

Dynamic Host Configuration Protocol

Dynamic Host Configuration Protocol (DHCP) enables computer network communications by providing IP addresses to hosts that dynamically join and leave the network. The process of dynamically assigning IP addresses is as follows:

A DHCP server running on the network listens for clients.
When a new client machine joins a network, it sends out a DHCPDISCOVER request on UDP port 67, to which the DHCP server responds with a DHCPOFFER on port 68.
The client responds with a DHCPREQUEST indicating it will use the address assigned, and the server acknowledges with a DHCPACK.

An easy mnemonic to remember this DHCP process is DORA for Discover, Offer, Request, Acknowledge. As with many network protocols, security was not originally part of the DHCP design. The current DHCP version 6 specifies how IPSec can be utilized for authentication and encryption of DHCP requests. An improperly configured DHCP server can lead to denial of service if incorrect IP addresses are assigned or, worse, can be used in a man-in-the-middle attack to misdirect traffic.

Domain Name System and DNS Security Extensions

Imagine how difficult it would be if you were required to remember every company's and person's IP address to send them data. Human-readable addresses like www.isc2.org are used instead, but these need to be converted to a machine-readable IP address. DNS does this by a process known as resolving.

A user initiates a network communication using a fully qualified domain name (FQDN). This might be done by entering a URL into a web browser or sending an email to a particular domain.
The user's machine queries a Domain Name System (DNS) resolver service on UPD port 53, which provides the IP address associated with the FQDN. The user's machine uses this information to address the communications and carry out the user's action—for instance, loading a web page or sending an email.

DNS operates using records, definitive mappings of FQDNs to IP addresses, which are stored in a distributed database across zones. DNS queries are facilitated by DNS servers sharing information from time to time, known as zone transfers, which enable resolution of client requests without having to iterate a request.

DNS Attacks

As with many network protocols, DNS was originally designed without security in mind, which leaves it open to attack. Cache poisoning is an attack where a malicious user updates a DNS record to point an FQDN to an incorrect IP address. One way of doing this is to initiate a zone transfer, which by default does not include authentication of the originator. This poisoned record causes users to be redirected to an incorrect IP address, where the attacker can host a malicious phishing site, malware, or simply nothing at all to create a denial of service.

DNS spoofing is another attack against DNS. In this case, an attacker spoofs a DNS service on a network with the goal of resolving a user's requested FQDN to an attacker-controlled IP address. Some attacks against DNS also abuse graphical interfaces instead of the underlying infrastructure. This can be achieved using characters that appear identical to users, such as zero and the letter o, sending users to an illegitimate site.

DNS Security Extensions (DNSSEC) is a set of specifications primarily aimed at reinforcing the integrity of DNS. It achieves this by providing for cryptographic authentication of DNS data using digital signatures. This provides proof of origin and makes cache poisoning and spoofing attacks more difficult if users cannot create a proper digital signature for their DNS data. It does not provide for confidentiality, since digital signatures rely on publicly decryptable information, nor does it stop attacks using graphically similar domain names, as these may be legitimate records registered with an authoritative DNS zone.

Virtual Private Network

Resources inside a network can be protected with security tools like firewalls and access controls, but what happens if users are not on the same network? A VPN gives external users the ability to virtually and remotely join a network, gaining access to resources hosted on that network and benefiting from the security controls in place. This security is achieved by setting up a secure tunnel, or encrypted communication channel, between the connecting host and the network they want to join. On the target network, there is often a VPN device or server that authenticates the user connecting and mediates their access to the network.

Commonly used to provide remote workers access to in-office network resources, VPNs can also be useful in cloud architectures to safely share data between offices, cloud environments, or other networks. These are often implemented at the edge of networks to allow secure communication between any hosts on connected networks and are called site-to-site or gateway-to-gateway VPNs.

There are a variety of VPN protocols and implementations that a CCSP should be familiar with, such as the following:

OpenVPN: This is an open-source VPN built on encryption in the OpenSSL project, which can be deployed and run on your own infrastructure, as well as commercial service options. These consist of packages that can be easily deployed in cloud environments as a virtual VPN server. OpenVPN can be used to establish site-to-site VPN connectivity and provides client-server functionality across a wide variety of Linux and Unix operating systems, as well as several versions of the Windows and macOS operating systems. Details can be found at openvpn.net.
Internet Key Exchange v2 and IPSec (IKEV2/IPSec): This protocol utilizes the Security Associate (SA) features of IPSec to establish the encrypted communications channel. Public keys are exchanged, and Diffie–Hellman is used to independently calculate the shared session key. It is widely built into Microsoft products and Apple iOS devices, which may simplify deployment as no additional software is required.
SSL VPN: Although the use of SSL has largely been replaced with TLS, these are still commonly referred to as SSL VPNs. These are often implemented in a browser due to ubiquitous support for TLS encryption and can provide remote connectivity for users without requiring the installation of additional software. This ubiquity is also a potential downside, as a user might be connecting from any Internet-connected machine, which may lack standard security software, be missing patches, or have malware installed. In this case, compensating controls such as tightly restricted access for SSL VPN users or some form of network access control (NAC) may be required.

Software-Defined Perimeter

A software-defined perimeter (SDP) is an emerging concept driven by the decentralized nature of cloud applications and services, which have upended the traditional model of a network with a perimeter (secure boundary). Since cloud applications may reside in data centers anywhere in the world, and users may also be connecting from anywhere in the world, it is no longer possible to define a perimeter.

One or more SDP controllers are created, which are connected to an authentication service to enforce access control.
Accepting SDP hosts are brought online and authenticate to the SDP controllers. By default, accepting hosts do not accept communication from any other host.
Initiating SDP hosts connect to the SDP controllers for authentication and can request access to resources on an accepting host. The SDP controller makes authorization decisions and provides details to both the initiating and accepting hosts. These details include authorization and encryption policies to establish a VPN.
A mutual VPN is established between the initiating and accepting hosts, and the user is able to interact with the resource.

For further reference, see the CSA SDP site: cloudsecurityalliance.org/research/working-groups/software-defined-perimeter.

Operating System Hardening through the Application of Baselines

Hardening is the configuration of a machine into a secure state. Common hardening practices include changing default credentials, locking/disabling default accounts that are not needed, installing security tools such as anti-malware software, configuring security settings available in the OS, and removing or disabling unneeded applications, services, and functions.

Prior to virtualization, the process of hardening an OS was often a manual task; once a system's hardware was installed, an administrator had to manually install and configure the OS and any application software. The advent of virtualization introduced machine images, which are essentially templates for building a VM. All VMs created from the same image will have the same settings applied, which offers dual benefits of efficiency and security. Of course, the image cannot remain static—patches and other security updates must be applied to the image to ensure that new VMs remain secure. Existing VMs must also be updated independently of the image, which is a concern of patching and configuration management disciplines.

A modern OS has thousands of configuration options, so to speed this up, an organization may choose to create or use a baseline configuration. Baselines are simply a documented, standard state of an information system, such as access control requiring multifactor authentication, vulnerable services such as File Transfer Protocol (FTP) disabled, and nonessential services such as Windows Media Player removed. Each of these configuration options should match a risk mitigation (security control objective).

This baseline and corresponding documentation may be achieved in a number of ways.

Customer-defined VM image: A customer spins up a VM and configures it to meet their specifications. Virtualization tools allow you to create an image from an existing machine, and from then on, this image may be used to create secure VMs. This option may also be built on top of one of the other options in this list. For example, an organization might use a CIS Benchmark as a starting point, tailor it to their specific needs, and create an image from that. This is similar to word processing software that allows you to create a template from an existing document.
CSP-defined images: CSPs may offer one or more of the following as a PaaS solution: popular operating systems like Microsoft Windows and various Linux distributions, databases and big data tools, or virtualization platforms like Hyper-V and VMware. These images often incorporate the latest patches and may have some security configuration already applied, but they should always be evaluated for appropriateness against the organization's own security needs and tailored where needed.
Vendor-supplied baselines: Microsoft, VMware, and some Linux creators offer configuration guidelines for their products that point out specific security options and recommended settings. As with any external source, it is imperative that the organization evaluate the recommendations against their own needs.
DISA STIGs: The U.S. Defense Information Systems Agency (DISA) produces baseline documents known as Security Technical Implementation Guides (STIGs). These documents provide guidance for hardening systems used in high-security environments and, as such, may include configurations that are too restrictive for many organizations. They are available for free and can be tailored, obviously, and also cover a broad range of OS and application software. Many configuration and vulnerability management tools incorporate hardening guidance from the STIGs to perform checks.

STIGs and additional information can be found here: public.cyber.mil/stigs/downloads.

NIST checklists: The National Institute of Standards and Technology (NIST) maintains a repository of configuration checklists for various OS and application software. It is a free resource.

The NIST Checklist repository (which also includes access to DISA STIGs) can be found here: nvd.nist.gov/ncp/repository.

CIS Benchmarks: The Center for Internet Security (CIS) publishes baseline guides for a variety of operating systems, applications, and devices, which incorporate many security best practices. These can be used by any organization with appropriate tailoring and are also built into many security tools such as vulnerability scanners.

More information can be found here: www.cisecurity.org/cis-benchmarks.

Availability of Stand-Alone Hosts

Stand-alone hosts are isolated, dedicated hosts for the use of a single tenant. These are often required for contractual or regulatory reasons, such as processing highly sensitive data like healthcare information. The use of nonshared resources carries obvious consequences related to costs. A CSP may be able to offer secure dedicated hosting similar to colocation, which still offers cost savings to the consumer due to shared resources including physical facilities and shared resources like power and utilities. A CCSP will need to gather and analyze the organization's requirements to identify whether the costs of stand-alone hosting are justified. In some CSPs, the use of virtual private resources may be an acceptable alternative with lower costs, due to the use of shared physical infrastructure with strong logical separation from other tenants.

Availability of Clustered Hosts

Clusters are a grouping of resources with some coordinating element, often a software agent that facilitates communication, resource sharing, and routing of tasks among the cluster. Clustered hosts can offer a number of advantages, including high availability via redundancy, optimized performance via distributed workloads, and the ability to scale resources without disrupting processing via addition or removal of hosts to the cluster. Clusters are a critical part of the resource pooling that are foundational to cloud computing and are implemented in some fashion for most resources needed in modern computing systems including processing, storage, network traffic handling, and application hosting.

The cluster management agent, often part of hypervisor or load balancer software, is responsible for mediating access to shared resources in a cluster. Reservations are guarantees for a certain minimum level of resources available to a specified virtual machine. The virtualization toolset or CSP console is often where this can be configured, such as a certain number of compute cores or RAM allocated to a VM. A limit is a maximum allocation, while a share is a weighting given to a particular VM that is used to calculate percentage-based access to pooled resources when there is contention.

Maintenance mode refers to the practices surrounding the routine maintenance activities for clustered hosts. Although taking a system offline primarily impacts availability, there are considerations for all three elements of the confidentiality, integrity, availability (CIA) triad related to maintenance mode.

Confidentiality: Many live migration tools transmit data in cleartext due to the operational overhead incurred by trying to encrypt all the data related to a running OS and applications. If this is the case, compensating controls may be required for the migration, especially if the live migration data will be transmitted across untrusted network segments.
Integrity: During maintenance mode, customers are not able to access or get alerts regarding the environment's configuration. The CSP's change management process needs to include robust integrity controls such as change approvals and documentation, and systems should still generate logs when in maintenance mode to support after-the-fact investigation if needed.
Availability: Maintenance usually involves taking a system offline, which is obviously counter to the goal of availability. To meet obligations for uptime and availability, the CSP should migrate all running VMs or services off a cluster prior to entering maintenance mode. Most virtualization toolsets automate this process, which is referred to as live migration. In some limited cases, a CSP may also take a service completely offline; in these cases, all consumers must be made aware of the outage beforehand, either through ad hoc communication or through a published maintenance window. SLAs should be documented that take into account the need for both routine and emergency maintenance tasks.

High Availability

Availability and uptime are often used interchangeably, but there is a subtle difference between the terms. Uptime simply measures the amount of time a system is running. In a cloud environment, if a system is running but not reachable due to a network outage, it is not available. Availability encompasses infrastructure and other supporting elements in addition to a system's uptime; high availability (HA) is defined by a robust system and infrastructure to ensure that a system is not just up but also available. It is often measured as a number of 9s, for example, five nines or 99.999 percent availability. This equates to approximately five minutes of downtime per year and should be measured by the cloud consumer to ensure that the CSP is meeting SLA obligations.

Organizations can implement multiple strategies to achieve HA. Some, detailed in the following sections, are vendor-specific implementations of cluster management features for maintaining system uptime. Other strategies include redundancy of infrastructure such as network connectivity and utilities. The Uptime Institute publishes specifications for physical and environmental redundancy, expressed as tiers, that organizations can implement to achieve HA.

Tier 1 involves no redundancy and the most amount of downtime in the event of unplanned maintenance or an interruption.
Tier 2 provides partial redundancy, meaning an unplanned interruption will not necessarily cause an outage.
Tiers 3 and 4 provide N+1 and 2N+1 levels of redundancy, which results in increased availability. Tier 3 allows for planned maintenance activities without disruption, but an unplanned failure can still cause an outage. Tier 4 is known as fault-tolerant, meaning it can withstand either planned or unplanned activity without affecting availability. This is achieved by eliminating all single points of failure and requires fully redundant infrastructure such as dual commercial power feeds and dual backup generators. This redundancy provides very high availability but also comes at a higher cost.

Distributed Resource Scheduling

Distributed Resource Scheduling (DRS) is the coordination element in a cluster of VMware ESXi hosts, which mediates access to the physical resources and provides additional features supporting high availability and management. It is a software component that handles the resources available to a particular cluster, the reservations and limits for the VMs running on the cluster, and maintenance features.

DRS maintenance features include the ability to dynamically move running VMs from one physical hardware component to another without disruption for the end users; this is obviously useful if hardware maintenance needs to be performed or additional capacity is added to the cluster and the workload needs to be rebalanced. This supports the element of rapid elasticity and self-service provisioning in the cloud by automating dynamic creation and release of resources as client demands change.

DRS can also handle energy management in physical hardware to reduce energy consumption when processing demands are low and then power resources back up when required. This enables cost savings for both the CSP and consumer, as hardware that is not actively being used does not consume energy.

Microsoft Virtual Machine Manager and Dynamic Optimization

Similar to VMware's DRS, Microsoft's Virtual Machine Manager (VMM) software handles power management, live VM migration, and optimization of both storage and compute resources. Hosts (servers) and storage capacity can be grouped into a cluster; options available to configure in the VMM console include the movement of VMs and virtual hard disks between hosts to balance workloads across available resources.

Storage Clusters

Storage clusters create a pool of storage, with the goal of providing reliability, increased performance, or possibly additional capacity. They can also support dynamic system availability by making data available to services running anywhere—if a data center in one part of the country fails and web hosts are migrated to another data center, they can connect to the same storage cluster without needing to be reconfigured. There are two primary architectures for storage clusters.

Components in a tightly coupled cluster are often all provided by the same manufacturer, and updates or expansion must come from that same manufacturer. The advantage of a tightly coupled cluster is generally better performance due to the division of data into deterministic blocks. When a file is written to such a storage cluster, it is broken down into blocks that are faster to read and write to disks than the entire file. This is especially relevant if the data is to be mirrored, as writing multiple copies of a long file will take longer than duplicating blocks, and blocks can be written across multiple nodes or disks simultaneously.
Loosely coupled storage offers more flexibility and lower cost at the expense of performance. Components can usually be added using any off-the-shelf parts, but the use of file-level storage means operations will be slower.

Availability of Guest Operating Systems

The availability of a guest OS in a CSP's environment is generally the consumer's responsibility, as the CSP only provides a base image. Once a VM is created in IaaS, the CSP no longer has direct control over the OS, while in PaaS the CSP maintains control. In the software as a service (SaaS) model, the consumer only needs to plan for availability of the data their organization puts into the app.

Ensuring the availability of guest OSs in a cloud environment may involve planning for backup and restoration, which will be similar to traditional on-prem backup and recovery planning, or it may involve utilizing cloud-specific features to design resiliency into the system. Details of these two approaches are presented here:

Backup and recovery: This is a more traditional method that assumes a system will be built, may be interrupted, and will then need to be recovered. In virtualized cloud infrastructure, this might involve the use of snapshots, which are provided by virtualization toolsets. Snapshots typically capture the state of a VM's primary storage, random access memory (RAM), and software configurations, which can be used to re-create the VM—at the point of the snapshot—on another physical host. Obviously, this approach could involve loss of data between the snapshot creation and the failure, as well as the time required to manually create the new VM instance.
As with all backup and restoration activity, there are concerns across all three CIA triad elements. Backup integrity should be routinely tested to ensure recovery, and the backups should not be stored on the same physical hardware as the primary systems since this single point of failure could impact availability. Snapshots will contain data with the same sensitivity level as the systems they are made from, so adequate access controls and other measures to enforce confidentiality are required.
Resiliency: Building resiliency is achieved by architecting systems to handle failures from the outset rather than needing to be recovered. One example is using a clustered architecture with live migration; in this case, if a physical hardware failure is detected, all running VMs are transferred to another physical host, with little or no interruption to the users.
Many cloud services also have resiliency options built in, such as worldwide data replication and availability zones or regions that can transfer running apps or services transparently in the event of a failure. The cloud consumer is responsible for choosing and configuring these resiliency options, and in some cases will need to make trade-offs. For example, some CSPs offer database encryption that makes it harder to perform data replication. In traditional on-prem architecture, building such a resilient app would have been cost prohibitive for all but the largest organizations, but the inherent structure of cloud computing makes this type of resiliency broadly available.

Manage Physical and Logical Infrastructure for Cloud Environment

Although many elements of physical and logical infrastructure in cloud environments will be under the direct control of the CSP, it is essential for cloud consumers and the CCSP practitioner to be aware of these practices. In some cases, there will be shared responsibilities that both parties are required to perform, and in others, the consumer must adequately understand these practices and conduct oversight activities like SLA reviews to ensure that security objectives are being met.

Access Controls for Remote Access

Remote administration is the default for a majority of cloud administrators from both the CSP and the consumer side. Tools including Remote Desktop Protocol (RDP), used primarily for Windows systems, and Secure Shell (SSH), used primarily on Unix and Linux systems, must be provisioned to support this remote management.

Secure remote access is a top-level priority in many security frameworks due to the highly sensitive nature of operations it entails and its inherent exposure to attacks. Remote access often relies on untrusted network segments for transmitting data, and in a cloud environment, this will entail users connecting via the Internet. Physical controls that could prevent unwanted access in a data center will be largely missing in a cloud as well; not that the CSP is ignoring physical security controls, but the inherently network-accessible nature of the cloud means most administrative functions must be exposed and are therefore susceptible to network-based threats.

There are a number of concerns that should be addressed to reduce the risk associated with remote access, including the following:

Session encryption: Remote access always requires additional security because it happens outside an organization's perimeter, and cloud-based applications often do away with secure perimeters altogether. Data transmitted in remote access sessions must be encrypted using strong protocols such as TLS 1.3 and should implement cryptographic best practices such as session-specific cryptographic keys, which reduce the possibility of replay attacks.
Strong authentication: Users performing remote administration present higher risk to the organization, so more robust authentication is appropriate for these users. This may be combined with cryptographic controls such as a shared secret key for SSH, assuming the user retains control over their key, as well as the use of two or more authentication factors for multifactor authentication (MFA), such as a one time code or hardware key.
Separate privileged and nonprivileged accounts: A general best practice for administrative users is the use of a dedicated admin account for sensitive functions, and a standard user account for normal functions such as daily web browsing or email. This approach offers two benefits: first, it reduces threats related to phishing by reducing the likelihood that a user's admin account credentials will be stolen, and second, it allows the organization to implement more stringent controls on the admin account without creating undue overhead for the user's daily business account. In many cloud environments, this is implemented by default, as a user's credentials to log in to the cloud admin console are separate from their main email or computer login.
Enhanced logging and reviews: This is also a general best practice not specific to cloud remote access. All admin accounts should be subject to additional logging and review of activity, as well as more frequent review of permissions.
Use of identity and access management tool: Many CSPs offer identity and access management tools specific for their environments, and third-party identity as a service (IDaaS) providers also exist, which offer the ability to manage logical access controls across cloud and on-prem applications. Many of these offer easy management of the previously discussed controls for administrator accounts, such as enhanced logging or more stringent password requirements. Examples of IDaaS tools include offerings from companies such as Okta, Microsoft Azure AD, Ping, and Auth0.
Single sign-on (SSO): Two of the major CSPs also offer productivity software: Google Workspace and Microsoft's 365 product line (previously known as Office 365). Both platforms enable users to log into other services using their company accounts, similar to the option in many consumer services to log in with a Facebook or Gmail account. This reduces the burden on users to remember passwords and simplifies administration of user access by reducing the number of accounts. Many IDaaS solutions also offer the ability to function as an SSO provider.

Operating System Baseline Compliance Monitoring and Remediation

A hardened OS implements many of the security controls required by an organization's risk tolerance and may also implement security configurations designed to meet the organization's compliance obligations. Once built, however, these systems do need to be monitored to ensure that they stay hardened. This ongoing monitoring and remediation of any noncompliant systems must be part of the organization's configuration management processes, designed to ensure that no unauthorized changes are made, any unauthorized changes are identified and rolled back, and changes are properly approved and applied through a change control process.

Monitoring and managing OS configuration against baselines can be achieved in a number of ways. Some are similar to legacy, on-prem techniques, while newer methods are also emerging, including the following:

Use of a configuration management database (CMDB) and CM audits: The organization's CMDB should capture all configuration items (CIs) that have been placed under configuration management. This database can be used for manual audits as well as automated scanning to identify systems that have drifted out of their known secure state. For example, an auditor performing a manual audit could pull the registry file from a sample of Windows servers and compare the entries against the configuration baseline values. Any systems that deviate from the baseline should be reconfigured, unless a documented exception exists. Automated configuration and vulnerability scanners can also perform this task on a more routine basis, and any findings from these scans should be treated using standard vulnerability management processes.
Organization-wide vulnerability scanning: System-specific vulnerability and configuration scans should be complemented by the organization's broader vulnerability management policy. For example, insecure services such as FTP may be disallowed on all organization systems, and a vulnerability scanner can easily identify a server responding to FTP requests. This vulnerability indicates a system that does not conform to baseline configuration and that requires immediate remediation action.
Immutable architecture: This is an evolving solution to the problem of systems that, over time, can drift away from baseline configurations. Immutable architecture is unchangeable but short lived. In cloud environments, it is possible to tear down all virtual infrastructure elements used by an old version of software and deploy new virtual infrastructure quite simply; in traditional architecture, this process would be much more difficult and time-consuming. Immutable architecture can address baseline monitoring and compliance by limiting the amount of time a host can exist in a noncompliant state; the next time the application is deployed, the old host is torn down, and a new VM is built from the standard baseline image.

Patch Management

Maintaining a known good state is not a static activity, unfortunately. Today's hardened system is tomorrow's highly vulnerable target for attack, as new vulnerabilities are discovered, reported, and weaponized by attackers. Patch management involves identifying vulnerabilities in your environment, applying appropriate patches or software updates, and validating that the patch has remediated the vulnerability without breaking any functionality or creating additional vulnerabilities.

Information needed to perform patch management can come from a variety of sources, but primary sources include vendor-published patch notifications, such as Microsoft's Patch Tuesday, as well as vulnerability scanning of your environment. Patch management processes will vary depending on the cloud service model you are using.

For SaaS environments, the consumer has almost no responsibilities, as applying patches is the CSP's purview. Verifying that patches are being applied according to established SLAs is a recommended practice for the consumer organization, and maintaining oversight on this metric is important. Additionally, some SaaS offerings provide staggered or on-your-own schedule patching, especially for custom software. In these models, the consumer may have the option to apply patches/updates immediately or apply to a small sample of users for testing purposes. In this case, the consumer must understand the shared responsibility and take appropriate action.
For IaaS and PaaS, it is usually the consumer's exclusive responsibility to apply patches to existing infrastructure. Hardware-based patches will likely be handled by the CSP, who is also likely to maintain up-to-date templates, used for PaaS VMs. Once a consumer creates a VM from a PaaS template the CSP retains responsibility for patching the OS, while the consumer is responsible for patching any applications installed on the VM.

Patch management tools exist that can help to identify known software vulnerabilities and the state of patching across an organization's system, such as Windows Server Update Services (WSUS). Such tools can also be used to orchestrate and automate patch application, though in some cases automation may not be desirable if a patch has operational impacts. There are plenty of patches and updates from major companies that caused unknown issues when installed, including turning otherwise functional hardware into bricks! A generic patch management process ought to incorporate the following:

Vulnerability detection: This may be done by security researchers, customers, or the vendor. A software flaw, bug, or other issue that could be exploited drives the need for a patch.
Publication of patch: The vendor provides notice, either through a standard update mechanism or through ad hoc communication such as press releases, circulating details to information sharing and analysis centers (ISACs), or other means. Consuming organizations ought to have subscriptions to relevant information sources for their industry and infrastructure. In other words, if an organization uses Red Hat Linux exclusively, the IT and security teams ought to be subscribed to relevant data feeds from Red Hat to receive notice of vulnerabilities and patches.
Evaluation of patch applicability: Not all users of software will need to apply all patches. In some cases, there may be features or functions that are not in use by an organization, meaning the patched vulnerability has no real impact. As a general best practice, all applicable patches should be applied unless there is a known functionality issue with the patch.
Test: Most patches should be tested in a limited environment to identify any potential functionality issues before being broadly deployed; patches designed to close highly critical vulnerabilities may be treated as an exception, since the vulnerability they remediate is riskier than the potential for an outage.
Apply and track: Assuming a patch does have negative functionality impacts, the organization must identify all systems that require the patch and then plan for and track the deployment to ensure that it is applied to all systems. In many organizations, there will be a key security metric worth tracking. Patches should have timeframes for application based on their criticality; for example, critical patches must be applied within seven days of release. This service level helps the organization measure and track risk and associated remediation efforts.

VMs present a particular challenge for patching efforts. As discussed, one of the features in many virtualization tools is the ability to power down a VM when it is not needed, but this can have the unintended consequence of leaving systems unpatched. The organization should have compensating controls, such as powering on all VMs when patches are deployed or performing checks when a VM is first powered on to detect and apply any missing patches.
Rollback if needed: Not all patching goes smoothly, so a rollback plan is essential. Virtualization makes this incredibly easy, as a snapshot can be created before the patch is applied, but that step must be part of the deployment process.
Document: Patched software represents a changed system baseline; the new, patched version of software is the new known good state. Systems used to track this, such as the CMDB, need to be updated to record this official change and new expected configuration.

Evolving cloud architectures are introducing new ways of managing patching, which offers significant advantages if applied correctly.

The use of infrastructure as code, where system architecture is documented in a definition file used by the CSP to spin up virtual infrastructure, offers a way for organizations to utilize a CSP's patched template files. These definition files are often used to create virtual infrastructure in PaaS environments and should always make use of the latest patched version of the platforms in question.
Immutable architecture, which is created each time a system is deployed and then torn down when the next deployment occurs, can also help prevent the spread of old systems with unpatched software. This relies on infrastructure as code and should be configured to always make use of the latest, patched infrastructure elements.
Software composition analysis (SCA) is a concern for applications built with open-source software components, many of which provide highly reusable elements of modern applications such as data forms or code libraries. Since this type of software is included in many applications, SCA tools identify flaws/vulnerabilities in these included pieces of the application. This represents a merging of application security and patch management and is a key automation point for security in application development. Identifying vulnerabilities in these included functions and ensuring that the latest, patched versions are in use by your application is a critical part of both development processes and patch management.

Equifax Data Breach

In 2017, credit monitoring company Equifax suffered a massive data breach that affected more than 140 million people, primarily in the United States but in other countries as well. The company maintains highly sensitive personal information used in financial decision making, and this information was exposed due to an unpatched software flaw in a web application server. The vulnerability had been published and a patch provided by the software vendor, but it was not properly applied throughout the Equifax environment. This unpatched software vulnerability allowed attackers to break in and steal data; estimates of the costs associated with this breach top $1 billion, including fines, lawsuits, and an overhaul of the company's information security program!

Source: www.bankinfosecurity.com/equifaxs-data-breach-costs-hit-14-billion-a-12473

Performance and Capacity Monitoring

Monitoring is a critical concern for all parties in cloud computing. The CSP should implement monitoring to ensure that they are able to meet customer demands and promised capacity, and consumers need to perform monitoring to ensure that service providers are meeting their obligations and that the organization's systems remain available for users.

The majority of monitoring tasks will be in support of the availability objective, though indicators of an attack or misuse may be revealed as well, such as spikes in processor utilization that could be caused by cryptocurrency mining malware. Alerts should be generated based on established thresholds, and appropriate action plans initiated in the event of an event or disruption. Monitoring is not necessarily designed to detect incidents, however. It is also critical for CSPs to measure the amount of services being used by customers so they can be billed accurately.

Infrastructure elements that should be monitored include the following:

Network: Cloud computing is, by design, accessible via networks, so it is essential to monitor their performance and availability. Bandwidth utilization, link state (up or down), and number of dropped packets are examples of metrics that might be captured.
Compute: Though traditionally defined as CPU utilization, often measured in cores or number of operations, compute can also include other forms of processing such as GPUs, field-programmable gate arrays (FPGA), or even service calls to an API or microservices architecture, where the service is billed based on the number of requests made.
Storage and memory: Data storage and memory are often measured in terms of total amount used, but other measures also exist such as number of reads and writes (input output operations, or IOPS), as well as speed of data access. Speed is a critical measure for many cloud services that are priced based on data retrieval, allowing cloud consumers to realize cost savings by storing infrequently needed data in a slower environment, while keeping essential data more quickly accessible in a more expensive environment.

Regardless of what is being monitored and who performs it, adequate staffing is critical to make monitoring effective. Just as reviews make log files impactful, appropriate users of performance data are also essential. If a metric is captured but the cloud consumer never reviews it, they run the risk of a service being unavailable with no forewarning or paying for services that were not actually usable. CSPs also face risks of customers complaining, customers refusing to pay, and loss of reputation if their services are not routinely available.

Hardware Monitoring

Although cloud computing relies heavily on virtualization, at some point physical hardware is necessary to provide all the services. Monitoring this physical hardware is essential, especially for availability as hardware failures can have an outsized impact in virtualization due to multiple VMs relying on a single set of hardware.

Various tools exist to perform physical monitoring, and the choice will depend on the hardware being used as well as organizational business needs. Similar to capacity monitoring, all hardware under monitoring should have alert thresholds and response actions if a metric goes outside expected values, such as automated migration of VMs from faulty hardware or an alert to investigate and replace a faulty component. Hardware targets for monitoring may include the following:

Compute hardware and supporting infrastructure: The CPU, RAM, fans, disk drives, network gear, and other physical components of the infrastructure have operating tolerances related to heat and electricity. Monitoring these devices to ensure that they are within tolerance is a critical aspect of ensuring availability; a CPU that overheats or a power supply that sends too much wattage can shorten the lifespan of a component or cause it to fail altogether. Fan speed can also be used as an indicator of workload; the harder a system is working, the more heat it produces, which causes fans to spin faster.
Some devices also include built-in monitoring, such as hard drives that support self-monitoring, analysis, and reporting technology (SMART). SMART drives monitor a number of factors related to drive health and can identify when a failure is imminent. Many SSDs also provide reporting when a set number of sectors have failed, which means the drive is reaching the end of its useful life. Especially in large environments, storage tools like storage area networks (SAN) will include their own health and diagnostic tools; the information from these should be integrated into the organization's continuous monitoring strategy.
Environmental: All computing components generate heat and are generally not designed for use in very wet environments or in direct contact with water. Monitoring environmental conditions including heat, humidity, and the presence of water can be critical to detecting issues early.

The types of monitoring tools in use will depend on a number of factors. Many vendors of cloud-grade hardware such as SAN controllers or virtualization clusters include diagnostic and monitoring tools. The usefulness of these built-in tools may be limited if your organization's measurement needs require data that is not captured, in which case a third-party tool may be required.

In general, hardware monitoring will be the purview of the CSP and not the consumer, as the CSP is likely to retain physical control of hardware. Data center design and infrastructure management are entire fields of endeavor largely outside the scope of the CCSP, but it is obviously important for a CSP to have appropriately skilled team members. The Uptime Institute (uptimeinstitute.com/tiers) is one resource that provides guidance and education on designing and managing infrastructure; another is the International Data Center Authority (www.idc-a.org).

Configuration of Host and Guest Operating System Backup and Restore Functions

There is a clear delineation of responsibility between the CSP and consumer when it comes to configuring, testing, and managing backup and restoration functions in cloud environments. In SaaS cloud models, the CSP retains full control over backup and restore and will often be governed by SLA commitments for restoration in the event of an incident or outage.

In the PaaS model, there will be backup and restoration responsibilities for both the consumer and the CSP, especially for VMs. The CSP is responsible for maintaining backup and restoration of the host OS and any hypervisor software and for ensuring the availability of the system in line with agreed-upon service levels.

Backup and recovery of individual VMs in IaaS are the responsibility of the consumers and may be done in a variety of ways. This might include full backups, snapshots, or definition files used for infrastructure as code deployments. Regardless of which backup method is utilized, a number of security considerations should be taken into account.

Sensitive data may be stored in backups, particularly in snapshot functions that do not support the same access controls or encryption as the OS they are created from. In this case, access to snapshots and need-to-know principles must be considered when designing the backup plan.
Snapshots and backups may be created on the same physical hardware as the running VMs, which violates a core principle of data backups: physical separation. As a best practice, these should be stored on different hardware (if possible) or in a different availability zone to ensure that an incident affecting the main environment does not also impact the backup data.
Integrity of all backups should be verified routinely to ensure that they are usable. Snapshots or backups should also be updated as system changes occur, especially patches or major configuration changes. Disaster recovery (DR) and business continuity (BC) exercises may be used as formal testing of backup and recovery capabilities, and many organizations conduct such exercises annually. Depending on the frequency of these exercises, additional testing may be appropriate to ensure that backup and restoration capabilities are adequate in the event a recovery is needed. Some backup systems provide for automated testing of every backup, and manual recovery tests can also be useful.

Configuration of resiliency functions, such the use of automatic data replication, failover between availability zones offered by the CSP, or the use of network load balancing, will always be the responsibility of the consumer. The CSP is responsible for maintaining the capabilities that enable these options, but the consumer must architect their cloud environment, infrastructure, and applications appropriately to meet their own resiliency objectives.

Network Security Controls

Cloud environments are inherently network accessible, so the security of data in transit between the consumer and the CSP is a critical concern, with both parties sharing responsibility for architecting secure networks. CSPs must ensure that they adequately support the networks they are providing in the cloud service environment, and consumers are responsible, in some cloud service models, for architecting their own secure networks using a combination of their own tools and those provided by the CSP.

One major concern related to network security is the ability of some tools to function in a cloud paradigm. The early days of virtualization brought challenges for many security tools that relied on capturing network traffic as it flowed across a switch; the devices were attached to a SPAN or mirror port on the switch and received a copy of all traffic for analysis. VMs running on the same physical host did not need to send traffic outside the host to communicate, rendering tools listening for traffic on a switch useless. Complex software-defined networks (SDNs) that can span multiple data centers around the world likely require more advanced solutions, and security practitioners must be aware of these challenges.

Firewalls

Firewalls are most broadly defined as a security tool designed to isolate and control access between segments of a network, whether it is an internal network and the public Internet or even between environments such as an application with highly sensitive data and other internal apps. Firewalls operate by inspecting traffic and making a decision whether to forward the traffic (allow) or drop it (deny) and are often used to isolate or segment networks by controlling what network traffic is allowed to flow between segments. There are a variety of firewall types.

Static packet or stateless: These are the original type of firewall, designed to inspect network packets and compare them against a ruleset. For example, the firewall might see TCP destination port 23 (Telnet) in a packet's headers and decide to drop the packet since Telnet is an insecure protocol. Static firewalls operate quickly but can struggle with complex situations like voice over IP (VoIP) and videoconferencing where a call is initiated on one port, but the actual call traffic is exchanged on a different port negotiated for that communication session.
Stateful: These are an evolution of static firewalls and offer the ability for the firewall to understand some context regarding communication (known as a state). For example, the firewall might allow traffic on a random high-number port from a particular host if it had also seen previous traffic on port 20/21 (FTP). Since FTP clients often negotiate a custom port to be used for a specific file transfer, this traffic makes sense in the context of previous traffic; without the previous traffic on port 20, the firewall would block the traffic on the high-number port. In this case, the firewall has more intelligence and flexibility to make decisions, but with a higher processing overhead and cost.
Web application firewall (WAF) and API gateway: These are a highly specialized form of network access control devices that are designed to handle specific types of traffic, unlike a generic firewall that can handle any network traffic. WAFs and API gateways allow for the analysis of traffic destined specifically for a web application or an application's API and can be useful in detecting more complex attacks such as SQL injection, which could not be identified by looking at raw network traffic. These devices apply a set of rules to HTTP conversations and look for anomalous interactions with the application.
Security groups: In SDNs, it can be difficult or impossible to locate specific infrastructure elements of the network. For example, a globally load-balanced application may exist in several data centers all over the world, and it would be a headache to place firewalls at the edge of each data center's network. Security groups (also called network security groups, or NSGs) are an abstraction layer that allows a consumer to define protections required, and the CSP's infrastructure deploys appropriate virtualized resources as needed. In the previous example, a cloud consumer defines a set of allowed traffic for the application, and the CSP's hardware and software will be configured uniquely for each data center to implement those rules. This is a similar concept to infrastructure as code, where hardware variations between data centers are abstracted and presented to the consumer in a virtual, self-serve manner.
Next-generation firewalls (NGFW): Although more of a marketing term than a unique type of firewall, NGFWs combine multiple firewall functions into a single device, such as a stateful firewall and API gateway. Many NGFWs also include other network security protections such as intrusion detection or VPN services.

Firewalls may be hardware appliances that would traditionally be deployed by the CSP, or virtual appliances that can be deployed by a cloud consumer as a VM. Host-based firewalls, which are software-based, are also often considered a best practice in a layered defense model. In the event a main network firewall fails, each host still has some protection from malicious traffic, though all devices obviously need to be properly configured.

There are a number of cloud-specific considerations related to firewall deployment and configuration, such as the use of security groups for managing network-level traffic coupled with host-based firewalls to filter traffic to specific hosts. This approach is an example of microsegmentation, which amounts to controlling traffic on a granular basis—often at the level of a single host. In a cloud environment, an NSG might block traffic on specific ports from entering a DMZ, and then the host firewalls would further restrict traffic reaching a host based on ports or protocols. Traditional firewall rules may also be ineffective in a cloud environment, which necessitates these new approaches. In an autoscaling environment, new hosts are brought online and dynamically assigned IP addresses. A traditional firewall would need its ruleset updated to allow traffic to these new hosts; otherwise, they will not be able to handle traffic at all. The newly created resources can be automatically placed into the proper security group with no additional configuration required.

Intrusion Detection/Intrusion Prevention Systems (IDS/IPS)

As the name implies, an intrusion detection system (IDS) is designed to detect a system intrusion when it occurs. An intrusion prevention system (IPS) is a bit of a misnomer, however—it acts to limit damage once an intrusion has been detected. The goal in both cases is to limit the impact of an intrusion, either by alerting personnel to an intrusion so they can take remedial action or by automatically shutting down an attempted attack.

An IDS is a passive device that analyzes traffic and generates an alert when traffic matching a pattern is detected, such as a large volume of unfinished TCP handshakes. IPS goes further by taking action to stop the attack, such as blocking traffic from the malicious host with a firewall rule, disabling a user account generating unwanted traffic, or even shutting down an application or server that has come under attack.

Both IDS and IPS can be deployed in two ways, and the deployment method as well as location are critical to ensure that the devices can see all traffic they require to be effective. A network-based intrusion detection system/intrusion prevention system (NIDS/NIPS) sits on a network where it can observe all traffic and may often be deployed at a network's perimeter for optimum visibility. Similar to firewalls, however, NIDS/NIPSs may be challenged in a virtualized environment where network traffic between VMs never crosses a switch. A host-based intrusion detection system/intrusion prevention system (HIDS/HIPS) is deployed on a specific host to monitor traffic. While this helps overcome problems associated with invisible network traffic, the agents required introduce processing overhead, may require licensing costs, and may not be available for all platforms an organization is using.

Honeypots and Honeynets

Honeypots and honeynets can be useful monitoring tools if used appropriately. They should be designed to detect or gather information about unauthorized attempts to gain access to data and information systems, often by appearing to be a valuable resource. In reality, they contain no sensitive data, but attackers attempting to access them may be distracted or deflected from high-value targets or give up information about themselves such as IP addresses.

In most jurisdictions, there are significant legal issues concerning the use of honeypots or honeynets, centered around the concept of entrapment. This legal concept describes an agent inducing a person to commit a crime, which may be used as a defense by the perpetrator and render any attempt to prosecute them ineffective. It is therefore imperative that these devices never be set up with an explicit purpose of being attractive targets or designed to “catch the bad guys.”

Vulnerability Assessments

Vulnerability assessments should be part of a broader vulnerability management program, with the goal of detecting vulnerabilities before an attacker finds them. Many organizations will have a regulatory or compliance obligation to conduct vulnerability assessments, which will dictate not only the schedule but also the form of the assessment. An organization with an annual PCI assessment requirement should be checking for required configurations and vulnerabilities related to credit cardholder data, while a medical organization should be checking for required protected health information controls and vulnerabilities.

Vulnerability scanners are an often-used tool in conducting vulnerability assessments and can be configured to scan on a relatively frequent basis as a detective control. Human vulnerability assessments can also be utilized, such as an internal audit function or standard reviews like access and configuration management checks. Even a physical walk-through of a facility to identify users who are not following clean desk or workstation locking policies can uncover vulnerabilities, which should be treated as risks and remediated.

A more advanced form of assessments an organization might conduct is penetration or pen testing, which typically involves a human tester attempting to exploit any vulnerabilities identified. Vulnerability scanners typically identify and report on software or configuration vulnerabilities, but it can be difficult to determine if a particular software vulnerability could actually be exploited in a complex environment. The use of vulnerability scanners and pen testers may be limited by your CSP's terms of service, so a key concern for a CCSP is understanding the type and frequency of testing that is allowed.

Management Plane

The management plane is mostly used by the CSP and provides virtual management options analogous to the physical administration options a legacy data center would provide, such as powering VMs on and off or provisioning virtual infrastructure for VMs such as RAM and storage. The management plane will also be the tool used by administrators for tasks such as migrating running VMs to different physical hardware before performing hardware maintenance.

Because of the functionality it provides, the management plane requires appropriate logging, monitoring, and access controls, similar to the raised floor space in a data center or access to domain admin functions. Depending upon the virtualization toolset, the management plane may be used to perform patching and maintenance on the virtualization software itself. Functionality of the management plane is usually exposed through an API, which may be controlled by an administrator from a command line or via a graphical interface.

A key concept related to the management plane is orchestration, or the automated configuration and management of resources. Rather than requiring an administrator to individually migrate VMs off a cluster before applying patches, the management plane can automate this process. The admin schedules a patch for deployment, and the software comprising the management plane coordinates moving all VMs off the cluster, preventing new VMs from being started, and then enters maintenance mode to apply the patches.

The cloud management console is often confused with the cloud management plane, and in reality, they perform similar functions and may be closely related. The management console is usually a web-based console for use by the cloud consumer to provision and manage their cloud services, though it may also be exposed as an API that customers can utilize from other programs or a command line. It may utilize the management plane's API for starting/stopping VMs or configuring VM resources such as RAM and network access, but it should not give a cloud consumer total control over the entire CSP infrastructure. The management plane's access controls must enforce minimum necessary authorization to ensure that each consumer is able to manage their own infrastructure and not that of another customer.

Implement Operational Controls and Standards

IT service management (ITSM) frameworks consist of operational controls designed to help organizations design, implement, and improve IT operations in a consistent manner. They can be useful in speeding up IT delivery tasks and providing more consistent oversight, and they are also critical to processes where elements of security risk management are implemented. Change management is one example; it helps the organization to maintain a consistent IT environment that meets user needs and also implements security controls such as a change control board where the security impact of changes can be adequately researched and addressed.

The two standards that a CCSP should be familiar with are ISO 20000-1 (not to be confused with ISO 27001) and ITIL (formerly an acronym meaning Information Technology Infrastructure Library). Both frameworks focus on the process-driven aspects of delivering IT services to an organization, such as remote collaboration services, rather than focusing on just delivering IT systems like an Exchange server. In ITIL, the set of services available is called a service catalog, which includes all the services available to the organization.

Both frameworks start with the need for policies to govern the ITSM processes, which should be documented, well understood by relevant members of the organization, and kept up-to-date to reflect changing needs and requirements. ISO 20000-1 and ITIL emphasize the need to deeply understand user needs and also focus on gathering feedback to deliver continuous service improvement. Stated another way, those in charge of IT services should have a close connection to the users of the IT system and strive to make continual improvements; in this regard, it is similar to the Agile development methodology.

Change Management

Change management is concerned with keeping the organization operating effectively, even when changes are needed such as the modification of existing services, addition of new services, or retirement of old services. To do this, the organization must implement a proactive set of formal activities and processes to request, review, implement, and document all changes.

Many organizations utilize a ticketing system to document all steps required for a change. The first step is initiation in the form of a change request, which should capture details such as the purpose of the proposed change, the owner, resources required, and any impacts that have been identified, such as downtime required to implement the change or impacts to the organization's risk posture.

The change then goes for a review, often by a change control or change advisory board (CCB or CAB). This review is designed to verify whether the proposed change offers business benefits/value appropriate to its associated costs, understand the impact of the change and ensure that it does not introduce unacceptable levels of risk, and, ideally, confirm that the change has been properly planned and can be reversed or rolled back in the event it is unsuccessful. This step may involve testing and additional processes such as decision analysis, and it may be iterated if the change board needs additional information from the requestor.

Once a change has been approved, it is ready for the owner to execute the appropriate plan to implement it. Since many changes will result in the acquisition of new hardware, software, or IT services, there will be a number of security concerns that operate concurrently with the change, including acquisition security management, security testing, and the use of the organization's certification and accreditation process if the change is large enough. In the event a change is not successful, fallback, rollback, or other restoration actions need to be planned to prevent a loss of availability.

Not all changes will be treated the same, and many organizations will implement different procedures based on categories of changes.

Low-risk: These are changes that are considered unlikely to have a negative impact and are therefore pre-authorized to reduce operational overhead. Examples include application of standard patches, addition of standard assets to address capacity (for example, deploying a standard server build to provide additional processing capability), or installation of approved software that is not part of a standard baseline but is required for a particular job function.
Normal changes: These changes require the full change management request-review-implement process. They will typically follow a routine schedule based on the meeting schedule of the change board.

Automating Change Management

In the case of a continuous integration/continuous deployment (CI/CD) software development environment, change reviews may be automated when new code is ready for deployment, particularly security testing such as code analysis. The goal of this automation is to reduce operational overhead while still adequately managing the risk associated with the change (in this case, new software).
Emergency changes: Things happen unexpectedly, and the process of requesting, testing, and receiving approval for a change may require too much time. For changes required to resolve an incident or critical security concern, formal procedures should be utilized to implement the change as needed to address the incident and document all details related to the change. Depending on the organization, a faster or less cumbersome change control decision process may be utilized; for example, only one CCB member approval is required, or the change may be reviewed and approved retroactively.

Continuity Management

Continuity is concerned with the availability aspect of the CIA triad and is a critical consideration for both cloud customers and providers. Continuity management addresses the reality that, despite best efforts and mitigating activities, sometimes adverse events happen. How the organization responds should be planned and adequate resources identified prior to an incident, and the business continuity policy, plan(s), and other documentation should be readily available to support the organization's members during an interruption.

It is essential for both cloud customers and providers to do the following:

Identify critical business functions and resources: This is usually accomplished by conducting a business impact assessment (BIA), which assists the organization to understand its essential assets and processes. For a customer, this may be business-critical applications, while for the provider, it will be the infrastructure and other resources required to deliver the cloud services. The BIA is a structured method of identifying what impact a disruption of critical business functions poses to the organization, as well as the resources necessary to recover a minimum level of functionality.
Prioritize recovery: Not all systems or assets can be recovered all at once, so it is essential that the organization develop a prioritization of critical processes that are essential to the continued functioning of its operations and identify which assets are essential to those processes. The set of ordered steps should also be documented to ensure that dependencies are documented and restored in the correct order, for example, that power to a facility is restored before any information systems can be operated in it.
Plan continuity: This will entail identifying continuity capabilities such as automated failovers, as well as understanding relevant cloud offerings and how they are to be used. In some cases, the cloud will function as a backup for an organization's on-prem infrastructure, while in other cases, the cloud's availability features will be utilized, such as different availability regions around the world, automatic duplication of data to multiple sites, etc. The cloud customer is responsible for understanding the availability features of their chosen cloud provider and properly architecting their cloud applications to meet continuity requirements.
Document plans and procedures: Issues that cause a loss of availability are often high-stress situations, such as a natural disaster. In these instances, it is preferable to have employees working from a previously prepared set of instructions rather than trying to think and respond on the fly. Decision makers and standard processes/tools may be unavailable, so appropriate alternatives should be documented.

There are a variety of standards related to continuity management; these may be useful to the organization in planning, testing, and preparing for contingency circumstances. Many legal and regulatory frameworks mandate the use of a particular standard depending on an organization's industry or location. The CCSP should be aware of the relevant framework for their industry; the following are some key frameworks:

NIST Risk Management Framework and ISO 27000: Since both frameworks focus on information security concerns, both deal with business continuity and disaster recovery (BCDR), terms that fall under the larger category of continuity management. In the NIST framework, a family of controls called contingency planning is specified depending on the system's risk profile, while the ISO 27002 framework specifies information security aspects of business continuity management.
Health Insurance Portability and Accountability Act (HIPAA): Healthcare data in the United States is governed by this standard, which mandates adequate data backups, disaster recovery planning, and emergency access to healthcare data in the event of a system interruption.
ISO 22301:2019 Security and resilience — Business continuity management systems: This specifies the requirements needed for an organization to plan, implement and operate, and continually improve the continuity capability. This includes adequate support from leadership within the organization, planning resources for managing continuity, and steps to implement/operate the program such as conducting a BIA, exercising contingency plans, and monitoring the capability's effectiveness.

Information Security Management

The goal of an information security management system (ISMS) is to ensure a coherent organizational approach to managing information security risks; stated another way, it is the overarching approach an organization takes to preserving the confidentiality, integrity, and availability (the CIA triad) of systems and data in use. The operational aspects of an ISMS include standard security risk management activities in the form of security controls such as encryption, as well as supporting business functions required for the organization to achieve risk management goals like formal support and buy-in from management, skills and training, and adequate oversight and performance evaluation.

Various standards and frameworks exist to help organizations implement, manage, and, in some cases, audit or certify their ISMS. Most contain requirements to be met in order to support goals of the CIA triad, as well as best practices for implementation and guidance on proper use and operation of the framework. While many security control frameworks exist, not all are focused on the larger operational task of implementing an ISMS. Payment Card Industry Data Security Standard (PCI DSS), for example, focuses specifically on securing cardholder data that an organization is processing. Frameworks that focus on both security controls as well as the overall ISMS functions include the following:

ISO 27000 series: The ISO 27001 standard provides a set of requirements for an organization's ISMS. It is often confused with ISO 27002, which is the code of practice for information security controls, due to the inclusion of the 27002 set of controls as an appendix to 27001. The two are interrelated, because 27001 sets out the high-level requirements an organization must meet to provide leadership for, plan, implement, and operate an ISMS.
- ISO 27002 provides a set of controls and implementation guidance, broken down by domains such as Asset Management and Cryptography. 27001 and 27002 provide a road map for an organization to understand its security risk posture and implement appropriate security controls to mitigate and manage those risks.
- There are additional ISO standards that provide guidance on implementing and managing security controls in cloud-specific environments. ISO 27017 is a “Code of practice for information security controls based on ISO/IEC 27002 for cloud services,” and ISO 27018 is a “Code of practice for protection of personally identifiable information (PII) in public clouds acting as PII processors.” Both documents enhance/extend the guidance offered in ISO 27002 controls to deal with particular security risk challenges in cloud implementation.
- ISO 27701 is also relevant to cloud security environments where personally identifiable information (PII) is being handled. 27701 extends the ISMS guidance in 27001 to manage risks related to privacy, by implementing and managing a privacy information management system (PIMS). As many privacy regulations require security controls and risk management, this standard will be relevant to a CCSP whose organization acts as a data owner or processor.
NIST RMF, SP 800-53, and CSF: Although it uses different terminology, the NIST Risk Management Framework (RMF) specified in Special Publication (SP) 800-37, “Risk Management Framework for Information Systems and Organizations: A System Lifecycle Approach for Security and Privacy,” has the same objective as the ISO-defined ISMS: identifying information security risks and applying adequate risk mitigations in the form of security controls. SP 800-37 provides the guidance for creating the organization's risk management framework and points to NIST SP 800-53, “Security and Privacy Controls for Federal Information Systems and Organizations,” for the control requirements and implementation guidance.
While the NIST RMF and SP 800-53 standards are mandated for use in many parts of the U.S. federal government, they are free to use for any organization. The NIST Cybersecurity Framework (CSF) was originally designed to help private-sector critical infrastructure providers design and implement information security programs; however, its free-to-use and relatively lightweight approach have made it a popular ISMS tool for many nongovernment organizations.
AICPA SOC 2: The Service Organization Controls (SOC 2) framework has seen wide adoption among cloud service providers for a variety of reasons, primarily the relatively lightweight approach it provides as well as the use of a third party to perform audits, which provides increased assurance for business partners and customers. While not as robust as the ISO and NIST frameworks, SOC 2 contains Trust Services Criteria (TSC), which cover organizational aspects including Security, Availability, Processing Integrity, Confidentiality, and Privacy. The Common Criteria apply to all organizations and contain similar requirements to other frameworks: executive support and buy-in, assessment and treatment of risks, monitoring of controls, and implementation guidance. The other TSCs such as availability may be implemented at the discretion of the organization; typically, an organization's service offering will drive which TSCs are chosen.

Continual Service Improvement Management

Most ITSM models include some form of monitoring capability utilizing functions such as internal audit, external audit and reporting, or the generation of security metrics and management oversight of processes via these metrics. The organization's IT services, including the ISMS and all related processes, should be monitored for effectiveness and placed into a cycle of continuous improvements. The goals of this continuous improvement program should be twofold: first to ensure that the IT services (including security services) are meeting the organization's business objectives and second to ensure that the organization's security risks remain adequately mitigated.

One critical element of continual service improvement includes elements of monitoring and measurement, which often take the form of security metrics. Metrics can be tricky to gather, particularly if they need to be presented to a variety of audiences. It may be the case that business leaders will be less interested in deeply technical topics, which means the metrics should be used to aggregate information and present it in an easily understood, actionable way.

For instance, rather than reporting a long list of patches and Common Vulnerabilities and Exposures (CVEs) addressed (undoubtedly an important aspect of security risk management), a more appropriate metric might be the percentage of machines patched within the defined timeframe for the criticality of the patch; for example, 90 percent of machines were patched within seven days of release. Acceptable values should also be defined, which allows for key performance indicators (KPIs) to be reported. In this example, the KPI might be red (bad) if the organization's target is 99 percent patch deployment within seven days of release—a clear indicator to management that something needs their attention.

There are other sources of improvement opportunity information as well, including audits and actual incidents. Audits may be conducted internally or externally, and findings from those audits can be viewed as improvement opportunities. Actual incidents, such as a business interruption or widespread malware outbreak, should be concluded with a lessons learned or postmortem analysis, which provides another source of improvement opportunities. The root cause of the incident and any observations made during the recovery can be used to improve the organization's IT security services.

Incident Management

It is important to understand the formal distinction between events and incidents as the foundation for incident management.

Events are any observable item, including routine actions such as a user successfully logging into a system, a file being accessed, or a system being unavailable during a scheduled maintenance window. Many routine events will be logged but do not require any additional actions.
Incidents, by contrast, are events that both are unplanned and have an adverse impact on the organization. Incidents typically require investigation and remedial action by some combination of IT, operations, and security personnel. Examples of incidents include unexpected restart of a system, ransomware preventing users from accessing a system, or a loss of network connectivity to a cloud service.

All incidents should be investigated and remediated as appropriate to restore the organization's normal operations as quickly as possible and to minimize adverse impact to the organization such as lost productivity or revenue. This resumption of normal service is the primary goal of incident management.

Not all incidents will require participation by the security team. For example, a spike in new user traffic to an application after a marketing campaign goes live, which leads to a partial loss of availability, is an operational issue and not a security one. A coordinated denial-of-service attack by a foreign nation-state, however, is an incident that requires participation by both IT and security personnel to successfully remediate.

All organizations require some level of incident management capability, that is, the tools and resources needed to identify, categorize, and remediate the impacts of incidents. This capability will revolve around an incident management plan, which should document the following:

Definitions of incident types, such as internal operational incidents, security incidents, and cloud provider incidents.
The incident response team (IRT) personnel. Note that the composition of this team will be dependent upon the type of incident, but an incident response coordinator should always be appointed to assess the situation and identify the requisite team members on a per-incident basis.
Roles and responsibilities for the IRT personnel in each incident type. This should include roles internal to the organization, as well as responsibilities of external stakeholders such as law enforcement, business partners or customers, and the cloud provider if the incident may affect them.
Resources required such as operations or security management tools to facilitate detection and response, such as a security information and event management (SIEM) or IDS, and required personnel.
Incident management processes following a logical lifecycle from detection of the incident to response to restoration of normal service. The response coordinator should determine relevant response requirements, including the following:
- Communications appropriate to the specific incident, both internal and external, with stakeholders including customers, employees, executive management, law enforcement, regulatory bodies, and possibly the media
- Any required breach or privacy law notifications, if the incident involved PII or other regulated data

A variety of standards exist to support organizations developing an incident response capability, including the ITIL framework, NIST Special Publication 800-61, “Computer Security Incident Handling Guide, and ISO 27035, Security incident management.” All standards implement a lifecycle approach to managing incidents, starting with planning before an incident occurs, activating the response, and following the documented steps, and ending with reporting on the results and any lessons learned to help the organization better mitigate or respond to such incidents in the future. Figure 5.1 shows an example of the NIST SP 800-61 lifecycle and a description of the activities.

Schematic illustration of NIST incident response lifecycle phases. — FIGURE 5.1 NIST incident response lifecycle phases

Source: Adapted from NIST SP 800-61

The CCSP's role in developing the capability is, obviously, the responses required for security incidents, while other stakeholders from IT or operations will provide input relevant to their responsibilities. All incident response frameworks emphasize the importance of planning ahead for incidents by identifying likely scenarios and developing response strategies before an incident occurs, as incidents can be a high-stress situation and ad hoc responses are less preferable than preplanned, rehearsed responses.

As with all aspects of cloud service use, there are shared responsibilities between the CSP and the consumer when responding to incidents. For many incidents, the CSP will not be involved; for example, an internal user at a consumer organization breaches policies to misuse company data. This does not impact the CSP; however, some incidents will require coordination, such as a denial-of-service attack against one consumer, which could impact other consumers by exhausting resources. The CSP could also suffer an incident, like an outage of a facility or theft of hardware resources, which must be reported to consumer organizations and may trigger their incident management procedures. The major CSPs have dedicated incident management teams to coordinate incident responses, including resources designed to provide notice to consumers such as a status page or dedicated account managers. Your incident planning must include coordination with your CSPs, and you should be aware of what capabilities are and are not available to you. Many CSPs forbid physical access to their resources during an incident response unless valid law enforcement procedures, such as obtaining a warrant, have been followed.

In addition to managing incidents with the CSPs involvement, incident management processes should take into account the difference between first- and third-party incidents. A first-party incident is one that happens internally, such as an employee stealing information or a server being infected with malware. A third-party incident is one that affects another organization like a contractor or vendor. Some incidents could be operational only, such as a vendor being hit by ransomware that prevents them from providing services, while others may be security related like a data breach at a contractor that exposes the data of the contractor's customers. In this case, the incident response plan should include information such as points of contact and steps needed to coordinate the incident response with the third party. This should include any legal or regulatory obligations the organization must meet in the event of a data breach, such as reporting to regulators or impacted customers.

Another important aspect of the organization's incident management capability is the proper categorization and prioritization of incidents based on their impact and criticality. Incident management seeks to restore normal operations as quickly as possible, so prioritizing incidents and recovery steps is critical. This is similar to the risk assessment process where risks are analyzed according to their impact and likelihood; however, since incidents have already occurred, they are measured by the following:

Criticality/impact: The effect the incident will have on the organization, often measured as low/moderate/high/critical.
Urgency: The timeframe in which the incident must be resolved to avoid unwanted impact. For example, unavailable systems that do not impact life, health, or safety will always be less urgent than systems that do impact these factors; therefore, they should be prioritized lower for resolution.

Many organizations utilize a P score (P0–P5) to categorize incidents. Members of the incident response team use this score to prioritize the work required to resolve an incident. For example, a P5 or low-priority item may be addressed as time permits, while a P0, which equates to a complete disruption of operations, requires that all other work be suspended. In many organizations, a certain priority rating may also be a trigger for other organizational capabilities, such as the invocation of a business continuity or disaster recovery plan if the incident is sufficiently disruptive to normal operations.

Problem Management

In the ITIL framework, problems are the causes of incidents or adverse events, and the practice of problem management seeks to improve the organization's handling of these incidents. Problems are, in essence, the root cause of incidents, so problem management utilizes root-cause analysis to identify the underlying problem or problems that lead to an incident and seeks to minimize the likelihood or impact of incidents in the future; it is therefore a form of risk management.

Identified problems are tracked as a form of common knowledge, often in a known issues or known errors database. These document an identified root cause that the organization is aware of, as well as any shared knowledge regarding how to fix or avoid them. Examples might include a set of procedural steps to follow when troubleshooting a particular system, or a workaround, which is a temporary fix for an incident. Workarounds do not mitigate the likelihood of a problem occurring but do provide a quick fix, which supports the incident management goal of restoring normal service as quickly as possible. Problems are risks to the organization, and if the workarounds do not provide sufficient risk mitigation, then the organization should investigate a more permanent solution to resolve the underlying cause.

Release Management

The last few years have seen an enormous shift from traditional release management practices due to widespread adoption of Agile development methodologies. The primary change is the frequency of releases due to the increased speed of development activities in continuous integration/continuous delivery, often referred to as a CI/CD pipeline. Under this model, developers work on small units of code and merge them back into the main branch of the application's code as soon as they are finished.

Inherent in a CI/CD pipeline is the concept of automated testing, which is designed to more quickly identify problems in an easier-to-solve way. Running tests on only small units of code being integrated makes it easier for the developers to identify where the problem is and how to fix it. From the user's perspective, they get access to new features more quickly than waiting for a monolithic release and, ideally, get fewer bugs due to the faster feedback to developers that automated testing offers.

Release management activities typically comprise the logistics needed to release the changed software or service and may include identifying the relevant components of the service, scheduling the release, and post-implementation reviewing to ensure that the change was implemented as intended, i.e., that the new software or service is functioning as intended. The process has obvious overlap with change management processes.

In the Agile methodology, the process of release management may not involve manual scheduling, instead relying on the organization's standard weekly, monthly, or quarterly release schedule. For organizations using other development methodologies, the process for scheduling the deployment might require coordination between the service provider and consumers to mitigate risks associated with downtime, or the deployment may be scheduled for a previously reserved maintenance window during which customer access will be unavailable.

The release manager must perform a variety of tasks after the release is scheduled and before it occurs, including identifying whether all changes to be released have successfully passed required automated tests as well as any manual testing requirements. Other manual processes such as updating documentation and writing release notes may also be part of the organization's release management activities. Once all steps have been completed, the release can be deployed and tested and is ready for users.

Deployment Management

In more mature organizations, the CD in CI/CD stands for continuous deployment, which automates the process of release management to deliver a truly automatic CI/CD pipeline. Once a developer has written their code and checked it in, an automated process is triggered to test the code, and if all tests pass, it is integrated and deployed automatically to users. This has the advantage of getting updated software and services deployed to users quickly and offers security benefits as well. For example, automated testing is typically less expensive than manual testing, making it feasible to conduct more tests and increase the frequency in complement to the organization's manual testing plans.

Even organizations with continuous deployment may require some deployment management processes to deal with deployments that cannot be automated, such as new hardware or software. In this case, the release management process should develop a set of deployment steps including all required assets, dependencies, and deployment order to ensure that the deployment process is successful.

One recent technology development supporting this trend of more frequent deployment is containerization, which packages application code and non-OS software the application requires into a container. Containers can be run on any computing platform regardless of underlying hardware or operating system, so long as container software such as the Docker Engine is available. The container software makes the resources of the computing environment available in response to the requirements of the containerized applications when they run, similar to the way virtualization makes hardware resources available to a virtualized guest OS.

Containers offer advantages of portability. In other words, they can be run on any OS and hardware platform with container software, as well as availability advantages over traditional infrastructure due to requiring fewer pieces of software to run. Continuous deployment pipelines often make use of containers, as they provide more flexibility and can speed development.

Deployment scheduling for noncontinuous environments may follow a set schedule such as a routine maintenance window or be deployed in a phased approach where a small subset of users receive the new deployment. This phased approach can offer advantages for riskier deployments such as large operating system updates, where unexpected bugs or issues may be encountered. Rolling out updates to a subset of the user pool reduces the impact of any bugs in the release, and the organization has more opportunity to find and correct them. Organizations may also choose not to push deployments but instead allow users to pull the updated software/service on their own schedule. This is the model many consumer OSs follow, where an update is made available and users are free to accept or delay the update at their discretion. This is advantageous for uptime as a user may not want to restart their machine in the middle of an important task, though it does lead to the problem of software never being updated, which means vulnerabilities fixed by that update are also not addressed.

Immutable Infrastructure

Cloud services have significantly shifted the way many organizations build their information system environments. In traditional models, a physical location with utilities had to be built, appropriate equipment like servers and routers installed, and then operating systems and application software installed on top of that hardware. Once installed, software was placed into an operations and maintenance cycle where new patches or features were installed, but often the basic infrastructure was untouched for years, due to the time and cost involved. You wouldn't go out and build a whole new data center just to apply a minor patch! This led to sometimes “stale” assets such as hardware or software that hadn't been updated or that had drifted away from a secure configuration, which can introduce major vulnerabilities.

In cloud environments where all resources exist as virtual pools, there are fewer limitations on building virtual infrastructure, tearing it down, and rebuilding it. This gives rise to the idea of immutable architecture, which is built as needed, remains in a consistent state in line with a validated image during its (rather short) useful life, and then is destroyed and replaced by a new version when the system is next deployed. This helps overcome the problem of assets becoming stale by ensuring that systems with up-to-date patches are always deployed and by blocking the infrastructure from being changed, which is why it's called immutable, which means unchangeable.

Configuration Management

Configuration management (CM, not to be confused with change management, which is also abbreviated CM) comprises practices, activities, and processes designed to maintain a known good configuration of something. Configuration items (CIs) are the things that are placed under configuration control and may be assets such as source code, operating systems, documentation, or even entire information systems.

Changes to CIs are usually required to go through a formal change process designed to ensure that the change does not create an unacceptable risk situation, such as introducing a vulnerable application or architecture. Part of the change management process must include updating the CMDB to reflect the new state of the services or components after the change is executed. For example, if a particular host is running Service Pack 1 (SP1) of an OS and a major upgrade to SP2 is performed, the CMDB must be updated once the change is completed successfully.

IT service CM may include hardware, software, or the cloud services and configurations in use by a consumer organization, while the CSP would also need to include configurations of the service infrastructure as well as the supply chain used to provide the services.

In many organizations, a formal CMDB will be used to track all CIs, acting as a system of record against which current configurations may be compared in order to detect systems that have gone out of line with expected configurations. The CMDB can also be useful for identifying vulnerabilities. As a source of truth for systems and software versions running in the organization, it is possible to query the CMDB to identify software running in the organization that contains a newly disclosed vulnerability. Furthermore, the CMDB can be useful to support audits by acting as a source of population data to allow auditors to choose a subset of systems to review. If the CMDB is not updated after changes are made, the organization is likely to face audit findings stemming from failure to properly follow processes.

Due to the type of information contained in a CMDB, such as version numbers, vendor information, hardware components, etc., it can also be used as the organization's asset inventory. Many tools that provide CMDB functionality can also be used to automatically detect and inventory systems, such as by monitoring network records to identify when a new system joins or by integrating with cloud administrative tools and adding any new cloud services that are invoked to the CMDB.

Checklists or baseline are often mentioned in discussions of CM, primarily as starting points or guidance on the desired secure configuration of particular system types. Configuration checklists are often published by industry or regional groups with specific guidance for hardening of operating systems like Windows, macOS, and various Linux distributions, as well as the hardening of popular applications such as Microsoft Office and collaboration tools. In many cases, the vendors themselves publish security checklists indicating how their products' various security settings can be configured. In all cases, these checklists are usually an input to an organization's CM process and should be tailored to meet the organization's unique needs.

Infrastructure as Code

Similar to immutable architecture, which supports configuration management security goals for cloud consumers by preventing unwanted changes, infrastructure as code can help organizations ensure that known good versions of infrastructure are deployed. This might include specially hardened OS or applications, network configurations, or other elements of infrastructure.

Infrastructure as code is a form of virtualization whereby system configuration information is written as a definition file that can be used by the cloud service to create machines with particular settings, rather than manual hardware and software configuration. In this way, developers can specify their application's needed infrastructure, and all instances will be configured according to that definition, ensuring that properly patched and configured systems are deployed automatically without the need for human intervention. While it's possible human error might occur in writing a definition file, it's less likely than an error in a mundane repetitive task like building and deploying servers.

Service Level Management

In the ITSM view of IT as a set of services, including information security, there is a function for defining, measuring, and correcting issues related to delivery of the services. In other words, performance management is a critical part of ITSM. This is closely related to the process of continual service improvement, and in fact, the same metrics are likely to be used by the organization to determine if services are meeting their defined goals.

Service level management rests on the organization's defined requirements for a service. The most common service level many cloud organizations encounter is availability, often expressed as a percentage of time that a system can be reached and utilized, such as 99.9 percent. A service with a 99.9 percent availability level must be reachable approximately 364.64 days per year; put another way, the system can only be down for less than 24 hours each year. Examples of other service levels that may be managed include number of concurrent users supported by a system, durability of data, response times to customer support requests, recovery time in the event of an interruption, and timeframes for deployment of patches based on criticality.

A key tool in managing service levels is the service level agreement (SLA), which is a formal agreement similar to a contract, but focused on measurable outcomes of the service being provided. This measurement aspect is what makes SLAs critical elements of security risk mitigation programs for cloud consumers, as it allows them to define, measure, and hold the cloud provider accountable for the services being consumed.

SLAs require routine monitoring for enforcement, and this typically relies on metrics designed to indicate whether the service level is being met. Availability metrics are often measured with tools that check to see if a service can be reached. For example, a script may run that checks to see if a website loads at a particular address. The script may run once an hour and log its results; if the SLA is 99.9 percent, then the service should not be down for more than nine hours in a given year. If the service level is not met, the SLA should define penalties, usually in the form of a refund or no obligation for the consumer to pay for the time the service was unavailable.

Defining the levels of service is usually up to the cloud provider in public cloud environments, though there is obviously a need to meet customer demands in order to win business. Requirements should be gathered for cloud service offerings regardless of the deployment model, and customer feedback should also be gathered and used as input to the continual service improvement. The metrics reported in SLAs are a convenient source of input to understand if the services are meeting the customers' needs.

Availability Management

In cloud environments, the ability of users to reach and make use of the service is incredibly important, so the provider must ensure that adequate measures are in place to preserve the availability aspect of the relevant services. Availability and uptime are often used synonymously, but there is an important distinction. A service may be “up”—that is, reachable but not available—meaning it cannot be used. This could be the case if a dependency like the access control system is not available, so users can get to a login page for the cloud service but no further.

Due to the expansive nature of availability management, it is critical to view this as a holistic process. Factors that could negatively impact availability include many of the same concerns that an organization would consider in business continuity and disaster recovery, including loss of power, natural disasters, or loss of network connectivity.

There are additional concerns for providing a service that meets the agreed-upon service levels, including the issue of maintenance. Some cloud service providers exclude periods of scheduled maintenance from their availability guarantees. For example, a system will be available 99 percent of the time with the exception of the third Saturday of each month. This gives defined timeframes to make changes that require a loss of availability, as well as some flexibility for unexpected events or emergency maintenance outside the normal schedule.

Many of the tools that make cloud computing possible provide integral high availability options. For example, many virtualization tools support automatic moving of guest machines from a failed host in the event of an outage, or provide for load balancing so that sudden increases in demand can be distributed to prevent a denial of service. Many cloud services are also designed to be highly resilient, particularly PaaS and SaaS offerings that can offer features such as automatic data replication to multiple data centers around the world, or concurrent hosting of applications in multiple data centers so that an outage at one does not render the service unreachable.

Cloud consumers have a role to play in availability management as well. Consumers of IaaS will, obviously, have the most responsibility with regard to availability of their cloud environment, since they are responsible for virtually everything except the physical facility. PaaS and SaaS users will need to properly architect their cloud solutions to take advantage of their provider's availability options. For example, some cloud providers offer automatic data replication for cloud-hosted databases, but there may be configuration changes required to enable this functionality. There may be other concerns as well, such as data residency or the use of encryption, which can complicate availability; it is up to the cloud consumer to gather and understand these requirements and to configure their cloud services appropriately.

Capacity Management

In ITSM, one of the core concerns of availability is the amount of service capacity available compared with the amount being subscribed to. In a simple example, if a service has 100 active users but only 50 licenses available, that means the service is over capacity and 50 users will face a denial-of-service condition. In this simple example, the service is oversubscribed, meaning there are more users than capacity. The service provider must be able to predict, measure, and plan for adequate capacity to meet its obligations; failure to do so could result in financial penalties in the form of SLA enforcement.

As users, we are aware of the negative impacts resulting from a lack of system resources—irritating situations such as a spinning hourglass or beach ball when our desktop computer's RAM capacity is exceeded. While a minor irritant for individual users, this situation could prove quite costly for a business relying on a cloud service provider's infrastructure. Any service that is being consumed should be measurable, whether it is network bandwidth, storage space, processing capability, or availability of an application. Measured service is one of the core elements of cloud computing, so metrics that illustrate demand for the service are relatively easy to identify.

Cloud service providers must take appropriate measures to identify the service capacity they need to provision. These measures might include analysis of past growth trends to predict future capacity, identifying capacity agreed to in SLAs, or even analysis of external factors such as knowing that a holiday season will cause a spike in demand at certain customers like online retailers. Monitoring of current services, including utilization and demand, should also be part of the analysis and forecasting model.

In some cases, cloud service providers and their customers may be willing to accept a certain amount of oversubscription, especially as it could offer cost savings. To extend the previous example, assume the service provider offers 50 licenses and the business has 100 users split between the United States and India. Given the time zone difference between the two countries, it is unlikely that all 100 users will try to access the system simultaneously, so oversubscription does not present an issue.

If the consumer does require concurrent accessibility for all 100 users, then they must specify that as an SLA requirement. The provider should then utilize their capacity management processes to ensure that adequate capacity is provisioned to meet the anticipated demand.

Support Digital Forensics

Digital forensics, broadly, is the application of scientific techniques to the collection, examination, and interpretation of digital data. The primary concern in forensics is the integrity of data, as demonstrated by the chain of custody. Digital forensics is a field that requires very particular skills and is often outsourced to highly trained professionals, but a CCSP must be aware of digital forensic needs when architecting systems to support forensics and how to acquire appropriate skills as needed to respond to a security incident.

Digital forensics in cloud environments is complicated by a number of factors; some of the very advantages of cloud services are also major disadvantages when it comes to forensics. For example, high availability and data replication mean that data is stored in multiple locations around the world simultaneously, which complicates the identification of a single crime scene. Multitenant models of most cloud services also present a challenge, as there are simply more people in the environment who must be ruled out as suspects. The shared responsibility model also impacts digital forensics in the cloud. As mentioned previously, most CSPs do not allow consumers physical access to hardware or facilities, and even with court orders like a warrant, the CSPs may have procedures in place that make investigation, collection, and preservation of information more difficult. This is not to frustrate law enforcement but is a predicament caused by the multitenant model; allowing investigation of one consumer's data might inadvertently expose data belonging to other consumers. Investigation of one security incident should, as a rule, not be the cause of other security breaches!

Forensic Data Collection Methodologies

In legal terminology, discovery means the examination of information pertinent to a legal action. E-discovery is a digital equivalent comprising steps including identification, collection, preservation, analysis, and review of electronic information. There are two important standards a CCSP should be familiar with related to e-discovery.

ISO 27050: ISO 27050 is a four-part standard within the broader ISO 27000 family of information security standards.
- Part 1, “Overview and concepts,” defines terms and requirements for organizations to consider when planning for and implementing digital forensics to support e-discovery.
- Part 2, “Guidance for governance and management of electronic discovery,” offers a framework for directing and maintaining e-discovery programs, with correlation to other elements of the 27000 framework for managing information security.
- Part 3, “Code of practice for electronic discovery,” provides detailed requirements for achieving e-discovery objectives in alignment with the standard, including evidence management and analysis.
- Part 4, “Technical readiness,” is under development as of 2020 and is designed to provide more discrete guidance on enabling various systems and architectures to support digital forensics.
Cloud Security Alliance (CSA) Security Guidance Domain 3: Legal Issues: Contracts and Electronic Discovery: This standard is part of the CSA's freely available guidance related to cloud security and covers legal issues, contract requirements, and special issues raised by e-discovery.
- Legal Issues details concerns related to privacy and data protection and how moving data to the cloud can complicate an organization's legal obligations. Examples include data residency, where laws may restrict the geographic locations that data may be stored in, as well as liability issues when a CSP is acting as a data processor.
- Contract Considerations lists concerns and recommendations for dealing with common contract issues related to security and privacy. These include performing adequate due diligence on any CSP vendors and their practices, ensuring that contractual obligations are properly documented, and performing ongoing monitoring of the CSP and services to ensure that they do not exceed the organization's changing risk tolerance.
- Special Issues Raised by E-Discovery details a number of critical concerns both CSPs and consumers must consider when choosing and architecting cloud solutions. These include possession, custody, and control of data; in short, outsourcing processing of data to a CSP does not absolve the cloud consumer of legal responsibility for the security of that data. Other issues include challenges related to discovery itself, as data may exist in multiple locations or multiple pieces (data dispersion), as well as issues related to privacy in a multitenant environment if one tenant is subject to legal action. In addition, there may be tools like bit-by-bit analysis or data replication that may be impossible in cloud environments, as these tools make use of hardware features that are not present in a virtualized cloud environment.
- The full document may be accessed here: cloudsecurityalliance.org/artifacts/csa-security-guidance-domain-3-legal-issues-contracts-and-electronic-discovery.

When legal action is undertaken, it is often necessary to suspend some normal operations such as routine destruction of data or records according to a defined schedule. In this case, a process known as legal hold will be utilized, whereby data is preserved until the legal action is completed. Provisions for legal hold, such as extra storage availability and proper handling procedures, must be part of contracts and SLAs, and during legal proceedings, the cloud consumer should have easy access to appropriate points of contact at the CSP to facilitate e-discovery or other law enforcement requirements.

The process of collecting evidence is generally a specialized activity performed by experts, but security practitioners should be aware of some steps, especially those performed at the beginning before a forensics expert is brought in.

Logs are essential. All activities should be logged including time, person performing the activity, tools used, system or data inspected, and results.
Document everything, including physical or logical system states, apps running, and any physical configurations of hardware as appropriate.
Some data is volatile and requires special handling. In a traditional computer system, RAM is often a particular concern for forensics, because it requires constant power to retain data. In cloud architectures where VMs may be spun up on demand or microservices run only as long as needed to perform a particular task, identifying any ephemeral data or services and preserving them may be critical.
Whenever possible, work on copies of data or images of systems, as simple actions like opening a folder or file can overwrite or change critical elements of the evidence, leading to a loss of integrity and possible destruction of evidence.
Verify integrity often and follow standard procedures. Incident response plans will often be the first set of steps leading into an investigation where evidence is required, and they should incorporate checks for integrity of evidence and handling such as hashing and verifying physical custody records.

Evidence Management

When handling evidence, the chain of custody documents the integrity of data, including details of time, manner, and person responsible for various actions such as collecting, making copies, performing analysis, and presenting the evidence. Chain of custody does not mean that the data has not been altered in any way, as it is often necessary to make changes such as physically collecting and moving a piece of hardware from a crime scene to a lab. Instead, chain of custody provides a documented, reliable history of how the data has been handled, so if it is submitted as evidence, it may be relied upon. Adequate policies and procedures should exist, and it may be appropriate to utilize the skills of trained forensic experts for evidence handling.

The scope of evidence collection describes what is relevant when collecting data. In a multitenant cloud environment, this may be particularly relevant, as collecting data from a storage cluster could inadvertently expose data that does not belong to the requesting party. Imagine that two competing companies both utilize a CSP, and Company A makes a request for data relevant to legal proceedings. If the CSP is not careful about the evidence collected and provided to Company A, they may expose sensitive data about Company B to one of their competitors!

Evidence presented to different audiences will follow different rules. For example, an industry regulator may have a lower integrity threshold for evidence, as they are not able to assess criminal penalties for wrongdoing. A court of law, however, will have higher constraints as the stakes are higher—they have the power to levy fines or possibly imprison individuals. Evidence should possess these five attributes to be useful.

Authentic: The information should be genuine and clearly correlated to the incident or crime.
Accurate: The truthfulness and integrity of the evidence should not be questionable.
Complete: All evidence should be presented in its entirety, even if might negatively impact the case being made (it is illegal in most jurisdictions to hide evidence that disproves a case).
Convincing: The evidence should be understandable and clearly support an assertion being made; for example, this particular person accessed this particular system and copied this particular file for which they were not authorized.
Admissible: Evidence must meet the rules of the body judging it, such as a court, which may rule out evidence such as hearsay (indirect knowledge of an action) or evidence that has been tampered with.

Collect, Acquire, and Preserve Digital Evidence

There are four general phases of digital evidence handling: collection, examination, analysis, and reporting. There are a number of concerns in the first phase, collection, which are essential for a CCSP. Evidence may be acquired as part of standard incident response processes before the need for forensic investigation and criminal prosecution have been identified, so the incident response team needs to handle evidence appropriately in case a chain of custody needs to be demonstrated. Proper evidence handling and decision making should be a part of the incident response procedures and training for team members performing response activities.

There are a number of challenges associated with evidence collection in a cloud environment, including the following:

Control: Using a cloud service involves loss of some control, and different service models offer varying levels of access. SaaS models typically offer little to no visibility outside a consumer's data or app instance, so investigating network security failures may be impossible as the CSP does not disclose them. On the other hand, IaaS gives an organization complete control over their virtualized network security but may stop short of any physical network data that might be pertinent to an investigation. Evidence that is inaccessible is obviously a challenge to an investigation; proper contract terms should be in place with any CSP to support an organization's likely investigation needs.
Multitenancy and shared resources: Evidence collected while investigating a security incident may inadvertently become another data breach in a multitenant environment, if the evidence collected includes information that does not belong to the investigating organization. This can also cause problems of attribution—was a particular incident caused by a malicious attacker or by the accidental action of another tenant?
Data volatility and dispersion: Cloud environments support high availability of data, which often requires novel data storage techniques like sharding, which is breaking data into smaller pieces and storing multiple copies of each piece across different data centers. Reconstructing that data could be an error-prone process, lowering the believability when evidence is presented, and may involve multiple legal frameworks and jurisdiction issues when data is stored across multiple countries. Other positive features of cloud computing such as live VM migration can complicate evidence gathering as well. When a VM can move to anywhere in the world virtually instantaneously, it may be hard to prove that a specific set of actions carried out by a specific person occurred in a specific place, leading to evidence that is worthless for prosecution.

Preparing for Evidence Collection

There are a number of important steps the organization should take prior to an incident that can support investigations and forensics. These can be built into several processes the organization is likely to perform for other security objectives, including the following:

Logging and monitoring: All apps, systems, and infrastructure should generate audit trails, which should be forwarded to a secure, centralized logging tool such as a syslog server or a SIEM platform. These tools should have unique access controls so that insiders cannot easily cover their tracks by deleting or altering logs, which also increases the work factor for an attacker by requiring multiple sets of compromised credentials to pull off an attack and cover the evidence. Regular reviews of logs for suspicious activity or automated correlation and alerting should be performed to identify potentially suspicious activity.
Backup and storage: Evidence often needs to be compared against a baseline to show how malicious activity caused changes. Adequate backups are useful to show how a system was configured before an incident to prove that a particular activity was the cause of an incident.
Baselines and file integrity monitoring: The known good state of a system is useful for comparison to an existing system when an investigator is trying to determine if malicious activity has occurred. Deviations from the baseline or files that have been changed can be indicators of an attack, or they could be the sign of poorly implemented change management and configuration management practices. If all intentional changes made by the organization are properly documented, it makes spotting unintended or malicious changes much easier.
Data and records retention: Most organizations have some requirement to retain records for a specific period of time dictated by business needs or legal/regulatory requirements. These records may be a useful source of information in an investigation. At the end of that retention period, data is generally destroyed, but the retention policy should also have clear requirements and procedures for placing a legal hold on records that may be pertinent, thereby preventing their destruction until the investigation is complete.

Evidence Collection Best Practices

Although forensics experts may perform significant amounts of evidence collection, security practitioners must be aware of some best practices, including the following:

When collecting evidence, it is best to utilize original physical media whenever possible, as copies may have unintended loss of integrity. Note that this applies only to collection; the best practice for analysis is to always use verified copies to preserve the original evidence. As mentioned, collecting physical evidence in the cloud may not be possible, though cyber-forensic tools such as Netscout are emerging that ease collection of digital forensic evidence, and some CSPs offer digital forensics services using their own tools and staff. In exceptional circumstances where physical evidence must be collected, it is likely the CSP will require that law enforcement be involved, particularly an agency with international jurisdiction. Even then, there is no guarantee; in a well-publicized case, Microsoft Corp. v. United States, Microsoft challenged a U.S. Department of Justice warrant for data stored in an Irish data center, which Microsoft claimed was outside U.S. jurisdiction. The case was ultimately rendered moot by passage of the Clarifying Lawful Overseas Use of Data Act (U.S. Cloud Act), but new laws being written for specific circumstances like this are exceedingly rare.
Verify integrity at multiple steps by using hashing, especially when performing operations such as copying files. Calculate original and copy hashes and compare them to ensure that they match.
Follow all documented procedures, such as the use of a dedicated evidence custodian for collection, logging of activities performed, leaving systems powered on to preserve volatile data, etc. These procedures should be documented in the organization's incident response plan.
Establish and maintain communications with relevant parties such as the CSP, internal legal counsel, and law enforcement for guidance and requirements.

Evidence Preservation Best Practices

Once evidence has been collected, it must be adequately preserved to support a variety of goals: maintain the chain of custody, ensure admissibility, and be available for analysis to support the investigation. Preservation activities and concerns should cover the following:

Adequate physical security: Most of us are familiar with TV police shows where an evidence locker is used. In the best-case scenario, the detectives must check out evidence following a documented procedure; in the worst-case scenario, the evidence has been removed with no documentation, leading to a criminal getting away with their actions. Digital evidence must be stored on physical media, which requires adequate physical protections, similar to a data center, and should also be subject to integrity tracking similar to a library book: documenting who checked out the evidence at what time and additionally documenting the actions taken such as copying data.
Physical and environmental maintenance: Since evidence is stored on physical media, it may require environmental maintenance such as temperature and humidity controls, as well as available power to preserve data. Battery replacement or charging will be especially important for mobile devices.
Blocking interference: Computing systems have access to a wide variety of wireless communications media including Bluetooth and Wi-Fi and may also try to communicate with the outside world via wired connections. Forensic analysts and evidence handlers need to adequately shield evidence-containing devices from this interference. To block wireless communications during analysis, the use of a Faraday cage, which blocks electromagnetic signals, is a best practice. For transporting mobile devices, a Faraday bag is recommended; in both cases, this can prevent these devices from being remotely wiped or reset. Workstations and other devices should be analyzed using a physically air-gapped network to prevent similar activities.
Working from copies: Unlike evidence collection where use of originals is preferred, examination and analysis activities should be performed on copies of data and devices wherever possible, with frequent integrity comparisons to ensure that the copy being analyzed matches the original. Tools such as virtualization are useful here, as they can create an exact copy (image) of a system for analysis. Where copies are not available, tools such as write blockers should be used; these devices allow for read-only access to devices and prevent writing to them. Even the simple act of connecting a drive to a modern OS causes files such as a search index to be written, which could destroy or damage data that the investigator needs.
Document everything: Remember that the chain of custody does not mean data or a system has not been changed at all, but defensibly documents the who, what, how, why, and when of changes. Checking evidence out for analysis, calculating hashes for comparison, and making copies for analysis are examples of actions that should be documented.

Manage Communication with Relevant Parties

Adequate coordination with a variety of stakeholders is critical in any IT operation, and the move to utilize cloud computing resources coupled with an increasingly regulated and dispersed supply chain elevates the priority of managing these relationships. Communication is a cornerstone of this management; providing adequate and timely information is critical. While this may be a skillset fundamental to project managers rather than security practitioners, it is worth understanding the importance of effective communication and supporting it whenever possible.

Effective communication should possess a number of qualities. The contents, nature, and delivery of communications will drive many decisions, which can be elicited using a series of questions about the information to be conveyed.

Who? The intended audience will determine the contents of a communication, such as the level of technical detail included or amount of information. In security incidents, communications to the general public may be reviewed by the organization's legal department to avoid any admission of culpability or exposing sensitive information.
What? The goal of the communication must be met by the contents. For example, reports to investors and business partners would contain different sets of details, as partners are likely to be under an NDA, but the general public is not. If the message does not clearly answer the question “So what?” it is likely the audience will not find it terribly useful.
Why? The purpose of the communication should be clear, and the intended audience should be able to make immediate use of it. We are all familiar with technical error messages that provide less-than-helpful details of an internal memory error at a specific register address, but that does not help the average user to carry on with their work.
When? Is the communication timely? If a data breach happened two years ago and the organization knew but did not report it, the information is likely to be useless as users will already have been targeted for identity theft scams. Furthermore, the communication is likely to cause reputational harm, as the organization could be accused of covering up the incident rather than reporting it in a timely fashion.

There are a number of stakeholders or constituents with whom an organization is likely to communicate regarding IT services, organizational news and happenings, and emergency information. Establishing clear methods, channels, and formats for this communication is critical.

Vendors

Few organizations exist in a vacuum; the modern supply chain spans the globe, and regulatory oversight has begun to enforce more stringent oversight of this supply chain. It is therefore essential that an organization and its security practitioners understand the supply chain and establish adequate communications.

The first step in establishing communications with vendors is an inventory of critical third parties on which the organization depends. This inventory will drive third-party or vendor risk management activities in two key ways. First, some vendors may be critical to the organization's ongoing functioning, such as a CSP whose architecture has been adopted by the organization. Second, some vendors of goods and services may provide critical inputs to an organization like a payment card processor whose service supports the organization's ability to collect money for its goods or services.

Communication with critical vendors should be similar to internal communications due to the critical role these vendors play in the business. If a vendor incident is likely to impact an organization's operations, the organization ought to have well-established communications protocols to receive as much advance notice as possible. If a consumer notices an incident such as loss of availability of a vendor's service, there should be adequate reporting mechanisms to raise the issue and resolve it as quickly as possible.

Many vendor communications will be governed by contract and SLA terms. When a CSP is a critical vendor, there should be adequate means for bidirectional communication of any issues related to the service, such as customer notices of any planned outages or downtime, emergency notifications of unplanned downtime, and customer reporting for service downtime or enhancements. In many cases, this will be done through a customer support channel with dedicated personnel as well as through ticketing systems, which creates a trackable notice of the issue, allowing all parties to monitor its progress.

Customers

As cloud consumers, most organizations will be the recipients of communications from their chosen CSPs. While this might seem to imply there are no responsibilities other than passively receiving information from and reporting issues to the CSP, consumers do have a critical accountability: defining SLA terms. Levels of communication service from the CSP should all be defined and agreed upon by both parties, such as speed of acknowledging and triaging incidents, required schedule for notification of planned downtime or maintenance, days/times support resources are available, and even the timeframe and benchmarks for reporting on the service performance. SLAs may be generic and standardized for all customers of a CSP or may be highly specific and negotiated per customer, which offers more flexibility but usually at greater cost.

Shared Responsibility Model

A key source of information to be communicated between CSPs and their customers is the responsibility for various security elements of the service. The CSP is solely responsible for operational concerns like environmental controls within the data center, as well as security concerns like physical access controls. Customers using the cloud service are responsible for implementing data security controls, like encryption, that are appropriate to the type of data they are storing and processing in the cloud. Some areas require action by both the provider and customer, so it is crucial for a CCSP to understand which cloud service models are in use by the organization and which areas of security must be addressed by each party. This is commonly referred to as the shared responsibility model, which defines who is responsible for different aspects of security across the different cloud service models. The generic model in Table 5.1 identifies key areas of responsibility and ownership in various cloud service models.

TABLE 5.1 Cloud Shared Responsibility Model

C = Customer, P = Provider

Responsibility	IaaS	PaaS	SaaS
Data classification	C	C	C
Identity and access management	C	C/P	C/P
Application security	C	C/P	C/P
Network security	C/P	P	P
Host infrastructure	C/P	P	P
Physical security	P	P	P

A variety of CSP-specific documentation exists to define shared responsibility in that CSP's offerings, and a CCSP should obviously be familiar with the particulars of the CSP their organization is utilizing. The following is a brief description of the shared responsibility model for several major CSPs and links to further resources:

Amazon Web Services (AWS): Amazon identifies key differences for responsibility “in” the cloud versus security “of” the cloud. Customers are responsible for data and configuration in their cloud apps and architecture, while Amazon is responsible for shared elements of the cloud infrastructure including hardware, virtualization software, environmental controls, and physical security.
More information can be found here: aws.amazon.com/compliance/shared-responsibility-model.
Microsoft Azure: Microsoft makes key distinctions by the service model and specific areas such as information and data and OS configuration. Customers always retain responsibility for managing their users, devices, and data security, while Microsoft is exclusively responsible for physical security. Some areas vary by service model, such as OS configuration, which is a customer responsibility in IaaS but a Microsoft responsibility in SaaS.
More information can be found here: docs.microsoft.com/en-us/azure/security/fundamentals/shared-responsibility.
Google Cloud Platform (GCP): Google takes a different approach with a variety of shared responsibility documentation specific to different compliance frameworks such as ISO 27001, SOC 2, and PCI DSS. The same general rules apply, however: customer data security is always the customer's responsibility, physical security is always Google's responsibility, and some items are shared depending on what service offerings are utilized.
More information can be found here: cloud.google.com/security.

Partners

Partners will often have a level of access to an organization's systems similar to that of the organization's own employees but are not directly under the organization's control. Communication with partners will be similar to communication with employees, with initial steps required when a new relationship is established, ongoing maintenance activities throughout the partnership, and termination activities when the partnership is wound down. Each of these phases should deliver clear expectations regarding security requirements.

Onboarding: During onboarding, a partner is introduced to the organization, and two important processes must be executed: third-party or vendor due diligence activities to assess the partner prior to sharing data or granting access, and communication of security requirements through contracts, SLAs, or other documentation. Elements of the organization's security program may be shared with the partner such as security policies and training.
Management: Once a partner has been onboarded, the organization's ongoing security efforts must include any partner activities. This should include managing access and permissions, auditing and oversight activities, and processes such as incident response and disaster recovery testing.
Offboarding: When a partnership ends, the organization must follow procedures to terminate access and communicate any security requirements relevant to the termination, such as reminders of nondisclosure requirements and the return of any assets the partner might have.

Regulators

There are a vast array of regulatory bodies governing information security, and most of them have developed cloud-specific guidance for compliant use of cloud services. In the early days of cloud computing, security practitioners often faced significant unknowns when moving to the cloud, but as the cloud became ubiquitous, regulators delivered clear guidance, and CSPs moved quickly to deliver cloud solutions tailored to those compliance requirements. A CCSP is still responsible for ensuring that their cloud environment is in compliance with all regulatory obligations applicable to their organization.

The main component of regulatory communication regarding the cloud is monitoring incoming information regarding regulatory requirements in the cloud. For example, the implementation of GDPR in the European Union (EU) forced many organizations to make architectural decisions regarding their cloud applications. GDPR's restrictions on data leaving the geographic boundaries of the EU mean many organizations need to implement additional privacy controls to transfer data out of the EU, otherwise they must host their applications in an EU-based data center. The CCSP should subscribe to feeds and be aware of regulatory changes that impact their organization's use of the cloud.

Similar to monitoring incoming cloud requirements, a CCSP may also be required to report information to regulatory bodies regarding their organization's state of compliance. For example, U.S.-based companies with GDPR privacy requirements may be required to communicate to the U.S. Department of Commerce on the status of privacy compliance under the U.S. Privacy Shield framework. CCSPs should be aware of regulatory reporting requirements specific to their organization and ensure that required documentation, artifacts, audits, etc., are communicated in a timely fashion.

Other Stakeholders

Communications may not be a primary job responsibility of a CCSP, but important details of security risk management work may need to be shared. Working with appropriate personnel in the organization will be crucial to ensure that information is communicated in a timely manner and with relevant details for each audience, such as the following:

Public: Details of security incidents are the most obvious category, especially if the breach could affect security or privacy. In this case, adequately trained public relations (PR) professionals as well as legal personnel should be consulted and may handle communications with input from the security team.
Security researchers: Increasingly, security experts are performing research on a specific organization's apps and services, with the goal of responsibly disclosing their findings before an attacker can exploit them. Organizations should have a method for receiving and responding to this type of communication, often via a publicly documented security email address. Getting details from a researcher, following up if more information is required, and in some cases paying a bounty for the discovered bug may all be ongoing communications that need to be handled.
Investors: In many countries, formal communication with investors is required on a regular basis, such as quarterly business filings, annual financial statements, or unusual business activities such as a merger. Security posture, incidents, and compliance may all be relevant to some investors, and due to legal requirements, certain security details may be required.
Crisis communication: An incident that causes significant disruption or disturbance such as a natural disaster may require communications to multiple stakeholder groups at once. Internal employees, business partners, investors, and the general public may all be impacted by a business disruption; in this case, the organization's communications personnel should have documented procedures, and security practitioners should look to these personnel for guidance on what information may be required.

Manage Security Operations

Security operations represent all activities undertaken by an organization in monitoring, maintaining, and generally running their security program. This may include continuous oversight of security systems to identify anomalies or incidents, as well as an organizational unit to house concerns related to security processes like incident response and business continuity. A large percentage of security process and procedure documentation will be attached to this function, and these processes should also be the target of continuous improvement efforts.

As in many security topics, an ISO standard exists that may be useful for security practitioners conducting security operations: ISO 18788, “Management system for private security operations — Requirements with guidance for use.” A word of caution: this document contains a great deal of material not applicable to cybersecurity or infosec concerns, as the numbering might imply (it is not part of the 27000 series). One extraneous topic is implementing and managing a private security force with authorization to use force; however, it also provides a business management framework for understanding the organization's needs, guidance on designing strategic and tactical plans for operating security programs, and applying a risk management approach. It also suggests a standard plan, do, check, act framework for implementing and improving security operations.

Security Operations Center

The security operations center (SOC) is an organizational unit designed to centralize a variety of security tasks and personnel at the tactical (mid-term) and operational (day-to-day) levels of the organization. While security strategy may rely on input from top leaders such as a board of directors, department secretary or minister, or the C-suite executives, SOC personnel are responsible for implementing steps required to achieve that strategy and maintain daily operations. Building and running a SOC in a traditional, all on-prem environment is basically building a monitoring and response function for IT infrastructure. Extending these concepts to the cloud may require some trade-offs, as the CSP will not offer the same level of access for monitoring that an on-prem environment does.

It is also important to note that there will be at least two SOCs involved in cloud environments. The CSP should run and manage their own SOC focused on the elements they control under the shared responsibility model such as infrastructure and physical security, while consumers should run their own SOC for their responsibility areas, chiefly data security when using the cloud.

The CISO Mind Map published by security author Rafeeq Rehman, found at rafeeqrehman.com/?s=mindmap, provides a more information-security-centric view of security operations than ISO 18788. Updated each year, the Mind Map details the items that a CISO's role should cover; the largest element of the job responsibilities is security operations, which is broken down into three main categories. This provides a strong framework for responsibilities the SOC should undertake, including the following:

Threat prevention: This subcategory includes preventative controls and risk mitigations designed to reduce the likelihood of incidents occurring. These include adequate network security, vulnerability and patch management programs, application security (appsec), information system hardening, and maintenance activities for common security tools such as PKI.
- Threat prevention in the cloud will involve similar activities such as keeping an adequate asset inventory and utilizing it to detect vulnerabilities and fix them via patching, particularly when using IaaS. In some PaaS and virtually all SaaS, vulnerability and patch management will be the responsibility of the CSP. Implementing network security in the cloud requires adequate selection and deployment of cloud-appropriate tools like firewalls and intrusion detection tools; many traditional tools do not work in virtualized networks or rely on capabilities like mirroring network traffic from a SPAN port on a switch to inspect traffic, which is not available in a virtual cloud network. SOC operations need to be able to access that data and integrate it with monitoring tools, often through the use of API.
Threat detection: Detecting threats requires tools, processes, and procedures, such as log capture/analysis/correlation, which is often achieved with a SIEM tool. Other capabilities include real-time monitoring tools such as data loss/leak prevention (DLP) tools and network security solutions like IDS/IPS, anti-malware, and firewalls. Some security testing processes also fall in this category such as red (offensive) and blue (defensive) team exercises.
- The general process of running the SOC falls into this category, covering topics such as resource management and training. Adequately trained personnel and robust procedures are essential to the success of a security program. Because of the tools in use, there will be architectural management concerns as well; chief among them is ensuring that the tools in use adequately monitor the infrastructure. As many organizations migrate to the cloud, legacy tools designed to monitor and defend a single network become obsolete since they do not scale to a globally accessible cloud environment.
- Building out a SOC for cloud operations may require the use of different monitoring tools, as many legacy tools are designed to be deployed on a physical network to monitor traffic and events. That capability is not available in a virtualized, cloud network, and given the use of a CSP's resources, many organizations implement additional encryption to protect data. This has the downside of often rendering data invisible to monitoring tools. CSPs offer a number of built-in security monitoring and alerting tools, such as Azure Sentinel and AWS Amazon GuardDuty, which can be used as stand-alone programs or integrated with SOC monitoring tools using an API.
- No security program is foolproof, so detecting threats and incidents is critical. Ideally, threats should be proactively identified by the organization via threat hunting before an attacker exploits them. If an attacker finds and exploits an unknown vulnerability, this function also provides incident detection capabilities that can reactively identify an incident that would be handled by the third function.
Incident management: Once an incident has occurred, the SOC will often serve as the main point of coordination for the incident response team (IRT). This subcategory will house many of the functions previously discussed, such as developing incident response capabilities, handling communications like regulator and customer notifications, and handling forensics.
- Two particular threats stand out for extra attention in this category. The first is data breach preparation, which encompasses the majority of incident response planning activities, business continuity planning, logging and monitoring functions, and the recently included cyber risk insurance. Insurance represents a form of risk transfer, which helps to shift the impact of a risk from the organization to another party, in this case the organization's insurer. Many cyber risk insurance plans also offer access to useful incident handling specialties such as data breach and privacy lawyers, forensic experts, and recovery services.
- The other threat explicitly called out is one that has grown in recent years: ransomware. Attackers are exploiting encryption to lock users out of information systems until a ransom is paid, so preparing for or preventing one of these attacks is critical. Preventative steps can include file integrity monitoring, designed to detect unwanted changes such as ransomware encrypting files or even the malware being installed. Planning for recovery is also crucial beforehand, with adequate business continuity, disaster recovery, and backup plans being the primary implementations for ransomware recovery.
- As discussed in the incident management section of this chapter, coordination between the CSP and consumer will be essential for incident response and handling and is a key difference from on-prem environments.

A SOC is typically made up of security analysts, whose job involves taking incoming data and extracting useful information, and security engineers who can keep operations running smoothly. There may be overlap with operations personnel, and in some organizations, the SOC may be combined with other operational functions. Common functions that may be performed in the SOC are the following:

Continuous monitoring and reporting: All the security tools implemented by an organization require at least some level of ongoing oversight by a human being. The SOC will typically centralize relevant data into a dashboard, which may be projected on large screens in the SOC physical office or available as a web page so all members can see vital information. In addition to SOC members, certain information may be reported to other members of the organization in the form of metrics or incident reports.
Data security: The SOC should have the ability to perform monitoring across any environments where sensitive data is being stored, processed, or transmitted. This can be used to ensure that data is being adequately protected in various states across all stages of its lifecycle. Threat hunting and vulnerability management functions in the SOC should look for risks to data, such as improperly configured cloud storage environments or insecure data transmission when users are connecting with SaaS applications.
Alert prioritization: Not all alerts are critical, require immediate attention, or represent imminent harm to the organization. SOC functions related to log management should include definitions to assist in prioritizing alerts received from various sources such as monitoring tools, as well as defined procedures for taking action on alerts.

Loss of commercial power at a facility is an alert worth monitoring but may not be a major incident if backup power is available and the power is likely to be restored quickly. If the organization is experiencing exceptional operations, such as retail during a major holiday, then the organization may choose to preemptively declare a business interruption and shift processing to an alternate facility. Detecting and triaging incidents, up to and including declaration of an interruption or disaster, is a logical function for the SOC to perform due to the type of data they work with.
Incident response: SOC personnel are uniquely situated to detect and respond to anomalous activity such as an interruption or incident and as such are often the core constituent of an incident response team. Skills and expertise such as digital forensics, whether internal to the team or via third-party services, may be a logical fit on this team as well. At the conclusion of an incident response, this team is also well suited to perform root-cause analysis and recommend remedial actions.
Compliance management: The SOC is likely to have information that is quite crucial to managing the organization's compliance posture. Although compliance functions may be implemented elsewhere in the organization to avoid conflict of interest, critical information monitored by the SOC can be useful, such as server configurations, results of routine vulnerability scans, and crucial risk management activities like business continuity.

As always, there are two key perspectives for the SOC. CSPs will likely need a robust SOC function with 24/7/365 monitoring of the environment. While such a capability will be expensive, the cost is likely justified by requirements of the cloud consumers and can be shared among this large group. Cloud consumers may operate a SOC for their own operations, which will include any on-prem IT services as well as their cloud services and environments. This may require the use of some legacy tools deployed to more traditional cloud services such as IaaS or PaaS, newer tools designed to monitor services such as SaaS (for example, access management or encryption tools), or even the use of CSP-provided monitoring capabilities.

As an example, the major CSPs offer security incident reporting services that customers can log in to if an incident affects them. They also offer the following public status pages that list operational information for public-facing services:

AWS Service Health Dashboard: status.aws.amazon.com
Microsoft Azure status: status.azure.com/en-us/status
Google Cloud Status Dashboard: status.cloud.google.com

One crucial decision to be made when designing a SOC is the use of internal resources or outsourcing the function (build versus buy). As previously mentioned, the CSPs can likely justify the cost of robust SOC resources due to cost sharing among customers and the requirements those customers will impose. A small organization with only cloud-based architecture may decide to outsource their security operations and monitoring, as many services provide dedicated support for specific cloud platforms at a lower cost than building the same function internally. These are known as managed security services providers (MSSPs). Like most business decisions, this will be a trade-off between control and cost and should be made by business leaders with input from security practitioners. A CCSP should understand the cloud architecture and communicate any risks the organization might assume by utilizing a third-party SOC.

Monitoring of Security Controls

Monitoring of security controls used to be an activity closely related to formal audits that occur relatively infrequently, sometimes once a year or even once every three years. A newer concept is known as continuous monitoring, which is described in the NIST SP 800-37 Risk Management Framework (RMF) as “Maintaining ongoing awareness to support organizational risk decisions.” Information that comes from an audit conducted more than a year ago is not ongoing awareness. Instead, the RMF specifies the creation of a continuous monitoring strategy for getting near real-time risk information.

Real-time or near real-time information regarding security controls comprises two key elements: the status of the controls and any alerts or actionable information they have created. Network resources are at risk of attacks, so network security controls like IDS are deployed. Continuous monitoring of the IDS's uptime is critical to ensure that risk is being adequately mitigated. A facility to view any alerts generated by the device, as well as personnel and processes to respond to them, is also crucial to the organization's goal of mitigating security risks; if the IDS identifies malicious network activity but no action is taken to stop it, the control is not effective.

A longer-term concern for monitoring security controls and risk management is the suitability of the current set of tools. As organizations evolve, their infrastructure will likely change, which can render existing tools ineffective. The SOC should be charged with ensuring that it can monitor the organization's current technology stack, and a representative should be part of change management or change control processes. Migrating a business system from on-prem hosting to a SaaS model will likely have a security impact with regard to the tools needed to monitor it, and the change board should ensure that this risk is planned for as part of the change.

In general, the SOC should have some monitoring capabilities across all physical and logical infrastructure, though detailed monitoring of some systems may be performed by another group. For example, a physical access control system dashboard may be best monitored by security guards who can perform appropriate investigation if an alarm is triggered. Some organizations run a network operations center (NOC) to monitor network health, and NOC engineers would be best suited to manage telecommunications equipment and ISP vendors. However, an operational incident in either of these two systems, such as a break-in or loss of ISP connectivity, could be an input to the SOC's incident management function. Several controls that might be particularly important for SOC monitoring include the following:

Network security controls: This includes traditional devices (or their virtual equivalents) such as network firewalls, web app firewalls (WAF), and IDS/IPS. The SOC should be able to see if devices are functioning, alerts such as suspicious activity or problems, and trends such as volume of dropped packets on a firewall or activity on a honeypot system. These could be indicators of a potential attack and are best dealt with proactively. Services may be included in the SOC monitoring as well, especially if used to achieve critical risk mitigation objectives like identity management (ID as a service, or IDaaS) or cloud application orchestration via platforms like Kubernetes.
Performance and capacity: All IT services have performance requirements and capacity limitations. While the cloud can theoretically offer benefits with both, the process may not be entirely automated. Some cloud service offerings are simply cloud-based, virtualized versions of legacy services. In these cases, there can be capacity constraints if the organization's usage grows; for example, a PaaS database may run out of storage as usage grows. Core services like storage, network bandwidth, and compute capacity should be monitored. Many of these will also be SLA metrics that the organization should monitor, so this function can achieve two objectives.
Vulnerability assessments: Vulnerability scanners can be configured to run on a defined frequency, either time- or trigger-based. Unlike audit frameworks in the past, which included a vulnerability scan once a year during audit, organizations implementing continuous monitoring should seek to define more frequent scan schedules. Time-based schedules should be made to balance any operational overhead, such as system performance slowdown, with user needs—often achieved by running scans outside of normal business hours. Trigger-based scans can be set up to conduct scans when certain activities occur, such as a new version of an application being deployed to the production environment. These can often be useful in achieving real-time risk visibility, as the scan is conducted at the same time a new attack surface is created. Vulnerability assessments may be conducted from an internal or external perspective to simulate what vulnerabilities a trusted insider or malicious outsider might be able to find and exploit.

Log Capture and Analysis

NIST SP 800-92, “Guide to Computer Security Log Management,” defines a log as “a record of the events occurring within an organization's systems and networks” and further states that “Many logs within an organization contain records related to computer security … including security software, such as antivirus software, firewalls, and intrusion detection and prevention systems; operating systems on servers, workstations, and networking equipment; and applications.” These logs of internal activity can unfortunately be overwhelming for humans to attempt to meaningfully review due to the sheer number of events generated by modern information systems. Security information and event management (SIEM) tools offer assistance.

SIEM tools provide a number of functions useful to security, namely, the following:

Centralization: It would be impossible for a human to navigate all the log files of individual hosts in even a moderately complex information system. There may be database, middleware, and web servers each with a corresponding log file, not to mention application logs for the software running on each of those systems. A SIEM provides the ability to centralize all log data. Logs are forwarded or sent to the SIEM tool from the system they originate on; this may be done as a native function of the system or rely on an agent installed that copies log files and sends them to the SIEM. This can also be used to enforce a key access control for log data. Users with credentials to monitored systems may be able to delete or change logs on those systems but can be easily denied permission to the SIEM tool.
Normalization: Log files generated by disparate systems may contain data that is similar but not exactly the same. One system may generate log files with a user's ID, like jsmith, while another utilizes the user's email address, like [email protected]. This makes it harder to analyze the data, so SIEM platforms can transform data to a common format, such as the use of UTC for timestamps to avoid issues with time zones, or the use of consistent field names like Timestamp instead of Event Time.
Correlation and detection: Once data has been centralized and normalized, the SIEM can better support detection of suspicious events. This is often done by comparing activities in the log with a set of rules, such as a user's location. If a user typically accesses a system every day from their desk on the corporate network, then that user suddenly logging on from another country halfway around the world is suspicious. Some SIEM platforms incorporate more complex detection methods like artificial intelligence, which uses learning models to identify potentially suspicious activity or weed out false positives.
Correlation refers to discovering relationships between two or more events; in this example, if the user has suddenly logged in from a new location, accessed email, and started downloading files, it could indicate compromised credentials being used to steal data. It could also indicate that the user is attending a conference and is getting some work done in between sessions, but the organization should still perform checks to verify. If travel is a common situation, the organization might also integrate data from an HR or travel reservation system to correlate travel records for a certain user with their activity accessing data from a particular country. If the user is not known to be traveling in that country, the activity is highly suspicious.

Other sources of information could include external information like threat intelligence or CSP status feeds, which can be relevant in investigating anomalies and categorizing them as incidents. Major CSPs offer monitoring capabilities in their platforms as well, and this data may be critical for investigating anomalies.
Alerting: Once suspicious activity is detected, the SIEM should generate an alert, and a SOC analyst should follow a documented procedure to review and take action on the alert. Many SOCs will utilize a system for handling this process, such as an IT ticketing system that generates work tickets in a queue for analysts to take action on. The activities related to each ticket are captured along with details such as who is working on it, steps taken, and artifacts from the process like investigation logs.

PaaS and SaaS in particular can pose issues for logging and monitoring, as the logs themselves may not be visible to the consumer. CSPs typically don't share internal logs with consumers due to the risk of inadvertently exposing customer details, so these services may be a black hole in the organization's monitoring strategy. Solutions have been designed that can address this, such as the cloud access security broker (CASB), which is designed to log and monitor user access to cloud services. These can be deployed inline to a user's connection to cloud services or collect information via API with cloud services. The CASB can monitor access and interaction with applications; for example, user Alice Doe logged into Dropbox at 12:30, and uploaded file Super Secret Marketing Data.xlsx at 12:32. Dropbox as a CSP may not share this data with consumers, but the CASB can help overcome that blind spot.

Log Management

NIST SP 800-92 details critical requirements for securely managing log data, such as defining standard processes and managing the systems used to store and analyze logs. Because of their critical nature in supporting incident investigations, logs are often a highly critical data asset and worthy of robust security mechanisms. These may include the following:

High-integrity storage media such as write once, read many (WORM), which allows a definitive copy of the data to be written just once. This prevents tampering with log files after they are written to disk, but allows for analysis.
Restricted access to only SOC personnel, as logs may contain highly sensitive information. This could include details of internal system configuration, as well as sensitive information being processed if an application or database error contains pieces of the sensitive data.
Capacity monitoring and rotation to ensure adequate storage space. Logs grow quite quickly, and allocating sufficient space for logs is essential. Maintaining all data readily available may be cost-prohibitive, so some log files may be stored offline if appropriate or rotated (written over) after a set period of time.
Retention and secure deletion when logs are no longer needed. Log files should be part of the organization's retention schedule and should be in compliance with applicable legal and regulatory obligations and are also critical to reconstruction of historical events as needed for incident investigations. Because of the possibility of highly sensitive data, secure destruction methods should be defined for log data.
Proper configuration of logging functions on all systems. Many systems and apps provide configuration options for what data is written to log files based on the value of the information. Informational alerts, such as a server synchronizing its clock with a timeserver, may not be of much value. Events like a user successfully logging in, entering the wrong credentials, or changing their password would be of more value, as they provide evidence of the organization's access controls at work. The clipping level defines which categories of events are and are not written to logs, such as user authentication events, informational system notices, or system restarts.
Testing and validation of log data to ensure that integrity is not compromised. Log data collected, especially on a centralized tool like a SIEM, should be periodically validated against the originals to ensure that it is being transmitted without alteration. It should not be possible to alter data being forwarded to the centralized platform, and any attempts to tamper with or disable the forwarding functionality should result in an alert so corrective action can be taken.

Incident Management

An incident is any unplanned event that actually does or has the ability to reduce the quality of an IT service. In security terms, reducing the quality is synonymous with impacting any element of the CIA triad. As an example, an event could be a loss of commercial power at a data center. If the organization has been notified that the power company is performing maintenance and is able to continue running with backup generators, then this is merely an event. The IT services provided by the data center can continue uninterrupted. If, instead, the power is cut unexpectedly and the facility must switch to backup power, this could cause systems to be unresponsive or lose data during the transition to backup power. This is an obvious negative impact to data integrity and system availability.

Incident management or incident response (IR) exists to help an organization plan for incidents, identify them when they occur, and restore normal operations as quickly as possible with minimal adverse impact to business operations. This is referred to as a capability, or the combination of procedures and resources needed to respond to incidents, and generally comprises three key elements.

Incident response plan (IRP): The IRP (sometimes known as an incident management plan) is a proactive control designed to reduce the impact of an incident. It defines structure for the processes to be used when responding to an incident, which allows team members to respond in a quick and orderly fashion. People are often in high-stress situations when an outage or interruption occurs, and having a plan with detailed scenarios and response steps can provide better decision-making and response outcomes. The IRP should include detailed, scenario-based response procedures for team members to follow, which are based on incidents that the organization is likely to face. For example, an IT consulting firm is unlikely to face a major attack on an industrial control system (ICS) but is likely to fall victim to phishing attacks. The IRP should include detailed instructions on how to identify a phishing campaign, prevent the message from propagating, and provide cleanup or remediation steps for users who have fallen victim, such as immediately locking user accounts until investigation can be conducted.
Incident response team (IRT): The IRP should detail the personnel needed to respond to incidents. This team is likely to be dynamic based on the type of incident; for example, a small malware infection is likely to require help-desk technicians to assist, while a data breach of PII will require legal counsel to assist with data breach notifications. All members of the IRT should be trained on their responsibilities before an incident occurs, and a designated coordinator must be appointed to lead the incident response effort. This coordinator should be empowered to dynamically assemble a team based on the incident, up to and including overriding conventional job duties for the duration of the incident. IRT members should also have dedicated communication protocols in place, such as a phone tree or instant messaging group, allowing them to receive information on incidents in a timely manner.
Root-cause analysis: Once an incident has been resolved, the IRT should perform a root-cause analysis, document these findings, and offer suggestions for preventing the incident in the future. Some incidents, such as natural disasters, may be unavoidable, but the organization may be able to implement proactive measures to reduce their impact, such as relocating staff ahead of a forecastable natural disaster. Incident response plans and procedures should also be updated as appropriate to help the organization better respond in the future to incidents of a similar nature.

Incident Classification

To ensure that an incident is dealt with correctly, it is important to determine how critical it is and prioritize the response appropriately. Each organization may classify incidents differently, but a generic scheme plots Urgency against Impact. These are assigned values from Low, Medium, or High, and incidents that are High priority are handled first. The following are descriptions and examples of these criteria:

Impact: How much or how significantly does the incident degrade IT services? An issue with identity management that forces users to try logging in multiple times before successfully accessing a system is irritating but has a minor effect on operations. A data center being completely unreachable due to cut network cables means a complete loss of the business function and should be dealt with before other issues.
Urgency: How soon does the organization need resolution? An outage on a system for internal staff social information like team sporting events is unlikely to be critical and can wait for restoration if a mission-critical system is also suffering an outage.

Incident classification criteria and examples should be documented for easy reference. The IRP should contain this information, and it is also advisable to include it in any supporting systems like incident trackers or ticketing systems. These ratings are subjective and may change as the incident is investigated, so the IR coordinator should ensure that critical information like prioritization is communicated to the team.

Incident Response Phases

The organization's IRP should include detailed steps broken down by phases. At a high level, there are activities to be conducted prior to an incident and after an incident occurs; namely, planning the IR capability and the actual execution of a response when an incident is detected. There are a number of IR models that contain slightly different definitions for each phase, but in general they all contain the following:

Prepare: Preparation is the phase where the IR capability's foundation is established. During this phase, the IRP and IRT should be documented, training given to IRT members, and adequate detection abilities implemented.
Detect: To be effective, the IR capability requires the ability to detect and draw attention to events that could negatively impact the organization's operations. This is most often in the form of the organization's continuous monitoring tools, which can identify anomalous activity and alert SOC personnel trained to analyze, prioritize, and initiate response procedures. Other methods of detection can include noncontinuous monitoring activities like routine audits, user-reported issues such as unexpected application or system behavior, and even external entities like security researchers.
Security researchers or malicious actors may draw attention to a vulnerability they have discovered or, worse, exploited, in which case the organization must take steps to investigate if the claim is true and take appropriate action. Organizations are also beginning to subscribe to external intelligence feeds from third-party services that can provide advance alert of an incident, such as compromised user credentials showing up on the dark web, or adjacent domains being registered that might be used in a phishing attack.

As soon as an incident is detected, it must be documented, and all actions from the point of detection through to resolution should be documented as well. Initial analysis or triage of incidents, prioritization, members of the IRT called upon to deal with the incident, and plans for implementing recovery strategies should be documented. As discussed in the section on digital forensics, it may be the case that a seemingly simple incident evolves into a malicious act requiring criminal charges. In this case, as much evidence as possible, handled correctly, will be crucial and cannot be created after the fact.

Investigation will begin as the IRT starts to gather information about the incident. This can be as simple as attempting to reproduce a user-reported app issue to determine if it is only affecting that user or is a system-wide issue. This is a key integration point to the practice of digital forensics. As soon as it appears, the incident may require prosecution or escalation to law enforcement, and appropriately trained digital forensics experts must be brought in.

Notification may also occur during this stage, once the incident is properly categorized. In many cases, this will be done to satisfy legal or compliance obligations, such as U.S. state privacy laws and the EU GDPR, which require notification to authorities within a certain timeframe after an incident is detected. Appropriate communication resources like legal counsel or public relations personnel may be required on the IRT to handle these incidents.
Respond: Once the organization has been alerted to an incident, its IR capability is activated, and an appropriate remediation must be developed. In most incidents, the overriding concern during the response phase should be containing the incident—that is preventing it from causing further damage. In some cases, gathering information about the attack may be required, during which time the incident continues to occur; in this case, the extra information gained must be more valuable than the damage caused by the ongoing attack.
Once sufficient information is gathered, a containment strategy must be formulated and implemented. This should follow the documented scenario-based responses in the IRP whenever possible—for example, responding to a ransomware attack by isolating any affected machines and working to establish how the malware was installed. Once this is ascertained, deploying techniques for preventing other machines from being affected may be more important than recovering data on machines that have already been compromised. These types of decisions should be made by qualified personnel with as much information as possible at their disposal.

Notification is typically handled as part of incident detection, and formal reporting is an ongoing task that starts after the initial notification. In simple cases, the report may comprise a single document at the resolution of an incident with the particulars, while in other cases, ongoing reporting of a dynamic situation will be required. As you may have seen during high-profile data breaches at public organizations, initial reporting is made on the incident with information available at that moment, even if it is estimated. As the investigation proceeds, updated reports are delivered to convey new information. The IR coordinator, along with appropriate communication resources on the IRT, is responsible for creating and disseminating all required reporting to appropriate stakeholders.

During the response phase, additional information should be gathered and documented regarding the incident. This may include details of any external attacker or malicious insider who might be responsible and should be conducted in accordance with the digital evidence handling previously discussed.
Recover: This phase may actually be started as soon as the detection phase and continue until the cause of the incident is completely eradicated and normal operations have resumed. Actions like changing a password for a user who responded to a phishing email or following documented procedures like restarting a server for unexpected app behavior can be recovery actions.
Eradication of the underlying cause is also a critical element of the recovery. This may include actions such as replacing faulty hardware, blocking a specific IP address or user from accessing a resource, or rebuilding compromised systems from known good images. Containment prevents the incident from spreading further, while eradication removes the immediate cause of the incident. It may be the case that neither prevents the problem from reoccurring in the future, which is a part of post-incident activities.

Recovery and eradication end when the organization returns to normal operations. This is defined as the pre-incident service level delivered using standard procedures.
Post-incident: Once normal operations have resumed, the IRT should ensure that all documentation from the incident is properly created and stored. Lessons learned from conducting the response should be gathered and used to improve the incident response process itself. Finally, the organization should conduct a root-cause analysis to identify any underlying causes and appropriate steps to prevent the incident from recurring in the future.

Incident response is designed to help the organization restore normal operations quickly, but there are situations when this will be impossible. In these cases, the incident may need to be upgraded to an interruption, which is an event whose impact is significant enough to disrupt the organization's ability to achieve its goals or mission. A few users with malware infections on their workstations is an incident that can likely be handled by normal IT resources, but an outbreak affecting all critical systems and a large percentage of users is likely to require more resources.

In such cases, the IR coordinator may be empowered to declare an interruption or disaster, which invokes processes like BCDR. Similar to IR, there should be clear plans in place to guide the organization's response, such as emergency authorization to buy new equipment outside of normal purchasing processes or invocation of alternate procedures like using a third party to process information. The IR coordinator should be aware of their role and responsibilities in these plans, including the process for declaring and notifying appropriate members of the BCDR team to take over.

Cloud-Specific Incident Management

As discussed, the migration to a third-party CSP can introduce additional complexity to an organization. When planning for incident management, the CSP must be considered as a critical stakeholder. Appropriate points of contact should be documented and reachable in the event of an incident, such as a service delivery or account manager who can support the organization's incident response and recovery. An incident at the CSP should be communicated to all the CSP's consumers, and the CSP may be required to provide specific information in the case of regulated data or negligence. Even if a data breach is the fault of the CSP, the consumer who is the data controller is still legally liable on several counts, including notifying affected individuals and possible fines.

Communication from a consumer to the CSP may be critical, especially if the incident has the ability to affect other CSP customers. Additionally, some options for performing incident management, such as rebuilding compromised architecture, will be different in the cloud environment. It is possible to rapidly redeploy completely virtual architecture to a known good state in the cloud; the same task in a traditional data center environment could take significant time.

Incident Management Standards

There are three standards that can guide a security practitioner in designing and operating an incident management capability.

Carnegie Mellon University Software Engineering Institute (SEI)—Incident Management Capability Assessment: SEI publishes a variety of capability maturity models, which can be useful for organizations assessing how robust their procedures are currently and identifying opportunities for future improvement. This model is freely available and utilizes categories to break down essential activities, including Prepare, Protect, Detect, Respond, and Sustain. Within each category are subcategory activities designed to help the organization proactively build the IR capability, respond when incidents occur, and continuously improve the capability.
You can find the technical report document at resources.sei.cmu.edu/asset_files/TechnicalReport/2018_005_001_538866.pdf.
NIST SP 800-61, Computer Security Incident Handling Guide: This NIST standard is also freely available and breaks incident handling down into four high-level phases: Preparation, Detection and Analysis, Containment Eradication and Recovery, and Post-Incident Activity. It focuses on creating incident handling checklists designed to speed response times and is a useful guide for organizations in the U.S. federal government or those following other NIST standards such as the RMF or using NIST SP 800-53 as a security control framework.
You can find NIST SP 800-61 here: csrc.nist.gov/publications/detail/sp/800-61/rev-2/final.
ISO 27035: As with all ISO standards. there are several documents in this standard, including Part 1: Principles of Incident Management, Part 2: Guidelines to Plan and Prepare for Incident Response, and Part 3: Guidelines for ICT Incident Response Operations. The standard is similar to other frameworks including a phased approach broken down into pre-, during-, and post-incident steps. Unlike other frameworks, these documents are not freely available. The steps are most closely aligned with the security controls framework and implementation approach outlined in the rest of the ISO 27000 standard.

Summary

Cloud security operations, like all other security practices, must be anchored by two key principles: operations must be driven by the organization's business objectives or mission, and they must preserve the confidentiality, integrity, and availability of data and systems in the cloud. Operations is a far-reaching topic covering the selection, implementation, and monitoring of physical and logical infrastructure, as well as security controls designed to address the risks posed in cloud computing.

There are a variety of standards that can assist the CCSP in implementing or managing security controls for cloud environments. These cover major objectives such as access control, securing network activity, designing operational control programs, and handling communications. Choosing the correct standard will be driven by each organization's location, industry, and possibly costs associated with the various standards. All programs implemented should have feedback mechanisms designed to continuously improve security as risks evolve.

Also key is an understanding of the shared responsibility model. CSPs will perform the majority of work related to physical infrastructure, though cloud consumers may need physical security for infrastructure that connects them to cloud computing. Logical infrastructure is a more equally shared responsibility: in non-SaaS models, the CSP runs the underlying infrastructure, but consumers have key responsibilities in securing the logical infrastructure in their virtual slice of the cloud.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Cloud Security Operations

Create new playlist

Sign In

Sign Up

Build and Implement Physical and Logical Infrastructure for Cloud Environment

Hardware-Specific Security Configuration Requirements

Storage Controllers

Network Configuration

Installation and Configuration of Virtualization Management Tools

Virtual Hardware–Specific Security Configuration Requirements

Installation of Guest Operating System Virtualization Toolsets

Operate Physical and Logical Infrastructure for Cloud Environment

Configure Access Control for Local and Remote Access

Secure Network Configuration

Virtual Local Area Networks

Transport Layer Security

Dynamic Host Configuration Protocol

Domain Name System and DNS Security Extensions

DNS Attacks

Virtual Private Network

Software-Defined Perimeter

Operating System Hardening through the Application of Baselines

Availability of Stand-Alone Hosts

Availability of Clustered Hosts

High Availability

Distributed Resource Scheduling

Microsoft Virtual Machine Manager and Dynamic Optimization

Storage Clusters

Availability of Guest Operating Systems

Manage Physical and Logical Infrastructure for Cloud Environment

Access Controls for Remote Access

Operating System Baseline Compliance Monitoring and Remediation

Patch Management

Performance and Capacity Monitoring

Hardware Monitoring

Configuration of Host and Guest Operating System Backup and Restore Functions

Network Security Controls

Firewalls

Intrusion Detection/Intrusion Prevention Systems (IDS/IPS)

Honeypots and Honeynets

Vulnerability Assessments

Management Plane

Implement Operational Controls and Standards

Change Management

Continuity Management

Information Security Management

Continual Service Improvement Management

Incident Management

Problem Management

Release Management

Deployment Management

Configuration Management

Service Level Management

Availability Management

Capacity Management

Support Digital Forensics

Forensic Data Collection Methodologies

Evidence Management

Collect, Acquire, and Preserve Digital Evidence

Preparing for Evidence Collection

Evidence Collection Best Practices

Evidence Preservation Best Practices

Manage Communication with Relevant Parties

Vendors

Customers

Shared Responsibility Model

Partners

Regulators

Other Stakeholders

Manage Security Operations

Security Operations Center

Monitoring of Security Controls

Log Capture and Analysis

Log Management

Incident Management

Incident Classification

Incident Response Phases

Cloud-Specific Incident Management

Incident Management Standards

Summary

Table of Contents for
Chapter 5: Cloud Security Operations