Cloud security operations comprise a mix of old and new practices for understanding, mitigating, and monitoring security risks in an organization's cloud environments. Old practices include standard activities that apply to legacy or on-premises IT, such as legal and regulatory compliance management, as well as novel activities like orchestrating cloud infrastructure by writing virtual machine (VM) definitions instead of physically installing new hardware and software.
Key to cloud security operations are the two main roles in cloud computing: the cloud service provider (CSP) and the cloud consumer. The CSP and consumer share responsibilities for securely operating and using the cloud, respectively, and require clearly defined, agreed-upon objectives documented in contracts and service level agreements (SLAs).
It is important to bear in mind that many aspects of secure cloud operations will be handled by the CSP and therefore may be largely invisible to the cloud consumers. As security professionals, it is critical to understand the importance of both the provider and consumer roles; particularly important is including adequate oversight of the CSP in third-party security risk management activities. From the CSP's perspective, proper isolation controls are essential due to the multitenant nature of the cloud, as well as appropriate capacity, redundancy, and resiliency to ensure that the cloud service meets the availability requirements that customers demand.
In public cloud deployments, hardware configuration will be handled by the CSP rather than the cloud consumer. Obviously, private and community clouds will require the security practitioner to properly configure and secure hardware, and in some cases, public clouds may offer a virtual private cloud (VPC) option, where some of these elements are configurable by the consumer. The rules for hardening, or securely configuring, systems in cloud environments are the same as they are for on-prem systems, though the methods may be different.
There are several targets for hardening hardware, including the following:
Storage controllers are hardware implemented to, as their name implies, control storage devices. This may involve a number of functions including access control, assembly of data to fulfill a request (for example, reconstructing a file that has been broken into multiple blocks and stored across disks), and providing users or applications with an interface to the storage services. Several standards exist including iSCSI and Fibre Channel/Fibre Channel over Ethernet (FCoE). These are storage area network (SAN) technologies that create dedicated networks for data storage and retrieval. Security concerns for SANs are much the same as for regular network services, including proper access control, encryption of data in transit or at rest, and adequate isolation/segmentation to address both availability and confidentiality.
For public cloud consumers, the majority of network configuration is likely to happen in a software-defined network (SDN) management console rather than via hardware-based network device configuration. It is the responsibility of the CSP to manage the underlying physical hardware including network controller devices such as switches and network interface cards (NICs). Concerns for this physical hardware include the following:
Virtualization management tools require particular security and oversight measures as they are essential to cloud computing. Without these in place, compromising a single management tool could lead to further compromise of hundreds or thousands of VMs and data. Tools that fall into this category include VMware vSphere, for example, as well as many of the CSP's administrative consoles that provide configuration and management of cloud environments by consumers.
Best practices for these tools will obviously be driven in large part by the virtualization platform in use and will track closely to practices in place for other high-criticality server-based assets. Vendor-recommended installation and hardening instructions should always be followed and possibly augmented by external hardening standards such as Center for Internet Security (CIS) Benchmarks or Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs). Other best practices include the following:
The cloud's heavy reliance on virtualization and multitenancy creates a new risk of data breaches when multiple users share a single piece of hardware. In a nonvirtual environment, it is comparatively difficult to leak data between systems, but a VM shares physical hardware with potentially hundreds of other machines.
The biggest issue related to virtual hardware security is the enforcement, by the hypervisor, of strict segregation between the guest operating systems running on a single host. The hypervisor acts as a form of reference monitor by mediating access by the various guest machines to the physical resources of the hardware they are running on and, in some cases, inter-VM network communication. Since most hypervisors are proprietary and produced by software vendors, there are two main forms of control a CCSP should be aware of:
Virtual hardware is highly configurable and, especially in general-purpose environments like platform as a service (PaaS) or infrastructure as a service (IaaS) public cloud offerings, may not be configured for security by default. It is the responsibility of the cloud consumer to properly configure the cloud environment to meet their specific needs, such as segmenting networks and configuring network access controls to permit only appropriate host-to-host communications. Particular concerns for virtual network security controls include the following:
A partial security concern related to virtual hardware configuration is the amount of virtual hardware provisioned. Many tools will allow the user to specify quantities, such as amount of memory, speed, and number of processing cores, as well as other attributes such as type of storage for a VM. Availability is obviously one concern—sufficient quantities of these virtual resources must be provisioned to support the intended workload. From a business perspective, these virtual resources should not be overprovisioned, however, because they do have associated costs.
An emerging trend in cloud environments is the provisioning of hardware using definition files, referred to as infrastructure as code. These definition files are read by the CSP and used to specify virtual hardware parameters and configurations, simplifying the process of setting up and configuring the environment. In many organizations, this changes the old paradigm of developers writing code and operations personnel configuring the hosts, because developers can package infrastructure definitions with their application code. This requires adequate training for developers to ensure that they understand the business needs, security requirements, and configuration options available to them.
The ability to deploy infrastructure using a definition file enables a feature of cloud computing known as autoscaling. Resources deployed in the cloud environment can be monitored for utilization, and as resources reach their limit, additional resources are automatically added. For example, a web server hosting an online ordering app may come under increased traffic when a celebrity endorses a product; in an autoscaling cloud environment, new instances of the web server are spun up automatically to deal with the increased traffic. Serverless computing is another feature of cloud computing that can support availability. Serverless environments like Azure Functions and AWS Lambda allow developers to deploy their application code without specifying server resources required. When the application is run—for example, a customer wants to place an order—the CSP provides sufficient resources to handle that demand, supporting availability. The cloud consumer pays for the resources only when they are running the particular application, saving costs.
Toolsets exist that can provide extended functionality for various guest operating systems including Unix, Linux, and Microsoft Windows. These toolsets provide specific or enhanced capabilities for a particular operating system (OS), such as support for additional devices, driver software, or enhanced interactivity. In a public cloud, these toolsets will typically be provided by the CSP; if a customer requires functionality that depends on a particular OS toolset, it is up to the customer to verify if the CSP can support it before using that provider's cloud.
When managing your own virtualization environment, installing these toolsets should follow the concept of minimum necessary functionality. If a virtualization cluster will not be running virtual Windows servers, then the Windows OS toolset does not need to be installed. The personnel responsible for the virtualization cluster need to understand the business requirements and build the solution appropriately.
Cloud computing has shifted many responsibilities for managing physical and logical infrastructure away from the users of the corresponding services. When organizations hosted their own infrastructure, it was essential to have adequate processes to assess risks and provision adequate security controls to mitigate them, but in the cloud, many of these tasks are the responsibility of the cloud provider instead. However, the consumers must still be aware of the risks inherent in using the cloud. It is essential for a CCSP to understand these matters to adequately assess cloud services and doubly important if the organization is providing a cloud service. In the case of a private cloud host or a security practitioner working for a CSP, these controls will be directly under the purview of the organization.
In most instances, access to cloud resources will be done remotely, so adequate security controls must be implemented in the remote administrative tools implemented to support these functions. A CCSP should be familiar with the following protocols for supporting remote administration:
Access to RDP can be controlled in a variety of ways. As a Microsoft standard, Active Directory is often utilized for identification and authentication, and the RDP standard also supports smart card authentication.
In situations where local administration is being performed, a secure Keyboard Video Mouse (KVM) switch may be utilized. This is a device that allows access to multiple hosts using a single set of human interface peripherals such as a keyboard, mouse, and monitor—a user does not need to have multiple keyboards on their desk physically attached to different computers.
A basic KVM allows the user to switch their peripherals to interact with various computers; a secure KVM adds additional protections for highly secured environments that primarily address the potential for data to leak between the various connected systems. Attributes of secure KVMs include the following:
All CSPs provide remote administrative access to cloud users via an administrative console. In AWS this is known as the Management Console, in Azure as the Portal, and in Google Cloud as the Cloud Console. All three offer a visual user interface (UI) for interaction that allows for the creation and administration of resources including user accounts, VMs, cloud services such as compute or storage, and network configurations. These functions are also available via CLI tools, most of which call APIs to perform these same administrative functions manually and which enable automation such as creating resources based on infrastructure as code definition files.
The CSP is responsible for ensuring that access to these consoles is limited to properly authenticated users, and in many cases, the CSPs restrict access to consumer resources entirely. For example, it is possible for a user in Azure to create a VM and remove all network access from it, effectively locking themself out as well. Even Microsoft cannot reconfigure that VM, because allowing that level of access to consumer resources is incredibly risky. By contrast, the cloud consumer is responsible for implementing appropriate authorization and access control for their own members in accordance with access management policies, roles, and rules. Because of the highly sensitive nature and abilities granted by these consoles, they should be heavily isolated and protected, ideally with multifactor authentication. All use of the admin console should be logged and should be routinely reviewed as a critical element of a continuous monitoring capability. Access to the UI or CLI functions is a key area for personnel security and access control policies, similar to any system or database admin position in an on-prem environment.
One aspect of cloud services is their broad network accessibility, so it is virtually impossible to find a cloud security concern that does not in some way relate back to secure network connectivity. Several protocols and concepts are important to understand as they relate to securing networks and the data transmitted.
VLANs were originally designed to support the goal of availability by reducing contention on a shared communications line. They do so by isolating traffic to just a subset of network hosts, so all hosts connected to the VLAN broadcast their communications to each other but not to the broader network. Communication with other VLANs or subnets must go through a control device of some sort, often a firewall, which offers confidentiality and integrity protections by enforcing network-level access control. For example, in a multitiered application architecture, the web servers will traditionally be isolated from database servers in separate VLANs, and the database layer will implement much more restrictive access controls.
VLAN network traffic is identified by the sending device using a VLAN tag, specified in the IEEE802.1Q standard. The tag identifies the VLAN that the particular data frame belongs to and is used by network equipment to determine where the frame will be distributed. For example, a switch will not broadcast frames tagged VLAN1 to devices connected to VLAN2, alleviating congestion. Firewalls that exist between the two VLANs can make decisions to allow or drop traffic based on rules specifying allowed or denied ports, protocols, or sender/recipient addresses.
An extension of VLAN technology specific to cloud computing environments is the virtual extensible LAN (VXLAN) framework. It is intended to provide methods for designing VLANs utilizing layer 2 protocols onto layer 3; effectively, it allows for the creation of virtual LANs that may exist across different data centers or cloud environments. VXLAN is more suitable than VLANs for complex, distributed, and virtualized environments due to limitations on the number of devices that can be part of a VLAN, as well as limitations of protocols designed to support availability of layer 2 devices such as Spanning Tree Protocol (STP). It is specified in RFC 7348.
From the standpoint of the cloud consumer, network security groups (NSGs) are also a key tool in providing secure network services and provide a hybrid function of network traffic isolation and filtering similar to a firewall. The NSG allows or denies network traffic access based on a list of rules such as source IP, destination IP, protocol, and port number. Virtual resources can be segmented (isolated) from each other based on the NSGs applied to them. For example, a development environment may allow inbound traffic only from your organization's IP addresses on a broad range of ports, while a production environment may allow access from any IP address but only on ports 80/443 for web traffic. Resources in the development environment could also be prevented from communicating with any resources in production to prevent an attacker from pivoting.
Transport Layer Security (TLS) is a set of cryptographic protocols that provide encryption for data in transit, and it replaced a previous protocol known as Secure Sockets Layer (SSL). The current version of TLS is 1.3; versions below this either are considered less secure or have demonstrated compromises. TLS provides a framework of supported cryptographic ciphers and keylengths that may be used to secure communications, and this flexibility ensures broad compatibility with a range of devices and systems. It also means the security practitioner must carefully configure their TLS-protected systems to support only ciphers that are known to be secure and disable older options if they have been compromised.
TLS specifies a handshake protocol when two parties establish an encrypted communications channel. This comprises these three steps:
TLS can be used to provide both encryption of data and authentication/proof of origin, as it relies on digital certificates and public key cryptography. In many cases, such as publicly available web apps, one-way authentication is used for the client's browser to authenticate the server it is connecting to, such as an online banking application. In mutual authentication, both client and server (or server and server) are required to exchange certificates. If both parties trust the issuer of the certificates, they can mutually authenticate their identities. Some high-security or high-integrity environments require mutual authentication before data is transmitted, but the overhead makes it largely infeasible for all TLS encryption (imagine the PKI required to authenticate every Internet user, web app, and IoT device on the planet).
Dynamic Host Configuration Protocol (DHCP) enables computer network communications by providing IP addresses to hosts that dynamically join and leave the network. The process of dynamically assigning IP addresses is as follows:
An easy mnemonic to remember this DHCP process is DORA for Discover, Offer, Request, Acknowledge. As with many network protocols, security was not originally part of the DHCP design. The current DHCP version 6 specifies how IPSec can be utilized for authentication and encryption of DHCP requests. An improperly configured DHCP server can lead to denial of service if incorrect IP addresses are assigned or, worse, can be used in a man-in-the-middle attack to misdirect traffic.
Imagine how difficult it would be if you were required to remember every company's and person's IP address to send them data. Human-readable addresses like www.isc2.org
are used instead, but these need to be converted to a machine-readable IP address. DNS does this by a process known as resolving.
DNS operates using records, definitive mappings of FQDNs to IP addresses, which are stored in a distributed database across zones. DNS queries are facilitated by DNS servers sharing information from time to time, known as zone transfers, which enable resolution of client requests without having to iterate a request.
As with many network protocols, DNS was originally designed without security in mind, which leaves it open to attack. Cache poisoning is an attack where a malicious user updates a DNS record to point an FQDN to an incorrect IP address. One way of doing this is to initiate a zone transfer, which by default does not include authentication of the originator. This poisoned record causes users to be redirected to an incorrect IP address, where the attacker can host a malicious phishing site, malware, or simply nothing at all to create a denial of service.
DNS spoofing is another attack against DNS. In this case, an attacker spoofs a DNS service on a network with the goal of resolving a user's requested FQDN to an attacker-controlled IP address. Some attacks against DNS also abuse graphical interfaces instead of the underlying infrastructure. This can be achieved using characters that appear identical to users, such as zero and the letter o, sending users to an illegitimate site.
DNS Security Extensions (DNSSEC) is a set of specifications primarily aimed at reinforcing the integrity of DNS. It achieves this by providing for cryptographic authentication of DNS data using digital signatures. This provides proof of origin and makes cache poisoning and spoofing attacks more difficult if users cannot create a proper digital signature for their DNS data. It does not provide for confidentiality, since digital signatures rely on publicly decryptable information, nor does it stop attacks using graphically similar domain names, as these may be legitimate records registered with an authoritative DNS zone.
Resources inside a network can be protected with security tools like firewalls and access controls, but what happens if users are not on the same network? A VPN gives external users the ability to virtually and remotely join a network, gaining access to resources hosted on that network and benefiting from the security controls in place. This security is achieved by setting up a secure tunnel, or encrypted communication channel, between the connecting host and the network they want to join. On the target network, there is often a VPN device or server that authenticates the user connecting and mediates their access to the network.
Commonly used to provide remote workers access to in-office network resources, VPNs can also be useful in cloud architectures to safely share data between offices, cloud environments, or other networks. These are often implemented at the edge of networks to allow secure communication between any hosts on connected networks and are called site-to-site or gateway-to-gateway VPNs.
There are a variety of VPN protocols and implementations that a CCSP should be familiar with, such as the following:
openvpn.net
.A software-defined perimeter (SDP) is an emerging concept driven by the decentralized nature of cloud applications and services, which have upended the traditional model of a network with a perimeter (secure boundary). Since cloud applications may reside in data centers anywhere in the world, and users may also be connecting from anywhere in the world, it is no longer possible to define a perimeter.
For further reference, see the CSA SDP site: cloudsecurityalliance.org/research/working-groups/software-defined-perimeter
.
Hardening is the configuration of a machine into a secure state. Common hardening practices include changing default credentials, locking/disabling default accounts that are not needed, installing security tools such as anti-malware software, configuring security settings available in the OS, and removing or disabling unneeded applications, services, and functions.
Prior to virtualization, the process of hardening an OS was often a manual task; once a system's hardware was installed, an administrator had to manually install and configure the OS and any application software. The advent of virtualization introduced machine images, which are essentially templates for building a VM. All VMs created from the same image will have the same settings applied, which offers dual benefits of efficiency and security. Of course, the image cannot remain static—patches and other security updates must be applied to the image to ensure that new VMs remain secure. Existing VMs must also be updated independently of the image, which is a concern of patching and configuration management disciplines.
A modern OS has thousands of configuration options, so to speed this up, an organization may choose to create or use a baseline configuration. Baselines are simply a documented, standard state of an information system, such as access control requiring multifactor authentication, vulnerable services such as File Transfer Protocol (FTP) disabled, and nonessential services such as Windows Media Player removed. Each of these configuration options should match a risk mitigation (security control objective).
This baseline and corresponding documentation may be achieved in a number of ways.
DISA STIGs: The U.S. Defense Information Systems Agency (DISA) produces baseline documents known as Security Technical Implementation Guides (STIGs). These documents provide guidance for hardening systems used in high-security environments and, as such, may include configurations that are too restrictive for many organizations. They are available for free and can be tailored, obviously, and also cover a broad range of OS and application software. Many configuration and vulnerability management tools incorporate hardening guidance from the STIGs to perform checks.
STIGs and additional information can be found here: public.cyber.mil/stigs/downloads
.
NIST checklists: The National Institute of Standards and Technology (NIST) maintains a repository of configuration checklists for various OS and application software. It is a free resource.
The NIST Checklist repository (which also includes access to DISA STIGs) can be found here: nvd.nist.gov/ncp/repository
.
CIS Benchmarks: The Center for Internet Security (CIS) publishes baseline guides for a variety of operating systems, applications, and devices, which incorporate many security best practices. These can be used by any organization with appropriate tailoring and are also built into many security tools such as vulnerability scanners.
More information can be found here: www.cisecurity.org/cis-benchmarks
.
Stand-alone hosts are isolated, dedicated hosts for the use of a single tenant. These are often required for contractual or regulatory reasons, such as processing highly sensitive data like healthcare information. The use of nonshared resources carries obvious consequences related to costs. A CSP may be able to offer secure dedicated hosting similar to colocation, which still offers cost savings to the consumer due to shared resources including physical facilities and shared resources like power and utilities. A CCSP will need to gather and analyze the organization's requirements to identify whether the costs of stand-alone hosting are justified. In some CSPs, the use of virtual private resources may be an acceptable alternative with lower costs, due to the use of shared physical infrastructure with strong logical separation from other tenants.
Clusters are a grouping of resources with some coordinating element, often a software agent that facilitates communication, resource sharing, and routing of tasks among the cluster. Clustered hosts can offer a number of advantages, including high availability via redundancy, optimized performance via distributed workloads, and the ability to scale resources without disrupting processing via addition or removal of hosts to the cluster. Clusters are a critical part of the resource pooling that are foundational to cloud computing and are implemented in some fashion for most resources needed in modern computing systems including processing, storage, network traffic handling, and application hosting.
The cluster management agent, often part of hypervisor or load balancer software, is responsible for mediating access to shared resources in a cluster. Reservations are guarantees for a certain minimum level of resources available to a specified virtual machine. The virtualization toolset or CSP console is often where this can be configured, such as a certain number of compute cores or RAM allocated to a VM. A limit is a maximum allocation, while a share is a weighting given to a particular VM that is used to calculate percentage-based access to pooled resources when there is contention.
Maintenance mode refers to the practices surrounding the routine maintenance activities for clustered hosts. Although taking a system offline primarily impacts availability, there are considerations for all three elements of the confidentiality, integrity, availability (CIA) triad related to maintenance mode.
Availability and uptime are often used interchangeably, but there is a subtle difference between the terms. Uptime simply measures the amount of time a system is running. In a cloud environment, if a system is running but not reachable due to a network outage, it is not available. Availability encompasses infrastructure and other supporting elements in addition to a system's uptime; high availability (HA) is defined by a robust system and infrastructure to ensure that a system is not just up but also available. It is often measured as a number of 9s, for example, five nines or 99.999 percent availability. This equates to approximately five minutes of downtime per year and should be measured by the cloud consumer to ensure that the CSP is meeting SLA obligations.
Organizations can implement multiple strategies to achieve HA. Some, detailed in the following sections, are vendor-specific implementations of cluster management features for maintaining system uptime. Other strategies include redundancy of infrastructure such as network connectivity and utilities. The Uptime Institute publishes specifications for physical and environmental redundancy, expressed as tiers, that organizations can implement to achieve HA.
Distributed Resource Scheduling (DRS) is the coordination element in a cluster of VMware ESXi hosts, which mediates access to the physical resources and provides additional features supporting high availability and management. It is a software component that handles the resources available to a particular cluster, the reservations and limits for the VMs running on the cluster, and maintenance features.
DRS maintenance features include the ability to dynamically move running VMs from one physical hardware component to another without disruption for the end users; this is obviously useful if hardware maintenance needs to be performed or additional capacity is added to the cluster and the workload needs to be rebalanced. This supports the element of rapid elasticity and self-service provisioning in the cloud by automating dynamic creation and release of resources as client demands change.
DRS can also handle energy management in physical hardware to reduce energy consumption when processing demands are low and then power resources back up when required. This enables cost savings for both the CSP and consumer, as hardware that is not actively being used does not consume energy.
Similar to VMware's DRS, Microsoft's Virtual Machine Manager (VMM) software handles power management, live VM migration, and optimization of both storage and compute resources. Hosts (servers) and storage capacity can be grouped into a cluster; options available to configure in the VMM console include the movement of VMs and virtual hard disks between hosts to balance workloads across available resources.
Storage clusters create a pool of storage, with the goal of providing reliability, increased performance, or possibly additional capacity. They can also support dynamic system availability by making data available to services running anywhere—if a data center in one part of the country fails and web hosts are migrated to another data center, they can connect to the same storage cluster without needing to be reconfigured. There are two primary architectures for storage clusters.
The availability of a guest OS in a CSP's environment is generally the consumer's responsibility, as the CSP only provides a base image. Once a VM is created in IaaS, the CSP no longer has direct control over the OS, while in PaaS the CSP maintains control. In the software as a service (SaaS) model, the consumer only needs to plan for availability of the data their organization puts into the app.
Ensuring the availability of guest OSs in a cloud environment may involve planning for backup and restoration, which will be similar to traditional on-prem backup and recovery planning, or it may involve utilizing cloud-specific features to design resiliency into the system. Details of these two approaches are presented here:
As with all backup and restoration activity, there are concerns across all three CIA triad elements. Backup integrity should be routinely tested to ensure recovery, and the backups should not be stored on the same physical hardware as the primary systems since this single point of failure could impact availability. Snapshots will contain data with the same sensitivity level as the systems they are made from, so adequate access controls and other measures to enforce confidentiality are required.
Many cloud services also have resiliency options built in, such as worldwide data replication and availability zones or regions that can transfer running apps or services transparently in the event of a failure. The cloud consumer is responsible for choosing and configuring these resiliency options, and in some cases will need to make trade-offs. For example, some CSPs offer database encryption that makes it harder to perform data replication. In traditional on-prem architecture, building such a resilient app would have been cost prohibitive for all but the largest organizations, but the inherent structure of cloud computing makes this type of resiliency broadly available.
Although many elements of physical and logical infrastructure in cloud environments will be under the direct control of the CSP, it is essential for cloud consumers and the CCSP practitioner to be aware of these practices. In some cases, there will be shared responsibilities that both parties are required to perform, and in others, the consumer must adequately understand these practices and conduct oversight activities like SLA reviews to ensure that security objectives are being met.
Remote administration is the default for a majority of cloud administrators from both the CSP and the consumer side. Tools including Remote Desktop Protocol (RDP), used primarily for Windows systems, and Secure Shell (SSH), used primarily on Unix and Linux systems, must be provisioned to support this remote management.
Secure remote access is a top-level priority in many security frameworks due to the highly sensitive nature of operations it entails and its inherent exposure to attacks. Remote access often relies on untrusted network segments for transmitting data, and in a cloud environment, this will entail users connecting via the Internet. Physical controls that could prevent unwanted access in a data center will be largely missing in a cloud as well; not that the CSP is ignoring physical security controls, but the inherently network-accessible nature of the cloud means most administrative functions must be exposed and are therefore susceptible to network-based threats.
There are a number of concerns that should be addressed to reduce the risk associated with remote access, including the following:
A hardened OS implements many of the security controls required by an organization's risk tolerance and may also implement security configurations designed to meet the organization's compliance obligations. Once built, however, these systems do need to be monitored to ensure that they stay hardened. This ongoing monitoring and remediation of any noncompliant systems must be part of the organization's configuration management processes, designed to ensure that no unauthorized changes are made, any unauthorized changes are identified and rolled back, and changes are properly approved and applied through a change control process.
Monitoring and managing OS configuration against baselines can be achieved in a number of ways. Some are similar to legacy, on-prem techniques, while newer methods are also emerging, including the following:
Maintaining a known good state is not a static activity, unfortunately. Today's hardened system is tomorrow's highly vulnerable target for attack, as new vulnerabilities are discovered, reported, and weaponized by attackers. Patch management involves identifying vulnerabilities in your environment, applying appropriate patches or software updates, and validating that the patch has remediated the vulnerability without breaking any functionality or creating additional vulnerabilities.
Information needed to perform patch management can come from a variety of sources, but primary sources include vendor-published patch notifications, such as Microsoft's Patch Tuesday, as well as vulnerability scanning of your environment. Patch management processes will vary depending on the cloud service model you are using.
Patch management tools exist that can help to identify known software vulnerabilities and the state of patching across an organization's system, such as Windows Server Update Services (WSUS). Such tools can also be used to orchestrate and automate patch application, though in some cases automation may not be desirable if a patch has operational impacts. There are plenty of patches and updates from major companies that caused unknown issues when installed, including turning otherwise functional hardware into bricks! A generic patch management process ought to incorporate the following:
Evolving cloud architectures are introducing new ways of managing patching, which offers significant advantages if applied correctly.
Monitoring is a critical concern for all parties in cloud computing. The CSP should implement monitoring to ensure that they are able to meet customer demands and promised capacity, and consumers need to perform monitoring to ensure that service providers are meeting their obligations and that the organization's systems remain available for users.
The majority of monitoring tasks will be in support of the availability objective, though indicators of an attack or misuse may be revealed as well, such as spikes in processor utilization that could be caused by cryptocurrency mining malware. Alerts should be generated based on established thresholds, and appropriate action plans initiated in the event of an event or disruption. Monitoring is not necessarily designed to detect incidents, however. It is also critical for CSPs to measure the amount of services being used by customers so they can be billed accurately.
Infrastructure elements that should be monitored include the following:
Regardless of what is being monitored and who performs it, adequate staffing is critical to make monitoring effective. Just as reviews make log files impactful, appropriate users of performance data are also essential. If a metric is captured but the cloud consumer never reviews it, they run the risk of a service being unavailable with no forewarning or paying for services that were not actually usable. CSPs also face risks of customers complaining, customers refusing to pay, and loss of reputation if their services are not routinely available.
Although cloud computing relies heavily on virtualization, at some point physical hardware is necessary to provide all the services. Monitoring this physical hardware is essential, especially for availability as hardware failures can have an outsized impact in virtualization due to multiple VMs relying on a single set of hardware.
Various tools exist to perform physical monitoring, and the choice will depend on the hardware being used as well as organizational business needs. Similar to capacity monitoring, all hardware under monitoring should have alert thresholds and response actions if a metric goes outside expected values, such as automated migration of VMs from faulty hardware or an alert to investigate and replace a faulty component. Hardware targets for monitoring may include the following:
Some devices also include built-in monitoring, such as hard drives that support self-monitoring, analysis, and reporting technology (SMART). SMART drives monitor a number of factors related to drive health and can identify when a failure is imminent. Many SSDs also provide reporting when a set number of sectors have failed, which means the drive is reaching the end of its useful life. Especially in large environments, storage tools like storage area networks (SAN) will include their own health and diagnostic tools; the information from these should be integrated into the organization's continuous monitoring strategy.
The types of monitoring tools in use will depend on a number of factors. Many vendors of cloud-grade hardware such as SAN controllers or virtualization clusters include diagnostic and monitoring tools. The usefulness of these built-in tools may be limited if your organization's measurement needs require data that is not captured, in which case a third-party tool may be required.
In general, hardware monitoring will be the purview of the CSP and not the consumer, as the CSP is likely to retain physical control of hardware. Data center design and infrastructure management are entire fields of endeavor largely outside the scope of the CCSP, but it is obviously important for a CSP to have appropriately skilled team members. The Uptime Institute (uptimeinstitute.com/tiers
) is one resource that provides guidance and education on designing and managing infrastructure; another is the International Data Center Authority (www.idc-a.org
).
There is a clear delineation of responsibility between the CSP and consumer when it comes to configuring, testing, and managing backup and restoration functions in cloud environments. In SaaS cloud models, the CSP retains full control over backup and restore and will often be governed by SLA commitments for restoration in the event of an incident or outage.
In the PaaS model, there will be backup and restoration responsibilities for both the consumer and the CSP, especially for VMs. The CSP is responsible for maintaining backup and restoration of the host OS and any hypervisor software and for ensuring the availability of the system in line with agreed-upon service levels.
Backup and recovery of individual VMs in IaaS are the responsibility of the consumers and may be done in a variety of ways. This might include full backups, snapshots, or definition files used for infrastructure as code deployments. Regardless of which backup method is utilized, a number of security considerations should be taken into account.
Configuration of resiliency functions, such the use of automatic data replication, failover between availability zones offered by the CSP, or the use of network load balancing, will always be the responsibility of the consumer. The CSP is responsible for maintaining the capabilities that enable these options, but the consumer must architect their cloud environment, infrastructure, and applications appropriately to meet their own resiliency objectives.
Cloud environments are inherently network accessible, so the security of data in transit between the consumer and the CSP is a critical concern, with both parties sharing responsibility for architecting secure networks. CSPs must ensure that they adequately support the networks they are providing in the cloud service environment, and consumers are responsible, in some cloud service models, for architecting their own secure networks using a combination of their own tools and those provided by the CSP.
One major concern related to network security is the ability of some tools to function in a cloud paradigm. The early days of virtualization brought challenges for many security tools that relied on capturing network traffic as it flowed across a switch; the devices were attached to a SPAN or mirror port on the switch and received a copy of all traffic for analysis. VMs running on the same physical host did not need to send traffic outside the host to communicate, rendering tools listening for traffic on a switch useless. Complex software-defined networks (SDNs) that can span multiple data centers around the world likely require more advanced solutions, and security practitioners must be aware of these challenges.
Firewalls are most broadly defined as a security tool designed to isolate and control access between segments of a network, whether it is an internal network and the public Internet or even between environments such as an application with highly sensitive data and other internal apps. Firewalls operate by inspecting traffic and making a decision whether to forward the traffic (allow) or drop it (deny) and are often used to isolate or segment networks by controlling what network traffic is allowed to flow between segments. There are a variety of firewall types.
Firewalls may be hardware appliances that would traditionally be deployed by the CSP, or virtual appliances that can be deployed by a cloud consumer as a VM. Host-based firewalls, which are software-based, are also often considered a best practice in a layered defense model. In the event a main network firewall fails, each host still has some protection from malicious traffic, though all devices obviously need to be properly configured.
There are a number of cloud-specific considerations related to firewall deployment and configuration, such as the use of security groups for managing network-level traffic coupled with host-based firewalls to filter traffic to specific hosts. This approach is an example of microsegmentation, which amounts to controlling traffic on a granular basis—often at the level of a single host. In a cloud environment, an NSG might block traffic on specific ports from entering a DMZ, and then the host firewalls would further restrict traffic reaching a host based on ports or protocols. Traditional firewall rules may also be ineffective in a cloud environment, which necessitates these new approaches. In an autoscaling environment, new hosts are brought online and dynamically assigned IP addresses. A traditional firewall would need its ruleset updated to allow traffic to these new hosts; otherwise, they will not be able to handle traffic at all. The newly created resources can be automatically placed into the proper security group with no additional configuration required.
As the name implies, an intrusion detection system (IDS) is designed to detect a system intrusion when it occurs. An intrusion prevention system (IPS) is a bit of a misnomer, however—it acts to limit damage once an intrusion has been detected. The goal in both cases is to limit the impact of an intrusion, either by alerting personnel to an intrusion so they can take remedial action or by automatically shutting down an attempted attack.
An IDS is a passive device that analyzes traffic and generates an alert when traffic matching a pattern is detected, such as a large volume of unfinished TCP handshakes. IPS goes further by taking action to stop the attack, such as blocking traffic from the malicious host with a firewall rule, disabling a user account generating unwanted traffic, or even shutting down an application or server that has come under attack.
Both IDS and IPS can be deployed in two ways, and the deployment method as well as location are critical to ensure that the devices can see all traffic they require to be effective. A network-based intrusion detection system/intrusion prevention system (NIDS/NIPS) sits on a network where it can observe all traffic and may often be deployed at a network's perimeter for optimum visibility. Similar to firewalls, however, NIDS/NIPSs may be challenged in a virtualized environment where network traffic between VMs never crosses a switch. A host-based intrusion detection system/intrusion prevention system (HIDS/HIPS) is deployed on a specific host to monitor traffic. While this helps overcome problems associated with invisible network traffic, the agents required introduce processing overhead, may require licensing costs, and may not be available for all platforms an organization is using.
Honeypots and honeynets can be useful monitoring tools if used appropriately. They should be designed to detect or gather information about unauthorized attempts to gain access to data and information systems, often by appearing to be a valuable resource. In reality, they contain no sensitive data, but attackers attempting to access them may be distracted or deflected from high-value targets or give up information about themselves such as IP addresses.
In most jurisdictions, there are significant legal issues concerning the use of honeypots or honeynets, centered around the concept of entrapment. This legal concept describes an agent inducing a person to commit a crime, which may be used as a defense by the perpetrator and render any attempt to prosecute them ineffective. It is therefore imperative that these devices never be set up with an explicit purpose of being attractive targets or designed to “catch the bad guys.”
Vulnerability assessments should be part of a broader vulnerability management program, with the goal of detecting vulnerabilities before an attacker finds them. Many organizations will have a regulatory or compliance obligation to conduct vulnerability assessments, which will dictate not only the schedule but also the form of the assessment. An organization with an annual PCI assessment requirement should be checking for required configurations and vulnerabilities related to credit cardholder data, while a medical organization should be checking for required protected health information controls and vulnerabilities.
Vulnerability scanners are an often-used tool in conducting vulnerability assessments and can be configured to scan on a relatively frequent basis as a detective control. Human vulnerability assessments can also be utilized, such as an internal audit function or standard reviews like access and configuration management checks. Even a physical walk-through of a facility to identify users who are not following clean desk or workstation locking policies can uncover vulnerabilities, which should be treated as risks and remediated.
A more advanced form of assessments an organization might conduct is penetration or pen testing, which typically involves a human tester attempting to exploit any vulnerabilities identified. Vulnerability scanners typically identify and report on software or configuration vulnerabilities, but it can be difficult to determine if a particular software vulnerability could actually be exploited in a complex environment. The use of vulnerability scanners and pen testers may be limited by your CSP's terms of service, so a key concern for a CCSP is understanding the type and frequency of testing that is allowed.
The management plane is mostly used by the CSP and provides virtual management options analogous to the physical administration options a legacy data center would provide, such as powering VMs on and off or provisioning virtual infrastructure for VMs such as RAM and storage. The management plane will also be the tool used by administrators for tasks such as migrating running VMs to different physical hardware before performing hardware maintenance.
Because of the functionality it provides, the management plane requires appropriate logging, monitoring, and access controls, similar to the raised floor space in a data center or access to domain admin functions. Depending upon the virtualization toolset, the management plane may be used to perform patching and maintenance on the virtualization software itself. Functionality of the management plane is usually exposed through an API, which may be controlled by an administrator from a command line or via a graphical interface.
A key concept related to the management plane is orchestration, or the automated configuration and management of resources. Rather than requiring an administrator to individually migrate VMs off a cluster before applying patches, the management plane can automate this process. The admin schedules a patch for deployment, and the software comprising the management plane coordinates moving all VMs off the cluster, preventing new VMs from being started, and then enters maintenance mode to apply the patches.
The cloud management console is often confused with the cloud management plane, and in reality, they perform similar functions and may be closely related. The management console is usually a web-based console for use by the cloud consumer to provision and manage their cloud services, though it may also be exposed as an API that customers can utilize from other programs or a command line. It may utilize the management plane's API for starting/stopping VMs or configuring VM resources such as RAM and network access, but it should not give a cloud consumer total control over the entire CSP infrastructure. The management plane's access controls must enforce minimum necessary authorization to ensure that each consumer is able to manage their own infrastructure and not that of another customer.
IT service management (ITSM) frameworks consist of operational controls designed to help organizations design, implement, and improve IT operations in a consistent manner. They can be useful in speeding up IT delivery tasks and providing more consistent oversight, and they are also critical to processes where elements of security risk management are implemented. Change management is one example; it helps the organization to maintain a consistent IT environment that meets user needs and also implements security controls such as a change control board where the security impact of changes can be adequately researched and addressed.
The two standards that a CCSP should be familiar with are ISO 20000-1 (not to be confused with ISO 27001) and ITIL (formerly an acronym meaning Information Technology Infrastructure Library). Both frameworks focus on the process-driven aspects of delivering IT services to an organization, such as remote collaboration services, rather than focusing on just delivering IT systems like an Exchange server. In ITIL, the set of services available is called a service catalog, which includes all the services available to the organization.
Both frameworks start with the need for policies to govern the ITSM processes, which should be documented, well understood by relevant members of the organization, and kept up-to-date to reflect changing needs and requirements. ISO 20000-1 and ITIL emphasize the need to deeply understand user needs and also focus on gathering feedback to deliver continuous service improvement. Stated another way, those in charge of IT services should have a close connection to the users of the IT system and strive to make continual improvements; in this regard, it is similar to the Agile development methodology.
Change management is concerned with keeping the organization operating effectively, even when changes are needed such as the modification of existing services, addition of new services, or retirement of old services. To do this, the organization must implement a proactive set of formal activities and processes to request, review, implement, and document all changes.
Many organizations utilize a ticketing system to document all steps required for a change. The first step is initiation in the form of a change request, which should capture details such as the purpose of the proposed change, the owner, resources required, and any impacts that have been identified, such as downtime required to implement the change or impacts to the organization's risk posture.
The change then goes for a review, often by a change control or change advisory board (CCB or CAB). This review is designed to verify whether the proposed change offers business benefits/value appropriate to its associated costs, understand the impact of the change and ensure that it does not introduce unacceptable levels of risk, and, ideally, confirm that the change has been properly planned and can be reversed or rolled back in the event it is unsuccessful. This step may involve testing and additional processes such as decision analysis, and it may be iterated if the change board needs additional information from the requestor.
Once a change has been approved, it is ready for the owner to execute the appropriate plan to implement it. Since many changes will result in the acquisition of new hardware, software, or IT services, there will be a number of security concerns that operate concurrently with the change, including acquisition security management, security testing, and the use of the organization's certification and accreditation process if the change is large enough. In the event a change is not successful, fallback, rollback, or other restoration actions need to be planned to prevent a loss of availability.
Not all changes will be treated the same, and many organizations will implement different procedures based on categories of changes.
Continuity is concerned with the availability aspect of the CIA triad and is a critical consideration for both cloud customers and providers. Continuity management addresses the reality that, despite best efforts and mitigating activities, sometimes adverse events happen. How the organization responds should be planned and adequate resources identified prior to an incident, and the business continuity policy, plan(s), and other documentation should be readily available to support the organization's members during an interruption.
It is essential for both cloud customers and providers to do the following:
There are a variety of standards related to continuity management; these may be useful to the organization in planning, testing, and preparing for contingency circumstances. Many legal and regulatory frameworks mandate the use of a particular standard depending on an organization's industry or location. The CCSP should be aware of the relevant framework for their industry; the following are some key frameworks:
The goal of an information security management system (ISMS) is to ensure a coherent organizational approach to managing information security risks; stated another way, it is the overarching approach an organization takes to preserving the confidentiality, integrity, and availability (the CIA triad) of systems and data in use. The operational aspects of an ISMS include standard security risk management activities in the form of security controls such as encryption, as well as supporting business functions required for the organization to achieve risk management goals like formal support and buy-in from management, skills and training, and adequate oversight and performance evaluation.
Various standards and frameworks exist to help organizations implement, manage, and, in some cases, audit or certify their ISMS. Most contain requirements to be met in order to support goals of the CIA triad, as well as best practices for implementation and guidance on proper use and operation of the framework. While many security control frameworks exist, not all are focused on the larger operational task of implementing an ISMS. Payment Card Industry Data Security Standard (PCI DSS), for example, focuses specifically on securing cardholder data that an organization is processing. Frameworks that focus on both security controls as well as the overall ISMS functions include the following:
While the NIST RMF and SP 800-53 standards are mandated for use in many parts of the U.S. federal government, they are free to use for any organization. The NIST Cybersecurity Framework (CSF) was originally designed to help private-sector critical infrastructure providers design and implement information security programs; however, its free-to-use and relatively lightweight approach have made it a popular ISMS tool for many nongovernment organizations.
Most ITSM models include some form of monitoring capability utilizing functions such as internal audit, external audit and reporting, or the generation of security metrics and management oversight of processes via these metrics. The organization's IT services, including the ISMS and all related processes, should be monitored for effectiveness and placed into a cycle of continuous improvements. The goals of this continuous improvement program should be twofold: first to ensure that the IT services (including security services) are meeting the organization's business objectives and second to ensure that the organization's security risks remain adequately mitigated.
One critical element of continual service improvement includes elements of monitoring and measurement, which often take the form of security metrics. Metrics can be tricky to gather, particularly if they need to be presented to a variety of audiences. It may be the case that business leaders will be less interested in deeply technical topics, which means the metrics should be used to aggregate information and present it in an easily understood, actionable way.
For instance, rather than reporting a long list of patches and Common Vulnerabilities and Exposures (CVEs) addressed (undoubtedly an important aspect of security risk management), a more appropriate metric might be the percentage of machines patched within the defined timeframe for the criticality of the patch; for example, 90 percent of machines were patched within seven days of release. Acceptable values should also be defined, which allows for key performance indicators (KPIs) to be reported. In this example, the KPI might be red (bad) if the organization's target is 99 percent patch deployment within seven days of release—a clear indicator to management that something needs their attention.
There are other sources of improvement opportunity information as well, including audits and actual incidents. Audits may be conducted internally or externally, and findings from those audits can be viewed as improvement opportunities. Actual incidents, such as a business interruption or widespread malware outbreak, should be concluded with a lessons learned or postmortem analysis, which provides another source of improvement opportunities. The root cause of the incident and any observations made during the recovery can be used to improve the organization's IT security services.
It is important to understand the formal distinction between events and incidents as the foundation for incident management.
All incidents should be investigated and remediated as appropriate to restore the organization's normal operations as quickly as possible and to minimize adverse impact to the organization such as lost productivity or revenue. This resumption of normal service is the primary goal of incident management.
Not all incidents will require participation by the security team. For example, a spike in new user traffic to an application after a marketing campaign goes live, which leads to a partial loss of availability, is an operational issue and not a security one. A coordinated denial-of-service attack by a foreign nation-state, however, is an incident that requires participation by both IT and security personnel to successfully remediate.
All organizations require some level of incident management capability, that is, the tools and resources needed to identify, categorize, and remediate the impacts of incidents. This capability will revolve around an incident management plan, which should document the following:
A variety of standards exist to support organizations developing an incident response capability, including the ITIL framework, NIST Special Publication 800-61, “Computer Security Incident Handling Guide, and ISO 27035, Security incident management.” All standards implement a lifecycle approach to managing incidents, starting with planning before an incident occurs, activating the response, and following the documented steps, and ending with reporting on the results and any lessons learned to help the organization better mitigate or respond to such incidents in the future. Figure 5.1 shows an example of the NIST SP 800-61 lifecycle and a description of the activities.
The CCSP's role in developing the capability is, obviously, the responses required for security incidents, while other stakeholders from IT or operations will provide input relevant to their responsibilities. All incident response frameworks emphasize the importance of planning ahead for incidents by identifying likely scenarios and developing response strategies before an incident occurs, as incidents can be a high-stress situation and ad hoc responses are less preferable than preplanned, rehearsed responses.
As with all aspects of cloud service use, there are shared responsibilities between the CSP and the consumer when responding to incidents. For many incidents, the CSP will not be involved; for example, an internal user at a consumer organization breaches policies to misuse company data. This does not impact the CSP; however, some incidents will require coordination, such as a denial-of-service attack against one consumer, which could impact other consumers by exhausting resources. The CSP could also suffer an incident, like an outage of a facility or theft of hardware resources, which must be reported to consumer organizations and may trigger their incident management procedures. The major CSPs have dedicated incident management teams to coordinate incident responses, including resources designed to provide notice to consumers such as a status page or dedicated account managers. Your incident planning must include coordination with your CSPs, and you should be aware of what capabilities are and are not available to you. Many CSPs forbid physical access to their resources during an incident response unless valid law enforcement procedures, such as obtaining a warrant, have been followed.
In addition to managing incidents with the CSPs involvement, incident management processes should take into account the difference between first- and third-party incidents. A first-party incident is one that happens internally, such as an employee stealing information or a server being infected with malware. A third-party incident is one that affects another organization like a contractor or vendor. Some incidents could be operational only, such as a vendor being hit by ransomware that prevents them from providing services, while others may be security related like a data breach at a contractor that exposes the data of the contractor's customers. In this case, the incident response plan should include information such as points of contact and steps needed to coordinate the incident response with the third party. This should include any legal or regulatory obligations the organization must meet in the event of a data breach, such as reporting to regulators or impacted customers.
Another important aspect of the organization's incident management capability is the proper categorization and prioritization of incidents based on their impact and criticality. Incident management seeks to restore normal operations as quickly as possible, so prioritizing incidents and recovery steps is critical. This is similar to the risk assessment process where risks are analyzed according to their impact and likelihood; however, since incidents have already occurred, they are measured by the following:
Many organizations utilize a P score (P0–P5) to categorize incidents. Members of the incident response team use this score to prioritize the work required to resolve an incident. For example, a P5 or low-priority item may be addressed as time permits, while a P0, which equates to a complete disruption of operations, requires that all other work be suspended. In many organizations, a certain priority rating may also be a trigger for other organizational capabilities, such as the invocation of a business continuity or disaster recovery plan if the incident is sufficiently disruptive to normal operations.
In the ITIL framework, problems are the causes of incidents or adverse events, and the practice of problem management seeks to improve the organization's handling of these incidents. Problems are, in essence, the root cause of incidents, so problem management utilizes root-cause analysis to identify the underlying problem or problems that lead to an incident and seeks to minimize the likelihood or impact of incidents in the future; it is therefore a form of risk management.
Identified problems are tracked as a form of common knowledge, often in a known issues or known errors database. These document an identified root cause that the organization is aware of, as well as any shared knowledge regarding how to fix or avoid them. Examples might include a set of procedural steps to follow when troubleshooting a particular system, or a workaround, which is a temporary fix for an incident. Workarounds do not mitigate the likelihood of a problem occurring but do provide a quick fix, which supports the incident management goal of restoring normal service as quickly as possible. Problems are risks to the organization, and if the workarounds do not provide sufficient risk mitigation, then the organization should investigate a more permanent solution to resolve the underlying cause.
The last few years have seen an enormous shift from traditional release management practices due to widespread adoption of Agile development methodologies. The primary change is the frequency of releases due to the increased speed of development activities in continuous integration/continuous delivery, often referred to as a CI/CD pipeline. Under this model, developers work on small units of code and merge them back into the main branch of the application's code as soon as they are finished.
Inherent in a CI/CD pipeline is the concept of automated testing, which is designed to more quickly identify problems in an easier-to-solve way. Running tests on only small units of code being integrated makes it easier for the developers to identify where the problem is and how to fix it. From the user's perspective, they get access to new features more quickly than waiting for a monolithic release and, ideally, get fewer bugs due to the faster feedback to developers that automated testing offers.
Release management activities typically comprise the logistics needed to release the changed software or service and may include identifying the relevant components of the service, scheduling the release, and post-implementation reviewing to ensure that the change was implemented as intended, i.e., that the new software or service is functioning as intended. The process has obvious overlap with change management processes.
In the Agile methodology, the process of release management may not involve manual scheduling, instead relying on the organization's standard weekly, monthly, or quarterly release schedule. For organizations using other development methodologies, the process for scheduling the deployment might require coordination between the service provider and consumers to mitigate risks associated with downtime, or the deployment may be scheduled for a previously reserved maintenance window during which customer access will be unavailable.
The release manager must perform a variety of tasks after the release is scheduled and before it occurs, including identifying whether all changes to be released have successfully passed required automated tests as well as any manual testing requirements. Other manual processes such as updating documentation and writing release notes may also be part of the organization's release management activities. Once all steps have been completed, the release can be deployed and tested and is ready for users.
In more mature organizations, the CD in CI/CD stands for continuous deployment, which automates the process of release management to deliver a truly automatic CI/CD pipeline. Once a developer has written their code and checked it in, an automated process is triggered to test the code, and if all tests pass, it is integrated and deployed automatically to users. This has the advantage of getting updated software and services deployed to users quickly and offers security benefits as well. For example, automated testing is typically less expensive than manual testing, making it feasible to conduct more tests and increase the frequency in complement to the organization's manual testing plans.
Even organizations with continuous deployment may require some deployment management processes to deal with deployments that cannot be automated, such as new hardware or software. In this case, the release management process should develop a set of deployment steps including all required assets, dependencies, and deployment order to ensure that the deployment process is successful.
One recent technology development supporting this trend of more frequent deployment is containerization, which packages application code and non-OS software the application requires into a container. Containers can be run on any computing platform regardless of underlying hardware or operating system, so long as container software such as the Docker Engine is available. The container software makes the resources of the computing environment available in response to the requirements of the containerized applications when they run, similar to the way virtualization makes hardware resources available to a virtualized guest OS.
Containers offer advantages of portability. In other words, they can be run on any OS and hardware platform with container software, as well as availability advantages over traditional infrastructure due to requiring fewer pieces of software to run. Continuous deployment pipelines often make use of containers, as they provide more flexibility and can speed development.
Deployment scheduling for noncontinuous environments may follow a set schedule such as a routine maintenance window or be deployed in a phased approach where a small subset of users receive the new deployment. This phased approach can offer advantages for riskier deployments such as large operating system updates, where unexpected bugs or issues may be encountered. Rolling out updates to a subset of the user pool reduces the impact of any bugs in the release, and the organization has more opportunity to find and correct them. Organizations may also choose not to push deployments but instead allow users to pull the updated software/service on their own schedule. This is the model many consumer OSs follow, where an update is made available and users are free to accept or delay the update at their discretion. This is advantageous for uptime as a user may not want to restart their machine in the middle of an important task, though it does lead to the problem of software never being updated, which means vulnerabilities fixed by that update are also not addressed.
Configuration management (CM, not to be confused with change management, which is also abbreviated CM) comprises practices, activities, and processes designed to maintain a known good configuration of something. Configuration items (CIs) are the things that are placed under configuration control and may be assets such as source code, operating systems, documentation, or even entire information systems.
Changes to CIs are usually required to go through a formal change process designed to ensure that the change does not create an unacceptable risk situation, such as introducing a vulnerable application or architecture. Part of the change management process must include updating the CMDB to reflect the new state of the services or components after the change is executed. For example, if a particular host is running Service Pack 1 (SP1) of an OS and a major upgrade to SP2 is performed, the CMDB must be updated once the change is completed successfully.
IT service CM may include hardware, software, or the cloud services and configurations in use by a consumer organization, while the CSP would also need to include configurations of the service infrastructure as well as the supply chain used to provide the services.
In many organizations, a formal CMDB will be used to track all CIs, acting as a system of record against which current configurations may be compared in order to detect systems that have gone out of line with expected configurations. The CMDB can also be useful for identifying vulnerabilities. As a source of truth for systems and software versions running in the organization, it is possible to query the CMDB to identify software running in the organization that contains a newly disclosed vulnerability. Furthermore, the CMDB can be useful to support audits by acting as a source of population data to allow auditors to choose a subset of systems to review. If the CMDB is not updated after changes are made, the organization is likely to face audit findings stemming from failure to properly follow processes.
Due to the type of information contained in a CMDB, such as version numbers, vendor information, hardware components, etc., it can also be used as the organization's asset inventory. Many tools that provide CMDB functionality can also be used to automatically detect and inventory systems, such as by monitoring network records to identify when a new system joins or by integrating with cloud administrative tools and adding any new cloud services that are invoked to the CMDB.
Checklists or baseline are often mentioned in discussions of CM, primarily as starting points or guidance on the desired secure configuration of particular system types. Configuration checklists are often published by industry or regional groups with specific guidance for hardening of operating systems like Windows, macOS, and various Linux distributions, as well as the hardening of popular applications such as Microsoft Office and collaboration tools. In many cases, the vendors themselves publish security checklists indicating how their products' various security settings can be configured. In all cases, these checklists are usually an input to an organization's CM process and should be tailored to meet the organization's unique needs.
In the ITSM view of IT as a set of services, including information security, there is a function for defining, measuring, and correcting issues related to delivery of the services. In other words, performance management is a critical part of ITSM. This is closely related to the process of continual service improvement, and in fact, the same metrics are likely to be used by the organization to determine if services are meeting their defined goals.
Service level management rests on the organization's defined requirements for a service. The most common service level many cloud organizations encounter is availability, often expressed as a percentage of time that a system can be reached and utilized, such as 99.9 percent. A service with a 99.9 percent availability level must be reachable approximately 364.64 days per year; put another way, the system can only be down for less than 24 hours each year. Examples of other service levels that may be managed include number of concurrent users supported by a system, durability of data, response times to customer support requests, recovery time in the event of an interruption, and timeframes for deployment of patches based on criticality.
A key tool in managing service levels is the service level agreement (SLA), which is a formal agreement similar to a contract, but focused on measurable outcomes of the service being provided. This measurement aspect is what makes SLAs critical elements of security risk mitigation programs for cloud consumers, as it allows them to define, measure, and hold the cloud provider accountable for the services being consumed.
SLAs require routine monitoring for enforcement, and this typically relies on metrics designed to indicate whether the service level is being met. Availability metrics are often measured with tools that check to see if a service can be reached. For example, a script may run that checks to see if a website loads at a particular address. The script may run once an hour and log its results; if the SLA is 99.9 percent, then the service should not be down for more than nine hours in a given year. If the service level is not met, the SLA should define penalties, usually in the form of a refund or no obligation for the consumer to pay for the time the service was unavailable.
Defining the levels of service is usually up to the cloud provider in public cloud environments, though there is obviously a need to meet customer demands in order to win business. Requirements should be gathered for cloud service offerings regardless of the deployment model, and customer feedback should also be gathered and used as input to the continual service improvement. The metrics reported in SLAs are a convenient source of input to understand if the services are meeting the customers' needs.
In cloud environments, the ability of users to reach and make use of the service is incredibly important, so the provider must ensure that adequate measures are in place to preserve the availability aspect of the relevant services. Availability and uptime are often used synonymously, but there is an important distinction. A service may be “up”—that is, reachable but not available—meaning it cannot be used. This could be the case if a dependency like the access control system is not available, so users can get to a login page for the cloud service but no further.
Due to the expansive nature of availability management, it is critical to view this as a holistic process. Factors that could negatively impact availability include many of the same concerns that an organization would consider in business continuity and disaster recovery, including loss of power, natural disasters, or loss of network connectivity.
There are additional concerns for providing a service that meets the agreed-upon service levels, including the issue of maintenance. Some cloud service providers exclude periods of scheduled maintenance from their availability guarantees. For example, a system will be available 99 percent of the time with the exception of the third Saturday of each month. This gives defined timeframes to make changes that require a loss of availability, as well as some flexibility for unexpected events or emergency maintenance outside the normal schedule.
Many of the tools that make cloud computing possible provide integral high availability options. For example, many virtualization tools support automatic moving of guest machines from a failed host in the event of an outage, or provide for load balancing so that sudden increases in demand can be distributed to prevent a denial of service. Many cloud services are also designed to be highly resilient, particularly PaaS and SaaS offerings that can offer features such as automatic data replication to multiple data centers around the world, or concurrent hosting of applications in multiple data centers so that an outage at one does not render the service unreachable.
Cloud consumers have a role to play in availability management as well. Consumers of IaaS will, obviously, have the most responsibility with regard to availability of their cloud environment, since they are responsible for virtually everything except the physical facility. PaaS and SaaS users will need to properly architect their cloud solutions to take advantage of their provider's availability options. For example, some cloud providers offer automatic data replication for cloud-hosted databases, but there may be configuration changes required to enable this functionality. There may be other concerns as well, such as data residency or the use of encryption, which can complicate availability; it is up to the cloud consumer to gather and understand these requirements and to configure their cloud services appropriately.
In ITSM, one of the core concerns of availability is the amount of service capacity available compared with the amount being subscribed to. In a simple example, if a service has 100 active users but only 50 licenses available, that means the service is over capacity and 50 users will face a denial-of-service condition. In this simple example, the service is oversubscribed, meaning there are more users than capacity. The service provider must be able to predict, measure, and plan for adequate capacity to meet its obligations; failure to do so could result in financial penalties in the form of SLA enforcement.
As users, we are aware of the negative impacts resulting from a lack of system resources—irritating situations such as a spinning hourglass or beach ball when our desktop computer's RAM capacity is exceeded. While a minor irritant for individual users, this situation could prove quite costly for a business relying on a cloud service provider's infrastructure. Any service that is being consumed should be measurable, whether it is network bandwidth, storage space, processing capability, or availability of an application. Measured service is one of the core elements of cloud computing, so metrics that illustrate demand for the service are relatively easy to identify.
Cloud service providers must take appropriate measures to identify the service capacity they need to provision. These measures might include analysis of past growth trends to predict future capacity, identifying capacity agreed to in SLAs, or even analysis of external factors such as knowing that a holiday season will cause a spike in demand at certain customers like online retailers. Monitoring of current services, including utilization and demand, should also be part of the analysis and forecasting model.
In some cases, cloud service providers and their customers may be willing to accept a certain amount of oversubscription, especially as it could offer cost savings. To extend the previous example, assume the service provider offers 50 licenses and the business has 100 users split between the United States and India. Given the time zone difference between the two countries, it is unlikely that all 100 users will try to access the system simultaneously, so oversubscription does not present an issue.
If the consumer does require concurrent accessibility for all 100 users, then they must specify that as an SLA requirement. The provider should then utilize their capacity management processes to ensure that adequate capacity is provisioned to meet the anticipated demand.
Digital forensics, broadly, is the application of scientific techniques to the collection, examination, and interpretation of digital data. The primary concern in forensics is the integrity of data, as demonstrated by the chain of custody. Digital forensics is a field that requires very particular skills and is often outsourced to highly trained professionals, but a CCSP must be aware of digital forensic needs when architecting systems to support forensics and how to acquire appropriate skills as needed to respond to a security incident.
Digital forensics in cloud environments is complicated by a number of factors; some of the very advantages of cloud services are also major disadvantages when it comes to forensics. For example, high availability and data replication mean that data is stored in multiple locations around the world simultaneously, which complicates the identification of a single crime scene. Multitenant models of most cloud services also present a challenge, as there are simply more people in the environment who must be ruled out as suspects. The shared responsibility model also impacts digital forensics in the cloud. As mentioned previously, most CSPs do not allow consumers physical access to hardware or facilities, and even with court orders like a warrant, the CSPs may have procedures in place that make investigation, collection, and preservation of information more difficult. This is not to frustrate law enforcement but is a predicament caused by the multitenant model; allowing investigation of one consumer's data might inadvertently expose data belonging to other consumers. Investigation of one security incident should, as a rule, not be the cause of other security breaches!
In legal terminology, discovery means the examination of information pertinent to a legal action. E-discovery is a digital equivalent comprising steps including identification, collection, preservation, analysis, and review of electronic information. There are two important standards a CCSP should be familiar with related to e-discovery.
cloudsecurityalliance.org/artifacts/csa-security-guidance-domain-3-legal-issues-contracts-and-electronic-discovery
.When legal action is undertaken, it is often necessary to suspend some normal operations such as routine destruction of data or records according to a defined schedule. In this case, a process known as legal hold will be utilized, whereby data is preserved until the legal action is completed. Provisions for legal hold, such as extra storage availability and proper handling procedures, must be part of contracts and SLAs, and during legal proceedings, the cloud consumer should have easy access to appropriate points of contact at the CSP to facilitate e-discovery or other law enforcement requirements.
The process of collecting evidence is generally a specialized activity performed by experts, but security practitioners should be aware of some steps, especially those performed at the beginning before a forensics expert is brought in.
When handling evidence, the chain of custody documents the integrity of data, including details of time, manner, and person responsible for various actions such as collecting, making copies, performing analysis, and presenting the evidence. Chain of custody does not mean that the data has not been altered in any way, as it is often necessary to make changes such as physically collecting and moving a piece of hardware from a crime scene to a lab. Instead, chain of custody provides a documented, reliable history of how the data has been handled, so if it is submitted as evidence, it may be relied upon. Adequate policies and procedures should exist, and it may be appropriate to utilize the skills of trained forensic experts for evidence handling.
The scope of evidence collection describes what is relevant when collecting data. In a multitenant cloud environment, this may be particularly relevant, as collecting data from a storage cluster could inadvertently expose data that does not belong to the requesting party. Imagine that two competing companies both utilize a CSP, and Company A makes a request for data relevant to legal proceedings. If the CSP is not careful about the evidence collected and provided to Company A, they may expose sensitive data about Company B to one of their competitors!
Evidence presented to different audiences will follow different rules. For example, an industry regulator may have a lower integrity threshold for evidence, as they are not able to assess criminal penalties for wrongdoing. A court of law, however, will have higher constraints as the stakes are higher—they have the power to levy fines or possibly imprison individuals. Evidence should possess these five attributes to be useful.
There are four general phases of digital evidence handling: collection, examination, analysis, and reporting. There are a number of concerns in the first phase, collection, which are essential for a CCSP. Evidence may be acquired as part of standard incident response processes before the need for forensic investigation and criminal prosecution have been identified, so the incident response team needs to handle evidence appropriately in case a chain of custody needs to be demonstrated. Proper evidence handling and decision making should be a part of the incident response procedures and training for team members performing response activities.
There are a number of challenges associated with evidence collection in a cloud environment, including the following:
There are a number of important steps the organization should take prior to an incident that can support investigations and forensics. These can be built into several processes the organization is likely to perform for other security objectives, including the following:
Although forensics experts may perform significant amounts of evidence collection, security practitioners must be aware of some best practices, including the following:
Once evidence has been collected, it must be adequately preserved to support a variety of goals: maintain the chain of custody, ensure admissibility, and be available for analysis to support the investigation. Preservation activities and concerns should cover the following:
Adequate coordination with a variety of stakeholders is critical in any IT operation, and the move to utilize cloud computing resources coupled with an increasingly regulated and dispersed supply chain elevates the priority of managing these relationships. Communication is a cornerstone of this management; providing adequate and timely information is critical. While this may be a skillset fundamental to project managers rather than security practitioners, it is worth understanding the importance of effective communication and supporting it whenever possible.
Effective communication should possess a number of qualities. The contents, nature, and delivery of communications will drive many decisions, which can be elicited using a series of questions about the information to be conveyed.
There are a number of stakeholders or constituents with whom an organization is likely to communicate regarding IT services, organizational news and happenings, and emergency information. Establishing clear methods, channels, and formats for this communication is critical.
Few organizations exist in a vacuum; the modern supply chain spans the globe, and regulatory oversight has begun to enforce more stringent oversight of this supply chain. It is therefore essential that an organization and its security practitioners understand the supply chain and establish adequate communications.
The first step in establishing communications with vendors is an inventory of critical third parties on which the organization depends. This inventory will drive third-party or vendor risk management activities in two key ways. First, some vendors may be critical to the organization's ongoing functioning, such as a CSP whose architecture has been adopted by the organization. Second, some vendors of goods and services may provide critical inputs to an organization like a payment card processor whose service supports the organization's ability to collect money for its goods or services.
Communication with critical vendors should be similar to internal communications due to the critical role these vendors play in the business. If a vendor incident is likely to impact an organization's operations, the organization ought to have well-established communications protocols to receive as much advance notice as possible. If a consumer notices an incident such as loss of availability of a vendor's service, there should be adequate reporting mechanisms to raise the issue and resolve it as quickly as possible.
Many vendor communications will be governed by contract and SLA terms. When a CSP is a critical vendor, there should be adequate means for bidirectional communication of any issues related to the service, such as customer notices of any planned outages or downtime, emergency notifications of unplanned downtime, and customer reporting for service downtime or enhancements. In many cases, this will be done through a customer support channel with dedicated personnel as well as through ticketing systems, which creates a trackable notice of the issue, allowing all parties to monitor its progress.
As cloud consumers, most organizations will be the recipients of communications from their chosen CSPs. While this might seem to imply there are no responsibilities other than passively receiving information from and reporting issues to the CSP, consumers do have a critical accountability: defining SLA terms. Levels of communication service from the CSP should all be defined and agreed upon by both parties, such as speed of acknowledging and triaging incidents, required schedule for notification of planned downtime or maintenance, days/times support resources are available, and even the timeframe and benchmarks for reporting on the service performance. SLAs may be generic and standardized for all customers of a CSP or may be highly specific and negotiated per customer, which offers more flexibility but usually at greater cost.
A key source of information to be communicated between CSPs and their customers is the responsibility for various security elements of the service. The CSP is solely responsible for operational concerns like environmental controls within the data center, as well as security concerns like physical access controls. Customers using the cloud service are responsible for implementing data security controls, like encryption, that are appropriate to the type of data they are storing and processing in the cloud. Some areas require action by both the provider and customer, so it is crucial for a CCSP to understand which cloud service models are in use by the organization and which areas of security must be addressed by each party. This is commonly referred to as the shared responsibility model, which defines who is responsible for different aspects of security across the different cloud service models. The generic model in Table 5.1 identifies key areas of responsibility and ownership in various cloud service models.
TABLE 5.1 Cloud Shared Responsibility Model
C = Customer, P = Provider
Responsibility | IaaS | PaaS | SaaS |
---|---|---|---|
Data classification | C | C | C |
Identity and access management | C | C/P | C/P |
Application security | C | C/P | C/P |
Network security | C/P | P | P |
Host infrastructure | C/P | P | P |
Physical security | P | P | P |
A variety of CSP-specific documentation exists to define shared responsibility in that CSP's offerings, and a CCSP should obviously be familiar with the particulars of the CSP their organization is utilizing. The following is a brief description of the shared responsibility model for several major CSPs and links to further resources:
More information can be found here: aws.amazon.com/compliance/shared-responsibility-model
.
More information can be found here: docs.microsoft.com/en-us/azure/security/fundamentals/shared-responsibility
.
More information can be found here: cloud.google.com/security
.
Partners will often have a level of access to an organization's systems similar to that of the organization's own employees but are not directly under the organization's control. Communication with partners will be similar to communication with employees, with initial steps required when a new relationship is established, ongoing maintenance activities throughout the partnership, and termination activities when the partnership is wound down. Each of these phases should deliver clear expectations regarding security requirements.
There are a vast array of regulatory bodies governing information security, and most of them have developed cloud-specific guidance for compliant use of cloud services. In the early days of cloud computing, security practitioners often faced significant unknowns when moving to the cloud, but as the cloud became ubiquitous, regulators delivered clear guidance, and CSPs moved quickly to deliver cloud solutions tailored to those compliance requirements. A CCSP is still responsible for ensuring that their cloud environment is in compliance with all regulatory obligations applicable to their organization.
The main component of regulatory communication regarding the cloud is monitoring incoming information regarding regulatory requirements in the cloud. For example, the implementation of GDPR in the European Union (EU) forced many organizations to make architectural decisions regarding their cloud applications. GDPR's restrictions on data leaving the geographic boundaries of the EU mean many organizations need to implement additional privacy controls to transfer data out of the EU, otherwise they must host their applications in an EU-based data center. The CCSP should subscribe to feeds and be aware of regulatory changes that impact their organization's use of the cloud.
Similar to monitoring incoming cloud requirements, a CCSP may also be required to report information to regulatory bodies regarding their organization's state of compliance. For example, U.S.-based companies with GDPR privacy requirements may be required to communicate to the U.S. Department of Commerce on the status of privacy compliance under the U.S. Privacy Shield framework. CCSPs should be aware of regulatory reporting requirements specific to their organization and ensure that required documentation, artifacts, audits, etc., are communicated in a timely fashion.
Communications may not be a primary job responsibility of a CCSP, but important details of security risk management work may need to be shared. Working with appropriate personnel in the organization will be crucial to ensure that information is communicated in a timely manner and with relevant details for each audience, such as the following:
Security operations represent all activities undertaken by an organization in monitoring, maintaining, and generally running their security program. This may include continuous oversight of security systems to identify anomalies or incidents, as well as an organizational unit to house concerns related to security processes like incident response and business continuity. A large percentage of security process and procedure documentation will be attached to this function, and these processes should also be the target of continuous improvement efforts.
As in many security topics, an ISO standard exists that may be useful for security practitioners conducting security operations: ISO 18788, “Management system for private security operations — Requirements with guidance for use.” A word of caution: this document contains a great deal of material not applicable to cybersecurity or infosec concerns, as the numbering might imply (it is not part of the 27000 series). One extraneous topic is implementing and managing a private security force with authorization to use force; however, it also provides a business management framework for understanding the organization's needs, guidance on designing strategic and tactical plans for operating security programs, and applying a risk management approach. It also suggests a standard plan, do, check, act framework for implementing and improving security operations.
The security operations center (SOC) is an organizational unit designed to centralize a variety of security tasks and personnel at the tactical (mid-term) and operational (day-to-day) levels of the organization. While security strategy may rely on input from top leaders such as a board of directors, department secretary or minister, or the C-suite executives, SOC personnel are responsible for implementing steps required to achieve that strategy and maintain daily operations. Building and running a SOC in a traditional, all on-prem environment is basically building a monitoring and response function for IT infrastructure. Extending these concepts to the cloud may require some trade-offs, as the CSP will not offer the same level of access for monitoring that an on-prem environment does.
It is also important to note that there will be at least two SOCs involved in cloud environments. The CSP should run and manage their own SOC focused on the elements they control under the shared responsibility model such as infrastructure and physical security, while consumers should run their own SOC for their responsibility areas, chiefly data security when using the cloud.
The CISO Mind Map published by security author Rafeeq Rehman, found at rafeeqrehman.com/?s=mindmap
, provides a more information-security-centric view of security operations than ISO 18788. Updated each year, the Mind Map details the items that a CISO's role should cover; the largest element of the job responsibilities is security operations, which is broken down into three main categories. This provides a strong framework for responsibilities the SOC should undertake, including the following:
A SOC is typically made up of security analysts, whose job involves taking incoming data and extracting useful information, and security engineers who can keep operations running smoothly. There may be overlap with operations personnel, and in some organizations, the SOC may be combined with other operational functions. Common functions that may be performed in the SOC are the following:
Alert prioritization: Not all alerts are critical, require immediate attention, or represent imminent harm to the organization. SOC functions related to log management should include definitions to assist in prioritizing alerts received from various sources such as monitoring tools, as well as defined procedures for taking action on alerts.
Loss of commercial power at a facility is an alert worth monitoring but may not be a major incident if backup power is available and the power is likely to be restored quickly. If the organization is experiencing exceptional operations, such as retail during a major holiday, then the organization may choose to preemptively declare a business interruption and shift processing to an alternate facility. Detecting and triaging incidents, up to and including declaration of an interruption or disaster, is a logical function for the SOC to perform due to the type of data they work with.
As always, there are two key perspectives for the SOC. CSPs will likely need a robust SOC function with 24/7/365 monitoring of the environment. While such a capability will be expensive, the cost is likely justified by requirements of the cloud consumers and can be shared among this large group. Cloud consumers may operate a SOC for their own operations, which will include any on-prem IT services as well as their cloud services and environments. This may require the use of some legacy tools deployed to more traditional cloud services such as IaaS or PaaS, newer tools designed to monitor services such as SaaS (for example, access management or encryption tools), or even the use of CSP-provided monitoring capabilities.
As an example, the major CSPs offer security incident reporting services that customers can log in to if an incident affects them. They also offer the following public status pages that list operational information for public-facing services:
status.aws.amazon.com
status.azure.com/en-us/status
status.cloud.google.com
One crucial decision to be made when designing a SOC is the use of internal resources or outsourcing the function (build versus buy). As previously mentioned, the CSPs can likely justify the cost of robust SOC resources due to cost sharing among customers and the requirements those customers will impose. A small organization with only cloud-based architecture may decide to outsource their security operations and monitoring, as many services provide dedicated support for specific cloud platforms at a lower cost than building the same function internally. These are known as managed security services providers (MSSPs). Like most business decisions, this will be a trade-off between control and cost and should be made by business leaders with input from security practitioners. A CCSP should understand the cloud architecture and communicate any risks the organization might assume by utilizing a third-party SOC.
Monitoring of security controls used to be an activity closely related to formal audits that occur relatively infrequently, sometimes once a year or even once every three years. A newer concept is known as continuous monitoring, which is described in the NIST SP 800-37 Risk Management Framework (RMF) as “Maintaining ongoing awareness to support organizational risk decisions.” Information that comes from an audit conducted more than a year ago is not ongoing awareness. Instead, the RMF specifies the creation of a continuous monitoring strategy for getting near real-time risk information.
Real-time or near real-time information regarding security controls comprises two key elements: the status of the controls and any alerts or actionable information they have created. Network resources are at risk of attacks, so network security controls like IDS are deployed. Continuous monitoring of the IDS's uptime is critical to ensure that risk is being adequately mitigated. A facility to view any alerts generated by the device, as well as personnel and processes to respond to them, is also crucial to the organization's goal of mitigating security risks; if the IDS identifies malicious network activity but no action is taken to stop it, the control is not effective.
A longer-term concern for monitoring security controls and risk management is the suitability of the current set of tools. As organizations evolve, their infrastructure will likely change, which can render existing tools ineffective. The SOC should be charged with ensuring that it can monitor the organization's current technology stack, and a representative should be part of change management or change control processes. Migrating a business system from on-prem hosting to a SaaS model will likely have a security impact with regard to the tools needed to monitor it, and the change board should ensure that this risk is planned for as part of the change.
In general, the SOC should have some monitoring capabilities across all physical and logical infrastructure, though detailed monitoring of some systems may be performed by another group. For example, a physical access control system dashboard may be best monitored by security guards who can perform appropriate investigation if an alarm is triggered. Some organizations run a network operations center (NOC) to monitor network health, and NOC engineers would be best suited to manage telecommunications equipment and ISP vendors. However, an operational incident in either of these two systems, such as a break-in or loss of ISP connectivity, could be an input to the SOC's incident management function. Several controls that might be particularly important for SOC monitoring include the following:
NIST SP 800-92, “Guide to Computer Security Log Management,” defines a log as “a record of the events occurring within an organization's systems and networks” and further states that “Many logs within an organization contain records related to computer security … including security software, such as antivirus software, firewalls, and intrusion detection and prevention systems; operating systems on servers, workstations, and networking equipment; and applications.” These logs of internal activity can unfortunately be overwhelming for humans to attempt to meaningfully review due to the sheer number of events generated by modern information systems. Security information and event management (SIEM) tools offer assistance.
SIEM tools provide a number of functions useful to security, namely, the following:
[email protected]
. This makes it harder to analyze the data, so SIEM platforms can transform data to a common format, such as the use of UTC for timestamps to avoid issues with time zones, or the use of consistent field names like Timestamp instead of Event Time.Correlation refers to discovering relationships between two or more events; in this example, if the user has suddenly logged in from a new location, accessed email, and started downloading files, it could indicate compromised credentials being used to steal data. It could also indicate that the user is attending a conference and is getting some work done in between sessions, but the organization should still perform checks to verify. If travel is a common situation, the organization might also integrate data from an HR or travel reservation system to correlate travel records for a certain user with their activity accessing data from a particular country. If the user is not known to be traveling in that country, the activity is highly suspicious.
Other sources of information could include external information like threat intelligence or CSP status feeds, which can be relevant in investigating anomalies and categorizing them as incidents. Major CSPs offer monitoring capabilities in their platforms as well, and this data may be critical for investigating anomalies.
PaaS and SaaS in particular can pose issues for logging and monitoring, as the logs themselves may not be visible to the consumer. CSPs typically don't share internal logs with consumers due to the risk of inadvertently exposing customer details, so these services may be a black hole in the organization's monitoring strategy. Solutions have been designed that can address this, such as the cloud access security broker (CASB), which is designed to log and monitor user access to cloud services. These can be deployed inline to a user's connection to cloud services or collect information via API with cloud services. The CASB can monitor access and interaction with applications; for example, user Alice Doe logged into Dropbox at 12:30, and uploaded file Super Secret Marketing Data.xlsx
at 12:32. Dropbox as a CSP may not share this data with consumers, but the CASB can help overcome that blind spot.
NIST SP 800-92 details critical requirements for securely managing log data, such as defining standard processes and managing the systems used to store and analyze logs. Because of their critical nature in supporting incident investigations, logs are often a highly critical data asset and worthy of robust security mechanisms. These may include the following:
An incident is any unplanned event that actually does or has the ability to reduce the quality of an IT service. In security terms, reducing the quality is synonymous with impacting any element of the CIA triad. As an example, an event could be a loss of commercial power at a data center. If the organization has been notified that the power company is performing maintenance and is able to continue running with backup generators, then this is merely an event. The IT services provided by the data center can continue uninterrupted. If, instead, the power is cut unexpectedly and the facility must switch to backup power, this could cause systems to be unresponsive or lose data during the transition to backup power. This is an obvious negative impact to data integrity and system availability.
Incident management or incident response (IR) exists to help an organization plan for incidents, identify them when they occur, and restore normal operations as quickly as possible with minimal adverse impact to business operations. This is referred to as a capability, or the combination of procedures and resources needed to respond to incidents, and generally comprises three key elements.
To ensure that an incident is dealt with correctly, it is important to determine how critical it is and prioritize the response appropriately. Each organization may classify incidents differently, but a generic scheme plots Urgency against Impact. These are assigned values from Low, Medium, or High, and incidents that are High priority are handled first. The following are descriptions and examples of these criteria:
Incident classification criteria and examples should be documented for easy reference. The IRP should contain this information, and it is also advisable to include it in any supporting systems like incident trackers or ticketing systems. These ratings are subjective and may change as the incident is investigated, so the IR coordinator should ensure that critical information like prioritization is communicated to the team.
The organization's IRP should include detailed steps broken down by phases. At a high level, there are activities to be conducted prior to an incident and after an incident occurs; namely, planning the IR capability and the actual execution of a response when an incident is detected. There are a number of IR models that contain slightly different definitions for each phase, but in general they all contain the following:
Security researchers or malicious actors may draw attention to a vulnerability they have discovered or, worse, exploited, in which case the organization must take steps to investigate if the claim is true and take appropriate action. Organizations are also beginning to subscribe to external intelligence feeds from third-party services that can provide advance alert of an incident, such as compromised user credentials showing up on the dark web, or adjacent domains being registered that might be used in a phishing attack.
As soon as an incident is detected, it must be documented, and all actions from the point of detection through to resolution should be documented as well. Initial analysis or triage of incidents, prioritization, members of the IRT called upon to deal with the incident, and plans for implementing recovery strategies should be documented. As discussed in the section on digital forensics, it may be the case that a seemingly simple incident evolves into a malicious act requiring criminal charges. In this case, as much evidence as possible, handled correctly, will be crucial and cannot be created after the fact.
Investigation will begin as the IRT starts to gather information about the incident. This can be as simple as attempting to reproduce a user-reported app issue to determine if it is only affecting that user or is a system-wide issue. This is a key integration point to the practice of digital forensics. As soon as it appears, the incident may require prosecution or escalation to law enforcement, and appropriately trained digital forensics experts must be brought in.
Notification may also occur during this stage, once the incident is properly categorized. In many cases, this will be done to satisfy legal or compliance obligations, such as U.S. state privacy laws and the EU GDPR, which require notification to authorities within a certain timeframe after an incident is detected. Appropriate communication resources like legal counsel or public relations personnel may be required on the IRT to handle these incidents.
Once sufficient information is gathered, a containment strategy must be formulated and implemented. This should follow the documented scenario-based responses in the IRP whenever possible—for example, responding to a ransomware attack by isolating any affected machines and working to establish how the malware was installed. Once this is ascertained, deploying techniques for preventing other machines from being affected may be more important than recovering data on machines that have already been compromised. These types of decisions should be made by qualified personnel with as much information as possible at their disposal.
Notification is typically handled as part of incident detection, and formal reporting is an ongoing task that starts after the initial notification. In simple cases, the report may comprise a single document at the resolution of an incident with the particulars, while in other cases, ongoing reporting of a dynamic situation will be required. As you may have seen during high-profile data breaches at public organizations, initial reporting is made on the incident with information available at that moment, even if it is estimated. As the investigation proceeds, updated reports are delivered to convey new information. The IR coordinator, along with appropriate communication resources on the IRT, is responsible for creating and disseminating all required reporting to appropriate stakeholders.
During the response phase, additional information should be gathered and documented regarding the incident. This may include details of any external attacker or malicious insider who might be responsible and should be conducted in accordance with the digital evidence handling previously discussed.
Eradication of the underlying cause is also a critical element of the recovery. This may include actions such as replacing faulty hardware, blocking a specific IP address or user from accessing a resource, or rebuilding compromised systems from known good images. Containment prevents the incident from spreading further, while eradication removes the immediate cause of the incident. It may be the case that neither prevents the problem from reoccurring in the future, which is a part of post-incident activities.
Recovery and eradication end when the organization returns to normal operations. This is defined as the pre-incident service level delivered using standard procedures.
Incident response is designed to help the organization restore normal operations quickly, but there are situations when this will be impossible. In these cases, the incident may need to be upgraded to an interruption, which is an event whose impact is significant enough to disrupt the organization's ability to achieve its goals or mission. A few users with malware infections on their workstations is an incident that can likely be handled by normal IT resources, but an outbreak affecting all critical systems and a large percentage of users is likely to require more resources.
In such cases, the IR coordinator may be empowered to declare an interruption or disaster, which invokes processes like BCDR. Similar to IR, there should be clear plans in place to guide the organization's response, such as emergency authorization to buy new equipment outside of normal purchasing processes or invocation of alternate procedures like using a third party to process information. The IR coordinator should be aware of their role and responsibilities in these plans, including the process for declaring and notifying appropriate members of the BCDR team to take over.
As discussed, the migration to a third-party CSP can introduce additional complexity to an organization. When planning for incident management, the CSP must be considered as a critical stakeholder. Appropriate points of contact should be documented and reachable in the event of an incident, such as a service delivery or account manager who can support the organization's incident response and recovery. An incident at the CSP should be communicated to all the CSP's consumers, and the CSP may be required to provide specific information in the case of regulated data or negligence. Even if a data breach is the fault of the CSP, the consumer who is the data controller is still legally liable on several counts, including notifying affected individuals and possible fines.
Communication from a consumer to the CSP may be critical, especially if the incident has the ability to affect other CSP customers. Additionally, some options for performing incident management, such as rebuilding compromised architecture, will be different in the cloud environment. It is possible to rapidly redeploy completely virtual architecture to a known good state in the cloud; the same task in a traditional data center environment could take significant time.
There are three standards that can guide a security practitioner in designing and operating an incident management capability.
You can find the technical report document at resources.sei.cmu.edu/asset_files/TechnicalReport/2018_005_001_538866.pdf
.
You can find NIST SP 800-61 here: csrc.nist.gov/publications/detail/sp/800-61/rev-2/final
.
Cloud security operations, like all other security practices, must be anchored by two key principles: operations must be driven by the organization's business objectives or mission, and they must preserve the confidentiality, integrity, and availability of data and systems in the cloud. Operations is a far-reaching topic covering the selection, implementation, and monitoring of physical and logical infrastructure, as well as security controls designed to address the risks posed in cloud computing.
There are a variety of standards that can assist the CCSP in implementing or managing security controls for cloud environments. These cover major objectives such as access control, securing network activity, designing operational control programs, and handling communications. Choosing the correct standard will be driven by each organization's location, industry, and possibly costs associated with the various standards. All programs implemented should have feedback mechanisms designed to continuously improve security as risks evolve.
Also key is an understanding of the shared responsibility model. CSPs will perform the majority of work related to physical infrastructure, though cloud consumers may need physical security for infrastructure that connects them to cloud computing. Logical infrastructure is a more equally shared responsibility: in non-SaaS models, the CSP runs the underlying infrastructure, but consumers have key responsibilities in securing the logical infrastructure in their virtual slice of the cloud.