15
Analysis of Cloud Digital Evidence

Irfan Ahmed and Vassil Roussev

University of New Orleans, New Orleans, LA, USA

15.1 Introduction

Analysis of digital evidence acquired from cloud computing deployments, which we refer to as cloud evidence analysis, is in its very early stages of development. It is still in its exploration and experimentation phase where new, ad hoc solutions are developed on a per‐case basis; efforts are made to map problems in the cloud domain to prior solutions; and, most of all, ideas for the future are put forward. In other words, the state of knowledge is quite immature, and that is well illustrated by the steady stream of recommendations – primarily from academia – on what should be done by providers and clients to make cloud forensics easier and better (for the existing toolset).

The goal of this chapter is to present a broad framework for reasoning about cloud forensics architectures and scenarios, and for the type of evidence they provide. We use this framework to classify and summarize the current state of knowledge, as well as to identify the blank spots and likely future direction of cloud forensics research and development.

15.1.1 Cloud Forensics as a Reactive Technology

Since our discussion is primarily focused on the technical aspects of analyzing cloud evidence (and not on legal concerns), we adopt the following technical definition of digital forensics (Roussev 2016):

Digital forensics is the process of reconstructing the relevant sequence of events that have led to the currently observable state of a target IT system or (digital) artifacts.

The notion of relevance is inherently case‐specific, and a big part of a forensic analyst's expertise is the ability to identify case‐relevant evidence. Frequently, a critical component of the forensic analysis is the causal attribution of an event sequence to specific human actors of the system (such as users and administrators). When used in legal proceedings, the provenance, reliability, and integrity of the data used as evidence are of primary importance. In other words, we view all efforts to perform system or artifact analysis after the fact as a form of forensics. This includes common activities such as incident response and internal investigations, which almost never result in any legal actions. On balance, only a tiny fraction of forensic analyses make it to the courtroom as formal evidence.

Digital forensics is fundamentally reactive in nature – we cannot investigate systems and artifacts that do not exist; we cannot have best practices before an experimental period during which different technical approaches are tried, (court‐)tested, and validated. This means there is always a lag between the introduction of a piece of information technology and the time an adequate corresponding forensic capability is in place. The evolution of the IT infrastructure is driven by economics and technology; forensics merely identifies and follows the digital breadcrumbs left behind.

It follows that forensic research is also inherently reactive and should focus primarily on understanding and adapting to the predominant IT landscape, as opposed to trying to shape it in any significant fashion. Throughout the rest of this chapter, we will make the case that the Cloud presents a new type of challenge for digital forensics and that it requires an entirely different toolset, as the existing one becomes quickly and increasingly inadequate.

Twelve years have elapsed since the introduction in 2006 of public cloud services by Amazon under the Amazon Web Services (AWS) brand. As of 2015, according to RightScale's State of the Cloud Report (RightScale 2015), cloud adoption has become ubiquitous: 93% of businesses are at least experimenting with cloud deployments, with 82% adopting a hybrid strategy, which combines the use of multiple providers (usually in a public‐private configuration). However, much of the technology transition is still ahead as 68% of enterprises have less than 20% of their application portfolio running in a cloud setup. Similarly, (Gartner 2014) predicts another two to five years will be needed before cloud computing reaches the plateau of productivity,” which marks mass mainstream adoption and widespread productivity gains.

Unsurprisingly, cloud forensics is still in its infancy despite dozens of articles in the literature over the last five years; there is a notable dearth of practical technical solutions on the analysis of cloud evidence. Thus, much of our discussion will necessarily tilt toward identifying new challenges and general approaches as opposed to summarizing existing experiences, of which there are few.

15.1.2 The New Forensics Landscape

Most cloud forensics discussions start with the false premise that, unless the current model of digital forensic processing is directly and faithfully reconstructed with respect to the Cloud, then we are bound to lose all notions of completeness and integrity. The root of this misunderstanding is the use of traditional desktop‐centric computational model that emerged in the 1990s as the point of reference. (This approach has been subsequently tweaked to work for successive generations of ever‐more‐mobile client devices.)

The key attribute of this model is that practically all computations take place on the device itself. Applications are monolithic, self‐contained pieces of code that have immediate access to user input and consume it instantly with (almost) no trace left behind; periodically, the current state is saved to stable storage. Since a big part of forensics is attributing the observed state of the system to user‐triggered events, we (forensic researchers and tool developers) have obsessively focused on two driving problems: discover every little piece of log/timestamp information, and extract every last bit of discarded data that applications and the operating system (OS) leave behind, either for performance reasons, or just plain sloppiness. Courtesy of countless hours spent on painstaking reverse engineering, we have become quite good at these tasks.

Indeed, our existing toolset is almost exclusively built to feast upon the leftovers of computations – an approach that is becoming more challenging even in traditional (non‐cloud) cases. For example, file carving of acquired media (Richard and Roussev 2005) only exists because it is highly inefficient for the OS to sanitize the media. However, for solid state disk (SSD) devices, the opposite is true – they need to be prepared before reuse. The result: deleted data is sanitized, and there is little left to carve (King and Vidas 2011).

The very notion of low‐level physical acquisition is reaching its expiration date even from a purely technological perspective – the current generation of high‐capacity hard disk drives (HDDs) (8 TB+) use a track‐shingling technique and have their very own Acorn RISC Machine (ARM) based processor, which is tasked with identifying hot and cold data, and choosing appropriate physical representation for it. The HDD device exposes an object store interface (not unlike key‐value databases) that effectively makes physical acquisition, in a traditional sense, impossible – legacy block‐level access is still supported, but the block identifiers and physical layout are no longer coupled as they were in prior generations of devices. By extension, the feasibility of most current data‐recovery efforts, such as file carving, will rapidly diminish.

Mass hardware disk encryption is another problem worth mentioning, as it is becoming increasingly necessary and routine in IT procedures. This is driven both by the fact that there is no observable performance penalty, and by the need to effectively sanitize ever‐larger and ‐slower HDDs. The only practical solution to the latter is to always encrypt and dispose of the key when the disk needs to be reclaimed.

In sum, the whole concept of acquiring a physical image of the storage medium is increasingly technically infeasible and is progressively less relevant because interpreting the physical image requires understanding of the (proprietary) internals of the device's data structures and algorithms. The inevitable conclusion is that forensic tools will have to increasingly rely on the logical view of the data presented by the device.

Logical evidence acquisition and processing will also be the norm in most cloud investigations, and it will be performed at an even higher level of abstraction via software‐defined interfaces. Conceptually, the main difference between cloud computing and client‐side computing is that most of the computation and, more importantly, the application logic executes on the server, with the client effectively becoming a remote terminal for collecting user input (and environment information) and for displaying the results of the computation.

Another consequential trend is the way cloud‐based software is developed and organized. Instead of one monolithic piece of code, the application logic is almost always decomposed into several layers and modules that interact with each other over well‐defined service interfaces. Once the software components and their communication are formalized, it becomes quite easy to organize extensive logging of all aspects of the system. Indeed, it becomes necessary to have this information just to be able to test, debug, and monitor cloud‐based applications and services. Eventually, this will end up helping forensics tremendously as important stages of computation are routinely logged, with user input being both the single most important source of events and the least demanding to store and process.

Returning to our driving problem of analyzing cloud artifacts, it should be clear that – in principle – our task of forensically reconstructing prior events should be becoming easier over time as a growing fraction of the relevant information is being explicitly recorded. As an example, consider that with a modest reverse‐protocol engineering effort, (Somers 2014), later expanded by (Roussev and McCulley 2016), was able to demonstrate that Google Docs timestamps and logs every single user keystroke. The entire history of a document can be replayed with a simple piece of JavaScript code; in fact, the log is the primary representation of the document, and the current (or prior) state of the document is computed on the fly by replaying (part of) the history. From a forensics perspective, this is a time‐travel machine and is practically everything we could ask for in terms of evidence collection, with every user event recorded with microsecond accuracy.

15.1.3 Adapting to the New Landscape

According to (Ruan et al. 2013), more forensic examiners see cloud computing as making forensics harder (46%) than easier (37%). However, the more revealing answers come from analyzing the reasoning behind these responses. The most frequent justifications for the harder answer fall into two categories: (i) restricted access to data (due to lack of jurisdiction and/or cooperation from the cloud provider), and (ii) inability to physically control the process of evidence acquisition/recovery, and to apply currently standard processing techniques. While data access is primarily a nontechnical issue – to be solved by appropriate legal and contractual means – the natural inclination to attempt to apply legacy techniques will take some time and new tools to resolve.

All of these concerns merely emphasize that we are in a transitional period during which requirements, procedures, and tools are still being fleshed out. From a technical perspective, the simplest way to appreciate why forensics professionals lack any cloud‐specific tools is to consider the current dearth of tools that support the investigation of server deployments. Practically all integrated forensic environments, both commercial and open source, focus on the client and the local execution model discussed earlier. However, a server‐side inquiry is an exercise in tracking down and correlating the available logs – a task that, today, is accomplished largely in an ad hoc manner using custom scripts and other “homegrown” solutions. Investigating a cloud application is not unlike performing a server‐side investigation, with a lot more log data and more complex data relationships.

From a research perspective, it may appear that log analysis does not present a very exciting prospect for the future. However, the real challenge will come from the sheer volume and variety of logs and the need to develop ever‐more‐intelligent tools to process and understand them. The long‐term upside – and, likely, necessity – is that we can, for the first time, see a plausible path toward substantially higher levels of automation in forensics. At present, the process of extracting bits and pieces of artifacts and putting them together in a coherent story is too circuitous and vaguely defined for it to be further automated to a meaningful degree. Once log data becomes the primary source of evidence, it is conceivable that a substantial number of forensic questions could be formalized as data queries to be automatically answered, complete with formal statistical error estimates.

The transition from artifact‐centric to log‐centric forensics will take some time – artifact‐analysis methods will always be needed, but they will be applied much more selectively. Nevertheless, it is a relatively safe prediction that, within a decade, this transition will fundamentally change forensic computing and the nature of investigative work.

15.1.4 Summary

Cloud computing presents a qualitatively new challenge for the existing set of forensic tools, which are focused on analyzing leftover artifacts on local client devices. Cloud services are inherently distributed and server‐centric, which renders ineffective large parts of our present acquisition and analysis toolchain. In this transitional period, forensic analysts are struggling to bridge the gap using excess manual technical and legal effort to fit the new data sources into the existing pipeline.

Over the medium‐to‐long term, the field requires an entirely new set of cloud‐native forensic capabilities that work in unison with cloud services. Once such tools are developed and established, this will open up the opportunity to define more formally legal and contractual requirements that providers can satisfy at a reasonable cost. As soon as technical and legal issues are worked out, we will face the unprecedented opportunity of automating and scaling up forensic analysis to a degree that is currently not possible.

The rest of this chapter is organized as follows: Section 15.1 presents necessary background of cloud computing service models, followed by Section 15.2 discussing the current approaches for each model. Section 15.3 presents potential approaches and solutions as a way forward to address forensics in cloud computing. Sections 15.4 and 15.6 present a detailed discussion and conclusion, respectively.

15.2 Background

Cloud computing services are commonly classified into one of three canonical models – Software‐as‐a‐Service (SaaS), Platform‐as‐a‐Service (PaaS), and Infrastructure‐as‐a‐Service (IaaS) – and we use this split as a starting point in our discussion. We should note, however, that practical distinctions are often less clear‐cut, and a real IT cloud solution (and a potential investigative target) can incorporate elements of all of these. As illustrated on Figure 15.1, it is useful to break down cloud computing environments into a stack of layers (from lower to higher): hardware such as storage and networking, virtualization consisting of a hypervisor allowing the installation of virtual machines (VMs), operating system installed on each VM, middleware and runtime environment, and application and data.

Image described by caption and surrounding text.

Figure 15.1 Layers of the cloud computing environment owned by the customer and cloud service provider on three service models: IaaS, PaaS, and SaaS (public cloud).

In a private (cloud) deployment, the entire stack is hosted by the owner, and the overall forensic picture is very similar to the problem of investigating a non‐cloud IT target. Data ownership is clear, as is the legal and procedural path to obtain it; indeed, the very use of the term cloud is mostly immaterial to forensics; therefore, we will not discuss this case any further.

In a public deployment, the SaaS/PaaS/IaaS classification becomes important as it indicates the ownership of data and service responsibilities. Figure 15.1 shows the typical ownership of layers by customer and service provider on different service models; Table 15.1 presents the examples of commercial products of the cloud service models.

Table 15.1 Examples of some popular commercial products based on the cloud service models: Software‐as‐a‐Service, Platform‐as‐a‐Service, and Infrastructure‐as‐a‐Service.

Software‐as‐a‐Service Platform‐as‐a‐Service Infrastructure‐as‐a‐Service
Google Gmail Apprenda Amazon Web Service
Microsoft 365 Google App Engine Microsoft Azure
Salesforce Google Compute Engine
Citrix GoToMeeting
Cisco WebEx

In hybrid deployments, layer ownership can be split between the customer and the provider and/or across multiple providers. Further, it can change over time as, for example, the customer may handle the base load on owned infrastructure, but burst to the public cloud to handle peak demand or system failures.

15.2.1 Software‐as‐a‐Service (SaaS)

In this model, cloud service providers (CSPs) own all the layers including the application layer that runs the software offered as a service to customers. In other words, the customer has only indirect and incomplete control (if any) over the underlying operating infrastructure and applications (in the form of policies). However, since the CSP manages the infrastructure (including the application), maintenance costs on the customer side are substantially reduced. Google Gmail/Docs, Microsoft 365, Salesforce, Citrix GoToMeeting, and Cisco WebEx are popular examples of SaaS, which run directly from the web browser without downloading and installing any software. Their desktop and smartphone versions are also available to run on the client machine. The applications have a varying but limited presence on the client machine, making the client an incomplete source of evidence; therefore, investigators need access to server‐side logs to paint a complete picture.

SaaS applications log extensively, especially when it comes to user‐initiated events. For instance, Google Docs records every insert, update, and delete operation of characters performed by the user along with timestamps, which makes it possible to identify specific changes made by different users in a document (Somers 2014). Clearly, such information is a treasure trove for a forensic analyst and is a much more detailed and direct account of prior events than is typically recoverable from a client device.

15.2.2 Platform‐as‐a‐Service (PaaS)

In the PaaS service model, customers develop their applications using software components built into middleware. Apprenda (https://apprenda.com) and Google App Engine (https://cloud.google.com/appengine/pricing) are popular examples of PaaS, offering quick and cost‐effective solutions for developing, testing, and deploying customer applications. In this case, the cloud infrastructure hosts customer‐developed applications and provides high‐level services that simplify the development process. PaaS provides full control to customers on the application layer including interaction of applications with dependencies (such as databases, storage, etc.) and enabling customers to perform extensive logging for forensics and security purposes.

15.2.3 Infrastructure‐as‐a‐Service (IaaS)

In IaaS, the CSP is the party managing the VMs; however, this is done in direct response to customer requests. Customers then install the OS and applications within the machine without any interference from the service providers. AWS, Microsoft Azure, and Google Compute Engine (GCE) are popular examples of IaaS. IaaS provides capabilities to take snapshots of the disk and physical memory of VMs, which has significant forensic value for quick acquisition of disk and memory. Since VMs support the same data interfaces as physical machines, the traditional forensic tools for data acquisition and analysis can also be used inside the VMs as remote investigation of a physical machine is performed. Furthermore, VM introspection provided by a hypervisor enables CSPs to examine live memory and disk data, and perform instant data acquisition and analysis. However, since the functionality is supported at the hypervisor level, customers cannot take advantage of this functionality.

In summary, we can expect SaaS and PaaS investigations to have a high dependency on logs, since disk and memory image acquisition is difficult to perform due to lack of control on middleware, OSs, and lower layers. In IaaS, the costumer has control over the OS and upper layers, which makes it possible to acquire disk and memory images, and perform traditional forensic investigations.

15.3 Current Approaches

The different architectural models shown in Figure 15.1 imply the need for a differentiated approach to building forensic tools for each of them. An additional dimension to this challenge is that implementations of the same class of service (e.g. IaaS) can vary substantially across providers. Moreover, providers could be using and/or reselling other providers' services, making the task of physically acquiring the source data impractically complicated, or even intractable.

Over time, we can expect the large (and still growing) number of cloud service types and implementations to naturally coalesce into a smaller set of de facto standards, which may eventually provide some needed visibility into the provider's operations. In the meantime, cloud forensics research is likely best served by focusing on the information available at the subscriber‐provider boundary interface. The key observation is that providers must collect and retain substantial amounts of log information for accounting and operational purposes. For example, an IaaS provider must have detailed records of VM operations (launch, shutdown, CPU/network/disk usage), assignment of IP addresses, changes to firewall rules, long‐term data storage requests, and so on. Such data should be readily available, segregated by subscriber, and, therefore, readily obtainable via the legal process. However, invasive requests for physical media and information that cut to the core of a provider's operation are highly unlikely to succeed in the general case.

15.3.1 SaaS Forensics

Cloud customers access SaaS applications such as Google Gmail, Google Docs, and Microsoft 365 through a web interface or desktop application from their personal computing devices such as laptop/desktop computers and smartphones. The applications maintain a detailed history/log of user inputs that is accessible to customers and has forensic value. Since applications are accessed from a client machine, remnants of digital artifacts pertaining to user activities in the applications are usually present and can be forensically retrieved and analyzed from the hard disk of the machine.

15.3.1.1 Cloud‐Native Application Forensics

Perhaps the very first cloud‐native tool with forensics applications is Draftback (https://draftback.com): a browser extension created by the writer and programmer James Somers, which can replay the complete history of a Google Docs document (Somers 2014). The primary intent of the code is to give writers the ability to look over their own shoulder and analyze how they write. Coincidentally, this is precisely what a forensic investigator would like to be able to do – rewind to any point in the life of a document, right to the very beginning.

In addition to providing in‐browser playback of all the editing actions – in either fast‐forward or real‐time mode – Draftback provides an analytical interface that maps the time of editing sessions to locations in the document (Figure 15.2). This can be used to narrow down the scope of inquiry for long‐lived documents.

Draftback analytical interface illustrating the timeline of activity for Sat. 4/5/2014, Sat. 4/12/2014, Sat. 4/19/2014, Sat. 4/26/2014, Sat. 5/3/2014, and Sat. 5/10/2014, each has vertical lines below.

Figure 15.2 Draftback analytical interface.

Somers's work, although not motivated by forensics, is probably the single best example of SaaS analysis that does not rely on trace data resident on the client – all results are produced solely by reverse engineering the web application's simple data protocol. Assuming that an investigator is in possession of valid user credentials (or such are provided by Google under legal order), the examination can be performed on the spot: any spot with a browser and an Internet connection.

The most profound forensic development here is that, as far as Google Docs is concerned, there is no such thing as deletion of data; every user editing action is recorded and timestamped. Indeed, even the investigator cannot spoil the evidence because any actions on the document will simply be added to the editing history with the appropriate timestamp. This setup is likely to be sufficient for most informal/internal scenarios, however, in order to make it to the courtroom, some additional tooling and procedures will need to be developed.

Clearly, the acquisition and long‐term preservation of the evidence is yet to be addressed. The data component is the low‐hanging fruit because the existing replay code can be modified to produce a log of the desired format. The challenging part is preserving the application logic that interprets the log; unlike client‐side applications, a web app's code is split between the client and the server, and there is no practical way to acquire an archival copy of the execution environment as of a particular date. Since the data protocol is internal, there is no guarantee that a log acquired now could be replayed years later.

One possible solution is to produce a screencast video of the entire replayed user session. The downside here is that most documents are bigger than a single screen, so the video would have to be accompanied by periodic snapshots of the actual document (e.g. in PDF). Another approach would be to create a reference text‐editor playback application in which the logs could be replayed. This would require extra effort and faces questions of how closely the investigated application can be emulated by the reference one.

The next logical question is, how common is fine‐grain user logging among cloud applications? As one would expect, a number of other editing applications from Google's Apps for Work suite – such as Sheets, Slides, and Sites – use a very similar model and provide scrupulously detailed history (Slides advertises “unlimited revision history”; https://www.google.com/work/apps/business/products/slides). A deeper look reveals that the particular history available for a document depends on the age of the file and/or the size of the revisions, as revisions may be merged to save storage space (https://support.google.com/docs/answer/95902). In other words, Google starts out with a detailed version of the history; over time, an active document's history may be trimmed and partially replaced with snapshots of its state. It appears that the process is primarily driven by technical and usability considerations, not policy.

Microsoft's Office 365 service also maintains detailed revisions based on edits in part because, like Google, it supports real‐time collaboration. Zoho is another business suite of online apps that supports detailed history via a shared track‐changes feature similar to Microsoft Word's feature. Most cloud drive services offer generic file‐revision history; however, the available data is more limited because it is stored as simple snapshots.

15.3.1.2 Cloud Drive Forensics

Cloud users access their cloud storage through personal computing devices such as laptop/desktop computers and smartphones. (Chung et al. 2012) suggest that the traces of services present in client devices can be helpful to investigate criminal cases, particularly when CSPs do not provide the cloud server logs to protect their client's privacy. Investigation on cloud storage identifies user activities from subscription to the service until the end of using the service. They authors analyze four cloud storage services (i.e. Amazon S3, Google Docs, Dropbox, and Evernote) and report that the services may create different artifacts depending on specific features of the services. The authors proposed a process model for forensic investigation of cloud storage services. The model combines the collection and analysis of artifacts of cloud storage services from both personal computer and smartphones. The model suggests acquiring volatile data from personal (Mac/Windows) computers (if possible) and then gathering data from the Internet history, log files, files, and directories. In Android phones, rooting is performed to gather data; iTunes data is gathered from iPhones, and the backup iTunes files from personal computers. The analysis checks for the traces of a cloud storage service that exist in the collected data.

(Hale 2013) discusses the digital artifacts left behind after an Amazon Cloud Drive has been accessed or manipulated from a computer. Amazon's cloud storage service allows users to upload and download files from any location, without having a specific folder on a local hard drive to sync with the cloud drive. The user uses a desktop application or online web interface to transfer selected files or folders to/from the cloud drive. The online interface is similar in appearance to Windows Explorer, with Upload, Download and Delete buttons to perform actions on files and folders. The desktop application also provides a drag‐and‐drop facility. Hale analyzes the cloud drive by accessing and manipulating the drive's content via the desktop application and online web interface. He found artifacts of the interface in the web browser history and cache files. The desktop application had artifacts on the Windows registry, application installation files on the default location, and a SQLite database used by the application to hold transfer tasks (i.e. upload/download) while the task's status was pending.

(Quick and Choo 2013) discuss the digital artifacts of Dropbox after a user has accessed and manipulated Dropbox content. Dropbox is a file‐hosting service enabling users to store and share files and folders it is accessed through a web browser or client software. The authors use hash analysis and keyword searches to determine if Dropbox client software has been used. They determine the Dropbox username from the browser history (of Mozilla Firefox, Google Chrome, or Microsoft Internet Explorer) and the use of Dropbox through several avenues such as directory listings, prefetch files, link files, thumbnails, the registry, the browser history, and memory captures.

(Martini and Choo 2013) discuss the digital artifacts of ownCloud on both the server and client side. ownCloud is a file‐sync and file‐share software application configured and hosted on the server. It provides client software and a web interface to access files on the server. The authors recover artifacts including sync and file‐management metadata (such as logging, database, and configuration data), cached files describing files the user has stored on the client device and uploaded to the cloud environment or vice versa, and browser artifacts.

15.3.1.3 Building New Tools for SaaS Forensics

As our review of existing work demonstrates, research and development efforts have been focused on the traditional approach of finding local artifacts on the client. This is an inherently limited and inefficient approach requiring substantial reverse‐engineering effort; the future of SaaS forensics lies in working with the web infrastructure the way web applications do: through application programming interfaces (APIs).

Practically all user‐facing cloud services strive to be platforms (i.e. they provide an API) in order to attract third‐party developers who build apps and extensions enhancing the base product. The most relevant example is the wide availability of backup products, such as those provided by EMC's Spanning (http://spanning.com/products), which provides scheduled cloud‐to‐cloud backup services on multiple platforms – Google Apps, Salesforce, and Office 365. Otixo (www.otixo.com) is another service that provides single sign‐on and data‐transfer capability across more than 20 different SaaS platforms.

The latter provides a clear demonstration that the systematic acquisition of SaaS data can be readily accomplished via the provided APIs. At present, the amount of data in cloud storage is small relative to the size of local storage – free services offer only up to 15GB. However, as businesses and consumers get comfortable with the services, and more willing to pay for them, we can expect a fast and substantial increase in volume. For example, Google Drive currently offers up to 30TB at $10/month per terabyte.

This development is likely to blur the line between acquisition and analysis. As full acquisition becomes too burdensome and impractically slow, one logical development would be to use the search interface offered by cloud service APIs to narrow down the scope of the acquisition. Procedurally, this aligns well with what already takes place during e‐discovery procedures.

Looking further ahead, as cloud storage grows, it will be increasingly infeasible to acquire data by downloading it over the Internet. This will bring a new impetus to the development of forensics‐as‐a‐service (FaaS), because the only practical means to perform the processing in a reasonable amount of time would be to collocate the data and the forensic computation in the same data center. Clearly, the practical use of FaaS is some years away, and a nontrivial number of legal and regulatory issues would have to be addressed beforehand. Nonetheless, the fact that storage‐capacity growth constantly outpaces network‐bandwidth growth inexorably leads to the need to move the computation close to the data. Forensic computation will be no exception, and procedures will have to be adjusted to account for technological realities.

15.3.2 PaaS/IaaS Forensics

PaaS packages middleware platforms on top of the OS. Commercial products on PaaS are available, such as Google App Engine. However, the forensic research community did not pay much attention to PaaS, which is evident from the lack of research papers on this topic. On the other hand, IaaS has received attention because of the close resemblance among service models to traditional computing infrastructure. It has VMs running contemporary OSs and software, similar to physical machines. The conventional forensic tools for data acquisition and analysis, such as WinDD, Forensic Toolkit (FTK), and EnCase, can be used on VMs. Furthermore, IaaS offers unique features such as VM snapshotting for quick physical memory and HDD acquisition.

(Dykstra and Sherman 2012) evaluated the effectiveness of EnCase, FTK, and three physical‐memory acquisition tools (i.e. HBGary's FastDump, Mandiant's Memoryze, and FTK Imager) in a cloud computing environment. They remotely acquired geographically dispersed forensic evidence over the Internet and tested their success at gathering evidence, the time to do so, and the trust required. They used a public cloud (Elastic Compute Cloud [EC2] from AWS) for their experiments and concluded that forensic tools are technically capable of remote data acquisition. They illustrated IaaS in six layers (from lowest to highest): network, physical hardware, host OS, virtualization, guest OS, and guest application/data. Each layer (except the last) has to trust all the lower layers. Given the trust requirements, the authors mention that technology alone is insufficient to produce trustworthy data and solve the cloud forensics acquisition problem. They recommend a management plane enabling consumers to manage and control virtual assets through an out‐of‐band channel interfacing with the cloud infrastructure, such as that provided by AWS: AWS Management Console. The management plane interfaces with the provider's underlying file system and hypervisor, and is used to provision, start, and stop VMs.

(Dykstra and Sherman 2013) developed a tool called FROST, which integrates forensic capabilities with OpenStack. FROST uses a management plane through a website and APIs, and collects data from the host OS level outside the guest VMs, assuming that the hardware, host OS, hypervisor, and cloud employees are trusted. OpenStack creates a directory on the host OS containing the virtual disk, RAMdisk, and other host‐specific files. FROST retrieves the files from the host and transforms the virtual disk format to raw using the available utilities. For instance, QEMU provides utilities to convert QEMU QCOW2 images to raw format.

15.4 Proposed Comprehensive Approaches

At present, there are no dedicated deployable solutions that can handle cloud forensics tasks and evidence at the same level of comprehensiveness as the current integrated forensic suites already discussed. The key reasons are of a procedural and technical nature. In many respects, evidence procedures have been – until now – relatively easy to establish, because the physical location and ownership of the hardware, software, and data were closely tied and readily accessible. In turn, this allowed many legal evidence procedures to be directly translated to the digital world without having to rethink the rules. Cloud services break this model and will, eventually, require that legal procedures evolve to catch up with technology development. Technical challenges arise from the overall shift from client‐centric to server‐centric computation, which breaks many of the assumptions of traditional digital forensics and makes applying the currently prevalent processing workflow problematic.

Thus, efforts to address the challenges posed by cloud forensics in a general way have taken one of two approaches. Work coming from digital forensics researchers tends to heavily favor the procedural approach, which assumes that we need (primarily) new acquisition processes, but the toolset and investigative process will mostly remain the same. Often, it is assumed that new legal obligations will be placed on service providers and/or tenants.

The alternative is to look for new technical solutions; these need not be strictly forensics in origin, but may address related problems in auditing, security, and privacy. In essence, this would acknowledge the concept that cloud‐based software works differently, and that we need a completely new toolset to perform effective and efficient forensic analysis on cloud systems.

15.4.1 Procedural Expansion of Existing Forensic Practices

(Martini and Choo 2012) is a representative example of the approach embraced in the digital forensics literature – it seeks to identify the problems but has little in the way of technical detail. The focus is on minor refinements of the accepted procedural models of forensics, such as (McKemmish 1999) and (Kent et al. 2006). Further, the authors prescribe a six‐step process to get the final version of the new framework: conceptual framework, explication interviews, technical experiments, framework refinement, validation interviews, finalized framework.

It is entirely possible that digital forensics researchers and practitioners, as a community, will adopt such a deliberate, time‐consuming process to solve the problem; however, looking at the history of digital forensics, this seems highly unlikely. Experience tells us that the way new practices get established is much more improvised during disruptive technological transitions. Once enough experience, understanding, and acceptance of the new forensic techniques are gained, practices and procedures undergo revisions to account for the new development.

An instructive example in that regard is main‐memory forensics. As recently as 10 years ago, best practices widely prescribed pulling the plug (literally) on any computer found running during search and seizure operations. This was not entirely unreasonable, because dedicated memory forensics tools did not exist, so there was no extra evidence to be gained. Today, we have highly sophisticated tools to acquire and analyze memory images (Ligh et al. 2014), and they are the only key to solving many investigative scenarios. Accordingly, training and field manuals have been rewritten to point to the new state of knowledge.

To summarize, by their very nature, standards lag technical development and merely incorporate the successful innovations into the canon of best practices. In our view, it is critically important to understand that we need new technical solutions, and no amount of procedural framework enhancement will bring them about. Technical advances are achieved by extensive and deliberate research and experimentation and often borrow and adapt methods from other fields.

15.4.2 API‐Centric Acquisition and Processing

In traditional forensic models, the investigator works with physical evidence carriers, such as storage media or integrated computing devices. Thus, it is possible to identify the computer performing the computations and the media that store (traces of) processing, and to physically collect, preserve, and analyze information content. Because of this, research has focused on discovering and acquiring every little piece of log and timestamp information, and extracting every last bit of discarded data that applications and the OS have left behind.

Conceptually, cloud computing breaks this model in two major ways. First, resources – CPU cycles, RAM, storage, etc. – are first pooled (e.g. Redundant Array of Independent Disks [RAID] storage) and then allocated at a fine granularity. This results in physical media potentially containing data owned by many users, and to data relevant to a single case being spread among numerous providers. Applying the conventional model creates a long list of procedural, legal, and technical problems that are unlikely to have an efficient solution in the general case. Second, both computations and storage contain a much more ephemeral record because VM instances are created and destroyed with regularity and working storage is sanitized.

As we discussed in the prior section, current work on cloud storage forensics has treated the problem as just another instance of application forensics. It applies basic differential analysis techniques to gain a basic understanding of the artifacts left on client devices by taking before and after snapshots of the target compute system, and deducing relevant cause‐and‐effect relationships. During an actual investigation, the analyst would be interpreting the state of the system based on these known relationships.

Unfortunately, there are several serious problems with this extension of existing client‐side methods:

  • Completeness: The reliance on client‐side data can leave out critical case data. One example is older versions of files, which most services provide; another is cloud‐only data, such as a Google Docs document, which literally has no serialized local representation other than a link. Some services, such as a variety of personal information management apps, live only in the browser, so a flush of the cache would make them go away.
  • Reproducibility: Because cloud storage apps are updated on a regular basis and versions have a relatively short life‐span, it becomes harder to maintain the reproducibility of the analysis and may require frequent repetition of the original procedure.
  • Scalability: As a continuation of the prior point, manual client‐side analysis is burdensome and simply does not scale with the rapid growth of the variety of services and their versions.

We have performed some initial work using an alternative approach for the acquisition of evidence data from cloud storage providers: one that uses the official APIs provided by the services. Such an approach addresses most of the shortcomings just described:

  • Correctness: APIs are well‐documented, official interfaces through which cloud apps on the client communicate with the service. They tend to change slowly, and changes are clearly marked – only new features need to be incrementally incorporated into the acquisition tool.
  • Completeness/reproducibility: It is easy to demonstrate completeness (based on the API specification), and reproducibility becomes straightforward.
  • Scalability: There is no need to reverse‐engineer the application logic. Web APIs tend to follow patterns, which makes it possible to adapt existing code to a new (similar) service with modest effort. It is often feasible to write an acquisition tool for a completely new service from scratch in a short time.

We have developed a proof‐of‐concept prototype called kumodd (Roussev et al. 2016) that can perform complete (or partial) acquisition of a cloud storage account's data. It works with four popular services – Dropbox, Box, Google Drive, and Microsoft's OneDrive – and supports the acquisition of revisions and cloud‐only documents. The prototype is written in Python and offers both command‐line and web‐based user interfaces.

15.4.3 Audit‐Centric Forensic Services

The move to cloud computing raises a number of security, privacy, and audit problems among the parties involved – tenants, (multiple) cloud providers, and cloud brokers. The only practical means to address them is for all to have a trustworthy history – a log – of all the relevant events in the computation. Such logs are created on the boundary between clients and servers, as well as during the normal operation of services, typically in response to client requests.

For example, a tenant providing SaaS for medical professionals will need to convince itself and its auditors that it is complying with all relevant privacy and audit regulations. Since the computation executes on the provider's platform, the tenant needs the assurances of a third party, such as a trusted logging service.

In many cases, the same log information collected for other purposes can directly answer forensic queries, such as the history of user input over time. It is also likely to provide much greater detail than is currently unavailable on the client. For example, Google Cloud Storage (https://cloud.google.com/storage/docs/access‐logs) maintains access logs in comma‐separated values (CSV) format containing information about the access requests made on the cloud storage area allocated to the user. Tables 15.2 and 15.3 present the list of fields and their descriptions maintained in the log files on Google Cloud Storage and AWS.

Table 15.2 List of fields (with their data types and descriptions) used in Google storage access log format.

Field Type Description
time_micros Integer Time taken by a request to complete, in microseconds
c_ip String The IP address of the client system from which the request is made
c_ip_type Integer The version of IP used, i.e. either IPv4 or IPv6
cs_method String The HTTP method of the request from client to server
cs_uri String URI of the request
sc_status Integer HTTP status code sent from server to client
cs_bytes Integer Number of bytes sent in an HTTP request message from client to server
sc_bytes Integer Number of bytes sent in an HTTP response message from server to client
time_taken_micros Integer The time it took to serve the request by the server, in microseconds
cs_object String The object specified in the request
cs_operation String The Google Cloud Storage operation, such as GET_Object

Table 15.3 list of fields (and their descriptions) used in amazon web services server access log format.

Field Description
Time The time at which the request was received
Remote IP IP address of the requester
Requester Canonical user ID of the requester
Request ID A string generated by Amazon S3 to uniquely identify each request
Operation Such as SOAP.operation and REST. HTTP_method
Request‐URI URI in the HTTP request message
HTTP status HTTP status code in the response message
Error Code Amazon S3 error code
Bytes Sent Number of bytes sent in the response message
Object Size Total size of the object requested
Total Time Number of milliseconds the request was in flight from the server's perspective
Turn‐Around Time Number of milliseconds that Amazon S3 spent processing the request

(Zavou et al. 2013) developed Cloudopsy: a visualization tool to address privacy concerns of a cloud customer about the third‐party service/infrastructure providers handling customer data properly. The tool offers a user‐friendly interface to the customers of cloud‐hosted services to independently monitor the handling of sensitive data by third party. Cloudopsy's mechanism is divided into three main components: (i) the generation of audit trails, (ii) the transformation of the audit logs for efficient processing, and (iii) the visualization of the resulting audit trails. Cloudopsy uses the Circos visualization tool to generate graphs that enable users with no technical background to get a better understanding of the management of their data by third‐party cloud services.

(Pappas et al. 2013) proposed CloudFence, which is a data flow‐tracking framework for cloud‐based applications. CloudFence provides APIs for integrating data‐flow tracking in cloud services, marking sensitive user data to monitor data propagation for protection. The authors implemented a prototype of CloudFence using Intel’s Pin (Luk et al. 2005), a dynamic binary instrumentation toolkit without modifying applications. CloudFence can protect against information‐disclosure attacks with modest performance overhead.

15.5 Discussion

Clearly, the discussed work in other relevant domains may not perfectly align with the typical needs of forensics. Nonetheless, it is representative of the kind of data that is likely to be available for forensic purposes. This can provide critical help in both building new analytical tools and in defining reasonable additional data‐retention and provenance requirements for different cloud actors. In our view, one likely outcome is the emergence of secure logging services (SecLaaS) such as the ones proposed by (Zawoad 2013).

Another bright spot for the future is the fact that logging is central to all aspects of the functioning of a cloud system, and most of the forensic concerns related to investigating logs (Ruan et al. 2013, Zawoad and Hasan 2013)), such as cleanup, time synchronization, data correlation, and timeline analysis, are quite generic. Therefore, a robust log‐management infrastructure will be available to the right forensic tools (Marty 2011).

Even as digital forensics practitioners struggle to come to terms with the current state of cloud computing, an even bigger wave of challenges is on the horizon. Indeed, the very definition of cloud forensics is likely to expand to include much of mobile device forensics (as more of the computation and logged data remain in the Cloud) and the emerging concept of automated machine‐to‐machine interaction (a.k.a. the Internet of Things [IoT]). The latter is poised to dramatically escalate the amount of data generated as the current limiting factor – the human operator – is removed from the loop. In other words, the number of possible machine‐to‐machine interactions will no longer be tied to the number of humans and can readily grow at an exponential rate.

15.6 Conclusions

In this chapter, we argued that the acquisition and analysis of cloud artifacts is in its infancy, and that current‐generation tools are ill‐suited to the task. Specifically, the continued focus on analyzing client‐side artifacts is a direct extension of existing procedural approaches that, in the cloud context, fail to deliver on two critical forensic requirements: completeness and reproducibility. We have shown that SaaS cloud services routinely provide versioning and use cloud‐native artifacts, which demands a new API‐centric approach to discovery, acquisition, and analysis.

Another major point of our analysis is that we are in the very early stages of a paradigm shift from artifact‐centric to log‐centric forensics. That is, the current focus on extracting snapshots in time of OS and application data structures (primarily out of the file system) will have wide applicability only to IaaS investigative scenarios. For PaaS/SaaS situations – which will be increasingly common – the natural approach is to work with the existing log data. In some cases, such as Google Docs, such data can provide a complete chronology of user edits since the creation of the document.

Finally, although data growth was identified as a primary concern for forensic tool design over a decade ago (Roussev and Richard 2004), we are facing an even steeper curve because the IoT – built on automated machine‐to‐machine interaction – will escalate the amount of data available that (potentially) needs to be examined. This implies that the drive for ever‐higher levels of automated processing will no longer be just an issue of efficiency, but a clear and present necessity. On the bright side, we note that logical (API‐based) acquisition will enable higher levels of automated processing by eliminating tedious, low‐level device acquisition and interpretation; the acquired data will have known a structure and semantics, thereby eliminating much of the need for manual reverse engineering.

It is important to recognize that, in forensics, technical experimentation and development have always led the way, with best practices and legal scrutiny following suit. At present, we are at the starting point of a major technology transition, which naturally leads to a moment of confusion, hesitation, and efforts to tweak old tools to a new purpose. There seems to be some conviction that, more than ever, we need a multidisciplinary effort to cloud forensics (NIST 2014); that is, we should attempt to solve all problems – technical and nontechnical – at the same time.

It is worth remembering that the current state of digital forensics is the result of some 30 years of research, development, and practice, all on client devices. We face a moment of disruption with the weight (and history) of computations shifting to the server environment. This is a substantial change, and “big project” approaches almost never succeed in managing such an abrupt transition. What does succeed is small‐scale technical experimentation, followed by tool development and field use. The legal and procedural framework can only be meaningfully developed once a critical amount of experience is accumulated.

References

  1. Chung, H., Park, J., Lee, S., and Kang, C. (2012). Digital forensic investigation of cloud storage services. Journal of Digital Investigation 9 (2): 81–95. https://doi.org/10.1016/j.diin.2012.05.015.
  2. Dykstra, J. and Sherman, A.T. (2012). Acquiring forensic evidence from infrastructure‐as‐a‐service cloud computing: Exploring and evaluating tools, trust, and techniques. Proceedings of the Twelfth Annual Digital Forensic Research Conference (DFRWS) S90–S98. doi:10.1016/j.diin.2012.05.001.
  3. Dykstra, J. and Sherman, A.T. (2013). Design and implementation of FROST: digital forensic tools for the OpenStack cloud computing platform. Journal of Digital Investigation 10 (Supplement): S87–S95. https://doi.org/10.1016/j.diin.2013.06.010.
  4. Gartner. (2014). Gartner's 2014 hype cycle of emerging technologies maps. http://www.gartner.com/newsroom/id/2819918.
  5. Hale, J.S. (2013). Amazon cloud drive forensic analysis. Journal of Digital Investigation 10 (3): 259–265. https://doi.org/10.1016/j.diin.2013.04.006.
  6. Kent K., Chevalier S., Grance T. et al. (2006). Guide to integrating forensic techniques into incident response. SP800–86. Gaithersburg: U.S. Department of Commerce.
  7. King, C. and Vidas, T. (2011). Empirical analysis of solid state disk data retention when used with contemporary operating systems. Journal of Digital Investigation 8: S111–S117. https://doi.org/10.1016/j.diin.2011.05.013.
  8. Ligh, M.H., Case, A., Levy, J., and Walters, A. (2014). The Art of Memory Forensics: Detecting Malware and Threats in Windows, Linux, and Mac Memory. Wiley. ISBN: 978‐1118825099.
  9. Luk, C.K., Cohn, R., Muth, R. et al. (2005). Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of PLDI, 190–200.
  10. Martini, B. and Choo, K.R. (2012). An integrated conceptual digital forensic framework for cloud computing. Journal of Digital Investigation 9 (2): 71–80. https://doi.org/10.1016/j.diin.2012.07.001.
  11. Martini, B. and Choo, K.R. (2013). Cloud storage forensics: ownCloud as a case study. Journal of Digital Investigation 10 (4): 287–299. https://doi.org/10.1016/j.diin.2013.08.005.
  12. Marty, R. (2011). Cloud application logging for forensics. In: Proceedings of the 2011 ACM Symposium on Applied Computing (SAC '11), 178–184. ACM. http://doi.acm.org/10.1145/1982185.1982226.
  13. McKemmish, R. (1999). What is forensic computing? Trends & Issues in Crime and Criminal Justice 118: 1–6.
  14. NIST. (2014). NIST cloud computing forensic science challenges. (draft NISTIR 8006). NIST Cloud Computing Forensic Science Working Group. http://csrc.nist.gov/publications/drafts/nistir‐8006/draft_nistir_8006.pdf.
  15. Pappas, V., Kemerlis, V.P., Zavou, A. et al. 2013. CloudFence: data flow tracking as a cloud service. In: 16th International Symposium, RAID 2013, 411–431
  16. Quick, D. and Choo, K.R. (2013). Dropbox analysis: data remnants on user machines. Journal of Digital Investigation 10 (1): 3–18. https://doi.org/10.1016/j.diin.2013.02.003.
  17. Richard, G. and Roussev, V. (2005). Scalpel: a frugal, high‐performance file carver. In: Proceedings of the 2005 Digital Forensics Research Conference (DFRWS). New Orleans, LA.
  18. RightScale. (2015). RightScale 2015 state of the Cloud report. http://assets.rightscale.com/uploads/pdfs/RightScale‐2015‐State‐of‐the‐Cloud‐Report.pdf.
  19. Roussev, V. (2016). Digital Forensic Science: Issues, Methods, and Challenges. Morgan & Claypool.
  20. Roussev, V. and McCulley, S. (2016). Forensic analysis of cloud‐native artifacts. 3rd Annual Digital Forensic Research Conference Europe (DFRWS‐EU), Geneva, Switzerland. https://doi.org/10.1016/j.diin.2016.01.013.
  21. Roussev, V. and Richard, G. (2004). Breaking the performance wall: the case for distributed digital forensics. In: Proceedings of the 2004 Digital Forensics Research Workshop (DFRWS). Baltimore, MD.
  22. Roussev, V., Barreto, A., and Ahmed, I. (2016). API‐based forensic acquisition of cloud drives. In: Research Advances in Digital Forensics XII (ed. G. Peterson and S. Shenoi), 213–235. Springer doi:10.1007/978‐3‐319‐46279‐0_11.
  23. Ruan, K., Carthy, J., Kechadi, T., and Baggili, I. (2013). Cloud forensics definitions and critical criteria for cloud forensic capability: an overview of survey results. Journal of Digital Investigation 10 (1): 34–43. https://doi.org/10.1016/j.diin.2013.02.004.
  24. Somers, J. (2014). How I reverse engineered Google Docs to play back any document's keystrokes. http://features.jsomers.net/how‐i‐reverse‐engineered‐google‐docs (accessed 20 December 20 2014).
  25. Zavou, A., Pappas, V., Kemerlis, V.P. et al. (2013). Cloudopsy: an autopsy of data flows in the Cloud. HCI (27): 366–375.
  26. Zawoad, S. and Hasan, R. (2013). Cloud forensics: a meta‐study of challenges, approaches, and open problems. https://arxiv.org/abs/1302.6312.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset