A fairly large portion of this book is dedicated to the techniques the “bad guys” will use to locate sensitive information. We present this information to help you become better informed about their motives so that you can protect yourself and perhaps your customers. We’ve already looked at some of the benign basic searching techniques that are foundational for any Google user who wants to break the barrier of the basics and charge through to the next level: the ways of the Google hacker. Now we’ll start looking at more nefarious uses of Google that hackers are likely to employ.

First, we’ll talk about Google’s cache. If you haven’t already experimented with the cache, you’re missing out. I suggest you at least click a few various cached links from the Google search results page before reading further. As any decent Google hacker will tell you, there’s a certain anonymity that comes with browsing the cached version of a page. That anonymity only goes so far, and there are some limitations to the coverage it provides. Google can, however, very nicely veil your crawling activities to the point that the target Web site might not even get a single packet of data from you as you cruise the Web site. We’ll show you how it’s done.

Next, we’ll talk about directory listings. These “ugly” Web pages are chock full of information, and their mere existence serves as the basis for some of the more advanced attack searches that we’ll discuss in later chapters.

To round things out, we’ll take a look at a technique that has come to be known as traversing: the expansion of a search to attempt to gather more information. We’ll look at directory traversal, number range expansion, and extension trolling, all of which are techniques that should be second nature to any decent hacker—and the good guys that defend against them.

Anonymity with Caches

Google’s cache feature is truly an amazing thing. The simple fact is that if Google crawls a page or document, you can almost always count on getting a copy of it, even if the original source has since dried up and blown away. Of course the down side of this is that hackers can get a copy of your sensitive data even if you’ve pulled the plug on that pesky Web server. Another down side of the cache is that the bad guys can crawl your entire Web site (including the areas you “forgot” about) without even sending a single packet to your server. If your Web server doesn’t get so much as a packet, it can’t write anything to the log files. (You are logging your Web connections, aren’t you?) If there’s nothing in the log files, you might not have any idea that your sensitive data has been carried away. It’s sad that we even have to think in these terms, but untold megabytes, gigabytes, and even terabytes of sensitive data leak from Web servers every day. Understanding how hackers can mount an anonymous attack on your sensitive data via Google’s cache is of utmost importance.

Google grabs a copy of most Web data that it crawls. There are exceptions, and this behavior is preventable, as we’ll discuss later, but the vast majority of the data Google crawls is copied and filed away, accessible via the cached link on the search page. We need to examine some subtleties to Google’s cached document banner. The banner shown in Figure 3.1 was gathered from


Figure 3-1. This Cached Banner Contains a Subtle Warning About Images

If you’ve gotten so familiar with the cache banner that you just blow right past it, slow down a bit and actually read it. The cache banner in Figure 3.1 notes, “This cached page may reference images which are no longer available.” This message is easy to miss, but it provides an important clue about what Google’s doing behind the scenes.

To get a better idea of what’s happening, let’s take a look at a snippet of tcpdump output gathered while browsing this cached page. To capture this data, tcpdump is simply run as tcpdump -n. Your installation or implementation of tcpdump might require you to also set a listening interface with the -i switch. The output of the tcpdump command is shown in Figure 3.2.

image image

Figure 3-2. Tcpdump Output Fragment Gathered While Viewing a Cached Page

Let’s take apart this output a bit, starting at the bottom. This is a port 80 (Web) conversation between our browser machine ( and a Google server ( This is the type of traffic we should expect from any transaction with Google, but the beginning of the capture reveals another port 80 (Web) connection to This is not a Google server, and an nslookup of that Internet Protocol (IP) shows that it is the Web server. The connection to this server can be explained by rerunning tcpdump with more options specifically designed to show a few hundred bytes of the data inside the packets as well as the headers. The partial capture shown in Figure 3.3 was gathered by running:

tcpdump -Xx -s 500 -n

and shift-reloading the cached page. Shift-reloading forces most browsers to contact the Web host again, not relying on any caches the browser might be using.


Figure 3.3 A Partial HTTP Request Showing the Host Header Field

Lines 0x30 and 0x40 show that we are downloading (via a GET request) an image file—specifically, a JPG image from the server. Farther along in the network trace, a Host field reveals that we are talking to the Web server. Because of this Host header and the fact that this packet was sent to IP address, we can safely assume that the Phrack Web server is virtually hosted on the physical server located at that address. This means that when viewing the cached copy of the Phrack Web page, we are pulling images directly from the Phrack server itself. If we were striving for anonymity by viewing the Google cached page, we just blew our cover! Furthermore, line 0x90 shows that the REFERER field was passed to the Phrack server, and that field contained a Uniform Resource Locator (URL) reference to Google’s cached copy of Phrack’s page. This means that not only were we not anonymous, but our browser informed the Phrack Web server that we were trying to view a cached version of the page! So much for anonymity.

It’s worth noting that most real hackers use proxy servers when browsing a target’s Web pages, and even their Google activities are first bounced off a proxy server. If we had used an anonymous proxy server for our testing, the Phrack Web server would have only gotten our proxy server’s IP address, not our actual IP address.

Notes from the Underground…
Google Hacker’s Tip

It’s a good idea to use a proxy server if you value your anonymity online. Penetration testers use proxy servers to emulate what a real attacker would do during an actual break-in attempt. Locating working, high-quality proxy servers can be an arduous task, unless of course we use a little Google hacking to do the grunt work for us! To locate proxy servers using Google, try these queries:

inurl:“nph-proxy.cgi” “Start browsing”


“cacheserverreport for” “This analysis was produced by calamaris”

These queries locate online public proxy servers that can be used for testing purposes. Nothing like Googling for proxy servers! Remember, though, that there are lots of places to obtain proxy servers, such as the atomintersoft site or the proxy site. Try Googling for those!

The cache banner does, however, provide an option to view only the data that Google has captured, without any external references. As you can see in Figure 3.1, a link is available in the header, titled “Click here for the cached text only.” Clicking this link produces the tcdutnp output shown in Figure 3.4, captured with tcpdump -n.


Figure 3.4 Cached Text Only Captured with Tcpdump

Despite the fact that we loaded the same page as before, this time we communicated only with a Google server (at, not any external servers. If we were to look at the URL generated by clicking the “cached text only” link in the cached page’s header, we would discover that Google appended an interesting parameter, &strip=1. This parameter forces a Google cache URL to display only cached text, avoiding any external references. This URL parameter only applies to URLs that reference a Google cached page.

Pulling it all together, we can browse a cached page with a fair amount of anonymity without a proxy server, using a quick cut and paste and a URL modification. As an example, consider query for Instead of clicking the cached link, we will right-click the cached link and copy the URL to the Clipboard, as shown in Figure 3.5. Browsers handle this action differently, so use whichever technique works for you to capture the URL of this link.


Figure 3-5. Anonymous Cache Viewing Via Cut and Paste

Once the URL is copied to the Clipboard, paste it into the address bar of your browser, and append the &strip=1 parameter to the end of the URL. The URL should now look something like Press Enter after modifying the URL to load the page, and you should be taken to the stripped version of the cached page, which has a slightly different banner, as shown in Figure 3.6.


Figure 3-6. A Stripped Cached Page’s Header

Notice that the stripped cache header reads differently than the standard cache header. Instead of the “This cached page may reference images which are no longer available” line, there is a new line that reads, “Click here for the full cached version with images included.” This is an indicator that the current cached page has been stripped of external references. Unfortunately, the stripped page does not include graphics, so the page could look quite different from the original, and in some cases a stripped page might not be legible at all. If this is the case, it never hurts to load up a proxy server and hit the page, but real Google hackers “don’t need no steenkin’ proxy servers!”

Notes from the Underground…
Google’s Highlight Tool

If you’ve ever scrolled through page after page of a document looking for a particular word or phrase, you probably already know that Google’s cached version of the page will highlight search terms for you. What you might not realize is that you can use Google’s highlight tool to highlight terms on a cached page that weren’t included in your original search. This takes a bit of URL mangling, but it’s fairly straightforward. For example, if you searched for peeps marshmallows and viewed the second cached page, part of the cached page’s URL looks something like Notice the search terms we used listed after the base page URL. To highlight other terms, simply play around with the area after the base URL, in this case +peeps+marshmallows. Simply add or subtract words and press Enter, and Google will highlight your terms! For example, to include fear and risk to the list of highlighted words, simply add them into the URL, making it read something like =en. Did you ever know that Marshmallow Peeps actually feel fear? Don’t believe me? Just ask Google.

Directory Listings

A directory listing is a type of Web page that lists files and directories that exist on a Web server. Designed to be navigated by clicking directory links, directory listings typically have a title that describes the current directory, a list of files and directories that can be clicked, and often a footer that marks the bottom of the directory listing. Each of these elements is shown in the sample directory listing in Figure 3.7.


Figure 3-7. A Directory Listing Has Several Recognizable Elements

Much like an FTP server, directory listings offer a no-frills, easy-install solution for granting access to files that can be stored in categorized folders. Unfortunately, directory listings have many faults, specifically:

  • image   They are not secure in and of themselves. They do not prevent users from downloading certain files or accessing certain directories. This task is often left to the protection measures built into the Web server software or third-party scripts, modules, or programs designed specifically for that purpose.
  • image   They can display information that helps an attacker learn specific technical details about the Web server.
  • image   They do not discriminate between files that are meant to be public and those that are meant to remain behind the scenes.
  • image   They are often displayed accidentally, since many Web servers display a directory listing if a top-level index file (index.htm, index.html, default.asp, and so on) is missing or invalid.

All this adds up to a deadly combination.

In this section, we’ll take a look at some of the ways Google hackers can take advantage of directory listings.

Locating Directory Listings

The most obvious way an attacker can abuse a directory listing is by simply finding one! Since directory listings offer “parent directory” links and allow browsing through files and folders, even the most basic attacker might soon discover that sensitive data can be found by simply locating the listings and browsing through them.

Locating directory listings with Google is fairly straightforward. Figure 3.11 shows that most directory listings begin with the phrase “Index of,” which also shows in the title. An obvious query to find this type of page might be ntitle:index. of, which could find pages with the term index of in the title of the document. Remember that the period (”.”) serves as a single-character wildcard in Google. Unfortunately, this query will return a large number of false positives, such as pages with the following titles:

Index of Native American Resources on the Internet

LibDex - Worldwide index of library catalogues

Iowa State Entomology Index of Internet Resources

Judging from the titles of these documents, it is obvious that not only are these Web pages intentional, they are also not the type of directory listings we are looking for. As Ben Kenobi might say, “This is not the directory listing you’re looking for.” Several alternate queries provide more accurate results—for example, intitle:index.of “parent directory” (shown in Figure 3.8) or intitle:index. of name size. These queries indeed reveal directory listings by not only focusing on index.of in the title, but on keywords often found inside directory listings, such as parent directory, name, and size. Even judging from the summary on the search results page, you can see that these results are indeed the types of directory listings we’re looking for.


Figure 3-8. A Good Search for Directory Listings

Finding Specific Directories

In some cases, it might be beneficial not only to look for directory listings, but to look for directory listings that allow access to a specific directory. This is easily accomplished by adding the name of the directory to the search query To locate “admin” directories that are accessible from directory listings, queries such as intitle:index.of.admin or intitle:index.of inurl:admin will work well, as shown in Figure 3.9.


Figure 3-9. Locating Specific Directories in a Directory Listing

Finding Specific Files

Because these types of pages list names of files and directories, it is possible to find very specific files within a directory listing. For example, to find WS_FTP log files, try a search such as intitle:index.of ws_ftp.log, as shown in Figure 3.10. This technique can be extended to just about any kind of file by keying in on the index.of in the title and the filename in the text of the Web page.


Figure 3-10. Locating Files in a Directory Listing

You can also use filetype and inurl to search for specific files. To search again for ws_ftp.log files, try a query like filetype:log inurl.ws_ftp.log. This technique will generally find more results than the somewhat restrictive index.of search. We’ll be working more with specific file searches throughout the book.

Server Versioning

One piece of information an attacker can use to determine the best method for attacking a Web server is the exact software version. An attacker could retrieve that information by connecting directly to the Web port of that server and issuing a request for the Hypertext Transfer Protocol (HTTP) (Web) headers. It is possible, however, to retrieve similar information from Google without ever connecting to the target server. One method involves using the information provided in a directory listing.

Figure 3.11 shows the bottom portion of a typical directory listing. Notice that some directory listings provide the name of the server software as well as the version number. An adept Web administrator could fake these server tags, but most often this information is legitimate and exactly the type of information an attacker will use to refine his attack against the server.


Figure 3-11. This Server Tag Can Be Used to Profile a Web Server

The Google query used to locate servers this way is simply an extension of the intitle:index.of query. The listing shown in Figure 3.11 was located with a query of intitle:index.of “server at”. This query will locate all directory listings on the Web with index of in the title and server at anywhere in the text of the page. This might not seem like a very specific search, but the results are very clean and do not require further refinement.

Notes from the Underground…
Server Version? Who Cares?

Although server versioning might seem fairly harmless, realize that there are two ways an attacker might use this type of information. If the attacker has already chosen his target and discovers this information on that target server, he could begin searching for an exploit (which may or may not exist) to use against that specific software version. Inversely, if the attacker already has a working exploit for a very specific version of Web server software, he could perform a Google search for targets that he can compromise with that exploit. An attacker, armed with an exploit and drawn to a potentially vulnerable server, is especially dangerous. Even small information leaks like this can have big payoffs for a clever attacker.

To search for a specific server version, the intitle:index.of query can be extended even further to something like intitle:index.of “Apache/1.3.21 Server at”. This query would find pages like the one listed in Figure 3.11. As shown in Table 3.1, many different servers can be identified through a directory listing.

Table 3.1 Some Specific Servers Locatable Via Directory Listings

Directory Listing of Web Servers
“AnWeb/1.42h” intitle:index.of
“Apache Tomcat/” intitle:index.of
“Apache-AdvancedExtranetServer/” intitle:index.of
“Apache/df-exts” in title:in dex.of
“Apache/” intitle:index.of
“Apache/AmEuro” intitle:index.of
“Apache/Blast” intitle:index.of
“Apache/WWW” intitle:index.of
“Apache/df-exts” intitle:index.of
“CERN httpd 3.0B (VAX VMS)” intitle:index.of
“CompySings/2.0.40” intitle:index.of
“Davepache/2.02.003 (Unix)” intitle:index.of
“DinaHTTPd Server/1.15” intitle:index.of
“HP Apache-based Web “Server/1.3.26” intitle:index.of
“HP Apache-based Web “Server/1.3.27 (Unix) mod_ssl/2.8.11 OpenSSL/0.9.6g” intitle:index.of
“HP-UX_Apache-based_Web_Server/2.0.43” intitle:index.of
“httpd+ssllkttd” * server at intitle:index.of
“IBM_HTTP_Server” intitle:index.of
“IBM_HTTP_Server/2.0.42” intitle:index.of
“JRun Web Server” intitle:index.of
“LiteSpeed Web” intitle:index.of
“MCWeb” intitle: index.of
“MaXX/3.1” intitle:index.of
“Microsoft-IIS/* server at” intitle:index.of
“Microsoft-IIS/4.0” intitle:index.of
“Microsoft-IIS/5.0 server at” intitle:index.of
“Microsoft-IIS/6.0” intitle:index.of
“OmniHTTPd/2.10” intitle:index.of
“OpenSA/1.0.4” intitle:index.of
“OpenSSL/0.9.7d” intitle:index.of
“Oracle HTTP Server/1.3.22” intitle:index.of
“Oracle-HTTP-Server/1.3.28” intitle:index.of
“Orade-HTTP-Server” intitle:index.of
“Oracle HTTP Server Powered by Apache” intitle:index.of
“Patchy/1.3.31” intitle:index.of
“Red Hat Secure/2.0” intitle:index.of
“Red Hat Secure/3.0 server at” intitle:index.of
“Savant/3.1” intitle:index.of
“SEDWebserver*” “server at” intitle:index.of
“SEDWebserver/1.3.26” intitle:index.of
“TcNet httpsrv 1.0.10” intitle:index.of
“WebServer/1.3.26” intitle:index.of
“WebTopia/2.1.1a” intitle:index.of
“Yaws 1.65” intitle:index.of
“Zeus/4.3” intitle:index.of

Table 3.2 Directory Listings of Apache Versions

Queries That Locate Apache Versions Through Directory Listings
“Apache/1.0” intitle:index.of
“Apache/1.1” intitle:index.of
“Apache/1.2” intitle:index.of
“Apache/1.2.0 server at” intitle:index.of
“Apache/1.2.4 server at” intitle:index.of
“Apache/1.2.6 server at” intitle:index.of
“Apache/1.3.0 server at” intitle:index.of
“Apache/1.3.2 server at” intitle:index.of
“Apache/1.3.1 server at” intitle:index.of
“Apache/ server at” intitle:index.of
“Apache/1.3.3 server at” intitle:index.of
“Apache/1.3.4 server at” intitle:index.of
“Apache/1.3.6 server at” intitle:index.of
“Apache/1.3.9 server at” intitle:index.of
“Apache/1.3.11 server at” intitle:index.of
“Apache/1.3.12 server at” intitle:index.of
“Apache/1.3.14 server at” intitle:index.of
“Apache/1.3.17 server at” intitle:index.of
“Apache/1.3.19 server at” intitle:index.of
“Apache/1.3.20 server at” intitle:index.of
“Apache/1.3.22 server at” intitle:index.of
“Apache/1.3.23 server at” intitle:index.of
“Apache/1.3.24 server at” intitle:index.of
“Apache/1.3.26 server at” intitle:index.of
“Apache/1.3.27 server at” intitle:index.of
“Apache/1.3.27-fil” intitle:index.of
“Apache/1.3.28 server at” intitle:index.of
“Apache/1.3.29 server at” intitle:index.of
“Apache/1.3.31 server at” intitle:index.of
“Apache/1.3.33 server at” intitle:index.of
“Apache/1.3.34 server at” intitle:index.of
“Apache/1.3.35 server at” intitle:index.of
“Apache/2.0 server at” intitle:index.of
“Apache/2.0.32 server at” intitle:index.of
“Apache/2.0.35 server at” intitle:index.of
“Apache/2.0.36 server at” intitle:index.of
“Apache/2.0.39 server at” intitle:index.of
“Apache/2.0.40 server at” intitle:index.of
“Apache/2.0.42 server at” intitle:index.of
“Apache/2.0.43 server at” intitle:index.of
“Apache/2.0.44 server at” intitle:index.of
“Apache/2.0.45 server at” intitle:index.of
“Apache/2.0.46 server at” intitle:index.of
“Apache/2.0.47 server at” intitle:index.of
“Apache/2.0.48 server at” intitle:index.of
“Apache/2.0.49 server at” intitle:index.of
“Apache/2.0.49a server at” intitle:index.of
“Apache/2.0.50 server at” intitle:index.of
“Apache/2.0.51 server at” intitle:index.of
“Apache/2.0.52 server at” intitle:index.of
“Apache/2.0.55 server at” intitle:index.of
“Apache/2.0.59 server at” intitle:index.of

In addition to identifying the Web server version, it is also possible to determine the operating system of the server as well as modules and other software that is installed. We’ll look at more specific techniques to accomplish this later, but the server versioning technique we’ve just looked at can be extended by including more details in our query. Table 3.3 shows queries that located extremely esoteric server software combinations, revealed by server tags. These tags list a great deal of information about the servers they were found on and are shining examples proving that even a seemingly small information leak can sometimes explode out of control, revealing more information than expected.

Table 3.3 Locating Specific and Esoteric Server Versions

Queries That Locate Specific and Esoteric Server Versions
“Apache/1.3.12 (Unix) mod_fastcgi/2.2.12 mod_dyntag/1.0 mod_advert/1.12 mod_czech/3.1.1b2” intitle:index.of
“Apache/1.3.12 (Unix) mod_fastcgi/2.2.4secured_by_Raven/1.5.0” intitle:index.of
“Apache/1.3.12 (Unix) mod_ssl/2.6.6 OpenSSL/0.9.5a” intitle:index.of
“Apache/1.3.12 Cobalt (Unix) Resin/2.0.5 StoreSense-Bridge/1.3 ApacheJServ/1.1.1 mod_ssl/2.6.4 OpenSSL/0.9.5a mod_auth_pam/1.0a FrontPage/ mod_Derl/1.24” intitle:index.of
“Apache/1.3.14 - PHP4.02 - Iprotect 1.6 CWIE (Unix) mod_fastcgi/2.2.12 PHP/4.0.3pl1” intitle:index.of
“Apache/1.3.14 Ben-SSL/1.41 (Unix) mod_throttle/2.11 mod_perl/1.24_01 PHP/4.0.3pl1 FrontPage/ rus/PL30.0” intitle:index.of
“Apache/1.3.20 (Win32)” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) PHP/4.0.3pl1 mod_auth_pam_external/0.1 FrontPage/ mod_perl/1.25” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) PHP/4.0.4 mod_auth_pam_external/0.1 FrontPage/ mod_ssl/2.8.4 OpenSSL/0.9.6b mod_perl/1.25” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) PHP/4.0.6 mod_ssl/2.8.4 OpenSSL/0.9.6 FrontPage/ mod_perl/1.26” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) mod_ssl/2.8.4 OpenSSL/0.9.6b PHP/4.0.3pl1 mod_auth_pam_external/0.1 FrontPage/ mod_perl/1.25” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) mod_ssl/2.8.4 OpenSSL/0.9.6b PHP/4.0.3pl1 mod_fastcgi/2.2.8 mod_auth_pam_external/0.1 mod_perl/1.25” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) mod_ssl/2.8.4 OpenSSL/0.9.6b PHP/4.0.4 mod_auth_pam_external/0.1 mod_perl/1.25” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) mod_ssl/2.8.4 OpenSSL/0.9.6b PHP/4.0.6 mod_auth_pam_external/0.1 FrontPage/ mod_perl/1.25” intitle:index.of
“Apache/1.3.20 Sun Cobalt (Unix) mod_ssl/2.8.4 OpenSSL/0.9.6b mod_auth_pam_external/0.1 mod_perl/1.25” intitle:index.of
“Apache/1.3.26 (Unix) Debian GNU/Linux PHP/4.1.2 mod_dtcl” intitle:index.of
“Apache/1.3.26 (Unix) PHP/4.2.2” intitle:index.of
“Apache/1.3.26 (Unix) mod_ssl/2.8.9 OpenSSL/0.9.6b” intitle:index.of
“Apache/1.3.26 (Unix) mod_ssl/2.8.9 OpenSSL/0.9.7” intitle:index.of
“Apache/1.3.26+PH” intitle:index.of
“Apache/1.3.27 (Darwin)” intitle:index.of
“Apache/1.3.27 (Unix) mod_log_bytes/1.2 mod_bwlimited/1.0 PHP/4.3.1 FrontPage/ mod_ssl/2.8.12 OpenSSL/0.9.6b” intitle:index.of
“Apache/1.3.27 (Unix) mod_ssl/2.8.11 OpenSSL/0.9.6g FrontPage/ mod_gzip/1.3.26 PHP/4.1.2 mod_throttle/3.1.2” intitle:index.of

One convention used by these sprawling tags is the use of parenthesis to offset the operating system of the server. For example, Apache/1.3.26 (Unix) indicates a UNIX-based operating system. Other more specific tags are used as well, some of which are listed below.

  • image   CentOS
  • image   Debian
  • image   Debian GNU/Linux
  • image   Fedora
  • image   FreeBSD
  • image   Linux/SUSE
  • image   Linux/SuSE
  • image   NETWARE
  • image   Red Hat
  • image   Ubuntu
  • image   UNIX
  • image   Win32

An attacker can use the information in these operating system tags in conjunction with the Web server version tag to formulate a specific attack. If this information does not hint at a specific vulnerability, an attacker can still use this information in a data-mining or information-gathering campaign, as we will see in a later chapter.

Going Out on a Limb: Traversal Techniques

The next technique we’ll examine is known as traversal. Traversal in this context simply means to travel across. Attackers use traversal techniques to expand a small “foothold” into a larger compromise.

Directory Traversal

To illustrate how traversal might be helpful, consider a directory listing that was found with intitle:index.of inurl: “admin”, as shown in Figure 3.12.


Figure 3-12. Traversal Example Found with index.of

In this example, our query brings us to a relative URL of /admin/php/tour. If you look closely at the URL, you’ll notice an “admin” directory two directory levels above our current location. If we were to click the “parent directory” link, we would be taken up one directory, to the “php” directory. Clicking the “parent directory” link from the “envr” directory would take us to the “admin” directory, a potentially juicy directory. This is very basic directory traversal. We could explore each and every parent directory and each of the subdirectories, looking for juicy stuff. Alternatively, we could use a creative site search combined with an inurl search to locate a specific file or term inside a specific subdirectory, such as inurhadmin ws_ftp.log, for example. We could also explore this directory structure by modifying the URL in the address bar.

Regardless of how we were to “walk” the directory tree, we would be traversing outside the Google search, wandering around on the target Web server. This is basic traversal, specifically directory traversal. Another simple example would be replacing the word admin with the word student or public. Another more serious traversal technique could allow an attacker to take advantage of software flaws to traverse to directories outside the Web server directory tree. For example, if a Web server is installed in the /var/www directory, and public Web documents are placed in /var/www/htdocs, by default any user attaching to the Web server’s top-level directory is really viewing files located in /var/uww/htdocs. Under normal circumstances, the Web server will not allow Web users to view files above the /var/www/htdocs directory. Now, let’s say a poorly coded third-party software product is installed on the server that accepts directory names as arguments. A normal URL used by this product might be This, URL would instruct the program to “fetch” the file located at /var/uww/htdocs/index.html and display it to the user, perhaps with a nifty header and footer attached. An attacker might attempt to take advantage of this type of program by sending a URL such as If the program is vulnerable to a directory traversal attack, it would break out of the /var/www/htdocs directory, crawl up to the real root directory of the server, dive down into the /etc directory, and “fetch” the system password file, displaying it to the user with a nifty header and footer attached!

Automated tools can do a much better job of locating these types of files and vulnerabilities, if you don’t mind all the noise they create. If you’re a programmer, you will be very interested in the Libwhisker Perl library, written and maintained by Rain Forest Puppy (RFP) and available from Security Focus wrote a great article on using Libwhisker. That article is available from If you aren’t a programmer, RFP’s Whisker tool, also available from the Wiretrip site, is excellent, as are other tools based on Libwhisker, such as nikto, written by [email protected], which is said to be updated even more than the Whisker program itself. Another tool that performs (amongst other things) file and directory mining is Wikto from SensePost that can be downloaded at The advantage of Wikto is that it does not suffer from false positives on Web sites that responds with friendly 404 messages.

Incremental Substitution

Another technique similar to traversal is incremental substitution. This technique involves replacing numbers in a URL in an attempt to find directories or files that are hidden, or unlinked from other pages. Remember that Google generally only locates files that are linked from other pages, so if it’s not linked, Google won’t find it. (Okay, there’s an exception to every rule. See the FAQ at the end of this chapter.) As a simple example, consider a document called exhc-1.xls, found with Google. You could easily modify the URL for that document, changing the 1 to a 2, making the filename exhc-2.xls. If the document is found, you have successfully used the incremental substitution technique! In some cases it might be simpler to use a Google query to find other similar files on the site, but remember, not all files on the Web are in Google’s databases. Use this technique only when you’re sure a simple query modification won’t find the files first.

This technique does not apply only to filenames, but just about anything that contains a number in a URL, even parameters to scripts. Using this technique to toy with parameters to scripts is beyond the scope of this book, but if you’re interested in trying your hand at some simple file or directory substitutions, scare up some test sites with queries such as file-type:xls inurl:1.xls or intitle:index.of inurl:0001 or even an images search for 1.jpg. Now use substitution to try to modify the numbers in the URL to locate other files or directories that exist on the site. Here are some examples:

  • image   /docs/bulletin/1.xls could be modified to /docs/bulletin/2.xls
  • image   /DigLib_thumbnail/spmg/hel/0001/H/ could be changed to /DigLib_thumbnail/spmg/hel/0002/H/
  • image   /gallery/wel008-1.jpg could be modified to /gallery/wel008-2.jpg

Extension Walking

We’ve already discussed file extensions and how the filetype operator can be used to locate files with specific file extensions. For example, we could easily search for HTM files with a query such as filetype:HTM1. Once you’ve located HTM files, you could apply the substitution technique to find files with the same file name and different extension. For example, if you found /docs/index.htm, you could modify the URL to /docs/index, asp to try to locate an index.asp file in the docs directory. If this seems somewhat pointless, rest assured, this is, in fact, rather pointless. We can, however, make more intelligent substitutions. Consider the directory listing shown in Figure 3.13. This listing shows evidence of a very common practice, the creation of backup copies of Web pages.


Figure 3-13. Backup Copies of Web Pages Are Very Common

Backup files can be a very interesting find from a security perspective. In some cases, backup files are older versions of an original file. This is evidenced in Figure 3.17. Backup files on the Web have an interesting side effect: they have a tendency to reveal source code. Source code of a Web page is quite a find for a security practitioner, because it can contain behind-the-scenes information about the author, the code creation and revision process, authentication information, and more.

To see this concept in action, consider the directory listing shown in Figure 3.13. Clicking the link for index.php will display that page in your browser with all the associated graphics and text, just as the author of the page intended. If this were an HTM or HTML file, viewing the source of the page would be as easy as right-clicking the page and selecting view source. PHP files, by contrast, are first executed on the server. The results of that executed program are then sent to your browser in the form of HTML code, which your browser then displays. Performing a view source on HTML code that was generated from a PHP script will not show you the PHP source code, only the HTML. It is not possible to view the actual PHP source code unless something somewhere is misconfigured. An example of such a mis-configuration would be copying the PHP code to a filename that ends in something other than PHP, like BAK. Most Web servers do not understand what a BAK file is. Those servers, then, will display a PHP.BAK file as text. When this happens, the actual PHP source code is displayed as text in your browser. As shown in Figure 3.14, PHP source code can be quite revealing, showing things like Structured Query Language (SQL) queries that list information about the structure of the SQL database that is used to store the Web server’s data.


Figure 3-14. Backup Files Expose SQL Data

The easiest way to determine the names of backup files on a server is to locate a directory listing using intitle:index.of or to search for specific files with queries such as intitle:index.of index.php.bak or inurl:index.php.bak. Directory listings are fairly uncommon, especially among corporate-grade Web servers. However, remember that Google’s cache captures a snapshot of a page in time. Just because a Web server isn’t hosting a directory listing now doesn’t mean the site never displayed a directory listing. The page shown in Figure 3.15 was found in Google’s cache and was displayed as a directory listing because an index.php (or similar file) was missing. In this case, if you were to visit the server on the Web, it would look like a normal page because the index file has since been created. Clicking the cache link, however, shows this directory listing, leaving the list of files on the server exposed. This list of files can be used to intelligently locate files that still most likely exist on the server (via URL modification) without guessing at file extensions.


Figure 3-15. Cached Pages Can Expose Directory Listings

Directory listings also provide insight into the file extensions that are in use in other places on the site. If a system administrator or Web authoring program creates backup files with a .BAK extension in one directory, there’s a good chance that BAK files will exist in other directories as well.


The Google cache is a powerful tool in the hands of the advanced user. It can be used to locate old versions of pages that may expose information that normally would be unavailable to the casual user. The cache can be used to highlight terms in the cached version of a page, even if the terms were not used as part of the query to find that page. The cache can also be used to view a Web page anonymously via the &strip=1 URL parameter, and can be used as a basic transparent proxy server. An advanced Google user will always pay careful attention to the details contained in the cached page’s header, since there can be important information about the date the page was crawled, the terms that were found in the search, whether the cached page contains external images, links to the original page, and the text of the URL used to access the cached version of the page. Directory listings provide unique behind-the-scenes views of Web servers, and directory traversal techniques allow an attacker to poke around through files that may not be intended for public view.

Frequently Asked Questions

Q: Searching for backup files seems cumbersome. Is there a better way?
A: Better, meaning faster, yes. Many automated Web tools (such as WebInspect from offer the capability to query a server for variations of existing filenames, turning an existing index.html file into queries for index.html.bak or index.bak, for example. These scans are generally very thorough but very noisy, and will almost certainly alert the site that you’re scanning. WebInspect is better suited for this task than Google Hacking, but many times a low-profile Google scan can be used to get a feel for the security of a site without alerting the site’s administrators or Intrusion Detection System (IDS). As an added benefit, any information gathered with Google can be reused later in an assessment.
Q: Backup files seem to create security problems, but these files help in the development of a site and provide peace of mind that changes can be rolled back. Isn’t there some way to keep backup files around without the undue risk?
A: Yes. A major problem with backup files is that in most cases, the Web server displays them differently because they have a different file extension. So there are a few options. First, if you create backup files, keep the extensions the same. Don’t copy index.php to index.bak, but rather to something like index.bak.php. This way the server still knows it’s a PHP file. Second, you could keep your backup files out of the Web directories. Keep them in a place you can access them, but where Web visitors can’t get to them. The third (and best) option is to use a real configuration management system. Consider using a CVS-style system that allows you to register and check out source code. This way you can always roll back to an older version, and you don’t have to worry about backup files sitting around.

1 Remember that filetype searches used to require an search parameter. They don’t any more. In the old days, all filetype searches required an addition of the extension. Filetype:htm would not work, but filetype:htm htm would!

