Chapter 3: Server-Side Strategies

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3
Server-Side Strategies

Findability strategies extend beyond the client-side. The way you structure your files, build your URLs, design your 404 pages, and optimize your server for speed can significantly improve the findability of your site.

Oftentimes search engine optimization best practices are hyper-focused on making changes client-side. There are many simple things you can do sever-side as well that will make your site easier for search engines to index and help your audience find their way around better. You can significantly improve the findability of your site with file naming, choosing the right domain name, creating search engine friendly URLs, serving custom 404 pages, and plenty of server optimization for fast indexing.

File and Folders: The Power of a Name

What’s in a name? Well, when it comes to the names of the files and folders in your site, a lot. Creating keyword density on your pages can help lift your site in search listings, but there are other places besides in your markup that can help out. Search engines index keywords in the names of files and folders in order to understand the content of your pages so choosing relevant, descriptive names for them is important. Here are a few recommendations to keep in mind when naming your files:

Include keywords in file and folder names where natural, and certain ubiquitous files such as in the name of the logo, and default style sheet.

Separate keywords in file and folder names with a hyphen rather than an underscore to ensure that search engines can read each word individually rather than as one large word. For example, most search engines will read a file named my-page.html as “my” “page,” whereas my_page.html would be read as “my_page,” which is not likely to match a search query. Google recently updated its system to recognize individual words in file names separated by an underscore. Since other search engines could be tripped up by the underscore, stick with hyphen-delimited keywords in your file names to be safe.

Keep your keywords brief, and relevant to your audience. The more keywords you add, the more you dilute the power of each one.

Contrary to rumors, a .html file extension will not rank higher than a .php extension, so feel free to use the one that is appropriate for your page.

Choosing and Managing Domain Names

Domain names are also a great place to include keywords that can help users find you. When choosing a domain name, it doesn’t matter which extension you choose: .com, .net, .biz and all other extensions start on equal footing and will not in any way impact search engine rankings. Of course, .com domains tend to be more memorable to users because of their popularity, and therefore may be more desirable for word-of-mouth referrals.

Just as keywords in file names can help boost search listings, keywords in domain names are important as well. Keywords in a domain name separated by hyphens are individually readable, and are more desirable for search engine indexing, but can be a little trickier for users to remember. It may be a good idea to register your domain name with hyphens for search engines, and without for the memorability and easy verbal referral by users. You can park both domain names on the same server, pointing them to the same site. List the hyphen-delimited domain with search engines, and include the domain without the hyphens on print collateral and advertisements.

An added benefit of including your target keywords in your domain name is that it will help encourage all inbound links to your site to contain the same keywords, making your site more relevant to search queries.

The age of a domain name can play a significant role in the assigning of a Google PageRank (a mathematical formula that defines the reputation of a site specific to Google). Young domains are generally ranked low, since many disreputable link-farm sites created by spammers pop up temporarily to make money from link referrals then quickly disappear before search engines blacklist them. Consequently, older domains are always preferable. However, if you buy a recently expired domain name that has a high PageRank because of its age and accumulated reputation, the PageRank does not transfer to the new owner. Search engines can usually see when domain names change ownership, and will reset their rankings in such situations. You’ll just have to buy a domain name that is relevant to your audience, and wait for it to age like a fine wine.

Determining Your Google PageRank

Using the free Google Toolbar (http://toolbar.Google.com), you can determine your site’s Page-Rank just by viewing it in your browser. Pages are ranked on a 0–10 scale, where 10 is the highest ranking. When indexed for the first time, most sites start with a PageRank of around 2. As your domain name ages and you gain inbound links, the credibility of the site will increase along with your PageRank. PageRank is updated about four times a year, so be patient waiting for dividends from your efforts.

A high PageRank will make your site a stronger competitor for the keywords you target, thus raising your listing position on search engine return pages. See FIGURE 3.1.

Figure 3.1 The Google toolbar is an easy way to view PageRank.

Solving the Google Canonical Problem

When Google indexes sites, it sees URLs with and without the preceding www as entirely different sites. Referred to as the Google canonical problem, this indexing approach can negatively affect your PageRank as some inbound links to your site may include the www while others may not, which divides the number of links to your site from Google’s perspective and splits your PageRank. You can tell if your site is suffering from the Google canonical problem by checking the PageRank of a page with and without the www in the URL. If you see two different PageRanks, then you’ll want to fix this issue. We can solve the problem using an Apache module called mod_rewrite, which can automatically map all requests to a single, consistent format.

mod_rewrite is a handy Apache module that rewrites URLs when specified patterns are detected. It’s exceptionally powerful, and can provide solutions to a number of SEO-related challenges including the Google canonical problem. We’ll use it to execute a 301 redirect, sending the user’s browser to a URL with the www included in the URL.

Servers provide different HTTP status codes to indicate a response to a request. You’re probably familiar with a 404 status code that indicates the page requested was not found. A 301 status code is a redirect from the URL requested to another one specified by the server administrator.

To send out a 301 redirect status code, create a file called htaccess.txt with the following code to give Apache the message:

Force WWW

RewriteEngine On
RewriteCond %{HTTP_HOST} ^example.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

This code uses regular expressions to locate specific patterns when evaluating URLs. Regular expressions are commonly used in many scripting and programming languages to identify patterns in strings—a series of text characters. They can be intimidating because of the cryptic characters they use to locate specific text, and aren’t exactly an intuitive read without some prior research.

In this example we see strange characters that indicate text pattern scenarios. For example, the ^ indicates the start of the string, and the $ indicates the end of the string. Each special character has some pattern matching meaning. You can print out a quick reference to demystify regular expressions at http://www.ilovejackdaniels.com/cheat-sheets/regular-expressions-cheat-sheet/.

Now that you’ve got your rewrite code in place, upload the htaccess.txt file to the Web root of your server, then rename it to .htaccess. Naming the file .htaccess locally can cause problems, as it is a reserved name recognized by some operating systems such as Mac OS X, which would automatically hide it.

In case you are new to working with Apache, the .htaccess file provides on-the-fly configuration commands to the server and the various modules associated with it. Apache looks for this file whenever it needs to send out a response, and supports the use of unique .htaccess files in different directories for very granular control of the server’s configuration. We’ll be using it often to configure Apache to meet our findability goals.

301 Redirects and URL Rewriting on a Windows Server

If you are running a Windows server with IIS (Internet Information Services) you can still rewrite URLs, create 301 redirects and build search engine friendly URLs, but the method is a little different. Just like Apache uses modules to extend its feature set, IIS uses ISAPI (Internet Services Application Programming Interface) filters to provide features that are not available by default. An ISAPI filter can be used to rewrite URLs on the fly very much like mod_rewrite for Apache.

It’s possible to create your own ISAPI rewrite filter, but it can be rather complex. Instead, you may want to use one of the many that have already been developed. There are a lot of ISAPI rewrite filters that you could track down with a quick Google search, but most of them you’d have to pay for. IIRF (Ionic’s IIS Rewrite Filter) is a free rewrite filter that can do the same things as Apache’s mod_rewrite and with a similar syntax (http://cheeso.members.winisp.net/IIRF.aspx).

Using the mod_rewrite Cheat Sheet

mod_rewrite can be intimidating and cryptic if you are new to it because of the many mysterious regular expression characters. You may want to refer to Dave Child’s handy PDF mod_rewrite cheat sheet, which will introduce you to the basics, and even show you some common solutions for other SEO-related issues.

http://www.ilovejackdaniels.com/mod_rewrite_cheat_sheet.pdf

You can also learn more about mod_rewrite and see some good examples in the official documentation at http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html.

The first line in the code example turns mod_rewrite on, and the next one sets a condition to be on the lookout for any page request with the domain name in the URL. Of course, example.com will need to be changed to your domain name in order for this to work. The last line creates a rewrite rule that redirects users to their intended destination with the www in the URL.

If you’d like to do the inverse and remove the www, just make sure your rewrite condition contains the www, and remove it from the rewrite rule.

Remove WWW

RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com$ [NC]
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]

There is one other way to ensure Google is using the desired URL structure when ranking your pages. The Google Web Master tools at http://www.google.com/webmasters/sitemaps/, which are discussed further in the bonus chapter entitled “Free Search Engine Tools and Services” on the companion site (http://buildingfindablewebsites.com), allow you to define your preference to include or exclude the www in your URLs when indexing. Taking this approach is good, but doesn’t address canonical problems with other search engines, so it’s still a good idea to employ the mod_rewrite solution as well.

Building Search Engine Friendly URLs

Poorly designed URLs can stop a search engine spider dead in its tracks, resulting in an incomplete indexing of your site. URLs with GET variable strings or session IDs in them are sometimes viewed incorrectly by search engines, and at other times are ignored completely. A URL like this, though very common in many ecommerce sites and Web applications, is not conducive to search engine indexing, nor is it very usable for humans:

http://example.com/top-sellers/products.php?color=red&prodId=12

Everything after the question mark, called GET variables or a query string, is often ignored by search engines, which would likely result in the display of the page without the necessary variables required to render it with content from a database. Though search engines may not care for them, dynamic pages that use GET data are still essential to many websites, so it’s not an option to abandon them entirely in favor of static pages. There are a few different ways to build search engine friendly URLS, and we’ll explore two popular solutions, both of which assume you are running an Apache server and have the ability use a .htaccess file to configure it, as is typical of most shared hosting environments. In both cases, we’ll be rebuilding the above URL to something like this, which search engines and users will appreciate.

http://example.com/top-sellers/products/red/12

A Simple Solution

The simplest solution requires that your dynamic pages that use GET variables be named without their typical extension like .php so they can appear in the URL as if they were a directory. If you are creating a new site rather than modifying an existing one where changing file names is inconvenient, you can create extension-less file names then force Apache to recognize their file type. Again we’ll use a .htaccess file to configure Apache.

<Files products>
ForceType application/x-httpd-php
</Files>

The <Files products> tag indicates the file the server should keep an eye out for, and ForceType application/x-httpd-php forces the file to be recognized as PHP. If you had many dynamic pages that used GET variables, you’d need to replicate the above code in your .htaccess file to identify each page that should be forced to be PHP. Now that Apache will once again parse the PHP code, we’ll need to create an easy way to grab the GET variables off the end of the URL, as PHP will no longer recognize the GET variables at the end of the URL. To do this, we’ll need to split the URL into separate pieces then assign them to variables that PHP can once again use.

This function, saved in an external include file called getParams.php, receives a string indicating the base path from the Web root directory to the PHP page so it can locate where the parameter list starts in the URL, then returns an array for easy use. In this case, the base path will be /top-sellers/products/.

<?php
// Pull parameters off URL and return as an array
function getParams($baseUrl){
  if (!strstr($_SERVER['REQUEST_URI'],$baseUrl)){
      // error, base path is not in the URL
      trigger_error('$baseUrl is invalid: '.$baseUrl);
  }

  // pull paramters off of URI, replacing with an empty string
  if ($baseUrl != '/'){
      $fragment = str_replace($baseUrl,'',$_SERVER['REQUEST_
URI']);
  }else{
      $fragment = $_SERVER['REQUEST_URI'];
  }

  // convert "/" seperated params to an array
  $params = explode('/',$fragment);
  return $params;
}
?>

The $_SERVER['REQUEST_URI'] super global variable, which is built into the PHP language, retrieves the URL. Using PHP’s strstr() function to search for a string within a string, the base path can be found so we know where to start looking for the parameters we are after. An error is thrown if the base path is not found as the parameters would be impossible to locate without it. Assuming the base path was found, it gets deleted from the URL string by replacing it with an empty string. Finally, using the explode() function, which splits strings into an array with many values, the parameters string is turned into an array by looking for the / which separates them. The resulting array is then returned to the location where the function was called. Here’s what that array would look like if we used print_r() to write it to the page for a quick evaluation:

Array ( [0] => red [1] => 12 )

On the product page where the parameters need to be used, we simply include the file, call the function sending it the base path, then use the parameters as we like:

<?
require_once('inc/getParams.php'),
$parameters = getParams('/top-sellers/products/'),
$color = $parameters[0];
$prodId = $parameters[1];

// Do some database query with retrieved parameters
?>

Though a few extra lines of code are required in your dynamic pages to retrieve the GET data, the search engine friendly URLs are well worth the effort.

Using mod_rewrite

If you have mod_rewrite installed on your server, as is the case on most Unix/Linux hosting environments, then you may prefer to have it automatically rewrite your URLs for you, rather than having to name files without an extension. mod_rewrite maps URLs to other locations using regular expressions, which were introduced earlier in this chapter in the section entitled “Solving the Google Canonical Problem.” Let’s take a look at a simple rewrite rule that will remap the URL for the products.php script:

RewriteEngine On
RewriteRule ^products/([a-zA-Z0-9]+)/([0-9]+) products.
php?color=$1&;id=$2

The first line simply turns mod_rewrite on to execute the following command. The RewriteRule looks for the products script with a query string trailing, then converts it to products/ first GET variable in letters or numbers/second GET variable in numbers. You could add slots for more GET variables in your URLs by modifying the rewrite rule:

RewriteRule ^products/([a-zA-Z0-9]+)/([0-9]+)/([a-zA-Z0-9]+)
products.php?color=$1&id=$2&newVar=string

The function we used in the first example would also be used with the mod_rewrite approach, and could accommodate any number of GET variables you need to grab from the URL.

An important thing to note when testing this is that the above rewrite is not an auto redirection, just a remapping. If you try testing this by entering the original URL with the query string, it will not automatically redirect you to the new search engine friendly URL. It simply points the search engine friendly URL to the products.php page.

General Guidelines for URL Design

Before getting too far into the development of your site, you’ll want to take some time to consider the structure of your URLs, and the variables that will need to be passed within them. Keep these guidelines in mind as you hatch your search engine friendly URL plans:

Try to make your URLs predictable to users. Define a system that users will instantly understand, such as naming directories and/or files with the same names as the navigation labels, and be sure to stay consistent.

Where possible, avoid including too many dynamic parameters that can make the URL cumbersome to reference and impossible to type. Even a search engine friendly URL can quickly become unwieldy for humans.

Shorter, descriptive URLs are more convenient for people who wish to link to your site, thus encouraging inbound links or reference in printed materials.

If you can make good on these three simple recommendations, your site is going to be much more navigable for search engines, and your users.

Moving Pages and Domains with 301 Redirects

From time to time it’s necessary to move the location of a page, change its name, or even move to a totally different domain name. In these situations you want to make sure you get your users and search engines informed about the new location. As mentioned in Chapter 2, a meta refresh, which automatically sends users to a new URL using a special meta tag, is the wrong solution for the problem, as it is used by spammers to trick search engines into incorrectly indexing pages. You can redirect requests for the old URLs directly from the server using mod_rewrite to do a 301 redirection. There’s no negative stigma around 301 redirects in the eyes of search engines as there is with meta refresh, so there are no worries of being mistaken for a spammer. When search engines receive a 301 redirect, they automatically update their records to replace the old URL with the new one.

Imagine we’ve updated a contact page that was once simply HTML with the typical contact info, but now we’ve decided to add a PHP contact form so users can more conveniently contact the organization. To do this, the extension of the page needs to change from .html to .php so the server will run the page through the PHP engine. It’s a good idea to automatically send people to the new location in case users try to access the old HTML page that no longer exists. To make this happen, add the following code to your .htaccess file on your server:

# Redirect to new contact page
RewriteEngine On
RewriteRule ^contact.html(.*)$ /contact.php [L,R=301]

If you already have another rewrite rule in your file, you won’t need to turn the rewrite engine on again as is shown in the second line. Note that you can comment your .htaccess files by preceding your notes with the #. The last line simply looks for any URL that points to contact.html in the Web root directory, and sends a 301 HTTP response to redirect the request to contact.php.

This same approach is equally useful if you are moving to an entirely new domain name. Although the code is slightly more verbose, the ideas are the same:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^(www.)?old-site.com$ [NC]
RewriteRule ^(.*)$ http://www.new-site.com/$1 [R=301,L]

After the rewrite engine is enabled, a condition is set looking for the domain name with or without the proceeding www, and regardless of capitalization. Next, requests are redirected to the new domain name with the preceding www and with the trailing path that may have been entered for the original request. Keep in mind this assumes you have the same directory and file structure with the new domain as the old one.

Using the 301 redirect will help ensure the Google PageRank of the old domain gets transferred to the new one, but be aware that redirects can slow down the user experience, and Apache as it will have additional lookup tasks on top of serving the requested file. In most situations this will not be very perceptible, but on high-traffic sites redirects can cause a noticeable slowdown in server performance. With this in mind, keep your use of 301 redirects to a minimum for simple tasks like correcting changed directory structures, file names, or domain names.

A little forward thinking can help you avoid having to do a lot of redirecting when your site changes. If a number of files in your site have to switch from HTML to PHP, rather than changing the file extensions, you could force Apache to run all HTML files through the PHP engine. By taking this approach all of your existing file extensions can stay the same. You can change Apache’s configuration in your .htaccess file as follows:

AddType application/x-httpd-php .php .html

This simple line of code tells the server that all files with a .php or .html extension need to be run through the PHP parsing engine.

Alternatively you could just not include an extension on your files so you have the flexibility in the future to have the server deliver them as plain HTML or run them through the PHP parsing engine. This can be a little more tedious to manage, as you’d need to tell Apache what file type is to be used on your anonymous files. You can do this using ForceType in your .htaccess file as illustrated earlier in this chapter in the section about creating search engine friendly URLs, entitled “A Simple Solution.” Extension-less files can be troublesome when you are building them in your development application because they will probably not show color-coding, code hints, and other features the software offers that are dependant upon knowing the file type.

Getting Users Back on Track with Custom 404 Pages

The default 404 error pages displayed by servers are cryptic to novice Web users, provide no explanation of why the error occured, and offer no means of getting back on track. They are a dead end that can frustrate users, and cause them to leave your site. The solution to the problem is to create a custom error page that can provide a clearer description of the problem, and offer solutions that could get your visitors back on track. Triggering the display of custom error pages is as simple as modifying your .htaccess file to tell the Apache server what to do when a URL can’t be resolved:

ErrorDocument 404 http://www.example.com/404.php

This single line of code is all it takes to trigger the display of 404.php in the Web root directory when a requested file is not found. Many shared hosting environments actually have 404 page handling options in their control panels, and may even have a template to get you started. Should your server not support the method shown above, you could instead use mod_rewrite to determine when a request is neither a file nor a directory, then trigger the display of the custom 404 page. This is a little less desirable because it is slightly more complex, but is still a great solution if your options are limited:

#Request is not a file
RewriteCond %{REQUEST_FILENAME} !-f
# Request is not a directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .? /404.php [L]

Windows users can download and install Xenu (http://home.snafu.de/tilman/xenulink.html), which also helps you identify broken links but from your local machine. If you’re running a Windows IIS server you can set custom 404 error pages just by changing a few properties for your site. 404 error handling is built right into IIS. Microsoft offers a helpful tutorial on the subject at http://www.microsoft.com/technet/prodtechnol/windows2000serv/technologies/iis/tips/custerr.mspx.

The Elements of a Successful 404 Page

Beyond knowing how to trigger a custom 404 page, it’s even more important know how to design one. A well-designed 404 page should help users recover from wrong turns or broken links without technical jargon or finger-pointing messages admonishing your users for their ignorant navigation choices. It’s a great opportunity to turn an otherwise frustrating user experience into a positive one, and keep users on the site longer. To get the best results, keep these simple guidelines in mind:

The design should be consistent with the rest of the site.

Include a clear message acknowledging the problem, indicating the possible cause (don’t admonish your users), and offering solutions.

Avoid technical jargon as many users may not know what a 404 error means.

Include the search box so users can find what they are looking for.

Include a link to the home page.

Include a link to the site map that provides links to all of the main sections of the site.

Include an option to report a broken link (contact form link).

Optionally, suggest popular destinations on the site.

Never auto-redirect users to another page, such as the home page, as users will have no idea that an error occurred and will be confused about what happened.

Don’t be afraid to inject some humor into your copy for your 404 pages. No one likes getting lost on a site, but finding something funny on the page makes the experience like finding an Easter egg. FIGURES 3.2–3.5 show a variety of examples of successful 404 error pages.

Figure 3.2 Jamie Huskisson (http://www.jhuskisson.com) uses humor to make the experience of getting lost on his site a little more like finding an Easter egg.

Figure 3.3 Garrett Dimon’s (http://garrettdimon.com) 404 page points lost users to areas in the previous version of his site which may contain the content they seek.

Figure 3.4 Dan Cederholm’s site http://simplebits.com provides a number of options to get the user back on track.

Figure 3.5 The popular designer T-shirt site Threadless (http://threadless.com) uses server-side scripting to look at the keywords in the URL so it can suggest links to the pages the user may have been searching for.

Optimizing Performance for Efficient Indexing

The loading speed of pages is a significant issue for search engine spiders, which have the mandate to crawl and index the Web as fast as possible. Optimizing content delivery can help ensure that spiders can crawl pages quickly, efficiently, and create more complete listings. As an added bonus, your users will appreciate the performance boost too, as they’ll spend less time waiting for content, and more time enjoying your site!

The speed at which your content is served can be affected primarily by the way the server is set up to send responses and by the files that are being served. Achieving optimal performance is a bit like alchemy, requiring a number of smaller tasks that can add up to a big increase in response and rendering speeds. Let’s start our performance optimization with some simple server configuration that will make a significant difference in indexing speed.

Cache and Dash: Getting Clients To Cache Files

The temporary memory of a client, known as the cache, can significantly speed up the browsing of a site. By the way, the term client refers to any browser, search engine spider, or other entity making a request of the server. As a client views a site, it stores common files in the cache so it doesn’t have to continually make redundant trips to the server for the files it’s already requested. Clients will often skip redundant requests if the file they have stored in memory is still current. Usually servers are not by default set up to define the expiration date of files, which prevents clients from taking full advantage of caching and thus significantly slows the crawling/viewing of a site.

In order to really wrap our heads around the idea of caching files, and understand how to configure Apache to give us the speed boost we want, we’ll need a working knowledge of HTTP, the protocol used when servers and clients communicate on the Web. Here’s a quick primer to set the stage for our server-side optimizations.

The back-and-forth dialog between a client and a server is delivered via HTTP or Hypertext Transfer Protocol, a standardized communication protocol that is split into requests and responses. All HTTP communication includes some simple bits of text at the top of the message, called headers, that provide lots of information about the communiqué including date and time information, which are central to our optimization goals. The HTTP headers in a request from a client are different from those in a response from a server.

HTTP headers are normally invisible to the average Web user, but you can take a peek behind the scenes using a shareware HTTP monitoring program called Charles (http://www.xk72.com/charles/). Charles acts as a proxy through which all HTTP request and response headers pass as you navigate sites in a browser. It can be used on Mac OS X, Windows, and Linux and will work with any browser. If you’re using Firefox you’ll need to download the free add-on from the Charles site to route all request and response traffic through Charles. FIGURE 3.6 shows Charles revealing some typical HTTP request headers sent by the client to the server, the last of which is centrally important for search engines.

TIP
If you’re doing any work with Ajax, Charles can be an exceptionally useful tool, as it will show your XMLHttpRequests (XHR) as each call to the server is made.

The very last request header shown in FIGURE 3.6 is If-Last-Modified, one that search engine spiders often use to determine if they have to re-index a file or if they can instead rely on their cached content. As its name suggests, If-Last-Modified compares the modification/expiration date of a file to the date of the cached file. If the file has changed then the spider will re-index it, but if it hasn’t the server sends out a 304 response, meaning there is no need to request this file.

Figure 3.6 Using Charles, a shareware HTTP monitor/proxy, you can observe HTTP requests and responses for any website in your browser. The If-Last-Modified request header shown here sent by the client to determine if the file requested has changed since it was last downloaded.

It may, at first, sound like a negative thing to encourage spiders to skip the indexing of certain content, but it’s actually a huge benefit. If spiders are able to rely more on their cached files from a previous crawl of your site then they will be able to cruise through at a much faster speed, grabbing only the new stuff. Files like your logo, master style sheet, or common JavaScript libraries can take a long time to download, but do not get updated regularly. The files you want search engines to re-index regularly are likely to be your HTML or PHP files where you’ve added new content or made changes. Using an .htaccess file, you can configure your server to tell clients to regularly index the important file types, and only periodically download others that rarely change.

Our .htaccess file will configure the Apache module called mod_expires, which is typically installed with most Apache servers, to set the expiration headers in HTTP responses. Here’s a simplified example of how it works:

<IfModule mod_expires.c>
  ExpiresActive On
  # All Files: 1 month from access date
  ExpiresDefault A2592000
</IfModule>

To be cautious we first check to see if mod_expires is available on the server; if it is, the script fires it up and then defines the expiration date to one month from when the client accessed the file. The ExpiresDefault is doing all of the heavy lifting for us by creating a universal expiration date for all files downloaded regardless of file type. The A indicates the time the client accessed the file, and the number following it is the number of seconds in one month. Although handy, this is not quite as practical as setting unique expiration dates for each file type as certain files don’t need frequent indexing but others certainly do. Following the same idea, here’s an example that segments the expiration dates of all files so some can be indexed regularly and others can be passed over:

<IfModule mod_expires.c>
  ExpiresActive On

   # 1 Year: Ico, PDF, FLV
  <FilesMatch ".(ico|pdf|flv)$">
    ExpiresDefault A2419200
  </FilesMatch>
   #1 Month: Jpg, PNG, GIF, SWF
  <FilesMatch ".(jpg|jpeg|png|gif|swf)$">
    ExpiresDefault A2419200
  </FilesMatch>

   #1 Month: XML, TXT, CSS, JS
  <FilesMatch ".(xml|txt|css|js)$">
    ExpiresDefault A2419200
  </FilesMatch>

   #2 Days: HTML, PHP
  <FilesMatch ".(html|htm|php)$">
    ExpiresDefault A172800
  </FilesMatch>
</IfModule>

The FilesMatch condition can be used to group certain file types to share a common expiration date. Files such as PDF and FLV, which are less likely to change and can significantly slow down the indexing of your site, will be cached for an entire year. HTML and PHP, on the other hand, are cached for just two days to maintain freshness. With the expiration dates logically assigned, you remove a significant burden from search engines, which will have to download fewer files when indexing your site. Some added benefits of this optimization are that you will conserve server bandwidth, thus saving on the costs associated with it, and your human users will also see much faster load times when their cache is primed with your site files.

Another way to set the expiration date of files is by using the Apache module called mod_headers to define the cache-control. The ideas are very similar, as is the code, and both do essentially the same thing. The difference between using mod_headers and mod_expires is that mod_headers will actually override all other cache settings, but will not work with the older HTTP 1.0 specification. Since most browsers from the late 1990s and on have supported HTTP 1.1 or better, this caveat seems less compelling. Let’s take a look at the mod_headers approach, which would also make use of an .htaccess file to modify Apache’s configuration:

<IfModule mod_headers.c>
  # Default: 1 Week
  Header set Cache-Control "max-age=2592000, public"

  # 1 Year: Ico, PDF, FLV
  <FilesMatch ".(ico|pdf|flv)$">
    Header set Cache-Control "max-age=29030400, public"
  </FilesMatch>

  # 1 Month: Jpg, PNG, GIF, SWF
  <FilesMatch ".(jpg|jpeg|png|gif|swf)$">
    Header set Cache-Control "max-age=2592000, public"
  </FilesMatch>

  # 1 Month: XML, TXT, CSS, JS
  <FilesMatch ".(xml|txt|css|js)$">
    Header set Cache-Control "max-age=2592000, public"
  </FilesMatch>

# 2 Days: HTML, PHP
  <FilesMatch ".(html|htm|php)$">
    Header set Cache-Control "max-age=172800, public"
  </FilesMatch>
</IfModule>

Again, we start by determining if the module is installed, and if it is we execute the header modifications. The first cache-control sets a default expiration of one month for all files unless specified otherwise. Again, the times are set in seconds, so you’ll need to break out your calculator to modify the expiration times to fit your needs.

With the cache-control or ExpiresDefault set, clients like search spiders can take advantage of caching, relying more on their archives and spending less time making redundant requests to the server. If you have content that changes regularly you can certainly adjust the expiration and caching threshold to ensure your audience is finding your latest content.

Setting Expiration Date Headers for Dynamic Content

If your content is coming from a database it’s likely to have a different modification date than the file itself. The expiration date you’ll want to deliver will need to be calculated from the date the content in the database was modified, rather than by the file modification date. To do this you’ll need to have some sort of date/time stamp associated with each record entered when it is created or updated. When you pull the content from the database, you can manually set the expiration headers using the built in PHP function header(). Be sure to set these headers before outputting any content to the page, as the HTTP headers get sent to the client immediately when anything is written to the page. Once the HTTP headers ship has sailed, you can’t make changes as needed.

The W3C provides further explanation of this issue along with a couple of simple PHP examples at http://www.w3.org/QA/2007/07/the_way_of_web_standards.html.

Managing File Size

Obviously, the size of the files the server is sending to the client can really slow down the download speed of a page. There are many techniques to keep the size of your files down, all of which can and should work in concert to shave off download time wherever possible.

A very general rule of thumb to keep in mind is that the collective weight of a page including HTML, images, CSS, JavaScript, and other accoutrements should stay below 100k if possible to facilitate speedy downloads. Of course, life isn’t always so utopian, and there are situations where larger file sizes can’t be avoided.

Inside of your HTML files you can trim file size by simply avoiding excessive white space, and overly verbose commenting. Compact your code and trim comments as much as possible without compromising legibility for you and your teammates who have to manage the files.

Naming utility folders such as your images, JavaScript, and CSS folders with shorter names like i, js, and c eliminates text in your files where frequent links are made to content in these directories. Since this type of content is not what we want search engines to index, short, keyword-poor names like this will have no bearing on the keyword density that we desire.

Externalizing all CSS and JavaScript and avoiding inline or local style sheets and obtrusive in-page embedding of JavaScript allows clients to cache this content, which is often shared between pages on a site. Internalizing this type of code defeats the server-side caching efforts discussed above. Although we’ll be discussing some JavaScript best practices in Chapter 7, consult Jeremy Keith’s book DOM Scripting (http://domscripting.com/), published by Friends of Ed, for further explanation of the concepts and benefits of unobtrusive JavaScript.

JavaScript files, such as the popular frameworks and libraries used in many Web applications today, are notoriously bloated with white space, comments, and functionality that may not even be desired. Purist JavaScript developers like Peter Paul Koch (aka PPK, http://quirksmode.com) argue that the best solution to this bloat is to build your own JavaScript libraries to include just the essential functionality that you need. Other developers argue that this is unnecessarily reinventing the wheel, and is not always the most practical solution. Whichever side of the argument you come down on, there are some useful methods for keeping your JavaScript files smaller.

Minifying is the process of removing white space (such as tabs, new lines, and spaces), and comments, leaving a lean file that is not very practical while you’re making updates, but it’s great when development and testing are complete. Amazingly, minifying your JavaScript can reduce file size significantly, sometimes by 50 percent or more. It would be painfully tedious to do this by hand, but there are a number of great, free tools on the Web that will do it for you. Some minification tools go even further to shrink your files by renaming variables and all references to them with shorter names, shaving even more off the file size.

Dean Edwards has created an exceptionally useful JavaScript minifier called Packer (http://dean.edwards.name/packer/). Packer (see FIGURE 3.7) is a simple little Web app in which you paste your code, click “pack,” and out comes a minified version of the script without comments or white space. It provides the option of renaming variables in the packing process to further decrease the files size of your JavaScript. Packer is also available in .Net, PHP, and Perl scripts that could be integrated into your Web applications to automate file compression.

In order for Packer to produce a working, compressed version of your JavaScript you’ll need to make sure your code is well formed with proper syntax.

Figure 3.7 Dean Edwards’ JavaScript minification tool Packer can significantly decrease the file size of your external JavaScript. The file size of this script has been decreased 44 percent.

All of your semicolons and curly braces need to be in order for it to work properly. Be sure to test your JavaScript files in your browser after minification, as the process can sometimes introduce errors.

TIP
The popular JavaScript framework Prototype (http://www.prototypejs.org) doesn’t take well to minification because it uses a slightly less formal coding approach. Running Prototype through Dean Edwards’ Packer will produce an unusable file. Instead of minifying Prototype yourself, use one of Steve Kallestad’s working compressed versions (http://www.stevekallestad.com/blog/prototype_and_scriptaculous_compressed.html) or run Prototype through jscompact (http://jscompact.sourceforge.net/) before running it through Packer.

Some JavaScript libraries like MooTools (http://mootools.net/) offer the option of building a custom version a la carte, selecting just the functionality you need. Once you’ve selected the features you want, MooTools also offers compression options, including running the code through Dean Edwards’ Packer automatically. Choose compression when you are ready to deploy your site, but avoid it while you are still in development so you can explore or change the file if needed.

CSS files can also be minified in much the same way as JavaScript, removing comments and white space. CSS files written with each declaration on its own line can increase file size exponentially because of the huge amount of added white space. CSS Drive (see FIGURE 3.8) has a useful CSS minification tool that can crunch your files down to their most efficient sizes (http://www.cssdrive.com/index.php/main/csscompressor/).

Figure 3.8 The CSS compression tool from CSS Drive can create further file size savings to help your site load efficiently for search engines and users.

Compressing Files with Gzip

Yet another way to save file size and download time is by using a standard HTTP compression called Gzip. Gzip can compress common text files such as CSS, JavaScript, and XML, making significant file size reductions. Most modern browsers now support HTTP 1.1 or greater, which is required to decompress Gzip files once they’ve been downloaded, making Gzip compression a safe option for most any project. Manually Gzipping your files before posting them would get old quickly; an automated approach would be much more convenient.

Niels Leenheer of http://rakaz.nl/ has developed a brilliant yet simple solution¹ that fits the bill, compressing each file on the fly when requested from the server. Anytime a client requests a CSS or JavaScript file, the Leenheer technique routes them through a PHP script that works its compression magic. Leenheer’s script does more than simple Gzip compression, though; it actually combines multiple JavaScript and CSS files into one larger Gzip file so only one HTTP request is necessary from the server, improving download times even further. Each HTTP request to a server slows down the load time of a page, so combining many requests into one can be a huge time saver.

1. http://rakaz.nl/item/make_your_pages_load_faster_by_combining_and_compressing_javascript_and_css_files

It actually ends up saving HTML code too. Rather than listing multiple script tags in your HTML to connect to JavaScript files, you instead make just a single call, separating the file names with commas:

One drawback of Leenheer’s approach is that the time required to compress many files on the fly can be rather long, negatively offsetting a good portion of the speed increase. Leenheer solves this problem by creating a writable

Using Apache’s mod_deflate and mod_gzip

There are other options for on-the-fly compression of your files. Apache 2.0 supports a module called mod_deflate that can be configured to automatically compress your files for you. If your server is running Apache 1.3 you’ll need to use mod_gzip for your compression. Stephen Pierzchala’s articles on SitePoint.com do a good job of walking you through the configuration of each module.

http://www.sitepoint.com/article/web-output-mod_gzip-apache

http://www.sitepoint.com/article/mod_deflate-apache-2-0-x

cache folder on the server where the combined and compressed file can be stored for immediate access, circumventing the need to continually perform processor intensive compression.

According to Leenheer, with his combine and compress approach eight external JavaScript files collectively weighing in at around 168kb and taking 1905ms to download can be compressed down to 37kb, downloading in just 400ms. That’s an 88 percent decrease in file size and an 80 percent decrease in load time!

Leenheer’s solution uses mod_rewrite to run all CSS and JavaScript through the PHP script automatically, making setup and maintenance a snap. Place the following code in your .htaccess file to set up the automatic routing through the PHP script:

RewriteRule ^css/(.*.css) /combine.php?type=css&files=$1
RewriteRule ^js/(.*.js) /combine.php?type=javascript&files=$1

Assuming you have folders called “js” and “css” in the root directory of your server, Apache will send any request for files within them through combine.php appending two GET variables indicating the file type and file name so the script can correctly fetch, combine, compress, and serve the files. You can download the combine.php file at http://rakaz.nl/projects/combine/combine.phps. You’ll need to make a few simple configuration changes at the top of the script to fit the directory structure of your server, upload it to the root directory on your server along with your newly modified .htaccess file, and you’re site is ready for super speedy deliveries.

Reducing HTTP Requests

As mentioned above, the number of HTTP requests a page has to make to the server can really slow down the load and render time of your pages. Only two or four HTTP requests can be processed from a single domain at a time by browsers, depending on the version of HTTP being used and the browser itself. This means that regardless of file size, when the browser encounters multiple JavaScript file requests in the head of an HTML page, it has to wait for two to load before loading two more, or continuing to the images, CSS, and other elements on the page. It’s like a planned traffic jam! Each HTTP request requires some set up and tear down to execute, and can be taxing on the server especially on high-traffic sites. Reducing the number of trips to the server is one of the most effective ways to speed up the delivery of your content to clients.

Already we’ve seen some solutions to the problem. By setting the HTTP expires headers to future dates we avoid unnecessary requests to the server, as clients will not request files that are still fresh in their cache. Leenheer’s combine script cuts out a number of requests to the server by concatenating multiple files into one compressed file.

Some Web designers and developers opt to split their style sheets into logically organized, separate files creating one document for fonts, one for colors, one for layout, one to reset the browser to default values, etc. This practice unnecessarily increases the number of server requests required for a page to render. It’s a better idea to keep your core style rules in one document, only adding additional documents when absolutely necessary such as when working with browser specific or print style sheets.

Keeping images in one larger file rather than in many separate files can save many server requests. Dave Shea’s CSS sprites technique described in issue 173 of A List Apart (http://alistapart.com/articles/sprites) is a great way to conserve HTTP requests. The basic idea is to combine the main images in your design into one file, then with a little CSS kung fu you can position this image in the background of elements, revealing just the subsection of the larger image you want. Although originally intended for rollover and image map effects, CSS sprites could be used as a simple HTTP request elimination technique.

Be a miser with your HTTP requests wherever possible, and you will notice significant speed increases that users and search spiders will appreciate.

Diagnosing Performance Problems with YSlow

As you complete the various optimization recommendations outlined so far, it’s a good idea to check your work with YSlow (see FIGURE 3.9) to see how much you’ve decreased the load time of your pages. Developed by Yahoo!, YSlow is a handy Firefox add-on that extends Firebug (http://getfirebug.com)—another Firefox add-on with a number of Web developer tools—to provide a 13-point checklist and scoring system for evaluating the various factors that can negatively impact the load speed of your site. To use YSlow, you’ll need to first install Firebug, restart Firefox, and then install YSlow. YSlow evaluates your page as it loads, checking it against the Yahoo! page performance rules (http://developer.yahoo.com/performance/rules.html).

Figure 3.9 YSlow grades you on each of Yahoo!’s page performance rules, then assigns an overall performance grade.

When you score poorly on a performance rule, YSlow provides a more detailed explanation of what specifically is affecting the rating (see FIGURE 3.10). In addition to performance testing, it also provides stats about your page and information about each component associated such as images, CSS, and JavaScript.

Figure 3.10 YSlow lets you know why your page scores poorly, so you can fix it.

If you’ve followed the performance optimization tips we’ve covered so far, chances are your page performance score will be reasonably good. Whether you are optimizing an existing site or developing a new one, YSlow is a great way to pinpoint problem areas that will slow down search engine indexing.

Controlling Search Engine Indexing with Robots.txt

There are situations when you don’t want search engines digging through some files and directories that need to remain private. By creating a file called robots.txt in your Web root directory, you can prevent spiders from indexing certain content on your site such as dynamic search results pages that may display improperly without user input, 404 pages, image directories, login pages, or general content to which you don’t want to direct search traffic. All search spiders automatically look for this file in your root directory, so all you need to do is create it, upload it, and wait for the spiders to read it. The robots.txt file does not secure your content in any way; it simply prevents search engine indexing. In fact, anyone can read your robots.txt file by simply going to the domain name /robots.txt. For an entertaining example, try out this much-viewed robots.txt file:

http://whitehouse.gov/robots.txt

Writing the file is a piece of cake. Here’s a short example of the important elements:

User-agent: *
# My private folder
Disallow: /private-folder/
Disallow: /404.php

Start your robots.txt file by defining the search spider user-agent you want to receive the message. The asterisk indicates that you are universally communicating

to all spiders. Preventing spiders from indexing content is done with the keyword Disallow: followed by the path to the private content. This example hides both the 404 error page, and a private directory. The # is used to create comments in your file so you can keep track of what you’ve done.

Each user-agent has a unique name you’ll need to know if you wish to segment your commands. Here’s a quick reference of the popular user-agents you can target with your robots.txt file:

Google: “googlebot”

Google’s Image Search: “Googlebot-Image”

MSN Live Search: “msnbot”

Inktomi: “Slurp”

AllTheWeb: “fast”

AskJeeves: “teomaagent1” or “directhit”

Lycos: “lycos”

Let’s take a look at another example, this time preventing the Google Image Search spider from indexing a folder with photos in it:

User-agent: Googlebot-Image
Disallow: /photos/

You can find the robots.txt mother ship at http://www.robotstxt.org/ along with plenty of explanation and more examples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Server-Side Strategies

Create new playlist

Sign In

Sign Up

3 Server-Side Strategies