While the content of your website matters the most, making sure everything else supports better visibility of the content on search engines is important too. The following sections explain some ways you can do this.
Site maps inform search engines of the existence of pages within your site that are otherwise not discoverable; perhaps they are not linked to from other pages on your site, or from external sites.
Some CMSs provide plugins to generate site maps, listed at code.google.com/p/sitemap-generators/wiki/SitemapGenerators
, or you can write one yourself using the guidelines at www.sitemaps.org/protocol.html.
Once you have written your site map, you can let search engine spiders discover it when they crawl your website if you add a link to the sitemap by using the following:
<linkrel="sitemap" type="application/xml" title="Sitemap" href="/sitemap.xml">
You can also submit the site map to individual search engines instead of linking to the site map within the HTML page, if you would like to make your page as small as possible.
You will likely sometimes have a staging server, such as staging.example.com
for your site example.com
. If an external site links to the files on the staging server (say you were asking a question about some feature not working on a forum and link to the staging server), it is likely to be indexed by search engines even though the domain name does not figure in the robots.txt
file or does not hold a robots.txt
file.
To prevent this, you can add X-Robots-Tag
HTTP header tags by appending and uncommenting the following code snippet to the .htaccess
file on the staging server:
# ------------------------------------------------------------ # Disable URL indexing by crawlers (FOR DEVELOPMENT/STAGE) # ------------------------------------------------------------ # Avoid search engines (Google, Yahoo, etc) indexing website's content # http://yoast.com/prevent-site-being-indexed/ # http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html # Matt Cutt (from Google Webmaster Central) on this topic: # http://www.youtube.com/watch?v=KBdEwpRQRD0 # IMPORTANT: serving this header is recommended only for # development/stage websites (or for live websites that don't # want to be indexed). This will avoid the website # being indexed in SERPs (search engines result pages). # This is a better approach than using robots.txt # to disallow the SE robots crawling your website, # because disallowing the robots doesn't exactly # mean that your website won't get indexed (read links above). # <IfModulemod_headers.c> # Header set X-Robots-Tag "noindex, nofollow, noarchive" # <FilesMatch ".(doc|pdf|png|jpe?g|gif)$"> # Header set X-Robots-Tag "noindex, noarchive, nosnippet" # </FilesMatch> # </IfModule>
Search engines consider folder URLs http://example.com/foo
and http://example.com/foo/
as two different URLs and as such would consider the content to be duplicates of each other. To prevent this, rewrite the URLs either to change http://example.com/foo
to http://example.com/foo/
or http://example.com/foo/
to http://example.com/foo
.
The way we do this is to edit the .htaccess
file for Apache server and add the following rewrite rules (see Chapter 5, Customizing the Apache Server, for details on how we edit .htaccess
files).
The following code snippet helps us to rewrite example.com/foo
to example.com/foo/
:
RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_URI} !(.[a-zA-Z0-9]{1,5}|/|#(.*))$ RewriteRule ^(.*)$ $1/ [R=301,L]
The following code snippet helps us to rewrite example.com/foo/
to example.com/foo
:
RewriteRule ^(.*)/$ $1 [R=301,L]
If you have existing rewrite rules, perform the following steps to make sure you set up your rewrite rules correctly. Not doing so can cause incorrect redirects and 404 errors.
.htaccess
file you are going to add redirects to, before you start adding them. This way you can quickly go back to the backup file if you are unable to access your site because of an error in the .htaccess
file.RewriteBase
path: If your website is in a subfolder, ensure you have set the right RewriteBase
path for your rewrite rules. If you have a working RewriteBase
path, do not remove it.Finally, consider implementing guidelines from Google's SEO Starter Guide at http://googlewebmastercentral.blogspot.com/2008/11/googles-seo-starter-guide.html.