Truth 1. Getting noticed by spiders, robots, and crawlers

Spiders, robots, and crawlers are your friends. In the name of search engine optimization, you’ll not only learn to love them, but also you’ll actually go out of your way to attract them to your site.

In SEO, spiders, robots, and crawlers are more or less synonymous, but don’t worry unduly—none have legs or feelers. So, let’s consolidate and just use the term crawler, shall we? Just bear in mind that you’ll sometimes want to attract robots, or lace your site with “spider bait”. All belong to the same principle.

So, what’s a crawler, and why would I want one on my website, anyway?

A crawler is a program or automated script (often called a ‘bot, short for robot) that scuttles around the Web visiting URLs. Crawlers navigate from URL to URL by following links on the pages of the websites that they visit.

The major search engines continuously send their crawlers across the vast expanse of the Internet. Crawlers find web pages and copy the text and code on them. They keep these copies in their vast index in a process called spidering. This enormous index, which essentially is a database of all the pages on all the websites a search engine crawler can successfully visit, is what the search engines use to provide lightning-fast results when you search. When you enter a search query into a search engine, such as Google, what you’re really querying is the search engine’s entire index, not the Internet as it exists at that very instant in time.

Of course, web pages change. Sometimes, pages and sites change with great frequency. On top of that, new pages and sites are going up all the time. That’s why the crawlers are always out there, visiting and revisiting pages to build, grow, update, and refresh the search engines’ indexes.

What’s in a search engine’s index is what the crawler “sees” when it visits a web page. This vision can differ quite considerably from what a visitor to the page sees. If you want to see what a crawler sees when it looks at a specific web page, visit that page with Internet Explorer and press Ctrl-A to view the copy (Apple-U on a Mac). Or if you’re in Google, click on the Cached link at the bottom of any search result to see the crawler’s most recent snapshot of the page.

First and foremost, crawlers are attuned to the words and phrases on each page they crawl. They crawl text and links. When you type a query into a search box, the search engine tries to match your exact search term with the web page most likely to match the words in that search query. In this sense, search engines function in exactly the same way as the opening of Genesis: “In the beginning, there was the word.”

Different search engines have their own individual crawlers, and as you might expect, they don’t all behave exactly the same way. Some spiders fetch entire pages; others are easily bored and look at only some of the content. Most think the copy in the title of the page and up near the top is more important than what’s further down the page. Crawlers can easily hit roadblocks and bail out on entire websites if, for example, there aren’t links for them to follow or they come up against the wall of some sort of weird (to them, at least) technology or code. A crawler can also be thwarted if it encounters technology that will trap it so that it cannot easily complete its work.

So, a very significant part of any SEO initiative is to make it as easy as possible for search spiders to find and to crawl your website. If your pages don’t get crawled, they won’t get indexed by the search engines. If a page isn’t in the index, searchers can’t find it because as far as the search engine is concerned, it simply doesn’t exist. Instead, the searcher will find other web pages, very possibly your competitors’. Links and intelligent site architecture are very much a part of building bridges from individual pages and sections of a website to other sections and pages, effectively paving a clear path for crawlers to follow.

Create a sitemap

One very basic way to help out the search engine crawlers is by creating a sitemap. A sitemap is a file (often in XML—Extensible Markup Language) that provides the crawler with a listing of all URLs on the site—at least the ones the site creator wants the crawler to see. Additional information about individual URLs can also be contained in the sitemap, such as when a given page was most recently updated, how often it changes, and how important each URL on the site is relative to the other URLs. (A homepage is usually more important than a “Contact Us” page, for example.)

This information helps search engines crawl the site more intelligently. Google, MSN, Yahoo!, and Ask all accept sitemap submission; however, it should be noted that none guarantee all submitted URLs will be crawled or indexed. (A list of tools for creating sitemaps is in Appendix A, “Resources,” located online at www.informit.com/title/9780789738318.)

Sitemaps are particularly useful for websites containing content that cannot be accessed through a browseable interface, such as sites that house a large archive or database of information that’s only accessible via internal site search. Remember, crawlers follow links, and often such information isn’t linked to.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset