HOUR 14
Improving Searches

What You’ll Learn in This Hour

image Three main aspects of searching

image Adding IFilters

image Customizing Search Server 2010 Express

image Improving all your searches

This hour delves deeper into SharePoint searching with the emphasis on the additional functionality that Search Server 2010 Express provides.

Searching Aspects

Following are three main aspects to searching:

image Crawling—Looking for documents, files, and data. A site—such as Google, which lives by providing information from everywhere—wants to crawl as many locations as possible. If you have a site on the Internet that you don’t want crawled, specify that in some way.

In the case of a crawl of a company site, the intention is usually to crawl only meaningful locations for data. In that case, restrict what is crawled. As discussed in this hour, you want to define what you want to be crawled and what shouldn’t be crawled.

image Indexing—Before you can index all the files that the crawl process finds, you need to “translate” the contents of file formats into words that the indexer can understand.

image Searching—Making sense of what you have indexed so that you have quality links high on the results listing.

Using IFilters to Translate the Contents of Files

This section looks at IFilters—how they are used and how new IFilters can be added.

What Are IFilters?

Microsoft uses pieces of code, called IFilters, to make sense of the contents of files that its search routines find. Some IFilters are built in to Microsoft’s search products. For instance, SPF 2010 and Search Server 2010 Express both include IFilters for many common file formats out-of-the-box.

The situation with WSS 3.0 and Search Server 2008 Express was that Search Server 2008 Express contained several more IFilters than WSS 3.0 out-of-the-box (because it came out more than a year later), but you could add all those missing IFilters (and more) by installing a free Filter Pack from Microsoft.

SPF 2010 and Search Server 2010 Express, as they were released the same time, contain the same basic set of IFilters. In addition you may have noticed (Hour 2, “Installing SharePoint Foundation 2010,” Figure 2.6) that one of the things that the Prerequisites function installs is Filter Pack 2.0, which is the latest version of the Filter Pack that could be added to WSS 3.0. Even so there will be some cases where you have files of file types for which you don’t have an IFilter installed. In that case although the files will be found by the crawl process, their contents will not be indexed by the index process.

Files created by most main Microsoft products already have IFilters included either in the product or in the Filter Pack, but there are some that are not covered (such as Office template files). Similarly although standard file types such as .zip and .txt also have IFilters included, the file types of most non-Microsoft products are not supported by built-in (or Filter Pack) IFilters; so if you have files of such types, you need to find IFilters for them if you want to search the contents of those files.

The usual place to go is the manufacturer of the product that you use because it has an incentive to provide such a filter, but sometimes it either can’t be bothered or provides a poorly working IFilter; then you need to look for third-party companies specializing in creating IFilters. By far the most common file format for which an IFilter needs to be found is the .pdf format used by what used to be called Adobe Acrobat.

Microsoft seem to delight in making working with Adobe products difficult; you may remember that they do not provide icons in the SharePoint products for the .pdf file type even though you can create documents from Word 2007 and 2010 saved in the .pdf format. That situation is true for IFilters, too. Microsoft has never provided IFilters for the .pdf file type and instead suggest you go to Adobe or to a third party for them. Until recently (in 2009) Adobe repaid the compliment by making it impossible to have an IFilter in a SharePoint system without you needing to install the entire Adobe Reader product on the server. Luckily this has now changed.

The following section gives some information about Adobe’s own 64-bit IFilter support for its pdf format, before mentioning where a third-party IFilter for pdf support is available. Test both if you have the time. If you don’t and you can afford the small cost of it, use the third-party product. It is generally considered to be faster and more accurate than the free version from Adobe, and likely if it hadn’t existed Adobe would never have bothered making its own (separate product) version available.

Even if you intend to use the chargeable product, read the following section because it gives details of how in an installation of SPF 2010 + Search Server 2010 Express you get to the Administration Pages for Search. It’s by no means as simple as it was in the previous version (WSS 3.0 + Search Server 2008 Express).

Adding an IFilter for the Adobe Acrobat (PDF) File Type

Thanks to customer pressure over many years, Adobe finally in 2009 made available a separate 64-bit IFilter for .pdf files. Earlier there had been a 32-bit IFilter for versions of .pdf up to and including version 6.0 of Adobe Acrobat that could also be used for 64-bit systems, but for files of version 7, 8, and 9 to be indexed, you needed to install the Adobe Reader product. You still need to do this for older versions of SharePoint running on 32-bit systems if you don’t want to use the chargeable product.

This new (and separate) Adobe IFilter for file types up to version 9 is here: http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025. Because the instructions on that page were written in 2009, they do not say that it is for SPF 2010 or SPS 2010. It is!

The installation procedure is straightforward. (Just click the file and let it go.) Following the installation of the IFilter, PDF needs to be added to SPF 2010 as a file type that will be included when files are crawled. (Without doing this, you have the ability to index a file type that isn’t included in the set of files you are indexing!)

To find where file types are listed, we first need to find the main page for Search Administration. Following are the steps you need to take:

1. Open Central Administration (see Figure 14.1). (If in the server, click the menu item; if on a client, specify http://spf1:portnumber in your browser.)

FIGURE 14.1 Central Administration concentrating on the left column

image

The reason that I have bothered with a screen here is that what you will quickly become aware of when you use Central Administration is that the options listed in the main section of the page under each heading are not all the options available. To see all the options, you need to first select the item in the left column.

Here we will be using a function of General Application Settings that is not the single item listed under the General Application Settings header in the main section of the page.

2. Click General Application Settings in the left column (see Figure 14.2).

FIGURE 14.2 The full options of General Application Settings

image

The screen in the previous version only listed Configure Send to Connections. (It’s a confusing page design, in my opinion.)

3. Click Farm-Wide Search Administration. (Ignore that our “farm” is a single server; it’s still a farm.) There’s not much here yet (see Figure 14.3), but the next screen will satisfy all our search configuration demands.

FIGURE 14.3 The Farm-Wide page

image

4. Click Search Service Application. (What you then see is shown in Figures 14.4, 14.5, and 14.6.)

FIGURE 14.4 The top half of the Search Administration page

image

FIGURE 14.5 Most of the lower half of the Search Administration page

image

FIGURE 14.6 The I Want To section of the Search Administration page

image

The Modify Technology page confirms that Microsoft Search Server 2010 Express Restricts the Topology of a Search Service Application to One Server with One Crawl Component and one Query Component.

There’s a lot on this page, which is why I created three figures and even then didn’t cover everything. It is actually the Search Administration page, and if you click Search Administration at the top-left part, you’ll load the same page. If you click Farm-Wide Search Administration at the top of the left column in Figure 14.4, you’ll be back at Figure 14.3. How’s that for confusing?

We look at this page later when we do further investigation into the search possibilities we now have after installing Search Server 2010 Express, but here we want the list of file types. That is listed in Figure 14.4 under the Crawling section of the left column.

5. Click File Types (see Figure 14.7).

FIGURE 14.7 The top of the File Types page

image

To avoid needing to have a large image so that you can see all the file types mentioned here—and remember these are only the file types that will be crawled; it is not necessarily a list of the file types that will be indexed—I’ll list them:

ascx, asp, aspx, csv, doc, docm, docx, dot, eml, exch, htm, html, jhtml, jsp, mht, mhtml, msg, mspx, nsf, nws, odc, odp, ods, odt, one, php, ppt, pptm, pptx, pub, tif, tiff, txt, url, vdw, vdx, vsd, vss, vst, vsx, vtx, xls, xlsb, xlsm, xlsx, xml, zip

Now we just need to add pdf to the list here, and we’re almost finished.

6. As shown in Figure 14.7, click New File Type.

7. Click OK on the Add File Type screen, which is not shown here.

8. Check the list in Figure 14.7 to see that pdf is now listed (without an icon).

Another alternative is to use the commercial 64-bit (pdf) IFilter from Foxit Software. You can get this at http://www.foxitsoftware.com/pdf/ifilter/. These Foxit IFilters cost (early 2010) $330 per server. As noted they are reputed to be considerably faster (and better) than the Adobe ones so are well worth considering. There is a download link for the Foxit Software 64-bit IFilter here: http://mirrors.foxitsoftware.com/pub/foxit/ifilter/desktop/win/1.x/1.0/enu/FoxitPDFIFilter10_X64_enu.msi.

Adding an IFilter for Other File Types

IFilters of varying quality for other file types are available from various commercial and noncommercial companies. Just search for them.

The best-quality ones tend to come from companies whose main product uses their own proprietary file formats. If these companies want their applications to be used, they need to provide working IFilters for them, so they usually do.

Did you Know?

Follow the (Microsoft) Filter Central blog; it occasionally mentions newly available IFilters. The RSS feed for it is http://blogs.msdn.com/ifilter/rss.xml.

Actions Needed After Installing IFilters

After installing one or more IFilters and carrying out the additional steps, restart the server and do a completely new crawl for documents (and new indexing); this ensures that these document types are included in the indexes.

An alternative approach is to do the following:

1. Net stop osearch

2. Net start osearch

3. Iisreset

I prefer doing completely new crawls in such situations.

Crawling and Indexing in SPF 2010 and Search Server 2010 Express

This section compares the methods used to crawl when using SPF 2010 and when using Search Server 2010 Express.

Crawling SPF 2010

Crawling in SPF 2010 can be done at the command line. Open a command prompt and go to C:Program FilesCommon FilesMicrosoft Sharedweb server extensions14BIN. Run the following command:

stsadm -o spsearch -action fullcrawlstart

Crawling Search Server 2010 Express

To crawl in Search Server 2010 Express, do the following:

Work your way down from Central Administration (via Figures 14.1, 14.2, and 14.3) until you reach Figure 14.4.

1. As also shown in Figure 14.4, click Content Sources (see Figure 14.8).

2. Click Start All Crawls.

FIGURE 14.8 Starting a new crawl after adding an IFilter

image

Using Other Search Server 2010 Express Options

Search Server 2010 Express provides more flexibility than SPF 2010 in determining what exactly we want to search. Let’s look at some of those additional functions.

By the Way

For both SPF 2010 and SPS 2010, these (manual) crawls are typically done only after installing a new IFilter to create a completely new index. The first crawls are done automatically when the Search function starts.

By the Way

Things are sometimes not what they seem to be. The Search Administration page has a Crawl History section (see Figure 14.5), which shows the latest six crawl results. There are also other pages you can go to so that you can see progressive older crawl results. At least that is what it looks like. In fact you can only use the right-arrow on that screen section to go to a single extra page (i.e., the second page, which also shows six crawl results). If you want to see all the crawl results you need to select the Crawl Log in the Search Administration page’s Quick Launch section (see Figure 14.4) and then select Crawl History there.

After those two important notes, another thing to watch out for is how to do a new Full Crawl. It looks as if this ought to be done using the Reset All Crawled Content link (again in the Quick Launch section of the Search Administration, Figure 14.4). In fact that, in the (slightly revised) words of Woody Windischman who kindly answered a question I had on this:

Essentially, it will return the Search indexes to a “known clean” state by

1. Deleting the existing file-system indexes

2. Resetting the search database index tables to a clean state (including resetting unique ID counters)

It does not:

1. Delete content source definitions or schedules

2. Start new crawls (though defined crawls from #1 should start on schedule)

3. Stop any crawl that might already be underway (though it may wait until it is done before it actually does anything)

The way to start a Full Crawl is to select Content Sources; click the single Content Source (“Local SharePoint sites”), which is provided as part of the installation, and then go to the bottom of the page where there is a Start Full Crawl section. Select the check box and click OK.

The Reset All Crawled Content option should be used only when the search indexes are in a mess and search is clearly not working properly.

Let’s now look at how to define which locations we can search using Search Server 2010 Express. Here are the steps required:

1. As shown in Figure 14.9, select Content Sources. This time, look at what we can add as locations to be searched.

FIGURE 14.9 The Content Source page

image

2. Click Local SharePoint sites, which is the only content source that has been installed by the Search Server 2010 Express installation routine. You can either amend the existing content source or create new ones using New Content Source, or do both. Here we look at the existing source to see what has been specified for it.

Figure 14.10 shows two key sections that are needed when saying what is to be crawled.

FIGURE 14.10 Where to search

image

The Start Address section includes all the top-level sites that were in existence when Search Server 2010 Express was installed with the exception of the Central Administration site. Addresses are not removed from this box if one of those top-level sites is later deleted (as was the case here with http://spf1:44465). In such cases, you can avoid unnecessary error messages by removing the addresses of deleted top-level sites from the table in Figure 14.10. You need to do this manually. The system will not do this for you when you delete a top-level site.

The Crawl Settings section enables you to decide if you want to crawl all sites (that is, subsites) below the top-level site (site collection) or whether you want to just crawl the top-level site.

Finally, Crawl Schedules does not by default set a schedule for when Full Crawls are to take place. When Search Server 2010 Express was installed, a single Full Crawl was automatically done using the above Start Addresses. Since then only Incremental Crawls have been done according to the default schedule that can also be changed in the same screen that Figure 14.10 shows a section of.

It is wise in production situations to Create a new schedule for Full Crawl so that it occurs at a suitable time, during the weekend for instance. The default value (when a schedule is specified) of every 20 minutes is suitable only for Incremental Crawls, where it is also the default and should not be selected for Full Crawls.

Normally, three main problems exist when you crawl “foreign” websites:

image You might need rights to access them.

image The sites’ administrators might have blocked crawling.

image You have no control over the amount of data to be crawled.

The first of these (provided you know an appropriate name and password) is handled by selecting crawl rules (see Figure 14.4). Here again Microsoft managed to confuse us because the first screen you see (Figure 14.11) gives only the option of choosing paths and has no box for specifying how to access these locations.

FIGURE 14.11 A misleading Crawl Rules page

image

At first glance, this screen seems to indicate that the only Crawl Rule you can set is which locations should be crawled. This is just an intermediate page that enables you to test various URLs. To actually specify, for example, the Administrator account and password for crawls, we need to create a New Crawl Rule using the link in Figure 14.11. (This leads to Figure 14.12.)

In addition to selecting paths that are not to be crawled, there is now also the possibility of giving name and password access information (or alternative access information) for paths you want to be crawled. The items in the Specify Authentication section are grayed out (and thus inaccessible) until the radio button in the Crawl Configuration section is changed to Include All Items in the Path.

I won’t go into the other options here. If you are serious about search, there are specialist books available only on Search. Here the intention is to introduce you to the options available rather than go through each option in detail.

FIGURE 14.12 Full Crawl Rules

image

Next, we look at how to do searches using the data we will have indexed in the crawling/indexing phases.

Using the Search Server 2010 Search Function in the SPF 2010 Site

Here it seems logical to do our search from the newly created Search page—which is the default page of the new site we created in Hour 13, “Using SPF 2010 Search and Installing Search Server 2010 Express,” (see Figure 13.16). When we go to that page, we see a simple entry form for searches.

Entering a search term there gives us a different (and probably slightly better) search than we would get from the standard SPF 2010 search. However, it’s messy to always need to go to this page when we already have search boxes in our existing sites, so let’s consider what happens if we search in the default site that SPF 2010 created.

What actually happens is that, unlike the situation with the v3 products where you had to fiddle with the system to get this to work, the Search in the default SPF 2010 site by default will use the Search Server 2010 search function.

You can check this (or turn it off if you want to keep the two kinds of searches strictly separated) with the following steps:

1. Go to Central Administration.

2. Select Manage Web Applications.

3. Select http://spf1:80 (see Figure 14.13).

FIGURE 14.13 Starting to check Search in the SPF default site

image

4. Click Manage Features (Figure 14.14).

FIGURE 14.14 Two activated searches

image

As can be seen (lower option), using the Search Server Service is now (by default after installing Search Server 2010 Express) the active option for searching the sites we created in SPF 2010. Because Search Server 2010 Express has been installed, we also have the possibility to search many other sources of enterprise data even from within the SPF 2010 sites. Now that we’ve checked that, here’s a quick look at the key additional search functions available to us in our searches.

The Queries and Results section enables us to improve the quality of our searches. Apart from again restricting searches and also specifying on which page the search results appear (using Scopes), there are a couple of useful new functions:

image Authoritative Pages enable you to specify which parts of the areas being searched contain the best quality information.

image Federated Search is a useful function that enables you to use ready-made search functions (available on the Internet; search for them in Bing or Google using Federated Search Locations or Federated Search Connectors) for searching particular locations. These ready-made search functions are fully configurable so that, for instance, you can have a web part on a page that uses a Federated Search Connector for Bing. You can then make a couple of special versions of this web part that are further restricted; perhaps the first could search only http://social.technet.microsoft.com (TechNet Forums) and the second could only search www.microsoft.com.

Summary

In this hour, you learned that searching actually means deciding where to look for data; grabbing, interpreting, and indexing all the files within the scope we set; and finally having a search function to find suitable data for our needs from the mass of base data provided by the indexing stage.

You also learned both how to restrict your search by specifying only particular places to look for information and how to extend your search by adding IFilters to enable your indexing software to extract sensible data from more file formats.

Finally, you learned how adding Search Server 2010 means that its extended search functions are also available for your searches in the SPF 2010 sites.

Q&A

Q. Is SPF 2010 plus Search Server 2010 Express better than the search provided by the much more expensive SPS 2010?

A. No. But the functionality provided is virtually the same.

Q. Search Server 2010 Express is free and seems to be powerful. Why should I bother paying for Search Server 2010?

A. Search Server 2010 and Search Server 2010 Express are identical in functionality, but Express is limited to running on a single machine.

Workshop

Quiz

1. What are the three main aspects of searching?

2. What is the purpose of an IFilter?

Answers

1. Crawling, indexing, and searching.

2. An IFilter is needed so that contents of a file in a particular file format can be understood by the indexing program. Without it, the indexing program usually can’t extract meaningful data from a file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset