Configuring the Alfresco search engine

The Alfresco search engine is configurable and highly scalable. This section provides information about the underlying search engine and the process for configuring it.

The theory behind the search engine

Alfresco supports full-text search capabilities, using Apache's powerful Lucene search engine (http://lucene.apache.org). Lucene is an open source, highly scalable, and fast search engine. Lucene powers search in the discussion groups at Fortune 100 companies, in commercial issue trackers, email search from Microsoft, and the Nutch web search engine (which scales to billions of pages).

Lucene's logical architecture performs a search on a document based on its text content. This helps Lucene to be independent of the file format. So any kind of file (PDF, HTML, Microsoft Word documents, and so on) can be indexed—as long as its textual information can be extracted.

Lucene stores the search indexes and related data in a back-end file system, similar to Alfresco's binary files. You can find the search index files in your <alfresco_installation>alf_datalucene-indexes folder. Lucene also supports federated searches by combining various data sources.

At the time of writing, Alfresco supports two languages (Lucene and XPath). These are used to search the content in the Alfresco repository.

Limit search results

By default, a search returns all of the results that match the search criteria. Let us say you have millions of documents in your repository. If a particular search results into thousands of documents, the web client uses pagination to display search results in multiple pages. Quite often we never see the search results in the later pages of the search. Can you recollect having ever clicked on page number 10 (or later) in the search results page to locate content? It is very inefficient to get all of the search results and display them in pages.

You can limit the search results by customizing your web-client configuration file web‑client-config-custom.xml in the extension (<alfresco_install_folder> omcatsharedclassesalfrescoextension) folder.

Edit the web-client-config-custom.xml file and add the following XML text after the first line (which is <alfresco-config>). If you have already created this XML block in your web‑client-config-custom.xml file, then you only need to insert the lines that are highlighted.

<config>
  <client>
    <!-- Override the from email address -->
      <from-email-address>[email protected]</from-email-address>
        <!-- the minimum number of characters required for a valid 
                                                     earch string -->
          <search-minimum>3</search-minimum>
         <!-- set this value to true to enable AND text terms for 
                                simple/advanced search by default -->
          <search-and-terms>false</search-and-terms>
         <!-- Limit search results. -1 for unlimited. -->
         <search-max-results>100</search-max-results>
  </client>
</config>

This code ensures that the search engine will return a maximum of 100 results. It also sets the minimum search string length to 3 characters, and disables Boolean AND search option in order to improve the search performance.

Restart Alfresco to make sure the changes above have taken effect.

Indexing properties

In the Alfresco content model, the data dictionary settings for properties determine how individual properties are indexed in the search engine.

Refer to the custom aspect called Customer Details in Chapter 7. In the earlier sections of this chapter, we configured the advanced search form to search for the Customer Name property of this custom aspect. It is advisable to index the values of the Customer Name property in order to improve the search performance.

Edit the customModel.xml file in your <extension> folder where you declared the Customer Details aspect. Add the highlighted code to the aspect declaration, in order to index the property.

<property name="custom:CustomerName">
    <title>Customer Name</title>
    <type>d:text</type>
    <protected>false</protected>
    <mandatory>false</mandatory>
    <multiple>false</multiple>
	 <index enabled="true">
	     <atomic>false</atomic>
	     <stored>false</stored>
	     <tokenised>true</tokenised>
	 </index>
    <constraints>
        <constraint ref="custom:name_length"/>
    </constraints>
</property>

If the enabled option for the index is set to true, then this property will be indexed in the search engine. If this is false, there will be no entry for this property in the index.

If the Atomic option is set to true, then the property is indexed in the transaction. If this is set to false, the property is indexed in the background.

If the Stored option is set to true, then the property value is stored in the index and may be obtained through the Lucene low-level query API.

If the Tokenized option is set to true, then the string value of the property is tokenized before indexing. If it is set to false, then it is indexed as it is, that is, as a single string. The token is determined by the property type in the data dictionary. This is locale-sensitive as supported by the data dictionary. Therefore, you could choose to tokenize all of your content in German, if you wish to do so.

If you have not specified any indexing values for your custom properties, then Alfresco gives default values to your properties. By default, the properties are indexed atomically. The property value is not stored in the index, and the property is tokenized when it is indexed.

Configuring Lucene in Alfresco

The repository.properties file in your config folder defines a number of properties that influence how all indexes behave. You can improve the search performance by setting appropriate values in the properties file.

Note

We advise that you to NOT change the values in the repository.properties file. Instead, we recommend that you override the settings in the custom-repository.properties file in the /extension folder in the Alfresco classpath.

The following are the default search-index properties:

  • lucene.query.maxClauses=10000
  • lucene.indexer.batchSize=10000
  • lucene.indexer.minMergeDocs=1000
  • lucene.indexer.mergeFactor=10
  • lucene.indexer.maxMergeDocs=100000

Max Clauses (Lucene standard parameter): Lucene queries limit the number of clauses in a Boolean query to this value. Some queries are expanded under the covers into a whole set of Boolean queries with many clauses. For example, searching for luc.* will expand to a Boolean query containing an OR for every token that the index knows about that matches luc.*.

Batch size (Alfresco indexing parameter): The indexer stores a list of what it has to do as the changes are made using the node service API. Typically, there are many events that would cause a node to be re-indexed. By keeping an event list, we can optimize these actions. The algorithm limits re-indexes to one per batch size, and it will not index if a delete is pending. When the list of events reaches this size, the whole event list is processed and the documents are added to the delta index.

Min Merger Docs (Lucene standard parameter): This determines the size of the in‑memory Lucene index that is used for each delta index. A higher value of Min Merger Docs would mean that we have more memory but less IO for writing to the index delta. The in-memory information will be flushed and written to disk at the start of the next batch of index events. As the process progresses, the event list requires reading against the delta index. This does not affect the way information is stored on disk—just how it is buffered before it gets there.

Merge Factor (Lucene standard parameter): This determines the number of index segments that are created on disk. When there are more segments than this value, then some segments will be combined.

Max Merge Docs (Lucene standard parameter): This value determines the maximum number of documents that can be stored in an index segment. When this value is reached, the segment will not grow any larger. As a result, there may be more segments than expected by looking at the merge factor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset