The token filters are declared in the <filter>
element and consume one stream of tokens, known as
TokenStream, and generate another. Hence, they can be chained one after another indefinitely. A token filter may be used to perform complex analysis by processing multiple tokens in the stream at once but in most cases it processes each token sequentially and decides to consider, replace, or ignore the token.
There may only be one official tokenizer in an analyzer; however, the token filter named WordDelimiterFilter
is in-effect a tokenizer too:
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
(Not all options were just shown) The purpose of this analyzer is to both split and join compound words with various means of defining compound words. This one is typically used with WhitespaceTokenizer
, not StandardTokenizer
, which removes punctuation-based intra-word delimiters, thereby defeating some of this processing. The options for this analyzer have the values 1
to enable and 0
to disable.
The WordDelimiterFilter
will first tokenize the input word, according to the configured options. Note that the commas on the right-hand side of the following examples denote separate terms, and options are all true
by default:
Wi-Fi
to Wi
, Fi
SD500
to SD
, 500
(if splitOnNumerics
)/hello--there
, dude
to hello
, there
, dude
David's
to David
(if stemEnglishPossessive
)WiFi
to Wi
, Fi
(if splitOnCaseChange
)At this point, the resulting terms are all filtered out unless some of the following options are enabled. You should always enable at least one of them:
generateWordParts
or generateNumberParts
is enabled, all-alphabetic terms or all-number terms pass through (meaning, they are not filtered). Either way, they are still considered for the concatenation options.catenateWords
(for example, wi-fi
to wifi
). If generateWordParts
is also enabled, this example would generate wi
and fi
but not otherwise. This will work even if there is just one term in the series, thereby generating a term that disabling generateWordParts
would have omitted. catenateNumbers
works similarly but for numeric terms. The catenateAll
option will concatenate all of the terms together. The concatenation process will take care to not emit duplicate terms.preserveOriginal
.Here is an example exercising all the aforementioned options: WiFi-802.11b
to Wi
, Fi
, WiFi
, 802
, 11
, 80211
, b
, WiFi80211b
, WiFi-802.11b
.
Internally, this filter assigns a type to each character (such as letter or number) before looking for word boundaries. The types are determined by Unicode character categories. If you want to customize how the filter determines what the type of each character is, you can provide one or more mapping files with the types
option. An example use case would be indexing Twitter tweets in which you want #
and @
treated as type ALPHA
.
For more details on this esoteric feature, see SOLR-205. You can find sample configuration, about how to customize WordDelimiterFilter's tokenization rules, at https://issues.apache.org/jira/browse/SOLR-2059.
Lastly, if there are a certain limited number of known input words that you want this filter to pass through untouched, then they can be listed in a file referred to with the protected
option. Some other filters share this same feature.
Solr's out-of-the-box configuration for the text_en_splitting
field type is a reasonable way to use the WordDelimiterFilter
—generation of word and number parts at both index- and query-time, but concatenating only at index time, since doing so at query time too would be redundant.
Stemming is the process of reducing inflected or sometimes derived words to their stem, base, or root form, for example, a stemming algorithm might reduce Riding and Rides, to just Ride. Stemming is done to improve search result recall, but at the expense of some precision. If you are processing general text, you will improve your search results with stemming. However, if you have text that is mostly proper nouns, such as an artist's name in MusicBrainz, then anything more than light stemming will hurt the results. If you want to improve the precision of search results but retain the recall benefits, you should consider indexing the data in two fields, one stemmed and the other not stemmed. The DisMax query parser, described in Chapter 5, Searching, and Chapter 6, Search Relevancy, can then be configured to search the stemmed field and boost by the unstemmed one via its bq
or pf
options.
Many stemmers will generate stemmed tokens that are not correctly spelled words, such as Bunnies becoming Bunni instead of Bunny or stemming Quote to Quot; you'll see this in Solr's Analysis screen. This is harmless since stemming is applied at both index and search times; however, it does mean that a field that is stemmed like this cannot also be used for query spellcheck, wildcard searches, or search term autocomplete—features described in later chapters. These features directly use the indexed terms.
A stemming algorithm is very language specific compared to other text analysis components; remember to visit https://cwiki.apache.org/confluence/display/solr/Language+Analysis as advised earlier for non-English text. It includes information on a Solr token filter that performs decompounding, which is useful for certain languages (not English).
Here are stemmers suitable for the English language:
SnowballPorterFilterFactory
: This one lets you choose among many stemmers that were generated by the so-called Snowball program, hence the name. It has a language
attribute in which you make the implementation choice from a list. Specifying English
uses the Porter2 algorithm—regarded as a slight improvement over the original. Specifying Lovins
uses the Lovins algorithm for English—regarded as an improvement on Porter but too slow in its current form.PorterStemFilterFactory
: This is the original English Porter algorithm. It is said to be twice as fast as using Snowball English.KStemFilterFactory
: This English stemmer is less aggressive than Porter's algorithm. This means it will not stem in as many cases as Porter will in an effort to reduce false-positives at the expense of missing stemming opportunities. We recommend this as the default English stemmer.EnglishMinimalStemFilterFactory
: This is a simple stemmer that only stems on typical pluralization patterns. Unlike most other stemmers, the stemmed tokens that are generated are correctly spelled words; they are the singular form. A benefit of this is that a single Solr field with this stemmer is usable for both general searches and for query term autocomplete simultaneously, thereby saving index size and making indexing faster.These stemmers are algorithmic instead of being based on a vetted Thesaurus for the target language. Languages have so many spelling idiosyncrasies that algorithmic stemmers are imperfect—they sometimes stem incorrectly or don't stem when they should.
If there are particularly troublesome words that get stemmed, you can prevent it by preceding the stemmer with a KeywordMarkerFilter
with the protected
attribute referring to a file of newline-separated tokens that should not be stemmed. An ignoreCase
Boolean option is available too. Some stemmers have, or used to have, a protected
attribute that worked similarly, but that old approach isn't advised any more.
If you need to augment the stemming algorithm so that you can tell it how to stem some specific words, precede the stemmer with StemmerOverrideFilter
. It takes a dictionary
attribute referring to a UTF8-encoded file in the conf
directory of token pairs, one pair per line, and a tab is used to separate the input token from the output token (the desired stemmed form of the input). An ignoreCase
Boolean option is available too. This filter will skip tokens already marked by KeywordMarkerFilter
and it will keyword-mark all the tokens it replaces itself, so that the stemmer will skip them.
Here is a sample excerpt of an analyzer chain showing three filters in support of stemming:
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" /> <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" /> <filter class="solr.PorterStemFilterFactory" />
The purpose of synonym processing is straightforward. Someone searches using a word that wasn't in the original document but is synonymous with a word that is indexed, so you want that document to match the query. Of course, the synonym need not be strictly those identified by a Thesaurus, and they can be whatever you want, including terminology specific to your application's domain.
The most widely known free Thesaurus is WordNet (http://wordnet.princeton.edu/). From Solr 3.4, we have the ability to read WordNet's "prolog" formatted file via a format="wordnet"
attribute on the synonym filter. However, don't be surprised if you lose precision in the search results—it's not a clear win, for example, "Craftsman" in context might be a proper noun referring to a brand, but WordNet would make it synonymous with "artisan". Synonym processing doesn't know about context—it's simple and dumb.
Here is a sample analyzer configuration line for synonym processing:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
The synonym reference is to a file in the conf
directory. Set ignoreCase
to true
for the case-insensitive lookup of synonyms.
Before describing the expand
option, let's consider an example. The synonyms file is processed line-by-line. Here is a sample line with an explicit mapping that uses the arrow =>
:
i-pod, i pod =>ipod
This means that if either i-pod
(one token) or i
then pod
(two tokens) are found in the incoming token stream to this filter, then they are replaced with ipod
. There could have been multiple replacement synonyms, each of which might contain multiple tokens. Also notice that commas are what separate each synonym, which is then split by whitespace for multiple tokens. To customize the tokenization to be something more sophisticated than whitespace, there is a tokenizerFactory
attribute, but it's rarely used.
Alternatively, you may have lines that look like this:
ipod, i-pod, i pod
These lines don't have =>
and are interpreted differently according to the expand
parameter. If expand
is true
, the line will be translated to the following explicit mapping:
ipod, i-pod, i pod =>ipod, i-pod, i pod
If expand
is false
, the aforementioned line will become this explicit mapping, in which the first source synonym is the replacement synonym:
ipod, i-pod, i pod =>ipod
It's okay to have multiple lines that reference the same synonyms. If a source synonym in a new rule is already found to have replacement synonyms from another rule, then those replacements are merged.
Multiword (also known as Phrase) synonyms
For multiword synonyms to work, the analysis must be applied at index time and with expansion so that both the original words and the combined word get indexed. The next section elaborates on why this is so. Also, be aware that the tokenizer and previous filters can affect the tokens that the SynonymFilter
sees. So, depending on the configuration and hyphens, other punctuations may or may not be stripped out.
If you are doing synonym expansion (have any source synonyms that map to multiple replacement synonyms or tokens), do synonym processing at either index time or query time, but not both. Doing it in both places will yield correct results but will perform slower. We recommend doing it at index time because of the following problems that occur when doing it at query time:
i pod
) isn't recognized at query time because the query parser tokenizes on whitespace before the analyzer gets it.However, any analysis at index time is less flexible, because any changes to the synonyms will require a complete re-index to take effect. Moreover, the index will get larger if you do index-time expansion—perhaps too large if you have a large set of synonyms such as with WordNet. It's plausible to imagine the aforementioned issues being rectified at some point. In spite of this, we usually recommend index time.
Alternatively, you could choose not to do synonym expansion. This means for a given synonym token, there is just one token that should replace it. This requires processing at both index time and query time to effectively normalize the synonymous tokens. However, since there is query-time processing, it suffers from the problems mentioned earlier (with the exception of poor scores, which isn't applicable). The benefit to this approach is that the index size would be smaller, because the number of indexed tokens is reduced.
You might also choose a blended approach to meet different goals, for example, if you have a huge index that you don't want to re-index often, but you need to respond rapidly to new synonyms, then you can put new synonyms into both a query-time synonym file and an index-time one. When a re-index finishes, you empty the query-time synonym file. You might also be fond of the query-time benefits, but due to the multiple word token issue, you decide to handle those particular synonyms at index time.
There is a simple filter called StopFilterFactory
that filters out certain so-called stop words specified in a file in the conf
directory, optionally ignoring case. The example usage is as follows:
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
When used, it is present in both index and query analyzer chains.
For indexes with lots of text, common uninteresting words such as "the", "a", and so on, make the index large and slow down phrase queries that use them. A simple solution to this problem is to filter them out of the fields in which they often show up. Fields likely to contain more than a sentence are ideal candidates. Our MusicBrainz schema does not have content like this. The trade-off when omitting stop words from the index is that those words are no longer queryable. This is usually fine, but in some circumstances like searching for To be or not to be
, it is obviously a problem.
The ideal solution to the common word problem is not to remove them. Chapter 10, Scaling Solr, discusses an approach called common-grams implemented with CommonGramsFilterFactory
that can be used to improve phrase search performance, while keeping these words. It is highly recommended.
Solr comes with a decent set of stop words for the English language. You may want to supplement it or use a different list altogether, if you're indexing non-English text. In order to determine which words appear commonly in your index, access the Schema Browser menu option in Solr's admin interface. All of your fields will appear in a drop-down list on the form. In case the list does not appear at once, be patient. For large indexes, there is a considerable delay before the field list appears because Solr is analyzing the data in your index. Now, choose a field that you know contains a lot of text. In the main viewing area, you'll see a variety of statistics about the field, including the top 10 terms appearing most frequently. If you can't see the term info by default, click on the Load Term Info button and select the Autoload checkbox.
You can also manage synonyms and stop words via a REST API using ManagedSynonymFilterFactory
and ManagedStopFilterFactory
respectively. You can read more and find sample configurations at https://cwiki.apache.org/confluence/display/solr/Managed+Resources.
Another useful text analysis option to enable searches that sound like a queried word is phonetic translation. A filter is used at both index and query time that phonetically encodes each word into a phoneme-based word. There are many phonetic encoding algorithms to choose from: BeiderMorse
, Caverphone
, Cologne
, DoubleMetaphone
, Metaphone
, RefinedSoundex
, and Soundex
. We suggest using DoubleMetaphone
for most text, and definitely BeiderMorse
for names. However, you might want to experiment in order to make your own choice.
The following code shows how to configure text analysis for phonetic matching using the DoubleMetaphone
encoding in the schema.xml
file:
<!-- for phonetic (sounds-like) indexing --> <fieldType name="phonetic" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false" maxCodeLength="8"/> </analyzer> </fieldType>
The previous example uses the DoubleMetaphoneFilterFactory
analysis filter, which has the following two options:
inject
: This is a Boolean defaulting to true
that will cause the original words to pass through the filter. It might interfere with other filter options, querying, and potentially scoring. Therefore, it is preferred to disable this, and use a separate field dedicated to phonetic indexing.maxCodeLength
: This is the maximum phoneme code (that is, phonetic character or syllable) length. It defaults to 4
. Longer codes are truncated. Only DoubleMetaphone
supports this option.Note that the phonetic encoders internally handle both uppercase and lowercase, so there's no need to add a lowercase filter.
In the MusicBrainz schema that is supplied with the book, a field named a_phonetic
is declared to use BeiderMorse
because that encoding is best for names. The field has the artist name copied into it through a copyField
directive. In Chapter 5, Searching, you will read about the DisMax query parser that can conveniently search across multiple fields with different scoring boosts. It can be configured to search not only the artist name (a_name
) field, but also a_phonetic
with a low boost so that regular exact matches will come above those that match phonetically.
Here is how BeiderMorse
is configured:
<fieldType name="phonetic" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- ... potentially others ... --> <filter class="solr.BeiderMorseFilterFactory" ruleType="APPROX"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.BeiderMorseFilterFactory" ruleType="EXACT"/> </analyzer> </fieldType>
Notice the difference in ruleType
between the query and index analyzers. In order to use most of the phonetic encoding algorithms, you must use the following filter:
<filter class="solr.PhoneticFilterFactory" encoder="RefinedSoundex" inject="false"/>
The encoder
attribute must be one of those algorithms listed in the first paragraph of this section, with the exception of DoubleMetaphone
and BeiderMorse
, which have dedicated filter factories.
Usually, the text indexing technology is employed to search entire words. Occasionally, however, there arises a need for a query to match an arbitrary substring of an indexed word or across them. Solr supports wildcards on queries (for example, mus*ainz
), but there is some consideration needed in the way data is indexed.
It's useful to first get a sense of how Lucene handles a wildcard query at the index level. Lucene internally scans the sorted terms list on disk starting with the nonwildcard prefix (mus
in the previous example). One thing to note about this is that the query takes exponentially longer for each fewer prefix character. In fact, Solr configures Lucene to not accept a leading wildcard to ameliorate the problem. Another thing to note is that stemming, phonetic, and other trivial text analysis will interfere with these kinds of searches, for example, if running
is stemmed to run
, then runni*
would not match.
Before employing these approaches, consider whether you really need better tokenization for special codes, for example, if you have a long string code that internally has different parts that users might search in separately, then you can use a PatternReplaceFilterFactory
with some other analyzers to split them up.
Solr doesn't permit a leading wildcard in a query unless you index the text in a reverse direction in addition to the forward direction. Doing this will also improve query performance when the wildcard is very close to the front. The following example configuration should appear at the end of the index analyzer chain:
<filter class="solr.ReversedWildcardFilterFactory" />
It has several performance-tuning options you can investigate further at its Javadocs, available at http://lucene.apache.org/solr/api/org/apache/solr/analysis/ReversedWildcardFilterFactory.html, but the defaults are reasonable.
Solr does not support a query with both a leading and trailing wildcard for performance reasons. Given our explanation of the internals, we hope you understand why.
N-gram analysis slices text into many smaller substrings ranging between a minimum and maximum configured size, for example, consider the word "Tonight". An NGramFilterFactory
configured with minGramSize
of 2 and maxGramSize
of 5 would yield all of the following indexed terms: (2-grams:) To, on, ni, ig, gh, ht, (3-grams:) Ton, oni, nig, igh, ght, (4-grams:) Toni, onig, nigh, ight, (5-grams:) Tonig, onigh, night. Note that "Tonight" fully does not pass through because it has more characters than the maxGramSize
. N-gram analysis can be used as a token filter, and it can also be used as a tokenizer with NGramTokenizerFactory
, which will emit n-grams spanning across the words of the entire source text.
The term n-gram can be ambiguous. Outside of Lucene, it is more commonly defined as word-based substrings, not character based. Lucene calls this shingling and you'll learn how to use that in Chapter 10, Scaling Solr.
The following is a suggested analyzer configuration using n-grams to match substrings:
<fieldType name="nGram" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- potentially word delimiter, synonym filter, stop words, NOT stemming --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- potentially word delimiter, synonym filter, stop words, NOT stemming --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Notice that the n-gramming only happens at index time. The range of gram sizes goes from the smallest number of characters you wish to enable substring searches on (2 in this example), to the maximum size permitted for substring searches (15 in this example).
Apply this analysis to a field created solely for the purpose of matching substrings. Another field should exist for typical searches, and configure the DisMax query parser, described in Chapter 5, Searching, for searches to use both fields using a smaller boost for this field.
Another variation is EdgeNGramTokenizerFactory
and EdgeNGramFilterFactory
, which emit n-grams that are adjacent to either the start or end of the input text. For the filter-factory, this input-text is a token, and for the tokenizer, it is the entire input. In addition to minGramSize
and maxGramSize
, these analyzers take a side
argument that is either front
or back
. If only prefix or suffix matching is needed instead of both, then an EdgeNGram
analyzer is for you.
There is a high price to be paid for n-gramming. Recall that in the earlier example, Tonight
was split into 15 substring terms, whereas typical analysis would probably leave only one. This translates to greater index sizes, and thus a longer time to index. Let's look at the effects of this in the MusicBrainz schema. The a_name
field, which contains the artist name, is indexed in a typical fashion and is a stored
field. The a_ngram
field is fed by the artist name and is indexed with n-grams ranging from 2 to 15 characters in length. It is not a stored
field because the artist name is already stored in a_name
.
a_name |
a_name + a_ngram |
Increase | |
---|---|---|---|
Indexing Time |
46 seconds |
479 seconds |
> 10x |
Disk Size |
11.7 MB |
59.7 MB |
> 5x |
Distinct Terms |
203,431 |
1,288,720 |
> 6x |
The preceding table shows a comparison of index statistics of an index with just a_name
versus both a_name
and a_ngram
. Note the ten-fold increase in indexing time for the artist name, and a five-fold increase in disk space. Remember that this is just one field!
The costs of n-gramming are lower if minGramSize
is raised and to a lesser extent if maxGramSize
is lowered. Edge n-gramming costs less too. This is because it is only based on one side. It definitely costs more to use the tokenizer-based n-grammers instead of the term-based filters used in the example before, because terms are generated that include and span whitespace. However, with such indexing, it is possible to match a substring spanning words.
Usually, search results are sorted by relevancy via the score
pseudo-field, but it is common to need to support conventional sorting by field values too. And, in addition to sorting search results, there are ramifications to this discussion in doing a range query and when showing facet results in a sorted order.
It just so happens that MusicBrainz already supplies alternative artist and label names for sorting. When different from the original name, these sortable versions move words like "The" from the beginning to the end after a comma. We've marked the sort names as indexed
but not stored
since we're going to sort on it but not display it—deviating from what MusicBrainz does. Remember that indexed
and stored
are true
by default. Because of the special text analysis restrictions of fields used for sorting, text fields in your schema that need to be sortable will usually be copied to another field and analyzed differently. The copyField
directive in the schema facilitates this task. The string
type is a type that has no text analysis and so it's perfect for our MusicBrainz case. As we're getting a sort-specific value from MusicBrainz, we don't need to derive something ourselves.
However, note that in the MusicBrainz schema there are no sort-specific release names, so let's add sorting support. One option is to use the string
type again. That's fine, but you may want to lowercase the text, remove punctuation, and collapse multiple spaces into one (if the data isn't clean). You can even use PatternReplaceFilterFactory
to move words like "The" to the end. It's up to you. For the sake of variety in our example, we'll be taking the latter route; we're using a type title_sort
that does these kinds of things.
By the way, Lucene sorts text by the internal Unicode code point. You probably won't notice any problem with the sort order. If you want sorting that is more accurate to the finer rules of various languages (English included), you should try CollationKeyFilterFactory
. Since it isn't commonly used and it's already well documented, we'll refer you to the wiki page https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-UnicodeCollation.
Solr includes many more token filters:
ClassicFilterFactory
: (It was formerly named StandardFilter
prior to Solr 3.1.) This filter works in conjunction with ClassicTokenizer
. It will remove periods in between acronyms and 's
at the end of terms:"I.B.M. cat's" => "IBM", "cat"
EnglishPossessiveFilterFactory
: This removes the trailing 's
.TrimFilterFactory
: This removes leading and trailing whitespace. We recommend doing this sort of thing before text analysis, same as using TrimFieldUpdateProcessorFactory
(see Chapter 4, Indexing Data).LowerCaseFilterFactory
: This lowercases all text. Don't put this before WordDelimeterFilterFactory
if you want to split on case transitions.KeepWordFilterFactory
: This filter omits all of the words, except those in the specified file:<filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>
If you want to ensure a certain vocabulary of words in a special field, you might enforce it with this.
LengthFilterFactory
: This filters out the terms that do not have a length within an inclusive range. The following is an example:<filter class="solr.LengthFilterFactory" min="2" max="5" />
LimitTokenCountFilterFactory
: This filter caps the number of tokens passing through to that specified in the maxTokenCount
attribute. Even without any hard limits, you are effectively limited by the memory allocated to Java—reach that and Solr will throw an error.RemoveDuplicatesTokenFilterFactory
: This ensures that no duplicate terms appear at the same position. This can happen, for example, when synonyms stem to a common root. It's a good idea to add this to your last analysis step if you are doing a fair amount of other analysis.ASCIIFoldingFilterFactory
: See MappingCharFilterFactory
in the earlier Character filters section for more information on this filter.CapitalizationFilterFactory
: This filter capitalizes each word according to the rules that you specify. For more information, see the Javadocs at http://lucene.apache.org/core/4_10_4/analyzers-common/org/apache/lucene/analysis/miscellaneous/CapitalizationFilterFactory.html.PatternReplaceFilterFactory
: This takes a regular expression and replaces the matches. Take a look at the following example:<filter class="solr.PatternReplaceFilterFactory" pattern=".*@(.*)" replacement="$1" replace="first" />
This example is for processing an e-mail address field to get only the domain of the address. This replacement happens to be a reference to a regular expression group, but it might be any old string. If the replace
attribute is set to first
, then only the first match is replaced; if replace
is all
, the default, then all matches are replaced.
PatternReplaceFilterFactory
and some of the others can offer you. For starters, check out the rType
field type in the schema.xml
that is supplied online with this book.There are some other miscellaneous Solr filters we didn't mention for various reasons. For common-grams or shingling, see Chapter 10, Scaling Solr. See the all known implementing classes section at the top of http://lucene.apache.org/core/4_10_4/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html for a complete list of token filter factories, including documentation.