Linguistic annotations include the application of syntactic and grammatical rules to identify the boundary of a sentence despite ambiguous punctuation, and a token's role in a sentence for POS tagging and dependency parsing. It also permits the identification of common root forms for stemming and lemmatization to group related words:
- POS annotations: It helps disambiguate tokens based on their function (this may be necessary when a verb and noun have the same form), which increases the vocabulary but may result in better accuracy.
- Dependency parsing: It identifies hierarchical relationships among tokens, is commonly used for translation, and is important for interactive applications that require more advanced language understanding, such as chatbots.
- Stemming: It uses simple rules to remove common endings, such as s, ly, ing, and ed, from a token and reduce it to its stem or root form.
- Lemmatization: It uses more sophisticated rules to derive the canonical root (lemma) of a word. It can detect irregular roots, such as better and best, and more effectively condenses vocabulary, but is slower than stemming. Both approaches simplify vocabulary at the expense of semantic nuances.