Chapter 7. Entity Recognition

Extracting information out of unstructured text data is a tedious process, because of the complex nature of natural language. Even after advancements in the field of Natural language processing (NLP), we are far from the point where any unrestricted text can be analyzed and the meaning can be extracted for general purposes. However, if we just focus on a specific set of questions, we can extract a significant amount of information from the text data. Named entity recognition helps identify the important entities in a text, to be able to derive the meaning from the unstructured data. It is a vital component of NLP applications, for example, question-answering systems, product discovery on e-commerce websites, and so on.

In this chapter, we will cover the following topics:

  • Entity extraction
  • Coreference and relationship extraction
  • Sentence boundary detection
  • Named entity recognition

Entity extraction

The process of extracting information from unstructured documents is called information extraction. In today's world, most of the data produced over the internet is semi-structured or unstructured; this data is mostly in a human-understandable format, what we call natural language, so most of the time, natural language processing comes into play during information extraction. Entity recognition is a sub process in the chain of information extraction process. NER is one of the important and vital parts of the information extraction process. NER is sometimes also called entity extraction or entity chunking .The main job of NER is to extract the rigid designators in the document and classify these elements in the text to a predefined category. The named entity extractor has a set of predefined categories such as the following:

  • persons
  • organizations
  • locations
  • time
  • money
  • percentages
  • dates

Given an unstructured document, NER will annotate the block or extract the relevant features. Consider a sample text as shown here:

"IBM developed AI software in 2006. It named after IBM's first CEO and industrialist Thomas J. Watson. Watson won $1 million in Jeopardy game."

After processing through the NER algorithm, the text is annotated with various predefined categories as shown here:

"IBM(Company) developed a AI software named Watson in 2006(Date). It named after IBM's(Company) first CEO and industrialist Thomas J. Watson (Person). Watson won $1(Money) million in Jeopardy game."

While annotating the given document, there may be words that may have a different meaning based on the context; for example, in the preceding example, Watson is the name of software but it can also be the name of a person. Let's consider one more simple example as follows:

Tim Cook is the CEO of Apple.

In the preceding sentence, Apple is the name of a company and not a fruit. NER algorithms can use contextual information to find out the right tag for a given entity.

NER is a challenging task; it is basically divided into two parts:

  • Identification of named entities
  • Classification of named entities

There are various approaches followed to solve the NER challenge. Following are a couple of approaches followed to develop NER systems.

The rule-based approach

These methods are based on linguistic rules and grammar-based techniques. Grammar-based systems are more accurate but need lot of work to be done by experts in computational linguists. Some of the rule-based NER systems are:

  • GATE
  • CIMPLE
  • System T

Machine learning

This type of solution is based on statically models. This kind of NER system needs a huge amount of annotated corpus, which is called training data. The algorithm process the annotated corpus learns or create rules and builds a model to further identify entities in new documents. Supervised learning is one approach followed to build NER systems. Annotating a large corpus is a challenging and time-consuming task. Some of the statistical algorithms used for supervised learning are:

  • Maximum entropy models
  • Hidden Markov models
  • Support vector machines
  • Conditional random fields

Semi-supervised learning is gaining momentum in building NER systems since it requires less human intervention and solves the problem of large annotated corpus availability. This kind of technique basically has a sentence boundary detector, tokenizer, and part of speech tagger.

In unsupervised learning, we try to infer the named entity by looking into a cluster which is built under a similar context. This approach uses some lexical databases such as word net to identify the named entity types.

There are various NLP frameworks available to perform various tasks in natural language processing:

  • Apache OpenNLP
  • Stanford NLP
  • LingPipe

In this chapter, we will learn how to invoke the Apache OpenNLP library through R using various R libraries such as openNLP, rJava, and NLP.

Apache OpenNLP is a Java-based framework. It has implementation to machine learning algorithms that can be used for natural language processing. It has supporting APIs to perform for some of the important steps in natural language processing, such as:

  • Part of speech tagging
  • Sentence boundary detection
  • Tokenization
  • Chunking
  • Parsing
  • Named entity extraction
  • Coreference resolution

It has implementations to algorithms such as maximum entropy and perceptron-based machine learning. Using various components from Apache OpenNLP, we can build an end-to-end language-processing pipeline. All the functionality for various methods is exposed as application programming interface. Each component generally has a training module and a testing/predicting module. Let's get an understanding the functionality provided by OpenNLP.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset