Why is NER difficult?

Like many NLP tasks, NER is not always simple. Although the tokenization of a text will reveal its components, understanding what they are can be difficult. Using proper nouns will not always work because of the ambiguity of language. For example, Penny and Faith, while valid names, may also be used for a measurement of currency and a belief, respectively. We can also find words such as Georgia that are used as the name of a country, a state, and a person. We can also not make a list of all people or places or entities as they are not predefined. Consider the following two simple sentences:

  • Jobs are harder to find nowadays
  • Jobs said dots will always connect

In these two sentences, jobs seems to be the entity but they are not related, and in second sentence it's not even an entity. We need to use some complex techniques to check for the occurrence of entities in the context. Sentences may use the same entity's name in different ways. Say, for example, IBM and International Business Machines; both terms are used in text to refer to the same entity, but for NER, this is challenging. Take another example: Suzuki and Nissan may be interpreted as names of people, instead of names of companies, by NER.

Some phrases can be challenging. Consider the phrase "Metropolitan Convention and Exhibit Hall" may contain words that in themselves are valid entities. So when the domain is well-known, a list of entities can be identified very easily and it is also easy to implement.

NER is typically applied at the sentence level, otherwise a phrase can easily bridge sentences, leading to the incorrect identification of an entity. For example, take the following two sentences:

"Bob went south. Dakota went west."

If we ignored the sentence boundaries, then we could inadvertently find the location entity South Dakota.

Specialized text such as URLs, email addresses, and specialized numbers can be difficult to isolate. This identification is made even more difficult if we have to take into account variations of the entity's form. For example, are parentheses used with phone numbers? Are dashes, or periods, or some other character used to separate its parts? Do we need to consider international phone numbers?

These factors contribute to the need for good NER techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset