What makes POS difficult?

There are many aspects of a language that can make POS tagging difficult. Most English words will have two or more tags associated with them. A dictionary is not always sufficient to determine a word's POS. For example, the meaning of words such as bill and force are dependent on their context. The following sentence demonstrates how they can both be used in the same sentence as nouns and verbs.

"Bill used the force to force the manger to tear the bill in two."

Using the OpenNLP tagger with this sentence produces the following output:

    Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$

The use of textese, a combination of different forms of text including abbreviations, hashtags, emoticons, and slang, in communications mediums such as tweets and text makes it more difficult to tag sentences. For example, the following message is difficult to tag:

"AFAIK she H8 cth! BTW had a GR8 tym at the party BBIAM."

Its equivalent is:

"As far as I know, she hates cleaning the house! By the way, had a great time at the party. Be back in a minute."

Using the OpenNLP tagger, we will get the following output:

    AFAIK/NNS she/PRP H8/CD cth!/.
    BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN BBIAM./.

In the Using the MaxentTagger class to tag textese section later in this chapter, we will provide a demonstration of how LingPipe can handle textese. A short list of common textese terms is given in the following table:

Phrase	Textese	Phrase	Textese
As far as I know	AFAIK	By the way	BTW
Away from keyboard	AFK	You're on your own	YOYO
Thanks	THNX or THX	As soon as possible	ASAP
Today	2day	What do you mean by that	WDYMBT
Before	B4	Be back in a minute	BBIAM
See you	C U	Can't	CNT
Haha	hh	Later	l8R
Laughing out loud	LOL	On the other hand	OTOH
Rolling on the floor laughing	ROFL or ROTFL	I don't know	IDK
Great	GR8	Cleaning the house	CTH
At the moment	ATM	In my humble opinion	IMHO

There are several lists of textese; a large list can be found at
http://www.ukrainecalling.com/textspeak.aspx.

Tokenization is an important step in the POS tagging process. If the tokens are not split properly, we can get erroneous results. There are several other potential problems, including the following:

If we use lowercase, then words such as sam can be confused with the person or the System for Award Management (www.sam.gov)
We have to take into account contractions such as can't and recognize that different characters may be used for the apostrophe
Although phrases such as vice versa can be treated as a unit, it has been used for a band in England, the title of a novel, and the title of a magazine
We can't ignore hyphenated words such as first-cut and prime-cut that have meanings different from their individual use
Some words have embedded numbers, such as iPhone 5S
Special character sequences such as a URL or email address also need to be handled

Some words are found embedded in quotes or parentheses, which can make their meaning confusing. Consider the following example:

"Whether "Blue" was correct or not (it's not) is debatable."

"Blue" could refer to the color blue or conceivably the nickname of a person.
The output of the tagger for this sentence is as follows:

Whether/IN "Blue"/NNP was/VBD correct/JJ or/CC not/RB (it's/JJ not)/NN is/VBZ debatable/VBG

Table of Contents for What makes POS difficult?

Create new playlist

Sign In

Sign Up

Table of Contents for
What makes POS difficult?