There are many aspects of a language that can make POS tagging difficult. Most English words will have two or more tags associated with them. A dictionary is not always sufficient to determine a word's POS. For example, the meaning of words such as bill and force are dependent on their context. The following sentence demonstrates how they can both be used in the same sentence as nouns and verbs.
"Bill used the force to force the manger to tear the bill in two."
Using the OpenNLP tagger with this sentence produces the following output:
Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$
The use of textese, a combination of different forms of text including abbreviations, hashtags, emoticons, and slang, in communications mediums such as tweets and text makes it more difficult to tag sentences. For example, the following message is difficult to tag:
"AFAIK she H8 cth! BTW had a GR8 tym at the party BBIAM."
Its equivalent is:
"As far as I know, she hates cleaning the house! By the way, had a great time at the party. Be back in a minute."
Using the OpenNLP tagger, we will get the following output:
AFAIK/NNS she/PRP H8/CD cth!/. BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN BBIAM./.
In the Using the MaxentTagger class to tag textese section later in this chapter, we will provide a demonstration of how LingPipe can handle textese. A short list of common textese terms is given in the following table:
Phrase |
Textese |
Phrase |
Textese |
As far as I know |
AFAIK |
By the way |
BTW |
Away from keyboard |
AFK |
You're on your own |
YOYO |
Thanks |
THNX or THX |
As soon as possible |
ASAP |
Today |
2day |
What do you mean by that |
WDYMBT |
Before |
B4 |
Be back in a minute |
BBIAM |
See you |
C U |
Can't |
CNT |
Haha |
hh |
Later |
l8R |
Laughing out loud |
LOL |
On the other hand |
OTOH |
Rolling on the floor laughing |
ROFL or ROTFL |
I don't know |
IDK |
Great |
GR8 |
Cleaning the house |
CTH |
At the moment |
ATM |
In my humble opinion |
IMHO |
http://www.ukrainecalling.com/textspeak.aspx.
Tokenization is an important step in the POS tagging process. If the tokens are not split properly, we can get erroneous results. There are several other potential problems, including the following:
- If we use lowercase, then words such as sam can be confused with the person or the System for Award Management (www.sam.gov)
- We have to take into account contractions such as can't and recognize that different characters may be used for the apostrophe
- Although phrases such as vice versa can be treated as a unit, it has been used for a band in England, the title of a novel, and the title of a magazine
- We can't ignore hyphenated words such as first-cut and prime-cut that have meanings different from their individual use
- Some words have embedded numbers, such as iPhone 5S
- Special character sequences such as a URL or email address also need to be handled
Some words are found embedded in quotes or parentheses, which can make their meaning confusing. Consider the following example:
"Whether "Blue" was correct or not (it's not) is debatable."
"Blue" could refer to the color blue or conceivably the nickname of a person.
The output of the tagger for this sentence is as follows:
Whether/IN "Blue"/NNP was/VBD correct/JJ or/CC not/RB (it's/JJ not)/NN is/VBZ debatable/VBG