What makes POS difficult?

There are many aspects of a language that can make POS tagging difficult. Most English words will have two or more tags associated with them. A dictionary is not always sufficient to determine a word's POS. For example, the meaning of words such as bill and force are dependent on their context. The following sentence demonstrates how they can both be used in the same sentence as nouns and verbs.

"Bill used the force to force the manger to tear the bill in two."

Using the OpenNLP tagger with this sentence produces the following output:

    Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$

The use of textese, a combination of different forms of text including abbreviations, hashtags, emoticons, and slang, in communications mediums such as tweets and text makes it more difficult to tag sentences. For example, the following message is difficult to tag:

"AFAIK she H8 cth! BTW had a GR8 tym at the party BBIAM."

Its equivalent is:

"As far as I know, she hates cleaning the house! By the way, had a great time at the party. Be back in a minute."

Using the OpenNLP tagger, we will get the following output:

    AFAIK/NNS she/PRP H8/CD cth!/.
    BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN BBIAM./.

In the Using the MaxentTagger class to tag textese section later in this chapter, we will provide a demonstration of how LingPipe can handle textese. A short list of common textese terms is given in the following table:

Phrase

Textese

Phrase

Textese

As far as I know

AFAIK

By the way

BTW

Away from keyboard

AFK

You're on your own

YOYO

Thanks

THNX or THX

As soon as possible

ASAP

Today

2day

What do you mean by that

WDYMBT

Before

B4

Be back in a minute

BBIAM

See you

C U

Can't

CNT

Haha

hh

Later

l8R

Laughing out loud

LOL

On the other hand

OTOH

Rolling on the floor laughing

ROFL or ROTFL

I don't know

IDK

Great

GR8

Cleaning the house

CTH

At the moment

ATM

In my humble opinion

IMHO

There are several lists of textese; a large list can be found at
http://www.ukrainecalling.com/textspeak.aspx.

Tokenization is an important step in the POS tagging process. If the tokens are not split properly, we can get erroneous results. There are several other potential problems, including the following:

  • If we use lowercase, then words such as sam can be confused with the person or the System for Award Management (www.sam.gov)
  • We have to take into account contractions such as can't and recognize that different characters may be used for the apostrophe
  • Although phrases such as vice versa can be treated as a unit, it has been used for a band in England, the title of a novel, and the title of a magazine
  • We can't ignore hyphenated words such as first-cut and prime-cut that have meanings different from their individual use
  • Some words have embedded numbers, such as iPhone 5S
  • Special character sequences such as a URL or email address also need to be handled

Some words are found embedded in quotes or parentheses, which can make their meaning confusing. Consider the following example:

"Whether "Blue" was correct or not (it's not) is debatable."

"Blue" could refer to the color blue or conceivably the nickname of a person.
The output of the tagger for this sentence is as follows:

Whether/IN "Blue"/NNP was/VBD correct/JJ or/CC not/RB (it's/JJ not)/NN is/VBZ debatable/VBG
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset