Hidden Markov Models for POS tagging

Hidden Markov Models (HMM) are conducive to solving classification problems with generative sequences. In natural language processing, HMM can be used for a variety of tasks such as phrase chunking, parts of speech tagging, and information extraction from documents. If we consider words as input, while any prior information on the input can be considered as states, and estimated conditional probabilities can be considered as the output, then POS tagging can be categorized as a typical sequence classification problem that can be solved using HMM.

Basic definitions and notations

According to (Rabiner), there are five elements needed to define an HMM:

  • N denotes the number of states (which are hidden) in the model. For parts of speech tagging, N is the number of tags that can be used by the system. Each possible tag for the system corresponds to one state of the HMM. The possible interconnections of individual states are denoted by S = {S1,Sn}. Let qt denote the state at time t.
  • Let M denote the number of distinct output symbols in the alphabet of the HMM. For parts of speech tagging, M is the number of words in the lexicon of the system. Let V = {v1 .. vm} denote the set of observation symbols.
  • The state transition probability distribution is also called the transition matrix A = {aij}, representing the probability of going from state Si to Sj. For parts of speech tagging, the states represent the tags, so aij is the probability that the model will move from tag ti to tj. This probability can be estimated using data from a training corpus:
    Basic definitions and notations

    Where qt denotes the current state, the transition probabilities should also satisfy the normal stochastic constraints:

    Basic definitions and notations

    And:

    Basic definitions and notations
  • An observation symbol probability distribution is also called emission matrix B = {bj(k)}, indicating the probability of the emission of symbol Vk when the system state is Sj:
    Basic definitions and notations

    Where Basic definitions and notations denotes the kth observation symbol and Basic definitions and notations the current parameter vector, the following conditions must be satisfied:

    Basic definitions and notations

    And:

    Basic definitions and notations
  • The initial state probability distribution Basic definitions and notations represents probabilities of initial states. For parts of speech tagging, this is the probability that the sentence will begin:
    Basic definitions and notations
    For an ngram model, if we need to estimate the probability of a word tag in a word sequence, we can estimate the conditional probability, given the information about the previous words in the sequence:

    Note

    P(w1….wn) = Pe (w1).Pe(w2 | w1).Pe(w3 | w1 , w2) ....... Pe(wn | wn-2 , wn

Implementing HMMs

When implementing an HMM, floating-point underflow is a significant problem. When we apply the Viterbi or forward algorithms to long sequences, the resultant probability values are very small, which can underflow on most machines. We solve this problem differently for each algorithm, as shown in the following examples.

Viterbi underflow

As the Viterbi algorithm only multiplies probabilities, a simple solution to underflow is to log all the probability values and then add values instead of multiplying. In fact, if all the values in the model matrices (A, B, π) are logged, then at runtime only addition operations are needed.

Forward algorithm underflow

The forward algorithm sums probability values, so it is not a viable solution to log the values in order to avoid underflow. The most common solution to this problem is to use scaling coefficients that keep the probability values in the dynamic range of the machine, and that are dependent only on t. The coefficient Forward algorithm underflow is defined as:

Forward algorithm underflow

And thus the new-scaled value for α becomes:

Forward algorithm underflow

OpenNLP chunking

The process of chunking consists of dividing a text into syntactically correlated parts of words, like noun groups, and verb groups, but does not specify their internal structure, or their role in the main sentence. We can use Maxent_Chunk_Annotator() for the OpenNLP R package to accomplish this.

Before we can use this method, we have to POS tag the sentence. We can use the same steps performed previously for POS tagging:

chunkAnnotator <- Maxent_Chunk_Annotator(language = "en", probs = FALSE, model = NULL)
annotate(s,chunkAnnotator,posTaggedSentence)
head(annotate(s,chunkAnnotator,posTaggedSentence))

The chunk tag provides some more useful information:

OpenNLP chunking

Chunk tags

The chunk tags contain the name of the chunk type, for example, I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags: B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. A chunk tag like B-NP is made up of two parts:

  1. First part:
    • B: Marks the beginning of a chunk
    • I: Marks the continuation of a chunk
    • E: Marks the end of a chunk

A chunk may be only one word long or may contain multiple words (like Pierre Vinken in the preceding example).The O chunk tag is used for tokens which are not part of any chunk.

  1. Second part:
    • NP: Noun chunk
    • VP: Verb chunk

You can find chunks for different languages at:

http://opennlp.sourceforge.net/models-1.5/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset