Collocation and contingency tables

When we look into a corpus, some words tend to appear in combination; for example, I need a strong coffee, John kicked the bucket, He is a heavy smoker. J. R. Firth drew attention to such words that are not combined randomly into a phrase or sentence. Firth coined the term collocations for such word combinations; the meaning of a word is in part determined by its characteristic collocations. In the field of natural language processing (NLP), the combination of words plays an important role.

Word combinations that are considered collocations can be compound nouns, idiomatic expressions, or combinations that are lexically restricted. This variability in definition is defined by terms such as multi-word expressions (MWE), multi-word units (MWU), bigrams and idioms.

Collocations can be observed in corpora and can be quantified. Multi-word expressions have to be stored as units in order to understand their complete meaning. Three characteristic properties emerge as a common theme in the linguistic treatment of collocations: semantic non-compositionality, syntactic non-modifiability, and the non-substitutability of components by semantically similar words (Manning and Schutze 1999, 184). Collocations are words that show mutual expectancy, in other words, a tendency to occur near each other. Collocations can also be understood as statistically salient patterns that can be exploited by language learners.

Extracting co-occurrences

There are basically three types of co-occurrences found in lexical data. The attraction or statistical association in words that co-occur is quantified by co-occurrence frequency.

Surface Co-occurrence

While extracting co-occurrences using this methodology, the criteria is that the surface distance is measured in word tokens. Let's consider the following sentence:

The wind blew heavily the umbrella went rolling.

2L 1L node 1R 2R

In the preceding span, umbrella is the node word, L stands for left, R's stand for right and numbers stand for distance. The words in the collocation that span around the node word can be symmetric (L2, R2) or asymmetric (L2, R1). This is the traditional approach in corpus linguistics and lexicography.

For example, let's consider a simple sentence: A great degree of judgment is required to catch a cricket ball. A player must practice day in day out to gain agility and improve fielding techniques. When the batsman strokes, the ball rolls at a high speed. Sometimes the fielder has to dive or slide to stop the rolling ball. If the fielder is lazy, it rolls and rolls to the boundary.

Let's consider two words:

w1 = ball

w2 = roll

Let the span window be symmetric with two words (L2, R2), limited by sentence boundaries.

We can see the co-occurrence frequency f = 2 and the number of times the words occur independently, that is, marginal frequency f1 = 3, f2 = 4.

f1, f2 is called the marginal frequency, that is, how many times the words w1 and w2 occurred independently in the highlighted spans.

Textual co-occurrence

While extracting co-occurrences using this methodology, some of the criteria we take into account are that words co-occur if they are in the same segment, for example in the same sentence, paragraph, or document.

Syntactic co-occurrence

In this type of co-occurrence, words have a specific syntactic relation. Usually, they are word combinations of a chosen syntactic pattern, like adjective + noun, or verb + preposition, depending on the preferred multi-word expression structural type. In order to do this, corpus is lemmatized, POS-tagged and parsed, since these steps are independent of any language. Word w1 and w2 are said to co-occur only if there is a syntactic relation between them. This type of co-occurrence can help to cluster nouns that are used as objects of the same verb, such as tea, water, and cola, which are all used with the verb drink. Consider the following example:

Open

Bar

Mixed

Crowd

Young

Men

Drinking

Cola

Young

Ladies

Drinking

Wine

Drinking

Tea

It was a open invitation, to a party with an open bar. There was a mixed crowd. Young men were in the center drinking cola, young ladies to the right drinking wine. The old gentlemen were drinking tea.

If we calculate the co-occurrence frequency of young men, it's f = 1 the number of times the words occur independently, for example, marginal frequency is f1 = 2, f2 = 1.

Co-occurrence frequency is a quantitative measure of affinity between words based on their recurrence. This frequency is not sufficient, because lot or bigrams such as is to and and the occur in a corpus very frequently.

Co-occurrence in a document

If two words w1 and w2 are seen in the same document, they are usually related by topic. In this form of co-occurrence, how near or far away from each other the words are in the document or the order of their appearance is irrelevant. Document-wise co-occurrence has been successfully used in NLP.

Co-occurrence in a single document may talk about multiple topics, so we can investigate the word co-occurrence in a smaller segment of text such as a sentence. In contrast to the document-wise model, sentence-wise co-occurrence does not consider the whole document, and only considers the number of times those two words occur in the same sentence.

We consider each textual unit as a sentence; multiple occurrences of a word in the same sentences are ignored. Let's consider the same text as mentioned previously:

  • A great degree of judgment is required to catch a cricket ball
  • A player must practice day in day out to gain agility and improve fielding techniques
  • When the batsman strokes, the ball rolls at a high speed
  • Sometimes the fielder has to dive or slide to stop the rolling ball
  • If the fielder is lazy, it rolls and rolls to the boundary

We can see the Co-occurrence frequency f = 2 and the number of times the words occur independently, for example, marginal frequency f1 = 3, f2 = 3.

Quantifying the relation between words

In corpus linguistics, the statistical association or attraction between words is expressed in the form of a contingency table. Significance testing is applied to estimate the degree of association or difference between two words. An independence model is hypothesized between the words and is tested for a good fit. The worse the fit, the more associated the words are.

Contingency tables

Contingency tables are basically used to demonstrate the relationship between categorical variables. We can even call it a categorical equivalent of scatterplots.

For measuring association in contingency tables, we can apply a statistical hypothesis test with null hypothesis H0: independence of rows and columns. H0 implies there is no association between w1 and w2 and the association score is equal to the test statistic or p-value, as shown in the following diagram:

  1. Compare observed frequencies Oij to expected frequencies Eij under H0 (➜ later) I or estimate conditional prob. Pr(w2 |w1), Pr(w1 |w2), ands so on, maximum-likelihood estimates or confidence intervals. Following is a simple contingency table for word w1 and w2:
 

*|w2

*|-w2

 

w1|*

O11

O12

= f1

-w1|*

O21

O22

 

= f2 =N

 

*|cola

*|-cola

 

drinking|*

1

2

=3

-drinking|*

1

4

 

=2 =8

Chi-square and fisher-test functions in the base package can be used to estimate the association scores. Contingency tables in R are represented in matrix format:

Note

Use functions like Chisq.test(M) for chi-squared independence test and fisher.test(M) for fisher test on a contingency table stored as M.

Let's apply this on bigrams in brown corpus. Data can be downloaded from (Please provide the link to the dataset brown_bigrams.tbl.txt).

rown <- read.table("brown_bigrams.tbl.txt",header=TRUE)
Names(rown)
#The output as below
[1] "id"    "word1" "pos1"  "word2" "pos2"  "O11"   "O12"   "O21"   "O22"
# we are going to make the observed frqeuncies as numeric 
cols<-c(6:9)
Brown[,cols] = apply(Brown[,cols], 2, function(x) as.numeric(x))
# Lets calculate the rowsum A1,A2, colsum B1,B2 and the sample size Z
Brown <- transform(Brown, A1=O11+O12, A2=O21+O22, B1=O11+O21, B2=O12+O22, Z=O11+O12+O21+O22)
# The expected frquencies can be estimated as 
Brown <- transform(Brown, E11=(A1*B1)/Z, E12=(A1*B2)/Z, E21=(A2*B1)/Z, E22=(A2*B2)/Z)
# Significance Measures :
# Chi-squared statistics
Brown$chisq = Z * (abs(O11*O22 - O12*O21) - Z/2)^2 / (A1 * A2 * B1 * B2)
Brown$log<- 2*( ifelse(O11>0, O11*log(O11/E11), 0) + ifelse(O12>0, O12*log(O12/E12), 0) + ifelse(O21>0, O21*log(O21/E21), 0) + ifelse(O22>0, O22*log(O22/E22), 0))

Detailed analysis on textual collocations

text <- "Customer value proposition has become one of the most widely used terms in business markets in recent years. Yet our management-practice research reveals that there is no agreement as to what constitutes a customer value proposition—or what makes one persuasive. Moreover, we find that most value propositions make claims of savings and benefits to the customer without backing them up. An offering may actually provide superior value—but if the supplier doesn't demonstrate and document that claim, a customer manager will likely dismiss it as marketing puffery. Customer managers, increasingly held accountable for reducing costs, don't have the luxury of simply believing suppliers' assertions.
Customer managers, increasingly held accountable for reducing costs, don't have the luxury of simply believing suppliers' assertions. Take the case of a company that makes integrated circuits (ICs). It hoped to supply 5 million units to an electronic device manufacturer for its next-generation product. In the course of negotiations, the supplier's salesperson learned that he was competing against a company whose price was 10 cents lower per unit. The customer asked each salesperson why his company's offering was superior. This salesperson based his value proposition on the service that he, personally, would provide. Unbeknownst to the salesperson, the customer had built a customer value model, which found that the company's offering, though 10 cents higher in price per IC, was actually worth 15.9 cents more. The electronics engineer who was leading the development project had recommended that the purchasing manager buy those ICs, even at the higher price. The service was, indeed, worth something in the model—but just 0.2 cents! Unfortunately, the salesperson had overlooked the two elements of his company's IC offering that were most valuable to the customer, evidently unaware how much they were worth to that customer and, objectively, how superior they made his company's offering to that of the competitor. Not surprisingly, when push came to shove, perhaps suspecting that his service was not worth the difference in price, the salesperson offered a 10-cent price concession to win the business—consequently leaving at least a half million dollars on the table. Some managers view the customer value proposition as a form of spin their marketing departments develop for advertising and promotional copy. This shortsighted view neglects the very real contribution of value propositions to superior business performance. Properly constructed, they force companies to rigorously focus on what their offerings are really worth to their customers. Once companies become disciplined about understanding customers, they can make smarter choices about where to allocate scarce company resources in developing new offerings."
library(tm) 
library(SnowballC)
text.corpus <- Corpus(VectorSource(text)) 

Few standard text preprocessing:

text.corpus <- tm_map(text.corpus, stripWhitespace) 
text.corpus <- tm_map(text.corpus, tolower) 
text.corpus <- tm_map(text.corpus, removePunctuation) 
text.corpus <- tm_map(text.corpus, removeWords, stopwords("english")) 
text.corpus <- tm_map(text.corpus, stemDocument) 
text.corpus <- tm_map(text.corpus, removeNumbers) 

Tokenizer for n-grams and passed on to the term-document matrix constructor:

library(RWeka) 
length <- 2 # how many words either side of word of interest 
length1 <- 1 + length * 2  
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = length, max = length1 )) 
text.corpus <- tm_map(text.corpus, PlainTextDocument)
dtm <- TermDocumentMatrix(text.corpus, control = list(tokenize = ngramTokenizer)) 
inspect(dtm) 

When we inspect the TermDocumentMatrix, the output is as shown in the following data:

<<TermDocumentMatrix (terms: 188, documents: 1)>>
Non-/sparse entries: 188/0
Sparsity           : 0%
Maximal term length: 98
Weighting          : term frequency (tf)

Terms                                                                                       character(0)
  ability communicate mindset makes easy successful conversations clients understand                   1
  absence opportunity form emotional bond technology falls short frontline                             1
  advisers consider guardians customers wellbeing therefore full confidence sell                       1
  advisers may mindset makes reluctant helpa hesitancy stem several                                    1
  advisers positive feelings values individual needsin short emotional intelligencerequired            1
  approaches complex protocols example may smooth simple customer interactions                         1

Explore the term document matrix for all ngrams that have the node word value in them:

word <- 'value' 
part_ngrams <- dtm$dimnames$Terms[grep(word, dtm$dimnames$Terms)]
part_ngrams

The following data shows the output:

[1] "advisers positive feelings values individual needsin short emotional intelligencerequired"
 [2] "behavior thoughts feelings values beliefs personal emotional needs met"                   
 [3] "belief products services offer values beliefs fear rejection stemming"                    
 [4] "elements largely govern human behavior thoughts feelings values beliefs"                  
 [5] "feelings lack belief products services offer values beliefs fear"                         
 [6] "feelings values beliefs personal emotional needs met unmet consider"                      
 [7] "feelings values individual needsin short emotional intelligencerequired connect help"     
 [8] "govern human behavior thoughts feelings values beliefs personal emotional"                
 [9] "human behavior thoughts feelings values beliefs personal emotional needs"                 
[10] "lack belief products services offer values beliefs fear rejection"

Keep only the ngrams of interest, as the following data shows:

part_ngrams <- part_ngrams[sapply(part_ngrams, function(i) { 
  tmp <- unlist(strsplit(i, split=" ")) 
  tmp <- tmp[length(tmp) - length] 
  tmp} == word)]

Find the collocated word in the ngrams:

col_word <- "customer" 
part_ngrams <- part_ngrams[grep(col_word, part_ngrams)] 

Count the collocations:

length(part_ngrams)
part_ngrams

Find collocations on both sides of the collocation of interest within the span specified:

alwords <- paste(part_ngrams, collapse = " ")
uniques <- unique(unlist(strsplit(alwords, split=" ")))

Left hand side collocations:

LHS <- data.frame(matrix(nrow = length(uniques), ncol = length(part_ngrams))) 
for(i in 1:length(part_ngrams)){ 

Estimate the position of unique words along the ngram vector:

position1 <- sapply(uniques, function(x) which(x == unlist(strsplit(part_ngrams[[i]], split=" "))))

Position of word of interest along ngram:

position2 <- which(word == unlist(strsplit(part_ngrams[[i]], split=" ")) )

Estimate distance of all collocations to the word of interest:

dist <- lapply(position1, function(i) position2 - i ) 

Keeps only positive values:

dist <- lapply(dist, function(i)  i[i>0][1]  )
tmp <- rep(NA, length(uniques)) 
tmp <- tmp[1:length(unlist(unname(dist)))] <- unlist(unname(dist)) 
LHS[,i] <- tmp 
}
row.names(LHS) <- uniques

Estimate the mean distance between the two words:

LHS_means <- rowMeans(LHS, na.rm = TRUE) 
countN <- function ( v ) sum( !is.na( v ) )  
LHS_freqs <- apply(LHS, 1, countN ) 
LHS_means <- data.frame(word = names(LHS_means), 
                        mean_dist = unname(LHS_means), 
                        freq = unname(LHS_freqs)) 

Sort by average distance:

LHS_means <- LHS_means[with(LHS_means, order(mean_dist)), ] 

Sort based on frequency:

LHS_means <- LHS_means[with(LHS_means, order(-freq)), ] 
Right hand side colocs:
RHS <- data.frame(matrix(nrow = length(uniques), ncol = length(part_ngrams))) 
for(i in 1:length(part_ngrams)){ 

Find the position of unique words along the ngram vector:

  pos1 <- sapply(uniques, function(x) which(x == unlist(strsplit(part_ngrams[[i]], split=" ")))) 

Find the position of the word of interest alongthe ngram vector:

  pos2 <- which(word == unlist(strsplit(part_ngrams[[i]], split=" ")) ) 

Compute the distance of all colocs to word of interest:

  dist <- lapply(pos1, function(i) pos2 - i ) 

Keep only positive values:

  dist <- lapply(dist, function(i)  i[i<0][1]  ) 

Insert distance values into a vector to append into a data frame :

tmp <- rep(NA, length(uniques))
  tmp <- tmp[1:length(unlist(unname(dist)))] <- unlist(unname(dist)) 
  RHS[,i] <- tmp 
} 
row.names(RHS) <- uniques 

Compute the mean distance between the two words:

RHS_means <- rowMeans(RHS, na.rm = TRUE) 

Also get coloc frequencies in spans:

Function to count non-NA values: countN <- function ( v ) sum( !is.na( v ) )  
RHS_freqs <- apply(RHS, 1, countN ) 
RHS_means <- data.frame(word = names(RHS_means), 
                        mean_dist = unname(RHS_means), 
                        freq = unname(RHS_freqs)) 
Sort by mean distance: 
RHS_means <- RHS_means[with(RHS_means, order(-mean_dist)), ] 

Sort by frequency:

RHS_means <- RHS_means[with(RHS_means, order(-freq)), ] 

Compute mutual information for all words in the span.

Mutual information is a measure of collocational strength. A collocation is a sequence of words or terms that co-occur more often. The higher the mutual information, the stronger the relation between the words:

MI <- vector(length = length(uniques)) 
for(i in 1:length(uniques)){ 

A = frequency of node word:

A <- length(grep(word, unlist(strsplit(examp1, split=" ")))) 

B = frequency of collocation:

B <- length(grep(uniques[i], unlist(strsplit(examp1, split=" "))))

Size of corpus = number of words in total:

sizeCorpus <- length(unlist(strsplit(examp1, split=" ")))

Span = span of words analysed to L and R of node word:

span <- span 

Compute MI:

MI[i] <- log ( (A * B * sizeCorpus) / (A * B * span) ) / log (2) 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset