Feature extraction

Feature extraction is a very important and valuable step in text mining. A system that can extract features from text has potential to be used in lots of applications. The initial step for feature extraction would be tagging the document; this tagged document is then processed to extract the required entities that are meaningful.

The elements that can be extracted from the text are:

  • Entities: These are some of the pieces of meaningful information that can be found in the document, for example, location, companies, people, and so on
  • Attributes: These are the features of the extracted entities, for example the title of the person, type of organization, and so on
  • Events: These are the activities in which the entities participate, for example, dates

Textual Entailment Human communication is diverse in terms of the usage of different expressions to communicate the same message. This proves to be quite a challenging problem in natural language processing. Consider, for example, s and s'. If the meaning of s can be inferred from the meaning of s', it can be concluded that s entails s'. s can be termed as 'text', while s' is its 'hypothesis'; usually s' is the short statement, whereas s is a longer span of text. Textual entailment is directional in nature, which implies if s entails s', it does not mean s' necessarily entails s.

Textual entailment aims to infer the sense of form in a given text. From a language perspective it can consist of several types, for example, syntactic entailment, I am going home entails I am going, semantic entailment, He is not happy entails He is unhappy, and implicative entailment, He fell down the stair entails He is injured.

Entailment is the key capability for enhancing the performance in a wide variety of NLP tasks, such as question-answering systems, information retrieval, enhanced search engines, multi-document summarization, and machine translation.

There are various approaches of estimating textual entailment:

  • Semantic similarity based methods
  • The syntactic similarity based approach
  • The logic based method
  • The vector space model approach
  • The surface string similarity based approach
  • Machine learning based methods

Among the various approaches of textual entailment, semantic similarity is the most common. This method utilizes the similarity between concepts/senses to conclude whether the hypothesis can truly be inferred from the text. Similarity measures can quantify how alike the two concepts are. An auto can be considered more like a car than a house, since auto and car share vehicle as a common ancestor with an is-a hierarchy.

Lexical databases like WordNet are extremely effective for similarity measures, since they maps nouns and verbs into hierarchies of an is-a relationship.

Synonymy and similarity

The lexical unit S entails S' if they are synonyms as per WordNet, or if there is any association of similarity between them. For example, rise and lift, reveal and discover, grant and allow.

Let the entailment rule be designed as:

entails(S, S') IF synonymy(S, S') OR WN_similarity(S, S')
synonymy(rise,lift)= TRUE => entails(rise,lift)= TRUE
WN_similarity(reveal,dicover)= TRUE => entails(reveal,discover) = TRUE

First, download WordNet data and set WNHOME. Specify the Dict folder with the setDict command.

Install WordNet by downloading the .exe from:

http://wordnetcode.princeton.edu/2.1/WordNet-2.1.exe.

Library(WordNet) for Windows operating system uses the following command:

setDict("C:/Program Files (x86)/WordNet/2.1/dict")
Find_synonyms<- function(x){
filter <- getTermFilter("ExactMatchFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, filter)
synsets <- getSynonyms(terms[[1]]) 
}

Multiwords, negation, and antonymy

WordNet contains many multi-words which have useful semantic relations with other words, but it may require additional processing to normalize them in order to use them effectively. There can be variation due to lemmatization, acronym dot or different spellings. In order to accurately measure the similarity, fuzzy matching is implemented using levenshtein distance between the WordNet words and the candidate word. Matching is allowed if the two compared words differ by less than 10%.

In the dependency tree, if there is any negation relation between the leaves and the father, the negation is spread across the tree until the root node. The entailment is not possible in this case:

Find_antonyms <- function(x){
    filter <- getTermFilter("ExactMatchFilter", x, TRUE)
    terms <- getIndexTerms("ADJECTIVE", 5, filter)
    synsets <- getSynsets(terms[[1]])
    related <- tryCatch(
        getRelatedSynsets(synsets[[1]], "!"),
        error = function(condition) {
            message("No direct antonym found")
            if(condition$message == "RcallMethod: invalid object parameter")
                message("there is no exact antonym")
            else
                stop(condition)
            return(NULL)
        }
    )
    if (is.null(related))
        return(NULL)
    return(sapply(related, getWord))
}
Hypernomy and WordNet entailment

Concept similarity

In concept similarity we measure the similarity between two concepts, for example, if we consider car as one concept then it is more related to the concept vehicle than some other concept such as food. Similarity is measured by information contained in a is-a hierarchy. WordNet is a lexical database that is well suited for this purpose, since nouns and verbs are organized in an is-a hierarchy.

Path length

The elementary idea of estimating similarity based on the path length is that the similarity between concepts can be expressed as a function of path length between concepts and concept position. There are different variants of path length calculation to quantify the concept similarity as a function of path length:

  • Shortest Path length: The shorter the path between the words/senses in a hierarchy graph, the more similar the words are:

    Path length between two words S = number of edges in shortest path

  • Shortest path length with depth:
    Path length
  • C1 , C2 are the concepts.
  • len(C1,C2) is the shortest path function between two concepts C1, C2.
  • deep_max is a fixed value for the specific version of WordNet.

This measure expresses the similarity between two words in terms of a linear combination of the shortest path and depth of the sub-tree, which holds very useful information about the features of the words. The lower the sub-tree is in the hierarchy, the lesser the abstract meaning shared by the two words:

Path length

Resnik similarity

Resnik similarity estimates the similarity between relatedness of words in terms of information content. The proportion of the amount of information content shared by two concepts determines the semantic similarity between them. Resnik considers the position of the nouns in the is-a hierarchy. Let C be the concepts in taxonomy, allowing several inheritances. The key to finding similarity between concepts lies in the edge count of the hierarchy graph and the proportion of the information shared between the concepts with respect to a highly specific concept, which is higher in the hierarchy and consumes both of them.

If P(concept) is the probability of encountering a concept, and entity A belongs to the is-a hierarchy under B, then P(A)<= P(B):

Resnik similarity
  • C1, C2 are the concepts
  • IC(C1) Information content based measure of concept C1
  • IC(C2) Information content based measure of concept C2
  • lso(C1,C2) is the lowest common subsume of C1 C2
    Resnik similarity

Lin similarity

Lin similarity estimates the semantic association between two-concepts/senses in terms of the ratio of the amount of information shared between two concepts to the total amount of information stored in the two concepts. It uses both the information required to describe the association between concepts and the information required to completely describe both of them:

Lin similarity
  • C1, C2 are the concepts
  • IC(C1) Information content based measure of concept C1
  • IC(C2) Information content based measure of concept C2
  • lso(C1,C2) is the lowest common subsume of C1 C2

Jiang – Conrath distance

To calculate the distance between two concepts, Jiang-Conrath distance considers the information content of the concepts, along with the information content of the most specific subsumer:

Jiang – Conrath distance
  • C1, C2 are the concepts
  • IC(C1) Information content based measure of concept C1
  • IC(C2) Information content based measure of concept C2
  • lso(C1,C2) is the lowest common subsume of C1 C2

As this is a distance measure, the higher the score the less the similarity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset