Feature extraction is a very important and valuable step in text mining. A system that can extract features from text has potential to be used in lots of applications. The initial step for feature extraction would be tagging the document; this tagged document is then processed to extract the required entities that are meaningful.
The elements that can be extracted from the text are:
Textual Entailment Human communication is diverse in terms of the usage of different expressions to communicate the same message. This proves to be quite a challenging problem in natural language processing. Consider, for example, s and s'. If the meaning of s can be inferred from the meaning of s', it can be concluded that s entails s'. s can be termed as 'text', while s' is its 'hypothesis'; usually s' is the short statement, whereas s is a longer span of text. Textual entailment is directional in nature, which implies if s entails s', it does not mean s' necessarily entails s.
Textual entailment aims to infer the sense of form in a given text. From a language perspective it can consist of several types, for example, syntactic entailment, I am going home entails I am going, semantic entailment, He is not happy entails He is unhappy, and implicative entailment, He fell down the stair entails He is injured.
Entailment is the key capability for enhancing the performance in a wide variety of NLP tasks, such as question-answering systems, information retrieval, enhanced search engines, multi-document summarization, and machine translation.
There are various approaches of estimating textual entailment:
Among the various approaches of textual entailment, semantic similarity is the most common. This method utilizes the similarity between concepts/senses to conclude whether the hypothesis can truly be inferred from the text. Similarity measures can quantify how alike the two concepts are. An auto can be considered more like a car than a house, since auto and car share vehicle as a common ancestor with an is-a hierarchy.
Lexical databases like WordNet are extremely effective for similarity measures, since they maps nouns and verbs into hierarchies of an is-a relationship.
The lexical unit S entails S' if they are synonyms as per WordNet, or if there is any association of similarity between them. For example, rise and lift, reveal and discover, grant and allow.
Let the entailment rule be designed as:
entails(S, S') IF synonymy(S, S') OR WN_similarity(S, S') synonymy(rise,lift)= TRUE => entails(rise,lift)= TRUE WN_similarity(reveal,dicover)= TRUE => entails(reveal,discover) = TRUE
First, download WordNet data and set WNHOME
. Specify the Dict
folder with the setDict
command.
Install WordNet by downloading the .exe
from:
http://wordnetcode.princeton.edu/2.1/WordNet-2.1.exe.
Library(WordNet) for Windows operating system uses the following command:
setDict("C:/Program Files (x86)/WordNet/2.1/dict") Find_synonyms<- function(x){ filter <- getTermFilter("ExactMatchFilter", x, TRUE) terms <- getIndexTerms("NOUN", 1, filter) synsets <- getSynonyms(terms[[1]]) }
WordNet contains many multi-words which have useful semantic relations with other words, but it may require additional processing to normalize them in order to use them effectively. There can be variation due to lemmatization, acronym dot or different spellings. In order to accurately measure the similarity, fuzzy matching is implemented using levenshtein distance between the WordNet words and the candidate word. Matching is allowed if the two compared words differ by less than 10%.
In the dependency tree, if there is any negation relation between the leaves and the father, the negation is spread across the tree until the root node. The entailment is not possible in this case:
Find_antonyms <- function(x){ filter <- getTermFilter("ExactMatchFilter", x, TRUE) terms <- getIndexTerms("ADJECTIVE", 5, filter) synsets <- getSynsets(terms[[1]]) related <- tryCatch( getRelatedSynsets(synsets[[1]], "!"), error = function(condition) { message("No direct antonym found") if(condition$message == "RcallMethod: invalid object parameter") message("there is no exact antonym") else stop(condition) return(NULL) } ) if (is.null(related)) return(NULL) return(sapply(related, getWord)) } Hypernomy and WordNet entailment
In concept similarity we measure the similarity between two concepts, for example, if we consider car as one concept then it is more related to the concept vehicle than some other concept such as food. Similarity is measured by information contained in a is-a hierarchy. WordNet is a lexical database that is well suited for this purpose, since nouns and verbs are organized in an is-a hierarchy.
The elementary idea of estimating similarity based on the path length is that the similarity between concepts can be expressed as a function of path length between concepts and concept position. There are different variants of path length calculation to quantify the concept similarity as a function of path length:
This measure expresses the similarity between two words in terms of a linear combination of the shortest path and depth of the sub-tree, which holds very useful information about the features of the words. The lower the sub-tree is in the hierarchy, the lesser the abstract meaning shared by the two words:
Resnik similarity estimates the similarity between relatedness of words in terms of information content. The proportion of the amount of information content shared by two concepts determines the semantic similarity between them. Resnik considers the position of the nouns in the is-a hierarchy. Let C be the concepts in taxonomy, allowing several inheritances. The key to finding similarity between concepts lies in the edge count of the hierarchy graph and the proportion of the information shared between the concepts with respect to a highly specific concept, which is higher in the hierarchy and consumes both of them.
If P(concept) is the probability of encountering a concept, and entity A belongs to the is-a hierarchy under B, then P(A)<= P(B):
Lin similarity estimates the semantic association between two-concepts/senses in terms of the ratio of the amount of information shared between two concepts to the total amount of information stored in the two concepts. It uses both the information required to describe the association between concepts and the information required to completely describe both of them:
To calculate the distance between two concepts, Jiang-Conrath distance considers the information content of the concepts, along with the information content of the most specific subsumer:
As this is a distance measure, the higher the score the less the similarity.