8

SIMILARITY, CATEGORIZATION AND CLUSTERING

Automatic text categorization, known as ATC, has become one of the most significant commercial uses of content analysis techniques. At its core, automatic text categorization allows the computer to organize documents based on their content, offering a filtering capability orders of magnitude more sophisticated than simple keyword searching. The vector space model underlying ATC can be used for other types of similarity analysis, from comparing pairs of documents, to the automated grouping of entire archives. Clustering is an especially powerful technique that allows the machine to determine the set of categories that work best for a particular text collection, rather than placing them into predefined categories: in many cases exposing previously undetected patterns.

Categorization

Simple keyword queries can offer sufficient classification accuracy for many projects. The George Mason University Syllabus Finder (Cohen, 2006) was developed by collecting a sample of history syllabi and examining word frequencies. Words like syllabus, reading(s), assignment(s), and exam were, of course, found to be very good indicators (syllabus appeared on more than 90 percent of the sample syllabi), but more subtle terms were also found to be highly suggestive, such as office hours and Spring or Fall and a year, such as Fall 2002. Using just keyword searching with a set of highly suggestive terms, this filter was able to achieve fairly high accuracy at recognizing syllabi.

Unfortunately, most document collections do not provide the same sure-fire indicators as history syllabi and must rely on more sophisticated techniques that can take into account subtle statistical variations in word usage across an archive that may contain hundreds of millions of words. Automatic text categorization (ATC) refers to the use of computer algorithms able to identify patterns in collections of documents and use them to classify new documents without human intervention. Such systems are called supervised learners because a human provides pre-categorized training data that the machine uses to learn and place new documents into the same categories.

The Vector Space Model

ATC techniques work by converting the textual format of a document into a numeric representation that can be more easily processed by computer. First, the system compiles a master list of all unique words across the entire collection of documents. Then, for each document, it counts how many times each of those words appears in that document, ultimately creating a table where the rows are the complete list of unique words appearing across all documents and each column records how many times each of those words appeared in a given document. Each document is thus translated from a blob of text to a vocabulary histogram that can be directly compared between documents in a collection. Such a list may be thought of as a vector space representation, in which each word represents a vector in a very high-dimensionality space. The goal of ATC is for the computer to use these vector representations and autonomously discern patterns among the members of each category. These patterns are used to develop a categorization model that can be used to classify novel text into a category based on the previously seen training data. In essence, ATC techniques allow the analyst to hand the machine a set of documents and ask it to find more like these.

Feature Selection

The first step in ATC is the conversion of a textual document into a vector representation, which raises the question of exactly what should be measured. Nearly every imaginable possibility has been explored in the literature over the years, but the most common in contemporary systems are noun phrase extraction, ordered words, and bag of words techniques. Noun phrase extraction identifies the list of noun phrases appearing in a document and uses them as a representation of its core concepts. In theory, ATC based on noun phrase extraction will identify trends based only on topics, eliminating extraneous words. In practice, however, even fairly small collections tend to have so much variation in noun phrase use (such as different words for the same concept), that this technique tends to be less accurate than relying on the entire corpus of words. Indeed, as was pointed out earlier with author gender identification, sometimes it is the most unexpected document attributes (such as pronoun use) that prove the most insightful.

The ordered words technique extracts a list of all words appearing in a document, but further enhances them through information on where each word appeared in the document and possibly what words it was near. Preserving word order would theoretically preserve more of the semantic content of a document, but, again, in practice this tends to offer only a negligible accuracy improvement. The most accurate technique to date seems to also be the simplest: the venerable bag of words. In bag of words feature selection, the document is treated as an unordered list of words. Under this approach, words are ranked solely by their frequencies and the word the holds just as much weight as the word attack, as far as the computer is concerned. In some cases (as illustrated in gender detection), the use of so-called noise words like pronouns and articles may actually hold significant predictive power, while in others they will simply be normalized away if they are consistent across categories.

Even further removed from semantic meaning, n-grams have also been used as document features to varying degrees of success. In the n-gram approach, the document is decomposed into strings of characters of a certain length (such as N=3), and these are used in place of words as the input to the ATC system. In a 3-gram, the phrase the dog would become the, he_, e_d, _do, dog, og_, and g__ (spaces and preceding and succeeding characters are replaced with underscores). To a human, of course, this collection of text holds little meaning, but it has the crucial benefit of making the classifier more robust against isolated typographical errors. This can be especially important when working with text derived from optical character recognition (OCR), in which the resulting text often contains various types of missing or incorrect characters. Even a single character out of place (such as computer versus ccmputer) causes the machine to treat them as different words, skewing the model's evaluation of the word.

Feature Reduction

Of course, constructing a vector space of dimensionality equal to the total number of distinct words in a large document archive will result in a breathtakingly large mathematical space requiring significant memory and processing time. Computing relationships among documents in a vector space of tens of millions of dimensions is simply computationally intractable, even with the large memory massively parallel supercomputers available today. Instead, feature reduction algorithms are used to select just the most meaningful features to be used in the vector construction. A word that appears in just five of 100,000 documents may uniquely characterize those five documents, but holds little value in characterizing the corpus as a whole. The term space of an ATC system is the set of all unique words found in the entire corpus. If a given document contains only 100 distinct words, it is still compared against the overall list of distinct words from the corpus, which might contain 100,000 distinct words. Such a document would have scores along just 100 of the 100,000 vectors (the others would be zeros), leading to substantial computational and storage waste. Beyond the computational expense, presenting too many features to the classification algorithm may cause important patterns to be buried in an avalanche of noise. To correct for this, feature reduction algorithms score each word's presence in each document based on its frequency in the document, its frequency across the collection, and other possible factors.

One of the most common feature reduction algorithms is known as TFIDF (term frequency inverse document frequency). Three values are multiplied together to determine a final score for each word in the corpus based on its occurrence in each document. The first term is the frequency of the word in the given document, known as the term weight. This can be a binary value (just a 1 if the word is present and a 0 if it is not) a raw frequency (the exact number of occurrences) or a normalized frequency (frequency adjusted to fall within a certain range). The second term essentially adjusts the first by the overall frequency of the word in the corpus. The word the may appear with considerable frequency within a particular document, but may not hold significant meaning if the word also appears with similar frequency in every document in the corpus. Corpus weighting terms tend to be either inverse document frequency (the logarithm of the number of documents in the collection divided by the number containing this word) or probabilistic inverse document frequency (the total number of documents in the collection minus the number containing the term, divided by the number of documents containing the term, with the final value being the logarithm of the result). Finally, a normalization weight is applied, frequently cosine normalization, which is the inverse of the length of the document. The specific weighting features used in any given ATC application are highly dependant on the source text, with different combinations of parameters working better on some text collections than others. In fact, some ATC toolkits use a default TFIDF weighting algorithm that measures only frequency within the current document and does not normalize it by a corpus or normalization component.

Once each word has been assigned a weight, or score, the list of words is ranked by those scores, and the ones with the highest scores selected. The selection cutoff varies by application, with some systems using a fixed count (the top 1,000 words) and others using a percentage (the top 20 percent), or specialized hybrids. The final output of this stage is a list of words that the system has determined to have the highest degree of separating power in distinguishing category-specific patterns of word use. Only these words are used to construct the vector representation for each document.

Learning Algorithm

Once each document has been converted to a vector model of features, it is time for the machine learning system to take over and learn the major patterns that define each category. There are numerous machine learning algorithms in use today, but perhaps the five most popular for automatic text categorization are: naïve Bayes, support vector machine, decision tree, neural networks, and k-nearest-neighbor.

The algorithms behind text classification are extremely complex, steeped heavily in statistical and machine learning theory, but a full understanding is not necessary to use them and so only a brief description of each will be provided here. Perhaps one of the best-known of these is the naïive Bayes classifier, which is based on Bayesian probability. Bayes classifiers are known as naïve because they assume each dimension (word) in a document appears with a probability that is entirely independent from any other word. In other words, ran has no greater likelihood of appearing with we or you than it does with onion. Of course, such independence does not hold true for human languages, but, surprisingly, naïve Bayes classifiers are exceedingly accurate on many collections, and train and classify very fast. Support vector machines treat documents as points in space and attempt to find a dividing plane that best divides the points as either belonging or not belonging to a given category. Neural networks attempt to create a complex network of interrelated binary neurons that can act in concert to make complex decisions, much like the human brain. Decision trees decompose the classification process into a set of probabilistic branching decisions, but have the significant benefit of allowing clear human understanding of the patterns extracted by the algorithm. Finally, the k-nearest-neighbor algorithm acts very similar to a clustering algorithm in that it measures the distance between every pair of documents. By measuring the dot product, or angle between their vectors, it is able to compute the level of similarity in their word use. Using this, the top k most similar documents (where k is a tunable parameter) are examined and the most common category among those documents is assigned to the target document.

There is no hard and fast rule as to which learning algorithm will work best on which type of text, but, in general, Bayesian models tend to work reasonably well across a wide cross section of genres and offer a good starting point. Each algorithm comes with a long list of parameters and configurations that can be adjusted, and the content analyst will wish to take the time to experiment with different models to see which offers the best accuracy for a given project.

Evaluating ATC Results

Of course, once an ATC model has been trained, the critical question becomes how accurately it is able to classify novel documents into its trained categories. In most cases, the training set is divvied up, with two-thirds of the documents used for training and one-third set aside for testing. Once the model has been trained on the training set, it is tested using the same set of documents, on which it should perform with very high accuracy. This is called its resubstitution accuracy, since the training data is being substituted back into the model for testing, essentially measuring the degree to which the model captured the nuances of the training data (while ignoring its ability to categorize novel documents). The model is then tested on the remaining one-third that was set aside at the beginning. To the ATC model, these are novel documents that it has never seen, but since they have been categorized by a human, it is possible to measure how closely the ATC algorithm matches the assigned categories. The most common method of measuring ATC accuracy is the so-called confusion matrix, which essentially charts the categorization choices of the ATC model against the human categorizations, allowing several key measures of performance to be determined.

TABLE 8.1 Basic confusion matrix

Computer
Negative Positive
Human Negative
Positive
True negatives
False negatives
False positives
True positives

Table 8.1 illustrates the basic confusion matrix, showing the results of a single binary classifier (only categorizes an article as belonging to the given category or not). The rows give the human categorization of the article (either negative – not in the category, or positive – in the category) while the columns give the ATC model's classification. True negatives and true positives indicate the number of documents in which the machine's categorization matched what the human assigned for that document, while false positives are documents the machine assigned to the category, but the human did not, and false negatives are ones excluded by the machine but included by the human. Each of these is a simple integer representing the number of documents falling into that category. Using this matrix, a set of key evaluative measures may be derived:

•   Accuracy The overall percentage of predictions that was correct. This is simply the number of true negatives plus true positives divided by the total number of documents.

•   False negative rate The percentage of positive documents that were incorrectly labeled as negative. (Number of false negatives divided by the total number of actual positive documents.)

•   False positive rate The percentage of negative documents that were incorrectly labeled as positive. (Number of false positives divided by the total number of actual negative documents.)

•   Rejection rate (precision) The percentage of positive predictions that was correct. This is essentially the precision value of the model, indicating how well it rejects documents that do not belong in the category. (Number of true positives divided by the total number of positive predictions.)

•   True negative rate The percentage of negative documents that were correctly labeled as negative. (Number of true negatives divided by the total number of actual negative documents.)

•   True positive rate (recall) The percentage of positive documents that were correctly labeled as positive. This is essentially the recall value of the model, or how many of the documents that belong in a category are caught by the algorithm. (Number of true positives divided by the total number of actual positive documents.)

Confusion matrices assume that testing data has a relatively uniform distribution among the trained categories. If, for example, category A has 990 documents and category B has only ten, a misclassification of ten documents in A would lead to an error rate of 1 percent, while the rate for B would be 100 percent, an obvious overemphasis. One solution is to use random sampling to reduce the size of the larger category to that of the smaller one. However, a better solution may be the use of a receiver operator characteristic (ROC) graph. A ROC curve measures the false positive rate on the X axis and the true positive rate on the Y axis. The point (0,1) (or 0,100, depending on the scale) is known as a perfect classifier in that its false positive rate is 0 and its true positive rate is 100 percent, meaning that it correctly classifies every document that belongs in the category and never puts an incorrect one in. As noted earlier, ATC systems have a large number of tunable parameters, from the feature selection and reduction methods, to the learning algorithm that builds the model. Each combination can be tested and its true and false positive rates plotted onto the ROC curve to determine the best possible configuration.

Benefits of ATC over Human Categorization

Automatic text categorization has a number of important benefits compared with human coding. There are the obvious ones of performance and scalability, with a single machine able to process hundreds or even thousands of documents per second on a single computer, compared with minutes or more per article with human coding. More important than scalability, however, is consistency. Even under the most rigid of environments, human coders bring background experiences and pre-existing knowledge to bear when coding documents, leading to significant variation in their results. Intercoder reliability tests must be used to determine the degree of variance between coders. These tests require each coder to categorize the same randomly selected set of articles, and the results compared to see how closely they match. If each coder categorizes articles slightly differently, this leads to fragmentation in the categorization corpus, leading to severe validity problems in the resulting data. Further complicating matters, coders may occasionally differ on the categorization for the same document if asked to categorize it multiple times, a problem known as intracoder reliability.

In some cases, the categorization schema may be sufficiently nuanced, or coders may have insufficient expertise in its subject areas, that extreme variation is seen among the entire body of coders. In other cases, specific coders may exhibit significantly different categorization behavior than their peers, requiring them either to undergo further training or be replaced. ATC techniques remove this source of variability and instability and offer instead perfect reliability in that the algorithm will produce the exact same categorizations for a document no matter how many times it is run. A single ATC model can operate faster than an entire team of human coders, but even if an ATC model is distributed across a cluster of machines to operate in parallel on a large document collection, every instance of the model will perform exactly the same.

Limitations of ATC

Like any automated technique, there are tradeoffs to the substantial performance and scalability increases offered by automatic text categorization. Since they bring no external knowledge to bear on the categorization task, ATC systems are entirely reliant on the quality of the training data to determine the accuracy of their resulting models. The number of training documents in each category should be as equal as possible to avoid creating an imbalance emphasizing words in the larger category. This is especially true when using a feature reduction weight that incorporates word frequency in the overall corpus, since those weights assume equal distribution of training data among the available categories. If one category has more documents than another, a random sample may be conducted of the larger category to reduce its size to that of the smaller category. Simply providing an ATC system with a corpus of categorized documents (or a random sample thereof) will usually result in fairly high accuracy in many applications, but occasionally the specific type of word usage in a category or set of categories may require more guidance than can be gleaned from the corpus alone. In these cases, human experts may hand-select examples of articles that obviously belong in the category, articles that obviously don't belong in the category, and fringe articles (both inclusive and exclusive). These hand-picked examples help reinforce the aspects of the category that the expert, drawing from his or her domain knowledge, believe are the most important.

Learning algorithms used in ATC work best when the text as a whole contributes substantially to the specific categories, as opposed to comprising only a small portion of text in each. An ATC model may not be the best choice to identify documents with brief mentions of global warming when the corpus consists of 100 page speeches and the topic in question comprises only a single paragraph in each. If the variation in the nature of the speeches is sufficient that broader patterns are not likely present (that might skew the learner into finding the wrong patterns), such a model might still work. More likely, however, is that the specific patterns suggesting global warming in the documents will be lost in the landfill of unrelated content. Significantly greater volumes of training data become necessary to help the machine sift out the small amounts of trigger text, and much greater care must be given to the selection of training data. In these cases, a preparation stage may be required, which might target a broad set of keywords related to global warming and extract any paragraph containing one of the keywords. This reduced set of text would be sent through the ATC procedure, resulting in a much higher density of text likely to be predictive of global warming references.

It is important to understand that ATC models are built exclusively from the training data provided to the learning algorithm. If the human coded training data has poor reliability (exhibits strong variation across coders) this can have an impact on the accuracy of the resulting model. However, due to their probabilistic nature, models are robust against a certain degree of error, so a few incorrectly categorized documents won't usually have a significant impact on overall accuracy. In fact, current-generation learning algorithms will self-adjust to randomized error (such as non-systematic coder error) as these documents appear as noise in the general probability tables. Systematic coder error (bias) is more problematic, but machine categorization relies on vastly different processes than human coders that can accommodate certain types of systematic bias. For example, two human coders that are entirely disjoint in the documents they place into a particular category (perhaps due to different background knowledge they each have about the category) would be expected to yield a dataset unusable for analysis. However, computer techniques examine only the wording used in the documents, not their upper-level semantic meaning. Unless one coder engages in a vastly different operational model (i.e. putting sports articles in a human rights category) there is a sufficient likelihood that a union of the word frequency distributions of the two document sets would yield a model representing documents halfway between the two extremes, embodying the expected categorization.

One of the benefits of machine learning is that the training environment may be precisely defined. The algorithms do not bring any outside knowledge or experience into the coding process, and the same codes are applied to the same documents no matter how many times the algorithm is run (yielding perfect intracoder reliability). Hence, if, over time, it is discovered that the algorithm performs less accurately on certain documents, the models can be adjusted by giving the system a few additional cases to consider. The model is precisely adjusted by only the amount of information in these additional documents, whereas additional training of a human coder might lead to a wider variance in coding style.

ATC systems offer two primary categorization models that must be understood when using multiple categories: best match and all matches. Under the best match model, the likelihood of the given document belonging to each category is computed and the one with the highest probability is assigned as the match. Even if a document has a strong probability of belonging to several categories, only the one with the highest rank will be assigned. This behavior may be acceptable or even desired in certain contexts, while in others it is preferable to have every likely category assigned to the document. One popular technique is to use a cascaded set of binary classifiers, each of which determines membership in a single category. The document is run through each classifier to determine whether it belongs to that particular category, and a list compiled of every category the document was deemed a member of. Such a technique can even be used with systems that offer only a best match, by using each ATC model to determine membership in only a single category.

Applications of ATC

Automatic text categorization has found application across nearly every field that must make sense of large volumes of textual material. One of the most common commercial applications lies in sorting news feeds and web sites into predefined categories, like Google News. Researchers often have a need to sift through large document collections to find just the small set of articles relevant to a search. Keyword queries are extremely limited when searching for more complex topics, and place the burden on the human searcher to discover every possible keyword that might define the topic at hand. Many topics may not have a set of signature keywords and instead are defined by subtle combinations of keywords in selected contexts. Only ATC techniques offer the ability to automate such nuanced searching.

Given the ability of ATC techniques to discover even slight patterns in document collections, they have also been used to compare tone and bias in collections by training on archives of known bias (such as speakers on either side of an issue). While some tonal and bias constructs are immediately obvious, overall tone or bias is often subtle and the ability to identify and operate upon collections of small word usage patterns can be a tremendous win for ATC in these situations.

Collections which have considerable amounts of metadata assigned to each document can benefit from ATC by discovering relationships between word usage and the values of particular metadata fields. For example, if a set of informal emails are collected from individuals in a wide range of ages and tagged by age range, ATC can be used to expose possible word-use patterns within each age group and estimate the author age of novel emails.

Finally, ATC can be used as an integral part of a survey coding workflow, translating open-ended or textual responses into quantitative data and partially automating the coding of multiple choice questions. Depending on the type of closed response question, sample documents that were coded into each possible value of the question can be used to develop models to evaluate new documents for the given question. The accuracy of such techniques depends on the amount of the document text that is used in answering the question, with obscure snippets of text being more difficult for the ATC system to recognize than larger whole-document patterns.

Clustering

Categorization requires a pre-existing taxonomy and training data, while clustering is a completely automatic technique, requiring no human intervention in the learning process and discovering its own similarity patterns in the data. Some systems allow the user to specify the number of groups that documents should be clustered into (essentially a resolution threshold), while others simply form as many groups as necessary to satisfy maximal similarity constraints within each group. Fully automated approaches may not yield the results a human would expect, but can offer important insights into previously undetected patterns in the source data. A clustering technique, for instance, may find that two groups of documents by supposedly very different sources share very strong word-use similarities (suggesting further investigation): something that would be lost when using human-defined categories that the machine simply matches. Clustering algorithms are known as unsupervised learners since the machine is responsible for coming up with its own clusters, rather than relying on human-provided ones.

Automated Clustering

Automated clustering techniques compute the similarity between every pair of documents and use these distances to successively group or ungroup a collection into maximally similar clusters. Like automatic text categorization, there are many different clustering algorithms, each of which is optimized for specific types of data input. However, techniques vary so widely in their methodology that this section will address just a few of the most common.

The best possible clustering of a document collection, known as its optimal clustering, can only be derived by exhaustively trying every permutation of starting documents to find the specific arrangement that leads to the greatest similarity among clusters. This is computationally intractable at large document counts, so real-world clustering algorithms rely on a variety of clever approximations. At the most basic level of functionality two major types of clustering approaches are used: hierarchical and partitional. Hierarchical clustering techniques break the documents into a hierarchy of clusters, represented as a dendrogram, preserving the relationship between clusters, while partitional techniques break the collection into a simple list of clusters, with no information on the similarity of each.

To form their clusters, algorithms use either agglomerative or divisive grouping. In the agglomerative approach, each document is initially considered to be its own cluster and clusters are successively merged together until the desired number of clusters is achieved or the similarity distance between each cluster has been optimized. In the divisive approach, all documents are considered to form a single massive cluster, which is successively split apart until either the desired number of clusters is reached or the similarity distance between each is optimized.

Hard clustering techniques assign each document to a single cluster, regardless of whether its content overlaps into other clusters. This is similar to best match ATC algorithms that only assign a document into the single highest-ranked category. Fuzzy clustering algorithms, on the other hand, allow a document to belong to multiple clusters simultaneously, assigning a score to the percentage of its content matching each cluster. While computationally more expensive, fuzzy clustering allows the greatest resolution in clustering documents, accounting for documents with multiple core themes.

Hierarchical Clustering

Hierarchical clustering is designed to cluster a set of documents while preserving the similarity relationship between each cluster. The resulting dendrogram visualization is essentially a tree structure illustrating the taxonomic relationship between each document at every level of clustering. Agglomerative clustering is the most common hierarchical method used and begins by measuring the pair-wise similarity of every document and selecting the two documents that are the most similar (in the case of a tie, the winner is selected at random). These are merged together and their vectors averaged to become the first cluster. The new cluster and the remaining documents are then compared and the two most similar grouped together to become the second cluster, which are then merged and their vectors averaged. This process repeats itself with exactly two documents/ clusters merged at each step, leading ultimately to a single cluster containing all documents.

Partitional Clustering

Partitional clustering is less computationally intensive compared with hierarchical clustering and thus is better suited for very large document collections. Rather than clustering into a tree of relationships, it generates a single one-dimensional list of clusters and their member documents, with no information about the closeness of individual clusters.

Like hierarchical clustering, agglomerative grouping is the most common partitional clustering schema. Unlike hierarchical clustering, which simply proceeds through the archive, grouping two documents/clusters at each stage, partitional clustering usually begins with a user-specified target number of clusters. All documents are converted to vector space representations, and every pair of documents is compared to find the two with the closest similarity, which become the first cluster (if two documents tie, one is selected at random). These two are dropped from the main list of documents and their individual document vectors averaged to form a representational vector for the cluster. The remaining list of documents is then compared and the next two most similar selected and the process repeats until the desired number of clusters has been formed. These become the starting clusters, representing the centers of the primary groups in the corpus. The remaining documents are then each compared against these clusters and assigned to the one they share the greatest similarity with. Essentially, the corpus of documents becomes agglomerated together into successively larger masses. Some variations do not require the user to specify a starting number of clusters and instead use some form of thresholding to determine how many starting clusters to create based on whether any of the remaining documents are similar enough to warrant becoming their own cluster.

Another form of agglomerative clustering is k-means clustering, in which k documents are selected at random from the corpus (a buckshot selection) to become the starting clusters. The remaining documents are assigned to one of these clusters based on similarity. Within each resulting cluster, the average of all documents is computed to arrive at the centroid that sits between those documents. The distance of all documents are then compared to these new centerpoints and reassigned to the closest center. This process is repeated until similarity within each cluster is maximized to the point that the centroids do not move and no reassignments are made.

Finally, divisive clustering begins with all documents considered to be a single cluster and successively breaks them apart based on maximal dissimilarity. The process begins by measuring the pair-wise distance between all documents and selecting the two with the greatest dissimilarity. These are split off from the group and form the two starting clusters. All remaining documents are compared against the two starting clusters and assigned to the one each is closest to. Each of the two starting clusters is then examined and the distance between each pair of documents in that cluster measured. The two documents within the cluster that are the farthest apart are compared against a predefined threshold, and if greater, are split off to form two new clusters. This is done for each of the clusters. The rest of the documents (that were not used as the starting seed documents of any of the clusters) are then compared against these clusters and assigned to the closest cluster. The process repeats itself until the maximal distance between documents in each cluster is below the specified threshold, at which point the clustering algorithm has reached convergence.

Document Similarity

Automatic text categorization explores patterns across a collection of documents, identifying nuanced word frequency differences among the collection as a whole and applying those to categorize new documents. Document clustering answers a similar question, relying on inferences to group documents together, rather than into predefined categories. Both techniques address the notion of comparing collections of documents against other collections and extracting patterns in the aggregate. In some cases, however, similarity must be computed against a single document, rather than a collection. There are two primary approaches to this: the vector space model, used in automatic text categorization, and contingency tables. The vector space model operates on the document as a whole, establishing similarity based on overall word usage, while contingency tables examine differences in the frequency of particular words, allowing quick isolation of the words which contribute the most significant differences between the documents.

Vector Space Model

Under the vector space model, the two documents to be compared are converted into vector representations. Similar to the automatic text categorization process, both undergo feature extraction and selection processes to produce a high-dimensionality vector representation of overall word use in the document. To compare the two, the cosine angular distance between their vectors is computed, which is simply their vector dot product divided by the multiplied magnitudes. This measurement essentially treats the two document vectors as simple geometric vectors in space, and computes the angle between them. While it may be hard to picture the angle between two 10,000-dimension vectors, the basic concept is the same whether it is a two-dimensional vector on a paper graph, or a much higher one used in document representation. The resulting angle is the degree to which the two document vectors are dissimilar, or, put another way, the degree to which their word usage differs. A value of 0 would indicate the two documents are exactly the same, while a higher value would indicate a greater and greater difference in wording. The careful reader will note that this process is actually a component of the k-nearest-neighbor learning algorithm used in ATC. While cosine similarity is one of the most common whole-document similarity measures, other variants include binary cosine similarity (same process, but uses only a 1 or 0 to represent whether a word occurred or not, instead of its frequency of occurrence), and Jaccard similarity (compares the number of similar and dissimilar words).

Contingency Tables

Cosine similarity compares two documents based on their word usage as a whole, yet in some cases it is important to determine which individual words exhibit the most profound differences in usage. To examine differences in individual word use, a contingency table is used, as seen below. A list of all unique words occurring in the two documents is computed and a contingency table compiled for each. The rows contain the number of occurrences of the given word in documents A and B, along with the number of occurrences of other words, while the columns separate the occurrences by document.

Once the contingency table has been constructed, it may be used with one of many statistical significance tests, such as the Pearson's chi-square test. These tests essentially measure the significance of the difference in word frequency use between the two texts. If the two documents used the word an identical number of times, there would be no difference, while the greater the difference in the word's frequency between the two documents, the greater the deviation and possible significance of its usage.

TABLE 8.2 Contingency table

Text A Text B TOTALS
Count of this word Word_A Word_B Word_Total
Count of other words Other_A Other_B Other_Total
Total words Total_A Total_B Total_ALL

In addition to comparing the use of individual words, many other attributes of the two documents can be compared using contingency tables and the log-likelihood test. Deviations in part of speech use, correlated words of a particular target word, and occurrences of words from a selected lexicon, among others, can all be used as analytical points to compare or contrast two documents.

Chapter in Summary

While earlier chapters introduced techniques for characterizing or extracting information from documents, this chapter surveys techniques for comparing them, including categorization and clustering. Automatic text categorization (ATC) accepts a set of documents as input and builds a statistical model that it uses to search through the larger collection to identify other documents similar in topical focus. ATC works by compiling a list of all unique words in the document collection (features) and identifying those words that have the greatest distinguishing power (feature reduction). Each document is represented using a vector space model where each word is a vector dimension and uses a learning algorithm to build a statistical model to estimate the probability that other documents in the collection belong to the same category. ATC relies on predefined categories determined by a person, while clustering requires no human input: it compares the similarity of each document to automatically create groupings that bring together the most similar documents. Finally, these same techniques can be used to compare the similarity of any pair of documents, while contingency tables can be used to identify the words that most contribute to the uniqueness of a particular category.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset