6

TOPIC EXTRACTION

Vocabulary analysis examines the surface of a document, measuring only its use of words, without taking into consideration the meaning of those words. A far more powerful technique known as semantic analysis uses sophisticated language models to allow the computer to disambiguate whether a word is a name or an action and to make assumptions about its meaning. While not as accurate as trained human analysts, modern semantic analysis methods offer significant accuracy across many types of text, opening the door to performing deeper analyses at an otherwise impossible scale.

How Machines Process Text

Unlike numeric computation, processing human-generated text presents a number of unique challenges to automation. Computers are designed to work within unchanging environments, with a given input holding constant meaning, no matter what its source. The number 15 refers to the same number of items, regardless of the context or culture in which it originated. Language, on the other hand, reflects the full extent of the individuality and creativity of the human mind. Unlike the universal system of numbers, there are over 6,900 known active human languages, each with its own underlying cultural context and shared background knowledge. Designing computer systems that can convert a collection of words into their deeper semantic meaning and derive understanding from that knowledge is an extraordinarily difficult task. Computer understanding of text is still a far cry from human comprehension, but machine accuracy is often good enough for many tasks and the speed and cost efficiency of machine processing makes it indispensable for any project working with large volumes of text.

Unstructured Text

Textual data is often referred to as a dichotomy of so-called structured and unstructured forms. Structured data refers to text that has been placed into a conformed framework that offers rich secondary information on its meaning. For example, on the World Wide Web, a computer language known as eXtensible Markup Language (XML) underlies a movement called the Semantic Web, in which web page authors annotate their text with special tags. Each tag has a meaning associated with it, indicating the selected text is a particular type of content. For example, a contact page for a business might contain tags around the business name, phone number, address, and operating hours that convey to the machine the type of content contained within each block. A phone number might appear in a page as 1-234-567-8901, while the company's name would appear as ACME Distributing. The tags are not displayed by web browsers, but are embedded in the page source code so that search engines and other computer agents can rapidly understand that a particular piece of text is a phone number or the name of the company operating the site, rather than just any string of numbers or letters. Other pages might have tags identifying each occurrence of a person's name or flagging the page as being a résumé or a lesson plan. No matter what types of tags are used, the underlying idea of structured text is to provide the equivalent of a cheat sheet to the computer, telling it the deeper semantic meaning of key sections of a page's content.

Unstructured text is simply a document with no additional computer markup: a typical email or news article is an example. Such content is referred to as unstructured to denote its opaqueness to traditional analysis techniques that require an understanding of what each piece of text means. Far from a freeform jumble, however, human language is actually governed by a complex set of structural rules ranging from surface syntactical guidelines to more complex grammatical constructs. The rigid rules of grammar are precisely the kinds of patterns that computers thrive on, and it is possible to develop programs that can reliably parse well-written text. The field of text mining refers to the use of computer techniques to exploit these syntactic and semantic rules of human language to generate an understanding of text. Such techniques don't actually structure content, rather they “[extract structure] by applying linguistic models to [the] documents … to discover and exploit their inherent structure.” (Grimes, 2005.) Through the use of such techniques, automated processes can conform human text to a schema that allows rich analytical techniques to be utilized in ways similar to structured data.

Extracting Meaning from Text

The very first step in machine language comprehension is to identify the semantic role each word plays in the document. Known as part of speech tagging (POS tagging), statistical techniques are used to annotate each word in the document with its appropriate part of speech (Brill, 1992). To create a POS tagging system for a language, hundreds or thousands of documents written in that language are painstakingly annotated by an army of human editors, assigning each word into its appropriate part of speech category. After the entire collection has been annotated, statistical models are generated that offer the probability of a word belonging to a particular part of speech given its surrounding words. Some words may always appear as a particular part of speech, while others are highly ambiguous (such as “I opened the can of soup” versus “He can run the marathon”) and depend entirely on context. The first major training corpus was the Brown Corpus, published in 1964, containing 500 tagged documents totaling more than one million words (Francis and Kucera, 1964). Using these models, a computer program can assign a POS tag to each word in a document by examining the probability of it being of each speech class based on surrounding words.

Once all of the document words have been tagged with their probable part of speech, spans of words having the same part of speech are grouped together to create noun phrases and verb phrases. Nouns (and noun phrases) constitute the primary semantic tokens of a document, conveying the actors of the text. Verb phrases provide the linking information that describes what happened to those actors or what actions they took. In the sentence John and Sara drove to the grocery store, the noun phrases are John, Sara, and grocery store, while the verb phrase is drove. Using just this information, the computer immediately understands that John and Sara and a grocery store are key elements of this sentence, and that they are related in that the first two drove to the third.

Noun phrases and verb phrases become essentially semantic surrogates to the document's content, giving the machine a basic list of its key concepts. However, at no time does the computer actually have a deeper understanding of the meaning of the text it is processing, it is simply applying statistical models to identify the grammatical structure of the document and the insights that offers to its actors and actions. As with any purely statistical technique, there is a wide margin of error. In particular, differences in language use can cause significant problems with such systems. If one author prefers the term war while another continually uses armed conflict, the computer has no way of knowing these two terms refer to the same core concept. Tools such as WordNet (http://wordnet.princeton.edu/), which applies a semantic taxonomy over the English language, offer promise in resolving the kinds of language use differences found in real-world environments. WordNet, for example, encodes the relationship between war, armed and conflict and through the proper traversal of such a taxonomy it is possible to bridge at least some of the issues posed by differing language use.

Applications of Topic Extraction

The ability of computer techniques to reduce the human labor required for higher-level semantic comparisons of large document collections has led to a variety of innovative applications of topic extraction technology. The examples below relate several general application areas with relevance to content analysis.

Comparing/Clustering Documents

One key application of topic extraction is conceptual document clustering. Traditional clustering algorithms rely on the full text of each document, but some projects may find greater accuracy by focusing only on core concepts. Using topic extraction, a ranked list of top concepts within each document can be generated and similarity among a group of documents determined by the degree of overlap between just those concepts. A clustering tool grouping documents by their full text might notice that one set of documents use the word “the” much more frequently than another. While that may indeed be the strongest distinguishing factor among the documents, it is likely that the more desirable outcome would be groupings based on topical focus. Thus, topical extraction can be very useful in generating a topical proxy of each document. Clustering plays such an important role in content analysis that the next chapter is dedicated entirely to the various techniques for clustering and categorization.

Automatic Summarization

Topic extraction also finds significant application in the production of automated document summaries. When a human summarizes a document into a short set of paragraphs, he or she brings significant background knowledge to bear on the interpretation of the text, determining the most important aspects of the document and the bare minimum amount of text required to accurately represent it. Sum-maries are also often context-dependent, with the selection of what is important information based heavily on its intended audience.

Automatic summary generation is not yet capable of interpreting documents in the same way as humans and must instead rely on statistical measures based on the wording of the text. While there are a myriad summarization techniques in common use, one particularly simple one relies on the use of topic extraction to identify the core fragments of text that carry the most information. In a typical document, the content revolves around a set of entities taking action or being acted upon. By extracting a list of noun phrases from the text and ranking them by number of occurrences, the most frequently appearing concepts (in theory, the most important aspects of the document) can be identified. Each sentence is then decomposed into fragments and each fragment containing one of the top concepts combined together. The final summary is created by starting with the list of fragments containing the top-ranked concept, combining those together, and, if the length of the summary is not yet at the specified limit, proceeding to the next concept and its list of fragments, and so on. The resulting summary is only a rough approximation of the text, but in the average case it will capture the core gist of the document while excluding outlier information that was only rarely mentioned.

Such an approach is obviously highly dependent on two critical components of its operation: noun phrase consistency and fragment selection. Once the list of all noun phrases in the document has been compiled, it is ranked by number of occurrences of each in the text. Different noun phrases that describe the same concept will, unfortunately, be missed by a system that compares noun phrases using exact matching. Some systems rely on secondary semantic taxonomies like WordNet to relate similar noun phrases (i.e. car and vehicle) to ensure that specific word choices do not negatively impact the system's ability to rank core concepts. The process used to convert complete sentences into fragments also plays a critical role in the accuracy of the resulting summary. A single sentence may convey several different concepts, and so fragment generation is used to separate it into logical blocks, each on one primary topic. Common delineation tokens include punctuation (such as commas or semicolons), coordinating conjunctions (and, but, or), and certain types of pronouns. Poor fragmentation can lead to overly large blocks of text being selected for each concept, reducing the amount of text on the core concepts that can be included in the final summary.

Depending on the level of polish required of the final summary, some systems will apply a final level of grammatical rules to the generated summary to ensure it has proper grammatical form and punctuation. Such final layers can smooth out inconsistencies and jarring transitions between fragments.

Automatic Keyword Generation

Most scholarly journals require their authors to submit a set of descriptive keywords to make it easier to find and index articles. Keyword selection is traditionally performed by humans with subject expertise in the area being described. Automated topic extraction can actually be used to generate descriptive keywords in much the same way that document summaries are generated. Once a ranked list of the most frequent noun phrases from a document has been compiled, the top five to ten phrases may be used as descriptive keywords, accurately capturing the most common concepts described in the document. Similar to automatic summarization, techniques like WordNet can be used to equate equivalent noun phrases to maximize the accuracy of the ranking.

Most automatic keyword generation systems output a set of keywords based on the text of the document, meaning that a sizable document corpus will have a largely disjoint set of descriptive keywords. One of the key benefits of keyword-based indexing is the normalization it offers in collecting together equivalent descriptions under a common taxonomy. Two primary techniques may be used to translate the set of document-specific keywords generated by the system into a predefined keyword taxonomy (what most journals use). In the first technique, a thesaurus like WordNet is used to find the taxonomy heading most closely related to each extracted noun phrase (potentially discarding high-ranked document keywords that don't have direct equivalents in the predefined taxonomy – a situation that can be common for fringe documents).

A second technique, widely used in web-based news aggregators, is to use automatic text categorization techniques (described in more detail in the next chapter) to build models of what documents in each category look like in terms of word use. In these systems, a set of example documents that have been previously indexed into each category is used to build a model of word usage for documents in that category. To categorize a new document, the model for each category is executed against the new document, and the categories with the highest score for the document are indexed as its keywords. News aggregators like Google News rely extensively on these techniques to automatically sort incoming news articles into a set of broad basic categories.

Multilingual Analysis: Topic Extraction with Multiple Languages

Vocabulary analysis treats a text as a collection of objects (words) and measures various characteristics of those objects, such as frequency, correlated objects, etc. Such techniques can be applied across most natural languages without modification. Semantic analysis, on the other hand, relies on prebuilt statistical models of word use and grammatical structures within a given language, often compiled from massive hand-annotated archives. Part-of-speech tagging, for instance, relies on the use of a lookup table of word forms and context for a given language, derived from thousands of documents whose words were individually tagged into their parts of speech by humans. Each model is trained only for a specific language and must be retrained to operate on a different language, requiring substantial human effort.

Comparing multiple documents together using semantic measures requires that all be written in the same language and adhere to a common set of grammatical constructs. If one document is written in a formal tone, while another is written in vernacular, concepts and their linking verbs may be substantially different, complicating semantic comparison (though the use of thesauri like WordNet can partially mitigate this). If documents are written in entirely separate languages, semantic models (such as part-of-speech databases) must be available for each source language to enable concept extraction across each language. The extracted concepts will still be in their source languages, making it impossible to compare or relate documents without some form of specialized cross-language dictionary.

Another approach is to utilize machine translation technology to translate all source documents into a single language, enabling the use of a single semantic model. While there are numerous machine translation offerings commercially available today, even the most advanced suffer from significant accuracy problems. Many contemporary tools use a rules-based translation engine, in which human experts have hand-coded a large knowledge base of grammatical rules and context-aware word translations. These systems require considerable time and effort to develop language pairs (a set of languages that can be translated between, such as English to French) and are limited to the rules and translations preprogrammed by the developers. Despite their developmental complexity, the systems offer very fast translations with very low computational requirements.

As computer processing power and memory has increased, a new breed of machine translation systems have evolved, known as statistical translation models. Rather than relying on an army of linguists to define a set of translation rules, these systems are entirely automatic, requiring only a large corpus of documents that have previously been translated into both source languages. They compare the text of each pair of documents, looking for statistical word correlations to develop context-aware translations of words and phrases between the two languages. In many cases, training such systems requires little work, as researchers can simply use digital versions of famous works (such as Alice in Wonderland) that have been widely translated into the languages of the world. However, these systems require substantial memory and processing power to operate, increasing their operational cost. Hybrid systems combine statistical translation with rules-based post processing to increase the grammatical accuracy of the resulting output.

Machine translation tools can work well for relatively simplistic content that uses basic grammatical constructions and common wording. More complex documents, such as nuanced legal contracts or editorial material or literature that makes heavy use of language-specific constructs (such as rhyming words) may find machine translation less useful. Part of speech tagging algorithms and other semantic tools are very susceptible to poor grammar, as they are not as able as humans to decipher the intended meaning of a text. Language conversion can be more difficult in one direction than the other, such that translating a text from English to French may be more difficult than from French to English. Before attempting any content analysis project that involves multiple languages, it is important to determine the set of languages that must be translated to/from and run extensive tests on the accuracy of each language pair and software package. For high-precision translation, it may be necessary to use human translators, or at the very least, to hand correct the machine output.

Chapter in Summary

While vocabulary analysis looks only at the surface-level word choices made by an author, topic extraction considers the actual meaning of text, mapping strings of characters into a high-level semantic understanding. The first step in semantic processing is to annotate each word in the text with its part of speech, and group together spans of words into noun phrases and verb phrases, representing the text's concepts and actions. Topic extraction has many applications in content analysis, including topical document comparison, summarization, and keyword generation. Unlike vocabulary analysis, which does not require an understanding of the underlying language, the elaborate language models of topic extraction tools pose complications to projects involving texts in multiple languages. Automatic translation tools can help with this process, but the need for high-quality grammatical structure may require human editorial assistance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset