3

VOCABULARY ANALYSIS

Vocabulary analysis involves the examination of a document's wording, conducting a surface analysis of language selection, rather than exploring the concepts those words represent. Language use can be a very powerful descriptive tool for characterizing texts: differing topics, authors, and time periods typically exhibit significant stratification in their vocabularies. Frequency histograms can assist in authorship attribution, judging the likelihood a document was written by a particular individual based on similarities in the use of certain words or word classes. Author gender can be explored through patterns in word class use, while readability indexes can estimate the ease with which a passage will be understood. Term frequencies can suggest the progressiveness of a historical text, while normative comparison provides a benchmark against which to compare an isolated text. In essence, vocabulary analysis is a foundational method that underlies many other content analysis approaches, treating a compilation of documents as a collection or “bag” of words and exploring the patterns in the way those words are used.

The Basics

Word Histograms

Documents are constructed through the selection and ordering of the basic words that comprise a language, much as complex molecules are formed from the assembly of base chemical elements. Examining a corpus as a collection of its underlying words, rather than as whole documents, allows greater exploration of the patterns that underlie the assembly of complex meaning from discrete linguistic elements and how they differ across comparison groups.

Perhaps the most basic form of vocabulary analysis is therefore the word histogram, in which the set of unique words across a collection of documents is ranked by their relative frequencies. Most word histogram programs split a document up by spaces and punctuation, and any remaining string of letters is considered to be a word. It is important to keep in mind that a word histogram does not check to see whether each word is a valid word in that language: all strings of characters are treated as words as-is. This is actually of benefit to the analyst, as misspellings, personal names, and other words that are not found in a traditional dictionary are all included, helping to identify common typographical and other errors.

In most cases, a vocabulary histogram counts individual words, but phrases can also be used. Some histograms may also contain additional descriptive attributes of each word, including length, pronunciation and reading complexity (how difficult the word is to pronounce or read), document frequency (number of documents it appeared one or more times in), global frequency (number of times it appeared in the entire collection), and part of speech.

Readability Indexes

Readability indexes are another common form of vocabulary analysis and measure the ease with which a native speaker of average intelligence can read and understand a passage of text. Scores are usually represented as the expected age or grade level of formal education that would be required of a reader to understand the material. An article in a small-town newspaper might have a score of eight, suggesting a reader should have at least an eighth-grade education, while a formal legal contract might have a score of sixteen or higher, reflecting its orientation toward those with advanced degrees. The goal of such indexes is to offer a quick estimate of how broad the readership audience of a particular document may be.

Readability algorithms function by measuring selected characteristics of textual material, including the number of characters and words, average word length, average syllables per word, words per sentence, and so on. The Gunning-Fog Index (also known simply as the Fog Index) is an example of a simple readability measure. It computes the average number of words per sentence, adds in the percentage of complex words (those that have three or more syllables), and multiplies the result by 0.4. The result is a score estimating the minimum grade level required of a reader to understand the material. The Flesch-Kincaid Index is similar, measuring the average number of syllables per word and average words per sentence. The Coleman-Liau Index does not measure syllables, relying instead on the average number of letters and sentences per 100 words. Numerous other algorithms exist, each with their own strengths and weaknesses.

Readability scores can play an important role in content analysis works by providing insights into the intended audience for a passage of text. Was a speech written for the general public, or was it tailored for the intellectual elite? Are there noticeable stylistic patterns of a particular author (perhaps a tendency toward the use of high-syllable words) that can be used alongside additional indicators to identify other works by that author?

Normative Comparison

Of particular interest in some research areas is comparing documents not against each other, but rather to some standard of normal text. By measuring the deviation of the target document from others of its class, isolated documents may be examined, and specific authors or sources may be compared to the entirety of their peers. For example, a vocabulary list might identify that a specific author uses a large number of adverbs in their writing, but this information is of limited use unless it can be compared against other contemporary authors to determine if high adverb use was the norm among writers of the time.

Of course, the representation of normal will vary by the type of text being analyzed: a sample compiled from news media reports will yield little insight into the departures of classical fiction from colloquial English use. In general, normative texts are created by selecting example documents from a wide range of works in a particular document class (such as news reports, or English fiction) and generating average statistics for that collection. The well-known DICTION toolkit offers a set of normative profiles compiled by analyzing over 20,000 documents, ranging from “public speeches to poetry, from newspaper editorials to music lyrics, from business reports and scientific documents to television scripts and informal telephone conversations.” (“DICTION,” n.d.) These subcollections were used to create profiles under such headings as Corporate Public Relations, Financial News, Legal Documents, and even Magazine Advertising, allowing normative comparison of new documents against those profiles (“DICTION,” n.d.). More recently, Google Books released several normative datasets offering vocabulary histograms by year from a sample of 5.2 million books: nearly four percent of all books ever published (http://ngrams.googlelabs.com/datasets).

Non-word Analysis

Some types of vocabulary cues come not from words, but from punctuation and punctuation-based expressions that shape the emotional interpretation of a text. In email and instant messaging, the use of capitalized letters is often associated with shouting, while exclamation points and question marks associate their preceding sentences with specific emotional reactions. Indeed, punctuation can provide many insights into a document's emotional content and help characterize a speaker's latent reaction. Emoticons, such as so-called smiley faces, represented as :-), impart further emotional cues. Many vocabulary analysis programs remove punctuation before analysis or do not recognize emoticons as linguistic objects. Preprocessing is often required to replace emoticons or other punctuation cues with textual proxies. For example, all smiley face emoticons in the text could be replaced with the made-up word emoticon_smileyface, which will then be counted as any other word by vocabulary analysis tools. In addition, some works may quote text from other authors and the ability to isolate such content from an author's own words may be important for author characterization.

Colloquialisms: Abbreviations and Slang

Some types of discourse, especially informal conversation, like instant messaging and social media, tend to make heavy use of colloquialisms like abbreviations and slang. Vocabulary analyses that do not rely on predefined dictionaries will not have a problem examining such content. Word histograms simply count up the list of unique words in a document without considering whether they exist in any standard dictionary, and so will automatically work with non-traditional language content. Readability indexes measure syllables and words per sentence and so will similarly have little problem with non-dictionary terms. Other techniques, like part-of-speech tagging, topic extraction, and sentiment analysis, however, will encounter challenges, since they must look up each word in a predefined word database. For example, most sentiment dictionaries include the word “happy” under their list of positive terms, but few have an entry for “lol” or a smiley emoticon. To integrate colloquialisms into a tone dictionary, one approach is to measure their correlation with known tone words across the entire collection. This will usually indicate that “lol” has a strong correlation with “happy,” suggesting it should be added alongside other positive words in the dictionary.

A common measure to identify the density of colloquial language is to compare the vocabulary list of a collection against a dictionary for the target language. Words not found in the dictionary can be highlighted and ranked by their frequency. Computing the relative frequency of non-dictionary words can be a useful start to identify those appearing commonly enough to warrant further analysis. These often include personal names, misspellings, and colloquialisms that have not yet made it into standard use.

Restricting the Analytical Window

Most applications of vocabulary analysis operate at the document level, yet some research questions may benefit from exploring content at a finer resolution, examining only content surrounding specific keywords or from specific document locations. For example, qualitative evidence may suggest two authors differ markedly in their use of descriptive language, suggesting a vocabulary analysis limited to adjectives might offer greater characterization power than one including all word types. A comparison of views toward two political candidates might limit itself to text appearing beside references to the candidates’ names (known as KeyWord In Context indexing) for greater precision. Certain portions of documents may yield richer contextual information, such as the lead paragraph in a news article or the introduction and conclusion sections of an academic paper.

Content may also contain embedded metadata identifying the speaker of each passage, such as theater plays, interviews, and other multi-speaker material. This information can be used to restrict the analytical window to just a particular speaker or set of speakers. Document-level metadata may also be used to create sub-collections that restrict documents based on certain metadata fields. In a collection with authors of different genders and many different age ranges, it may be useful to isolate only the documents written by females aged 20–35, to explore patterns in their vocabulary and contrast those with males of the same age.

Vocabulary Comparison and Evolution/Chronemics

A common task in content analysis is to contrast the vocabularies of two document collections or to track vocabulary use over time in order to study evolution of language use or topical differences between speakers or sources. In the case of speaker or collection comparison, groups of documents from two authors or sources are compared against each other, while for evolutionary comparison, documents are grouped together by time period and compared. In some cases, analysis may be performed on a single speaker or collection over time, while in others it may be useful to excerpt a month of documents from several authors or collections (such as news media in different countries) and compare. While there are many types of comparisons that can be performed, the list below presents several of the most common:

•   Word frequency Word frequencies change over time as some topics increase in importance, only to be replaced later as other topics gain prominence. Even if the topics discussed remain similar, word choices themselves can undergo considerable change to reflect different audiences and changing language styles. Ranking words by their frequency of occurrence and comparing across time spans allows temporal comparison of frequency changes, illustrating sustained downward or upward use trends of specific words or topics. Comparisons between authors or sources can expose notable usage signatures. Changes in frequency of use can be measured for any subclass of words, including part of speech (to measure an increase or decrease in the use of adjectives, for instance), quoted material (is the speaker increasing or decreasing the use of quoted material?), or even certain types of punctuation. A word's changing use over time can be represented as a graph, showing its frequency of use by time period, and words can be grouped together by similar trajectories of use. For example, terms that gradually enter the popular vocabulary and then slowly lose popularity can be contrasted against terms which become popular overnight and then just as quickly fade from usage.

•   Document frequency While word frequency measures the relative frequency of a word's occurrence across all text in a collection, document frequency measures the related metric of how many of those documents the word appears in. This is especially important when the lengths of documents in a collection are not uniform. For example, in a collection of 20 documents, two may be 100 pages long, while the others are just two pages each. If a particular word appears with high frequency in the two long documents, its word frequency may place it at the top of the collection, even though it appeared in just 10 percent of the collection. A document frequency measure, on the other hand, ranks words by the percentage of documents in the collection they appeared in, regardless of the number of times they appeared in each of those documents. It therefore can offer a better representation of which words hold the greatest weight across the collection as a whole.

•   Unique versus shared words In some situations it can be useful to know if a word appears in multiple document collections or is used in only one. This can suggest a topic or word choice that could be a signature of that particular author, source, or time period. In particular, hapax legomena, or words that appear only once in a work or collection, can provide useful comparative points to contrast collections.

•   Word births In evolutionary comparison, words which appear for the first time in a given time period can suggest emerging topics or shifting language patterns by an author. If a political leader continually speaks negatively of a neighboring country, but suddenly adopts more conciliatory wording, it could signal an impending shift in that diplomatic relationship.

•   Word deaths Words which occur with some degree of regularity for a time, only to taper off and disappear later on can signal topics which no longer hold prominence in the collection, or words which may have found newer replacements. For example, in the popular news media, armed nongovernmental groups have been alternately called guerrillas, insurgents, or terrorists over time. Word births and deaths are especially important concepts for keyword searching of collections. A search for “guerrillas” in news data will likely show a decline in its usage over time, which is correct for that specific word. However, the concept it represents has actually likely increased in usage, but has simply been replaced with a new word like “terrorists.”

•   Context changes Word correlations explore which words tend to co-occur with a given word, yielding insights into related topics within the collection. Exploring differences in correlations between authors/sources, or across time periods, can yield significant insights into the changing associations of a word/topic.

•   Spread Vocabulary spread is defined as the total number of words divided by the number of unique words. For example, the sentence the dog and I went to the park and played in the field has 13 words, but the word the appears three times and and appears twice, meaning there are only ten unique words, yielding a spread of 1.3. The smaller the spread, the less words are reused, meaning the vocabulary load, or the number of words that a reader must know to understand the passage, becomes greater.

•  Readability Rather than comparing words individually, measuring document-level comparative readability scores may act as a surrogate for broader environmental indicators, like average number of syllables per word, average word length, average words per sentence, etc. While the other measures in this list only compare the use of words, readability measures actually describe how difficult those words and their grammatical context are to read.

Each of these measures is usually applied to the vocabulary portfolio as a whole, but many of the measures can also be used to explore the evolving use of a particular word over time. For example, a histogram showing the use frequency of a word over a period of months or years can show the ebbs and flows of interest in that topic. Over longer periods of time, shifts in the prominence of a particular term (such as terrorist over guerrilla) can signal changing attitudes toward a topic. Finally, changing correlations can offer insights into the immediate context of a topic and its changes through time.

The exploration of temporal communicative patterns and vocabulary evolution is known as chronemics and represents a broad field of communicative study.

Advanced Topics

Syllables, Rhyming, and “Sounds Like”

Traditional content analysis focuses on the written use of language, but in some cases the spoken environment may yield important contextual insights. In literary works, words may be selected to link certain lines in rhyme, as in poetry. Examining the syllabic structure of a document can reveal whether spoken patterns, such as meter, underlie pronunciation. Rhyming and pronunciation dictionaries, such as the public domain Moby Pronunciator database that contains phonetic transcriptions of more than 175,000 English words (Ward, 1997), can assist with identifying these kinds of auditory patterns.

Word pronunciation can also normalize typographical errors and alternative spellings. Algorithms such as Soundex analyze the vowel and/or consonant structure of a word, converting it to a coded form representing the way it is spoken, rather than the character combination of its written spelling. This can be especially useful when working with names which are often misspelled or transliterated differently. For example, a 2004 Turkish editorial in the aftermath of the Abu Ghurayb prison scandal in Iraq noted there were at least four different spellings of the name in wide use in the Turkish press at the time (Eksi, 2004). Alternative spellings gurab, ghuraib, ghurayb, and ghurab would all result in the same Soundex code of G610, allowing a search to uncover references to the city regardless of spelling

Gender and Language

Language use can be strongly influenced by gender, with men emphasizing concrete references to objects, while women prioritize descriptive language connecting objects. Using vocabulary analysis techniques, Koppel, Argamon, and Shimoni (2002) were able to discern significant differences in the use of certain word classes by male versus female authors. Men emphasize noun phrases, including numbers, modifiers, and determiners, while females rely heavily on pronouns, selected propositions, and negation terms. These gender differences are so pronounced that the resulting algorithm developed by the researchers is able to estimate the gender of a formal text with nearly 80 percent accuracy and judge whether it is a work of fiction or non-fiction with 98 percent accuracy.

Authorship Attribution

Human language offers a nearly infinite number of ways to express the same information, and much like a fingerprint, every author has distinct nuances in the way he or she uses language. Vocabulary analysis can be paired with the historical documentary record to determine the likely writer of documents with uncertain provenance. In an authorship attribution study, the writing style of a document whose author is unknown is compared against writing samples from a set of likely candidates. Small nuances, such as higher densities of particular word endings, tenses, or grammatical constructions, will often suggest greater similarity with one author than the others. In 1996, elevated use of the word endings –y and –ish, coupled with unusual adjective-derived adverbs, was used to unmask Joe Klein as the author of Primary Colors (Liptak, 2000).

One downside of authorship attribution is that it requires very large sample archives of text for each known author in order to generate an accurate picture of that writer's language use over time. If there is not enough text, the resulting models can lead to inaccurate conclusions, as in a seven-year battle over whether a new Shakespearean poem had been discovered (Niederkorn, 2002).

Word Morphology, Stemming, and Lemmatization

Word morphology plays an important role in content analysis, isolating words from their sentential conjugations. In morphology, a lexeme refers to a root word, or lemma, and its conjugational forms, such as modification by tense. For example, the lemma walk is the root word of the lexeme that includes walk, walks, walking, and walked. When machines compare words, they do so on a character-by-character basis, such that the words walk and walks are treated as independent words with no relationship to each other. Since most uses of words in human language involve the conjugation of the lemma to match tense, number, gender, or other attributes, such exact matches lead automated systems to have very poor comparative ability. To address this, lemmatization and stemming refer to the conversion of a word back into its original lemma, such as dropping the s from walks to transform it back to walk.

Lemmatization uses the complete set of transformative rules in the target language, handling both regular words, like walk, and irregular words, like ran versus run, as well as taking sentence context into account as necessary. The considerable expense in maintaining such detailed lists of every possible irregular word and transformation rules for all possible conjugations makes this process intractable for the high-volume conversion required of text processing algorithms. Instead, a process known as stemming approximates lemmatization by using a set of simple rules that achieve fairly high accuracy. A common technique is to just drop certain suffixes, such as ed, s, or ing, converting walked to walk. The resulting lemmas may or may not be actual words in the target language, but are usable for machine processing tasks. For example, many stemming algorithms convert the word happy to happi using a rule that maps y to i. This nonsense word is still decipherable by a human reader, but, most importantly, is usable by the machine to match against other normalized conjugated forms. One of the more venerable stemmers is the Porter stemmer for English (Porter, 1980), which is still widely used in text processing systems.

While it is most often used to normalize conjugative effects on vocabulary, word morphology also plays a significant role in vocabulary analysis, especially for historical texts. Document collections or authors whose works span substantial periods of time are prime candidates for morphological analysis, exposing underlying trends in language use. The morphology of the famed author Chaucer has been extensively studied in the literature in this way. One analysis found Chaucer used the word love in five different part of speech contexts and in seven different spellings. A frequency distribution graph, however, showed he preferred the more modern spellings of love and loved to the Middle English variants of loveden and yloved, classing him as progressive in his language use (“WordHoard,” n.d.).

Chapter in Summary

Vocabulary analysis is a collection of foundational methods that examine documents purely as a collection of words and look at patterns in how words are used, rather than exploring the concepts those words represent. Word histograms, readability indexes, and normative collections are basic building blocks for profiling a collection's language use. Collections of documents can be directly compared based on their language use, or grouped by date to measure change over time. Advanced techniques, like pronunciation dictionaries and morphology tools, can help normalize word selection or add additional dimensions to a vocabulary analysis. Even though vocabulary analysis uses very simple methods, it can be used in a variety of advanced analyses, such as authorship attribution and author gender detection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset