2

OBTAINING AND PREPARING DATA

The first step in any content analysis project is to obtain the data to be analyzed and perform any necessary preprocessing to ready it for analysis. Data collection is often regarded as the easiest step of the entire analytical process, yet it is actually one of the most complex, with the quality of the collection and preparation processes impacting every other stage of the project. Issues such as ensuring the data sample is meaningful and data preparation tasks like cleaning and filtering are all critical concepts which are often overlooked. This chapter provides a basic introduction to the issues surrounding data collection and preparation, with a few words on advanced topics, like the integration of multimedia sources and random sampling.

Collecting Data from Digital Text Repositories

One of the most popular sources of material for content analysis is the online searchable digital text repository. However, while these collections have substantial advantages over their print brethren, one must be careful to understand their limitations. Of foremost danger is the all-too-common belief that a keyword query on a searchable database will return every single matching document. As discussed in this section, numerous complexities ranging from OCR errors in historical collections to content blackouts stemming from licensing restrictions may partially or entirely exclude works from the online edition of a print publication. Humans are also able to rely on a tremendous array of background knowledge when searching for relevant documents, giving them more flexibility in locating documents based on what they mean. Computers, on the other hand, rely on precise keyword matches and automated searches must therefore be carefully designed around their more limited capabilities.

Are the Data Meaningful?

One of the first issues to consider is whether the data available from a given source will be meaningful in the context of the research questions asked of it. Several common pitfalls are discussed below:

•   Result counts and surface statistics Rather than examining the full text of each document, some studies explore their domain through surface statistics like result counts from search queries. A report examining the attention given to immigration in the news media over time might measure the number of returns from a keyword query on immigration each month. Such studies must be extremely careful that the result counts returned by the search service they are relying on are exact counts and not predictive estimates. Premium data services, like ProQuest Historical Newspapers™ and Lexis Academic™, tend to use search algorithms which return the exact number of matching results. However, when relying on open web tools, such as Google, Yahoo, or other web search engines, one must be more careful about relying on result counts. Google, for example, uses a variety of predictive estimations to increase the performance of search queries across such a large index. Even though many result counts will be displayed as exact numbers rather than rounded values, these numbers are estimations from general index statistics, rather than precise counts of the number of matching web pages. The so-called Google Horizon, which limits users to viewing the first 1,000 search results, prevents manual counting of results to determine the actual number of matching pages. Since its result counts are predictions based on the global index, they may not necessarily be statistically meaningful as measures of the overall popularity of any given term over the Web at large except at a very gross scale.

•   Source stability Large aggregators like LexisNexis continually add and remove sources from their database over time and one must be careful when using default settings like “Search All Sources” that results are not driven by the set of available sources instead of actual publication trends. A keyword query run two months apart might search 8,000 sources the first time and 8,100 sources the second, showing an increased number of hits that may be a reflection of genuine greater use, or simply reflect that more publications are being searched. In addition, even though a source has been available in print form for a long time, its availability in a particular aggregator's database may be more recent. LexisNexis has archives of some sources back to 1979, while other sources that were in publication in 1979 were not added to the service until the 1990s or 2000s. A typical historical query for a keyword using the “Search All Sources” option on LexisNexis will often show an exponential ramp-up of uses of that term from 1979 to present. It is difficult to determine whether such an increase is due to greater use of the term, or merely a greater number of sources in the LexisNexis database each year to be searched.

•   Using datasets in unintended ways Another source of possible error is the use of a dataset in a way it was not originally designed for. Some datasets may have nuances in the way they are compiled that may skew certain uses of that data. This is an important concept and is detailed in its own section below.

•   Completeness/accuracy of the data Is the online archive an exact duplicate of the print edition, or are there sections of content that have been left out due to licensing or other restrictions? Has the data been willfully manipulated to enhance/mask certain content? These issues are surprisingly common even in some of the most popular data sources and are addressed in more detail later in this chapter.

•   Completeness of the query Assuming the data archive is complete, will the query being used return all relevant documents? This too is a complicated topic and is given its own section later in this chapter.

Using Data in Unintended Ways

At first glance, it may seem counterintuitive that data can have intended and unintended uses. After all, news coverage is written to inform the public about current events, but is used years later to answer a wide range of research questions across many different disciplines. Yet, some datasets can have complex provenances or nuances in the way they are collected and compiled that can complicate their use with some research questions. For example, some studies have attempted to compile global bibliographies of books published on specific subjects by searching the WorldCat union library catalog (http://www.worldcat.org/). Unlike national or topical bibliographies, which attempt to inventory a country's entire printed output by surveying publishers and authors, union library catalogs merely reflect the partial holdings of subscribing libraries. Libraries upload their master electronic catalogs of the books they hold to WorldCat, which in turn helps them facilitate interlibrary loans by locating other libraries holding a given work. At first glance, it might appear that a central library catalog inventorying the holdings of major libraries would be an ideal database from which to compile topically oriented lists of books. The problem comes when one considers the source data for WorldCat and just what precisely a typical library's catalog contains.

Libraries based in the United States have been the largest contributors to WorldCat, and just three percent of WorldCat's 1.6 billion holdings are from national libraries outside the US (“National Libraries,” n.d.). The majority of WorldCat's records on foreign publications are therefore dictated by the collection policies of US libraries. Subject headings are used as-is from the submitting institutions, meaning that US-based libraries will use English-language Library of Congress headings, while foreign libraries may elect to use their own national catalog headings in their own languages. Searching by English Library of Congress subject headings will therefore miss a large portion of foreign library holdings. Complicating matters, US libraries will often catalog foreign works based on rough translations of their titles, and without a language specialist trained in the source language, a foreign work will end up with multiple duplicate entries from various contributing libraries. Duplication as a whole is a significant problem in WorldCat, reflecting the unfortunate lack of rigid quality control in the online catalogs of the libraries that feed it.

Most importantly, however, WorldCat reflects only the entries in each institution's online catalog, which at a typical library is far from an exhaustive list of its holdings. Larger libraries do not have the staff or funding to comb their holdings and proactively inventory every work on their shelves and thus older books are only entered into their electronic catalogs as they are checked out or removed to remote storage. Hence, online catalogs tend to reflect only recent works and the most popular historical ones, skewing WorldCat's collection heavily toward works published in the last 30 years held by libraries in the United States. From the standpoint of WorldCat's mission of facilitating interlibrary loan, this is not a significant concern, since the most frequently requested works are well represented in their database. However, a researcher attempting to compile a list of every book ever published globally about clock making will encounter difficulty in drawing concrete conclusions from these results. This is not to say that such studies cannot contribute considerably to the understanding of publication trends in a particular subject area, but considerable care must be given to preparing and cleaning the data before analysis.

Analytical Resolution

A content analysis project must begin by defining its analytical resolution: will it need to examine collections of documents, individual documents, or individual words? The intended analytical resolution plays a critical determining factor in the resolution that must be used to collect content for analysis. Analysis methods operate on three fundamental levels: collections, documents, and entities.

•   Comparing and contrasting collections One of the most common applications of content analysis procedures is in the study of large document collections, such as European newspaper coverage of World War II. The unit of analysis becomes the collection as a whole, with its member documents serving as the sample points. For example, if one was exploring sentiment in the global reaction to a particular event (such as a large corporate scandal), the collection of all news coverage on that event would be the source corpus, with a sentiment score generated for each document and used to compare documents within that collection. Documents are compared not against each other as individuals, but as discrete representatives of the overall collection. Collections may also be compared against each other, such as dividing the New York Times newspaper into one-year timespans and comparing the average number of unique words and new words introduced in each period against each other to explore vocabulary evolution in that periodical. Through the grouping of documents into collections, the individual nuances of any single document become normalized away, with broad aggregate data providing a more accurate picture of overall trends.

•   Document-level examination While collections offer broader insights into groupings of documents, the focus in some projects is on the comparison of individual documents. Authorship attribution compares a document of unknown provenance against others whose authorship is known, to find the author with the most similar writing style. Clustering groups documents together based on their topic or language similarity. Document-level examination offers the ability to exploit variances in individual documents, rather than normalizing across an author or source's entire corpus of work, as is done in collection-level analysis.

•   Entity-level examination Whether part of a corpus, or examined individually, documents may be thought of simply as containers for semantic objects or entities like names, dates, locations, or emotionally charged words. A researcher might want to know about the emotional state, vocabulary, and other characteristics of a famous speech or declaration: attributes which are descriptive rather than comparative in this case. The full text of a lengthy play might be replaced with a simple list of characters mentioned by scene and used to examine patterns in their co-appearances.

Types of Data Sources

There are many different classes of databases available for content analysis, far too numerous to list. In general, however, sources tend to fall into the following broad categories:

•   Digitized historical archives Products such as ProQuest Historical Newspapers typify this category. These databases provide an invaluable mechanism for exploring past media or tracking longitudinal trends, but the text tends to be generated through automated optical character recognition (OCR) and therefore usually contains significant errors, such as misspellings and typographical errors, which must be accounted for in the data preparation stage. These errors can also impact search accuracy, causing some relevant works not to be returned for keyword queries. For example, the letter l, the number 1, and an exclamation point, can appear similar enough to each other in certain typefaces that OCR software might record the word “happily” or “happi1y.” A keyword search for documents containing “happily” would not return these entries due to the computer's literal matching criteria, while a human conducting a manual search would usually be able to recognize the intended word.

•   Digital aggregators There are numerous database vendors that specialize in licensing content from multiple sources and aggregating it together into comprehensive content files. Perhaps the two best-known vendors are ProQuest Dialog, producer of many market-specific files, and LexisNexis, whose News Search aggregates numerous news media from throughout the world. These large databases provide a considerable benefit in that users need use only a single interface to search across countless numbers of sources, rather than having to search each news outlet or information provider individually.

•   Specialty databases Beyond the major vendors with their broad product lineups are many smaller organizations with very narrow databases that are critically useful within certain fields of research. For example, the Vanderbilt Television News Archive (http://tvnews.vanderbilt.edu/) holds the most complete archive of historical television news in the United States. Over 30,000 evening news broadcasts and 9,000 hours of special event news coverage is held in its collection, dating back to 1968. The Internet Archive (http://www.archive.org/) is another specialty source, which has archived the World Wide Web since 1996, storing some 85 billion pages as of 2007. The Archive continually crawls the Internet at large, saving a copy of every page it finds and archiving those copies over time so that one can enter a URL and see every update to that page since the birth of the web. As standalone entities, these databases must be searched and integrated individually into a content analysis project, and each may have specific nuances and limitations to its search results.

•   Citation indexes Some kinds of content contain links that connect documents to each other. For example, web pages use hyperlinks to connect to other pages, while academic papers feature references and citations that link them to other papers. Examining the structure of these connections can yield significant information about the importance of a given document. Google's PageRank algorithm uses a measure of how many pages link to a page as one indicator of how highly regarded it is on the Web at large. Academic prestige of a journal paper is similarly measured by how many subsequent papers cite it.

•   Access indexes In addition to the textual content of web sites and other archives, a wealth of secondary data is collected regarding the users of that data. Every time a web page is requested from an Internet web server, a log entry records the unique identifier of the computer requesting the page, the computer's operating system and web browser version, and possibly the page the user visited immediately before requesting the current page. These logs weave a rich tapestry of context describing visitors, and can be used to understand which content is the most popular. Many search engines monitor which search results are clicked the most often for a given query and rank those results higher over time.

Finding Sources

With so many possible sources available, how is one to make sense of all of the available databases? To address this issue of information overload, there are many fee-based services that publish master directories of sources for content analysis projects. For example, the Fulltext Sources Online directory (http://www.fso-online.com/) compiles information from 30 aggregator services to list sources for over 46,000 newspapers, newsletters, periodicals, newswires, and TV/radio transcripts. Such directories allow quick location of a source in electronic format, ready for machine content analysis. While one can find the same information by searching through the source listings of individual vendors, having it all in one place makes it easy to tell which vendors cover a specific source.

Specialty directories offer assistance in acquiring niche data. The Gale Directory of Databases (“Gale,” n.d.) lists nearly 20,000 datasets available in electronic format, ranging from addresses of foreign-owned firms operating in Hawaii, to a compilation of financial regulations in the United Kingdom. The Sourcebook to Public Record Information (“Sourcebook,” n.d.) offers information on over 20,000 agencies that handle requests for various types of public record information in the United States, providing telephone numbers and mailing addresses to begin Freedom of Information Act (FOIA) requests with these organizations. More recently, Data.gov was released as a service of the United States government, offering computer-friendly datasets from across the federal government.

Many more specialty datasets are available directly from the academic institutions or research organizations that produce them. The World Bank offers a selection of its datasets for free on its web site (http://data.worldbank.org/) but charges for a premium CD-ROM that contains more comprehensive data. For highly specific datasets, one of the best options is simply to search online and in the academic literature of that discipline to locate other papers that have published on the topic and identify the datasets they used or created.

Searching Text Collections

Once a dataset has been located, the most common method of searching a textual archive is through keyword queries. However, while keyword queries are simple to construct and use, they suffer from limited ability to precisely define complex or nuanced queries. How exactly does one define a keyword query capable of returning documents on human rights violations? One common approach is to use a long list of search terms combined together using Boolean logic such as “human rights OR torture OR false imprisonment.” However, the more keywords that are added to increase coverage, the more likely unrelated results will also be returned. Chapter 8 introduces automatic text categorization (ATC), which uses machine learning technology to convert a collection of relevant articles into a sophisticated statistical model that the machine can later use to find similar articles. Using ATC, a collection of news articles on human rights violations can be used to train the machine to sift through a large collection of documents and pick out ones that are related in topic. Given sufficient tuning, ATC techniques can yield extremely high accuracy, even across large text archives. Unfortunately, most commercial database vendors do not yet provide interfaces through which ATC techniques can be used to search their collections. In some cases it may be possible to partner with a vendor in order to obtain a dump of a subsection of their database to run ATC searching across, so it is worth discussing your project with database vendors.

One final complication that may arise is the presence of OCR errors when searching digitized historical collections. To a machine, a scanned page image is just a grid of colored dots, with optical character recognition (OCR) technology required to translate the image into machine-understandable text. OCR can process vast repositories of imagery with considerable speed, making it possible to produce full-text searchable archives of millions of pages of scanned imagery. Yet, like most automated techniques, it is far from flawless. The poor quality of most historical material produces artifacts, like page creases and smeared or faded ink, that introduce misspellings and other typographical errors into the output. ProQuest used human editors to hand-correct the title and first paragraph of each story in their historical New York Times archive (Helm, 2006), but the sheer volume of material meant even they could not afford to correct the complete text of every article. Journalistic convention suggests that important terms will be mentioned multiple times in an article, including at least one mention in the opening paragraph, and so ProQuest's approach is likely to return most relevant matches. Indeed, searches of the ProQuest New York Times historical archive usually yield very reasonable results, but it is important to understand that these recognition errors will likely prevent at least some relevant articles from being returned, meaning result sets are rarely exhaustive. Great care should therefore be taken with any digitized historical collection to take into consideration the impact of OCR error.

Sources of Incompleteness

Sources of incompleteness abound in external data, and one must be careful to evaluate the possibility that such issues may introduce bias into a study. The following are several common issues with electronic databases:

•   Licensing restrictions Most sources of incompleteness stem from licensing restrictions imposed by the content producer. For example, the print edition of the New York Times includes content from the Associated Press (AP) wire feed. However, AP publication rules require that content from its wire feed published online must be removed within 30 days. Hence, the searchable electronic archive of the New York Times differs substantially from the print edition in its lack of AP material. This introduces a substantial bias into electronic search results: they will not be exhaustive searches of NewYork Times coverage – only the material that contractual agreements permitted to be included.

•   Different editions Some outlets may have different electronic and print editions of their publication. Newspapers, for example, may limit the content they publish on their sites to force users to purchase their print edition. Conversely, many larger newspapers now have blogs and other online-only content not available to print readers. Due to space constraints in the print edition, the print and electronic editions of a given article may not even match, with the online version having more details than could fit in print. Some media outlets may even intentionally alter the content they provide based on the location of the reader. A well-known Middle Eastern news outlet determines the country of origin of visitors to its web site and offers a different assortment of content to readers from the United States than those from the rest of the world. Outlets that publish in multiple languages may also carefully select the articles they publish in the online language-specific editions of their web site to achieve strategic goals.

•   Real-time information Historically, news took so long to travel from the location where an event took place to its final audience that reporters were required to package the event, its context, and any necessary background information, into a polished report to be sent for publication. Over time, news reporting shifted from such carefully packaged presentations to so-called helicopter journalism, where reporters quickly converge on an event and broadcast whatever information they can gather in near real-time. Increasingly, even major newspapers are heading toward this model, with newspapers like the Los Angeles Times reporting breaking news stories on their web site as they happen and using their print edition to fill in the context of the story the following day (“Los Angeles Times,” 2007). Such reporting is complicated by the often-conflicting accounts that emerge in the immediate aftermath of a major event, with details being constantly revised over the hours and days that follow. Rather than one single story written after the majority of the details have reached equilibrium, reporting based on real-time information will manifest itself as a stream of short reports, each presenting new information and correcting details of previous ones. Processing this information in a content analysis project requires increased technical capability to group these individual reports together and reason within the corresponding environment of uncertainty.

•   Duplication Large news aggregators like LexisNexis simply republish the content they receive from each news source as-is, and any underlying data quality issues are passed along to users. One of the most common problems is duplication, where multiple copies of an article may appear, potentially with different date stamps or identifying information. For example, the Summary of World Broadcasts (SWB) newswire in LexisNexis has duplication rates as high as 65 percent for some time periods (Leetaru, 2010). In SWB's case, foreign news articles are translated into English, and the duplicate articles contain slightly different translations of the same original source article, making automated duplicate detection much harder.

•   Understanding a source While it might seem obvious, it is important to understand the purpose of a news outlet before relying on its information. The Pacific island nation of Kiribati learned this fact the hard way in the 2002 lead up to the US invasion of Iraq. A satirical news site in New Zealand called Spinner published an article claiming the United States had shifted its attention from invading Iraq to invading Kiribati and that it was sending the US Seventh Fleet to install a new government. The article held particular relevance in Kiribati at the time because of a fierce election campaign that involved an opposition commitment to shutter a satellite tracking system that was believed to monitor US missile tests in the region. After the article came to the attention of the citizens of Kiribati, it caused near panic in the country, with the president's office ordering broadcasters to reassure the public that no invasion was imminent (Dorney, 2002).

Licensing Restrictions and Content Blackouts

While incompleteness is a very real threat to any content analysis project, it is extremely difficult to detect and account for without detailed background information on the source or author. Content blackouts, on the other hand, are more easily discovered, and determining the extent of any coverage holes in an online archive is a crucial first step toward determining how accurately it will be able to answer a particular research question.

Perhaps the two best-known examples of content blackouts involve the Associated Press (AP) newswire service and the New York Times newspaper. Organizations subscribing to the Associated Press are permitted to use AP content in their print publications, but any online republications must expire within 30 days of publication. A newspaper which offers an electronic edition of its printed paper may include all of the stories from its print version immediately upon publication, but after 30 days it must remove any stories which originally came from the Associated Press. This presents a grave obstacle to the use of online archives as exact surrogates of print editions. While one can simply access an AP newswire archive in a service like LexisNexis to search through all AP content, there is no way to know which AP articles appeared in a particular issue of a newspaper and were subsequently removed from its online archive. Searching the LexisNexis archive of the New York Times will only return content produced by the paper itself, and if an issue of the paper included an AP story as its only coverage of a particular event, there will be no record of the paper covering that event. Larger newspapers like the New York Times rely more on in-house content than wire services, and so are less affected by this problem than smaller papers, but it is problematic nonetheless. The printed New York Times index, however, does include these wire stories in its bibliography of articles and is an example of why not all print indexes are obsolete.

A further complication with the New York Times electronic archive stems from a copyright dispute with its freelance authors. In 2001, the United States Supreme Court ruled in Tasini et al. v. The New York Times Company et al. that the New York Times had not secured the necessary rights to distribute articles by its freelance contributors as part of its electronic archive. In response to this ruling, the New York Times was forced to remove all freelance content from its archive, making those articles and images no longer accessible, except by cross-reference from the printed index to the original paper copy.

Measuring Viewership

In the days of broadcast media it was easy to determine listenership/viewership: anyone within receiving radius of a broadcast was a potential audience member. Segregating by viewership made it possible to examine differences in the media message reaching specific geographies. However, in the Internet era, many of the assumptions of traditional one-too-many broadcasting have disappeared. Internet radio makes it possible to access broadcasts from anywhere in the world, with a user in the corner of South Africa listening to a streaming radio station from the center of Germany. Spatial targeting changes the content that is presented based on the geographic location of the visitor, while the increasing reliance on self-provided English translations of foreign news provides an opportunity for the original and translated texts to carry different messages. In the globalized information world of the Internet, it is nearly impossible for the content analyst to ascertain the possible audience of any given news product without cooperation from that outlet. Through the examination of web server logs, it is possible to compile basic statistics on the geographic localities of the viewers of a media outlet, but the necessity of convincing each outlet to hand over its server logs means this technique is not practical at a large scale.

Accuracy and Convenience Samples

The ideal search is one which returns all articles relevant to the query and none that are irrelevant. Few concepts can be represented by a single keyword, and so typical searches involve long strings of search terms cobbled together, with some terms designed to broaden the set of results to include additional concepts, and others designed to narrow it by excluding irrelevant related concepts. The accuracy of the resulting search is measured by two indicators: false positives (irrelevant results returned as relevant) and false negatives (relevant results that were not returned). While false positives may be eliminated through manual review of the results, there is no easy way to determine the false negative rate, as that would require manually reviewing the entire archive for missed articles, defeating the purpose of using a keyword query in the first place. Since the searcher is aware only of the false positive rate, many users try to minimize it through the use of narrow keywords that reduce the number of irrelevant results, but often at a cost of discarding many relevant results.

Given that the entire purpose of electronic database searches is to use computational power to replace laborious manual review of each article, few content analysts spend significant time attempting to create the ideal search criteria. Instead, a typical search will involve an iterative process of adding and removing search terms until the majority of the returned results appear relevant. Most searches of digital repositories are therefore convenience samples, in which the scope of the archive to be analyzed is defined to be the set of results returned from a search query. This is usually much smaller than the entire universe of all relevant articles, which could be determined only through an extensive manual review of the entire archive. Further complications can also arise through the use of secondary subject indexes and summary text in place of full-text searches (Althaus, Edy, and Phalen, 2001).

Random Samples

Computational techniques enable rapid processing of very large collections, but time constraints, processing limitations, or data availability may necessitate using a smaller randomly selected sample of a document collection, rather than processing it in its entirety. Random selection is a well-understood area of statistical study, with a wide range of techniques available to fit any type of data.

Most randomization algorithms allow for the same number to be selected multiple times, a method known as randomization with replacement. Depending on the situation, this may be acceptable, or it may be necessary to use a random selection routine that guarantees an item will not be picked more than once, known as randomization without replacement. In some cases, it may also be necessary to provide an exclusion list of documents that should not be selected. The same effect may be achieved by simply removing those documents from the input list to the random selection system, but exclusion lists can make this process less error-prone than maintaining multiple selection lists.

Computer-based random selection is a bit more nuanced than many analysts may suspect. A specific point of caution is the use of random number generators for very large datasets. The default random number generators provided by most programming languages (and hence used by many content analysis packages) are called pseudorandom number generators and are designed to favor performance over entropy (randomness). These algorithms essentially calculate a medium-sized list of numbers that do not fall into a general pattern. For most content analysis projects, the level of randomization provided by these built-in routines will be more than adequate, but when selecting large sequences of random numbers, the random sequence may repeat itself as the list of numbers is exhausted. In addition, some algorithms may generate numbers with poor distributions that overemphasize some numeric ranges. When working with large collections, it may be worth exploring more robust random number generators that have been designed for cryptographic (encryption) purposes, as they often incorporate true entropy in the form of hardware feedback or other non-computed seed values.

Multimedia Content

Most content analysis techniques are designed for textual documents, but not all content is readily available in textual format. Television and radio broadcasts, image archives, and other types of still and moving media present unique challenges to the content analyst.

Converting to Textual Format

One of the most straightforward mechanisms of integrating multimedia material into a content analysis is to convert its audio stream to text if it includes spoken language. Many commercial television shows and movies already contain embedded transcripts in the form of closed captioning streams that contain the dialog in computer-readable format. These are often entered by trained human editors and are precisely synchronized to the dialog, enabling high-resolution capture of audio. For content that does not have available closed captioning, speech recognition systems convert spoken audio into computer text. Most commercially available systems today are continuous speech recognition systems designed to recognize human speech at a conversational pace, as opposed to previous generations that required long pauses between each spoken word. Nuance's Dragon NaturallySpeaking™ is marketed for standalone use as an automated transcriptionist, allowing a user to narrate a document to the computer by voice rather than typing it out. IBM's ViaScribe™ is positioned for closed captioning of real-time audio streams such as live broadcasts.

In carefully controlled settings these software packages can achieve impressive accuracy, but when applied to field-collected audio, background noise, speed of speech, and the presence of multiple and non-native speakers can severely degrade recognition accuracy. Many systems require a training process to tune their recognition for a particular speaker, meaning that they perform well for that single speaker, but less accurately for other speakers in a shared-speaker environment (such as a conference call or panel discussion). While far from perfect, many projects will find the tools offer quite usable accuracy.

Prosody

A significant amount of the meaning of spoken communication is conveyed through prosody: the intonation and stress of each word. A simple hello could convey happiness at the encounter, sadness, attraction, annoyance, or a range of other emotional states, all through the way in which the word is spoken. Speech recognition software, like most transcription processes, removes this tonal information and converts rich speech into an emotionally devoid body of text. If prosody is especially important to a project, the resulting transcripts can be annotated by hand with additional metadata indicating spoken emotional charge.

Example Data Sources

As this chapter has alluded to, there are a myriad databases available offering full-text archives on nearly every topic. However, sometimes it can be difficult to determine just how these databases may be applied to answer a specific research question. The following examples each briefly examine a particular research question and how an available source can be used to explore it.

Patterns in Historical War Coverage

With the United States once again a nation at war, the question of how war coverage has changed over the years has become a particularly relevant one. How has coverage differed between each war based on the United States’ status in the war at that time? How were military successes and losses covered in the war and how were casualties reported? While evening news broadcasts comprise a significant avenue for news distribution, the scarcity and incompleteness of news transcripts makes the use of newspaper articles the best source of material on this topic. Within the United States, the New York Times is considered the major paper of record, the most authoritarian general news national newspaper in the country. Even though it offers only a narrow glimpse into overall national news coverage, as the paper of record it reflects the general sentiments of the day and thus may be used as a surrogate for coverage across the nation.

ProQuest corporation has digitized the complete run of the New York Times from 1851 to present as part of its Historical Newspapers™ product, allowing full-text searching across the entire paper. Using this archive, one can search for all coverage mentioning the words dead, killed, or casualties, from the Civil War to the present, to generate a rough sample of casualty reporting during each war. The historical nature of the ProQuest archive permits true temporal examination of coverage patterns.

Competitive Intelligence

In the realm of competitive intelligence, it is critically important to track a competitor's web site, examining its latest product lineup, the offerings it gives the most emphasis to, which products it tends to link together, and so on. However, while much of this analysis is performed on a proactive basis, there are many occasions where postmortem analysis is required to examine what may have caused a particular event, such as a spike in the competitor's sales. The Internet Archive (http://www.archive.org/) is a non-profit organization that has maintained an archive of a large portion of the Internet since 1996. Unlike Google's Cached Copy feature, which maintains only the most recent version of each page, the Internet Archive keeps a copy of every version of each page it has indexed. While full-text search is not currently supported in the archive, entering a URL allows every indexed copy of the page to be viewed. In essence, the Internet Archive allows one to time warp back to the state of the Web at any point in time. In fact, the United States Internal Revenue Service recommends the use of the service in its August 2005 guide for online retail auditing. As the guide describes (“Retail Industry,” 2005):

[The Internet Archive] will allow the examiner to determine what the website contained during the year of examination as well as historical information. Did the online business really start in 2004 or was there activity prior to that year? What were the product lines that were being sold online during the year of audit? Are those product sales included in income?

Any research that involves historical exploration of web content will find this unique resource extremely useful.

Global News Coverage

The news media offers a unique proxy into local events and popular reaction around the world. Differences in how countries cover major events yield important insights into their cultural contexts and are powerful external proxies for the prevailing public perceptions within each nation (Gerbner and Marvanyi, 1977). Since World War II, the US Central Intelligence Agency (CIA) has maintained a global news monitoring service called the Foreign Broadcast Information Service (FBIS) (http://wnc.fedworld.gov/), while the British government has operated its counterpart Summary of World Broadcasts (SWB) (http://www.monitor.bbc.co.uk/). Both services monitor newspapers, radio, television, trade publications, and all other available news and public information within each country of the world, selecting and translating into English a representative sample of each day's material. Known as open source intelligence products, FBIS and SWB transcribe and translate more than 30 million words a month and maintain even coverage across the world over time. The unclassified editions of the two collections are available from numerous services, such as ProQuest Dialog and LexisNexis, for both commercial and academic use (Leetaru, 2010).

Cross-national comparisons of media reactions have historically been complicated by the difficulty of acquiring and translating content from multiple countries, especially those with a higher density of broadcast media or less-developed press distributions systems. Access to open source products like FBIS and SWB make it possible to trace evolving global reaction to major events, such as the 2002 collapse of WorldCom. Europe laid blame on the pro-business policies of US politicians, while South America grappled with whether its new capitalist democracies would survive and Southern Asia considered the impact on its domestic telecommunications industry (Leetaru, 2008).

Downloading Content

Most of the computational content analysis techniques described in this book require direct access to the underlying text of the document archive. A vocabulary inventory, for example, must examine the full text of each article to identify the words it contains. Thus, once a content source has been identified and a query developed to select the desired content, the content analyst must devise a method of actually downloading the full content to a local computer for analysis. For example, a web keyword search on Google might list 500 matching web pages, but in order to actually examine the language used by those pages, they must all be downloaded to the analyst's computer.

Digital Content

Projects involving the search and bulk download of material from online sources such as web sites or databases should investigate the use of site mirroring tools. These tools are designed to accept one or more URLs as input and then follow all links on the given page to traverse through the entire web site, downloading a local copy of each page as it goes. A mirroring tool works by reading the initial URL and following all of its links that point to other pages on the site, and then repeating the process for those pages and their links and so on, until it finds no further links to other pages on the given site. The choice of starting URL is important when attempting to download an entire subsection of a site, as the initial page must have links that point to enough other pages that the mirroring tool will be able to find all of the pages on that site. An about page that describes a site, but has no links to other pages on the site, would therefore be a poor choice as a starting URL, while a table of contents page that includes links to all of the major content sections of a site, would be a much better selection.

Site mirroring tools can be used to copy sites with just a small handful of pages through to those with hundreds of thousands of pages. Most offer a variety of throttling features that allow you to specify the maximum number of files to request a minute, set a bandwidth quota, or enable other rate-limiting methods. These help reduce the burden of your download on the remote server, as many smaller servers will collapse under the heavy sustained load of a site mirroring tool running from a high-speed university network.

When performing large downloads of remote sites, it is important to look closely at their terms of use. Many web sites specifically forbid automated downloading of their content and take active technological steps to deter it in order to reduce the load on their servers. An increasing number of sites, especially subscription-based services, will actually temporarily disable access to a computer or even an entire network if they detect unauthorized bulk downloading. In some cases they will levy severe legal penalties and/or monetary fees against violators. It is therefore highly recommended that projects needing to perform a large number of searches or download a high volume of material contact the vendor first and explain their project's needs. Some vendors will decline to allow such automated querying, which is much better to know up front than to face legal sanctions and fines after doing a large download. In the realm of data downloading, the old adage that it is better to beg for forgiveness than ask for permission certainly does not hold true. Furthermore, some vendors may actually offer special interfaces specifically designed for bulk downloading, making interaction with their data much easier.

Print Content

Projects involving large volumes of printed material must find a way to quickly and cost-effectively digitize it, converting it into digital text that a computer can process. The first step is to scan the print content, which creates a digital picture of each page. For particularly fragile works, this can be done by hand using a flatbed scanner or even a digital camera. If the print material is in loose-leaf form, or if the binding can be destroyed, the material can be fed through a bulk highspeed scanning machine. Dedicated document scanners are relatively expensive, but many modern digital copiers offer a scan to PDF option that will scan and create a PDF file from a document. Upper-end flatbed scanners can also be purchased with automatic document feeder options that can achieve the same effect. To scan an entire book, purchase a cheap used copy and take it to a local copy center, most of which can chop off the binding for a few dollars.

To convert the scanned page image from a grid of colored dots into actual text that the computer can understand, optical character recognition (OCR) software must be used, such as ABBYY FineReader or Nuance OmniPage. Run on a large desktop computer, these programs can convert upwards of 100,000 scanned pages a day into digital text, quickly digitizing even a large print archive.

Preparing Content

Obtaining data for analysis is only the first half of the collection process; the data must next undergo a series of preparative stages before it can be subjected to analytical procedures. The specific set of preparation steps required depends on the type of data, as outlined below.

Document Extraction

Some data sources provide documents in a format ready for analysis, with header information (such as publication date) separated from the rest of the text. Most sources, however, will require a certain amount of preprocessing to extract the body text from the rest of the document. For example, an email message consists of a number of header fields, including the sender and recipient addresses, date, and subject line. These header fields must be separated from the body text and broken into their own fields to allow for searches by sender, by date, and so on.

Content downloaded from the Web presents a much more complex situation. An article on an online news site such as CNN.com is embedded in a web page that surrounds it with advertisements, navigation menus, headers, footers, and other extraneous content. To process a series of web pages downloaded from a web site, a content extraction tool must process the raw HTML of each page and identify the core article body, extracting and saving it to a separate file for processing. Alternatively, for a small number of web pages, a human analyst can manually view each page in a browser and copy-paste the body sections into a spreadsheet.

Cleaning

Many longitudinal content analysis projects, especially in the humanities, involve material which has been digitized, or scanned and converted into machine-readable text. A well-known commercial example is the ProQuest Historical Newspapers™ product. The scale of these services necessitates a high degree of automation in the processing of their digitized imagery and OCR software is used to convert each scanned page image into text. OCR output often suffers from typographical errors, especially when the quality of the scans or original source material was sub-par. These errors usually manifest themselves as typographical errors, and random characters appearing throughout the text (noise). A manual cleaning process is traditionally used to attempt to correct misspellings and removal spurious characters. For maximum accuracy, human editors must read through each document and hand-correct any errors, as ProQuest did with the title and lead paragraph information for each of their digitized New York Times articles.

The cleaning process is critically important when using word-based measures like vocabulary analysis. A word in which the letter “l” has been mistaken by the OCR software for the number one or an exclamation point will be counted as a unique word by a vocabulary analysis. It is not uncommon in a large archive of digitized content to have hundreds of thousands or even millions of error words that result from OCR transcription problems. These can significantly impact density and other measures used by many vocabulary analyses.

Post Filtering

Sometimes it may be desirable to further filter the results downloaded from an online search service through mechanisms more advanced than the basic keyword searching they support. A project examining human rights violations around the world would be hard-pressed to generate a set of keywords that precisely select only news articles relating to human rights violations. Such projects may therefore use a set of broad keywords to ensure all relevant documents are selected and then filter the resulting data through a more advanced mechanism, like an automatic text categorization system, to eliminate false positives and narrow the final result set used for analysis.

Reforming/Reshaping

Content may occasionally contain formatting cues that add additional layers of meaning to the document. For example, a set of Shakespearean plays might contain a standardized header format that denotes a scene change. A reforming utility could be written that splits each play into individual files at each occurrence of a scene change, allowing the play to be analyzed as individual scenes rather than as one single block of text.

Content Proxy Extraction

Some content analyses are more concerned with specific elements of the document than the document as a whole. An analysis of political speeches might only care about mentions of political leaders, while a geographic analysis might wish to examine just references to cities. Preprocessing tools can be used to extract such information from each document and use it in place of the full text for later analysis.

Chapter in Summary

The first step in any content analysis project is the collection and preparation of the data to be analyzed, a process that is more complex in the digital realm than most analysts may expect. When searching content aggregators, there are many complicating factors that must be taken into consideration, such as changes in the number of sources archived, estimated versus real result counts, completeness, accuracy, suitability of the data, and the completeness of the search criteria. Nuances in how a dataset was collected may affect its suitability for some research questions, while the type of database and analytical resolution desired play key factors in the kinds of methods that can be used with the data. Digital editions are also not always mirror images of their print counterparts: licensing restrictions, publication space, and duplication can all impact the completeness of a digital archive. Projects downloading content from the web or digitizing print material will need to carefully consider the sources and workflows they use to acquire that content. Finally, once content has been acquired, a variety of preparation processes may be needed before it can be analyzed, such as extracting body content from web pages, cleaning digitization errors, filtering and content reshaping.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset