5

LEXICONS, ENTITY EXTRACTION, AND GEOCODING

This book introduces many advanced topics in content analysis ranging from sentiment analysis to document clustering, but at the end of the day most projects rely on one of the simplest forms of analysis: the lookup file. Geocoding systems rely on massive databases of geographic locations and their latitude and longitude coordinates, known as gazetteers, while lexicons use lists of keywords to match into predefined categories, and entity extraction systems use a combination of context and word lists to compile lists of named entities appearing in a document. The theory behind such systems is simple: take each word or token and look for a match in the corresponding word list, but the complexities of real-world applications necessitate many specialized processes to ensure a high level of accuracy and thus this chapter introduces concepts and practices for enhancing their use.

Lexicons

A lexicon is just another name for a list of related words grouped together under a single heading. Lexicons can be created on any topic, ranging from the concrete (war) to the abstract (cognitive processes) and the words within can be synonyms (war, armed conflict) or only loosely related (believe, thought process). Content analysis using lexicons is similar to the basic vocabulary analysis techniques introduced earlier, but rather than examining patterns in the words themselves, they are compared against word lists (lexicons) to determine presence/absence, counts, and other targeted statistics. A war lexicon might contain military titles, terms for violence, types of military equipment, and other vocabulary commonly used in descriptions of war. Each document is checked for words that match terms in the lexicon and flagged with the topical heading of the lexicon upon a positive match. Grouping large numbers of related terms together allows complex concepts to be treated as a single topic, measuring, for instance, the prevalence of war terminology in a set of political speeches rather than measuring thousands of individual word frequencies.

Lexicons and Categorization

There are two primary applications of lexicons in content analysis: presence/absence and value-mapped. Presence/absence lexical analysis is the most common and merely records what percent of a document's words appear in a lexicon. Value-mapped lexicons assign a numeric value to each lexicon entry that is averaged to yield a score for the document along that dimension. While occurrence frequency can be used as a scoring factor for presence/absence lexicons (the number of words in the document that matched the lexicon), this has limitations when words in the lexicon hold different semantic significance. Sentiment mining algorithms, for example, rely on value-mapped lexicons to encode the differing emotional content of each word. The word attack conveys greater emotional energy than dissuade, just as adore reflects greater affinity than like. A value-mapped lexicon stores a numeric value with each entry and computes a lexical score based on the average value of all matching entries. If, for instance, one document contains words like adore, love, enjoy, and enthusiastic, its average score on the emotional scale would be very high, while a document with adore, loathe, like, and dislike, would have an average emotional score near zero, reflecting that there are an equal number of positive and negative terms of equivalent weight to contradict each other's impact. In a presence/absence lexical model, these two documents would have identical scores if they were the same length, masking the strongly differing contribution each word has to that dimension.

Customized lexicons can be compiled from many sources and numerous commercial lexicon collections exist, tailored for various applications. The DICTION toolkit (http://www.dictionsoftware.com/) includes over 10,000 words in five primary categories: certainty, activity, optimism, realism, and commonality, along with 35 subcategory lexicons. The Minnesota Contextual Content Analysis (MCCA) database contains over 11,000 words placed in 116 idea categories, such as structural roles, ideals, merchandise, and scholarly nouns (McTavish and Pirro, 1990). These prebuilt lexicons make it easy to drop lexical analysis into a content analysis project without the extensive research required to build a lexicon from scratch. Existing lexicons, especially widely used ones, have the added benefit that they have often undergone testing in multiple applications with relative consensus on the terms included in each category.

When testing a new lexicon with a document collection, it may be useful to view the words being matched by the lexicon to validate whether the results it generates are meaningful in that context. A social life lexicon might contain the word party, which would be valid trigger word in most contexts, but when applied to a set of political speeches, might give them a very high social life score due to the appearance of party in democratic party and republican party. Interactive validation, which displays all words in a document that matched into a specific lexicon, can help reduce the impact of false-positives by allowing quick prescreening of results and removal of words from the lexicon as needed. Many commercial lexicons, however, come with bundled software to run the lexicons and do not permit viewing of the underlying word list.

Lexical Correlation

Once a lexicon has been used to assign a document into a set of matching categories, these lexical assignments may be used as dimensions in which to examine document and topical correlations. A collection of campaign speeches by presidential candidates, for example, might be analyzed using lexicons representing core themes of that campaign cycle. The density of matches along each lexical dimension can be used to compare candidates’ stances and the topics they have chosen to emphasize as part of their campaign platform. Correlation tables can be used to cluster candidates by the similarity of their stances, or to show which topics are being treated the most similarly in this campaign cycle. Lexical frequencies may also be correlated against document metadata, such as author gender or political party affiliation.

Lexicon Consistency Checks

Many lexicons are pieced together from multiple sources, often over time, becoming large hand-built assemblages of words and phrases with a high risk of duplicate entries and overlapping terms. Consistency checks perform a variety of validations on a lexicon to optimize its execution speed and minimize the chance of overlapping terms creating problems for analysis. Duplicate removal is the most basic type of check, which can dramatically improve the execution performance of a lexicon being run against a large document collection. Overlapping term removal is a more subtle operation that has the potential for significant performance gains when optimizing large vocabulary files that are composed of a mixture of phrases of different lengths or phrases and words. Term overlap occurs when a smaller term appears as part of a larger one, such as war and act of war or United States and United States of America. In both cases, if the larger phrase appeared in a document it would also be matched by the smaller term (act of war appearing in the text would also be matched by war), so checking for both the smaller and larger terms is redundant. These types of overlaps are very time-consuming and error-prone to remove by hand, but are trivial for automated consistency checking routines to identify and remove.

Identifying overlapping terms can also have implications beyond execution speed when working with categorization lexicons. One of the purposes of using phrases rather than individual words, or longer phrases in place of shorter ones, is to place additional restrictions on matching in order to specify a narrower concept. When building a lexicon to search for mentions of digitization technology (converting print materials to digital form) one might initially include the term scanner. A colleague might find a reference work on digitization that lists different types of scanners and add flatbed scanner, drum scanner, film scanner, and microfilm scanner as terms in the list. Another collaborator might add terminology for other types of more specialized scanning equipment and related terminology, like linear scanners. If these terms were part of a large list that included a wide diversity of digitization terminology, it would be easy to lose track of the original inclusion of the word scanner on its own. After applying the lexicon to a collection of open web documents, the output might include mentions of police scanners (radios that receive the police band) and remote sensing multispectral scanners (used in aerial surveys and other applications). With a large lexicon, it might be difficult to determine why these particular documents matched, especially on a larger project or with a lexicon built over time, where the user may only remember the very narrow terms, like film scanner, and not the original inclusion of the root word scanner. Such over-inclusion would significantly skew the pool of documents flagged by this lexicon. Overlap detection is designed to catch the fact that one of the terms in the list, scanner, appears as a component of several larger terms and therefore essentially short-circuits the restrictiveness implied by those larger terms.

A third form of overlap analysis known as partial overlap analysis can help with situations where terms are too specialized and several phrases are being used to represent a simpler term. For example, a lexicon for identifying lesson plans might include social studies lesson plan, history lesson plan, and geography lesson plan. While such specialty terms will locate many lesson plans on those topics, the lexicon would have to include an exhaustive list of all possible topics on which a lesson plan might be written. In essence, this lexicon has become overspecialized or over-restricted. Partial overlap analysis recognizes that these phrases all share the common root phrase lesson plan and would recommend it as a replacement for the set. The primary difference between partial overlap detection and standard overlap detection is that the subterm (lesson plan in this case) does not need to be present on its own in partial overlap detection. The algorithm examines entries as groupings of words and looks for patterns among those groupings, rather than just checking for the occurrence of each entry as a part of every larger entry (which only catches overlaps where the smaller phrase appears on its own in the lexicon). This can not only yield significant improvements in query performance, but, in this case, would likely yield significantly better results.

Of course, in many cases, the use of a set of narrow overlapping phrases is an intentional and necessary method of identifying specific topics. A lexicon might be built to identify mentions of European wars occurring in the nineteenth century, in which case many names might overlap, both in the mention of specific country names and in the use of the word war, but it would not be desirable to reduce them to these common terms. This is one of the reasons that overlap analysis routines are traditionally interactive processes that request confirmation from the user before altering the lexicon.

Thesauri and Vocabulary Expanders

The accuracy of any search query (or wordlist in a lexicon) is scored by the conflicting values of recall and precision. Recall refers to the percent of relevant documents in the collection retrieved by the query, while precision refers to the percent of retrieved documents that are actually relevant. A query which returns every document in the corpus will have a recall score of 100 percent, but its precision score will be very low, while a query that is so narrow in scope that it returns only relevant documents must often trade recall, leaving out some peripherally relevant documents. Recall and precision are important concepts when searching electronic archives, as search criteria must be restrictive enough so that most of the returned documents are relevant, while being broad enough to catch articles using slightly different terminology.

For example, searching for coverage of the Iraq War in a news archive using the keyword Iraq is likely to bring back nearly every document mentioning the war (high recall), but will also bring back many documents unrelated to the war (low precision). A search for Iraq war civilian casualties will most likely have fairly high precision, but would not return documents that used the phrase women and children casualties instead of the word civilian, or other wording to tally the dead, yielding poor recall. Recall and precision are therefore traditionally inversely related, with an increase in one invariably causing a decrease in the other.

There are several techniques that can be used to help expand the wordlist used in a particular lexicon, providing options for expanding recall or narrowing precision. One of the most popular is the venerable thesaurus, offering possible synonyms for each word. However, regular thesauri are fairly limited: they can only offer words which are semantically related, as opposed to ones that co-occur only in the content being analyzed, and they cannot offer guidance on the degree of similarity between suggested terms. The words conflict and war will be found as synonyms in most thesauri, as the two words have similar definitions in the English language. On the other hand, Saddam Hussein and weapons of mass destruction are not related by definition, yet they appeared together with a high degree of frequency in media reports in 2003. Word correlation can be used to find such contextually related words and add them to a lexicon.

On the other hand, there may not be a sufficient number of sample documents available from which to check for correlated words, or the user may wish to build a broader lexicon that contains terms not found in the current corpus. A database called WordNet (http://wordnet.princeton.edu/) can be of substantial assistance in these cases. Developed as a semantic taxonomy over the English language, WordNet essentially relates every English word to every other, allowing one, for example, to see that a car is a type of vehicle, as is a truck, and that both are usually driven. One of the most powerful aspects of WordNet is that it even ties together words that are not direct synonyms (such as cars and trucks), along with the path of concepts that relate them and the strength of those relationships, making it possible to find non-obvious terms with which to expand a lexicon.

When searching informal text archives, such as emails or social media posts, typographical errors can reduce the matching rate of lexicons. To counter this, some programs will suggest common misspellings of each word using a statistical model of the standard computer keyboard layout. For example, typographical permutations of information might include ibformation, informatoin, and iformation.

Named Entity Extraction

Lexicon files are most frequently used for categorization tasks, in which the specific list of lexicon words found in a document matters less than the overall density of matching words. In some applications, however, the mere presence or absence of terms from a lexicon is less important than the actual set of terms that were matched. Named entity extraction (NER) tools return the list of matching entries from each lexicon as their output. Common entity types include persons, companies, organizations, projects, courses, acronyms, email addresses, and even bibliographic references.

Lexicons and Processing

The first step in building an entity extraction system is the compilation of one or more source lexicons that contain seed terms for the type of entity to be extracted. A person name lexicon might be built by merging together baby name books from around the world (it is important to include names from many different parts of the world, since names are so culturally variable). Company or organization name lexicons can be compiled from various professional directories. Some directories further break companies down by industry, allowing a lexicon to encode not only a company's name, but also sector information that can be used for secondary classifications. Numerous specialty lexicons also exist, including the US Food and Drug Administration's Orange Book (http://www.fda.gov/cder/ob/), which lists all FDA-approved drug products, their active ingredients, alternative names, administration method, relevant patents, and holding companies. This information could be used to scan for prescription drug references in news releases, automatically grouping articles by drug class or active ingredient: information that is not listed in the article, but is encoded within the FDA's lexicon.

Once a source lexicon has been compiled, the entity extraction algorithm parses through the document text and checks each word and phrase for a match from one of the lexicons. If an exact match is not found, most systems will attempt a piece-wise match, in which it will see if any subset of the name is contained in the lexicon. In most cultures, first names tend to be more regular than last names, leading to a higher likelihood of matching the first word of the name in the lexicon. If a match is found, the entire phrase is extracted and tagged with the appropriate heading of the lexicon it matched into. This kind of matching is known as direct lexical matching and is only applicable to situations where candidate matches are phrases and matching is performed through a word-based lookup on the lexicon. More complex types of entities, like acronyms, email addresses, and bibliographic references, require a hybrid approach that combines lexical matching with rules-based matching, or, in some cases, may rely exclusively on rules matching.

A rules-based system defines a pattern of characters that must appear in the entity and is most commonly expressed in a language called a regular expression, or regex. A regex that would find most US telephone numbers (of the form 123-456-7890) might be [0–9]{3}[-][0–9]{3}[-][0–29]{4}, which says that a string of three numeric characters (0–9 is shorthand for saying 0,1,2,3,4,5,6,7,8,9) must be followed by a dash, another set of three numeric characters, another dash, and then a set of four numeric characters. A slightly more complex regex that matches many email addresses might be [a–z0–9.]+@[a–z0–9]+.[a–z]{2,4}. This states that an email address is composed of a block of characters that are letters, numbers, and periods (to handle addresses like [email protected]) followed by the @ symbol, another set of letters and numbers, a period, and then a set of letters between 2 and 4 characters in length (to handle country codes like .it and new domains like .info).

Regular expressions may seem complex at first, but are among the most powerful languages for expressing patterns in a form that computers can understand. A regex pattern will return a matching piece of text no matter where it appears in a document, meaning that a program implementing the two patterns above would instantly return phone numbers and emails found in any text they scanned. Email and phone number extraction are examples of rules-only matching, in that no lexicon of predefined terms is used to identify them, only a set of pattern rules. Bibliographic entries are often extracted using a hybrid approach that looks for certain lexical terms, like the phrase references or reference list, in the text and only applies its pattern matching to text below that line to reduce false matches. Traditionally, these rules have been developed manually by subject matter experts, but there are numerous tools today that can generate rules files interactively by having a human select text of interest (such as person names) across a large number of files and having the system learn the context those entities appear in.

Applications

Entity extraction has become a part of numerous commercial products today, with major premium search providers like LexisNexis offering basic entity extraction (such as people and company names) as part of their search result displays. IBM's WebFountain and the NCSA VIAS project were among early wide-scale deployments of entity extraction to real-world data. VIAS was launched at the National Center for Supercomputing Applications (NCSA) in mid-2000. It was originally developed as a large-scale aggregation system that monitored mailing lists related to high-end virtual reality and scientific visualization, but expanded to include USENET (an early form of online community bulletin board) and web content. Rather than crawl and index the entire Web like other search engines, VIAS was designed to create many smaller, topically directed databases called Dynamic Knowledge Repositories (DKRs).

A DKR was created for a particular topic (such as virtual reality or flat screen display technology) and assigned a list of mailing lists and USENET groups to monitor. Using directed crawlers, it would begin searching the Web for pages mentioning any of the target keywords, archiving and indexing only matching content. To enhance the utility of these large topically oriented data stores, VIAS extracted a set of 16 different entity classes and assigned documents into 16 different categories. A resulting document entry would note whether a page was a résumé, lesson plan, table of contents, or news release, and list all person and company names, bibliographic references, grants, and email addresses found within. Searching for lcd projector companies would result not in a list of web sites (as with a traditional search engine), but instead would offer a bulleted list of companies manufacturing or involved with LCD projectors. A search for bibliographic references on pages about haptics (recreating the sense of touch in virtual reality) would yield a concise bulleted list of papers found on those pages, providing a literature-review-in-a-box.

With the explosive popularity of entity extraction in commercial applications today, there are numerous open source and commercial toolkits that provide identification and extraction of different types of entities. One of the best known is the General Architecture for Text Engineering (GATE) (http://gate.ac.uk/), first developed in 1995, that provides several classes of entity extraction. Frameworks like GATE are highly useful for prototype projects, but the type of data being processed determines the level of sophistication required for entity extraction. Production systems like VIAS and WebFountain have been designed to be extremely robust to the content found on the so-called open web, where grammatical rules and basic assumptions on language use are often severely violated, requiring highly adaptive matching algorithms. Many toolkits achieve impressive accuracy levels in the carefully controlled environment of the lab but experience diminished performance levels when applied to real-world data. It is important that the content analyst keep in mind issues such as grammatical quality, editorial consistency, and other factors when selecting an analysis toolkit.

Entity extraction technology adds a new kind of analytical dimension to large content analysis projects. A 2006 study (Leetaru and Leetaru) explored public opinion of the emerging field of carbon sequestration, through international news media coverage and web search results. One of the underlying research questions was whether specific institutions or individuals received more prominent placement in media coverage and online content than others. An analysis of the top 100 Google search results for carbon sequestration revealed the US Department of Energy as having three and a half times more mentions than any other organization and that the majority of institutions with the most mentions were in the United States. An analysis of international news media showed the Department of Energy again as the top organization, but followed closely by the FutureGen Industrial Alliance, a corporate umbrella overseeing a major billion-dollar carbon sequestration initiative in the United States. Three individuals, David Hawkins, Mike Burnett, and William Purvis, were identified as having very high visibility in news coverage of the topic, reflecting the prominent positions each held in the field at the time. Entity extraction of this kind can either reinforce expected patterns (someone in the field might expect that these three would feature prominently in news coverage), or suggest unexpected patterns. For example, William Purvis was a spokesperson for the Department of Energy, not a scientist, and so suggests that the DOE, like many federal agencies, relies more heavily on its professional media relations staff for news reports instead of making its scientists available directly for comment.

Geocoding, Gazetteers, and Spatial Analysis

Information is not created in a spatial vacuum. Every piece of information is created by someone or something in a specific geographic location, often intended for an audience in another geographic location, and may contain mentions of yet other locations. Document archives can be clustered based on similar geographic focus, characters in a book associated with the cities they live in or visit, and organizations or events analyzed by geographic diffusion over time. Yet, despite all these benefits, spatial analysis is a relatively untouched topic in the discipline of content analysis. Identifying and converting textual location references into latitude and longitude coordinates requires sophisticated algorithms known as geocoders and the availability of specialty (and often massive) geographic databases called gazetteers. The size and complexity of many of these systems have limited their primary users to geographers and others that work with GIS (geographic information system) systems, but with the right tools, geocoding and spatial analysis can play an integral role in any content analysis project, bringing to bear a rich new dimension of locative information.

Geocoding

Geocoding involves the conversion of textual location references into geographic ones. These can range from precise street-level addresses, like 1600 Pennsylvania Avenue NW, Washington DC, 20500, to vague landmark-based instructions, such as around two days’ journey north of the river's mouth or four houses past the third stop sign on the left. A geocoder's job is to convert all three of these to a more universal geographic representation: a latitude and longitude that can be displayed on a map or integrated with other spatial data. As the next section will discuss, this involves accessing databases known as gazetteers that pair known locations with an approximate latitude/longitude coordinate pair that give its location on planet Earth.

Gazetteers and the Geocoding Process

For a geocoding system to work, it must have access to a database of predefined locations and their latitude/longitude mappings, known as a gazetteer. In practice, most databases contain centroid coordinates, meaning that the coordinates given for a particular location are those of the centroid of its perimeter (roughly its midpoint). One can imagine that a city the size of Beijing, China, does not have a single coordinate that defines its location on a map, yet for many types of analysis, having a rough approximation of its “center” can be useful. Centroid locations allow approximate distances and other measurements to be computed, even for large regions.

Within the United States, a typical street-level geocoder relies on the United States Census Bureau's TIGER/Line data files. As part of its census operations, the Bureau must be able to reliably associate each census response with a specific geographic location in the United States, to assign it into a tract, block group, and block. It compiles extremely detailed maps of every street in the country, with each street represented as a line segment with the first and last house numbers and latitude/longitude of those houses saved to a massive database. Since 1989 this database has been made available to the public as part of its TIGER (Topologically Integrated Geographic Encoding and Referencing) system. A street address is decomposed into a house number, street name, city, state, and zip code information, which is used to locate the appropriate street line segment and interpolate the location of the house along the segment. Since this database requires interpolation from segment endpoints, it is not strictly a gazetteer, which implies that only lookups and exact matches are to be returned.

Of course, cities change, and TIGER/Line files may not reflect the newest subdivision or latest road construction, so a robust geocoding system must be able to fall back to a less-accurate geographic representation as needed. If an exact street match cannot be located, the next best possible match would be the city or zip code components of the address. In the United States, most zip codes cover a smaller geographic region than a city (in 1999, there were 78 zip codes in Chicago, IL, alone), so if the address contains a zip code, it is often used for the next match attempt. In November 1999, the US Census Bureau released a database of centroids for every zip code in the country, while numerous commercial vendors produce more recent tabulations. This database may be compiled into a gazetteer to enable direct geocoding of US zip codes into geographic locations.

If no zip code was provided, or if no match was found, the city specified in the address provides a final matching possibility. The Geographic Names Information System (GNIS) (http://geonames.usgs.gov/domestic/index.html) maintained by the United States Geological Survey on behalf of the US Board on Geographic Names, is the “Federal standard for geographic nomenclature” and the “official repository of domestic geographic names data.” The GNIS is a massive gazetteer of almost two million named places in the United States, ranging from cities to hills, streams, schools, university buildings, and major landmarks. Each location is provided with an approximate centroid latitude and longitude.

The foreign gazetteer counterpart to the GNIS is the GEOnet Names Server (GNS) (http://earth-info.nga.mil/gns/html/index.html) maintained by the National Geospatial-Intelligence Agency. Like its GNIS counterpart, the GNS is the official federal repository of place names, with nearly six million named entities outside the US, ranging from cities to hills, streams, regions, and military installations. In 2006, the GNS database defined 642 different entity classes, differentiating, for example, between a locality, a locality with permanent structures, an abandoned locality, a religious locality, a first-order administrative seat, and a capital.

Geocoders are designed to process individual geographic references, such as a street address or city/landmark name, and output a single geographic coordinate. Many datasets already separate geographic information from the rest of the content, so this presents little difficulty. However, a typical application of geocoding in the humanities and social sciences involves extracting geographic references from the full text of a document archive. Geocoding systems that must identify and extract geographic references from freeform text and convert them into latitude/longitude coordinates are extremely complex. A set of preprocessing algorithms must analyze the text to identify possible geographic references and test multiple permutations in the event of a non-match. In some cases, contextual wording, such as in or at, can suggest a location reference, but in many cases, location may be integrated as part of a larger phrase, such as the United States Olympic Team or New York Times.

To provide the highest possible match rate, many preprocessing systems extract all capitalized phrases from a document and try matching successively smaller portions of each against a gazetteer. For example, a matching system might first check to see if there is a place called New York Times, falling back to New York as its second attempt, which would result in six matches in the GNIS database for cities in New York, Florida, Iowa, Kentucky, Missouri, New Mexico, and Texas. The GNIS database also provides estimated population, which is provided only for the New York state entry, suggesting it as the most likely match, at a latitude of40.7142 and a longitude of 74.0064. A search for New York's Central Park could be processed in several ways in that both New York and Central Park have entries in the GNS database. Once it finds a match for New York, an exhaustive geocoder might also try the remaining portion of the address, Central Park, finding 156 entries in 33 states, but only one match with New York as the city. Of course, most city and landmark names will appear in the first one to two words of a phrase, and so some geocoders will try to minimize the number of gazetteer lookups by trying the first word, then the first two words, and so on.

Some highly specialized systems have been developed that attempt to make fuzzy matches based on vague landmark-referenced descriptions. For example, a historical text might make reference to a camp around two days’ journey north of the river's mouth. To convert this to a set of geographic coordinates requires tying river's mouth to a mention of a particular river earlier in the text and the location of the river's mouth via a gazetteer to establish a starting landmark. The age of the text and other information may be used to estimate travel speed at the time (the distance embodied by two days’ journey) and used to compute a rough endpoint north of the starting landmark. Of course, key landmarks referenced may have been lost or destroyed (especially in areas that have undergone significant conflict), such that even human historical researchers struggle with such vague locative information. Several historical gazetteers have been produced for historical work, including the Getty Thesaurus of Geographic Names® (http://www.fda.gov/cder/ob/research/tools/vocabularies/obtain/index.html) which contains numerous historical name spellings and locations for cities that no longer exist.

Operating Under Uncertainty

At first glance, geocoding city names would seem rather straightforward: look the city name up in the gazetteer, find its coordinates, and return them. Yet, most individuals spend their lives with highly localized information that has been pre-filtered for where they live. A weather report in a local newspaper doesn't list whether it will rain or snow in every city in the world: it lists only the local forecast. Similarly, a local newspaper article mentioning a robbery that happened at a bank downtown yesterday might mention a particular street, but would rarely specify a state and country. Unfortunately for geocoding systems, one of the byproducts of human communication is the notion of shared background information, in which the author makes certain assumptions about the shared knowledge of their readers. One aspect of this knowledge expectation is that a news article printed in the local paper of Chicago, Illinois makes the expectation that readers know that Chicago is in Illinois and that it is part of the United States, omitting this information from any location references in the story. A reader viewing that story in Belize, however, might make the same assumption about location, but believe that Chicago refers rather to the city named Chicago located in Belize. As another example, there are 32 cities named Paris in the world outside the US, including cities in Turkey, Panama, the Philippines, Togo, Sao Tome and Principe, Ukraine, Mexico, Haiti, Bolivia, Canada, the Congo, Gabon, an unpopulated locality in Kiribati, farmsteads in South Africa and Zimbabwe, a hill in Colombia, and even a capital in a country named France.

These two cities are not anomalies, and in fact, of the 5,613,871 named entities in the December 12, 2005 GNS gazetteer, there are only 3,901,700 unique names (69.5 percent), meaning that more than two million places in the world recorded in GNS share identical names. The implications for this high degree of name overlap are significant when geocoding material of international scope. If all documents in a corpus were published in the same country, it is likely that references to cities outside that country will include country information. Yet, major international cities, like Paris, France; Washington DC, USA; and Moscow, Russia, rarely appear with their country designators, no matter what country's news they are mentioned in. Thankfully, such cities are rare, and the vast majority of the world's places appear in some form of qualified context when mentioned in the foreign press.

The first step in disambiguating a geographic reference is to search for clarifying information in the immediate vicinity of the reference. For instance, a reference to Cairo would return possible matches in Peru, Italy, Costa Rica, Colombia, Cuba, and Egypt. A search for each of these country names in the document appearing in context with the word would provide immediate clarification. In the absence of such information (such as a news article published in the Middle East), the document's country of publication can yield potential insights. If the document was published in one of these six countries, or one of their immediate neighbors, it might lend additional support for a match in that country. If neither of these alternatives yields a result with a sufficient confidence level, the GNS feature class information can be used to discriminate between the possibilities with a very high degree of accuracy.

Relying on the observation that major cities tend to be mentioned without a country name more often than smaller ones, a typical disambiguation routine would first check if one of the matches is a capital city. In the case of the example above, such a check would reveal that Cairo is in fact the capital city of Egypt and so an unqualified reference to the city would most likely be referring to the Egyptian capital. If the city was not a capital, then a second match might be to the seat of a first-order administrative division, then to a populated locality, and finally to an unpopulated locality. Of course, while city names are obvious geographic indicators, many other geographic features can be used to discern geographic location. GNS includes a large number of landmarks, including major buildings, parks, schools, rivers, streams, and hills. Each of these is assigned an entry in the database and may be used as a reference point such that a document mentioning a stroll on Parkonkangas Hill would immediately be identified as having occurred in Finland at a particular set of coordinates. Of course, if references point to geographic features like streams and hills instead of (or together with) city names, a more complex ranking scheme must be used that encodes the expected ordering behavior of that particular collection.

Spatial Analysis

Geocoding enables an incredible diversity of applications under the heading of spatial analysis that exploit locative information to unveil patterns mediated through position and distance. Geographic clustering is a simple example, where documents can be grouped not by word usage, but by the locations they mention. In a collection of documents, if one mentions Paris numerous times, another Madrid, a third Berlin, a fourth Cairo and a fifth Baghdad, geographic clustering would be able to divide these into two clusters, one for European-focused documents and the other for Middle Eastern-focused, even if the rest of the documents’ contents did not otherwise suggest such a grouping.

The Google Books service provides a tantalizing view into the potential of geocoding as a content analysis methodology. Every public domain book added to the Google Books service is scanned for mentions of geographic locations and a map displayed with points for each location. This map provides immediate feedback on the geographic emphasis of the work, allowing one, for example, to see whether a book on European colonization covers a particular region of the world. Of course, taken in aggregate, this geographic information offers unprecedented insights into the expansion of the known world as seen through books. In fact, Google itself used this data in 2007 to produce a series of maps showing the summation of all geographic references across all books in their collection by decade for the nineteenth century, illustrating such trends as the Westward expansion of the United States (Gray, 2007).

Chapter in Summary

Lexicons are lists of related words grouped under a single heading and underlie many of the more advanced content analysis methodologies. Densities of word matches from a lexicon can be used for rudimentary document categorization, while the grouping of large numbers of related terms under a single lexicon can improve the results of correlation analysis. Many optimizations, like consistency checks, can help improve both the execution speed and accuracy of results for larger lexicons, while thesauri and vocabulary expanding techniques can be used to broaden the terms in a lexicon. Named entity extraction compiles a list of matching names from a document, such as all person names. Entity extraction can be lexically based, comparing candidate phrases against a lexicon of known person names, or rules-based, using regular expression patterns to identify email addresses or phone numbers. Geocoding combines geographic lexicons known as gazetteers with multiple layers of reasoning algorithms to intelligently interpret full text documents, identifying geographic references, disambiguating them, and converting them to geographic coordinates for mapping. Advanced techniques like entity extraction and geocoding allow content analyses to go beyond the text, treating a document as a collection of higher-level information, like person names or locations, to permit more sophisticated interpretation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset