Chapter 3: SAS Predefined Concepts: Enamex
3.1. Introduction to SAS Predefined Concepts
3.2.2. Suffixes as Part of a Personal Name
3.2.6. Locations as Part of Name
3.2.8. Historical Figures, Saints, and Deities
3.2.9. Animals, Fictional Characters, Artificial Intelligence, and Aliens
3.2.10. Businesses Named after People
3.2.11. Laws, Diseases, Prizes, and Works of Art
3.3.1. Common Nouns and Determiners
3.3.2. Subnational Regions and Other Descriptors
3.3.8. Conjoined Location Names
3.3.9. Special Cases for Nonmatches
3.4.1. Corporate Designators or Suffixes
3.4.2. Determiners before Proper Names
3.4.3. Facility Names Associated with an Organization
3.4.6. Conjoined Organization Names
3.4.8. Special Cases for Nonmatches
3.5. Disambiguation of Matches
3.5.2. Organization or Product
3.1. Introduction to SAS Predefined Concepts
As you will recall from the previous chapter, a named entity is one or more words or numeric expressions in sequence which name a single individual or specify an instance of a type in the real world (or an imaginary world).
SAS provides a set of seven predefined entities called predefined concepts, spanning the three types of entities described in chapter 2:
All fully supported languages also provide a predefined grammatical pattern to aid in the recognition of multiwords and complex concepts:
Although the rules that are used for the predefined concepts are proprietary and not displayed in the products, you can learn more about the principles and assumptions that form the basis for the rules for each of the predefined concepts in the sections that follow. Knowing what matches are expected for predefined concepts can help you both more accurately predict and modify behavior of the concepts, and more easily identify areas where custom concepts would be most useful for your particular extraction task.
In addition, this information can help you measure the effectiveness of an information extraction system by acting as a standards manual for setting up and annotating a gold standard corpus, as well as for data collection, with all targeted named entities marked in a consistent manner. Measuring the value of information extraction without first defining the targeted entities is like using a yardstick with no numbers or lines. The information in this chapter and in chapter 4 defines the numbers and lines on that yardstick.
Referencing these standards can also be a useful step in troubleshooting matches. It can help you align expectations regarding the existence and disambiguation of matches and their scope in various contexts.
This chapter and chapter 4 are a reference that you can keep coming back to as you work with named entities, whether you are using SAS Text Analytics or some other approach. Because these chapters serve as a set of annotation guidelines for typical named entities, you can use them whether you are using SAS Text Analytics, or implementing your own set of entity rules using other approaches or software. The content is based on extensive research, historical definitions, and best practice guidelines that the SAS linguists have prepared during the development of cross-linguistic standards for predefined concept extraction for more than 30 languages.
In this chapter and chapter 4, matches that meet the definition of each predefined concept type are denoted in square brackets. For example, in the phrase: “the company [SAS],” only “SAS” is an extracted match (for Organization).
Person is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpPerson or another similar name. The generic “Person” label is used in this book, because it aligns with industry standard practices and is similar to any concept name used in the SAS Text Analytics products in the past.
Person includes any proper name used to designate a specific individual in the real or in an imaginary world. Individual includes any intelligent agent: any real or fictional human, alien, deity, artificial intelligence, or animal.
The matches for Person include two or more of the following:
See section 3.2.3 for a discussion of when single-word names are considered Person matches. References with only an initial or initials and no other name must also have a title captured as part of the match—for example, “Mr. T.” References to people that are not proper names, as well as common nouns or pronouns are not matches for the Person concept. The match is always the longest possible combination of allowed elements.
Words that are leveraged to identify a potential match for Person include job titles and verbal constructions indicating agents of human-like actions, such as, for example, “exclaim.” These markers are not retained in the matched string; they are leveraged only as contextual cues.
Remember: Person includes any proper name designating a specific individual in the real or imaginary world. |
Special cases that govern whether certain words are included in the match are described in the following sections.
The matches for Person include the following titles of address:
In the contexts where a person can be addressed in spoken communication with the title and first and last name, only first name, or only last name, that title is included as part of the tagged match for Person. However, job titles or descriptions are not matches for Person.
Consider the following examples of strings referencing persons:
Pause and think: Can you identify the Person matches in the above examples? |
Matches include only the following:
Titles like “Secretary of Health and Human Services,” “CEO of SAS,” “Pope,” and “Duke of York” are job or professional titles that can refer to more than one individual throughout history. Such relative references, including phrases such as “Miss Know-It-All,” are not specific enough to be considered a match for the Person concept. In addition, only an initial is not enough context for a match to the Person concept.
3.2.2. Suffixes as Part of a Personal Name
Suffixes on names that are part of the specific designation of an individual and not simply related to education or career are included in the match, together with the name or names. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Only the first, last names and initials are matched in the following:
The suffixes that follow the last names in these examples are referring to professional designations in the medical and business fields. Therefore, they are not included in the match.
Single-word names are included only when the context (person suffixes, job names, birthdays, or other person-related information) indicates a probable match. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
In the remaining examples, the proper nouns are ambiguous because there is not enough context to infer that the reference is to a person. For example, Kent could be a person, company, product, or place name. Similarly, Gary is a common English name for persons but could also refer to a town in Indiana.
References to a body part, remains, or corpse of a person are not considered a part of the match. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
Note that the remaining words in the matches above provide reasonably unambiguous context that the proper nouns are referring to persons.
Quotes around a descriptive nickname are included within the name match if they appear within or overlap the boundaries of a person’s name. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The nickname is not included in the match for the final example because it does not appear within the boundaries of the person’s name.
3.2.6. Locations as Part of Name
Locations that are part of the name are included in the match and not matched separately as Place. But mentions of titles and locations only are not included as matches. In addition, locations named for people are not included as matches to the Person concept.
Consider the following examples:
Pause and think: Can you identify the Person matches in the examples above? |
Matches include only the following:
The first example is not considered a person match because it is a title that could refer to different people throughout history. In the second example, the title contains a location name, so only the first and last names are parts of the match. In the third and fifth examples, the reference is to a location, even though the place name contains a person name. Therefore, they are not considered matches for Person. In the fourth example, the location is included in the Person match because it helps specify which Princess Anna is being referred to.
Groups of individuals such as national, geographic, religious, or ethnic groups; family or dynasty names; or blended names of two individuals are not a match for Person.
Nonmatches include the following:
Some groups of individuals match as Organization:
See more about organizations in 3.4.
Terms referring to groups of two or more people are not included as matches to the Person concept. However, conjoined or listed names with elision are included as matches. The listed names are considered one single reference if part of the name is elided. The listed names are considered two or more matches if the names on either side of the conjunction are complete.
Matches include the following:
Consider the following examples:
Pause and think: Can you identify the matches for the Person concept in the examples on the previous page? |
Matches for the Person concept include only the following:
The first few examples are referring to ethnic, religious, and political groups of people, as well as family names, conjoined names, and elided names. None of these examples match the Person concept.
3.2.8. Historical Figures, Saints, and Deities
Names of saints and other historical figures are included, unless the context indicates that they appear as a part of the name of another predefined concept type. Proper names for deities are a match, but not references to deities generally, descriptive references, or exclamations.
Consider the following examples:
Pause and think: Can you identify the matches for the Person concept in the examples above? |
Matches for the Person concept include only the following:
The second and third examples are not matches for the Person concept because they are referring to locations, namely a bridge and a cathedral. The sixth example is an exclamation, whereas the remaining nonmatches are not specific enough to refer to one particular deity.
3.2.9. Animals, Fictional Characters, Artificial Intelligence, and Aliens
The proper names of animals, fictional characters, artificial intelligence, and aliens are matches.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
Matches do not include species, such as Eevee, Martians, or Vulcans, because they are groups.
3.2.10. Businesses Named after People
Names of humans, any of which could also be the name of a business, are included as matches to Person unless there is a contextual cue that the name applies to the business, not to the individual. Organization names with embedded person names are not included as matches.
Consider the following examples:
Pause and think: Can you identify the matches for the Person concept in the examples above? |
Matches for the Person concept include only the following:
Note that the third and fourth examples are not matches because context, such as “& Associates” and “Podiatrists,” identifies a business even though part of the company name may be a person name.
3.2.11. Laws, Diseases, Prizes, and Works of Art
Place is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpPlace or another similar name. The generic “Place” label is used in this book, noting that “nlpPlace” and any concepts found in SAS products that have Location within their name are equivalent.
Place includes any proper name or defined expression commonly used to designate a specific site in the real or in an imaginary world, as well as any geo-political entity (GPE). Site includes any geographical point or area in physical space, on earth or elsewhere, including imaginary worlds. GPE is a composite of the following:
For example, GPE includes province, state, county, city, town, and others.
Remember: Place includes any proper name or expression designating a specific site or geo-political entity in the real or imaginary world. |
In addition to site names and GPE names, matches for Place include location expressions.
For example, matches include the following:
Words that are leveraged to indicate a potential match for Place are the following:
Special cases that govern whether certain words are included in the match are described in the following sections.
3.3.1. Common Nouns and Determiners
Common nouns may be included in the name if they help clarify the concept or are truly treated in language and by societal conventions as a predefined concept, whether capitalized or not. Determiners like English “a” or “the” may also be included if they are considered a part of the name. For example, the determiner is included in the match of “[Democratic Republic of the Congo]” but not in “the [Southeastern United States].”
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
The first and fifth examples do not produce a match because they do not include a proper noun. Note that the determiner is included as part of the match only in the final example.
3.3.2. Subnational Regions and Other Descriptors
Subnational regions are not included when referenced by only compass-point modifiers; generally, there needs to be enough information in the text explicitly that the location could be plotted or an area drawn on a map. Historic modifiers and other descriptors are included only if they are part of the official name.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Examples like “the South” and “the Southwest region” are not specific enough to be able to be pinpointed on a map, because they could refer to locations in various countries. Note that adjectives such as “former” or nouns such as “coast” are not included in the match when they are a historical or geographical reference, but are included if they are part of the official name of a country.
Street addresses are included if they contain enough information to identify a specific point on a street or to zero in on a specific building or multi-structure facility with some background information about country and city/town/province as assumed knowledge. For the match to be a Place, it has to be able to be found on a map without guesswork.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The remaining two examples are not matches for the Place concept, because the context is not specific enough. The references could be to an organization rather than a place.
Monuments that are not aliases for organizations running them are included as matches. All other facilities or buildings are excluded unless they are an airport or they fit the criteria for address.
Consider the following examples:
Pause and think: Can you identify the matches for the Place concept in the examples above? |
Matches for the Place concept include only the following:
The remaining three examples contain matches for the Organization concept.
Names of heavenly bodies and locations are matches so long as the reference is to a specific heavenly body. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
In the remaining examples, the references are not to specific celestial objects; therefore, no matches are extracted to the Place concept.
Names of neighborhoods are included, but generic references to parts of cities or towns are not matches. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that the cardinal points are not included in the matches in these examples.
Fictional and nonphysical places with names are considered a match so long as the reference is to a specific place. If the reference is generic, it is not a match.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
The first two examples are not proper nouns and therefore not matches. The remaining examples are matches because they name specific locations.
3.3.8. Conjoined Location Names
When more than one location name in a row is encountered, they are considered one Place match if the relationship between them is hierarchical and they are adjacent or separated by punctuation or prepositions that establish the hierarchical relationship. They are also considered one Place match if the location names are conjoined or listed with elision. Leading prepositions are not included in the match.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that the leading prepositions are not included in the match and that only the final example produces two matches because of intervening text.
3.3.9. Special Cases for Nonmatches
Special cases that are excluded from matches as Place are as follows:
Organization is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpOrganization or another similar name. The generic “Organization” label is used in this book because it is an industry standard term and reflects previous names used in SAS products for this concept.
Organization means a formally established association. The matches for Organization include the proper names, common aliases, nicknames, or stock ticker symbols of businesses, government units, sports teams, clubs, and formally organized artistic groups. Common types of organizations are as follows:
Examples of aliases, nicknames, and pseudonyms include the following:
Matches also include stock ticker symbols, such as [MSFT] and [CSCO].
The proper names for groups of individuals closely associated with a specific organization are also considered matches. For example, [Girl Scouts] is a proper name associated with [Girl Scouts of America] and [Democrats] is a proper name associated with the [Democratic Party]. Generic names for a type of group or organization, like Latinos, feminists, police, or army, are not considered matches. But a specific proper name is a match; for example, the [Los Angeles Police Department] or [U.S. Congress].
In addition, organization names embedded in locations, such as AT&T Stadium, are not matches for Organization because they are referring to a location.
Remember: Organization means the name of a formally established association. |
Words that are leveraged to indicate a potential match for Organization include prefixes and suffixes indicative of organizations, verbs associated with businesses or organizations acting like individuals, some prepositions (at, for, with, within, outside of), nouns for associated groups (team, division, chapter, orchestra, club), and facility words as part of the name.
3.4.1. Corporate Designators or Suffixes
Corporate designators or suffixes are included in the match. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
In all the examples, the various corporate designators are included in the matches.
3.4.2. Determiners before Proper Names
Determiners in front of proper names are included only if they are expected as part of the name. In the example of “The Ohio State University,” that university dictates that its name includes “the,” so the entire string ([The Ohio State University]) is the match. In contrast, in the text “the United Nations,” the determiner is not a part of the match (the [United Nations]).
3.4.3. Facility Names Associated with an Organization
Proper names referring to facilities which are closely associated with an organization that runs or owns the facility are included in the match, even if the facility itself is being referenced in a locative context. One exception is airports, which are not considered organizations.
Consider the following examples:
Pause and think: Can you identify the matches for Organization in the examples above? |
Matches for the Organization concept include only the following:
Airports and organization names embedded in locations are not matches, which disqualifies the remaining examples from matching for the Organization concept.
Named groups of individuals with a codified and widely accepted set of criteria for membership in the group are included if they are closely associated with a single specific named organization. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
The remaining examples denote religious groups and therefore are not matches for Organization.
A city, state, or district name is included when it is used to refer to a sports team. This is a common example of metonymy, a type of alias.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
When an organization name and an alias are both present, they are considered two separate matches. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
An organization name or alias that is an explicit reference to a product or brand is included. However, the reverse is not true: References to the products or brands themselves are not automatically matched as organizations. Ambiguous references to products or brands that cannot be discerned from context to be referring to the organization specifically are also not included.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The remaining examples are referring to products rather than organizations and are therefore not matches to Organization.
3.4.6. Conjoined Organization Names
Two or more conjoined or listed organization names are considered separate predefined concept matches, even if it looks like they may share elided material. In this case, the shortened name is considered an alias.
Consider the following examples:
Pause and think: Can you identify the matches for the Organization concept in the examples above? |
Matches include the following:
In these examples, although the organization names are conjoined, they are separate matches.
Event names are not considered organizations, but the committees and organizations that run the events are. Consider the following examples:
Pause and think: Can you identify the matches for the Organization concept in the examples above? |
Matches include only the following:
The remaining examples are not matches, because they are names of events.
3.4.8. Special Cases for Nonmatches
Special cases that are excluded from matches as Organization are as follows:
3.5. Disambiguation of Matches
Accounting for situations in which one single predefined concept match or pattern could fall into multiple categories is one of the key challenges of named entity recognition. There are ambiguities between enamex entities because many proper nouns could be names of persons, organizations, or locations. Some examples are listed below.
“Duke” could be part of a Person match or an Organization match:
“Washington” could be referring to a person or place, so it could be part of a Person or Place match:
“Chelsea” could be a part of a Person match, Place match, or Organization match:
Ambiguities are also encountered between enamex and numex entities, as mentioned in chapter 4. In addition, the same text string could be a predefined concept match or not. For example, the acronym “NER” could stand for nucleotide excision repair (nonmatch) or the North-East Railway (Organization).
The SAS predefined concepts account for these types of ambiguity by leveraging contextual cues like common titles, professions, abbreviations, prefixes or suffixes, appositives, and nominal and verbal constructions.
Sometimes it is difficult to distinguish from context whether the reference is to a place or an organization, because of metonymy, meaning the use of one term as a stand-in for another. For example, sports teams (organizations) from a particular location are often referred to as that location, as in “Buffalo’s win over New York.” Similarly, the work of government officials or departments is sometimes referred to by the name of the location, as in “Germany unveils new law.” In these and other similar cases, the following predefined concept guidelines offer some direction.
The following situations describe matches for the Organization concept:
The following situations describe matches for the Place concept:
Consider the following examples:
Pause and think: Can you identify which of the examples above contain matches for the Organization concept and which ones for Place? |
Matches for the Organization concept include the following:
Matches for the Place concept include the following:
3.5.2. Organization or Product
An organization name or alias that is an explicit reference to a product or brand is a match for Organization. However, references to the products or brands themselves and ambiguous references to products or brands that cannot be discerned from context to be referring to the organization specifically are not matches.
Consider the following examples:
Pause and think: Can you identify which of the examples above contain matches for the Organization concept? |
Matches include the following:
Groups of individuals belonging to an organization match as Organization, such as [Democrats], [Girl Scouts], and [Marines]. However, groups of individuals who do not belong to a formally established association are not considered a match for Organization or Person. Thus, for example, members of a particular religion are not considered matching Organization, but members of a particular formally established religious denomination or church may be.
Consider the following examples:
Pause and think: Can you identify which of the examples above contain matches for the Organization concept? |
Matches include the following:
Groups of individuals belonging to a particular industrial sector, industry, or job are not considered matches because they are not proper nouns. For example, the job description “financial advisors” is not a match for Person, but “[Bank of America] financial advisors” contains an Organization predefined concept match—the company where that group of individuals works.