Chapter 4: SAS Predefined Concepts: Timex, Numex, and Noun Group
4.1. Introduction to Other SAS Predefined Concepts
4.2.1 Extended ISO 8601 Format
4.3.1. Extended ISO 8601 Format
4.3.2. Named Times and Time Zones
4.4.6. Expressions and Metaphors
4.5.1. Acronyms, Initialisms, and Abbreviations
4.5.3. Quotation Marks and Parentheses
4.5.7. Special Cases for Nonmatches
4.7. Disambiguation of Matches
4.8. Supplementing Predefined Concepts
4.1. Introduction to Other SAS Predefined Concepts
As you will recall from chapter 3, SAS provides a set of seven predefined concepts, spanning the three types of entities described in chapter 2:
This chapter also includes a description of the predefined grammatical pattern, Noun Group, which aids in the recognition of multiwords and complex concepts. This pattern is detailed in section 4.6.
The rules that comprise the predefined concepts are proprietary and not displayed in the products. But, when you learn more about the principles and assumptions that form the basis for the predefined concept rules, as you do in this chapter, you can more accurately identify when you can leverage them and when custom concepts are a better choice.
Date is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpDate or another similar name. The generic “Date” label is used in this book because it is an industry standard term and reflects previous names used in SAS products for this concept.
Date matches include patterns that indicate a specific point in time at any granularity from full day to larger amount of time. Matches can also be a range of points with the following:
A reference date is either the date that the text was written, or the date that the events in the text occurred. In interpreting a possible Date match, the assumption is that the reference date is known, even if it is not explicitly contained in the text. The granularity of that known point extends only to the full day, not to smaller units of time. However, a word like “now” may serve as a reference point in relationships so long as there is another legitimate time match in the phrase.
The point or points in time modeled by a Date match must be specific enough to be able to be plotted on a timeline. A timeline is a graph of time at any level of specificity:
The smallest unit that can be a Date is a full day.
Remember: Date includes expressions of time that can be plotted on a timeline and span at least a full day. |
Date matches include formal or informal references to dates, usually composed of a named unit or a numerical value combined with at least some unit of time. Named units include the following names and common expressions for time:
The match usually encompasses one of the following grammatical categories:
Note that the match does not encompass clauses or prepositional phrases. The match is as short as possible without losing meaning. Punctuation is considered part of the Date match only if it is a lexical part of the tokens. Some examples include the following:
Special cases that govern whether certain words are included in the match are described in the following subsections.
4.2.1 Extended ISO 8601 Format
At least one element of the extended ISO 8601 format, the international standard covering the exchange of date- and time-related data, should be explicit. Units larger than a year are also included. In all cases, at least one point in time should be possible to plot on a timeline from the information given in the text plus the assumption of a known reference date.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
The remaining examples do not contain enough information to plot on a timeline. The fourth example is not a match because there is not a known reference date for the start or end of the “2 months” period.
Named dates are included unless they are clearly a standalone set or nonspecific reference to a type or class of item, and this can be determined by the immediate context. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
Commonly understood slang or cultural references to dates, as well as references in titles, are included so long as they can be plotted on a timeline with an assumed reference date.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
The first two examples cannot be plotted on a calendar, so they are not matches for the Date concept.
Common nouns signifying events are excluded from matches unless a date stands for an event. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include only the following:
Note that the remaining examples are not matches because they include only common nouns.
Leading or trailing modifiers that bring a more accurate understanding of how to plot the time expression on a timeline are included. This principle applies particularly to modifiers that express that the date is no later than, no earlier than, approximate to, after, or before a given date, or is a specified subset of a given date. However, leading prepositions or phrasal post-modifiers are not generally included unless they help clarify a relationship between multiple points. A vague term like “now” may be part of a range if the other part is a true Date, but not if both are vague.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
In the remaining examples, the references to “week” and “holiday” are not specific enough to be plotted on a calendar and therefore are not matches for Date.
Two or more separate date expressions are considered one match for the Date predefined concept if they are adjacent (or separated only by text that relates them) and the relationship is hierarchical. If overlapping or elided material exists between two expressions, then they are related and should always be identified as one match. They are also considered as one match if each point contributes to the understanding of the span of time under discussion, unless there are more than several words of intervening, unrelated material. This applies to range relationships and conjoined dates that could be interpreted as a range, where the ordering of the points is relevant and cannot be reversed without impacting the meaning. In a possessive construction, if both the possessive phrase and the phrase that it modifies are temporal expressions, then they are identified together as a single match. In all these cases, the Date expressions indicate one point in time. Comparative examples are provided in Table 4.1.
Table 4.1. One or More Matches for Date
One Match for Date |
Multiple Matches for Date |
The test was given [last week on Monday and Wednesday, but not Friday] |
The test will be given on [Monday], [Wednesday], [Sunday], and [Tuesday] |
[Every Thursday in October] |
[Yesterday], [today] and [tomorrow] the stock rose a point |
Consider the following examples:
Pause and think: Can you identify the matches for Date in the examples above? |
Matches include the following:
Note that the second example produces a single match and the final example produces multiple matches. The former is a range, whereas the latter is a series of separate dates.
If the time expression is a better answer for the questions “How long” or “How often” rather than “When,” it is not a match for Date. However, duration can be included in the Date concept if it is directly adjacent to a Date and helps plotting the Date on a timeline.
Consider the following examples ;
Pause and think: Can you identify the matches for Date in the examples above? |
Matches of duration that are included in Date include the following:
Portions of the examples above contained references to duration, marked in italics below and not matches for the Date concept:
Expressions that cannot be plotted on a timeline explicitly because they are underspecified or referring to implicit time are excluded from matches as Date.
Nonmatches include the following:
Similarly, words like “now,” “today,” or “tomorrow” are excluded from matches as Date when they have the generic meanings of “these days,” “nowadays,” or “in the future.”
Time is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpTime or another similar name. The generic “Time” label is used in this book because it is an industry standard term and reflects previous names used in SAS products for this concept.
Time expressions include patterns that indicate a point in time at any granularity smaller than a full day. Matches can also be a range of points with the following characteristics:
A reference date is either the date that the text was written, or the date that the events in the text occurred. In interpreting a possible Date match, the assumption is that the reference date is known, even if it is not explicitly contained in the text. The granularity of that known point extends only to the full day, not to smaller units of time. However, a word like “now” may serve as a reference point in relationships so long as there is another legitimate time match in the phrase.
The point or points in time must be able to be plotted on a timeline, which is a graph of time at any level of specificity smaller than a full day. The largest unit that can be a Time match is part of a day.
The matches for Time include formal or informal references to times, usually comprising a named unit, or a numerical value combined with at least some unit of time, which may be implicit from context. Named units of time include the following:
The reference could also be a pattern of numbers and punctuation. Punctuation is considered part of the Time match only if it is a lexical part of the tokens. For example, consider the following:
Remember: Time includes expressions of time that can be plotted on a timeline and are shorter than a full day. |
Special cases that govern whether certain words are included in the match are described in the following subsections.
4.3.1. Extended ISO 8601 Format
At least one element of the extended ISO 8601 format, the international standard covering the exchange of date and time-related data, should be explicit enough to plot on a timeline from the information given in the text plus the assumption of a known reference date. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The remaining two examples are not specific enough to plot on a timeline, because there is no known reference point for “now” and “15 minutes late.”
4.3.2. Named Times and Time Zones
Time zones, when present, are included in the scope of the match. Names of times are included unless they are clearly a standalone set or nonspecific reference to a type or class of item, and this can be determined by the immediate context. Commonly understood slang or cultural references to time periods, as well as references in titles, are included so long as they can be plotted on a timeline.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that cultural references to a specific time of day such as “rush hour” and “happy hour” are included in the matches, but phrases such as “good morning” and “eleventh hour decision” are not, because they cannot be plotted on a timeline.
Leading or trailing modifiers that bring a more accurate understanding of how to plot the time expression on a timeline are included. This principle applies particularly to modifiers that express that the time is no later than, no earlier than, approximate to, after, or before a given time—or are a specified subset of a given time. However, leading prepositions or phrasal post-modifiers are not generally included unless they help clarify a relationship between multiple points. A vague term like “now” may be part of a range if the other part is a true Time, but not if both are vague.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The remaining examples are too vague to be plotted on a timeline.
Two or multiple separate Time expressions are considered one match if they are adjacent (or only separated by text that relates them) and the relationship is hierarchical. If overlapping or elided material exists between two entities, then they are related and should always be identified as one match. They are also considered as one match if each point contributes to the understanding of the span of time under discussion, unless there are more than several words of intervening, unrelated material. This applies to range relationships and conjoined times that could be interpreted as a range; in other words, the ordering of the points is relevant and cannot be reversed without impacting the meaning. In a possessive construction, if both the possessive phrase and the phrase that it modifies are temporal expressions, then they are identified together as a single match. In all these cases, the Time expressions indicate one point in time. Some illustrative examples are presented in Table 4.2.
Table 4.2. One or More Matches for Time
One Match for Time |
Multiple Matches for Time |
We had tests on [Monday at 9:00 AM, at 10:00 AM, and at 11:00 AM] |
We had tests on [Monday at 9:00 AM], [Tuesday at 10:00 AM], and [Wednesday at 11:00 AM] |
. . . on [Friday morning] |
There were doughnuts at the [8:00] meeting [this morning] |
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The only example that contains multiple matches is the fifth one because it refers to two distinct times on two different days.
If the time expression denotes duration and is a better answer for the questions “How long” or “How often” rather than “When,” it is not a match for Time. However, duration can be included in the Time predefined concept match if it is directly adjacent to a Time and helps in plotting the Time on a timeline.
Consider the following examples;
Pause and think: Can you identify the matches for Time in the examples above? |
Matches include the following:
Portions of the examples above contained references to duration, marked in italics below, and not matches for the Time concept:
Like vague expressions of dates, expressions containing time references that cannot be plotted on a timeline explicitly because they are underspecified or referring to implicit time are excluded from Time matches. Some examples of nonmatches include “1 second later” and “a few hours earlier.”
Money is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpMoney or another similar name using the term “Currency.” The generic “Money” label is used in this book.
Money expressions include any explicit or implied numeric value with a monetary denomination or monetary unit symbol. Explicit or implied numeric values can be any of the following:
Numeric quantifiers include determiners and other quantifiers for which a number could be substituted grammatically (implied numeric amount) with the same or very similar meaning: “one,” “a,” “a few,” and so on. Monetary denominations include any official term or abbreviation for currency in any country (“dollar,” “quarter,” “dime,” “peso”), but not slang terms for money or amounts of money (“quid,” “bucks,” “dough,” “clams,” “Benjamins,” “five-spots,” “fivers,” “moolah,” “greenbacks,” “grand,” “large”).
The match includes the entire string expressing the monetary value: all tokens between the value and denomination or symbol, inclusive within the bounds of a single phrase. For example, matches include the following:
If the match of the monetary value and the currency is separated by more than a phrase or short clause, then the matched string may include only the monetary value, and the currency may play the role of context only.
However, generic or implied references to money are not specific enough, so the following examples are not matches:
Remember: Money includes expressions of numeric value with a denomination or monetary unit symbol. |
Special cases that govern whether certain words are included in the match are described in the following subsections.
Modifiers that indicate the multiplied value of a unit should be included when the expression remains grammatical and has similar meaning, if such a digit is substituted for the word(s). In other words, some quantifiers may take the place of the numerical value. A minus sign or the words like “minus” and “negative” should be included in the expression.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Modifying words that indicate the approximate value of a number or relative position, as well as verbs and prepositions outside the boundaries of a value and monetary denomination or symbol, are not included. However, modifiers which indicate the value is a maximum or minimum of a range of values (inclusive or exclusive of given value) are included in the match. Some examples of such modifiers include the following:
If a modifier occurs in the middle of an expression within the same phrase or sentence as the value and currency marker, then the modifier is included in the match. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that in the fifth example, the modifier “about” is not included in the match, because it does not provide any additional information than the sum itself that could be plotted on a number line.
In rate expressions, the unit is included in the matched string.
Ratios of currencies to each other are excluded from Money matches. These ratios do not indicate exact or approximate amounts of money, but only a relationship between types of money.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The remaining examples do not produce matches, because the ratios are comparing currencies rather than expressing an amount of money.
A quoted or parenthesized number or other information is included in the match when it is in the same phrase with a numerical value and a denomination or monetary unit symbol.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that in all three cases, the information between the amount and currency is included in the match.
Two or multiple adjacent (or only separated by text that relates them) Money expressions are considered one match if any of the following conditions are satisfied:
In these cases, leading prepositions or modifiers that clarify the relationship between the expressions are included in the match, as shown in the left column of Table 4.3. But if the expressions describe moving from one value to another or if there are more than several intervening, unrelated words, then each point is considered a separate Money match, as shown in the right column of Table 4.3. Money matches that do not have relating or elided material are also considered separate matches when each can stand alone and retains its meaning.
Table 4.3. One or More Matches for Money
One match for Money |
Multiple Matches for Money |
[Seventeen and then almost eighteen dollars] |
I had [$5] and then later [$2] in my wallet |
[Nine dollars and ten cents] more |
[eleven cents] and [twelve cents] |
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that the first example contains multiple matches for Money because each can stand alone and retain its meaning. Each of the remaining examples contains a single match.
A value + currency adjectival construction or other construction that leaves part of the value open-ended is included, even if the exact amount is not clear, so long as the approximate amount can be inferred. An imprecise value is still counted as a value if it contains a numeric reference.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The remaining examples are not matches because an approximate amount cannot be inferred.
4.4.6. Expressions and Metaphors
References to money in standard expressions or metaphors should be analyzed to determine whether there is really an amount of money explicitly stated, and that the meaning has not drifted so far away that it is still valid to acknowledge the value as a Money match. Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
For the remaining examples, the meaning has drifted from an explicit reference to an amount to a more general metaphorical meaning.
Percent is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpPercent or another similar name. The generic “Percent” label is used in this book because it is an industry standard term and reflects previous names used in SAS products for this concept.
Percent expressions include an explicit or implied numeric amount and a percentage reference. A numeric amount can be expressed with a number, word, or phrase; numeric quantifier; digit; fraction; or decimal. A percentage reference includes words and symbols with the meaning of “percent,” including the following:
A numeric quantifier includes determiners and other quantifiers for which a number could be substituted grammatically (implied numeric amount) with the same or very similar meaning: “one,” “a,” “a few,” and the like.
The match includes the entire string expressing the percentage value: all tokens between the value and percent reference, inclusive within the bounds of a single phrase. If the match of the numeric amount and the percent marker is separated by more than a phrase or short clause, then the matched string may include only the numeric amount, and the percent marker may play the role of context only. For example, matches include the following:
If there is no explicit percentage term within the scope of the same sentence as the numeric value, there is no match for Percent. Compare the preceding matches to the following nonmatches:
Similarly, if there is no numeric value or numeric quantifier within the scope of the same phrase or sentence as the percentage term, then there is no match for Percent. If the quantifier cannot be easily substituted for a number without further context, it is too subjective to be a numeric quantifier. Therefore, compare the following matches and nonmatches:
Remember: Percent includes expressions of numeric value with a percent reference. |
Special cases that govern whether certain words are included in the match are described in the following subsections.
4.5.1. Acronyms, Initialisms, and Abbreviations
Acronyms and initialisms are not included as matches unless spelled out. However, abbreviations are included. Matches include “[zero annual percentage rate]” and “[6 PCT] higher than last year.” Nonmatches include “zero APR.”
Modifying words that indicate the approximate value of a number or relative position, as well as verbs and prepositions outside the boundaries of a value and percent reference, are not included. However, modifiers which indicate the value is a maximum or minimum of a range of values (inclusive or exclusive of given value) are included in the match. Some examples of such modifiers include the following:
If a modifier occurs in the middle of an expression within the same phrase or sentence as the value and percent reference, then the modifier is included in the match. A minus sign or words like “minus” or “negative” are included in the match.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that the preposition “about” in the second example is not included in the match because it does not add any additional specification to the percentage amount that could be plotted on a number line.
4.5.3. Quotation Marks and Parentheses
A quoted or parenthesized number or other information is included in the match when it is in the same phrase with a numerical value and a percent reference.
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
Note that in all three cases, the information between the amount and percent is included in the match.
Two or multiple adjacent (or separated only by text that relates them) Percent expressions are considered one match if overlapping or elided material exists between two entities, or if in the context, each point contributes to the understanding of the span of percentage points under discussion (as in ranges or in conjoined expressions that can be interpreted as ranges), as shown in the left column of Table 4.4.
In this case, leading prepositions or modifiers that contribute to clarification of the relationship between two amounts are included in the match. But if the expressions describe moving from one value to another, or if there are more than several intervening, unrelated words, then each point is considered a separate Percent match. Percent matches that do not have relating or elided material are also considered separate matches when each can stand alone and retains its meaning, as shown in the right column of Table 4.4.
Table 4.4. One or More Matches for Percent
One Match for Percent |
Multiple Matches for Percent |
[5–9%] |
[5%], [112%], [18%] or [22%] respectively |
[5% through 9%] |
The twins got [87%] and [89%] on their tests |
Consider the following examples:
Pause and think: Can you identify the matches in the examples above? |
Matches include the following:
The second and fourth examples contain multiple matches in each example because each of the matches can stand alone and meaning is not lost. The remaining examples contain one match per example.
Multiword expressions that include percent references, such as “percent growth,” “percent yield,” or “percent margin,” and are used in the proximity of numeric values are included as matches in some languages, but not in others; in any case, they should be treated consistently. In the context of broader mathematical or other values or representations, only the percent reference and numeric value it describes are considered a match for Percent.
Consider the following examples:
Pause and think: Can you identify the potential matches in the examples above? |
Potential matches include the following:
The first and second examples contain two possible spans for the matches, depending on how multiword expressions are treated. In the SAS predefined concepts, the narrower match has been implemented.
Derivative or related mathematical items, like fractions, ratios, or other parts-per-N expressions, where the percentage relationship is not explicit, are not included as matches.
Nonmatches include the following:
4.5.7. Special Cases for Nonmatches
The percent symbol, when used in the encoding of characters, as a modulus, or as substitution for a white space character as in a path or URL, is not considered a match, even if it is adjacent to a number.
Nonmatches include the following:
Noun Group consists of a head noun and closely tied modifiers: nominal modifiers, most adjectival modifiers, and some adverbial modifiers. A head noun can be only a common noun, not a pronoun, number, proper noun, or another predefined concept type.
This approach differs from the way that a noun phrase is defined in grammatical theories, natural language processing, and text analytics systems, which have different purposes for noun phrase identification. The goal for Noun Group matches in the SAS processing approach is to identify complex concepts that consist of multiple words or tokens, which can then be used for topic generation and other text analytics tasks. Therefore, unlike noun phrases, Noun Groups do not include pre-determiners, determiners, numerical determiners (quantifiers), or negation adverbials, whether they are words, phrases, or clauses. In some languages, like English, post-head modifiers are also excluded. Furthermore, a bare head noun is not a Noun Group match. For example, only parts of the noun phrases in the following sentence are matches for Noun Group:
The dog’s [speedy recovery] from the five [long days] spent wandering was due to a [kind-hearted old lady], who found him at the [main gate] of her community.
Special constraints that govern whether certain words are included in the Noun Group match serve to prevent the match from becoming too specific (too long) to be useful. Different languages vary in their use of these constraints, but in general, Noun Group matches have no more than two or three modifiers of different part-of-speech tag types. In addition, they do not include conjunctions.
Modifiers joined with conjunctions, as well as conjoined nouns, are not combined into a conjoined phrase.
Consider the following examples:
Pause and think: Can you identify the potential matches in the examples above? |
Matches include the following:
The first, third, and fifth examples do not contain modifiers to the nouns and therefore do not produce Noun Group matches.
4.7. Disambiguation of Matches
Accounting for situations in which one single predefined concept match or pattern could fall into multiple categories is one of the key challenges of named entity recognition. Ambiguities between enamex entities were detailed in chapter 3, but there are also ambiguities between enamex and numex entities. Some examples are included below.
“May” can be part of a person’s name or a date:
“April” can be part of a person’s first name, an organization name, or a date:
In addition, the same text string could be a predefined concept match or not. Consider the following sentence (from https://www.bbc.com/sport/cricket/47273785):
Adil Rashid claimed 2-21, Chris Woakes 2-28 and . . .
The numbers in this sentence could be referring to dates in the month of February in the context of, for example, claiming days off from work. In this context, the numbers should be extracted as dates. However, the sentence above comes from a sports context, and in this case extracting dates would be inaccurate, because the rest of the sentence includes “Mark Wood 2-35.” The numbers are referring to cricket players’ statistics and are not timex entities. Similarly, in European data sources, soccer scores are often represented in a format that may match a time, such as “4:10.” It would be inappropriate to extract the final score of a soccer match as a time.
The SAS predefined concepts account for these types of ambiguity by leveraging contextual cues. To give a simple example, when a personal title is encountered in front of a proper noun, it is likely that the proper noun is a person, as in the example “Ms. May.” If, on the other hand, there is a numeral before or after “May,” then it is more likely to be a date, as in “May 5, 2017.”
4.8. Supplementing Predefined Concepts
The information about named entities in this chapter may have inspired you to think about augmenting the set of provided concepts with applications specific to your own area of interest. You may have realized that there is information that would be useful to extract but that is not matched in the predefined concepts. To assist you with those tasks, the focus of the next several chapters is creating your own custom concepts using some of the same best practices that are reflected in the predefined concepts.