Chapter 4: SAS Predefined Concepts: Timex, Numex, and Noun Group

4.1. Introduction to Other SAS Predefined Concepts

4.2. Date

4.2.1 Extended ISO 8601 Format

4.2.2. Named Dates

4.2.3. Modifiers

4.2.4. Conjoined Dates

4.2.5. Duration

4.2.6. Vague Expressions

4.3. Time

4.3.1. Extended ISO 8601 Format

4.3.2. Named Times and Time Zones

4.3.3. Modifiers

4.3.4. Conjoined Times

4.3.5. Duration

4.3.6. Vague Expressions

4.4. Money

4.4.1. Modifiers

4.4.2. Rates and Ratios

4.4.3. Quotes and Parentheses

4.4.4. Conjoined Expressions

4.4.5. Approximate Amount

4.4.6. Expressions and Metaphors

4.5. Percent

4.5.1. Acronyms, Initialisms, and Abbreviations

4.5.2. Modifiers

4.5.3. Quotation Marks and Parentheses

4.5.4. Conjoined Expressions

4.5.5. Multiword Expressions

4.5.6. Fractions and Ratios

4.5.7. Special Cases for Nonmatches

4.6. Noun Group

4.7. Disambiguation of Matches

4.8. Supplementing Predefined Concepts

4.1. Introduction to Other SAS Predefined Concepts

As you will recall from chapter 3, SAS provides a set of seven predefined concepts, spanning the three types of entities described in chapter 2:

  • Enamex (Person, Location, Organization), detailed in chapter 3
  • Timex (Date, Time), detailed in sections 4.2 and 4.3
  • Numex (Money, Percent), detailed in sections 4.4 and 4.5

This chapter also includes a description of the predefined grammatical pattern, Noun Group, which aids in the recognition of multiwords and complex concepts. This pattern is detailed in section 4.6.

The rules that comprise the predefined concepts are proprietary and not displayed in the products. But, when you learn more about the principles and assumptions that form the basis for the predefined concept rules, as you do in this chapter, you can more accurately identify when you can leverage them and when custom concepts are a better choice.

4.2. Date

Date is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpDate or another similar name. The generic “Date” label is used in this book because it is an industry standard term and reflects previous names used in SAS products for this concept.

Date matches include patterns that indicate a specific point in time at any granularity from full day to larger amount of time. Matches can also be a range of points with the following:

  • Known beginning and ending points
  • Known beginning and ending points plus a frequency of units within the range
  • Known beginning or ending point and the other point is an explicit date
  • Known beginning or ending point plus duration (anchored duration)
  • An explicit or strongly implied reference date plus duration (anchored duration)

A reference date is either the date that the text was written, or the date that the events in the text occurred. In interpreting a possible Date match, the assumption is that the reference date is known, even if it is not explicitly contained in the text. The granularity of that known point extends only to the full day, not to smaller units of time. However, a word like “now” may serve as a reference point in relationships so long as there is another legitimate time match in the phrase.

The point or points in time modeled by a Date match must be specific enough to be able to be plotted on a timeline. A timeline is a graph of time at any level of specificity:

  • Day
  • Week
  • Month
  • Year
  • Decade

The smallest unit that can be a Date is a full day.

Remember: Date includes expressions of time that can be plotted on a timeline and span at least a full day.

Date matches include formal or informal references to dates, usually composed of a named unit or a numerical value combined with at least some unit of time. Named units include the following names and common expressions for time:

  • Days
  • Months
  • Seasons
  • Decade
  • Year
  • Quarter
  • Semester

The match usually encompasses one of the following grammatical categories:

  • Noun
  • Proper noun
  • Noun phrase
  • Adjective or adverbial phrase

Note that the match does not encompass clauses or prepositional phrases. The match is as short as possible without losing meaning. Punctuation is considered part of the Date match only if it is a lexical part of the tokens. Some examples include the following:

  • [6 Oct.]
  • [Aug. 1st]
  • [Dec. 31, 2016]
  • It increased [last May].

Special cases that govern whether certain words are included in the match are described in the following subsections.

4.2.1 Extended ISO 8601 Format

At least one element of the extended ISO 8601 format, the international standard covering the exchange of date- and time-related data, should be explicit. Units larger than a year are also included. In all cases, at least one point in time should be possible to plot on a timeline from the information given in the text plus the assumption of a known reference date.

Consider the following examples:

  • I went home yesterday
  • I recently went home to visit my parents
  • You stayed at my home Friday
  • . . . you stayed at my home for 2 months
  • I want to go now
  • I have great hopes for the future of my grandchildren

Pause and think: Can you identify the matches in the examples above?

Matches include only the following:

  • I went home [yesterday]
  • You stayed at my home [Friday]

The remaining examples do not contain enough information to plot on a timeline. The fourth example is not a match because there is not a known reference date for the start or end of the “2 months” period.

4.2.2. Named Dates

Named dates are included unless they are clearly a standalone set or nonspecific reference to a type or class of item, and this can be determined by the immediate context. Consider the following examples:

  • We decorate every Christmas
  • We decorated for Christmas
  • We vacationed as we do every October
  • Next year we will vacation in October
  • October is my favorite
  • Yearly in October, we plan a vacation
  • . . . in May last year
  • . . . through the Fourth Quarter
  • The New Year’s Day tradition

Pause and think: Can you identify the matches in the examples above?

Matches include only the following:

  • We decorated for [Christmas]
  • [Next year] we will vacation in [October]
  • [October] is my favorite
  • . . . in [May last year]
  • . . . through [the Fourth Quarter]

Commonly understood slang or cultural references to dates, as well as references in titles, are included so long as they can be plotted on a timeline with an assumed reference date.

Consider the following examples:

  • Wear your Sunday best
  • The dog days of summer are here
  • During the previous school year
  • Next weekend
  • “Summer of ’69” is one of Bryan Adams’ most popular songs

Pause and think: Can you identify the matches in the examples above?

Matches include only the following:

  • During the [previous school year]
  • [Next weekend]
  • “[Summer of ’69]” is one of Bryan Adams’ most popular songs

The first two examples cannot be plotted on a calendar, so they are not matches for the Date concept.

Common nouns signifying events are excluded from matches unless a date stands for an event. Consider the following examples:

  • My birthday
  • Her September 2 birthday
  • . . . on September 11
  • Your wedding
  • The June 4 problem

Pause and think: Can you identify the matches in the examples above?

Matches include only the following:

  • Her [September 2] birthday
  • . . . on [September 11]
  • The [June 4] problem

Note that the remaining examples are not matches because they include only common nouns.

4.2.3. Modifiers

Leading or trailing modifiers that bring a more accurate understanding of how to plot the time expression on a timeline are included. This principle applies particularly to modifiers that express that the date is no later than, no earlier than, approximate to, after, or before a given date, or is a specified subset of a given date. However, leading prepositions or phrasal post-modifiers are not generally included unless they help clarify a relationship between multiple points. A vague term like “now” may be part of a range if the other part is a true Date, but not if both are vague.

Consider the following examples:

  • On approximately May 1st
  • Before the summer of ’69
  • In the fall of 1992
  • In the first 5 days of April
  • Less than a year ago
  • No less than a year ago
  • We travelled most of the week
  • We travelled much of last week
  • Both now and in the future
  • He left after the holiday
  • It will get fixed between now and Monday morning

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • On [approximately May 1st]
  • [Before the summer of ’69]
  • In [the fall of 1992]
  • In [the first 5 days of April]
  • [Less than a year ago]
  • [No less than a year ago]
  • We travelled much of [last week]
  • It will get fixed [between now and Monday morning]

In the remaining examples, the references to “week” and “holiday” are not specific enough to be plotted on a calendar and therefore are not matches for Date.

4.2.4. Conjoined Dates

Two or more separate date expressions are considered one match for the Date predefined concept if they are adjacent (or separated only by text that relates them) and the relationship is hierarchical. If overlapping or elided material exists between two expressions, then they are related and should always be identified as one match. They are also considered as one match if each point contributes to the understanding of the span of time under discussion, unless there are more than several words of intervening, unrelated material. This applies to range relationships and conjoined dates that could be interpreted as a range, where the ordering of the points is relevant and cannot be reversed without impacting the meaning. In a possessive construction, if both the possessive phrase and the phrase that it modifies are temporal expressions, then they are identified together as a single match. In all these cases, the Date expressions indicate one point in time. Comparative examples are provided in Table 4.1.

Table 4.1. One or More Matches for Date

One Match for Date

Multiple Matches for Date

The test was given [last week on Monday and Wednesday, but not Friday]

The test will be given on [Monday], [Wednesday], [Sunday], and [Tuesday]

[Every Thursday in October]

[Yesterday], [today] and [tomorrow] the stock rose a point

Consider the following examples:

  • . . . in March of this year
  • We will be on break from July 1-5 this year
  • . . . in the fall of 1992
  • This year’s summer was unusually hot
  • My birthday is on August 8 and October 27th is my brother’s birthday

Pause and think: Can you identify the matches for Date in the examples above?

Matches include the following:

  • . . . in [March of this year]
  • We will be on break [from July 1-5 this year]
  • . . . in [the fall of 1992]
  • [This year’s summer] was unusually hot
  • My birthday is on [August 8] and [October 27th] is my brother’s birthday

Note that the second example produces a single match and the final example produces multiple matches. The former is a range, whereas the latter is a series of separate dates.

4.2.5. Duration

If the time expression is a better answer for the questions “How long” or “How often” rather than “When,” it is not a match for Date. However, duration can be included in the Date concept if it is directly adjacent to a Date and helps plotting the Date on a timeline.

Consider the following examples ;

  • I’m leaving on vacation two weeks from next Tuesday
  • In September, we finally went to the show, after a three-month wait for tickets
  • Every Tuesday this year, we went to the zoo
  • His application was being processed for 10 years and he finally became a citizen on July 4th

Pause and think: Can you identify the matches for Date in the examples above?

Matches of duration that are included in Date include the following:

  • I’m leaving on vacation [two weeks from next Tuesday]
  • In [September], we finally went to the show . . .
  • [Every Tuesday this year], we went to the zoo
  • . . . he finally became a citizen on [July 4th]

Portions of the examples above contained references to duration, marked in italics below and not matches for the Date concept:

  • . . .after a three-month wait for tickets
  • His application was being processed for 10 years . . .

4.2.6. Vague Expressions

Expressions that cannot be plotted on a timeline explicitly because they are underspecified or referring to implicit time are excluded from matches as Date.

Nonmatches include the following:

  • For 4 months
  • During two entire days
  • Every two days
  • In recent decades
  • In the past
  • Now
  • For at least the next year or two
  • Over the coming months
  • A few months ago
  • Recently
  • The last 4 days of the festival
  • On a Tuesday
  • Not long ago

Similarly, words like “now,” “today,” or “tomorrow” are excluded from matches as Date when they have the generic meanings of “these days,” “nowadays,” or “in the future.”

4.3. Time

Time is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpTime or another similar name. The generic “Time” label is used in this book because it is an industry standard term and reflects previous names used in SAS products for this concept.

Time expressions include patterns that indicate a point in time at any granularity smaller than a full day. Matches can also be a range of points with the following characteristics:

  • Known beginning and ending points
  • Known beginning and ending points plus a frequency of units within the range
  • Known beginning or ending point the other point is an explicit time reference
  • Known beginning or ending point plus duration (anchored duration)
  • Explicit or strongly implied reference date plus duration (anchored duration)

A reference date is either the date that the text was written, or the date that the events in the text occurred. In interpreting a possible Date match, the assumption is that the reference date is known, even if it is not explicitly contained in the text. The granularity of that known point extends only to the full day, not to smaller units of time. However, a word like “now” may serve as a reference point in relationships so long as there is another legitimate time match in the phrase.

The point or points in time must be able to be plotted on a timeline, which is a graph of time at any level of specificity smaller than a full day. The largest unit that can be a Time match is part of a day.

The matches for Time include formal or informal references to times, usually comprising a named unit, or a numerical value combined with at least some unit of time, which may be implicit from context. Named units of time include the following:

  • Morning
  • Night
  • Hour
  • Minute
  • Second
  • Noon
  • Midday
  • Midnight

The reference could also be a pattern of numbers and punctuation. Punctuation is considered part of the Time match only if it is a lexical part of the tokens. For example, consider the following:

  • [5 a.m.]
  • [12:00]
  • She arrived at [8pm]

Remember: Time includes expressions of time that can be plotted on a timeline and are shorter than a full day.

Special cases that govern whether certain words are included in the match are described in the following subsections.

4.3.1. Extended ISO 8601 Format

At least one element of the extended ISO 8601 format, the international standard covering the exchange of date and time-related data, should be explicit enough to plot on a timeline from the information given in the text plus the assumption of a known reference date. Consider the following examples:

  • I want to go right now
  • He was 15 minutes late
  • He will arrive at 2:00
  • I leave at 16:00

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • He will arrive at [2:00]
  • I leave at [16:00]

The remaining two examples are not specific enough to plot on a timeline, because there is no known reference point for “now” and “15 minutes late.”

4.3.2. Named Times and Time Zones

Time zones, when present, are included in the scope of the match. Names of times are included unless they are clearly a standalone set or nonspecific reference to a type or class of item, and this can be determined by the immediate context. Commonly understood slang or cultural references to time periods, as well as references in titles, are included so long as they can be plotted on a timeline.

Consider the following examples:

  • It starts at 8 ET
  • She arrives at 1 pm CST
  • Rush hour
  • We will be done by noon
  • It ended at midnight
  • Good morning
  • He naps every afternoon
  • 24-hour gym
  • Primetime
  • The mail arrives every morning
  • The wee hours of the morning
  • Happy hour
  • The bottom of the hour
  • Eleventh hour decision
  • At the last minute
  • I saw the film “Last night”
  • “Minute to win it”
  • “60 minutes”
  • “Midnight in the Garden of Good and Evil”

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • It starts at [8 ET]
  • She arrives at [1 pm CST]
  • [Rush hour]
  • We will be done by [noon]
  • It ended at [midnight]
  • [Primetime]
  • [The wee hours of the morning]
  • [Happy hour]
  • [The bottom of the hour]
  • I saw the film “[Last night]”
  • “[Midnight] in the Garden of Good and Evil”

Note that cultural references to a specific time of day such as “rush hour” and “happy hour” are included in the matches, but phrases such as “good morning” and “eleventh hour decision” are not, because they cannot be plotted on a timeline.

4.3.3. Modifiers

Leading or trailing modifiers that bring a more accurate understanding of how to plot the time expression on a timeline are included. This principle applies particularly to modifiers that express that the time is no later than, no earlier than, approximate to, after, or before a given time—or are a specified subset of a given time. However, leading prepositions or phrasal post-modifiers are not generally included unless they help clarify a relationship between multiple points. A vague term like “now” may be part of a range if the other part is a true Time, but not if both are vague.

Consider the following examples:

  • It may last from a few minutes to a few hours
  • At half past three
  • From 2:00 onwards
  • From now on
  • Between 6:00 and 8:00
  • From now until 1pm
  • It was about 5 hours yesterday afternoon
  • By around 5:00

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • At [half past three]
  • From [2:00] onwards
  • [Between 6:00 and 8:00]
  • [From now until 1pm]
  • It was [about 5 hours yesterday afternoon]
  • By [around 5:00]

The remaining examples are too vague to be plotted on a timeline.

4.3.4. Conjoined Times

Two or multiple separate Time expressions are considered one match if they are adjacent (or only separated by text that relates them) and the relationship is hierarchical. If overlapping or elided material exists between two entities, then they are related and should always be identified as one match. They are also considered as one match if each point contributes to the understanding of the span of time under discussion, unless there are more than several words of intervening, unrelated material. This applies to range relationships and conjoined times that could be interpreted as a range; in other words, the ordering of the points is relevant and cannot be reversed without impacting the meaning. In a possessive construction, if both the possessive phrase and the phrase that it modifies are temporal expressions, then they are identified together as a single match. In all these cases, the Time expressions indicate one point in time. Some illustrative examples are presented in Table 4.2.

Table 4.2. One or More Matches for Time

One Match for Time

Multiple Matches for Time

We had tests on [Monday at 9:00 AM, at 10:00 AM, and at 11:00 AM]

We had tests on [Monday at 9:00 AM], [Tuesday at 10:00 AM], and [Wednesday at 11:00 AM]

. . . on [Friday morning]

There were doughnuts at the [8:00] meeting [this morning]

Consider the following examples:

  • It was about 5 hours yesterday afternoon
  • Twelve o’clock January 3, 1984
  • He left between 6:00 p.m. and 8:00 p.m.
  • After 9PM and before 2AM
  • At 5:15 PM on Tuesday and 5 PM on Thursday
  • At eleven in the morning

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • It was [about 5 hours yesterday afternoon]
  • [Twelve o’clock January 3, 1984]
  • He left [between 6:00 p.m. and 8:00 p.m.]
  • [After 9PM and before 2AM]
  • At [5:15 PM on Tuesday] and [5 PM on Thursday]
  • At [eleven in the morning]

The only example that contains multiple matches is the fifth one because it refers to two distinct times on two different days.

4.3.5. Duration

If the time expression denotes duration and is a better answer for the questions “How long” or “How often” rather than “When,” it is not a match for Time. However, duration can be included in the Time predefined concept match if it is directly adjacent to a Time and helps in plotting the Time on a timeline.

Consider the following examples;

  • On Monday, we had to wait 20 minutes for the professor
  • For 20 minutes last Monday, we waited for the professor
  • Two minutes
  • 5 hours
  • Dinner is from five to six pm tomorrow
  • The class is 3-6 pm today

Pause and think: Can you identify the matches for Time in the examples above?

Matches include the following:

  • On [Monday], we had to wait 20 minutes for the professor
  • Dinner is [from five to six pm tomorrow]
  • The class is [3-6 pm today]

Portions of the examples above contained references to duration, marked in italics below, and not matches for the Time concept:

  • . . .we had to wait 20 minutes for the professor
  • Two minutes
  • 5 hours

4.3.6. Vague Expressions

Like vague expressions of dates, expressions containing time references that cannot be plotted on a timeline explicitly because they are underspecified or referring to implicit time are excluded from Time matches. Some examples of nonmatches include “1 second later” and “a few hours earlier.”

4.4. Money

Money is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpMoney or another similar name using the term “Currency.” The generic “Money” label is used in this book.

Money expressions include any explicit or implied numeric value with a monetary denomination or monetary unit symbol. Explicit or implied numeric values can be any of the following:

  • Digits
  • Number words
  • Fractions
  • Decimals
  • Numeric quantifiers

Numeric quantifiers include determiners and other quantifiers for which a number could be substituted grammatically (implied numeric amount) with the same or very similar meaning: “one,” “a,” “a few,” and so on. Monetary denominations include any official term or abbreviation for currency in any country (“dollar,” “quarter,” “dime,” “peso”), but not slang terms for money or amounts of money (“quid,” “bucks,” “dough,” “clams,” “Benjamins,” “five-spots,” “fivers,” “moolah,” “greenbacks,” “grand,” “large”).

The match includes the entire string expressing the monetary value: all tokens between the value and denomination or symbol, inclusive within the bounds of a single phrase. For example, matches include the following:

  • [One and a half million dollars]
  • [$10]
  • [0.1 cent]
  • [Twenty-something dollars]

If the match of the monetary value and the currency is separated by more than a phrase or short clause, then the matched string may include only the monetary value, and the currency may play the role of context only.

However, generic or implied references to money are not specific enough, so the following examples are not matches:

  • There was a lot of pesos on the table
  • There were many dollars at risk
  • The dollar fell against the yen

Remember: Money includes expressions of numeric value with a denomination or monetary unit symbol.

Special cases that govern whether certain words are included in the match are described in the following subsections.

4.4.1. Modifiers

Modifiers that indicate the multiplied value of a unit should be included when the expression remains grammatical and has similar meaning, if such a digit is substituted for the word(s). In other words, some quantifiers may take the place of the numerical value. A minus sign or the words like “minus” and “negative” should be included in the expression.

Consider the following examples:

  • There were several 10-dollar bills in my wallet
  • There were several bills in my wallet
  • A few million dollars fell
  • There were no dollars left of my paycheck
  • I received a million dollars
  • There was minus 15 dollars in the account
  • −12 billion dollars

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • There were [several 10-dollar bills] in my wallet
  • [A few million dollars] fell
  • There were [no dollars] left of my paycheck
  • I received [a million dollars]
  • There was [minus 15 dollars] in the account
  • [−12 billion dollars]

Modifying words that indicate the approximate value of a number or relative position, as well as verbs and prepositions outside the boundaries of a value and monetary denomination or symbol, are not included. However, modifiers which indicate the value is a maximum or minimum of a range of values (inclusive or exclusive of given value) are included in the match. Some examples of such modifiers include the following:

  • Over
  • Above
  • More than
  • Below
  • Under
  • Less than
  • Maximum of

If a modifier occurs in the middle of an expression within the same phrase or sentence as the value and currency marker, then the modifier is included in the match. Consider the following examples:

  • Over $5 were lost
  • Just under $24 million
  • Raised more than five million dollars
  • He had barely $6 to his name
  • The cost was about $20 too high
  • She lost almost 50 dollars in chips
  • 12 big bad million dollars
  • 11 stinking cents

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • [Over $5] were lost
  • Just [under $24 million]
  • Raised [more than five million dollars]
  • He had [barely $6] to his name
  • The cost was about [$20] too high
  • She lost [almost 50 dollars] in chips
  • [12 big bad million dollars]
  • [11 stinking cents]

Note that in the fifth example, the modifier “about” is not included in the match, because it does not provide any additional information than the sum itself that could be plotted on a number line.

4.4.2. Rates and Ratios

In rate expressions, the unit is included in the matched string.

Ratios of currencies to each other are excluded from Money matches. These ratios do not indicate exact or approximate amounts of money, but only a relationship between types of money.

Consider the following examples:

  • $3 per share
  • 11 cents/unit
  • From highs above 0.7700 on Thursday, AUD/USD has fallen sharply to 0.7500
  • US$2-per-day
  • USD/CAD has strengthened from lows below 1.2850 to trade above 1.3100
  • $12 per person
  • NZD/USD has moved from testing 15-month highs at 0.7500 to test the 0.7300 support area

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • [$3 per share]
  • [11 cents/unit]
  • [US$2-per-day]
  • [$12 per person]

The remaining examples do not produce matches, because the ratios are comparing currencies rather than expressing an amount of money.

4.4.3. Quotes and Parentheses

A quoted or parenthesized number or other information is included in the match when it is in the same phrase with a numerical value and a denomination or monetary unit symbol.

Consider the following examples:

  • Above eighty (80) dollars
  • 6 “six” cents
  • After zero (that means none) dollars in fines

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • Above [eighty (80) dollars]
  • [6 “six” cents]
  • After [zero (that means none) dollars] in fines

Note that in all three cases, the information between the amount and currency is included in the match.

4.4.4. Conjoined Expressions

Two or multiple adjacent (or only separated by text that relates them) Money expressions are considered one match if any of the following conditions are satisfied:

  • They are hierarchically related
  • Overlapping or elided material exists between two entities
  • They express a relationship between values in two different currencies or the same value in digits and words
  • Their order is relevant and impacts the meaning

In these cases, leading prepositions or modifiers that clarify the relationship between the expressions are included in the match, as shown in the left column of Table 4.3. But if the expressions describe moving from one value to another or if there are more than several intervening, unrelated words, then each point is considered a separate Money match, as shown in the right column of Table 4.3. Money matches that do not have relating or elided material are also considered separate matches when each can stand alone and retains its meaning.

Table 4.3. One or More Matches for Money

One match for Money

Multiple Matches for Money

[Seventeen and then almost eighteen dollars]

I had [$5] and then later [$2] in my wallet

[Nine dollars and ten cents] more

[eleven cents] and [twelve cents]

Consider the following examples:

  • We made $700, $1200, and $600 for each of the three jobs
  • The cost can be anywhere from $12 through $20
  • 5–600 dollars
  • Spent 15.6 billion pesos (over US $900 million)
  • #26 million ($43.6 million)
  • Above eighty (80) dollars
  • 6 “six” cents
  • $2,000–$3,000 million dollars
  • 7–10 dollars

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • We made [$700], [$1200], and [$600] for each of the three jobs
  • The cost can be anywhere [from $12 through $20]
  • [5–600 dollars]
  • Spent [15.6 billion pesos (over US $900 million])
  • [#26 million ($43.6 million])
  • Above [eighty (80) dollars]
  • [6 “six” cents]
  • [$2,000-$3,000 million dollars]
  • [7-10 dollars]

Note that the first example contains multiple matches for Money because each can stand alone and retain its meaning. Each of the remaining examples contains a single match.

4.4.5. Approximate Amount

A value + currency adjectival construction or other construction that leaves part of the value open-ended is included, even if the exact amount is not clear, so long as the approximate amount can be inferred. An imprecise value is still counted as a value if it contains a numeric reference.

Consider the following examples:

  • He lost the team millions
  • Many dollars were lost
  • A million-dollar conference party was offered
  • The whole budget
  • Tens of billions of dollars were donated
  • Every dollar I had
  • Fortunes were lost
  • Many millions of dollars were lost
  • All the cash

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • A [million-dollar] conference party was offered
  • [Tens of billions of dollars] were donated
  • Many [millions of dollars] were lost

The remaining examples are not matches because an approximate amount cannot be inferred.

4.4.6. Expressions and Metaphors

References to money in standard expressions or metaphors should be analyzed to determine whether there is really an amount of money explicitly stated, and that the meaning has not drifted so far away that it is still valid to acknowledge the value as a Money match. Consider the following examples:

  • In for a penny, in for a pound
  • Penny whistle
  • Penny candy
  • A penny for your thoughts
  • Penny pincher
  • A pretty penny
  • A penny saved is a penny earned
  • Pennyweight
  • On a dime
  • A day late and a dollar short
  • The almighty dollar
  • A dime a dozen
  • To nickel and dime someone
  • Be two a penny
  • Phony as a three-dollar bill
  • Feel like a million dollars
  • I wouldn’t give 2 cents/pennies for that

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • [A penny] for your thoughts
  • [A penny] saved is [a penny] earned
  • [A dime a dozen]
  • Be [two a penny]
  • Feel like [a million dollars]
  • I wouldn’t give [2 cents/pennies] for that

For the remaining examples, the meaning has drifted from an explicit reference to an amount to a more general metaphorical meaning.

4.5. Percent

Percent is a predefined concept provided by the SAS linguists. Note that the name of this concept in your product may be nlpPercent or another similar name. The generic “Percent” label is used in this book because it is an industry standard term and reflects previous names used in SAS products for this concept.

Percent expressions include an explicit or implied numeric amount and a percentage reference. A numeric amount can be expressed with a number, word, or phrase; numeric quantifier; digit; fraction; or decimal. A percentage reference includes words and symbols with the meaning of “percent,” including the following:

  • Percentage point
  • Percentile
  • Quantile
  • Centile
  • Percentile rank
  • %

A numeric quantifier includes determiners and other quantifiers for which a number could be substituted grammatically (implied numeric amount) with the same or very similar meaning: “one,” “a,” “a few,” and the like.

The match includes the entire string expressing the percentage value: all tokens between the value and percent reference, inclusive within the bounds of a single phrase. If the match of the numeric amount and the percent marker is separated by more than a phrase or short clause, then the matched string may include only the numeric amount, and the percent marker may play the role of context only. For example, matches include the following:

  • [12 percentage points]
  • [1 ¾ percent]
  • A fixed [106 7/8%]
  • [50-something percent]
  • [Eighty-eight percent]
  • [One and a half percent]
  • [10%]
  • [.9%]
  • A [percentage rate of 0.51]
  • The [75th percentile] of the wage distribution

If there is no explicit percentage term within the scope of the same sentence as the numeric value, there is no match for Percent. Compare the preceding matches to the following nonmatches:

  • 12 points
  • 1.5 times
  • About one-third of
  • Fees 1 ¾
  • A fixed 106 7/8
  • Priced at 99 ¼

Similarly, if there is no numeric value or numeric quantifier within the scope of the same phrase or sentence as the percentage term, then there is no match for Percent. If the quantifier cannot be easily substituted for a number without further context, it is too subjective to be a numeric quantifier. Therefore, compare the following matches and nonmatches:

  • [A percentage point]
  • [Several percentage points] down
  • [A few percent] higher
  • Up [several tenths of a percent]
  • The rate goes up many percentage points
  • All percentage discussions

Remember: Percent includes expressions of numeric value with a percent reference.

Special cases that govern whether certain words are included in the match are described in the following subsections.

4.5.1. Acronyms, Initialisms, and Abbreviations

Acronyms and initialisms are not included as matches unless spelled out. However, abbreviations are included. Matches include “[zero annual percentage rate]” and “[6 PCT] higher than last year.” Nonmatches include “zero APR.”

4.5.2. Modifiers

Modifying words that indicate the approximate value of a number or relative position, as well as verbs and prepositions outside the boundaries of a value and percent reference, are not included. However, modifiers which indicate the value is a maximum or minimum of a range of values (inclusive or exclusive of given value) are included in the match. Some examples of such modifiers include the following:

  • Over
  • Above
  • More than
  • Below
  • Under
  • Less than
  • Maximum of

If a modifier occurs in the middle of an expression within the same phrase or sentence as the value and percent reference, then the modifier is included in the match. A minus sign or words like “minus” or “negative” are included in the match.

Consider the following examples:

  • At least 5% of the students passed
  • About a percent
  • It was [over 10%] of what we earned last year
  • Up 6 PCT from last year
  • Barely 8% over predicted value
  • Almost 9/10th of a percent
  • One half of one tiny percent difference
  • Almost ½ a percent
  • Nearly 40 percent of Americans
  • At minus 15 percent
  • Generated a negative 1% return

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • [At least 5%] of the students passed
  • About [a percent]
  • It was [over 10%] of what we earned last year
  • Up [6 PCT] from last year
  • [Barely 8%] over predicted value
  • [Almost 9/10th of a percent]
  • [One half of one tiny percent] difference
  • [Almost ½ a percent]
  • [Nearly 40 percent] of Americans
  • At [minus 15 percent]
  • Generated a [negative 1%] return

Note that the preposition “about” in the second example is not included in the match because it does not add any additional specification to the percentage amount that could be plotted on a number line.

4.5.3. Quotation Marks and Parentheses

A quoted or parenthesized number or other information is included in the match when it is in the same phrase with a numerical value and a percent reference.

Consider the following examples:

  • After zero (that means none) % growth
  • Above eighty (80) percent
  • Six “6” percent

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • After [zero (that means none) %] growth
  • [Above eighty (80) percent]
  • [Six “6” percent]

Note that in all three cases, the information between the amount and percent is included in the match.

4.5.4. Conjoined Expressions

Two or multiple adjacent (or separated only by text that relates them) Percent expressions are considered one match if overlapping or elided material exists between two entities, or if in the context, each point contributes to the understanding of the span of percentage points under discussion (as in ranges or in conjoined expressions that can be interpreted as ranges), as shown in the left column of Table 4.4.

In this case, leading prepositions or modifiers that contribute to clarification of the relationship between two amounts are included in the match. But if the expressions describe moving from one value to another, or if there are more than several intervening, unrelated words, then each point is considered a separate Percent match. Percent matches that do not have relating or elided material are also considered separate matches when each can stand alone and retains its meaning, as shown in the right column of Table 4.4.

Table 4.4. One or More Matches for Percent

One Match for Percent

Multiple Matches for Percent

[5–9%]

[5%], [112%], [18%] or [22%] respectively

[5% through 9%]

The twins got [87%] and [89%] on their tests

Consider the following examples:

  • Between 6% and 17% higher than yesterday
  • By a factor of maybe 200% or 250%
  • Up almost 5–6%
  • 20%, 25%, 30% tint pictures
  • 11 and then almost 12 percentage points
  • Every 10 or 20 mole percent KCl

Pause and think: Can you identify the matches in the examples above?

Matches include the following:

  • [Between 6% and 17%] higher than yesterday
  • By a factor of maybe [200%] or [250%]
  • Up [almost 5–6%]
  • [20%], [25%], [30%] tint pictures
  • [11 and then almost 12 percentage points]
  • Every [10 or 20 mole percent] KCl

The second and fourth examples contain multiple matches in each example because each of the matches can stand alone and meaning is not lost. The remaining examples contain one match per example.

4.5.5. Multiword Expressions

Multiword expressions that include percent references, such as “percent growth,” “percent yield,” or “percent margin,” and are used in the proximity of numeric values are included as matches in some languages, but not in others; in any case, they should be treated consistently. In the context of broader mathematical or other values or representations, only the percent reference and numeric value it describes are considered a match for Percent.

Consider the following examples:

  • By a 35 percent growth
  • With 85 percent yield in a gas recycle
  • 5.2 ± 5.4%
  • Standard deviation is 2.3% of the mean of 4.4

Pause and think: Can you identify the potential matches in the examples above?

Potential matches include the following:

  • By a [35 percent] growth
  • By a [35 percent growth]
  • With [85 percent] yield in a gas recycle
  • With [85 percent yield] in a gas recycle
  • 5.2 ± [5.4%]
  • Standard deviation is [2.3%] of the mean of 4.4

The first and second examples contain two possible spans for the matches, depending on how multiword expressions are treated. In the SAS predefined concepts, the narrower match has been implemented.

4.5.6. Fractions and Ratios

Derivative or related mathematical items, like fractions, ratios, or other parts-per-N expressions, where the percentage relationship is not explicit, are not included as matches.

Nonmatches include the following:

  • 5 out of 10 children
  • 2/5 of the pieces of fruit are oranges
  • The amount of orange juice concentrate is 1/5 of the total liquid
  • The presence of two molar proportions
  • 2‰ (per mille)

4.5.7. Special Cases for Nonmatches

The percent symbol, when used in the encoding of characters, as a modulus, or as substitution for a white space character as in a path or URL, is not considered a match, even if it is adjacent to a number.

Nonmatches include the following:

  • Fran%c3%a7ois
  • http://www.edg.com/true&width=80%&height=80%
  • http://call.co/app/?q=php20%

4.6. Noun Group

Noun Group consists of a head noun and closely tied modifiers: nominal modifiers, most adjectival modifiers, and some adverbial modifiers. A head noun can be only a common noun, not a pronoun, number, proper noun, or another predefined concept type.

This approach differs from the way that a noun phrase is defined in grammatical theories, natural language processing, and text analytics systems, which have different purposes for noun phrase identification. The goal for Noun Group matches in the SAS processing approach is to identify complex concepts that consist of multiple words or tokens, which can then be used for topic generation and other text analytics tasks. Therefore, unlike noun phrases, Noun Groups do not include pre-determiners, determiners, numerical determiners (quantifiers), or negation adverbials, whether they are words, phrases, or clauses. In some languages, like English, post-head modifiers are also excluded. Furthermore, a bare head noun is not a Noun Group match. For example, only parts of the noun phrases in the following sentence are matches for Noun Group:

The dog’s [speedy recovery] from the five [long days] spent wandering was due to a [kind-hearted old lady], who found him at the [main gate] of her community.

Special constraints that govern whether certain words are included in the Noun Group match serve to prevent the match from becoming too specific (too long) to be useful. Different languages vary in their use of these constraints, but in general, Noun Group matches have no more than two or three modifiers of different part-of-speech tag types. In addition, they do not include conjunctions.

Modifiers joined with conjunctions, as well as conjoined nouns, are not combined into a conjoined phrase.

Consider the following examples:

  • Boys and girls
  • The five unruly boys and girls
  • Cookies and milk
  • Very delicious cookies and milk
  • Bangers and mash
  • Large but fixed amount of money
  • Her considered and well-articulated opinion

Pause and think: Can you identify the potential matches in the examples above?

Matches include the following:

  • The five [unruly boys] and girls
  • Very [delicious cookies] and milk
  • Her considered and [well-articulated opinion]
  • Large but [fixed amount] of money

The first, third, and fifth examples do not contain modifiers to the nouns and therefore do not produce Noun Group matches.

4.7. Disambiguation of Matches

Accounting for situations in which one single predefined concept match or pattern could fall into multiple categories is one of the key challenges of named entity recognition. Ambiguities between enamex entities were detailed in chapter 3, but there are also ambiguities between enamex and numex entities. Some examples are included below.

“May” can be part of a person’s name or a date:

  • [Prime minister Theresa May] arrived yesterday.
  • It happened in [May] this year.

“April” can be part of a person’s first name, an organization name, or a date:

  • [Mayor April O’Neil] was elected last Monday.
  • He works at [April Group].
  • In [April], she went to a conference.

In addition, the same text string could be a predefined concept match or not. Consider the following sentence (from https://www.bbc.com/sport/cricket/47273785):

Adil Rashid claimed 2-21, Chris Woakes 2-28 and . . .

The numbers in this sentence could be referring to dates in the month of February in the context of, for example, claiming days off from work. In this context, the numbers should be extracted as dates. However, the sentence above comes from a sports context, and in this case extracting dates would be inaccurate, because the rest of the sentence includes “Mark Wood 2-35.” The numbers are referring to cricket players’ statistics and are not timex entities. Similarly, in European data sources, soccer scores are often represented in a format that may match a time, such as “4:10.” It would be inappropriate to extract the final score of a soccer match as a time.

The SAS predefined concepts account for these types of ambiguity by leveraging contextual cues. To give a simple example, when a personal title is encountered in front of a proper noun, it is likely that the proper noun is a person, as in the example “Ms. May.” If, on the other hand, there is a numeral before or after “May,” then it is more likely to be a date, as in “May 5, 2017.”

4.8. Supplementing Predefined Concepts

The information about named entities in this chapter may have inspired you to think about augmenting the set of provided concepts with applications specific to your own area of interest. You may have realized that there is information that would be useful to extract but that is not matched in the predefined concepts. To assist you with those tasks, the focus of the next several chapters is creating your own custom concepts using some of the same best practices that are reflected in the predefined concepts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset