Chapter 7: CONCEPT_RULE Type

7.1. Introduction to the CONCEPT_RULE Type

7.2. Basic Use

7.3. Advanced Use: Multiple and Embedded Operators

7.4. Advanced Use: Negation Using NOT

7.5. Advanced Use: Negation Using UNLESS

7.6. Advanced Use: Coreference and Aliases

7.7. Troubleshooting

7.8. Best Practices

7.9. Summary

7.1. Introduction to the CONCEPT_RULE Type

In chapter 6, you learned about three of the concept rule types: CLASSIFIER, CONCEPT, and C_CONCEPT. The focus of chapter 7 is the fourth rule type in this group: CONCEPT_RULE.

This rule type is unique among the concept rule types because it enables the use of operators that specify the distance that your elements can be from one another and still trigger a match. These operators include standard Boolean operators, such as AND, OR, and NOT, and special proximity operators, such as SENT, which constrains a match to within a single sentence. Notice that all the operators are in uppercase, which is a requirement for their use in rules. Table 7.1 and chapter 11 describes the types of operators allowed in a CONCEPT_RULE and provides details about how they work and how to select the right one. In addition, advanced use examples in the following sections may help you understand specific applications of these operators.

The CONCEPT_RULE rule type should be used more sparingly than other rule types. In addition, adding a CONCEPT_RULE should trigger additional careful testing of your model.

After reading this chapter, you will be able to do the following tasks:

  • Use the LITI syntax to write efficient and effective CONCEPT_RULE type of rules
  • Avoid common pitfalls and use best practices to create better rule sets
  • Troubleshoot common rule-writing errors

7.2. Basic Use

A CONCEPT or C_CONCEPT rule may not do everything you need if the context in your text is not predictable or the distance between elements is far. In that case, you may want to use a CONCEPT_RULE instead.

The basic syntax of the rule definition is each operator enclosed within parentheses with its arguments in a comma-separated list. Each argument is one or more elements enclosed within double quotation marks. As in C_CONCEPT rules, the _c{} extraction label encloses in curly braces the element, or elements, whose match should be extracted. The following template shows the structure of this rule, with “OPERATOR” and “element” as placeholders:

CONCEPT_RULE:(OPERATOR, “element1”, “_c{element2}”)

CONCEPT_RULE:(OPERATOR, “_c{element1 element2}”, “element3”)

For example, the following rule finds a date in the same sentence as a percentage and extracts the match for the date. This rule might be useful in business or news documents to extract the dates on which some stock price or revenue target changed by some percentage:

CONCEPT_RULE:(SENT, “_c{nlpDate}”, “nlpPercent”)

This rule contains the rule type in all-caps, followed by a colon. Then, within parentheses, the SENT operator is listed along with its arguments in a comma-separated list. Each of the two arguments is enclosed in double quotation marks. The nlpDate element is marked with the _c{} extraction label, so the nlpDate span of text is extracted if there is a match.

Remember: Whenever an operator is used in a rule, the operator is always enclosed in parentheses in a comma-separated list with its arguments.

As another example situation, imagine that you are an executive at a large bank and you want to use some data found online to figure out how much cash other banks are offering to customers as an incentive to open a new account. The data for this example is modified from customer complaints to the U.S. Consumer Financial Protection Bureau (https://www.consumerfinance.gov/data-research/consumer-complaints/).

Consider the following input documents:

1. opened an account using the $300.00 bonus offer promotion.

2. opened a Premier Everyday Checking account on March 31, 2017 online and was told that I am eligible to receive a $250.00 bonus once I complete a set of activities within 60 days of opening the account.

3. They promised to pay $400.00 to the new users of an opened VIP account package.

Pause and think: Considering the input documents, can you write a CONCEPT_RULE to extract the promotion amount near the mention, event, or action of opening an account?

One way to extract matches for the amount associated with a promotion is to use the following rule in a concept named, for example, promotionAmount:

CONCEPT_RULE:(DIST_18, “_c{nlpMoney}”, “open@ _w account”)

This rule contains the DIST_n operator with the value of 18 as the number of tokens away that the match can occur. Otherwise, the syntax is the same as in the first example, except that there is more than one element in the second argument; in fact, there are three elements: open@, _w, and account. Because of the placement of the extraction label around the nlpMoney concept, this rule would extract the currency value associated with the new account.

The matches with the above input documents are in Figure 7.1.

Figure 7.1. Extracted Matches for the promotionAmount Concept

Doc ID

Concept

Match Text

1

promotionAmount

$300.00

2

promotionAmount

$250.00

3

promotionAmount

$400.00

As you further investigate the real data available from the U.S. Consumer Financial Protection Bureau, you observe that another amount is sometimes mentioned alongside the promotion amount, and that is the amount used to open the bank account. To remove those matches from the promotionAmount concept, you can create another concept that specifies the context within which the amount used for opening the account is encountered and then remove those matches from matches to the above concept. This is a more complex approach and involves the REMOVE_ITEM filtering rule type, discussed in chapter 9.

A third scenario illustrating the basic use of operators in a CONCEPT_RULE involves the operator DIST_n with a value of 0 for n, which can be used to match a token and its part of speech (POS) at run-time. In this example, the information technology (IT) department of a company wants to extract information from reports about equipment issues and outages. In the text data, there are mentions of the token “monitor” with two different parts of speech: a noun referring to the computer peripheral, and a verb referring to the action of observing a situation. A rule needs to capture only the instances referring to the part so that these instances can be routed to the IT department that handles computer peripherals. This rule is in the concept partIssue.

CONCEPT_RULE:(DIST_0, “_c{monitor@}”, “:N”)

Consider the following input documents:

1. ITS continues to monitor the issue.

2. The monitors were flickering.

Pause and think: Considering these input documents, can you predict the matches for the partIssue concept?

The token “monitor” in the first document is a verb, whereas “monitors” in the second document is a noun. Because the rule allows for the nouns “monitor” or “monitors” to be matched by using the @ morphological expansion symbol, a match is returned only for the second document.

Note that you can use the same operators in a CONCEPT_RULE type of rule as in the PREDICATE_RULE type, which is discussed in chapter 8. Some of the same CONCEPT_RULE goals described above and in the advanced sections that follow can be achieved with PREDICATE_RULE rules, but because the latter are more computationally expensive, using the CONCEPT_RULE type, if possible, is recommended. The PREDICATE_RULE type should be reserved for scenarios in which the CONCEPT_RULE cannot achieve the same results.

7.3. Advanced Use: Multiple and Embedded Operators

What makes the CONCEPT_RULE type very powerful is the ability to embed an operator and its arguments as an argument of another operator. This nesting of operators allows for interactions between the types of operators to help you specify the exact conditions under which the meaning in the text will be a match for the desired information.

Remember: You can embed an operator and its arguments as an argument of another operator.

Using this technique of embedded operators, you can do many things to control how the text is interpreted by the rule. One of the most common patterns in rule-writing involves limiting matches to within a sentence, with each argument of SENT being a list of arguments under an OR operator:

CONCEPT_RULE:(SENT, (OR, “element1”, “element2”), (OR, “_c{element3 element4}”, “_c{element5}”))

Note that, in the rule template above, the operators are filled in, but the elements are just placeholders. You can plug in your own elements to use the rule in the software.

It is very important that if the _c{} extraction label encloses all or part of an argument of an OR operator, all the other child arguments of that operator must also include the extraction label. Some SAS Text Analytics products do not give a compilation warning in this situation, but matches will not work properly without all of the necessary _c{} labels. In all other contexts, only one or more consecutive elements in the same argument may have the _c{} label, because for each rule match, only one match string can be extracted by a match on this type of rule.

Tip: If the _c{} extraction label encloses all or part of an argument of an OR operator, all the other child arguments of that operator must also include the extraction label.

Recall that one of the rules in section 7.2 finds specific currency amounts near information about opening an account. During testing, you may discover that you need to constrain this rule further with a third argument to capture the context of the bonus payment. You can handle this situation by adding under the DIST_n operator a third argument that lists the possibilities under an OR operator, as shown in the rule below. You can also move the extraction label to these elements, if you want to know how often the offer is a bonus versus a promotion:

CONCEPT_RULE:(DIST_18, “nlpMoney”, “open@ _w _w account”, (OR,_c{bonus},_c{promotion}))

This more advanced rule contains (as shown in bold) a new set of parentheses, enclosing the new operator OR with its arguments, after the second argument of DIST_n, “open@ _w account.”

You can read the new rule this way: First match the existence of one of the following: the strings “bonus” or “promotion,” the predefined concept nlpMoney, or the string in the second argument. Then scan 18 tokens in either direction to find matches for the remaining arguments. Because of the placement of the extraction label, the match returned is now either the string “bonus” or the string “promotion.”

Consider the following input documents, modified from the U.S. Consumer Financial Protection Bureau data:

1. Open a checking account and earn $300.00 promotion.

2. To receive the $300.00 bonus, you must open an interest account and set up and receive 10 Qualifying Direct Deposits . . .

3. I opened an express account with the promotion of $300.00 for premier checking and $200.00 for premier savings accounts.

4. I said I was interested in opening the savings account as well preferably at the 1.49 % rate but if not then 1.34 % rate. Also, I was told that if I were to answer a few questions related to my finances, they will give me a $25.00 gift card.

5. I met with the manager at my local branch and signed up for a promotion that would give me $1000.00 bonus after opening a savings account . . .

Pause and think: Assuming that the rule above is in a concept named promotionStrategy, can you predict the matches for the input documents above?

The matches in the input documents above are listed in Figure 7.2.

Figure 7.2. Extracted Matches for the promotionStrategy Concept

Doc ID

Concept

Match Text

1

promotionStrategy

promotion

2

promotionStrategy

bonus

3

promotionStrategy

promotion

5

promotionStrategy

promotion

5

promotionStrategy

bonus

Note that there are no matches for the fourth document, because it does not contain either “bonus” or “promotion.” But even if “gift card” were added to the rule as an additional argument of the OR operator, the distance between the amount and the match to the second argument would be too great for a match to be produced using this rule. The rule could additionally be modified by increasing the distance to 45 for example, and then the string “gift card” would be extracted as a match.

Another common pattern in advanced CONCEPT_RULE rules is use of the SENT operator to bound the scope of DIST_n, ORD, or ORDDIST_n to within a single sentence:

CONCEPT_RULE:(SENT, (DIST_4, “_c{element1}”, “element2”, “element3”))

This rule template extracts the first element, if all three elements are found within a distance of four tokens of each other in the same sentence. As above, elements in this rule template are placeholders for you to substitute with your own content before using in the software.

Caution should be exercised with the AND operator. This operator can be very useful when applied to short documents. However, for mid-sized or long documents, an AND operator that is not bounded by another operator may match in situations you do not expect. In those cases, the use of SENT, DIST_n, or SENT_n instead of AND will usually give you the more targeted behavior you are looking for.

Tip: The AND operator is most useful for short documents. For longer documents, use an operator with more restricted scope, such as SENT, DIST_n or SENT_n. Do not embed AND under one of these operators.

7.4. Advanced Use: Negation Using NOT

The CONCEPT_RULE type is the first type of LITI rule covered so far that can accomplish some types of negation and filtering of matches on its own. In other words, you can use this rule type to specify both what you want to find, and what should not be present within a given scope of the matched elements.

There are two operators that help you specify what to exclude from matching: the NOT operator and the UNLESS operator. This section will describe the behavior of NOT and provide a few examples of when this approach could be useful.

The NOT operator specifies along with other criteria for a match what should not be found in the document. It must be one of the arguments of an AND operator. In other words, as one argument of the AND operator, you specify what you want to find in the document, and as another argument you specify the NOT operator and its argument.

For example, perhaps you want to find documents that mention aircraft, but not catch documents that talk about American football, because you are collecting information for a report on the use of airspace over American cities. In this way, you can eliminate documents about the New York Jets, which would otherwise be false positives in your result set. Here is how you might write that rule:

CONCEPT_RULE:(AND, “_c{aircraftConcept}”, (NOT, “footballConcept”))

This rule references two other concepts not shown in detail here: aircraftConcept, defined with keywords that describe different types of aircraft, and footballConcept, populated with keywords that are common football-specific terms. The rule aims to return matches for aircraftConcept only if the document does not discuss football, in accordance with the terms defined in footballConcept.

Some of the rules in aircraftConcept include the following:

CLASSIFIER:aircraft

CLASSIFIER:jet@

Some of the rules in footballConcept include the following:

CONCEPT_RULE:(SENT, “_c{fly@V}”, “ball@N”)

CLASSIFIER:the Jets, New York Jets

Keep in mind that the NOT operator always has document-level scope, so it cannot be limited to a sentence by putting an operator like SENT in front of it in the structure of the rule. Therefore, a rule like the one below will not limit the scope of NOT to the sentence because SENT is not able to control the matches to NOT. You will not get the results that the structure of the rule implies; therefore, this is an error:

ERROR -> CONCEPT_RULE:(SENT, (AND, “_c{aircraftConcept}”, (NOT, “footballConcept”)))

Tip: The NOT operator has document-level scope and cannot be limited, for example to a sentence, by putting another operator higher in the structure of the rule.

A similar type of example would include extracting instances of weapons mentioned in text, but not wanting to extract matches on documents that were discussing video games. This might be the focus of government analysts tracking the purchase and ownership of weapons in online forums. Using NOT is a way of filtering out matches, assuming you are certain that the filtered items are ones you do not want. Here is an example rule:

CONCEPT_RULE:(AND, (OR, “_c{firearmsList}”, “_c{amunitionList}”, “_c{bombList}”), (NOT, “videoGameTerm”))

This rule references four other concepts not shown in detail here that have been defined in a variety of ways and used to separate out different types of weapons references in order to use them in different combinations within the model. For example, no large-scale weapons (like tanks) are represented, but those usually used by individuals are included. The goal of this rule is to find mentions of such weapons, but not in the context of a document that has terms commonly used when describing or discussing video games.

Just as with AND, you must be careful when using NOT, because both have document-level scope. This means they may not behave as you want if your documents are very long. If you are familiar with SAS Categorization models, then you may be tempted to try using the operators NOTINDIST or NOTINSENT to get around this limitation on NOT. These operators are not supported in LITI rules; they will result in a compilation error.

Another type of example for using the NOT operator involves the use of a key phrase or marker to indicate a specific document type. For example, if you want to find all person names in the documents, but you know that the document collection includes a form used for registering voters, and you want to exclude those documents from matching, then you can build a rule like the one here:

CONCEPT_RULE:(AND, “_c{nlpPerson}”, (NOT, “Voter Registration Form”), (NOT, “Voter ID Number”))

This rule will match on person names, but not in any documents that contain the phrase “Voter Registration Form” or “Voter ID Number.” If either of the phrases under a NOT operator is present, then it is enough to block the match to nlpPerson from appearing as a match to the concept where this rule is written. You cannot easily specify that both items are required to eliminate the match, unless you build another concept to reference in this rule, and that other concept requires both phrases in order to match.

7.5. Advanced Use: Negation Using UNLESS

Another way to exclude matches that you do not want is to use the UNLESS operator. This operator has some specific limitations that you should know. First, it takes just two arguments, where the second one is one of the following operators: AND, SENT, DIST_n, ORD, and ORDDIST_n. Each of these operators may take two or more arguments. The UNLESS operator blocks a match if the first argument appears between the arguments under the later operator.

Let us use the example of tracking specific events. You have a basic rule that you want to find situations where a particular sports team wins a game. This rule, for example in a concept named trackingWins, says that when a match to the baseballTeam concept is followed by “win” or its variants, you want to extract the date that occurs in the same sentence:

CONCEPT_RULE:(SENT, (ORD, “baseballTeam”, “win@”), “_c{nlpDate}”)

The baseballTeam concept includes the following rules:

CLASSIFIER:Cleveland Indians

CLASSIFIER:Indians

Because the definitions in the baseballTeam concept include references to the Cleveland Indians baseball team, the rule in the trackingWins concept outputs matches in sentences like the following. Note that highlighted tokens signify matches for each of the arguments, which are required in order for the rule to return a match for the _c{} label:

1. Brantley singled two home runs on the first pitch of his first at-bat and Carlos Carrasco worked out of a bases-loaded jam in the sixth inning, leading the Cleveland Indians to a 3-2 win in their chilly home opener over the Kansas City Royals on Friday.

2. It wasn’t pretty, but the Cleveland Indians found a way to win the first home series of 2018 with a wild 3-1 win over the Kansas City Royals on Sunday.

3. Coming into the season, the Indians were expected to win somewhere near 100 games this year, win the division convincingly, and contend for a World Series title.

The matches returned to the trackingWins concept due to the _c{} label are listed in Figure 7.3.

Figure 7.3. Extracted Matches for the trackingWins Concept

Doc ID

Concept

Match Text

1

trackingWins

this season

1

trackingWins

on Friday

2

trackingWins

of 2018

2

trackingWins

on Sunday

3

trackingWins

this year

Note: Because of how the nlpDate concept is predefined, matches include both “on Friday” and “Friday,” as well as “on Sunday” and “Sunday.” You can use postprocessing code to retain the most specific date for each document ID. Alternatively, the near-duplicate matches can be cleaned up by using a REMOVE_ITEM rule that removes the match containing a preposition if the same

match but without a preposition that has been found already. See chapter 9 for more information about this rule type:

REMOVE_ITEM:(ALIGNED, “_c{nlpDate}”, “_w nlpDate”)

You are doing some postprocessing on this data to get the results aligned with the news article date and interpreting the results, but hits like the third document are throwing your statistics off. It is a false positive match, because it references wins that have not yet happened. You are counting more wins than the team actually has. You can remove the hypothetical wins while retaining matches for confirmed wins by using the UNLESS operator:

CONCEPT_RULE:(UNLESS, “expect@”, (SENT, (ORD, “myTeam”, “win@”), “_c{nlpDate}”))

This rule allows matches only if a form of the word “expect” does not occur between the two arguments of the SENT operator. This modification using UNLESS will exclude the third sentence above from matching the rule.

Another restriction on the UNLESS rule is more of a safety recommendation, and therefore has exceptions. The recommendation is to use a reference to a concept with UNLESS only if that concept contains only CLASSIFIER or REGEX rules.

Tip: When you are using UNLESS and the first argument is a concept name, that concept should contain only CLASSIFIER or REGEX rules.

In another example, perhaps you have a rule by which you want to find mentions of your product or service with positive adjectives, like “happy,” “useful,” “best,” and the like. You can use UNLESS to help you exclude situations in which that adjective is modified with negation adverbs like “not” and “never.” For example, see the following rule:

CONCEPT_RULE:(UNLESS, “negList”, (DIST_7, “custServiceRep”, “_c{posAdj}”))

The project containing this rule in a concept named posMention also includes three other concepts that are partially shown below: negList, a list of negative adverbs as CLASSIFIER rules; custServiceRep, a list of terms that describe customer service representatives in an airline; and posAdj, a list of positive adjectives that may be used to describe the quality of the customer service by the representative.

The concept negList contains the following rule:

CLASSIFIER:not

The concept custServiceRep contains the following rules:

CONCEPT:attendant@
CONCEPT:agent@
CLASSIFIER:help desk
CLASSIFIER:personnel

The concept posAdj contains the following rules:

CLASSIFIER:helpful
CLASSIFIER:kind

Consider the following input documents, which simulate airline feedback data.

1. The ladies at the help desk were not helpful at all.

2. Some rather unpleasant personnel were rude or not helpful.

3. The attendants were kind but not helpful.

4. There was one agent in particular, Mr. Jim Wilsey, who was very helpful.

5. In any event, one of the flight attendants was extremely helpful and apologetic.

Pause and think: Assuming the input documents above, can you predict the matches for the posMention concept?

The matches are represented in Figure 7.4.

Figure 7.4. Extracted Matches for the posMention Concept

Doc ID

Concept

Match Text

3

posMention

Kind

5

posMention

helpful

Note that there are no matches for the first and second documents because the “not” in these sentences is a match for the negList concept and prevents matches to the posMention concept through the UNLESS operator. The “helpful” match in the third document is also filtered by UNLESS, but the “kind” match to the posAdj concept is passed on to the posMention concept. The fourth document has no matches because the distance is greater than 7 tokens between “agent” from the custServiceRep concept and “helpful” from the posAdj concept.

7.6. Advanced Use: Coreference and Aliases

The coreference symbol _ref{} can be used in CONCEPT_RULE rules to tie a reference back to a lemma (canonical form). See chapter 1 for an explanation of lemmas. This approach can be useful when you are trying to establish relationships between items, where some of the relationships may involve pronouns, common nouns, or aliases. For example, perhaps you want to find each reference to a company, whether the full name is used or not. You may

want to do so to tie other information that you find back to the company in your analysis. You may start with a rule like the following:

CONCEPT_RULE:(AND, “_c{SAS Institute}”, (OR, “_ref{SAS}”,”_ref{they}”, “_ref{company}”))

In this rule, the _c{} extraction label encloses the string element “SAS Institute.” The string elements “SAS,” “they,” and “company” are also considered to be company references in this rule. All of the arguments of the OR operator are marked with a _ref{} symbol. This shows that they are the references that should be tied to the primary return string, marked with the _c{} label. In other words, this rule says that if you find a match for “SAS Institute,” then also look anywhere in the document for any of the possible defined coreferents, and link them to the canonical form returned by the _c{} label.

Consider the following input document:

I work for SAS Institute. SAS is a large private software company. They make software for various business purposes centered around the idea of analytics. The company puts customers first and has recently celebrated their 40th anniversary.

Pause and think: Assuming the rule above is in a concept named sasAlias, can you predict the matches with the input document above?

The matches for the sasAlias concept containing the above rule with the input document are in Figure 7.5.

Figure 7.5. Extracted Matches for the sasAlias Concept

Doc ID

Concept

Match Text

Canonical Form

1

sasAlias

SAS Institute

SAS Institute

1

sasAlias

SAS

SAS Institute

1

sasAlias

SAS

SAS Institute

1

sasAlias

company

SAS Institute

1

sasAlias

They

SAS Institute

1

sasAlias

company

SAS Institute

Alert! The coreference functionality works properly only in a subset of the SAS Text Analytics products. To use it, you should confirm that you have the output shown in Figures 7.6 and 7.7.

Figure 7.6 shows the relationship between the highlighted word in the text and the canonical form elsewhere in the text. This information is used during the rule-building process to confirm that the correct results are found by a particular rule or concept. In Figure 7.6, the highlighted word is “company” and the pop-up window shows that it is connected to the canonical form of “SAS Institute.”

Figure 7.6. Canonical Form Representation in SAS Enterprise Content Categorization

Figure 7.6. Canonical Form Representation in SAS Enterprise Content Categorization

Figure 7.7 shows matches in the scoring output in SAS Studio that includes the relationship between the coreference matches in the term column and the canonical form in the canonical_form column. Note that, based on the offsets, the first and second matches overlap. The concept name is evident in the name column. This information is accessible in a production context when you are scoring many documents with a completed model.

Figure 7.7. Canonical Form Representation in SAS Studio

Figure 7.7. Canonical Form Representation in SAS Studio

In the rule that extracted the strings shown above, matches to the coreference terms may appear anywhere in the document because of the AND operator. If you want to control this matching behavior more closely, use a different operator. For example, use of ORD will limit the coreference matches to after the first match of the primary reference. ORDDIST_n will limit the matches to some distance from the primary reference. SENT and ORD used together will restrict the scope of the match to within the bounds of the same sentence as the primary reference match, but require the primary reference to occur first. To illustrate, the following rule is very similar to the previous rule, except that it limits the order and distance of the matches:

CONCEPT_RULE:(ORDDIST_15, “_c{SAS Institute}”, (OR, “_ref{SAS}”, “_ref{they}”, “_ref{company}”))

The matches for this rule are similar to the matches shown above, with one difference. The last match on “company” is now too far away from the primary reference, so it no longer matches.

Assuming that this rule is in the concept named sasAlias, the matches for the input text in the previous example are in Figure 7.8.

Figure 7.8. Extracted Matches for the sasAlias Concept

Doc ID

Concept

Match Text

Canonical Form

1

sasAlias

SAS Institute

SAS Institute

1

sasAlias

SAS

SAS Institute

1

sasAlias

Company

SAS Institute

1

sasAlias

They

SAS Institute

In general, unless the documents are very short or the coreference variants are not ambiguous, a best practice recommendation is to start with the ORDDIST operator. For a very conservative approach, use SENT with ORD together, but always first verify the approach with your data.

Table 7.1 summarizes the behavior you can expect from each operator that may be used in this type of rule.

Table 7.1. Behavior of Operators

Operator

Behavior

AND

Matches any occurrence of primary reference and matches any coreference, whether it follows or precedes the primary reference in the document. It ties all coreference instances to the first primary reference found in the document, not the closest one.

ORD

Matches any occurrence of primary reference and then matches any coreference that follows the first primary reference match. It ties all coreference instances to the first primary reference found in the document, not the closest one.

SENT

Matches only when the primary reference and the coreference occur in the same sentence but does not require the primary reference to come first. Govern with the ORD operator to require the primary reference to be matched first.

DIST_n

Matches only when the primary reference and the coreference occur within a specified number of tokens from each other but does not require the primary reference to come first. Use ORDDIST instead to require the primary reference to be matched first.

SENT_n

Matches only when the primary reference and the coreference occur within the specified number of sentences but does not require the primary reference to come first. Govern with the ORD operator to require the primary reference to be matched first.

PARA

Matches only when the primary reference and the coreference occur in the same paragraph but does not require the primary reference to come first. Govern with the ORD operator to require the primary reference to be matched first.

ORDDIST_n

Matches any occurrence of the primary reference, then matches any coreference that both follows the first primary reference match, and appears within the specified number of tokens of that match. After the maximum match distance is reached, a match must first be a primary reference to trigger more coreference matches again.

Note that rules that result in coreference or canonical form matches must be at the top level of the model to generate such information in the output. In other words, the concept that houses them will not pass along this information to any calling concept. Keep this in mind when you design your models, and consider using multiple models, if necessary.

7.7. Troubleshooting

If you discover that a rule is not matching as you expected, potential causes for this could be one of the pitfalls outlined in section 5.4—namely, general syntax errors, comments, misspelling/mistyping, tokenization mismatch, or filtered matches. In addition, there are also errors that you can check for that are specific to the CONCEPT_RULE type of rule, such as the following:

  • White space
  • Syntax errors
  • Missing extraction label
  • Extra extraction label
  • Tagging mismatch
  • Expansion mismatch
  • Concept references
  • Predefined concept references
  • Using nonexistent operators
  • Logical error with operators
  • Cyclic dependencies

White space in a CONCEPT_RULE is not very important because of the use of the parentheses, commas, and double quotation marks to set off pieces of the rule. However, within an argument (double quotation marks), white space is a separator for a list of elements and not counted as an element itself.

One of the common syntax errors that is specific to CONCEPT_RULE is forgetting the extraction labels or curly braces in the extraction label, or misplacing them: The braces must always be inside the double quotation marks defining an argument. Remember also that the operators and arguments inside a set of parentheses are a comma-separated list. Do not forget the commas. Finally, parentheses and quotation marks must come in pairs.

In the CONCEPT_RULE rule, there can be only a single extraction label: _c{}. However, do not forget that, if you have marked all or part of an argument of an OR operator with the _c{} label, then you will also have to place the label somewhere on all of the sister arguments under the same OR, as well. Otherwise, you will not see the matching behavior that you expect. If you use multiple _c{} extraction labels in any other context, your rule will compile but will not match anything.

It is possible that the POS tag you think a particular word may have is not the tag assigned to that word by the software in that particular context. The best way to prevent this error is to test your expectations with targeted examples in context, before applying the rule to a sample of documents that is like the data you will process with the model.

In addition, it is possible that the POS tag is misspelled or does not exist. Different languages, versions and products may use different POS tags. Consult your product documentation for lists of acceptable tags for rule-building. The spelling and case of the tags in the rules must be exactly as documented. Because writing a rule with a nonexistent tag like “:abc” is not a syntax error but a logical error, the syntax checking protocols will not catch it as an error, but there will not be any of the expected matches.

Another potential error when you are writing rules that contain a POS tag is forgetting to include the colon before specifying the tag. Without the colon, the system considers the rule to refer to a concept by that name or a string match, which may produce unexpected or no results. Syntax checking protocols will not return an error in this case.

When using the expansion symbols (e.g., @, @N, @V, @A), note that the expansion includes only related dictionary forms, not any misspellings that may have been identified by the misspelling algorithm or other variants associated with that lemma through use of a synonym list. To review what a lemma is, consult chapter 1. Also, remember that the forms of the words are looked up before processing, and when matching happens, the associated POS assignment of the word in the text is not considered. You can work around this issue, if you want to, using a CONCEPT_RULE; see section 7.2 for more information. Examining your output from rules that contain expansion symbols is recommended.

Referencing concepts by name without ensuring that you have used the correct name, including both case and spelling accuracy, can also reduce the number of expected matches. If you reference predefined concepts, be sure that they are loaded into your project, and always check the names because they may be different across different products.

Even though the form of a CONCEPT_RULE looks similar to rules used in SAS Categorization, there are some important differences. If you are used to writing categorization rules, you may make special types of errors in LITI rules. For example, you cannot use the following symbols in LITI rules:

  • * as a wildcard match on beginning or end of a word
  • ^ to tie match to beginning of a document
  • $ to tie a match the end of the document
  • _L to match a literal string
  • _C to specify case-sensitivity

Another difference is that, in categorization rules, you do not need the _c{} extraction label, because you are not extracting anything; the rule either matches or does not. In a CONCEPT_RULE, the _c{} extraction label is required for the output that should be extracted. Finally, there are operators that you can use in categorization rules that are not available in LITI, including the following: NOTIN, NOTINSENT, NOTINPAR, NOTINDIST, START_n, PARPOS_n, PAR, MAXSENT_n, MAXPAR_n, MAXOC_n, MINOC_n, MIN_n, and END_n.

Any rule that can reference a concept and returns matches (e.g., not REMOVE_ITEM or NO_BREAK) has the capacity to participate in a cyclic dependency error. A cyclic dependency is when two or more concepts refer to each other in a circle of concept references. For example, if the concept myConceptA has rules that reference myConceptB, and myConceptB has rules that reference myConceptA, there is a cycle of references between them. This type of error will prevent your whole project from compiling. This is another reason to test your project often as you are adding concepts and rules. This way you will know that the latest rules added to the model created the cyclic dependency. Another strategy to use to avoid this error is careful design for your taxonomy and model. Refer to chapter 13 to learn more about taxonomy design best practices.

If you have checked all the above and are still having problems with your rules, then you should look at the logic defined by your combination of operators. A full understanding of operators is recommended if you are combining them together in a single rule. Consult chapter 11 to learn more about operators and how they interact. If you need more help with troubleshooting this rule type, see the discussion of match algorithms in section 13.4.1.

Finally, if you can use a simpler rule type to extract the information that you are trying to extract with a CONCEPT_RULE type of rule, always use the simpler rule type instead. Although the CONCEPT_RULE type is very powerful, it can be more difficult to maintain and troubleshoot in larger models. If you use it, make sure you use it correctly.

7.8. Best Practices

The best time to use a CONCEPT_RULE is when you have some complexity in the elements’ relationship to one another. Then it is useful to be able to specify the relationship between the elements, using a combination of operators. Another reason to use a CONCEPT_RULE is that you have more distance between targeted textual elements, such that predicting the intervening text is tricky or impossible.

Because the CONCEPT_RULE type is very versatile, some beginners are tempted to do most of the extraction that is possible with the previously discussed rule types, using only CONCEPT_RULE rules. However, because of the higher complexity level of this rule, it is not recommended to take such a “shortcut” in larger projects. One reason is that the computational load may be higher, which will mean the project will run more slowly. The second reason is that CONCEPT_RULE rules will be more difficult to read, so more comments will be needed to remember and record what each rule is intended to do. The result is that the project will be somewhat more difficult to maintain than if the beginner rule-writer had used more easily read rules instead.

Beginners should avoid this rule type, if possible, until some experience with the matching process is gained through practice of building and testing the simpler rule types. Table 7.2 contains some examples of situations in which you may try to use a CONCEPT_RULE when you should be using a different rule type.

Table 7.2. Examples of Situations in Which CONCEPT_RULE Should Not Be Used

Situation

Suggested resolution

List of items

Instead of a list of arguments under an OR operator, use a series of CLASSIFIER or CONCEPT rules.

Predictable context

Instead of using DIST_n, use :sep, or _w in C_CONCEPT.

Return multiple matched elements

Use CONCEPT rule, if possible, or use SEQUENCE rule type or, if that fails, use PREDICATE_RULE type.

Another best practice is to try to reference concepts containing simpler rule types, such as CLASSIFIER, CONCEPT or C_CONCEPT, in CONCEPT_RULE rules. Avoid other rule types, such as REGEX, unless they are necessary. Additionally, avoid stacking CONCEPT_RULES, so that one CONCEPT_RULE references another, which references another, creating layers. However, this may also sometimes be necessary to achieve certain goals. In short, keep rule types simple in concepts that are referenced in other rules, when possible.

Tip: Whenever possible, use simpler rule types for concepts that are referenced in other concepts higher in the taxonomy.

Build rules that are generalized to capture different types of patterns, while keeping them specific to the type of meaning they target. In other words, try to do only one task with each rule that you build; do not combine multiple tasks together into a single rule, unless that single rule is doing a specific and describable task itself with the pieces. For example, if you have a rule that targets finding a piece of information, like blood pressure, when it occurs in a context with person names, you should keep that as a separate rule from a rule that targets blood pressure, but in the context of drug names. This approach is recommended for the purposes of testing and maintainability. Even though the information you are finding and extracting in each situation is the same, the strategy, test data, and the types of language patterns will be different.

Keep in mind that CONCEPT_RULE rules have a lot of power, but that also means that they can sometimes match in contexts that you had no intention of matching. A powerful, general rule can be useful for some purposes, but if the rule leads to matches in the wrong type of situations, then it may need to be constrained further. A typical example is when you are finding dates, such as with the rule here:

CONCEPT_RULE:(SENT, (ORD, “_c{MonthName}”, (OR, “:digit”, “:NUM”)))

This rule is intended to capture any month name like “January,” followed in the same sentence by a reference to a number. This approach may seem good and bring good results back in your initial tests. However, if you do not realize that this rule really assumes that the month name will always be referencing a month, then you may find yourself matching the wrong thing in situations where what you assume is the month could actually be a person name (June, April) or a regular word at the beginning of a sentence (May). In these cases, you should probably keep those ambiguous names separate from the unambiguous ones and put them only into more constrained rules. This strategy assumes that you are trying to maximize precision, as well as recall, in your testing.

Tip: To maximize precision, as well as recall, keep rules with ambiguous elements separate from rules with unambiguous elements.

Another key best practice for complex rules like CONCEPT_RULES is to build and test the pieces first and then combine them into the complex rule. You can compare the results you get with the pieces with the results you get with the full rule to verify that the rule is doing what you intend. For more information on good testing practices, see chapter 14.

A key best practice for all rules and rule types is to comment your rules or sections of rules with the intent of the rule, special considerations, decisions, and any other information that will make assessing or editing the rule later more efficient. Commented lines look like this:

# This rule should find a product name in the context of a marker

# that shows a positive assessment and return the marker -> put this

# marker into new data column called Positives in post-processing.

CONCEPT_RULE:(UNLESS, “negList”, (DIST_5, “myProducts”, “_c{PosAdj}”))

When you design your project and your rules, keep in mind that as you identify the pieces that you need, keeping those pieces meaningful and naming them useful names will help you to trace through your project later. You will be able to diagnose problems more easily because your assumptions will be clear either through the project design and concept names, or through comments, or both. Also, make concepts only as large as they need to be; smaller concepts with fewer rules are easier to troubleshoot and to understand than very large concepts with numerous rules. See chapter 13 for more information on designing projects.

7.9. Summary

Requirements for a CONCEPT_RULE include the following:

  • A rule type declaration in all-caps and followed by a colon
  • One or more Boolean or proximity operators, in a comma-separated list with arguments enclosed in parentheses
  • Each argument comprises one or more elements enclosed in double quotation marks
  • One _c{} extraction label on an element or multiple elements within the same argument that indicates what information to extract (If under an OR operator, put the _c{} extraction label somewhere in each of the arguments under the OR operator, otherwise use the _c{} operator only once per rule)

Types of elements allowed include the following:

  • A string, a token, or sequence of tokens to match literally (“#” character must still be escaped for a literal match to occur)
  • A reference to another concept name, including predefined concepts
  • A POS or special tag preceded by a colon
  • A word symbol (_w), representing any single token
  • A cap symbol (_cap), representing any capitalized word

Allowed options for the rule type include the following:

  • Comments using the “#” modifier
  • Coreference symbols, including _ref{}, _P{}, and _F{}
  • Morphological expansion symbols, including @, @N, @A, and @V
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset