Chapter 11: Best Practices for Custom Concepts

11.1. Introduction to Boolean and Proximity Operators

11.2. Best Practices for Using Operators

11.2.1. Behavior of Groupings of Single Operators

11.2.2. SAS Categorization Operators

11.2.3. Combinations of Operators and Restrictions

11.3. Best Practices for Selecting Rule Types

11.3.1. Rule Types and Associated Computational Costs

11.3.2. Use of the Least Costly Rule Type for Best Performance

11.3.3. When Not to Use Certain Rule Types

11.3. Concept Rules in Models

11.1. Introduction to Boolean and Proximity Operators

In chapters 7 and 8, it was mentioned that the CONCEPT_RULE and PREDICATE_RULE rule types can contain Boolean operators, such as AND, OR, and NOT, and distance operators, such as SENT_n, DIST_n and ORDDIST_n. Many operators are available in LITI, and because they can be used in isolation as well as together, it can be difficult to know which operator or operators to choose for a particular purpose. Likewise, it can be challenging to know which rule type is best for a particular situation. After reading this chapter, you will be able to do the following tasks:

  • Rely on best practices when choosing operators in isolation and in combination
  • Choose the most appropriate rule type for your project on the basis of computational cost and performance factors

11.2. Best Practices for Using Operators

Each operator is a logical command over a set of arguments. The command controls the requirements for a match to be found in text data. In other words, the operator over a set of arguments defines how many and in what relationships the arguments may occur to make the rule “true” in the data. The operators include both the standard Boolean operators AND, OR, and NOT, as well as additional proximity operators that add constraints about the context the arguments must appear in.

Operators are used in CONCEPT_RULE, PREDICATE_RULE, and REMOVE_ITEM rule types. In the first two rule types, they may be used in almost infinite combinations to control the conditions for a match, because all the operators except ALIGNED allow other operators to be arguments in addition to any other elements that may appear in the rule type. In other words, an operator may govern another operator in those two rule types.

Remember: Arguments are always one or more elements between double quotation marks. Elements may also have modifiers such as @ or an extraction label like _c{}.

11.2.1. Behavior of Groupings of Single Operators

The operators have a basic behavior that spans sets of operators, and this behavior is useful to know for choosing the right operator for your purposes. This common behavior of operators is summarized in Table 11.1 and described in more detail in the next sections.

Table 11.1. Operator Groupings

Description

Operators

Any argument found in the text triggers a match; if one has _c{} modifier or a fact label, then all must.

OR

All arguments are required to be found in the text to trigger a match; arguments’ order and distance constraints apply as well. When used, the n is replaced by a digit.

AND

DIST_n

ORD

ORDDIST_n

All arguments are required to be found in the text to trigger a match; distance from the start or end of a sentence constrains the match. When used, the n is replaced by a digit.

SENTEND_n

SENTSTART_n

All arguments are required to be found in the text to trigger a match; document structure criteria constrain the match as well. When used, the n is replaced by a digit.

SENT

SENT_n

PARA

These special operators require a specific context to work.

ALIGNED ► allowed only in REMOVE_ITEM rule type

UNLESS ► second argument should be headed by one of the following operators: AND, SENT, DIST_n, ORD, ORDDIST_n

NOT ► must be an argument of AND; cannot stand alone

The OR Operator

First, the OR operator requires only one argument, but is generally used to govern a list of items. At least one of the arguments in the list must match to satisfy the requirements of the OR. If there is only one argument under an OR, then there is probably a simpler way to write the rule, such as using a CLASSIFIER or CONCEPT rule type. Usually an OR is applied in combination with other operators in the same rule. Here is a simple rule with only an OR operator as an example:

CONCEPT_RULE:(OR, “_c{love}”, “_c{joy}”, “_c{peace}”)

The OR governs three arguments, each in double quotation marks. If at least one of the three words is present in data, then the requirements of the OR operator are satisfied, and because it is the only operator in the rule, the rule matches. In other words, OR is “true,” and the rule is therefore also “true.” This same result could also be achieved with three CLASSIFIER rules, and would be easier to read and maintain:

CLASSIFIER:love

CLASSIFIER:joy

CLASSIFIER:peace

The reason that the _c{} extraction label is required in each argument of an OR operator in a CONCEPT_RULE, if it appears in any of them, is that only one of the arguments has to match to satisfy the conditions, and the others are not required. For example, if the _c{} extraction label was not present on the argument “joy,” then if the text matched that argument, there would be no return command in the part of the rule that matches, and no match would be returned—it would be as if the rule did not match. This type of error is difficult to catch during syntax validation because of potentially embedded operators, so the logic in the rule must be manually verified.

Operators Related to AND

The second group of operators in Table 11.1 governs two or more arguments and requires all to match in order to satisfy the operator requirements. The AND operator works this way: All arguments are required to match to make AND “true.” The scope of AND is the entire document, so arguments may appear anywhere in the document and in any order. The other arguments in this group work like the AND operator but have a second test that makes each of them different from the others.

First, the DIST_n operator works like AND, except it specifies a restricted scope. All the arguments of DIST_n must match within a distance of n tokens, where n is a digit. Next, the ORD operator works just like AND, except it requires that the arguments appear in the document in the same order that they appear in the rule; the scope is still the entire document. Finally, the ORDDIST_n operator both limits the scope to n tokens, and requires that the arguments appear in the same order in both the rule and the document text.

The next example uses data from a city government’s records of 311 service requests that include a text field describing the resolution of each citizen request (available online at https://data.cityofnewyork.us/dataset/311-Service-Requests-From-2015/57g5-etyj). Imagine that you are doing an audit of the resolution of the complaints for a period of time, and you want to specifically look at any complaint that would have been resolved without a specific action being taken to fix or address the issue. You can then compare the results of your search with the department responsible for each request. Two rules using the operators discussed above are provided below in a concept named, for example, noAction, showing the resolution of government action to 311 service requests:

CONCEPT_RULE:(ORDDIST_2, “no”, (OR, “_c{_w action}”, “_c{_w evidence}”))

CONCEPT_RULE:(ORDDIST_2, “not”, (OR, “_c{_w violate@}”, “_c{_w necessary}”, “_c{_w found}”))

Consider the following input documents:

1. The Police Department responded to the complaint and determined that police action was not necessary.

2. The Police Department responded and upon arrival those responsible for the condition were gone.

3. Unfortunately, the behavior that you complained about does not violate any law or rule. As a result, no city agency has the jurisdiction to act on the matter.

4. The Police Department responded to the complaint and with the information available observed no evidence of the violation at that time.

Pause and think: Assuming the rules above, can you predict the matches for the input documents above?

The matches for the noAction concept with the input documents above are in Figure 11.1.

Figure 11.1. Extracted Matches for the noAction Concept

Doc ID

Concept

Matched Text

1

noAction

not necessary

3

noAction

not violate

4

noAction

no evidence

The two rules look for indications that a complaint was resolved without any direct action being taken to correct a given condition or situation. Each requires a relationship that is narrowly defined between a negation word and another marker, which closely follows the negation word, of intention to act. Each of the input documents is a situation in which the government found that it could or should take no action, but the second document does not match either rule. Another rule is needed to capture the finding that “those responsible” were “gone.” Note that in order to extract extra context as a part of the extracted match, an extra _w was placed before each of the extracted terms in this rule. Variations on this trick are useful ways to work around optional components in a match when you need a particular one to be there. As you collect examples and analyze patterns, you can continue adding rules to your model until you are satisfied by your testing that you have found a good sample of reports to review more closely.

Note that in the group of operators described in this section, there should be two or more arguments under each operator type. Be cautious: The software will validate and run if the operators are used with only one argument, but that is a logical error. Avoid such errors, because they make your rules more difficult to read and to maintain, as well as troubleshoot. For example, you should avoid rules like the following:

ERROR! –> CONCEPT_RULE:(ORD, (AND, “_c{go@}”), (AND, “stay@”))

In this rule, the ORD operator has two arguments, which is correct. However, each of the AND operators has only one argument, which contributes nothing to the rule, as if it were not there. The correct way to write this rule is as follows:

CONCEPT_RULE:(ORD, “_c{go@}”, “stay@”)

This version looks cleaner and is much easier to understand; there is a match if the terms “go” and “stay” appear in the document in that order, and it returns the match for “go.”

Operators Related to Sentence Start and End

The third group of operators in Table 11.1 governs one or more arguments. SENTEND_n and SENTSTART_n each rely on the structure of sentences to bound the distance between arguments. They work much like the DIST_n in that they consider token count and add the criterion of a sentence boundary. SENTEND_n will match one or more arguments that occur within n tokens of the end (last token) of a sentence. Counting backwards from the end of a sentence to the number of tokens specified defines the scope of a possible match; all arguments must then be found within that distance. SENTSTART_n works the same way but starts the count at the beginning (first token) of each sentence.

Here is an example of using the SENTSTART_n operator in a rule. The data consists of reports on restaurant inspection results in a large city (available at https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j). Each record is usually short, but some are a few sentences long. The goal of this rule in the mainTopic concept is to find the best summary in the form of a noun phrase that will be used to categorize each report:

CONCEPT_RULE:(SENTSTART_5, “_c{nounPhrase}”)

This rule looks for a noun phrase within 5 tokens of the beginning of the sentence, because you observed that the first noun phrase in your data usually indicates the main topic of the report. The nounPhrase concept is a custom concept that has rules that handle singular nouns, plural nouns, proper nouns, pronouns, possessive nouns, and adjectival modifiers. Some of these rules are provided for you in the supplementary code that accompanies this book and is available online.

Here are a few records from the data set and the results of running the rule set described above on each input document. The matches are shaded gray:

1. Hot food item not held at or above 140º F.

2. Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.

3. Proper sanitization not provided for utensil ware washing operation.

4. Bulb not shielded or shatterproof, in areas where there is extreme heat, temperature changes, or where accidental contact may occur.

5. Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.

After reviewing the results, you decide that the first noun phrase found will be used as your summary unless it is a single word in length; then it will be used with the second one, if another one is found. You derive your final summaries by postprocessing the matches shown above. Figure 11.2 is the final summary_report, based on running your own postprocessing code with the algorithm to select the first noun phrase or first pair of nouns.

Figure 11.2. Postprocessed Matches for the mainTopic Concept

Doc ID

Concept

Matched Text

1

mainTopic

Hot food item

2

mainTopic

Food contact surface

3

mainTopic

Proper sanitization

4

mainTopic

Bulb

5

mainTopic

Facility vermin

You have successfully extracted the most important noun phrases from the restaurant inspections data set, creating a new, structured data set that can be used for counts and reporting.

Operators Related to Sentence or Paragraph Structure

The fourth group of operators uses the structure of the document to manage scope constraints, overriding the default document-level scope. If governing one argument, that argument is usually another operator. The SENT and PARA operators define the scope of the match as within one sentence or one paragraph respectively. For SENT_n, you can specify the number of sentences that will scope the match—all arguments must appear within the bounds of n sentences, where n is some digit. These three operators are useful for matching items that are in grammatical, topical, or discourse relationships. They are also useful for helping to constrain matches in longer documents instead of using AND. They frequently govern the operators discussed in the first three groups above in combination.

The goal in the example below is to find mention of health issues near the discussion of senior citizen needs in political speeches. The concept seniorHealth contains the rule below, which uses the PARA operator to look in each paragraph to find sentences that mention health-related topics, such as “healthcare,” “healthy,” “drug,” “drugs,” or “medicine,” within three sentences of the discussion of senior citizen issues, as defined in the seniorCitizen concept. The PARA operator governs the SENT_n operator, which in turn governs an OR operator.

PREDICATE_RULE:(health, senior):(PARA, (SENT_3, (OR, “_health{healthcare}”, “_health{healthy}”, “_health{drug@}”, “_health{medicine}”), “_senior{seniorCitizen}”))

The concept seniorCitizen includes the following rules relevant to the example:

CLASSIFIER:elderly

CONCEPT:senior@

Consider the following input documents:

1. 450,000 of our citizens will lose access to healthcare because of the lack of funding for our Medicaid programs and 800,000 meals for the elderly will be eliminated.

2. We increased funding to help Alabama seniors get free prescription drugs. More citizens than ever can get help buying the medicine they need. Now they won’t have to choose between eating and taking their prescriptions.

3. Congressional leaders promised help, but they failed to deliver on a prescription drug benefit program. I’m not waiting any longer. During this session, we will create a prescription drug program that will lower the cost of drugs for Alabama seniors.

Pause and think: Assuming the model above, can you predict the matches for the seniorHealth concept and the input documents above?

The matches for the seniorHealth concept and the input documents above are in Figure 11.3.

Figure 11.3. Extracted Matches for the seniorHealth Concept

Doc ID

Concept

Extraction Label

Extracted Match

1

seniorHealth

healthcare because of the lack of funding for our Medicaid programs and 800,000 meals for the elderly

1

seniorHealth

Senior

elderly

1

seniorHealth

Health

healthcare

2

seniorHealth

seniors get free prescription drugs. More citizens than ever can get help buying the medicine

2

seniorHealth

Senior

seniors

2

seniorHealth

Health

medicine

2

seniorHealth

seniors get free prescription drugs

2

seniorHealth

Senior

seniors

2

seniorHealth

Health

drugs

3

seniorHealth

drug benefit program. I'm not waiting any longer. During this session, we will create a prescription drug program that will lower the cost of drugs for Alabama seniors

3

seniorHealth

Senior

seniors

3

seniorHealth

Health

drug

3

seniorHealth

drug program that will lower the cost of drugs for Alabama seniors

3

seniorHealth

Senior

seniors

3

seniorHealth

Health

drug

3

seniorHealth

drugs for Alabama seniors

3

seniorHealth

Senior

seniors

3

seniorHealth

Health

drugs

Note that the unlabeled matched strings for the second and third input documents overlap. If you change the “all matches” algorithm to “longest match,” the duplicate matches without a label and the corresponding extracted matches with labels will be removed automatically, resulting in the output in Figure 11.4.

Figure 11.4. Extracted Matches for the seniorHealth Concept

Doc ID

Concept

Extraction Label

Extracted Match

1

seniorHealth

healthcare because of the lack of funding for our Medicaid programs and 800,000 meals for the elderly

1

seniorHealth

senior

elderly

1

seniorHealth

health

healthcare

2

seniorHealth

seniors get free prescription drugs. More citizens than ever can get help buying the medicine

2

seniorHealth

senior

seniors

2

seniorHealth

health

medicine

3

seniorHealth

drug benefit program. I'm not waiting any longer. During this session, we will create a prescription drug program that will lower the cost of drugs for Alabama seniors

3

seniorHealth

senior

seniors

3

seniorHealth

health

drug

For each input document in these results, there is only one set of extracted matches, comprising one match for the “senior” label, one for the “health” label, and the text span between them.

Special Operators

The final set of operators is special, because they are less universal than the ones described above. To work properly, each of these operators requires a special context or structure of a rule.

First, the ALIGNED operator matches two arguments when both arguments match the same text in the document. In other words, you define the two arguments to extract the same exact span of text. For example, suppose that you want to match the string “love,” when it is also a noun—the two arguments could be “love” and “:N.” Theoretically, these two arguments would satisfy the requirements of the ALIGNED operator if used in a rule.

The second restriction on ALIGNED is that it must be used only in a REMOVE_ITEM rule. If you want this behavior to apply to a single token in CONCEPT_RULE or PREDICATE_RULE rules, then you can use DIST_0 to get two criteria applied to the same token. See the example in section 7.2. The REMOVE_ITEM rule allows two arguments governed by ALIGNED: The first argument must contain a _c{} extraction label on a concept name, specifying what match to remove. For more information and examples of this rule type, see section 9.2.

The next special operator is UNLESS, which governs two arguments, the second of which can be AND, DIST_n, ORD, ORDDIST_n, or SENT. Except for SENT, the remaining operators were all described in the second set in Table 11.1. The UNLESS operator requires that the first argument not be present within the match scope of the second. It is a way of filtering matches and restricting a rule that is capturing false positive matches. For example, you can use the UNLESS operator to eliminate sentences containing negation, as illustrated in section 7.5.

The final operator is NOT, which takes only one argument. It is a basic operator that seems simple at first, but there are some special restrictions that make this operator tricky. First, this operator must be under the AND operator in the hierarchy and, in this case, the AND operator must be at the top of the hierarchy.

Alert! The only operator that can govern a NOT operator is AND. Do not put any other operator above NOT in the hierarchy. The part of the rule containing the NOT operator is applied to the entire document.

If you do not follow this best practice, then the rule may validate, but not work the way you would expect. For example, in the rule below, it is erroneous to put NOT under a SENT operator (two levels up):

ERROR! -> CONCEPT_RULE:(SENT, (AND, “drive@”, (NOT, “crazy”)), “_c{nlpMeasure”)

The intention of this rule is to find mentions of driving and some measure amount, like “200 miles,” in the same sentence as long as the word “crazy” is not also in the sentence. This approach avoids idioms, such as “drives me crazy,” when matching literal driving events and extracting the distance driven. However, because the NOT operator cannot be governed by the SENT operator, what really happens is that the word “crazy” found anywhere in the document will cause the rule to fail to match. This formulation of the rule better matches its behavior:

CONCEPT_RULE:(AND, (SENT, “drive@”, “_c{nlpMeasure}”), (NOT, “crazy”))

This formulation of the rule better illustrates that NOT acts independently of the SENT operator restriction; the preceding rule works the same way, but its form obscures the expected results.

You may ask why this restriction exists on the NOT operator, because operators like SENT and DIST_n are really a type of AND plus scope restrictions, as are all the operators that one may use with UNLESS above. The answer is that additional capability within the LITI syntax would be possible, and if SAS customers request this addition, then it will likely be provided.

11.2.2. SAS Categorization Operators

If you are familiar with SAS categorization models, then you recognize the use and syntax of the CONCEPT_RULE as similar to those rules. The main differences are that there is no rule type declaration—the rule just starts with an open parenthesis. The other difference is that there are no extraction labels because, in categorization, no information is truly extracted. However, this boundary is blurred due to the match string information that may be used as output in categorization.

The syntax of LITI is different from categorization in some unexpected ways, and you might be tempted to use shortcuts from categorization that are not supported by LITI. For example, there is no support for using the following symbols in a rule in LITI:

  • * as a wildcard match on beginning or end of a word
  • ^ to tie match to beginning of a document
  • $ to tie a match the end of the document
  • _L to match a literal string
  • _C to specify case-sensitivity

The set of operators that you can use in categorization rules includes several that are currently not available in LITI including: NOTIN, NOTINSENT, NOTINPAR, NOTINDIST, START_n, PARPOS_n, PAR, MAXSENT_n, MAXPAR_n, MAXOC_n, MINOC_n, MIN_n, and END_n. Do not attempt to use these operators, because they will only give you a compilation error. If you need any of these operators, consider whether you could combine a concept model and a categorization model together. Concepts can be referenced in categorization models in the same way concepts are referenced in LITI rules, but with a slightly different syntax. If users request implementation of any of these categorization-only operators for LITI rules, then these operators could be added in the future.

Alert! There are differences between the syntax of rules for information extraction and those for categorization. In addition, the set of operators available for information extraction is narrower than the set available for categorization.

11.2.3. Combinations of Operators and Restrictions

Earlier, it was mentioned that the operators may be combined in almost infinite ways to control the characteristics of matches to CONCEPT_RULE and PREDICATE_RULE types of rules. Some exceptions have been described in the previous sections, and next you can learn about some of the most and least useful types of combinations. Note that these are general tips and guidelines, but there are situations in which it may be fine to ignore them. Some combinations will compile, but not work the way that you might expect. This section will clarify those situations.

Rules with multiple layers of embedded operators are evaluated in the system via a bottom-up approach. At each layer, the governed operator passes on true or false information to the governing operator, and that one passes it to its governing operator and so forth through the layers until the entire rule is evaluated. As you are writing custom rules with multiple layers of embedded operators, consider this approach and test to make sure that the results meet your needs.

Tips for Use of OR, NOT, and UNLESS

The first tips are about the use of OR. Do not generally use an OR operator to govern the top level of your rules. If OR is at the top of your rule structure, then you could likely write two rules that would be easier to read and maintain. Remember that if OR is the only operator in your rule, then you should be using a different rule type. The perfectly correct example below could be rewritten to four CLASSIFIER rules instead, with much less complicated syntax:

Avoid this! -> CONCEPT_RULE:(OR, (OR, “_c{love}”, “_c{kindness}”), (OR, “_c{joy}”, “_c{happiness}”))

The approach with CLASSIFIER rules follows:

CLASSIFIER:love

CLASSIFIER:kindness

CLASSIFIER:joy

CLASSIFIER:happiness

Keep in mind the restrictions on use of NOT and UNLESS. When using NOT, be sure to connect it to a top-level AND operator, and do not artificially embed it under other scope-restricting operators. Keep the NOT sections of your rule where you can use the rule itself to remind you that NOT cannot be limited to less than document scope.

Use UNLESS carefully and be mindful of its restrictions; see section 7.5 for details. At the time of this writing, it is the newest and potentially the most brittle of all the operators, so test such rules carefully at every stage of your project if you use them.

Basic Combinations and Pitfalls

One good basic guideline is that the items in the second and third groups of Table 11.1 usefully govern each other and OR. The operators in these groups include AND, DIST_n, ORD, ORDDIST_n, SENTEND_n, and SENTSTART_n. However, there is a caveat to this guideline, which is discussed next.

A useful situation where this guideline works well is when an AND operator governs an ORD and a DIST_n, each of which has arguments of its own. Here is an example:

CONCEPT_RULE:(AND, (ORD, “arg1”, “arg2”), (DIST_n, “arg3”, “_c{arg4}”))

This rule reads that, if arg1 and arg2 are in that order in the document AND arg3 and arg4 are within n tokens of each other in the document, then the rule will match and extract arg4. It does not matter whether the matches for the ORD and the DIST_n operators overlap, because the AND operator has no restrictions other than both pairs of arguments appearing in the document scope.

However, there are some combinations that may not work the way you would expect. One example involves ORD and ORDDIST_n. It is redundant for ORD to govern ORDDIST, because the ordering command exists for the arguments of ORD and applies to the arguments of any operators that it governs. Consider each of the following two rules. The first one can be interpreted this way: Find “good” within five tokens preceding “job,” both of which should precede “not” within seven tokens preceding “quit.” The second one could be interpreted this way: Find “good” within five tokens of “job,” both of which should precede “not” within seven tokens of “quit.”

CONCEPT_RULE:(ORD, (ORDDIST_5, “good”, “_c{job}”), (ORDDIST_7, “not”, “quit”))

CONCEPT_RULE:(ORD, (DIST_5, “good”, “_c{job}”), (DIST_7, “not”, “quit”))

For distinguishing between matches, the first rule is in a concept named jobEval1, and the second is in jobEval2. Consider the following input documents:

1. I have a good job, so I will not quit.

2. I have a good job, and if I quit, I will not be happy.

Pause and think: Can you predict the matches for jobEval1 and jobEval2 with the input documents above?

Both jobEval1 and jobEval2 extract the same match, as seen in Figure 11.5.

Figure 11.5. Comparison of Extracted Matches for the jobEval1 and jobEval2 Concepts

Doc ID

Concept

Match Text

1

jobEval1

job

1

jobEval2

job

Although the ordering is not explicit in the “(“DIST_7, “not”, “quit”)” part of the rule in the jobEval2 concept, there is no match for the second document because ORD applies to the arguments of the DIST_7 operator that it governs.

Just as the ORD operator’s governing ORDDIST is redundant, so is ORDDIST’s governing ORD. Therefore, the following two rules in the jobEval concept produce the same matches with the two input documents above as the previous two rules:

CONCEPT_RULE:(ORDDIST_10, (ORD, “good”, “_c{job}”), (ORD, “not”, “quit”))

CONCEPT_RULE:(ORDDIST_10, (AND, “good”, “_c{job}”), (AND, “not”, “quit”))

In fact, the first rule is an error, because its formulation gives the impression that, if you find the pairs of words under the ORD operators in the right order, then the pairs could even overlap and still the rule would match. However, that is not the case. If you run the first rule above on the following two documents, the shaded match is the only one extracted:

1. Good, you did not quit your job.

2. Good job, you did not quit.

To go even further, the second rule above also contains redundant operators, which is an additional error. The rule will behave the same way if you write it without the AND operators, so the right way to write this rule is to remove the redundant operators completely, as in the following rule:

CONCEPT_RULE:(ORDDIST_10, “good”, “_c{job}”, “not”, “quit”)

This rule is also much easier to read and to maintain.

Semantic Hierarchy of Operators

All the information presented so far points to a semantic hierarchy between these operators. When you understand the operators and their hierarchy, you can write better rules.

Remember that ORD is like AND, but with an added ordering constraint. You can interpret that to mean that ORD means [and] + [order], where each of the items in the square brackets is a part of the meaning of the ORD operator. The square brackets are used in the rest of this section to denote a component of meaning or characteristic of each operator. This approach means that the most useful rule of operator combination will be a heuristic one, as described in this tip:

Tip: Use operators that are governed by other operators where the governing operator does not already imply the same characteristics as the governed one. In other words, the lower-level operator should add elements of meaning or constraints in order to be useful.

The exception to this rule is where the two related operators share the [distance] constraint. In that case, the higher operator should have the same or larger digit on the distance operator.

Putting this tip into practice implies certain recommendations for the first three groups of operators from Table 11.1. Each operator has specific meaning components. Table 11.2 shows the list of operators each can effectively govern.

Table 11.2. Operator Governance for OR, AND, ORD, DIST_n, and SENTEND_n, SENTSTART_n, and ORDDIST_n

Operators

What the Operators Can Govern

[or]

OR ► any but OR (unless using groupings for enhanced readability or maintenance)

[and]

AND ► any but AND

[and] + [order]

ORD ► any but AND, ORD, ORDDIST_n

[and] + [distance]

DIST_n ► any but AND

[and] + [distance]

SENTEND_n ► any but AND

[and] + [distance]

SENTSTART_n ► any but AND

[and] + [order] + [distance]

ORDDIST_n ► any but ORD, AND

Table 11.2 shows that OR can govern any of the other operators, but keep in mind the caveat about using it at the top of your rule structure. If you put multiple OR operators in a hierarchical relationship with each other, then it will work to organize arguments into sets, but will not make the rule behave differently. For example, suppose you want to capture specific drink types and mention of sugar content with the following rule:

PREDICATE_RULE:(drink, sugar):(DIST_8, “_sugar{nlpMeasure}”, (OR, (OR, “_drink{grape juice}”, “_drink{apple juice}”, “_drink{orange juice}”), (OR, “_drink{vodka}”, “_drink{beer}”, “_drink{wine}”)))

You can see two OR lists under OR that is the second argument of DIST_8. This OR does all the work of creating a list of drink types; The lower-level OR operators do nothing other than allow the rule builder to group types of drinks together. A better place to do this type of organization is in a separate concept for drinks, but this approach may be useful during the exploration phase of rule-building.

The meaning of AND, which is by default the same as [document scope], is included in all the other operators’ constraint set, and all the other operators add their own constraints. Therefore, putting an AND operator under any of the others is redundant, as is putting an [order] operator under another [order] operator.

As the caveat implies, the [distance] operators work differently, because distance is always defined by a number. It is possible and logical to put a [distance] operator under another [distance] operator, assuming that the lower-level operators are more constrained by their number than the governing operator. This rule illustrates that approach:

CONCEPT_RULE:(DIST_18, (DIST_5, “good”, “_c{job}”), (DIST_5, “not”, “quit”))

Each of the items that are most closely related are constrained to within 5 tokens of each other; however, the entire rule can match across a total of 18 tokens. Assuming the four elements do not overlap (for example, “not” appearing between “good” and “job”), then the top-level operator adds 8 more tokens that can appear between the two subordinate matches. If that number were 10, then one would understand that it is not intended that there be intervening tokens between the matches or that the matches should overlap one another. An even smaller number like 8 would constrain the matches even further, never allowing both DIST_5 operators to reach their full distance at the same time in a given match scenario. If the upper-level operator goes as low as 5, then the lower operators become redundant and should be removed.

DIST_n and ORDDIST_n operators can also be used together, with the same caveat in mind, plus the basic rule that says [order] will constrain all the arguments if ORDDIST_n is used as a top operator, but only its own arguments if it is used under a DIST_n operator. So, the following rules produce different matches:

CONCEPT_RULE:(DIST_18, (ORDDIST_5, “good”, “_c{job}”), (DIST_5, “not”, “quit”))

CONCEPT_RULE:(ORDDIST_18, (DIST_5, “good”, “_c{job}”), (DIST_5, “not”, “quit”))

For the purpose of distinguishing between matches, the first rule is in a concept named jobEval1, and the second is in jobEval2. Consider the following input documents:

1. She does not have a good job, so she will quit.

2. He has a good job, but he will still quit, though not right away.

Pause and think: Can you tell which of the above rules match the input documents?

The extracted match is represented in Figure 11.6.

Figure 11.6. Extracted Match for the jobEval1 Concept

Doc ID

Concept

Match Text

2

jobEval1

job

The first rule above matches the second input document, because there is no operator that requires that “not” and “quit” be in that order. In the first input document, “not” and “quit” are just too far apart to match, because both rules require that they be no more than 5 tokens apart.

The second rule does not match either of the documents. It cannot match the first document because DIST_5 is too small of a distance to capture “not” and “quit” in this sentence. It also cannot match the second document because “quit” comes before “not” and the rule requires the reverse ordering due to ORDDIST_18, which governs all operators.

Sentence Start and End Combinations

The two operators, SENTEND and SENTSTART, are a little more complicated when used together. If you want the match to be on the same token, then you can construct rules like the following, encapsulating one of the operators within the other:

CONCEPT_RULE:(SENTSTART_10, (SENTEND_10, “_c{job}”))

CONCEPT_RULE:(SENTEND_10, (SENTSTART_10, “_c{job}”))

Either operator can be first and you will see the same match pattern, so both rules above will match the second sentence below, but not the first. Even though in the second sentence, the word “job” appears within 10 tokens of the start of the sentence and again within 10 tokens of the end of the sentence, the rule specifies that the token match be on one item in the sentence, not on two separate identical items, because one operator governs the other in each rule. The input documents are as follows:

1. He has a good job, but he will still quit, though not until he finds another job.

2. I have a good job, so I will not quit.

If you want to match the first document, then use the AND operator in your rule to put each of the other operators on the same level, as in this example:

CONCEPT_RULE:(AND, (SENTSTART_10, “job”), (SENTEND_10, “_c{job}”))

Assuming the rule above is in a concept named jobEval, the matches are shown in Figure 11.6.

Figure 11.6. Extracted Matches for the jobEval Concept

Doc ID

Concept

Match Text

1

jobEval

job

2

jobEval

job

Note that only the second occurrence of the string “job” in the first sentence is returned as a match.

Scope Override Operators

Turning to the fourth set of operators, SENT, SENT_n, and PARA as shown in Table 11.3, you see that they all have in common that they override the default scope constraint of AND and limit the scope of the match. Because of this, they are not usually used to govern the AND operator. They are often used to constrain the other groups of operators discussed earlier in this chapter. Also, they have another constraint of interacting with each other that is similar to the [distance] operators above. In general, the [scope] operator at the higher level should specify a larger scope. SENT is always smaller than SENT_n and PARA and should not govern them. PARA and SENT_n may each have larger scope than the other, depending on the value of n and type of documents, so you must decide which should govern the other. Usually, PARA is considered to have greater scope than SENT_n unless n is larger than 6.

Table 11.3. Operator Governance for AND

Operators

What the Operators Can Govern

[and] + [scope]

SENT -> any but AND, SENT_n, or PARA

[and] + [scope] + [distance]

SENT_n -> any but AND or if n < 6 also PARA

[and] + [scope]

PARA -> any but AND or SENT_n where n > 6

Some examples of using the [scope] operators with the other operators include the grammatical and topical strategies described in this section and the advanced use sections for CONCEPT_RULE and PREDICATE_RULE.

Use SENT to constrain matches to within a sentence, while using ORD or ORDIST to specify the order of items. This approach can be used to explore the grammatical relationships, like the one between subjects and verbs, once you have defined some basic concepts for the head noun of a phrase and an active verb using part-of-speech tags:

PREDICATE_RULE:(subj, verb):(SENT, (ORDDIST_3, (OR, “_subj{headNoun}”, “_subj{:Pro}”), “_verb{activeVerb}”))

This rule requires that, within the scope of a sentence, a subject head noun or pronoun appear in the text within three tokens and be ordered before an active verb. You can move elements around and focus on passive verbs as well:

PREDICATE_RULE:(subj, verb):(SENT, (ORDDIST_5, “_verb{passiveVerb}”, (DIST_2, “by”, (OR, “_subj{headNoun}”, “_subj{:Pro}”))))

You can read this rule as follows: Within the scope of a sentence, match a passive verb that precedes (within the span of five tokens) the word “by” which itself is within two tokens of either a head noun or a pronoun. The passive verb is returned as a match for the label “verb,” and the head noun or pronoun match is returned for the label “subj.” This type of rule will become much more effective when a new operator called CLAUS_n is released. This operator is on the product roadmap for SAS Visual Text Analytics. The use of CLAUS_0 restricts the scope of a match to within any single main clause in a sentence. The use of CLAUS_1 restricts the scope of a match to within any single clause, either main or subordinate, in a sentence. This type of grammatical scope will make rules like the one above or rules for negation much easier to write and test.

SENT_n is useful for when you are looking for a relationship, but suspect that there is a high potential for use of anaphora (pronouns and general nouns used in place of more specific nouns) that could obscure the relationships you are looking for. In this rule, you are looking beyond single-sentence matches to find a birth location for an individual:

PREDICATE_RULE:(per, loc):(SENT_4, (ORD, “_per{nlpPerson}”, (OR, “she”, “he”), “born”, “in _loc{nlpPlace}”))

This rule says that you will look in a scope of four sentences for a Person predefined concept match first, then either “she” or “he,” then the word “born,” and then a combination of the word “in” with a Location predefined concept match. The matches to the labels “per” and “loc” represent the fact that the extracted match for person was born in the extracted match for location. PARA may be used in some products to identify the first head noun in a paragraph, which may be a good indicator of the topic of that paragraph, depending on how your data is structured.

CONCEPT_RULE:(PARA, (SENTSTART_5, “_c{headNoun}”))

This rule will find the first head noun from each of the sentences in the paragraph, but you can filter the results in postprocessing by selecting the matches with the lowest offset values to carry forward into your analysis. What about governing [scope] operators with the other operators described above? You can combine them in some cases. For example, if you want to match two items within a sentence and you want one of the matches to come before the other, then the following rule will work to some extent:

CONCEPT_RULE:(ORD, (SENT, “bank”, “_c{fee@}”), (SENT, “close@”, “account@”))

This rule will find the two items governed by the first SENT operator within the same sentence and then will find the other two items governed by the second SENT operator. The matches for the first pair must come before the matches of the second pair, because of the ORD operator; however, the matches could appear in the same sentence or different sentences at the beginning and end of the document, because ORD has document-level scope.

If you want to constrain the distance of the two matches, then you might try to use ORDDIST or DIST operators instead of ORD. The rule might look like the following with a large value of n to try to allow for some sentence variation:

CONCEPT_RULE:(DIST_50, (SENT, “bank”, “_c{fee@}”), (SENT, “close@”, “account@”))

The discussion of bank fees and closing the account can appear within a scope of 50 tokens in either order. Even though the individual arguments produce matches (shaded gray below), the input document would not match this rule, because the shaded relevant element pairs are just too far apart.

Consider the following input document, modeled after public data from the U.S. Consumer Financial Protection Bureau (https://www.consumerfinance.gov/data-research/consumer-complaints):

A month ago I commented that closing my account at Bank Y was really easy. A week or two later, I found all this mail from Bank Y—overdraft notices for my checking account, which was supposedly closed. The notices are for two debit card transactions and two auto-pay electronic checks. Instead of the payments being rejected by my bank, like you would expect, all four were paid by Bank Y, which then added an overdraft fee of $34 to each one, meaning $136 in overdraft fees.

A better approach might be to use PARA or SENT_6 instead of the DIST operator. You can also provide a higher value of n for the SENT_n operator. These operators give you more control over the number of sentences used to relate the issue of closed accounts to bank fees. This rule would match the text above, providing more control over matches than either ORD or DIST:

CONCEPT_RULE:(SENT_6, (SENT, “bank”, “_c{fee@}”), (SENT, “close@”, “account@”))

Keep in mind, however, that this rule will also allow the matches to all be in the same sentence. To try to specify that they must be at least in two separate sentences, you will need to add ORD and some marker of the sentence division like the following. Note that in this version of the rule, the ordering constraint also applies to each of the arguments of the SENT operators, so you may need more variations of the rule in your model:

CONCEPT_RULE:(SENT_6, (ORD, (SENT, “bank”, “_c{fee@}”), “sentBreak”, (SENT, “close@”, “account@”)))

The concept sentBreak used in the rule above could contain a REGEX rule that looks for sentence-ending punctuation, or an even better option would be a concept containing a rule using the SENTSTART_n operator like this:

CONCEPT_RULE:(SENTSTART_1, “_c{_w}”)

Best Practices for Operator Combinations

In summary, the combinations that work best across the operators are the following:

  • OR can govern any other operator but should not be the top-level or only operator in the rule.
  • NOT and UNLESS may appear only in very constrained contexts.
  • The variants of AND with document-level scope can usefully govern each other so long as the lower-level operators in the rule add meaning or have smaller distance constraints than the higher-level operators in the rule.
  • The variants of AND that change scope can usefully govern the other operators.
  • The variants of AND that change scope usefully govern each other, if the scope is greater for the higher-level operators in the rule and is smaller for the lower-level ones.
  • The variants of AND with document-level scope can be used to govern the variants of AND that change scope but may not be as effective as you want; be careful and aware of how the operators will interact.

11.3. Best Practices for Selecting Rule Types

Each rule type has its own processing requirements, which means that, by selecting different rule types, you have control over how efficiently your model processes data. This section will help you make such decisions.

11.3.1. Rule Types and Associated Computational Costs

To better inform the selection of rule types for your models, the following list ranks the rule types from least to most computationally expensive:

  • CLASSIFIER is the least costly rule type because it includes only tokens, and because the found text is the extracted match.
  • CONCEPT is the second least expensive rule type because it works with token literals, refers to a set of sequential elements and rule modifiers, and the found text is the extracted match.
  • C_CONCEPT works like the CONCEPT rule type at first by matching all the defined elements and modifiers. Additional processing then returns a part of the rule match using the _c{} extraction label, making the rule type slightly more expensive.
  • CONCEPT_RULE is the most expensive of the concept rule types because it allows for Boolean and proximity operators. The number of operators in a rule can increase its overall cost. It is possible to create a CONCEPT_RULE that is more expensive than even some of the other rule types below.
  • SEQUENCE is the less expensive of the two fact-matching types of rules because elements and modifiers are sequential, which parallels the C_CONCEPT rule type. Additional processing then extracts all matches for each label. More labels in the rule can further increase the cost of the rule.
  • PREDICATE_RULE is more flexible, but more expensive, than the SEQUENCE rule type because of the use of Boolean and proximity operators. It is similar in cost to the CONCEPT_RULE type plus extra processing for extracting matches for multiple labels. More operators or more labels can contribute to increases in the overall cost of this type of rule.
  • REMOVE_ITEM is generally the less costly filtering rule type of the two; it operates over only matches of a specified concept. It depends on the number of matches for that concept to determine the cost, so keep that factor in mind as you apply the rule.
  • NO_BREAK is the costlier of the two filtering rule types because it operates over all matches in the model across all concepts.
  • REGEX rule cost can vary widely because of seemingly endless combinations and because it depends on the makeup of the regular expression rule. It must be used with caution. Although this rule can have minimal cost for certain definitions, it can potentially be the most expensive. See chapter 10 for additional advice on special characters and strategies to avoid.

Each of the available rule types have been designed for specific purposes, and it is advised that you use these rule types for those purposes. Although some rule types can be used in place of others, this approach can lead to inefficiencies that may not surface until a later time when you are scoring documents at scale. It is thus best that you use the appropriate rule type from the start of authoring rules.

If you are relatively new at writing rules, you may fall into the trap of always using a given rule type that has worked for you in the past. This approach can lead to a misunderstanding of when to use certain rule types, and potentially develops a habit of using the wrong rule type when authoring larger sets of rules. Instead, select the rule type by always keeping in mind the goals of the model, the type of data you are working with, and the tips about each rule type in this book.

11.3.2. Use of the Least Costly Rule Type for Best Performance

Take for example the CLASSIFIER and CONCEPT rule types. Both produce a match on the same input token or tokens, so the result appears to be the same. The CONCEPT rule type is used to match against everything to the right of the colon in its definition (in order), just as the CLASSIFIER rule type. However, the CONCEPT rule type can do more than the CLASSIFIER rule type can, including referring to rule modifiers and other element types. With these additional capabilities and because of the various ways in which it can be expanded, the CONCEPT rule type is more expensive in terms of complexity and run-time cost. In contrast, the CLASSIFIER rule type allows only literal strings in the rule definition. Repeated use of the CONCEPT rule type for literal strings should be converted only to using the CLASSIFIER rule type.

For the example below, a series of CONCEPT rule types each have a single named symptom:

CONCEPT:burning

CONCEPT:itching

CONCEPT:redness

Given that the rules above define only literal strings, they should be rewritten as CLASSIFIER rules:

CLASSIFIER:burning

CLASSIFIER:itching

CLASSIFIER:redness

Alternatively, two of them can be kept as CONCEPT rules that match in a broader fashion by converting the rule definition to include a lemma form and @ expansion symbol. There is no need for the expansion symbol in the third rule, so it could be converted to a CLASSIFIER rule:

CONCEPT:burn@

CONCEPT:itch@

CLASSIFIER:redness

Remember that you cannot put an @ modifier on an adjective like “red” to get matches for the noun “redness,” because the @ modifier expands only to nominal, adjectival, or verbal forms that are inflectionally derived. In other words, the adjective “red” is not the parent of the nominal form “redness.”

The same conservative approach just described for CLASSIFIER and CONCEPT rule types is also recommended for the CONCEPT and C_CONCEPT rule types in comparison to one another. Matches extracted because of CONCEPT rules correspond to the full found text based on the rule definition. In other words, when a CONCEPT rule matches input text, the extracted match will be the entire rule definition body. The C_CONCEPT rule will likewise match input text in accordance with its full rule definition, but the extracted match is only the part of the found text specified in the _c{} extraction label. When this extraction label includes the entire definition of the rule, it is better to use the CONCEPT rule type instead of the C_CONCEPT rule type, because the latter is designed for extracting only a portion of the found text. Even though technically it is allowed, _c{} should never be used to match against an entire rule definition, because it is a misuse of the C_CONCEPT rule type. Instead, consider whether the CONCEPT rule type can be used in its place.

For the example below, assume a concept with the name skinSymptom has been defined as containing a list of known symptoms that can impact the skin. The first rule set is incorrect, and should be replaced by the second set or the third set:

Avoid this! -> C_CONCEPT:_c{skinSymptom sensation}

Avoid this! -> C_CONCEPT:_c{skinSymptom feeling}

Avoid this! -> C_CONCEPT:_c{skinSymptom}

In the case where both the match from the concept skinSymptom and the token after it should be returned, use the following set of rules:

CONCEPT:skinSymptom sensation

CONCEPT:skinSymptom feeling

CONCEPT:skinSymptom

In the case where only the match from the skinSymptom concept should be extracted, use the following set of rules:

C_CONCEPT:_c{skinSymptom} sensation

C_CONCEPT:_c{skinSymptom} feeling

CONCEPT:skinSymptom

11.3.3. When Not to Use Certain Rule Types

Although using one rule type in place a of a less costly rule type is one type of misuse, another is improperly using the results of one rule in another rule. For instance, using the SEQUENCE or PREDICATE_RULE rule types to define a concept that is referenced in the rules of another concept is generally not a proper use of the rule type, because the fact aspect of the matches will be lost. In other words, the labels and associated extracted matches will be lost. Only the matched string will be passed along.

If you have a reason to extract the full match string from the first to the last match element, and if your elements are in a known pattern, then you can use a CONCEPT rule instead of a SEQUENCE rule. On the other hand, if you need to use operators to find all the elements that are required for your match, then PREDICATE_RULE is the only rule type that enables both operators and the extraction of the full matched string. This use case may have a use in your model but must be applied with caution.

If you have no purpose for either of the output results of the rule match, and you are throwing away both types of extracted data, then you are using the existence of the match as a binary decision. This is a misuse of the SEQUENCE and PREDICATE_RULE rule types because they are meant to produce the information about relationships between labeled items or between them and the extracted match string. In such a situation, you should use the C_CONCEPT or SEQUENCE or CONCEPT rule types instead, because your model will run faster. For example, the CONCEPT_RULE type is used in place of the PREDICATE_RULE type when fact matches are not needed but operators are required to define the rule and the result of finding a relationship among the elements is still desired.

The example below shows two concepts, each with a PREDICATE_RULE definition. The first concept, named reportedIssue, contains a PREDICATE_RULE definition containing two labels, a part, defined as a match to the partList concept, and mal, a malfunction often connected to vehicle air bags. The partList concept contains the following rule:

CLASSIFIER:air bags

The concept named reportedIssue contains the following rule:

PREDICATE_RULE:(part, mal):(SENT, “_part{partList}”, “_mal{deploy@ on :DET own}”)

The second concept, named legalClaim, contains a CONCEPT_RULE containing two keywords, insurance and claim, and a reference to the reportedIssue concept:

CONCEPT_RULE:(SENT_2, “insurance”, “_c{reportedIssue}”, “claim”)

The rule defined in the legalClaim concept is attempting to match the token “insurance,” a match on the reportedIssue concept, and the token “claim,” and to return the reportedIssue as the match for the CONCEPT_RULE rule definition. The extracted matches from the PREDICATE_RULE are being used only to define the bounds of the matched string, which is being passed forward to the CONCEPT_RULE. That matched string then leads to output via the result of the CONCEPT_RULE, which is a legitimate way to use the rules. One other legitimate reason for choosing this combination of rules is that you need both legalClaim and reportedIssue output from scoring your data in production.

Consider the following input document:

It was foggy. We were going about 30 mph when the air bags deployed on their own, broke the windshield and caught on fire. The damage to the car was $5100.00 and when the insurance put a claim into the manufacturer, they replied that they would have to examine the “alleged faulty parts” before honoring the claim or taking responsibility. How frustrating!

Pause and think: Assuming the model with reportedIssue and legalClaim concepts, can you predict the matches for the reportedIssue concept with the input document above? What if you also output the legalClaim concept?

With the input document above, the legalClaim concept produces the match in Figure 11.7.

Figure 11.7. Extracted Match for the legalClaim Concept

Doc ID

Concept

Match Text

1

legalClaim

air bags deployed on their own

With the input document above, the reportedIssue concept produces the matches in Figure 11.8.

Figure 11.8. Extracted Matches for the reportedIssue Concept

Doc ID

Concept

Extraction Label

Extracted Match

1

reportedIssue

air bags deployed on their own

1

reportedIssue

mal

deployed on their own

1

reportedIssue

part

air bags

If you did not place the _c{} extraction label on the reportedIssue concept, but somewhere else in the rule, then the use would be incorrect unless you were generating output via both concepts above, because all the information passed from the PREDICATE_RULE would have then been lost. Be careful not to make this error, because such fact rules buried in your model with no purpose can contribute to slow run-time speeds when you are scoring data with your model.

Avoid this! -> CONCEPT_RULE:(SENT_2, “_c{insurance}”, “reportedIssue”, “claim”)

To solve the error of losing all the extracted information from the PREDICATE_RULE in higher levels of your model, convert the original PREDICATE_RULE type of rule to a CONCEPT_RULE type. The CONCEPT_RULE type of rule returns a match only on either of the tokens: either the malfunction, or the part. It would probably be better to return the latter. After this modification, the concept named reportedIssue contains the following rule:

CONCEPT_RULE:(SENT, “_c{partsList}”, “deploy@ on :DET own”)

The output for the reportedIssue concept has changed as detailed in Figure 11.9.

Figure 11.9. Extracted Match for the reportedIssue Concept

Doc ID

Concept

Matched Text

1

reportedIssue

air bags

Considering how models are constructed and how they pass information forward through the layers of concepts is covered in more detail in chapter 13. Please refer to that chapter before designing and setting up your taxonomy and before building your model.

11.3. Concept Rules in Models

Custom concepts are useful when you know what information you are trying to extract from your data and you need a way to target that information. You can use a single rule or a series of rules to accomplish your goals. The information in chapters 5–10 introduced each of the rule types and showed you how they relate to one another in terms of complexity and usage scenarios. With the addition of the important details and best practices in the current chapter, you are well equipped to start building your own custom rules successfully, using the LITI syntax.

Chapters 12–14 take advantage of what you have learned in the previous chapters and equip you to build a full information extraction model. Chapters 12 and 13 include tips on designing and setting up a model, taking into consideration data characteristics, whereas chapter 14 focuses on testing and maintenance of models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset