Chapter 8: Fact Rule Types

The SEQUENCE and PREDICATE_RULE types of rules are grouped together because they are used for extracting facts. Fact matches involve identifying multiple items in a relationship, as well as identifying events or slots within a template structure. For example, some types of relationships between items are listed in Table 8.1.

Table 8.1. Fact Matching Relationships, Slots, and Examples

Relationship	Slots	Example
X is a type of Y.	X, Y	Cash back is a type of promotion.
X was born in Y.	X, Y	The CEO was born in Texas.
X can cause the allergic reactions of A, B, or C.	X, A, B, C	A vaccine can cause the allergic reactions of itchy skin, difficulty breathing, and a reduction in blood pressure.

The relationships shown in the table include a “typeOf” relationship between “cash back” and “promotion,” a “bornIn” relationship between “CEO” and “Texas,” and a “causeReaction” relationship between “vaccines” and various types of reactions or symptoms. These are just examples to help you start thinking of all the things you can model with fact rule types. In these sentences, there is an explicit mention of the relationship, but in some cases the relationship might be indicated more subtly or even implied by context. You can still build fact rules to capture these contexts effectively in many cases.

A template structure is just a more complex form of this two-way relationship. In other words, the relationship between multiple things can be modeled together. Types of events or template structures that a fact rule can be used for are listed in Table 8.2.

Table 8.2. Templates, Slots, and Examples for Fact Rules

Template	Slots	Example
Payment event type happened on Y date for Z amount.	Y, Z	The bill was paid on 12.12.2012 for $12.
A marriage event occurred with these attributes: Location = A Date = B Bride = C Groom = D Number of people attending = E	A, B, C, D, E	Mary Smith married John Brown on August 11, 2018, in Little Rock, AR with 150 people in attendance.
A verb type has been found with the following slots: Subject = A Direct object = B Time adverb = C	A, B, C	The tenants caused damages to the sewer pipes yesterday.

An event is usually modeled as a set of slots to be filled or left empty, depending on the specificity of the information about the event in the text. For example, as shown in Table 8.2, a marriage event could include all the slots listed, but the text might not contain information about the number of people attending. All the other slots might be filled by information in the text, in which case you would have an event template with an incomplete instantiation. This might still be useful, because it results in a similar situation to when there are missing values in structured data. Even though some data points are missing, the ones that are known can still be useful.

Both relationships and events can be modeled with the use of fact rule types in the SAS IE toolkit by leveraging either or both of the two rule types illustrated in this chapter. But keep in mind that fact rules produce intrinsically different results than concept rules: For example, when considering output tables, remember that fact rule matches are found in the factOut table, whereas concept rule matches are found in the conceptOut table.

Although these rule types are described briefly in the SAS Text Analytics product documentation, there are intricacies of usage that you need to know to use them effectively and efficiently. This chapter will extend your understanding of the SEQUENCE and PREDICATE_RULE through tips, potential pitfalls, and examples that show both basic and advanced uses of each. The requirements and optional elements for each rule type are summarized at the end of each section so that you can keep coming back to that section as a quick reference when you are building your models.

After reading this chapter, you will be able to do the following tasks:

Use the LITI syntax to write efficient and effective SEQUENCE and PREDICATE_RULE types of rules
Understand how the output of fact rules is different from the output of concept rule types
Avoid common pitfalls and use best practices to create better rule sets
Troubleshoot common rule-writing errors

8.2. SEQUENCE Rule Type

The SEQUENCE rule type works like the C_CONCEPT type except that it enables you to extract more than one part of the match. When you have ordered elements and need to extract more than one matched element, use a SEQUENCE rule type to model the fact and the surrounding context. The SEQUENCE rule type is designed to exploit the inherent sequential order of elements in text while focusing its attention on matching facts and extracting multiple arguments.

8.2.1. Basic Use

The basic syntax comprises a rule with three sections:

Rule type declaration
Label declaration
Rule definition

Between each of the three sections is a colon. The label declaration section includes one or more user-defined extraction labels in a comma-separated list enclosed in parentheses. The rule definition contains two or more elements marked for extraction with the label, or labels, and zero or more additional elements. Here is the basic rule syntax:

SEQUENCE:(label1, label2):_label1{element1} _label2{element2}

SEQUENCE:(label1, label2):elementA _label1{element1} elementB _label2{element2} elementC

For descriptive convenience, some of the elements have been labeled with numbers; others, with letters. You can read the first rule this way: If element1 and element2 are both in the text in sequential order, extract element1 as a match to label1 and extract element2 as a match to label2 in the concept in which this rule is written. In addition, the entire span of text between element1 and element2 is returned as a match to provide insight into the context of the matches.

You can read the second rule this way: If element1 and element2 are both in the text in the sequential context specified by elementA, elementB, and elementC, then extract element1 as a match to the label named label1 and extract element2 as match to the label named label2 in the concept in which this rule is written. In this example, there are 5 total elements, two of which are marked as targets for extraction with user-defined extraction labels. Again, note that the entire span of text between element1 and element2 is extracted as an additional match.

In some SAS Text Analytics products, you can append additional tokens onto the beginning and end of the extracted match bounded by the two matches with labels. This approach provides you even more context through a bigger window of content. Creating output with three sections (preceding context, extracted match, and following context) is called concordancing.

Notice that in the label declaration section of the rule, the labels are listed within parentheses separated by a comma, but in the rule definition, each of the extraction labels is directly preceded by a single underscore and followed by curly braces surrounding the elements to be matched. This is similar to the _c{} extraction label in other rule types, but keep in mind that you cannot name the label “c,” because it is reserved by the system. It is recommended to also avoid the following names: “Q,” “F,” “P,” and “ref,” although these names might not cause any problems in fact rules.

The names of extraction labels must start with a letter, followed by any series of letters, underscores, or digits. Note that, in some older products, using an uppercase letter in an extraction label name could cause compilation errors.

Remember: Extraction label names must start with a letter, followed by any series of letters, underscores, or digits.

Here is an example rule used to find problems in vehicles in a concept named reportedDefect:

SEQUENCE:(part, mal):_part{engine} is _w _mal{overheating}

The extraction labels in this rule are “part” and “mal” (malfunction). For matches to be extracted to these labels, those matches must be in the same order in the text as in the rule definition and include the word “is” and another token between them.

Consider the following input documents.

1. The engine is quickly overheating whenever the water pump does not engage.

2. Because the engine is always overheating, the thermostat also quit working.

Pause and think: Assuming the rule and input documents above, can you predict the output?

Fact rule matches are usually returned in a different output table or format in the graphical user interface (GUI) than concept rule matches. This is due to the extra information that fact matches create. For example, the extracted matches for the reportedDefect concept and the input document above include the following: the token “engine” for the label “part” and the token “overheating” for the label “mal,” as well as the entire span of text between these two matches in each of the input documents, as shown in Figure 8.1.

Figure 8.1. Extracted Matches for the reportedDefect Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	reportedDefect		engine is quickly overheating
1	reportedDefect	mal	overheating
1	reportedDefect	part	engine
2	reportedDefect		engine is always overheating
2	reportedDefect	mal	overheating
2	reportedDefect	part	engine

In this example, the literal strings (“engine” and “overheating”) that are explicitly defined in the rule are assigned as matches to the extraction labels (“part” and “mal”) when the fact is found. This rule works well for counting occurrences of specific strings. Note that the string between the first and last match is also returned for both documents, providing context. In fact matches, there is always at least one match per defined extraction label, plus one extra match to show the span between the first extracted string and the last extracted string.

Although at least one extraction label is required in the SEQUENCE rule type, it is recommended to specify two or more extraction labels because the intended use of this rule type is to model a relationship among multiple extracted matches. If you want to specify a single label, a C_CONCEPT rule type is more appropriate because it is less computationally expensive. You can read more about C_CONCEPT rules in section 6.4. However, an exception to the guideline is when you need to match multiple pieces of text with the same label in the same rule definition. This is not possible with a C_CONCEPT rule type.

Returning to the previous example, if more than one match should be extracted for the “part” label, then a single extraction label can be used to capture all the parts mentioned:

SEQUENCE:(part):The _part{engine}, _part{transmission}, and _part{suspension} have been replaced

Consider the following input documents:

1. The engine, transmission, and suspension have been replaced.

2. The engine and transmission have been replaced.

Pause and think: Assuming that the rule above is in a concept named replacedPart, can you predict the matches for the input documents above?

The matches for the replacedPart concept and the input document above are shown in Figure 8.2.

Figure 8.2. Extracted Matches for the replacedPart Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	replacedPart		engine, transmission, and suspension
1	replacedPart	part	suspension
1	replacedPart	part	transmission
1	replacedPart	part	engine

Note that there are four matches for the first document: three matches for the label “part” and one match that extracts the text between the first and last of the three matches. There are no matches for the second document, because only two of the three required elements are found in the text.

The basic use of the SEQUENCE rule is useful in highly structured text, where the extracted matches and context are very predictable. In the next section, you will see how to extend the usefulness of this rule type by generalizing and replacing string literals with other elements.

8.2.2. Advanced Use with Other Elements

To capture a sequence of unknown terms or a larger set of previously defined terms, you can replace each string with another element such as a part-of-speech (POS) tag or a concept name. For example, the rule from the previous section could be rewritten by replacing the strings with “_w,” which represents any single token, including any word that has not been previously specified in any rules or has not been previously extracted. Using “_w” is a good strategy for exploring your data and finding unknown part names or abbreviations.

SEQUENCE:(part):The _part{_w}, _part{_w}, and _part{_w} have been replaced

In this rule, any words found in sequence in the context specified would be extracted as matches to the part argument. If this rule were to replace the previous one in that same concept named replacedPart, and the input sentences were the same as before, the output would also be the same. However, if the input text included different words, whereas the former rule would extract no matches, this modified rule would extract the new automotive parts.

Consider the following input documents:

1. The engine, transmission, and suspension have been replaced.

2. The windshield, wipers, and mirrors have been replaced.

Pause and think: Assuming that the rule above is in a concept named replacedPart, can you predict the matches with the input documents above?

The matches for the replacedPart concept with the above documents as input are represented in Figure 8.3.

Figure 8.3. Extracted Matches for the replacedPart Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	replacedPart		engine, transmission, and suspension
1	replacedPart	part	suspension
1	replacedPart	part	transmission
1	replacedPart	part	engine
2	replacedPart		windshield, wipers, and mirrors
2	replacedPart	part	mirrors
2	replacedPart	part	wipers
2	replacedPart	part	windshield

Note that for each input document, the extracted matches include those for the “part” label as well as the entire span of text from the first to the last matched automotive part. This context is helpful in this scenario for determining which parts are often replaced together.

Now imagine that you might need to extract names only of malfunctioning parts that a particular company manufactures, rather than all the possible parts that were replaced. In this case, you would put a concept name between the curly braces in the extraction label. This concept could be named, for example, madeByMalCo, where MalCo is an abbreviation for a fictitious company. The madeByMalCo concept could contain only names of manufactured parts that were produced by MalCo.

CLASSIFIER:engine

CLASSIFIER:transmission

CLASSIFIER:suspension
CLASSIFIER:wipers

Then, the SEQUENCE rule in the replacedPart concept could refer to the matches that would potentially be extracted from the madeByMalCo concept.

SEQUENCE:(part):The _part{madeByMalCo}, _part{madeByMalCo}, and _part{madeByMalCo} have been replaced

In this example, with the same input text sentences as above, the SEQUENCE rule would extract matches only if the potential matches were listed in the madeByMalCo concept. Consider the following input documents:

1. The windshield, wipers, and mirrors have been replaced.

2. The engine, transmission, and suspension have been replaced.

Pause and think: Can you predict the matches for the concept replacedPart with the input documents above?

The matches for the replacedPart concept and the input documents above are in Figure 8.4.

Figure 8.4. Extracted Matches for the replacedPart Concept

Doc ID	Concept	Extraction Label	Extracted Match
2	replacedPart		engine, transmission, and suspension
2	replacedPart	part	suspension
2	replacedPart	part	transmission
2	replacedPart	part	engine

Notice that no matches are extracted from the first document, although one of the potential matches, “wipers,” is listed in the madeByMalCo concept. The rule, as written, requires that all three arguments produce matches for the concept madeByMalCo. Remember that in SEQUENCE rule types, all the elements are required to appear in the order specified before a match is returned for the concept. The second document produces three matches with the label “part” and a fourth match that extends from the first to the last of those three matches in the text.

Although the ordering of the elements in the rule definition must parallel the input text for extraction to occur, the ordering of extraction labels in the rule definition does not. Thus, the extraction label “_mal{}” could be referenced after the extraction label “_part{}” in the rule definition even if they are declared in the opposite order in the label declaration as “part” and “mal.” The flexible order between two or more extraction labels in the declaration and definition is illustrated with this rule in a concept named, for example, reportedDefect.

SEQUENCE:(part, mal):_mal{overheating} of the _part{engine} is the problem

Consider the following input documents:

1. The overheating of the engine is the problem.

2. The engine overheating is the problem.

Pause and think: Can you predict the matches for the reportedDefect concept and the input documents above?

The matches for the reportedDefect concept and the input documents above are in Figure 8.5.

Figure 8.5. Extracted Matches for the reportedDefect Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	reportedDefect		overheating of the engine
1	reportedDefect	part	engine
1	reportedDefect	mal	overheating

As you can see in this output, the “_part{}” extraction label matches the term “engine,” and the “_mal{}” extraction label matches the term “overheating” for the first input document. The order of the elements in the rule definition corresponds to the text of this sentence, even though the label order does not match the label declaration. There is no match for the second document because the criteria for the found text are not met: The word “overheating” is not followed by the sequence of other words defined in the rule. Simply put, the order of the extraction labels in the declaration section is irrelevant.

8.2.3. Troubleshooting

If you discover that a rule is not matching as you expected, potential causes for this could be one of the pitfalls outlined in section 5.4: namely, general syntax errors, comments, misspelling/mistyping, tokenization mismatch, or filtered matches. In addition, there are also errors that you can check for that are specific to the SEQUENCE rule type, such as the following:

White space
Syntax errors
Missing extraction label
Extra extraction label
Tagging mismatch
Expansion mismatch
Concept references
Predefined concept references
Cyclic dependencies

In the SEQUENCE rule type, white space is reduced to a separator for a list of elements and not counted as an element itself. You cannot specify how many white space characters or what type can appear between elements. For that type of matching, where you specify white space characters, you will need to use a REGEX rule type.

The SEQUENCE rule type requires use of the output declaration, which is located between the rule type declaration and the rule definition. Make sure that you put the extraction label declaration between the two colons that delimit the section and that you format the labels as a comma-separated list between parentheses.

Another error with SEQUENCE rules includes forgetting the extraction labels and their curly braces or putting them on the wrong element, or elements. You must use all the labels you define in the declaration section of the rule at least once, and they must be spelled just as you declared them, or you will get an error. However, if you use a label in a rule that you did not declare, then you will simply get no matches. Remember the underscore and both curly braces for every label in use in the rule definition.

Every element defined in the rule must be present in the text that you are trying to match exactly as you have defined it. Mismatches in order or spelling can eliminate expected matches. If your concept is case-sensitive, then check for alignment of case, as well.

It is possible that the POS tag that you think a particular word might have is not the tag assigned to that word by the software in that particular context. The best way to prevent this error is to test your expectations with targeted examples in context before you apply the rule to a sample of documents that is like the data that you will process with the model.

In addition, it is possible that the POS tag is misspelled or does not exist. Different languages, versions, and products might use different POS tags. Consult your product documentation for lists of acceptable tags for rule-building. The spelling and case of the tags in the rules must be exactly as documented. Because writing a rule with a nonexistent tag like “:abc” is not a syntax error but a logical error, the syntax checking protocols will not catch it as an error, but there will not be any of the expected matches.

Another potential error when you are writing rules that contain a POS tag is forgetting to include the colon before specifying the tag. Without the colon, the system considers the rule to refer to a concept by that name or a string match, which might produce unexpected or no results. Syntax checking protocols will not return an error in this case.

When using the expansion symbols (e.g., @, @N, @V, @A), note that the expansion includes only related dictionary forms, not any misspellings that might have been identified by the misspelling algorithm or other variants associated with that lemma through use of a syRenonym list. To review what a lemma is, consult chapter 1. Also, remember that the forms of the words are looked up before processing, and when matching happens, the associated POS assignment of the word in the text is not considered. You can work around this issue, if you want to, by using a CONCEPT_RULE; see section 7.2 for more information. Examining the output from rules that contain an expansion symbol is recommended.

Referencing concepts by name without ensuring that you have used the correct name, including both case and spelling accuracy, can also reduce the number of expected matches. If you reference predefined concepts, be sure they are loaded into your project and always check the names because they might be different across different products.

Any rule that can reference a concept and returns matches (e.g., not REMOVE_ITEM or NO_BREAK) has the capacity to participate in a cyclic dependency error. A cyclic dependency is when two or more concepts refer to each other in a circle of concept references. For example, if the concept myConceptA has rules that reference myConceptB, and myConceptB has rules that reference myConceptA, then there is a cycle of references between them. This type of error will prevent your whole project from compiling successfully. This reason motivates the best practice to test your project often as you are adding concepts and rules. In this way, you will know that the latest concepts added to the model created the cyclic dependency. Another strategy to use to avoid this error is careful design for your taxonomy and model. Refer to chapter 13 to learn more about taxonomy design best practices.

Finally, if you can use a simpler rule type to do the work that you are trying to do with SEQUENCE rules, then always use the simpler rule type instead. In this case, the most likely alternative rule type is the C_CONCEPT rule. Although fact rule types are very powerful, they can be more difficult to maintain and troubleshoot in larger models because the outputs are more complex. If you use them, make sure you use them correctly.

8.2.4. Best Practices

Use the SEQUENCE rule type when fact matching is required and the order of elements is known, but only when you cannot extract enough information using a CONCEPT or C_CONCEPT rule.

When naming the extraction labels, keep in mind that they should start with a letter, followed by any series of letters, underscores, or digits. Note that in some older products, using an uppercase letter in an extraction label name could cause compilation errors. Do not use the extraction label _c{} in a fact rule, and avoid the labels _ref{}, _F{}, _P{}, and _Q{}.

To check whether your labeled elements are what you meant to extract, you will do well to complete some preliminary scoring before spending a lot of time building rules. This guideline aligns with the practice of creating a set of method stubs in programming to check that the end-to-end design is sound. You can put a few rules in each of your concepts to test how the input documents are transformed into new columns of structured data and plan any post-processing that might be required.

Finally, effective label names are descriptive and show why you are extracting each item. Keeping the names short is good for readability and to help avoid typographical errors, but descriptive and informative names are important for maintainability and making the rules understandable. Also, be sure to use comments to document your rules and the labels that you are using, for future troubleshooting and maintainability.

8.2.5. Summary

Requirements for SEQUENCE include the following:

A rule type declaration in all caps and followed by a colon
One or more comma-separated user-defined extraction labels enclosed in parentheses and followed by a colon
Repetition of the user-defined extraction label preceded by an underscore and followed by curly braces that enclose an element or elements to be extracted somewhere within the rule definition
Any combination of two or more elements

Types of elements allowed include the following:

A token or sequence of tokens to match literally (“#” character must still be escaped for a literal match to occur)
A reference to another concept name, including predefined concepts
A POS or special tag preceded by a colon
A word symbol (_w), representing any single token
A cap symbol (_cap), representing any capitalized word

Allowed options for the rule type include the following:

Comments using “#” modifier
Morphological expansion symbols, including @, @N, @A, and @V

8.3. PREDICATE_RULE Rule Type

When you need to extract facts but cannot use a SEQUENCE rule, a PREDICATE_RULE might be effective. The SEQUENCE rule type defines a series of ordered elements, whereas the PREDICATE_RULE rule type defines a pattern of elements using Boolean and proximity operators. As in CONCEPT_RULE rules, Boolean and proximity operators allow for conjunctions, disjunctions, negations, distance, and order-free (left or right direction) constraints that specify conditions for matching. However, this flexibility of the order of elements in text in relation to one another can increase rule complexity, as well as the time required for matching, maintenance, and troubleshooting.

8.3.1. Basic Use

The basic syntax of a PREDICATE_RULE is similar to the CONCEPT_RULE type, and the set of allowed operators is the same. The primary differences are as follows:

The addition of an extraction label declaration in the output declaration section between the rule type declaration and the rule definition
The application of those labels in the rule to mark extracted matches, instead of using the _c{} extraction label

Here are two examples of the basic syntax:

PREDICATE_RULE:(label1, label2):(operator, “_label1{element1}”, “_label2{element2}”)

PREDICATE_RULE:(label1, label2):(operator, “_label1{element1}”, “element2”, “_label2{element3}”)

As in the SEQUENCE rule type, the output declaration section holds the extraction label declaration, which consists of a comma-separated list of one or more extraction labels between parentheses. Remember that extraction labels must start with a letter, followed by any series of letters, underscores, or digits. Note that in some older SAS Text Analytics products, using an uppercase letter in an extraction label name could cause compilation errors.

As in the CONCEPT_RULE type, an operator and its arguments are placed in a comma-delimited list between parentheses, and every argument is fully enclosed within double quotation marks. Elements to be extracted as matches have a corresponding extraction label that is also within the double quotation marks that delimit an argument, and the curly braces enclose the element or elements to be extracted.

For example, if you want to extract two pieces of text that correspond to a vehicle part (“engine”) and malfunction (“overheating”) within the scope of a sentence, ignoring any other tokens in the sentence, you can use the following rule:

PREDICATE_RULE:(part, mal):(SENT, “_part{engine}”, “_mal{overheating}”)

This rule can be read as follows: If the strings “engine” and “overheating” are found in a sentence, then extract the match “engine” to the “part” label and extract the match “overheating” to the “mal” (malfunction) label. Note that the order of the extracted matches does not matter, as long as they are in the same sentence, because they are governed by the SENT operator, which is unordered. Moreover, the entire span starting with the first element extracted from the sentence and ending with the last element extracted from the sentence is also returned to provide insight into the context of the matches. Some SAS Text Analytics products also enable you to concatenate additional tokens onto the beginning and end of that matched string to provide even more context via a bigger window of content. The process of concatenating context to both ends of a match creates a concordance view of the match.

Consider the following input documents:

1. The report indicated overheating, which means we need to focus on the engine.

2. The customer said that the engine was frequently overheating.

Pause and think: Assuming that the rule above is in a concept named reportedDefect and the input documents above, can you predict the matches for each extraction label and the entire matched string?

Assuming that the rule above is in a concept named reportedDefect, the matches for the input documents above are included in Figure 8.6.

Figure 8.6. Extracted Matches for the reportedDefect Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	reportedDefect		overheating, which means we need to focus on the engine
1	reportedDefect	mal	Overheating
1	reportedDefect	part	engine
2	reportedDefect		engine was frequently overheating
2	reportedDefect	mal	overheating
2	reportedDefect	part	engine

Both input documents produced matches despite the varied order of the extracted matches in the input text. This output shows that the order of the arguments in the definition of a PREDICATE_RULE with the SENT operator is irrelevant. Finally, if there are elements found by the rule (and not extracted) that are before the first or after the last extracted match, then they will not be included in the matched string.

Operators might have other operators as their arguments; this is called nesting of operators. The operator that is higher in the nesting hierarchy is the governing operator. If you want to restrict the order of the arguments, then you can insert an ORD operator into the rule above, governed by SENT. For more information about operators and their behavior, as well as how to select the right one, please consult both your product documentation and chapter 11. Advanced use examples in the following sections also illustrate specific applications, including nesting.

8.3.2. Advanced Use: Capture of a Sentence

Remember that, in addition to the extracted elements and associated labels, the results of a match to a PREDICATE_RULE also include a matched string that spans from the first extracted element to the final extracted element, including all the tokens between them. Potentially useful information might be contained within this span of tokens, information that can be analyzed to inform further rules.

For example, you might want to split each document into sentences. Then you can apply another model to each sentence or examine sentences with particular characteristics, such as those with mention of a vehicle part, as a smaller data set. Here is the rule to identify the first word and last word of a sentence; it assumes that your data is well formed and grammatical:

PREDICATE_RULE:(first, last):(SENT, (SENTSTART_1, “_first{_w}”), (SENTEND_2, “_last{_w} :sep”))

This rule looks within the scope of a sentence, as defined by the SENT operator, to find its two arguments, each of which is an operator governing its own arguments. The first word of the sentence is defined as such by using the SENTSTART_n operator with n defined as 1. The other operator used is SENTEND_n with n defined as 2, which enables you to identify the last word without extracting just the sentence-ending punctuation. Keep in mind, though, that the span extracted by this rule will not include the final punctuation, because that element is not inside an extraction label’s curly braces. To change the output to extract the final punctuation instead, use SENTEND_1 instead of SENTEND_2, and remove the final “:sep” element from the rule.

Consider the following input document:

The provider stopped sending me bills and therefore, I am delinquent. They sent me to a collection agency. Then they closed my account and I’ve been paying them all this time!

Pause and think: Assuming that the previous rule is in a concept named singleSentence and the input document above, can you predict the matches for each extraction label and the entire matched string?

Assuming that the PREDICATE_RULE is in the singleSentence concept and the input document is above, the matches are in Figure 8.7.

Figure 8.7. Extracted Matches for the singleSentence Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	singleSentence		Then they closed my account and I've been paying them all this time
1	singleSentence	Last	Time
1	singleSentence	First	Then
1	singleSentence		They sent me to a collection agency
1	singleSentence	last	Agency
1	singleSentence	first	They
1	singleSentence		The provider stopped sending me bills and therefore, I am delinquent
1	singleSentence	last	Delinquent
1	singleSentence	first	The

In the results above, when the extraction label column is blank, the extracted match is each sentence from the original document but without sentence-ending punctuation. This data could be the text field that you analyze in another project.

To put this type of rule into a practical situation, suppose you are running a hotel and have reviews from customers that talk about what they liked and did not like about their stay. These reviews are written for others that might be considering staying in your hotel. They include advice about what to do or see while visiting, as well as what to avoid. You want to identify such advice to use for honing your services and experiences, as well as to encourage your visitors to take advantage of great activities to engage in nearby. Your goal is for your guests to have the best time possible. You have a separate model or concept that talks about likes, dislikes, and complaints.

You can modify the rule in the previous example to get to the two types of advice when given in command form like the following:

PREDICATE_RULE:(pos, end):(SENT, (SENTSTART_1, “_pos{:V}”), (SENTEND_1, “_end{_w}”))

PREDICATE_RULE:(neg, end):(SENT, (SENTSTART_3, “_neg{ Do not :V}”), (SENTEND_1, “_end{_w}”))

You can see the similarity to the previous rule. This time, though, the item defined in the first position, or positions, of the sentence is meant to capture commands that start with verbs. Also, the ending punctuation is included in the match, instead of returning the final word in the sentence as the end of the matched string. If you use these two rules together—for example, in two concepts named posAdvice and negAdvice—then you might need to also use a REMOVE_ITEM rule to remove the negative comments (“do not”) from the positive advice (“do”) concept to disambiguate the two. Here is an example of such a REMOVE_ITEM rule:

REMOVE_ITEM:(ALIGNED, “_c{posAdvice}”, “negAdvice”)

See section 9.2 for more information about the REMOVE_ITEM rule type.

Consider the following input documents:

1. Visit the hotel restaurant and you will be amazed!!!

2. Do not attend the show as it is a waste of time.

Pause and think: Can you predict the matches for each extraction label and the entire matched strings for the posAdvice and negAdvice concepts with the input documents above, assuming that the REMOVE_ITEM rule removed false positives?

The matches are included in Figure 8.8.

Figure 8.8. Extracted Matches for the posAdvice Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	posAdvice		Visit the hotel restaurant and you will be amazed!!!
1	posAdvice	end	!
1	posAdvice	pos	Visit
2	negAdvice		Do not attend the show as it is a waste of time.
2	negAdvice	end	.
2	negAdvice	neg	Do not attend

In the first document, the posAdvice concept matches on a verb that starts the sentence (“Visit”) and spans to match the punctuation ending the sentence (the right-most exclamation point of the three). In the second document, the negAdvice concept matches on the string “Do not attend” as the first three tokens that start a sentence and spans to include the sentence-ending punctuation (a period). For each of these cases, the extracted match is the entire document. The REMOVE_ITEM rule removed the matches from the posAdvice concept that also match the negAdvice concept, so the matches in Figure 8.9 were removed from the final output.

Figure 8.9. Removed Matches for the posAdvice Concept

Doc ID	Concept	Extraction Label	Extracted Match
2	posAdvice		Do not attend the show as it is a waste of time.
2	posAdvice	end	.
2	posAdvice	pos	Do

Both rules are defined in two separate concepts, named posAdvice and negAdvice respectively, to semantically associate each positive and negative set of rules with its respective dedicated concept.

You can later restrict the first rule to limit the verbs to the ones you expect reviewers to use, because POS tagging can be error-prone. This type of exploratory approach can be used to learn many things in your data, when you already know some things or can use the structure of the text to focus in on what you are interested in. You can also explore your data with SEQUENCE rules, but they are less flexible and you can fill only gaps that are modeled with elements like _w, POS tags or _cap.

For more information about sentence boundary detection with CAS, consult Gao (2018).

8.3.3. Advanced Use: More Complex Rules

Multiple Boolean and proximity-based operators can be used within a PREDICATE_RULE. As mentioned in section 8.3.1, a feature that makes this rule type as powerful as the CONCEPT_RULE type is nesting, the ability to embed an operator and its arguments as an argument of another operator. This allows for interactions between the operators to help you specify the exact conditions under which the pattern in the text will be a match for the desired information. The number of nesting levels are not constrained, although you should keep your rules logical and readable to the extent possible.

As an example, you may find several variations of an argument are needed as part of your rule. You can achieve this result by listing each of the variations as arguments to a single OR operator, separated from another via a comma-delimited list:

PREDICATE_RULE:(part, malfunction):(SENT, (OR, “_part{fender}”, “_part{wing}”, “_part{mudguard}”), (OR, “_malfunction{shaking}”, “_malfunction{vibrating}”)

In this example, any one of the three variants of the idea of a vehicle part (“fender,” “wing,” or “mudguard”) can match and will evaluate to a value of true for the OR operator. Similarly, when either of the ways that something can move back and forth (“shaking” or “vibrating”) match, the second OR operator will evaluate to true, as well. If both OR operators are true within a sentence, then the SENT operator is also made true, and the match is returned for the entire rule.

Tip: When the list of arguments under an OR operator is bigger than 4–5 arguments, consider adding them as a list of CLASSIFIER or CONCEPT rules in a new concept and referencing that concept for better readability.

Another aspect of PREDICATE_RULE rules to be aware of is that not all elements require a corresponding extraction label. Elements used in the rule definition can serve as additional conditions that must hold true for the rule to match. Such elements specify the context of the rule match, much as with the elements without the _c{} extraction label in a CONCEPT_RULE rule type.

For example, in medical records, drug names are often followed by text that explains the means for delivery or administration of drugs and then by a date of application. Extracting matches for the drug name and date of application in the context of the string that represents the drug delivery method can be done with a PREDICATE_RULE that leverages two other

concepts: nlpDate, which is an enabled predefined concept, and drugName, which includes the following rules:

CLASSIFIER:prednisone

CLASSIFIER:methylprednisolone

CLASSIFIER:fluticasone

The PREDICATE_RULE given here is in the drugDelivery concept:

PREDICATE_RULE:(drug, date):(SENT, (ORDDIST_10, “_drug{drugName}”, “administer@”, “_date{nlpDate}”))

This rule can be read as follows: Within a sentence, within a span of 10 tokens or less, first find a drug as defined in the drugName concept, then find a delivery string that contains variants of the token “administer,” and, finally, find a date of application as defined in the predefined concept nlpDate. The rule returns a match of the drugName concept with the “drug” label, a match for the nlpDate concept for the “date” label, and the entire span of text between the drugName match and the nlpDate match.

In this case, each argument found in the text must be in the same order as specified in the rule because of the use of the ORDDIST_n operator. If specifying the ordering is not required for your data, you could use DIST_n instead. One additional benefit of this specific rule is that the drug names have been collected in a separate concept called drugName to avoid writing a separate rule for each drug. Also, leveraging the predefined concept nlpDate gives you the flexibility to match on many types of date variants without writing explicit rules. By using the @ modifier with the “administer” element, you extend the rule beyond the usefulness of a simple string to cover forms such as “administered.”

Consider the following input documents:

1. Due to inflammation, we are prescribing Prednisone to be administered starting today.

2. I took half a dose of fluticasone nasal spray yesterday.

3. Resulted because IV fluid and methylprednisolone was administered on 30.11.04 and swelling was observed in forearm on 4.12.04—that is, 4 days after fluid administration.

Pause and think: Can you predict the matches for each extraction label and the entire matched string for the drugDelivery concept and the input documents above?

The matches for the drugDelivery concept and input documents above are in Figure 8.10.

Figure 8.10. Extracted Matches for the drugDelivery Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	drugDelivery		Prednisone to be administered starting today
1	drugDelivery	date	today
1	drugDelivery	drug	Prednisone
3	drugDelivery		methylprednisolone was administered on 30.11.04
3	drugDelivery	date	30.11.04
3	drugDelivery	drug	methylprednisolone

The output shows matches for the first and third document and not for the second one because it did not contain a morphological variation of the token “administer,” as the rule definition required. In addition, the extracted matches for the two labels include only the drug name (“Prednisone” and “methylprednisolone”) and the date of application (“today” and “30.11.04”), and not the strings related to the token “administer.” Because morphological variations of that token were defined in the rule, they must be present and reside between the other two arguments in order for matches to be returned at all. But because the rule did not specify that the token should be extracted with a label, the variations of “administer” are not part of the labeled matches. The entire string from the match for the drug to the match for the date is also included in the output.

8.3.4. Advanced Use: Single Label, Multiple Extracted Matches

As with the SEQUENCE rule type, you can also match against the same extraction label more than once in a PREDICATE_RULE. For example, there might be several vehicle parts mentioned within the space of a couple of sentences, and you want to know when multiple parts are mentioned in close proximity to see whether those parts are interacting poorly. For this purpose, you can use a single PREDICATE_RULE extraction label, named “part,” and use it more than once in the same rule definition:

PREDICATE_RULE:(part):(SENT, “_part{partListA}”, “_part{partListB}”)

You can read this rule this way: Within the span of a sentence, extract any match to the rules in the concept partListA and the concept partListB as matches for the label “part.” Each of the two referenced concepts contains a list of parts which are of a certain type. In other words, this rule will find situations where one type of part is mentioned by a customer in the same sentence as another type of part. This approach enables you to explore the relationships between two types of parts (from partListA and partListB) and to produce both sets of matches as outputs to a third concept, named, for example, partInteraction.

The concept partListA includes rules such as the following:

CLASSIFIER:rear defrost

The concept partListB includes rules such as the following:

CLASSIFIER:back windshield

Consider the following input document:

Immediately after turning on my rear defrost I heard that oh too familiar cracking noise coming from the passenger side of my back windshield—the same sound I heard the first time I came outside to find my windshield shattering on its own.

Pause and think: Can you predict the fact matches and matched string for the partInteraction concept with the input document above?

The matches for the partInteraction concept with the input document above are in Figure 8.11.

Figure 8.11. Extracted Matches for the partInteraction Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	partInteraction		rear defrost I heard that oh too familiar cracking noise coming from the passenger side of my back windshield
1	partInteraction	part	back windshield
1	partInteraction	part	rear defrost

The PREDICATE_RULE in this example would also work if the parts were all together in a single list; however, in that case, you would get more false positive matches of items you already knew were related or variant ways of referring to the same thing. For example, you would get extracted matches for both “back windshield” and “windshield,” which is much less useful.

At this point, you might ask why you would use a PREDICATE_RULE over a CONCEPT_RULE when attempting to match multiple items using a single argument. Although a CONCEPT_RULE type does accept Boolean and proximity operators, it does not allow for capturing multiple matches at the same time because the _c{} extraction label is used to yield a single result each time text is found. Therefore, you could not show the relationship in the example above between different parts in a CONCEPT_RULE.

You might remember that the _c{} extraction label can be used more than once in a CONCEPT_RULE but only when used with an OR operator to mark multiple sister arguments. Even then, only one match will be returned. The PREDICATE_RULE will enable you to capture multiple values for the same argument and should be used when you want to associate the same items in some relationship. If instead you do want to extract only one of the items, a CONCEPT_RULE should be used because it is the less computationally expensive rule type.

8.3.5. Advanced Use: More Than Two Returned Arguments

Remember that fact rule types (SEQUENCE and PREDICATE_RULE) are intended to capture relationships between elements. As you have already seen, elements do not all have to be extracted. You can extract more than two elements with one PREDICATE_RULE rule; the number of labels should correspond to the elements that you are attempting to extract. For example, consider the following rule in a concept named checkInfo:

PREDICATE_RULE:(checkNo, amount, date):(SENT_2, “_checkNo{checkNumber}”, “_amount{nlpMoney}”, “_date{nlpDate}”)

You can read the rule this way: Within a span of 2 sentences, find a match for the rules in the custom concept checkNumber and return it as a match for the label “checkNo,” a match for the rules in the predefined concept nlpMoney, and return it as a match for the label “amount,” as well as returning a match for the rules in the predefined concept nlpDate for the label “date.” Note that the rule declaration consists of three extraction labels, corresponding to the information that you want to extract: the check number, amount, and date. The concept checkNumber is a supporting concept that contains the definition for how a check is expected to appear, which starts with a single hash symbol, followed by one or more digits (that is, the check number itself). The concept contains the following rule:

REGEX:#d+

As in SEQUENCE rules, the order of the extraction labels in the declaration does not matter, and you can change it if you want to. The order in the rule definition depends on whether you are using operators that require a particular order to match, like ORD or ORDDIST.

Consider the following input documents:

1. I have written a personal check in the amount of $125.00 and it is dated from Monday. The check number is #2501. Why is this check not shown on my current statement?

2. My accountant noticed that our check #3889, in the amount of $889.23 from 3/24/2016 bounced.

Pause and think: Can you predict the matches for the checkInfo concept and the input text above?

The fact matches for the checkInfo concept and the input documents above are shown in Figure 8.12.

Figure 8.12. Extracted Matches for the checkInfo Concept

Doc ID	Concept	Extraction Label	Extracted Match
1	checkInfo		$125.00 and it is dated from Monday. The check number is #2501
1	checkInfo	date	Monday
1	checkInfo	amount	$125.00
1	checkInfo	checkNo	#2501
2	checkInfo		#3889, in the amount of $889.23 from 3/24/2016
2	checkInfo	date	3/24/2016
2	checkInfo	amount	$889.23
2	checkInfo	checkNo	#3889

Note that there are matches for both documents, even though the order of the three elements in the two documents is different.

8.3.6. Advanced Use: Discovery of Terms to Add to a Model

Imagine that you are building a list of adjectives used to describe your product in reviews. There are several approaches you could take. One way to find adjectives in your data is to use the POS tag :A in a CONCEPT rule. You will, however, get a lot of adjectives that are not used to describe your product. A better approach would be to use a PREDICATE_RULE with a SENT operator like the following:

PREDICATE_RULE:(prod, adj):(SENT, “_prod{productList}”, “_adj{:A}”)

But, using this rule, you are still likely to get some matches that are not references to your product, and you will miss some that are correct. If you want to adjust your results further, one option is to use context to target words that are likely to be adjectives.

SEQUENCE:(prod, adj)::DET _adj{:A} _prod{productList}

PREDICATE_RULE:(prod, adj):(SENT, (ORDDIST_5, “:DET”, “_adj{:A}”, “_prod{productList}”))

Comparing the output of the two rules above, the second rule is broader in that it allows for other words to come between the adjective and the product mention. It is also more constrained in scope: It is limited to within a sentence. Both rules look for a determiner like “the” or “a,” followed by an adjective, and then followed by a mention of the product. Taking into consideration your specific data, you can use the rule that returns better results. You can also model additional contexts where adjectives are likely to occur grammatically:

SEQUENCE:(prod, adj): _prod{productList} be@ _adj{:A to :V}

PREDICATE_RULE:(prod, adj):(SENT, (ORDDIST_5, “_prod{productList}”, “be@”, “_adj{:A to :V”))

These rules look for mention of your products, followed by some form of the word “be,” and then an adjective followed by an infinitive verb construction. Again, try both to see which one is most suitable for your data. Then you can expand or add similar rules to continue your investigation. As you find examples of useful adjectives in your data, you can add those adjectives to a list and put them in their own concept to be referenced by other rules.

8.3.7. Troubleshooting

White space
Syntax errors
Missing extraction label
Extra extraction label
Logical error with operators
Tagging mismatch
Expansion mismatch
Concept references
Predefined concept references
Cyclic dependencies

White space in a PREDICATE_RULE is not very important because of the use of the parentheses, commas, and double quotation marks to set off pieces of the rule. However, within an argument (double quotation marks), white space is a separator for a list of elements and not counted as an element itself.

Every PREDICATE_RULE includes an output declaration section between the rule type declaration and the rule definition. Make sure that there is a colon on either side of this section and that the declaration of the names of labels is a comma-separated list between parentheses.

One of the common syntax errors that are specific to PREDICATE_RULE is forgetting or misplacing the extraction label (or labels), underscore (or underscores), or curly braces of the extraction label: The braces must always be inside the double quotation marks defining an argument. Remember also that the elements inside a set of parentheses are a comma-separated list. Do not forget the commas. Finally, parentheses and quotation marks must come in pairs.

You must use all the labels you define in the declaration section of the rule at least once, and they must be spelled just as you declared them, or you will get an error. If you use a label in a rule that you did not declare, you will simply get no matches.

Do not forget that if you have marked all or part of an argument of an OR operator with a user-defined label, then you will also have to place the label somewhere on all of the sister arguments, as well. Otherwise, you will not see the matching behavior that you expect. Avoid using _ref{}, _F{}, _P{}, or _Q{} as an extraction label, as well, because there might be unexpected behavior if you do use them.

It is possible that the POS tag that you think a particular word might have is not the tag assigned to that word by the software in that particular context. The best way to prevent this error is to test your expectations with targeted examples in context before applying the rule to a sample of documents that is like the data that you will process with the model.

When using the expansion symbols (e.g., @, @N, @V, @A), note that the expansion includes only related dictionary forms, not any misspellings that might have been identified by the misspelling algorithm or other variants associated with that lemma through use of a synonym list. To review what a lemma is, consult chapter 1. Also, remember that the forms of the words are looked up before processing, and when matching happens, the associated POS assignment of the word in the text is not considered. You can work around this issue using a CONCEPT_RULE; see section 7.2 for more information. Examining your output from rules that contain expansion symbols is recommended.

Referencing concepts by name without ensuring that you have used the correct name, including both case and spelling accuracy, can also reduce the number of expected matches. If you reference predefined concepts, be sure they are loaded into your project, and always check the names because they might be different across different products. Concept names are always case-sensitive.

Any rule that can reference a concept and returns matches (e.g., not REMOVE_ITEM or NO_BREAK) has the capacity to participate in a cyclic dependency error. A cyclic dependency is when two or more concepts refer to each other in a circle of concept references. For example, if the concept myConceptA has rules that reference myConceptB, and myConceptB has rules that reference myConceptA, then there is a cycle of references between them. This type of error will prevent your whole project from compiling. This is why you should test your project often as you are adding concepts and rules. In this way, you will know that the latest concepts added to the model created the cyclic dependency. Another strategy to use to avoid this error is careful design for your taxonomy and model. Refer to chapter 13 to learn more about taxonomy design best practices.

One common error reported by users who write rules programmatically is to add an additional colon between the name of the concept and the names of the labels. Although this unnecessary colon will not produce a syntax checking error, the rule will produce no matches. To avoid this situation, ensure that there only two colons in the fact rule written programmatically: one between the rule type declaration and the concept name, and the other between the extraction labels and the rule definition. See the programmatically formatted example rule that follows, which defines the “part” and “mal” labels in the replacedPart concept:

PREDICATE_RULE:replacedPart(part, mal):(SENT, “_part{engine}”, “_mal{overheating}”)

Be careful: This rule is correct when you are using the programmatic ways of building a model, but it will not work as expected in the GUI environment. In the GUI, you should not use the concept name as a part of the rule, because the GUI interprets the name from your taxonomy structure and location of your rule in the editor associated with a specific concept.

If you have checked all the above and are still having problems with your rules, then you should look at the logic defined by your combination of operators. A full understanding of operators is recommended if you are combining them together in a single rule. Consult chapter 11 to learn more about operators and how they interact. Review the project design and match algorithm sections in chapter 13 if you need more help with troubleshooting these rule types.

Finally, if you can use a simpler rule type to do the work that you are trying to do with these rules, always use the simpler rule type instead. Although these rule types are very powerful, they can be more difficult to maintain and troubleshoot in larger models. If you use them, make sure you use them correctly.

8.3.8. Best Practices

As rule complexity grows, the potential exists for increasing compilation and run-time costs. Because of the flexible nature of elements and operators that can be used in a PREDICATE_RULE, it is advised to keep each rule as lightweight as possible. You can write less computationally intensive rules by opting for the following:

Minimizing the number of rule arguments
Limiting the number of nested Boolean and proximity operators
Referencing less computationally expensive concepts

Remember, PREDICATE_RULE arguments are used to match against specific parts of a text. If you are extracting elements from many arguments in a PREDICATE_RULE type of rule, then you should evaluate whether all extracted elements require extraction as a set. Can the rule be split into smaller, less expensive rule types with fewer restrictions? If so, select those rule types over the more computationally expensive PREDICATE_RULE.

PREDICATE_RULE arguments contain two types of elements: the ones with an extraction label specify what part of the match to return, and the ones without a label specify context. You can think of the latter in the same way you do the unmarked contextual elements in a CONCEPT_RULE rule type. Minimize the number of these elements needed in each rule in order to reduce processing time, but leverage them where useful. Remember that extraction labels must start with a letter, followed by any series of letters, underscores, or digits. Note that in some older products, using an uppercase letter in an extraction label name could cause compilation errors. Do not use the extraction label _c{} in a fact rule, and avoid the labels _ref{}, _F{}, _P{}, and _Q{}.

When PREDICATE_RULE rules refer to concepts that are themselves potentially computationally expensive, the costs associated with compilation and run-time processes are compounded. It is recommended to have PREDICATE_RULE rules depend on concepts with only more simplistic matching or comprising less computationally expensive rule types, like CLASSIFIER, CONCEPT, or C_CONCEPT rules.

Having PREDICATE_RULE definitions depend on concepts which themselves contain PREDICATE_RULE definitions, also known as scaffolding, is not recommended. A referring PREDICATE_RULE rule might be dependent on whether another concept contains PREDICATE_RULE matches, but does not use the results of the matches themselves (e.g., the matching fact arguments). It is recommended to have the rule in the referent concept use a less computationally expensive rule type. If operators are required, use a CONCEPT_RULE type to feed one level of PREDICATE_RULE at the very top.

To check whether your labeled elements are what you meant to extract, you will find it a good idea to do some preliminary scoring before spending a lot of time building rules. This guideline aligns with the practice of creating a set of method stubs in programming to check that the end-to-end design is sound. You can put a few rules in each of your concepts to test how the input documents are transformed into new columns of structured data and plan any post-processing that might be required.

Finally, the best way to use label names effectively is to use descriptive labels that help show why you are extracting each item. Keeping the names short is good for readability and to help avoid typographical errors, but descriptive and informative names are important for maintainability and making the rules understandable. For ease of troubleshooting and maintainability, be sure to use comments to document your rules and labels.

8.3.9. Summary

Requirements for a PREDICATE_RULE include the following:

A rule type declaration in all caps and followed by a colon
One or more comma-separated user-defined extraction labels enclosed in parentheses and followed by a colon
Repetition of the user-defined extraction label preceded by an underscore and followed by curly braces that enclose an element or elements to be extracted somewhere within the rule definition
One or more Boolean or proximity operators, in a comma-separated list with its arguments enclosed in parentheses
One or more elements enclosed in double quotation marks in each argument

Types of elements allowed include the following:

A token or sequence of tokens to match literally (# character must still be escaped for a literal match to occur)
A reference to other concept names, including predefined concepts
A POS or special tag preceded by a colon
A word symbol (_w), representing any single token
A cap symbol (_cap), representing any capitalized word

Allowed options for the rule type include the following:

Comments using # modifier
Morphological expansion symbols, including @, @N, @A, and @V

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Fact Rule Types

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8: Fact Rule Types