Chapter 6: Concept Rule Types

6.1. Introduction to the Concept Rule Types

6.2. CLASSIFIER Rule Type

6.2.1. Basic Use

6.2.2. Advanced Use: Coreference Command

6.2.3. Advanced Use: Information Field

6.2.4. Troubleshooting

6.2.5. Best Practices

6.2.6. Summary

6.3. CONCEPT Rule Type

6.3.1. Basic Use

6.3.2. Advanced Use: Combination of Various Elements

6.3.3. Advanced Use: Combination of Elements and Modifiers

6.3.4. Troubleshooting

6.3.5. Best Practices

6.3.6. Summary

6.4. C_CONCEPT Rule Type

6.4.1. Basic Use

6.4.2. Advanced Use: Multiple Strings as Matches

6.4.3. Advanced Use: Coreference

6.4.4. Troubleshooting

6.4.5. Best Practices

6.4.5. Summary

6.1. Introduction to the Concept Rule Types

In chapter 5, you learned about four groupings of LITI rule types:

  • Concept rule types (including CLASSIFIER, CONCEPT, C_CONCEPT, and CONCEPT_RULE)
  • Fact rule types (including SEQUENCE and PREDICATE_RULE)
  • Filter rule types (including REMOVE_ITEM and NO_BREAK)
  • REGEX rule type

Each one of these rule types is described briefly in the SAS Text Analytics product documentation. But it is often difficult to grasp the full power of each rule type in the context of a project. Therefore, the current chapter focuses on three of the concept types in the first group above: CLASSIFIER, CONCEPT, and C_CONCEPT rules. These rule types are used when you need to extract one contiguous string of information.

In this chapter, you will find basic and advanced uses for each of these three concept rule types, with examples. You should focus first on mastering the basic use cases and then extend your knowledge to the more advanced use cases.

To aid with troubleshooting unexpected behavior, each rule type section includes a checklist of possible errors specific to that rule type. To help you make the most out of each rule type in your models, this chapter also contains best practices for using that rule type. Finally, the requirements and optional elements for each rule type are summarized at the end of each section so you can keep coming back to that section as a quick reference when you are building your models.

After reading this chapter, you will be able to do the following tasks:

  • Use the LITI syntax to write efficient and effective CLASSIFIER, CONCEPT, and C_CONCEPT types of rules
  • Avoid common pitfalls and use best practices to create better rule sets
  • Troubleshoot common rule-writing errors

6.2. CLASSIFIER Rule Type

CLASSIFIER rules match literal strings that represent a token or sequence of tokens. The full span of text found by the rule is returned as the extracted match in the output.

6.2.1. Basic Use

The basic syntax is as follows:

CLASSIFIER:token

CLASSIFIER:token token

This rule type specifies to extract the token, which can contain any character sequence, consisting of letters, numbers, and punctuation, as well as multiple tokens separated by spaces. So the following examples are valid rules:

CLASSIFIER:Mets

CLASSIFIER:Red Sox!
CLASSIFIER:2-3 teams including the Astros

Consider applying the rules above to the following input documents:

1. I like the Mets but she roots for the Red Sox!

2. They love 2-3 teams including the Astros.

You can try this and other examples in this chapter yourself with the code provided in the supplemental materials for the book (for instructions on downloading the supplemental materials, see About this Book).

Pause and think: Assuming that the rules above are in a concept named baseballTeams, can you predict the extracted matches for the input documents above?

The extracted matches are presented in Figure 6.1.

Figure 6.1. Extracted Matches for the baseballTeams Concept

Doc ID

Concept

Match Text

1

baseballTeams

Mets

1

baseballTeams

Red Sox!

2

baseballTeams

2-3 teams including the Astros

Be careful not to include commas in the definition portion of the rule, because commas have a special role in CLASSIFIER rules as cues for the information field (see section 6.2.3 for details). If you want to include a comma in your extracted match, use the special escape sequence “c”. For example, see the following instance:

CLASSIFIER:The Red Sox c Inc.

This rule would match “The Red Sox, Inc.” but not “The Red Sox Inc.” (without the comma) or “The Red Sox, Inc” (without a trailing period on the abbreviation).

Many rule-based information extraction (IE) systems take advantage of dictionaries or lists of specialized terms to be extracted. The SAS IE system performs the same task through the CLASSIFIER rule type.

6.2.2. Advanced Use: Coreference Command

A special form of the CLASSIFIER rule type includes a section in square brackets between the rule type and the rule definition. This special section is the coreference command, which can enable you to extract a match when an alias is used to refer to a string that is referenced with a full name in the same document. To capture co-occurrence of the terms, you can define the coreference in a CLASSIFIER rule.

The basic syntax is as follows:

CLASSIFIER:[coref=our company]:SAS Institute

You can read this rule this way: When the terms “SAS Institute” and “our company” appear in the input document, “our company” should be a match in the same concept as “SAS Institute.” If only “SAS Institute” appears in the document, it is still extracted as a match, but if “our company” appears without “SAS Institute,” it is not extracted as a match for that concept. The prerequisite for the coref part of the rule definition to produce a match is that the primary rule definition is found in the text. This approach adds an if-then condition to the match logic.

Consider the following examples. Assume each numbered item is a separate observation in the input data set:

1. SAS Institute is a great company. Our company has a recreation center and health care center for employees.

2. Our company has won many awards.

3. SAS Institute was founded in 1976.

Pause and think: Assuming that the rule above is in a concept named bestEmployer, can you predict the matches with the input documents above?

The scoring output matches are in Figure 6.2; note that the document ID associated with each match aligns with the number before the input document where the match was found.

Figure 6.2. Extracted Matches for the bestEmployer Concept

Doc ID

Concept

Match Text

Canonical Form

1

bestEmployer

SAS Institute

SAS Institute

1

bestEmployer

Our company

SAS Institute

3

bestEmployer

SAS Institute

SAS Institute

In the user interface of some SAS Text Analytics products, the canonical form is visible in the terms list in parsing. For example, Figure 6.3, a view of SAS Visual Text Analytics, shows that “sas institute” is the lemma (parent or canonical form) for “our company” in those instances where they both appear in the same input document. To review what a lemma is, consult section 1.5.3.

To get the same output in your SAS Visual Text Analytics product, you need a project with a Concepts node and a Parsing node after it. First, open the Concepts node and create a new concept named “bestEmployer.” Put the rule above into the rule editor window, and run your entire pipeline. Then, open the Parsing node and expand the term “sas institute” in the Term column.

Figure 6.3: SAS Visual Text Analytics Output of CLASSIFIER Rules with the Coreference Command

As shown in Figure 6.3, company name aliases are a good reason for using the coreference capability. In relatively short documents or with less ambiguous aliases, all instances of the alias in the document that mentions the full company name at least once are probably also referring to that company. Note that you cannot use this rule type with more than two aliases per rule.

This command is not the best way to handle resolution of pronouns, such as “we” or “our,” because pronouns are not always tied to one noun. The recommended best practice in those cases is to use C_CONCEPT and CONCEPT_RULE types instead. Read more about these rule types in section 6.4 and in chapter 7, respectively.

The best time to use the coreference command in a CLASSIFIER rule is in cases where you may want term or phrase A to always be associated with term or phrase B that is present in the text. However, if term or phrase B is not in the text and you still want to extract term or phrase A and associate it with term or phrase B, use the information field feature instead. This feature of the CLASSIFIER rule type is described in section 6.2.3.

6.2.3. Advanced Use: Information Field

Another special form of the CLASSIFIER rule includes a comma in the rule definition, which signifies the beginning of the information field. Some versions of the SAS Text Analytics products can use this information field as a means for specifying the lemma of the match. The lemma acts as an umbrella term under which various forms of the same matched term are aggregated. For example, in SAS Visual Text Analytics, the information field allows you to set up a parent-child relationship between two terms or sets of terms in contexts where the child term (or set of terms) appears in the text, but the parent does not necessarily appear. In this sense, the information field is similar to the coreference command. The difference is that the parent term or terms are not required to be in the text when you use the information field, but the parent term must be matched in the text for a rule with the coreference command to be applied.

However, in some software versions, the information field is not displayed or provided as output, and in others this information is lost if the concept containing the CLASSIFIER rule is referenced by another concept. Therefore, you should use this option with caution and always consult the documentation for your specific product and version before using this feature. Before using the information field in the design of your model, you should build a brief test to confirm the outputs of scoring will be as you expect.

The basic syntax is as follows:

CLASSIFIER:token,information field

Two rule examples follow:

CLASSIFIER:United States,USA
CLASSIFIER:U.S.,USA

Remember that, with the coreference command, both spans of text in the rule must be found in the input text. With the information field, only the span of text in the rule definition (appearing before the comma) must be found in the input text.

Consider the following input text documents. As before, each numbered item is a separate row in the input data set:

1. I live in the United States.

2. The U.S. is their home country.

3. They chanted: USA! USA!

Pause and think: Assuming that the rules above are in a concept named usAlias, can you predict the matches for the input documents above?

Assuming the rules and input documents above, the matches are in Figure 6.4.

Figure 6.4. Extracted Matches for the usAlias Concept

Doc ID

Concept

Match Text

Canonical Form

1

usAlias

United States

USA

2

usAlias

U.S.

USA

Notice that only the first two input documents produce a match. The third one does not, because there is no rule definition that matches the term “USA”—It is mentioned only in the information field of the two rules. In short, the information field does not extract matches; it only adds a canonical form to the already extracted match.

In software versions that support using the information field, when both rules are processed with the input text, the two matches will be aggregated in the terms list in the parsing node under the lemma “usa” and will contain the role of the concept name, in this case usAlias. The lemma is displayed in the Canonical Form column in Figure 6.4.

An example from SAS Visual Text Analytics is provided in Figure 6.5. To replicate this result, create the concept usAlias, containing the rules above in a Concepts node. Make sure a Parsing node follows in your pipeline. Then run the pipeline, open the Parsing node, and expand the term “usa” with the role usAlias.

Figure 6.5. SAS Visual Text Analytics Output of CLASSIFIER Rules with the Information Field

In Figure 6.5, note that the term “usa” as a proper noun is separate from the term “usa” as the lemma of the usAlias concept. This example illustrates that the term in the information field of the rules above is not being matched in the rules. The information field term or terms do not need to be present in the document for the rules to produce matches and for the matches to be aggregated.

The term or terms in the information field unify the extracted matches like a parent. As the figure illustrates, this observation includes the frequency of the different aliases. This behavior is especially useful with text processing of terms that may have different forms in the text but are not in the dictionary and therefore not automatically grouped together by the software.

6.2.4. Troubleshooting

Even though CLASSIFIER rules in their basic form are relatively simple to write, you may discover that a particular rule is not matching as you expected. Potential causes for this could be one of the pitfalls outlined in section 5.4—namely, general syntax errors, comments, misspelling/mistyping, tokenization mismatch, or filtered matches. In addition, there are also errors that you can check for that are specific to the CLASSIFIER rule type, such as the following:

  • White space
  • Comma use in CLASSIFIER rules
  • Syntax error

In a CLASSIFIER rule type, white space is reduced to a separator for a list of elements and not counted as an element itself. You cannot specify, for example, that you want to match the tokens “blue,” space character, space character, and “dinosaur” in sequence. For that type of specific character matching, you need to use the REGEX rule type. However, if you want to match two adjacent tokens in text, you can eliminate the white space between the elements.

For example, if you want to match the string “Go!” and you want to match these two tokens side-by-side, then the rule will work the same way if you define it as either of these rules:

CLASSIFIER:Go!

CLASSIFIER:Go !

Remember that a comma signifies an advanced use (information field) of the CLASSIFIER rule type. So, a CLASSIFIER rule containing a comma will not match a comma in the text. To match a comma in the text, replace the comma in the CLASSIFIER rule with “c” instead. The other character that must be escaped to match literally is the hash “#,” because it acts as a comment marker. Comment the hash when you want to match it, like so: #.

In addition to checking for common syntax errors that are possible with any rule types, if you are writing the advanced CLASSIFIER rules, then check for proper use of square braces, colon, and equal sign.

6.2.5. Best Practices

The best time to use CLASSIFIER rules is when you have a list of tokens or token sequences that you either want to extract or want to use as context for extraction. You cannot use CLASSIFIER rules to reference elements other than tokens, such as part-of-speech (POS) tags, other concepts, or regular expressions.

The benefit of using CLASSIFIER rules is that they are relatively low-cost computationally and simple to generate from lists. However, because they can result in many individual rules, perhaps thousands, they can be difficult to maintain. One way to improve maintainability is to group smaller sets of CLASSIFIER rules into concepts that can then be referenced by other rules but are still short enough to review for comprehensiveness and to troubleshoot for errors. For an example, see section 6.3.1. The CLASSIFIER rule type is useful for beginners and for the fundamental rules of a project, but be careful not to over-rely on it when a smaller set of patterns would be easier to maintain.

Always test that each rule matches as you would expect. Be especially careful with the advanced uses of CLASSIFIER rules. In addition, consult your product-specific documentation before using the information field, to confirm that the behavior you need is supported.

6.2.6. Summary

Requirements for a CLASSIFIER include the following:

  • A rule type declaration in all-caps and followed by a colon
  • A token or sequence of tokens to match literally (a comma or a hash character must be escaped with backslash)

Allowed options for the rule type include the following:

  • Comments using the “#” modifier
  • Coreference command
  • Information field (which is set off by a comma from the string to be extracted)

6.3. CONCEPT Rule Type

When a CLASSIFIER rule cannot do everything that you need, consider the use of a CONCEPT rule instead. CONCEPT rules return the entire found text as the extracted match just as CLASSIFIER rules do. The rule can include tokens, punctuation, references to POS tags, and references to other concepts, as well as special elements used to identify any token (_w), capitalized word (_cap), or modifiers (listed in Table 5.2).

6.3.1. Basic Use

The basic syntax is one element (from the ones listed in Table 5.1), such as a string, POS tag, or concept name, following the rule type declaration and colon. Regular expressions in the rule definition are not allowed for this rule type.

CONCEPT:element

Referencing Other Concepts

The most basic use of a CONCEPT rule type is to refer to another concept, pulling the matches from that other concept into the one containing the CONCEPT rule type. For example, you can combine two lists of strings by referencing two other concepts that each contain lists of CLASSIFIER rules, without repeating all the possible combinations.

Consider a concept named targetCity containing these rules:

CONCEPT:capitalCity
CONCEPT:companyCity

The first rule references the concept named capitalCity, which contains a series of classifier rules defining matches for capital cities in the United States:

CLASSIFIER:Nashville
CLASSIFIER:Raleigh
CLASSIFIER:Springfield

The second one references the concept named companyCity, which contains a series of classifier rules defining the set of cities where your company has offices:

CLASSIFIER:Memphis
CLASSIFIER:Charlotte

There are many reasons for keeping two or more different lists of city names, as in this example. Some reasons may be for organizational purposes or for ease of maintenance. For example, having separate lists for each state or each country of interest will provide shorter lists. Redundancy does not matter much, and the flexibility gained by having separate lists will offset the drawback of having the same item appear in multiple lists. In addition, different lists may come from different sources or represent different subcategories of a larger category.

Assume here that your marketing department is building a model to use for finding mentions of particular cities involved in a promotional offer. They leverage the two lists already available to create the concept targetCity.

Consider the following input text document:

Best Health Systems Inc is headquartered in Nashville, TN with local offices in Memphis, TN, Raleigh, NC, Charlotte, NC, New York, NY and Springfield, IL.

Pause and think: Assuming the model above and settings that allow for the examination of the matches from all three concepts (capitalCity, companyCity, and targetCity), can you predict the output for the document above?

Assuming the model and input document above, as well as the “all matches” algorithm, the matches are listed in Figure 6.6.

Figure 6.6. Extracted Matches for the targetCity, companyCity, and capitalCity Concepts

Doc ID

Concept

Match Text

1

targetCity

Nashville

1

capitalCity

Nashville

1

companyCity

Memphis

1

targetCity

Memphis

1

capitalCity

Raleigh

1

targetCity

Raleigh

1

companyCity

Charlotte

1

targetCity

Charlotte

1

capitalCity

Springfield

1

targetCity

Springfield

There are no matches for “New York,” because there are no rule definitions for that string. All of the matches in the output represent pairs of matches for each of the defined strings: Each pair contains one match for the concept, with the CLASSIFIER rule containing that string, and a second match for the concept with the CONCEPT rule. As you may remember from section 1.4, some concepts can be marked as helper concepts in some products so that they do not contribute to the final result set directly, but only through other concepts that reference them. Using this approach and designating the capitalCity and companyCity concepts as helper concepts can eliminate one of the sets of duplicate matches shown in Figure 6.6. To learn more about the role that helper concepts play in IE models, see section 13.3.2.

Because a concept name can contain letters, numbers, and underscores, and therefore can look like a regular word token, it is important to name concepts using strings that would not be encountered in the text. In this way, you can avoid inadvertently matching the name as a

string literal. To illustrate, consider the following example project. Despite the suggested best practice, one concept is named “protein,” containing the following rules:

CLASSIFIER:keratin

CLASSIFIER:collagen

Another concept is named macroMolecule and contains rules defining types of macromolecules:

CONCEPT:protein

CONCEPT:lipid

CONCEPT:nucleic acid

Now consider the following input sentence:

Collagen works in conjunction with another important protein, keratin.

Pause and think: Taking into consideration the concepts and input document above, can you predict the output?

The rules in the protein concept will return matches for “collagen” and “keratin,” which are expected. These two matches will also be returned to the macroMolecule concept, which is expected as well. However, what may be unexpected is that a match for the string “protein” is also returned to the macroMolecule concept. If you had intended to reference only the concept named “protein,” not the literal string “protein,” then the resulting match may be surprising. To avoid this situation, always name concepts as different from string literals you may find in the data.

Referencing POS Tags

In the CONCEPT rule type, you can also write grammatical rules by using POS tags and special tags such as “:sep,” “:digit,” and “:time.” These rules are all preceded by a colon and are case-sensitive:

CONCEPT::A
CONCEPT::N

The first rule will match any adjective in the input document, whereas the second rule will match any noun. Note that because the rule type declaration ends in a colon and the POS tag begins with one, there are two colons next to each other when a POS tag or special tag is the first element in a rule.

The list of POS and special tags that can be used in CONCEPT rules is available in your product documentation. The tags may be different in different versions of the software, so you should always consult the documentation for the appropriate version.

In addition, exercise caution when using POS tags, because it is possible that the tag you think a particular word may have is not the same as the tag assigned to that word by the software in the context in which it appears. Always test your expectations with a small sample of text. Keep in mind that the same word can have different POS assignments (if that is a possibility for your language) in different contexts. In your tests, ensure that the grammatical structure of your sample text is parallel to the structure of the data that you want to process.

One example in which POS tags are useful is if your goal is to extract all proper nouns from a text as part of data exploration. For this purpose, you could have a concept named properNoun, containing the following rule:

CONCEPT::PN

Consider the following input document:

The company Best Health Systems Inc is headquartered in Nashville, TN.

Pause and think: Assuming the rule and input document above, can you predict the output?

Assuming the rule and input document above, the matches are outlined in Figure 6.7.

Figure 6.7. Extracted Matches for the properNoun Concept

Doc ID

Concept

Match Text

1

properNoun

Best

1

properNoun

Health

1

properNoun

Systems

1

properNoun

Inc

1

properNoun

Nashville

1

properNoun

TN

The next exploratory step may be to write a rule containing a sequence of several POS tags. Using sequences of elements in CONCEPT rules is discussed in sections 6.3.2 and 6.3.3.

Referencing Special Elements and Modifiers

In CONCEPT rules, you can also use special elements, such as _w and _cap, as well as modifiers such as the expansion symbol @. See the example rule here:

CONCEPT:_w

This rule extracts every token in a corpus. It is useful for creating a unigram model.

Tip: Although _w is called a “word symbol,” it actually represents any token in the text. This means that it will match single punctuation, as well as any word.

The expansion modifiers, when used after a lemma, allow matches of inflectional variants in the same POS class of a particular word, on the basis of the variants in the underlying dictionary. To review what a lemma is, consult section 1.5.3. Here is one example:

CONCEPT:part@N

You can read this rule as follows: Expand the matches to any entries in the underlying dictionary that stem to the word “part” and have the POS “noun.” This rule will match any instances in the text of the words “part” or “parts,” because each of these variants is listed in the dictionary as a singular and plural noun, respectively. However, be cautious in interpreting this rule. It does not mean that the words with the POS tag of “:N” will be located, but only the strings “part” and “parts.” Each of these strings may also be tagged as verbs in the input text. In that case, the rule above will match them as well.

Another example of a rule containing the expansion symbol is as follows:

CONCEPT:part@V

This rule will match any instances in the text of the words “part,” “parts,” “parted,” or “parting,” no matter which role the words are actually playing in the document itself. The reason for this behavior is that the rule is expanded in the background during compilation of the model based on the dictionary, but the POS tag for the word in the text is not known at that time. The run-time processing of data has not begun.

Tip: When you are using the expansion modifiers, such as @N, @V, and @A, the variants are generated in accordance with the POS tags in the underlying dictionary. Any of the variants are then matched as strings in the text, regardless of the role that the words are playing in the input text.

6.3.2. Advanced Use: Combination of Various Elements

A more complex example of a CONCEPT type of rule combines various elements to extract longer matches. This rule type is useful for matching text using patterned sequences of elements. For example, you can capture “an important decision” and “a valuable resource” with a sequence of the POS tags: determiner adjective noun. The syntax is several elements separated by a space, where each element may be any item in Table 5.1 except regular expression.

CONCEPT:element1 element2 … elementN

One example is to combine references to concept names with strings and punctuation. For example, if you know that a city name that you want to extract will always be followed by a comma and a string signifying a U.S. state, then you can write the following rule in a concept named modelCity:

CONCEPT:capitalCity, usState

In this example, the concept named usState contains the following rules:

CLASSIFIER:TN
CLASSIFIER:NC

As before, the concept named capitalCity contains the following rules:

CLASSIFIER:Nashville

CLASSIFIER:Raleigh

CLASSIFIER:Springfield

Consider the following input document.

Best Health Systems Inc is headquartered in Nashville, TN with local offices in Memphis, TN, Raleigh, NC, Charlotte, NC, New York, NY and Springfield, IL.

Pause and think: Taking into consideration the concepts and input document above, can you predict the output if the “all matches” algorithm is specified?

The matches for the concepts and input document above are presented in Figure 6.8.

Figure 6.8. Extracted Matches for the capitalCity, modelCity, and usState Concepts

Doc ID

Concept

Match Text

1

capitalCity

Nashville

1

modelCity

Nashville, TN

1

usState

TN

1

usState

TN

1

capitalCity

Raleigh

1

modelCity

Raleigh, NC

1

usState

NC

1

usState

NC

1

capitalCity

Springfield

The other cities in the input document do not match, because no rules have been written to capture those city names.

6.3.3. Advanced Use: Combination of Elements and Modifiers

Consider the following rule, which combines a concept rule reference and POS tags, as well as the expansion modifier @:

CONCEPT:partName be@ :V

You can read this rule this way: When a match from the partName concept is followed by any form of “be” that is in the dictionary and then by any token with the POS tag of “verb,” extract the entire match to the concept where this rule appears. This concept could be named, for example, reportedIssue. The concept referenced in the rule definition, partName, includes the following rules:

CLASSIFIER:damper frameCLASSIFIER:damper wheel
CLASSIFIER:thumb piece
CLASSIFIER:rubber foot
CLASSIFIER:dashboard
CLASSIFIER:coil spring

Some input documents that this very simple model could be applied to are as follows:

1. Some of the issues that I noticed are that the damper frame is twisted, the damper wheel was installed wrong, and the thumb piece was blown.

2. The rubber foot is broken.

3. I saw that the dashboard is fractured and the coil spring was worn.

4. The thumb pieces were broken.

5. The dashboard must have been cracked previously.

Pause and think: Assuming the model above, can you predict the matches for only the reportedIssue concept with the documents above?

Assuming the partName concept is marked as a helper concept, the rule matches for the reportedIssue concept with the input documents above are in Figure 6.9.

Figure 6.9. Extracted Matches for the reportedIssue Concept

Doc ID

Concept

Match Text

1

reportedIssue

damper frame is twisted

1

reportedIssue

damper wheel was installed

1

reportedIssue

thumb piece was blown

2

reportedIssue

rubber foot is broken

3

reportedIssue

dashboard is fractured

3

reportedIssue

coil spring was worn

Note that there are no matches for the fourth input document, because the concept partName did not return a match for “thumb pieces.” The rule in the partName concept was written to match the string “thumb piece,” and the document contains “thumb pieces.” To extract this additional match, you can add an additional CLASSIFIER rule to account for the string or change the existing rule as follows:

CONCEPT:thumb piece@

There are also no matches for the fifth input document because the string “must have been” does not match “be@” in the rule. In this example, the grammatical structure of the sentence did not match the grammatical structure of the rule in the reportedIssue concept.

6.3.4. Troubleshooting

If you discover that a particular rule is not matching as you expected, potential causes for this could be one of the pitfalls outlined in section 5.4—namely, general syntax errors, comments, misspelling/mistyping, tokenization mismatch, or filtered matches. In addition, there are also errors that you can check for that are specific to the CONCEPT rule type, such as the following:

  • White space
  • Misspelling/mistyping
  • Tagging mismatch
  • Expansion mismatch
  • Concept references
  • Predefined concept references
  • Cyclic dependencies

White space is reduced in a CONCEPT rule to a separator for a list of tokens and not counted as a token itself. You cannot specify, for example, that you want to match the tokens “blue,” space character, space character, and “dinosaur” in sequence in this rule type. For doing that type of specific sequence matching, you need to use the REGEX rule type.

Misspelling can occur either in the rule or in the text. For the CONCEPT rule type, beware of mistyping concept names, because concept names are case-sensitive.

It is possible that the POS tag you think a particular word may have is not the tag assigned to that word by the software in that particular context. The best way to prevent this error is to test your expectations with targeted examples in context, before applying the rule to a sample of documents that is like the data you will process with the model. Also, be aware that the best natural language processing system will make errors in POS tagging even in perfectly grammatical text. Take that error rate into account as you design your model.

In addition, it is possible that the POS tag is misspelled or does not exist. Different languages, versions, and products may use different POS tags. Consult your product documentation for lists of acceptable tags for rule-building. The spelling and case of the tags in the rules must be exactly as documented. Because writing a rule with a nonexistent tag like “:abc” is not a syntax error, but a logical error, the syntax checking protocols will not catch it as an error, but there will not be any matches.

Another potential error when you are writing rules that contain a POS tag is forgetting to include the colon before specifying the tag. Without the colon, the system considers the rule to refer to a concept by that name or a string match, which may produce unexpected or no results. Syntax checking protocols will not return an error in this case.

When using the expansion symbols (e.g., @, @N, @V, @A), note that the expansion includes only related dictionary forms, not any misspellings that may have been identified by the misspelling algorithm or other variants associated with that lemma through use of a synonym list. To review what a lemma is, consult chapter 1. Also, remember that the forms of the words are looked up before processing, and when matching happens, the associated POS assignment of the word in the text is not considered. You can work around this issue, if you want to, using a CONCEPT_RULE type of rule; see chapter 7 for more information. Examining your output from rules that contain expansion symbols is recommended.

Note that if the word before the @ is looked up and not found in the dictionary, then the word is treated as an unknown word, and only that specific string up to the @ sign will be matched without variants. No error is generated in this situation. Another common type of error is accidentally adding an @ modifier to a CLASSIFIER rule without changing the rule type to CONCEPT. Because the CLASSIFIER rule will treat the characters literally, you will likely see no matches to that rule.

Referencing concepts by name without ensuring that you have used the correct name, including both case and spelling accuracy, can also reduce the number of expected matches. If you reference predefined concepts, be sure that they are loaded into your project, and always check the names, because they may be different across different products.

Any rule that can reference a concept and returns matches (e.g., not REMOVE_ITEM or NO_BREAK) has the capacity to participate in a cyclic dependency error. A cyclic dependency is when two or more concepts refer to each other in a circle of concept references. For example, if the concept myConceptA has rules that reference myConceptB, and myConceptB has rules that reference myConceptA, there is a cycle of references between them. This type of error will prevent your whole project from compiling. This error is another reason to test your project often as you are adding concepts and rules. In this way, you will know that the latest rules added to the model created the cyclic dependency. Another strategy to use to avoid this error is careful design for your taxonomy and model. Refer to chapter 12 to learn more about taxonomy design best practices.

6.3.5. Best Practices

Use a CONCEPT rule when a CLASSIFIER rule does not have enough power to capture the types of patterns and combinations that you need to model in your rule. A CONCEPT rule is the best choice when you need access to elements other than the literal token, or series of literal tokens, and still want to extract the found span of text in its entirety.

The benefits of using a CONCEPT rule include a powerful ability to match both literal tokens and a variety of other pattern types, such as a series of POS tags or intervening tokens. The syntax of the rule is otherwise very straightforward because each element of the rule must be found in the text in order to match. The rule type should be used frequently in most models. However, do not use a CONCEPT rule when you are extracting only a literal token, or set of literal tokens, because using many CONCEPT rules instead of CLASSIFIER rules for this purpose will have a negative impact on the performance of your model.

Test your rules frequently while building to ensure that each rule is doing what you expect. In particular, it is easy to misspell or mistype concept names and tag names, so be sure your rule is working on a snippet of data before testing a set of documents. Also, do not expect POS tagging to be 100% accurate. Instead, use testing to determine how accurate your rule needs to be, and either swap out your use of POS tag with another method after using the tag for exploration, or build into your rules compensation for tagging errors.

Coreference symbols may be used in CONCEPT rules but are not recommended. Instead, for better syntax checking and consistency of using the _c{} extraction label, use a C_CONCEPT rule when writing rules with coreference. See section 6.4 for more information on the C_CONCEPT rule type.

When naming your concepts, be sure to follow all the naming conventions introduced in section 5.3.1 (or create your own standard guidelines while still adhering to the name requirements). Keep your model logical, with clear and well-designed names, to save time when testing, troubleshooting, and maintaining your models.

6.3.6. Summary

Requirements for a CONCEPT include the following:

  • A rule type declaration in all caps and followed by a colon
  • One or more elements

Types of elements allowed include the following:

  • A token or sequence of tokens to match literally (“#” character must still be escaped for a literal match to occur)
  • A reference to another concept name or a series of concept names, including predefined concepts
  • A POS or special tag preceded by a colon
  • A word symbol (_w), representing any single token
  • A cap symbol (_cap), representing any capitalized word

Allowed options for the rule type include the following:

  • Comments using the “#” modifier
  • Morphological expansion symbols, including @, @N, @V, and @A

6.4. C_CONCEPT Rule Type

The benefits of the C_CONCEPT rule type include the ability to control the portion of the found text that is extracted as a match and returned as output, allowing for fine-grained specification of the structured data that your rules will create. Also, coreference matching is available and compatible with C_CONCEPT rules.

In the two rule types described in the previous sections, the found text was the same as the extracted match. However, sometimes you need to specify context that determines whether the text should be extracted, especially when the match itself is ambiguous. In this case, the found text is not the same as the extracted match, and you should use the C_CONCEPT rule type.

6.4.1. Basic Use

The basic syntax is as follows:

C_CONCEPT:_c{element} element
C_CONCEPT:element _c{element}

Note that only the part between the curly braces is the extracted match. You can also use additional elements on either side of the braces and use multiple elements inside the braces.

For example, the following rules use strings of text to extract the various contexts in which “Congo” is referring to a country name as opposed to the name of a river:

C_CONCEPT:republic of the _c{Congo}
C_CONCEPT:_c{Congo}, republic of the
C_CONCEPT:the _c{Congo} republic
C_CONCEPT:west _c{Congo}
C_CONCEPT:former French _c{Congo}

In each of the rules above, only the string “Congo” is extracted as a match, but only if preceded or followed by the specified strings, which disambiguate the string as a country name.

Consider the following input documents:

1. Africa :: CONGO, REPUBLIC OF THE. Flag Description . . .

2. The Democratic Republic of the Congo (DRC) is located in central sub-Saharan Africa . . .

3. The Republic of the Congo (French: République du Congo), also known as the Congo Republic, West Congo, the former French Congo . . .

Pause and think: Assuming that the rules above are in a concept named africanCountry, can you predict the matches from the input documents above?

The matches for the rules and input documents above are in Figure 6.10.

Figure 6.10. Extracted Matches for the africanCountry Concept

Doc ID

Concept

Match Text

1

africanCountry

CONGO

2

africanCountry

Congo

3

africanCountry

Congo

3

africanCountry

Congo

3

africanCountry

Congo

3

africanCountry

Congo

Note that, although the last four matches come from the same document, each match is from a different context: the first one from “republic of the Congo,” the second from “the Congo republic,” the third one from “West Congo,” and the fourth one from “former French Congo.”

C_CONCEPT rules can also include other types of elements, such as POS tags, special symbols such as _w or _cap, or names of other concepts. For example, in a project requiring extraction of adjectives that appear in front of the names of certain famous hotels, the rule below takes into consideration the context of the hotel name being mentioned, but extracts only the adjectives as matches. In this example, the rule is in a concept named hotelDescriptor:

C_CONCEPT:_c{:A} hotelName

The hotelName concept, referenced here, contains CLASSIFIER rules that match certain hotel names. Because only the adjectives need to be extracted, the hotelName is a helper concept.

CLASSIFIER:Four Seasons hotel

CLASSIFIER:Taj Mahal Palace hotel

CLASSIFIER:Plaza Hotel

Consider the following input documents:

1. The renowned Four Seasons hotel . . .

2. The famous Taj Mahal Palace hotel . . .

3. The owners of the famed Plaza Hotel . . .

Pause and think: Assuming that the rule above is in a concept named hotelDescriptor, can you predict the matches for the input documents above?

The matches for the hotelDescriptor concept with the input documents above are in Figure 6.11.

Figure 6.11. Extracted Matches for the hotelDescriptor Concept

Doc ID

Concept

Match Text

1

hotelDescriptor

renowned

2

hotelDescriptor

famous

3

hotelDescriptor

famed

Alternatively, the rule above could be rewritten to extract only the names of the hotels if the project required their extraction after certain adjectives. The list of adjectives could be specified with CLASSIFIER rules in one or more concepts named, for example, positiveAdjective or negativeAdjective. One rule that could extract only the hotel name when it follows an adjective in the positiveAdjective concept could be written as follows:

C_CONCEPT:positiveAdjective _c{hotelName}

6.4.2. Advanced Use: Multiple Strings as Matches

In all the examples in the previous section, the match that was returned corresponded to one element in the rule definition. Advanced C_CONCEPT rule uses can return multiple elements in the match string and use more than two elements for specifying the context.

For example, in a project containing U.S. customers’ addresses, a company may want to extract only the city and state of the address. The project contains a concept named firstLineAddress, with rules that define the first line of the address, such as “123 Main Str.” or “4004 Oak Blvd Ste #300.” Some of the rules in this concept are included here:

CONCEPT:anyDigit _cap Str.

CONCEPT:anyDigit _cap St poundDigit

CONCEPT:anyDigit _cap Ave NE

CONCEPT:anyDigit _cap Ave SW
CONCEPT:anyDigit _cap _cap Ste poundDigit

CONCEPT:anyDigit _cap _cap Dr.

The rules above leverage the anyDigit and poundDigit helper concepts. You can read more about helper concepts in section 15.2. The anyDigit concept contains a REGEX rule that captures one or more adjacent digits. You can learn more about REGEX rules in chapter 10.

REGEX:[0-9]+

The poundDigit helper concept contains a REGEX rule that matches one or more digits following the pound sign:

REGEX:#[0-9]+

In addition, the model contains the helper concept customerCity, which includes CLASSIFIER rules of city names.

CLASSIFIER:Lansing

CLASSIFIER:Boston
CLASSIFIER:Rockford
CLASSIFIER:Cary

The helper concept customerState contains CLASSIFIER rules of two-letter state codes:

CLASSIFIER:MI
CLASSIFIER:MA
CLASSIFIER:NC

The concept customerCityState has a C_CONCEPT rule that extracts the city and state of the customer’s address on the basis of expectations that the match to the firstLineAddress will be followed by a comma (modeled as any punctuation by use of the “:sep” tag), and then the match to the customerCity concept, another comma, and finally a match to the customerState concept:

C_CONCEPT:firstLineAddress :sep _c{customerCity :sep customerState}

Consider the following input documents:

1. 11300 Center Str., Lansing, MI 48906

2. 256 E St #1, Boston, MA 02127

3. 8200 Peachtree Ave NE, Rockford, MI 49341

4. 100 SAS Campus Dr., Cary NC 27513

Pause and think: Assuming the model described above, can you predict the matches for the customerCityState concept on the basis of the input documents above?

The matches for the customerCityState concept is in Figure 6.12.

Figure 6.12. Extracted Matches for the customerCityState Concept

Doc ID

Concept

Match Text

1

customerCityState

Lansing, MI

2

customerCityState

Boston, MA

3

customerCityState

Rockford, MI

Note that there is no match for the fourth document because there is no comma between the city and state. In all the other cases, the comma is present and extracted as part of the match because in the rule definition, the second “:sep” tag is inside the curly braces.

6.4.3. Advanced Use: Coreference

The C_CONCEPT rule type can also be used for coreference, which applies to any situation in which you need to tie variant references to a standard reference. For example, coreference often applies to the use of pronouns in language, where the pronouns refer to some other noun in the text. In comparison with the CLASSIFIER rule type approach to coreference, the C_CONCEPT rule type has the benefit of being able to use elements other than strings in the rule definition.

For example, imagine that you are trying to extract drug side effects from medical notes explaining patients’ complaints. You may have rules that extract patients’ reactions, such as “severe pain,” in a statement such as “Patient reported severe pain.” You could do so with a C_CONCEPT rule such as the following one, in a concept named, for example, sideEffect:

C_CONCEPT:patient _c{:V :A :N}

With the example sentence above as input, the rule would return a match for “reported severe pain.” But in some cases, the medical notes may use pronouns to refer to the patient. Several examples follow:

1. The patient stated that she had severe reactions to the medicine.

2. Patient complained that he experienced painful headaches.

Because the patients’ reported reactions do not follow the word “patient,” the rule above would not produce any matches. In these cases, it may be useful to resolve the pronouns “she” and “he” in a rule such as the following one in a concept named patientReport:

C_CONCEPT: _c{patient} :V that _ref{:PRO}

In both documents above, “patient” and the pronouns “she” and “he” would be matches in patientReport. Now, in the sideEffect concept an additional rule can be written that includes cases in which a pronoun is referring to the patient, by referencing the patientReport concept:

C_CONCEPT:patientReport _c{:V :A :N}

Consider the following input documents, adapted from the Vaccine Adverse Event Reporting System 2016 data (https://vaers.hhs.gov/):

1. The patient claimed that she had abdominal pain and vomiting for 3 months after vaccination.

2. On 14 Oct 2016, same day after the vaccination, the patient reported that he has red bumps on both arms (Rash papular).

3. Patient reports that she had excruciating pain in the back of her head.

4. On the same day, the patient complained that she had swelling at the base of her shoulder.

Pause and think: Assuming the model described above, can you predict the matches for the input documents above?

The matches to the sideEffect concept for the input documents above are in Figure 6.13.

Figure 6.13. Extracted Matches for the sideEffect Concept

Doc ID

Concept

Match Text

1

sideEffect

had abdominal pain

2

sideEffect

has red bumps

3

sideEffect

had excruciating pain

Note that in the first three cases, the concept named patientReport is identifying the pronouns “she” and “he” and passing them as matches to the concept sideEffect. However, not all notes about patient side effects are written in the same pattern of verb, followed by adjective and noun, as demonstrated by the fourth input document. In this case, although there would be a match for “patient” and “she” in the concept patientReport, there would be no match for the sideEffect concept. To capture this type of sentence structure, another C_CONCEPT rule could be defined in the sideEffect concept. An example is provided here:

C_CONCEPT:patientReport _c{:V :N}

Because of this rule, the fourth document would also produce a match if the model is rerun.

If the data has even more variability in how symptoms are described, then more rules would be required. This example illustrates that the best time to use the C_CONCEPT rule type for modeling coreference is when the sentence structure is somewhat predictable, without many different variations of patterns. Otherwise, a better choice may be the CONCEPT_RULE type, as described in chapter 7.

6.4.4. Troubleshooting

If you discover that a particular rule is not matching as you expected, potential causes for this could be one of the pitfalls outlined in section 5.4—namely, general syntax errors, comments, misspelling/mistyping, tokenization mismatch, or filtered matches. In addition, there are also errors that you can check for that are specific to the C_CONCEPT rule type, such as the following:

  • White space
  • Syntax errors
  • Missing extraction label
  • Tagging mismatch
  • Expansion mismatch
  • Concept references
  • Predefined concept references
  • Cyclic dependencies

In C_CONCEPT rule types, white space is reduced to a separator for a list of elements and not counted as an element itself. You cannot specify how many white space characters or what type can appear between elements. For specifying white space characters in your match, you need to use a REGEX rule type. In short, white space works the same way in this rule type as in the CLASSIFIER and CONCEPT rule types.

Another error in C_CONCEPT rules includes forgetting the extraction label and its curly braces or putting it on the wrong element or elements. In the C_CONCEPT rule, there may only be a single extraction label used, _c{}.

It is possible that the POS tag you think a particular word may have is not the tag assigned to that word by the software in that particular context. The best way to prevent this error is to test your expectations with targeted examples in context, before applying the rule to a sample of documents that is like the data you will process with the model.

In addition, it is possible that the POS tag is misspelled or does not exist. Different languages, versions, and products may use different POS tags. Consult your product documentation for lists of acceptable tags for rule-building. The spelling and case of the tags in the rules must be exactly as documented. Because writing a rule with a nonexistent tag like “:abc” is not a syntax error, but a logical error, the syntax checking protocols will not catch it as an error, but there will not be any of the expected matches.

Another potential error when you are writing rules that contain a POS tag is forgetting to include the colon before specifying the tag. Without the colon, the system considers the rule to refer to a concept by that name or a string match, which may produce unexpected or no results. Syntax checking protocols will not return an error in this case.

When using the expansion symbols (e.g., @, @N, @V, @A), note that the expansion includes only related dictionary forms, not any misspellings that may have been identified by the misspelling algorithm or other variants associated with that lemma through use of a synonym list. To review what a lemma is, consult chapter 1. Also, remember that the forms of the words are looked up before processing, and when matching happens, the associated POS assignment of the word in the text is not considered. You can work around this issue, if you want to, using a CONCEPT_RULE; see chapter 7 for more information. Examining your output from rules that contain expansion symbols is recommended.

Referencing concepts by name without ensuring that you have used the correct name, including both case and spelling accuracy, can also reduce the number of expected matches. If you reference predefined concepts, be sure that they are loaded into your project, and always check the names because they may be different across different products.

Any rule that can reference a concept and returns matches (e.g., not REMOVE_ITEM or NO_BREAK) has the capacity to participate in a cyclic dependency error. A cyclic dependency is when two or more concepts refer to each other in a circle of concept references. For example, if the concept myConceptA has rules that reference myConceptB, and myConceptB has rules that reference myConceptA, then there is a cycle of references between them. This type of error will prevent your whole project from compiling. This error is another reason to test your project often as you are adding concepts and rules. In this way, you will know that the latest rules added to the model created the cyclic dependency. Another strategy to use to avoid this error is careful design for your taxonomy and model. Refer to chapter 13 to learn more about taxonomy design best practices.

6.4.5. Best Practices

The C_CONCEPT rule type should be used when you want context to constrain or trigger a match, but the context itself should not be part of the extracted match. Like the other rule types described in this chapter, this rule type lists each piece of the rule in the order in which it should appear in the text. Therefore, its syntax is as simple as a CONCEPT rule plus the addition of the _c{} extraction label. This rule type is fundamental to most good models, but should be used only after you fully understand the application and purpose of the _c{} extraction label.

Be sure that when you are using this rule, all the elements that you specify appear in order in the targeted text. Ensure this through frequent testing of sample data from the data sources that you will be processing, using your model. The more different types of documents you have in your source data, the more complex your model will probably have to be to model the increased variation. People use language differently, depending on their purposes: writing an email, writing a report, creating a form, preparing a presentation, and so forth. Be aware of these sources of variation in your data, and if the variation is too extreme, consider winnowing down the data you will process with your model, or develop multiple models (perhaps with shared concepts) for different data sources. See section 12.2 for a further discussion of understanding your data.

A key best practice for all rules and rule types is to comment your rules or sections of rules with the intent of the rule, special considerations, decisions, and any other information that will make assessing or editing the rule later more efficient. Because C_CONCEPT rules do not return the whole match, this practice is even more important with this rule type and with the more complex ones that follow this chapter. Commented lines should start with the hash symbol, like so:

# The rule below extracts the city and state of the customer’s
# address when the match to the firstLineAddress is followed by a
# comma and then the match to the customerCity concept, another
# comma and finally a match to the customerState concept.

C_CONCEPT:firstLineAddress :sep _c{customerCity :sep customerState}

As you design your project and your rules, keep in mind that as you identify the pieces you need, keeping those pieces meaningful and naming them with useful names will help you trace through your project later. You will be able to diagnose problems more easily because your assumptions will be clear either through the project design and concept names, through comments, or through both. Also, make concepts only as large as they need to be; smaller concepts with fewer rules are easier to troubleshoot and to understand than very large concepts with many rules.

Tip: Use comments to allow you to assess and edit rules more efficiently.

When you are using coreference symbols, be sure that you are using the correct one. Generally, you will want to use only the _ref{} modifier, because it matches only what you specify in your rule. However, if you have very short documents that stay focused on one topic or person, you may be able to use _F{}, which matches what you specify plus every instance of the coreferences for the rest of the document after your initial match. Least recommended is _P{}, which matches what you specify and any preceding matches.

6.4.5. Summary

Requirements for a C_CONCEPT include the following:

  • A rule type declaration in all-caps and followed by a colon
  • The extraction label, _c{}, with one or more elements to be extracted specified between the curly braces
  • At least one element outside of the curly braces

Types of elements allowed include the following:

  • A token or sequence of tokens to match (# character must still be escaped for a literal match to occur)
  • A reference to another concept name, including predefined concepts
  • A POS or special tag preceded by a colon
  • A word symbol (_w), representing any single token
  • A cap symbol (_cap), representing any capitalized word

Allowed options for the rule type include the following:

  • Comments using the # modifier
  • Coreference symbols, including _ref{}, _P{}, or _F{}
  • Morphological expansion symbols, including @, @N, @V, and @A
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset