Chapter 5: Fundamentals of Creating Custom Concepts

5.1. Introduction to Custom Concepts

5.2. LITI Rule Fundamentals

5.2.1. Required Parts of LITI Rules

5.2.2. Optional Parts of LITI Rules

5.2.3. Rule Definition

5.3. Custom Concept Fundamentals

5.3.1. Best Practices for Naming Custom Concepts

5.3.2. Best Practices for Referencing Custom Concepts

5.3.3. Concepts versus CONCEPT and CONCEPT_RULE Rule Types

5.3.4. Programmatic Rule Writing and Model Compilation

5.3.5. Programmatic Model Application

5.4. Troubleshooting All Rule Types

5.1. Introduction to Custom Concepts

In chapters 2–4, you learned that, for the purposes of information extraction, you can leverage the predefined concepts. This chapter will introduce you to the fundamentals of custom concepts and writing your own rules. Why might you want to create your own concepts and rules?

Perhaps you have information in your documents that you want to extract but that is not covered by the predefined set of named entities. For example, maybe you want to extract names of medicines, treatment options, vehicle parts, body parts, grocery items, and the like. One way to extract custom information is by relying on automated approaches, such as statistical or machine learning models. Some drawbacks of these models include difficulties in optimization and in explanation of results. Instead, by writing custom IE rules, you can take a deterministic approach with increased control over the quality and interpretability of your results.

After reading this chapter, you will be able to do the following tasks:

  • Recognize the required and optional parts of LITI rules, including elements, modifiers and punctuation
  • Use best practices for concept naming and referencing
  • Troubleshoot common rule-writing errors for all rule types

5.2. LITI Rule Fundamentals

This chapter focuses on concepts and rules for information extraction using LITI syntax. LITI is an acronym for language interpretation for textual information. It is a proprietary programming language for extracting specific pieces of information or relationships between specific pieces of information from text. SAS documentation already provides the basics of writing LITI rules. Building on that information, this chapter and the next five chapters provide technical details on required and optional elements of each rule type, usage through examples, and information about run-time complexity and computational cost, as well as pitfalls, guidelines, and tips on writing rules with LITI syntax.

The types of rules that you can write in LITI include the following:

  • Concept rule types, detailed in chapters 6 and 7
    • CLASSIFIER
    • CONCEPT
    • C_CONCEPT
    • CONCEPT_RULE
  • Fact rule types, detailed in chapter 8
    • SEQUENCE
    • PREDICATE_RULE
  • Filter rule types, detailed in chapter 9
    • REMOVE_ITEM
    • NO_BREAK
  • REGEX rule type, detailed in chapter 10

The final section of this chapter includes troubleshooting tips that apply to all the rule types. Each subsequent chapter will highlight any troubleshooting tips specific to the rule types presented in that chapter.

In the examples in chapters 6–11, the matching algorithm that is assumed is “all matches,” meaning each rule that defines content found in the text returns a match. Furthermore, the project setting is “case insensitive matching,” unless otherwise noted. For more information on project settings and other matching algorithms and their uses, see chapter 12. Unless otherwise noted, the data used in examples is constructed from the authors’ experiences to resemble real business data.

5.2.1. Required Parts of LITI Rules

As shown in Figure 5.1, each LITI rule has at least 3 parts:

  • A declaration of the rule type, which is written in ALL CAPS
  • A colon, which is the separator between the rule type declaration and the rule definition
  • A rule definition, which varies according to the rule type

Figure 5.1. Simple LITI Rule Example

Figure 5.1. Simple LITI Rule Example

In Figure 5.1, the rule type is CLASSIFIER, which is the most basic type of rule in the LITI syntax. The rule definition specifies that the string “Beatles” is extracted when it is found in the text. The rule type declaration and rule definition are separated by a colon.

5.2.2. Optional Parts of LITI Rules

Although all rule types in the LITI syntax include the sections listed above, it is also possible to write more complicated rules using a section called the output declaration. This rule section holds information between two colons after the rule type and before the rule definition that specifies how the rule output should appear. For example, there are some rule types that allow for extra information or commands to be placed between two colons. These rule types include the following elements:

  • CLASSIFIER, for the coreference command
  • SEQUENCE and PREDICATE_RULE, for extraction label declaration

The extraction label declaration lists the user-defined extraction labels that will be used in the rule definition. Figure 5.2 shows a more complex example.

Figure 5.2. Complex LITI Rule Example

Figure 5.2. Complex LITI Rule Example

The rule shown in Figure 5.2 extracts the string “Beatles” as an extracted match for the “bands” extraction label and extracts the string “Ringo” as an extracted match for the “bandmembers” extraction label. The rule returns the two strings, as well as the text between them, as one extracted match.

The extraction labels are enclosed in parentheses and separated from each other by a comma. Similar to what appears in Figure 5.1, where the colon is a separator between the two parts of the rule, in this example two colons separate the three sections: rule type declaration, output declaration, and rule definition. Note that the output declaration section of the rule can include not only an extraction label declaration, but also a concept name declaration (in programmatic rule-writing) or coreference command (in the CLASSIFIER rule type).

5.2.3. Rule Definition

Rule definitions may include the following components:

  • Elements
  • Modifiers
  • Punctuation

Rule elements are the essential parts of the rule definition that can stand alone, be modified, and define arguments. They represent some piece of text that may be found in a document. Table 5.1 describes possible rule elements.

Table 5.1. Elements in Rule Definitions

Elements

Description

Examples

Rule Types

String

One or more literal alphabetic, numeric, or alphanumeric characters without newlines

Rolling Stones

band

c (for comma)

#

All – comma (in CLASSIFIER) and hash characters must be escaped

Concept name

Can be predefined or custom name; represent a set of rules

nlpPerson

bandMembers

All except CLASSIFIER and REGEX

Part-of-speech tag and special tags

A part-of-speech or special tag preceded by a colon; represent the set of all words filling a given role in context

:ADV

:CONJ

:N

:sep (for punctuation)

All except CLASSIFIER and REGEX

Word symbol

Represents a single token, including single punctuation marks in some contexts

_w

All except CLASSIFIER and REGEX

Cap symbol

Represents any single token, which begins with an uppercase letter

_cap

All except CLASSIFIER and REGEX

Regular expression

Special expression combining strings and operators in a PERL-like syntax* that represent a span of text

[Bb]and(:?’s)?[ ] music

Only REGEX

*To learn more about PERL syntax, consult chapter 10, which focuses on regular expressions.

Rule modifiers are used to modify or relate the elements to each other in some way. Provided in Table 5.2 are examples for the contexts in which modifiers are used. The modifiers themselves are shaded gray.

Table 5.2. Modifiers in Rule Definitions

Modifiers

Description

Examples

Rule Types

Comment character

Marks the remainder of a line as a comment to be ignored in processing; #

# This is a comment

Tip: To match # as a literal, escape it like this #

All except REGEX; but you can put # before a REGEX rule

Morphological expansion symbol

Add to the end of a string when you want to match inflectional variations; @, @N, @V, @A

go@ = go, going, goes, gone

bottle@N = bottle, bottles

All except CLASSIFIER and REGEX

Extraction label

Precede with _c and enclose an element or series of elements in curly braces to mark as the section of the match to extract; _c{}

The following rule:

C_CONCEPT:said _c{_cap _cap} on dayOfWeek


can produce the result:
Jane Wu

C_CONCEPT

CONCEPT_RULE

REMOVE_ITEM

NO_BREAK

User-defined extraction label

Precede with underscore and any word and enclose an element or series of elements in curly braces to mark the section of the match to target in fact rules; ties label to extracted match; _name{}

The following rule:

SEQUENCE:(name, day):said _name{_cap _cap} on _day{dayOfWeek }

Can produce the result:

name=Jane Wu

day=Tuesday

SEQUENCE

PREDICATE_RULE

Coreference symbol

For tying extracted matches together and enabling additional matches based upon preceding or successive text

See the rule sections for the rule types for examples of use of: _ref{}, >, _P{} and _F{}

C_CONCEPT

CONCEPT_RULE

Argument

Any element or set of elements inside explicit quotations and governed by an operator or marked with an extraction label

CONCEPT_RULE:(OR, “_c{love@V}”, “_c{like@V}”, “_c{enjoy@V} driving”)

CONCEPT_RULE

SEQUENCE

PREDICATE_RULE

Operator

Used to combine arguments in Boolean and proximity relationships

CONCEPT_RULE:(SENT, (DIST_5, “broken”, “_c{partVehicle}”))

CONCEPT_RULE

PREDICATE_RULE

Punctuation, such as backslashes, colons, commas, quotation marks, and different types of brackets are also used to separate or relate the elements and their modifiers to each other. Please refer to your product documentation for how different punctuation is used in each rule type.

White space is not explicitly encoded as part of a rule, except for REGEX rules. In all other rule types, for languages in which white space is used for tokenization, white space is used to separate elements from one another. In general, do not put two elements together without white space intervening when you are working with such languages.

Tip: Do not put two rule elements together in a rule definition without white space intervening in languages in which white space is used to delimit tokens or words.

Now that you are familiar with the terminology for parts of the rule definition, you can see some of the parts combined in Figure 5.3, which is an example PREDICATE_RULE type. This example rule is the same as the one in Figure 5.2, but with detailed labels for parts of the rule definition.

Figure 5.3. PREDICATE_RULE Example

Figure 5.3. PREDICATE_RULE Example

Notice in Figure 5.3 that there are two arguments of the operator “AND” and they are separated by commas. Each argument is enclosed in quotation marks and consists of an extraction label and an element enclosed in curly brackets.

There are two ways to write LITI rules: in the graphical user interface (GUI) and programmatically. The discussion of custom rules in this and following chapters applies to both approaches. However, if you choose to write rules programmatically, you need to be aware of a few additional rule conventions, which are detailed in section 5.3.4.

5.3. Custom Concept Fundamentals

A concept is a grouping of one or more LITI rules. Each concept has a name, and the name can be used to reference its group of rules from other concepts.

Sometimes a concept can contain a long list of rules, so it is recommended that you use the comment character and a descriptive comment to break up the list into different sections for ease of maintenance. For example, take a long list of rules of painkiller drug names in a concept named drugTypeA. As shown in the excerpt below, the comment character and a short description separates different types of painkiller drugs in the list.

#Over the counter

CLASSIFIER:aleve

CLASSIFIER:tylenol

CLASSIFIER:ibuprofen

CLASSIFIER:advil

CLASSIFIER:motrin

#Prescription

CLASSIFIER:vicodin

CLASSIFIER:percocet

CLASSIFIER:oxycontin

In this book, some of the example rules are too long to be represented on a single line; therefore, long rules are wrapped. However, in the SAS Text Analytics products, each LITI rule is always constrained to one line. Any new line interrupting a LITI rule causes compilation errors.

Tip: Make sure each LITI rule is constrained to one line.

5.3.1. Best Practices for Naming Custom Concepts

Because concept names can be used in the same positions in LITI rules as strings are, it is important that you follow some guidelines for naming the concepts so that you can distinguish them from strings and other rule elements.

Avoid naming concepts with a single word that may occur in your text. For an example and explanation of this best practice, see section 6.3.1.

In addition, it is recommended that concept names use “camel” case without spaces between words and start with a lowercase letter. Some example concept names that follow these guidelines include “lossAmount,” “posSentiment,” and “loanOrigin.”

In English projects in products released starting in 2017, you can also use numbers and underscores in the name, but if you want to name the concept with a leading underscore, make sure you follow it with an alphabetic character, not a number. In addition, if you want to use a leading or trailing underscore, balance it with an underscore on the other end of the concept name. Be careful not to include “_Q” anywhere in a concept name, because the name will not work properly.

For projects in English before 2017 and in other languages in products released before the summer of 2019, the recommendation is to use only ASCII letters in concept names. For products released after this date, the guidelines just given can be applied across all supported languages. Additionally, all alphabetic characters may be used.

Consider the following concept names:

  • _Companies1
  • Companies1_
  • manufacturers
  • companyList
  • _Quarter_
  • COMPANY_LIST
  • mylist
  • _234123_
  • the1stQuarter
  • _1stQuarter_

Pause and think: Which of the concept names above follow the suggested guidelines?

As you may have realized, only the concept names companyList and the1stQuarter follow the guidelines, because they are not a single word that could appear in the text, they do not have unbalanced underscores or underscores followed by a number or Q, and they are written in camel case. Although COMPANY_LIST and mylist could also be used as concept names, it would be easier to distinguish them from other rule elements if their casing were more distinctive and consistent.

To summarize, adhere to the following guidelines about concept naming to prevent loss of extracted matches or unintended matches:

  • Avoid naming concepts with a single word that may occur in the text.
  • Use only ASCII letters in product releases before 2017 for English and before 2019 for other languages.
  • Use all letters, numbers, and underscore in product releases after 2017 for English and after 2019 for other languages, so long as you follow these guidelines:
    • Do not use _Q anywhere in the name.
    • If you use an underscore as the first character, use a letter for the second character and an underscore for the final character.

Another best practice in naming concepts is that names should be descriptive of the content you will be extracting with the rules in a given concept. For example, if you are extracting information about a vehicle part, then name the concept that contains those rules something like “vehiclePart.” If you are extracting something grammatical, include the grammatical element in the name. For example, use “posAdj” for extracting positive adjectives. You should make the name long enough to be descriptive and informative, but short enough to be easily typed in new rules without introducing errors. Concept names are case-sensitive and must be consistently spelled whenever they are used in rules.

Tip: When naming your concept, use “camel” case with no spaces between words. Make the concept name singular if you will be extracting one instance of the item that you define in each rule within the concept. Pay attention to case and consistency when referencing concepts in rules.

5.3.2. Best Practices for Referencing Custom Concepts

As mentioned in chapter 1, the taxonomy for your project contains concepts that are sets of rules. Each of those sets of rules can be referenced in another concept. To do so, include the referenced concept’s name as an element in a rule in another concept. In this way, concepts are like code objects, so it is good to treat each one as a component of a whole. This approach follows a common way to build things in general. For example, vehicles, houses, and watches are all made up of a set of parts that are made for a specific purpose. Your concepts should work the same way.

Concept names with simpler rules will be used in the rules of more complex concepts, and the readability of such rules depends on how well you name and design each concept. Also, your ability to test and determine quality of a concept depends on how well your design reflects the types of data you will process with the model. Use singular names, for example, when extracting one item with the rules in your concept, because this way you will be able to read the more complex rules more accurately. This means that, if you are extracting a part name, then use the concept name “vehiclePart,” not “vehicleParts.”

5.3.3. Concepts versus CONCEPT and CONCEPT_RULE Rule Types

It is important to distinguish between concepts as groupings of rules, and the CONCEPT and CONCEPT_RULE rule types. A concept is represented by a node in the taxonomy tree. It can contain one or more rules of any type, including but not limited to CONCEPT and CONCEPT_RULE. This meaning is represented in the phrase “concept rules” in the title of this book. When concepts are mentioned in general, the word “concept” is written in lowercase letters.

The CONCEPT rule type refers to a rule that starts with the declaration “CONCEPT:” and a CONCEPT_RULE type refers to a rule that starts with the declaration “CONCEPT_RULE:”. When the rule types are mentioned, the word “CONCEPT” is written in all-caps so that these types can be easily distinguished from concepts in general.

An additional phrase that you will encounter in some versions of the SAS software and documentation is “Concepts node.” This phrase is referring to the pipeline node, which contains the predefined and custom concepts and their rules—in other words, the concept model. You can see the relationship between the concept model, the concepts themselves, and the rules in Figure 5.4.

Figure 5.4. Concept Model, Concepts, and Rules

Figure 5.4. Concept Model, Concepts, and Rules

The concept drugTypeB has two CLASSIFIER rules. The concept drugTypeA has two CLASSIFIER rules and a CONCEPT rule referencing the concept drugTypeB. Because of that CONCEPT rule, extracted matches for the concept drugTypeB will also be extracted matches for the concept named drugTypeA.

Remember: In this book, “concept” refers to a node in a taxonomy that contains LITI rules, not an idea in general.

In this book, references to rule types (e.g., CONCEPT, CONCEPT_RULE) will always be in all capital letters, as in the rule type declaration itself.

SAS documentation and products may refer to a “Concepts node,” meaning a node within a pipeline that houses the model; this book will always refer to a “Concepts node” as a concepts model to avoid confusion.

5.3.4. Programmatic Rule Writing and Model Compilation

If you are writing custom concept rules programmatically, rather than using the GUI in products such as SAS Contextual Analysis or SAS Visual Text Analytics, you should know about additional requirements regarding the underlying configuration syntax. When you write rules in the GUI, it interprets the syntax of each rule and converts it to the underlying configuration syntax.

In this underlying syntax, a rule type declaration is followed by a required output declaration for every rule. The output declaration must contain the concept name with which the rule is associated. This concept name precedes any extraction label declaration (in fact rule types) or coreference command (in the CLASSIFIER rule type). Be careful not to put an additional colon between these two parts of the output declaration section. See the examples immediately below.

CLASSIFIER:musicBand:Rolling Stones

PREDICATE_RULE:musicBand(bands,bandMembers):(AND, “_bands{Beatles}”, “_bandMembers{Ringo}”)

In addition to the rule type declaration, there are two additional declaration types: ENABLE and CASE_INSENSITIVE_MATCH. You need to explicitly call out with the ENABLE declaration each concept that is enabled, which allows the given concept to provide output from the model. Any concept that is in the model, but not enabled, may still find text spans, but will only pass extracted matches along to referencing concepts, not provide output from the model. In the example below, the concept named musicBand is enabled.

ENABLE:musicBand

The CASE_INSENSITIVE_MATCH declaration specifies that any string in any rule in that concept should be interpreted in a case-insensitive manner, extending the possible matches to both uppercase and lowercase alphabetic characters. All concepts are case-sensitive by default.

CASE_INSENSITIVE_MATCH:musicBand

Putting all these pieces together in a configuration file is shown in the following example.

ENABLE:musicBand

CASE_INSENSITIVE_MATCH:musicBand

CLASSIFIER:musicBand:Beatles

CLASSIFIER:musicBand:Rolling Stones

PREDICATE_RULE:musicBand(bands,bandMembers):(AND, “_bands{Beatles}”, “_bandMembers{Ringo}”)

This configuration file, saved in a .txt format, is also provided as part of a larger code example in the supplementary materials for this chapter, accessible online as mentioned in About This Book.

The underlying configuration syntax just explained is used by DS2 code or Cloud Action Services (CAS) actions in the SAS Text Analytics Rule Development action set for compiling an IE model.

There are several ways to compile a model containing custom concepts. The supplementary materials for this chapter contain two such examples. The first one uses macros and DS2 code to compile the configuration file (in text format) into a concepts model binary file and then to apply it to score a data set. The other example uses the INFILE statement in data step to build a data set from the same configuration file. This data set can then be used to compile the model binary file with the compileConcept CAS action and apply it with the applyConcept CAS action.

As an alternative to using the INFILE statement with a text file or macros, you can also write the content of the configuration file as a SAS data set, using datalines. As in the example with the INFILE statement, you can then compile the data set can into a model binary file by using the compileConcept action. This method is used for the supplementary materials in the examples for the remainder of the book.

Another method to compile the model is to write the rules from the configuration file into a CAS table that can be used to compile the model binary file with the use of compileConcept. This approach requires that each rule have a ruleId and that the rule itself be enclosed in single quotes. Reference the example below.

data sascas1.concept_rule;

   length rule $ 200;

   ruleId=1;

   rule=’CASE_INSENSITIVE_MATCH:musicBand’;

   output;

   ruleId=2;

   rule=’ENABLE:musicBand’;

   output;

   ruleId=3;

   rule=’CLASSIFIER:musicBand:Beatles’;

   output;

   

   ruleId=4;

   rule=’CLASSIFIER:musicBand:Rolling Stones’;

   output;

   ruleId=5;

   rule=’ PREDICATE_RULE:musicBand(bands,bandMembers):(AND, “_bands{Beatles}”, “_bandMembers{Ringo}”)’;

   output;

run;

Note that the remaining chapters of this book present the examples in the format of the rules used in the product GUIs. The programmatic format is used in supplemental materials so that the code can be run “out of the box” in DS2 code or with CAS actions.

5.3.5. Programmatic Model Application

Once your model has been built, you can use the SAS IE procedures or CAS actions in the Text Analytics Rule Score action set to run the model against a data set, using a SAS programming interface or SAS Studio. This method may work best when you are stringing together many different SAS analytic and visualization processes. Sample code for applying a SAS IE model by using both DS2 and CAS actions is provided in the supplementary materials for this chapter.

SAS Enterprise Content Categorization Server offers application of models using Java, Python, C#, and Perl. The Java and Python client interfaces are bundled as part of the server download, whereas the other types are standalone bundles.

5.4. Troubleshooting All Rule Types

Now that you are familiar with the required and optional parts of the different rule types, you can make sure that you avoid unexpected matches by following some general troubleshooting tips. There is no tracing mechanism in the LITI matcher that will tell you which rules or concepts match a particular string of text, so you will need to design your taxonomy, name your concepts, and write your rules and comments with this in mind.

This section is intended to help you identify why your model may be extracting spans that you did not intend or failing to extract spans as you expected. The pitfalls presented in this section are common to all the rule types. But you should also consult the troubleshooting sections specific to each rule type for additional errors to guard against.

Some of the possible reasons for unexpected matches include the following:

  • General syntax errors
  • Comments
  • Misspellings or typographical errors
  • Tokenization mismatch
  • Filtered or removed matches

Syntax errors that are possible for all rule types include failing to use all-caps for the rule type or misspelling the name of the type. Be sure that you have a colon after the rule type and after any special section—that is, before the main rule definition. For example, the PREDICATE_RULE type, as shown in Figure 5.2, has a label declaration section between the rule type and the rule definition; there should be a colon both before and after such a section.

You can comment out a line or a part of a line by using the hash character (#); however, if you intend to match the hash character as part of the rule definition, you must escape it with a backslash like so: #. See the rules below that match hashtags expressing positive sentiment.

#Hashtags
CLASSIFIER:#bestproducts

CLASSIFIER:#bestgifts

These rules could be applied to the following input documents:

I don’t normally use hashtags, but I love my new phone! #bestproducts

Thanks for my new phone! #bestgifts

Pause and think: Assuming the rules above are in the posSentiment concept, can you predict the extracted matches for the input document above?

The extracted matches are shown in Figure 5.5.

Figure 5.5. Extracted Matches

Doc ID

Concept

Matched Text

1

posSentiment

#bestproducts

2

posSentiment

#bestgifts

You can see that there is no match for “hashtags” because “#Hashtags” is just a comment. The extracted matches for both CLASSIFIER rules include the hashtag symbol because it was escaped with a backslash in the rules.

Misspellings can occur either in the rule or in the text. If the spelling is not exactly the same in both the text and the rule, then there will be no match. Also, beware of mistyping concept names, because concept names are always case-sensitive.

Another common cause of missing matches is that the rule contains a part of a token instead of the entire token. You can review what a token is in section 1.5.1. For example, consider the following rule aiming to capture a unit of measurement.

CLASSIFIER:ft

An input document that this rule could be applied to is as follows:

It was a 3ft drop to the bottom of the hill.

Pause and think: Assuming the rule above, can you predict whether a match will be extracted from the input document above?

If the rule contains only letters, but the input text contains an alphanumeric token, as in the example just shown, there will be no match. The rule will not match “ft” because the “3” is also a part of that alphanumeric token.

Note that the Measure predefined concept will see this token as a measurement, matching the full token “3ft” from the text because it contains a REGEX rule that captures the entire alphanumeric string.

One issue that you might observe is extracted matches without obvious rules aligning with the match. This can happen if you are using a predefined concept or other concept created by SAS. In this case, the rules may be hidden but operating in the background. You are given the opportunity to modify the behavior of such concepts through the addition of rules that contribute extracted matches or the addition of rules that filter extracted matches, or both.

Another issue that you might observe is that expected matches may be missing. In addition to the errors just explained that cause a rule not to match properly, there may be effects of rule-specific pitfalls, which will be covered in the following chapters. The primary one that you should be aware of, when you are using the “all matches” algorithm, is filtering done by global rule types, such as the filter rule types addressed in chapter 9.

The other reasons that extracted matches may be missing involve the alternate matching algorithm. If you are missing matches and your algorithm is set to either “best match” or “longest match,” then try resetting your project to “all matches” and testing your rule again to see if this is the problem. If so, look at section 13.4.1 for advice on working through the issue, using your chosen match algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset