0%

Book Description

Extract actionable insights from text and unstructured data.

Information extraction is the task of automatically extracting structured information from unstructured or semi-structured text. SAS Text Analytics for Business Applications: Concept Rules for Information Extraction Models focuses on this key element of natural language processing (NLP) and provides real-world guidance on the effective application of text analytics.

Using scenarios and data based on business cases across many different domains and industries, the book includes many helpful tips and best practices from SAS text analytics experts to ensure fast, valuable insight from your textual data.

Written for a broad audience of beginning, intermediate, and advanced users of SAS text analytics products, including SAS Visual Text Analytics, SAS Contextual Analysis, and SAS Enterprise Content Categorization, this book provides a solid technical reference. You will learn the SAS information extraction toolkit, broaden your knowledge of rule-based methods, and answer new business questions. As your practical experience grows, this book will serve as a reference to deepen your expertise.

Table of Contents

  1. About This Book
    1. What Does This Book Cover?
    2. Is This Book for You?
    3. What Should You Know about the Examples?
      1. Software Used to Develop the Book’s Content
      2. Example Code and Data
    4. We Want to Hear from You
  2. Acknowledgments
  3. Chapter 1: Fundamentals of Information Extraction with SAS
    1. 1.1. Introduction to Information Extraction
      1. 1.1.1. History
      2. 1.1.2. Evaluation
      3. 1.1.3. Information Extraction versus Data Extraction versus Information Retrieval
      4. 1.1.4. Situations in Which to Use IE for Business Problems
    2. 1.2. The SAS IE Toolkit
      1. 1.2.1. NLP Foundation for IE
      2. 1.2.2. LITI Rule Syntax
      3. 1.2.3. Predefined Concepts
      4. 1.2.4. Taxonomy of Concepts
      5. 1.2.5. Algorithms for Matching
      6. 1.2.6. Interfaces for Building and Applying Models
    3. 1.3. Reasons for Using SAS IE
    4. 1.4. When You Should Use Other Approaches instead of SAS IE
    5. 1.5. Important Terms in the Book
      1. 1.5.1. Strings versus Tokens
      2. 1.5.2. Named Entities and Predefined Concepts
      3. 1.5.3. Parent Forms and Other Variants
      4. 1.5.4. Found Text and Extracted Match
    6. 1.6. Suggested Reading
  4. Chapter 2: Fundamentals of Named Entities
    1. 2.1. Introduction to Named Entities
    2. 2.2. Business Scenarios
      1. 2.2.1. Example: Pinpointing Location Information
      2. 2.2.2. Example: Identifying Supporters and Competitors
      3. 2.2.3. Example: Estimating Loss, Gain, and Risk
      4. 2.2.4. Example: Detecting Personally Identifiable Information
    3. 2.3. The SAS Approach
      1. 2.3.1. Understanding Standard Predefined Concepts
      2. 2.3.2. Understanding Underlying Principles
      3. 2.3.3. Accessing the Predefined Concepts
  5. Chapter 3: SAS Predefined Concepts: Enamex
    1. 3.1. Introduction to SAS Predefined Concepts
      1. 3.2. Person
      2. 3.2.1. Titles in Person Names
      3. 3.2.2. Suffixes as Part of a Personal Name
      4. 3.2.3. Single-Word Names
      5. 3.2.4. Body References
      6. 3.2.5. Quotes
      7. 3.2.6. Locations as Part of Name
      8. 3.2.7. Groups of Individuals
      9. 3.2.8. Historical Figures, Saints, and Deities
      10. 3.2.9. Animals, Fictional Characters, Artificial Intelligence, and Aliens
      11. 3.2.10. Businesses Named after People
      12. 3.2.11. Laws, Diseases, Prizes, and Works of Art
    2. 3.3. Place
      1. 3.3.1. Common Nouns and Determiners
      2. 3.3.2. Subnational Regions and Other Descriptors
      3. 3.3.3. Street Addresses
      4. 3.3.4. Monuments
      5. 3.3.5. Celestial Bodies
      6. 3.3.6. Neighborhoods
      7. 3.3.7. Fictional Place Names
      8. 3.3.8. Conjoined Location Names
      9. 3.3.9. Special Cases for Nonmatches
    3. 3.4. Organization
      1. 3.4.1. Corporate Designators or Suffixes
      2. 3.4.2. Determiners before Proper Names
      3. 3.4.3. Facility Names Associated with an Organization
      4. 3.4.4. Groups of Individuals
      5. 3.4.5. Aliases
      6. 3.4.6. Conjoined Organization Names
      7. 3.4.7. Event Names
      8. 3.4.8. Special Cases for Nonmatches
    4. 3.5. Disambiguation of Matches
      1. 3.5.1. Organization or Place
      2. 3.5.2. Organization or Product
      3. 3.5.3. Organization or Person
  6. Chapter 4: SAS Predefined Concepts: Timex, Numex, and Noun Group
    1. 4.1. Introduction to Other SAS Predefined Concepts
    2. 4.2. Date
      1. 4.2.1 Extended ISO 8601 Format
      2. 4.2.2. Named Dates
      3. 4.2.3. Modifiers
      4. 4.2.4. Conjoined Dates
      5. 4.2.5. Duration
      6. 4.2.6. Vague Expressions
    3. 4.3. Time
      1. 4.3.1. Extended ISO 8601 Format
      2. 4.3.2. Named Times and Time Zones
      3. 4.3.3. Modifiers
      4. 4.3.4. Conjoined Times
      5. 4.3.5. Duration
      6. 4.3.6. Vague Expressions
    4. 4.4. Money
      1. 4.4.1. Modifiers
      2. 4.4.2. Rates and Ratios
      3. 4.4.3. Quotes and Parentheses
      4. 4.4.4. Conjoined Expressions
      5. 4.4.5. Approximate Amount
      6. 4.4.6. Expressions and Metaphors
    5. 4.5. Percent
      1. 4.5.1. Acronyms, Initialisms, and Abbreviations
      2. 4.5.2. Modifiers
      3. 4.5.3. Quotation Marks and Parentheses
      4. 4.5.4. Conjoined Expressions
      5. 4.5.5. Multiword Expressions
      6. 4.5.6. Fractions and Ratios
      7. 4.5.7. Special Cases for Nonmatches
    6. 4.6. Noun Group
    7. 4.7. Disambiguation of Matches
    8. 4.8. Supplementing Predefined Concepts
  7. Chapter 5: Fundamentals of Creating Custom Concepts
    1. 5.1. Introduction to Custom Concepts
    2. 5.2. LITI Rule Fundamentals
      1. 5.2.1. Required Parts of LITI Rules
      2. 5.2.2. Optional Parts of LITI Rules
      3. 5.2.3. Rule Definition
    3. 5.3. Custom Concept Fundamentals
      1. 5.3.1. Best Practices for Naming Custom Concepts
      2. 5.3.2. Best Practices for Referencing Custom Concepts
      3. 5.3.3. Concepts versus CONCEPT and CONCEPT_RULE Rule Types
      4. 5.3.4. Programmatic Rule Writing and Model Compilation
      5. 5.3.5. Programmatic Model Application
    4. 5.4. Troubleshooting All Rule Types
  8. Chapter 6: Concept Rule Types
    1. 6.1. Introduction to the Concept Rule Types
    2. 6.2. CLASSIFIER Rule Type
      1. 6.2.1. Basic Use
      2. 6.2.2. Advanced Use: Coreference Command
      3. 6.2.3. Advanced Use: Information Field
      4. 6.2.4. Troubleshooting
      5. 6.2.5. Best Practices
      6. 6.2.6. Summary
    3. 6.3. CONCEPT Rule Type
      1. 6.3.1. Basic Use
      2. 6.3.2. Advanced Use: Combination of Various Elements
      3. 6.3.3. Advanced Use: Combination of Elements and Modifiers
      4. 6.3.4. Troubleshooting
      5. 6.3.5. Best Practices
      6. 6.3.6. Summary
    4. 6.4. C_CONCEPT Rule Type
      1. 6.4.1. Basic Use
      2. 6.4.2. Advanced Use: Multiple Strings as Matches
      3. 6.4.3. Advanced Use: Coreference
      4. 6.4.4. Troubleshooting
      5. 6.4.5. Best Practices
      6. 6.4.5. Summary
  9. Chapter 7: CONCEPT_RULE Type
    1. 7.1. Introduction to the CONCEPT_RULE Type
    2. 7.2. Basic Use
    3. 7.3. Advanced Use: Multiple and Embedded Operators
    4. 7.4. Advanced Use: Negation Using NOT
    5. 7.5. Advanced Use: Negation Using UNLESS
    6. 7.6. Advanced Use: Coreference and Aliases
    7. 7.7. Troubleshooting
    8. 7.8. Best Practices
    9. 7.9. Summary
  10. Chapter 8: Fact Rule Types
    1. 8.1. Introduction to Fact Rule Types
    2. 8.2. SEQUENCE Rule Type
      1. 8.2.1. Basic Use
      2. 8.2.2. Advanced Use with Other Elements
      3. 8.2.3. Troubleshooting
      4. 8.2.4. Best Practices
      5. 8.2.5. Summary
    3. 8.3. PREDICATE_RULE Rule Type
      1. 8.3.1. Basic Use
      2. 8.3.2. Advanced Use: Capture of a Sentence
      3. 8.3.3. Advanced Use: More Complex Rules
      4. 8.3.4. Advanced Use: Single Label, Multiple Extracted Matches
      5. 8.3.5. Advanced Use: More Than Two Returned Arguments
      6. 8.3.6. Advanced Use: Discovery of Terms to Add to a Model
      7. 8.3.7. Troubleshooting
      8. 8.3.8. Best Practices
      9. 8.3.9. Summary
  11. Chapter 9: Filter Rule Types
    1. 9.1. Introduction to Filter Rule Types
    2. 9.2. REMOVE_ITEM Rule Type
      1. 9.2.1. Basic Use of the REMOVE_ITEM Rule Type
      2. 9.2.2. Advanced Use of REMOVE_ITEM: Additional Elements
      3. 9.2.3. Advanced Use of REMOVE_ITEM: Negation
      4. 9.2.3. REMOVE_ITEM Troubleshooting
      5. 9.2.4. REMOVE_ITEM Best Practices
      6. 9.2.5. REMOVE_ITEM Summary
    3. 9.3. NO_BREAK Rule Type
      1. 9.3.1. Basic Use of the NO_BREAK Rule Type
      2. 9.3.2. Advanced Use of NO_BREAK: Specifying a Concept Name
      3. 9.3.3. NO_BREAK Troubleshooting
      4. 9.3.4. NO_BREAK Best Practices
      5. 9.3.5. NO_BREAK Summary
  12. Chapter 10: REGEX Rule Type
    1. 10.1. Introduction to the REGEX Rule Type
    2. 10.2. Basic Use
    3. 10.3. Advanced Use: Discovery of Patterns
    4. 10.4. Advanced Use: Exploration
    5. 10.5. Advanced Use: Identification of Tokens for Splitting in Post-processing
    6. 10.6. Advanced Use: Information Field
    7. 10.7. Troubleshooting REGEX
    8. 10.8. Best Practices for Using REGEX
    9. 10.9. Summary of REGEX
  13. Chapter 11: Best Practices for Custom Concepts
    1. 11.1. Introduction to Boolean and Proximity Operators
    2. 11.2. Best Practices for Using Operators
      1. 11.2.1. Behavior of Groupings of Single Operators
      2. 11.2.2. SAS Categorization Operators
      3. 11.2.3. Combinations of Operators and Restrictions
    3. 11.3. Best Practices for Selecting Rule Types
      1. 11.3.1. Rule Types and Associated Computational Costs
      2. 11.3.2. Use of the Least Costly Rule Type for Best Performance
      3. 11.3.3. When Not to Use Certain Rule Types
    4. 11.3. Concept Rules in Models
  14. Chapter 12: Fundamentals of Data Considerations
    1. 12.1. Introduction to Projects
    2. 12.2. Data Considerations
    3. 12.3. Data Evaluation
    4. 12.4. Data Exploration
    5. 12.5. Data Analysis
      1. 12.5.1. Vocabulary Diversity
      2. 12.5.2. Information Density
      3. 12.5.3. Language Formality
      4. 12.5.4. Information Complexity
      5. 12.5.5. Domain Specificity
    6. 12.6. Business Goals and Targeted Information
    7. 12.7. Suggested Reading
  15. Chapter 13: Fundamentals of Project Design
    1. 13.1. Introduction to Project Design
    2. 13.2. Definition of Targeted Information
    3. 13.3. Taxonomy Design
      1. 13.3.1. Decomposition
      2. 13.3.2. Concept Types
    4. 13.4. Project Settings
      1. 13.4.1. Match Algorithm and Priority
      2. 13.4.2. Case Sensitivity
    5. 13.5. Suggested Reading
  16. Chapter 14: Fundamentals of Model Measurement
    1. 14.1. Introduction to Model Measurement
    2. 14.2. Use of a Gold Standard Corpus
    3. 14.3. Setup of a Gold Standard Corpus
    4. 14.4. Setup of Approximate Annotations
    5. 14.5. Creation of Samples for Development and Testing
    6. 14.6. Model Quality and Decisions
      1. 14.6.1. Strategies for Overcoming Low Recall
      2. 14.6.2. Strategies for Overcoming Low Precision
    7. 14.7. Model Monitoring
    8. 14.8. Suggested Reading
  17. References