In the previous chapter, we covered how you can use Amazon Textract for Optical Character Recognition (OCR) and deep dive into its features and specific API implementations. In this chapter, you will get a detailed introduction to Amazon Comprehend and Amazon Comprehend Medical, what their functions are, what business challenges they were created to solve, what features they have, what types of user requirements they can be applied to, and how easy it is to integrate Comprehend with different AWS services, such as AWS Lambda, to build business applications.
In this chapter, we will go through the following sections:
For this chapter, you will need access to an AWS account. Before getting started, we recommend that you create an AWS account by referring to AWS account setup and Jupyter notebook creation steps in Technical requirements in Chapter 2, Introducing Amazon Textract. While creating an Amazon SageMaker Jupyter notebook, make sure you input AmazonComprehendFullAccess to the IAM role attached with your notebook instance, and follow these steps:
Check out the following video to see the Code in Action at https://bit.ly/3Gkd1Oi.
In this section, we will talk about the challenges associated with setting up ML (ML) preprocessing for NLP (NLP). Then, we will talk about how Amazon Comprehend and Amazon Comprehend Medical can help solve these pain points. Finally, we will talk about how you can use Amazon Comprehend to analyze the extracted text from documents by using Amazon Textract to extract the data.
Some of the key challenges while setting up NLP preprocessing are that documents can be semi-structured, unstructured, or can be in various languages. Once you have a large amount of unstructured data, you would probably like to extract insights from the data using some NLP techniques for most common use cases such as sentiment analysis, text classification, NER (NER), machine translation, and topic modeling.
The challenge with applying these techniques is that the majority of the time is spent in data preprocessing. This applies whether you are doing ML, for example, sentiment analysis, or deep learning to apply key NLP techniques to find insights. If you are doing ML, some of the key preprocessing techniques you would use include the following:
Even after these preprocessing steps, you would still need to apply advanced NLP techniques if you are doing deep learning on top of the preprocessed steps. Some popular techniques are the following:
Some of the challenges with setting up these NLP models include the following:
To overcome these challenges, we have Amazon SageMaker, which helps with removing all the infrastructure-heavy lifting of building, training, tuning, and deploying NLP models from idea to execution quickly.
Amazon SageMaker
You can learn more about how to get started with Amazon SageMaker NLP techniques in the book Learn Amazon SageMaker by Julien Simon.
Moreover, talking specifically about implementing transformers in your NLP models, Amazon SageMaker also supports transformer implementation in PyTorch, TensorFlow, and HuggingFace.
The Hugging Face transformers package is an immensely popular Python library providing pretrained models that are useful for a variety of NLP tasks. Refer to this blog to learn more: https://aws.amazon.com/blogs/machine-learning/aws-and-hugging-face-collaborate-to-simplify-and-accelerate-adoption-of-natural-language-processing-models/.
So, we have covered some of the key challenges with preprocessing NLP techniques and modeling. With AWS AI services such as Amazon Comprehend, you don't need to worry about spinning up servers or setting up complex infrastructure for NLP training. You also don't need to worry about all the preprocessing techniques we've covered, for example, tokenization, PoS tagging, and so on.
You also don't need to think about implementing transformers to set up deep learning models to accomplish some of the key NLP tasks, such as text classification, topic modeling, NER, key phrase detection, and a lot more.
Amazon Comprehend and Comprehend Medical give you APIs to accomplish some key NLP tasks (such as sentiment analysis, text classification, or topic modeling) on a variety of unstructured texts (such as emails, chats, social media feeds, or healthcare notes).
In the next section, we will cover how Comprehend and Comprehend Medical can detect insights in text with no preprocessing.
In this section, we will cover some of the key benefits of Amazon Comprehend and Comprehend Medical by discussing the following examples:
Amazon Comprehend and Amazon Comprehend Medical are AWS AI services, similar to Amazon Textract (which we covered in Chapter 2, Introducing Amazon Textract), where you do not need to set up complex models. You call the Amazon Comprehend and Amazon Comprehend Medical APIs and send a text request, and you will get a response back with the detected confidence score. The difference between Amazon Comprehend and Amazon Comprehend Medical is that Comprehend Medical is specific to healthcare NLP use cases. Comprehend Medical uses ML to extract health-related, meaningful insights from unstructured medical text, while Amazon Comprehend uses NLP to extract meaningful information about the content of unstructured text by recognizing the entities, key phrases, language, sentiments, and other common elements in the text.
Some of the key use cases of Amazon Comprehend are as follows:
You can bring a sample of your data and use Amazon Comprehend Custom entity recognition to get started without needing to worry about writing a complex model. You also do not need to worry about labeling large datasets to get started, as Amazon Comprehend Custom uses transfer learning under the hood. You can get started with a small set of labeled data to create custom entities specific to your use case. Similarly, you can bring your own data and perform custom classification to perform multi-class and multi-label classification to identify classes.
In the case of healthcare records, you can use Amazon Comprehend Medical. You can use Comprehend Medical for the following healthcare applications:
We will cover Amazon Comprehend Medical use cases in detail in Chapter 12, AI and NLP in Healthcare.
If you have documents in the form of scanned images or PDFs, you can use Amazon Textract to extract data quickly from these documents and then use Amazon Comprehend to gain meaningful insights from the extracted text, such as entities, key phrases, and sentiment. You can further classify these documents using Amazon Comprehend text classification, and also perform topic modeling to identify key topics within the documents. We will cover how you can use Amazon Textract with Amazon Comprehend together in an architecture in Chapter 4, Automating Document Processing Workflows for Financial Institutions, and in Chapter 5, Creating NLP Search in the section Creating NLP-powered smart search indexes. Moreover, for the healthcare industry, if you have lots of scanned documents such as medical intake forms, patient notes, and so on, you can use Amazon Textract to extract data from these documents and then use Amazon Comprehend Medical to extract key insights from this unstructured text data.
In this section, we first covered the challenges associated with setting up NLP modeling. Then we discussed how Amazon Comprehend and Comprehend Medical can address the pain points associated with setting up NLP models, such as scalability, preprocessing steps, and infrastructure setup. Lastly, we covered how you can automate your documents and enrich them with NLP by combining Amazon Textract and Amazon Comprehend. We have covered how Comprehend and Comprehend Medical can provide rich APIs for building intelligent NLP applications, which are also scalable to process large numbers of documents or unstructured data. In the next section, we will talk about some of the product features of these services using an AWS Console demo.
In this section, we will talk about Amazon Comprehend and Amazon Comprehend Medical product features using an AWS Console demo. We will start with Amazon Comprehend, and then move to Amazon Comprehend Medical.
Amazon Comprehend enables you to examine your unstructured data, for example, social media feeds, posts, emails, web pages, data extracted from Amazon Textract, phone transcripts, call center records, or really any kind of unstructured textual data. It can help you gain various insights about its content by using a number of pretrained models. Figure 3.2 is a diagram of how Amazon Comprehend actually works:
With Amazon Comprehend, you can perform the following on your input unstructured textual data by using the following text analysis APIs:
These text analysis APIs can be used both in real-time and in a batch manner, while topic modeling is a batch job or asynchronous process and cannot be used for real-time use cases.
There are two modes in which you can use these APIs:
Note
For synchronous APIs, your text has to be UTF-8 encoded and 5,000 bytes.
Let's take a quick look at some Amazon Comprehend features on the AWS Console. Please refer to the Technical requirements section if you have not already set up your AWS account.
Since we all forget to set up autopay messages to pay our credit card bills, in this demo we will show you a quick analysis of a sample autopay message to extract some key insights using Amazon Comprehend:
Hi Alex. Your NoNameCompany Financial Services, LLC credit card account 1111-0000-1111-0010 has a minimum payment of $25.00 that is due by Sunday, June 19th. Based on your autopay settings, we are going to withdraw your payment on the due date from your bank account XXXXXX1121 with the routing number XXXXX0000.
Your latest statement was mailed to 100 XYZ Street, Anytown, WA 98121.
After your payment is received, you will receive a confirmation text message at 555-0100-0000.
If you have questions about your bill, NoNameCompany Customer Service is available by phone at 206-555-0199 or email at [email protected].
Now, we will walk through each Insights API by changing each tab.
You can see from the screenshot in Figure 3.5 that Amazon Comprehend was able to detect the highlighted entities from the text you entered:
In the following screenshot in Figure 3.6, you can see that Alex has been identified as a Person, and NoNameCompany, the sender of the autopay message, has been identified as an Organization. The date by which Alex's amount is due (June 19th) has been identified as a Date entity, along with their specific confidence scores. The confidence score means how likely a match is to be found by the ML model, which is in a range from 0 to 100. The higher the score, the greater the confidence in the answer. A score of 100 is likely an exact match, while a score of 0 means that no matching answer was found:
Note
Out of the box, Amazon Comprehend's built-in APIs can detect Person, Location, Quantity, Organization, Date, Commercial Item, Quantity, and Title from any text.
In the last section of this chapter, we will see how to call these APIs using python boto 3 SDKs and integrate them into your applications.
Change the tab to key phrases to understand what the key phrases are and what Amazon Comprehend has predicted:
In English, a key phrase consists of a noun phrase (noun plus modifier) that describes a particular thing. For example, in the text in Figure 3.8, "Hi Alex", "Your NoNameCompany Financial Services", and "minimum payment" are some of the key phrases identified by the Amazon Comprehend API. Without reading the text and just looking at these keywords a person can know it is about a finance company and something to do with a payment, which is really useful when you have large amounts of unstructured text.
Change the tab to see the dominant language identified by Amazon Comprehend, as shown in Figure 3.9:
Similar to other Comprehend APIs, Amazon Comprehend detects the language of the given text and provides a confidence score along with it. You can use this feature for a book written in multiple different languages, such as both French and Hindi. Using language detection APIs, you can detect the language and classify the percentage of each language the book consists of, and then you can use Amazon Translate, which is an AWS service that translates the text from one language to another. We will see this example in future chapters in order to translate it.
Change the tab to PII to see what you will get using the Amazon Comprehend out-of-the-box PII detection API as follows:
As you can see in Figure 3.10, Amazon Comprehend provides you with Offsets and Labels with its real-time or sync PII APIs. If you want to redact the PII data from your text, you can use an asynchronous job. Amazon Comprehend can detect these PII entities: age, address, AWS access key, AWS secret key, bank-related details (such as bank account and bank routing number), credit card details (such as credit card number and expiry date), identification details (such as driving license ID and passport number), network-related details (such as emails, IP address, and MAC address), URLs; passwords; and usernames.
With this understanding of types of PII entities detected by Amazon Comprehend, let's scroll down to see the entities or results of Offsets detected by PII for the text you entered:
You also get a confidence score along with the entity and the type of PII entity.
In case you do not want to identify the specific entities and just want to know what type of PII your documents have, you can use the Labels PII feature.
Select the Labels button to see this feature in action:
From the results shown in Figure 3.12, you can clearly see that Date time, Email, Name, Address, and Phone are some of the pieces of PII related to a person in the text you entered.
Change to the Sentiment tab to understand the sentiment of the text you have entered:
Since the text was related to an autopay message, a neutral sentiment was detected by Amazon Comprehend's Detect Sentiment real-time API. The Amazon Comprehend sentiment analysis feature helps determine whether the sentiment is positive, negative, neutral, or mixed. You can use this feature for various use cases, such as determining the sentiments of an online book review, Twitter sentiment analysis, or any social media sentiment handles, such as Reddit or Yelp reviews sentiment analysis.
Click on the last tab, Syntax, to see what type of responses you can get with Amazon Comprehend's Detect Syntax feature:
Amazon Comprehend is able to identify nouns, verbs, and adjectives, and can identify 17 types of parts of speech overall. This feature can be really useful for data preprocessing for NLP models that require PoS tagging.
Note
We covered all the Amazon Comprehend text analysis real-time APIs in detail. You can perform batch real-time operations with all of these APIs we covered and send 25-5,000 bytes (https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html) UTF-8 text documents at once to get real-time results. Comprehend custom has now the ability to bring pdf documents directly for analysis and custom training.
With Amazon Comprehend Custom you can bring your own datasets, quickly create custom entities, and perform custom classification. This feature is a batch or asynchronous feature that involves two steps:
We will cover Comprehend custom entity features in Chapter 14, Auditing Named Entity Recognition Workflows, and Comprehend custom classification features in Chapter 15, Classifying Documents and Setting up Human in the Loop for Active Learning.
We will cover topic modeling product features in detail in Chapter 6, Using NLP to Improve Customer Service Efficiency.
Amazon Comprehend Events has a specific use case for financial organizations, where you can use this API to see the relationships between various entities extracted through Amazon Comprehend in the case of any important financial events such as press releases, mergers, and acquisitions. You can use this Events batch API to detect events over large documents to answer who, what, when, and where the event happened. To learn more about Comprehend Events, refer to this blog: https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-amazon-comprehend-events/.
Amazon Comprehend Medical provides two types of analysis:
Now, we will quickly cover Amazon Comprehend Medical features through the AWS Console again:
Pt is 35 yo woman, IT professional with past medical history that includes
- status post cardiac catheterization in may 2019.
She haspalpitations and chest pressure today.
HPI : Sleeping trouble for present dosage of Catapres. Severe rash on thighs, slightly itchy
Meds : Xanax100 mgs po at lunch daily,
Catapres 0.2 mgs -- 1 and 1 / 2 tabs po qhs
Lungs : clear
Heart : Regular rhythm
Next follow up as scheduled on 06/19/2021
Note
With Comprehend Medical real-time APIs, you can analyze up to 200,000 characters.
You can see that Comprehend Medical also provides relationships within these entities, such as Catapres dosage, and the frequency at which the drug should be administered. Amazon Comprehend Medical detects Entity, Type, and Category, such as whether the entity is PHI or treatment or time expression and traits, along with a confidence score.
Use this feature to identify medication as entities:
If you scroll down to Results, Comprehend Medical shows the RXCUIs for each medication, along with a confidence score. An RXCUI is a machine-readable code that refers to a unique name for a particular drug, and drugs having the same RXCUI are considered to be the same drug. This Comprehend Medical feature provides RxNorm information such as strength, frequency, dose, dose form, and route of administration. You can use this RxNorm feature for scenarios such as the following:
Let's change the tab to ICD-10-CM concepts and you will get the following analysis:
The InferICD10CM API detects possible medical conditions as entities and links them to codes from the ICD-10-CM, along with a confidence score. In healthcare, these codes are standard medical transaction codes, set for diagnostic purposes to comply with the Health Insurance Portability and Accountability Act (HIPAA), and used for classifying and reporting diseases. You can use these ICD-10-CM codes for downstream analysis as the signs, symptoms, traits, and attributes.
InferICD10CM is well-suited to scenarios such as professional medical coding assistance for patient records, clinical trials and studies, integration with an existing medical software system, early detection and diagnosis, and population health management.
In the next section, we will see these APIs in action by performing a walkthrough of a Jupyter notebook.
In this section, you will see a detailed walkthrough of broad categories of APIs available for Amazon Comprehend and Comprehend Medical through a Jupyter notebook example, which you can run in your AWS account. To set up the notebook, refer to the Technical requirements section of this chapter.
We will be showing you a subset of key APIs in Comprehend along with their functions, and then will talk about how you can build applications integrating with AWS Lambda, API Gateway, and Comprehend.
Note
We will cover Amazon Comprehend Medical APIs in Chapter 12, AI and NLP in Healthcare.
Let's start with the Amazon Comprehend APIs first. Amazon Comprehend provides three types of API.
In the notebook https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2003/Chapter%203%20Introduction%20to%20Amazon%20Comprehend.ipynb, we will cover real-time APIs and batch real-time APIs.
Note
You can implement the same features in other supported APIs such as Java, Ruby, .NET, AWS CLI, Go, C++, JavaScript, and PHP. For more information on Comprehend APIs, refer to the Amazon documentation: https://docs.aws.amazon.com/comprehend/latest/dg/API_Reference.html.
import boto3
comprehend = boto3.client('comprehend')
SampleText="Packt is a publishing company founded in 2003 headquartered in Birmingham, UK, with offices in Mumbai, India. Packt primarily publishes print and electronic books and videos relating to information technology, including programming, web design, data analysis and hardware."
response = comprehend.detect_entities(
Text=SampleText,
LanguageCode='en')
import json
print (json.dumps(response, indent=4, sort_keys=True))
This gives us the following output:
Comprehend was able to successfully detect entities along with their type, and give a response that Packt Publications is an organization located in Birmingham, UK, and Mumbai, India.
Note
Amazon Comprehend supports analysis in multiple languages using these APIs. The supported languages are French, Japanese, Korean, Hindi, Arabic, and Chinese.
Here is a sample text about Packt Publications (https://en.wikipedia.org/wiki/Packt), translated into French using Amazon Translate:
SampleText="Packt est une société d'édition fondée en 2003 dont le siège est à Birmingham, au Royaume-Uni, avec des bureaux à Mumbai, en Inde. Packt publie principalement des livres et des vidéos imprimés et électroniques relatifs aux technologies de l'information, y compris la programmation, la conception Web, l'analyse de données et le matériel"
response = comprehend.detect_key_phrases(
Text= SampleText,
LanguageCode='fr'
)
print (json.dumps(response, indent=4, sort_keys=True))
Amazon Comprehend is able to identify key phrases along with the location of the text.
Now, what if you wanted to buy a book from Packt Publications, you might want to read the reviews and determine whether they are positive or not.
Using the batch_detect_sentiment API, we will show you how you can analyze multiple reviews at once. For this demo, we will pick some sample reviews from the book 40 Algorithms Every Programmer Should Know (https://www.packtpub.com/product/40-algorithms-every-programmer-should-know/9781789801217):
response = comprehend.batch_detect_sentiment(
TextList=[
'Well this is an area of my interest and this book is packed with essential knowledge','kinda all in one With good examples and rather easy to follow', 'There are good examples and samples in the book.', '40 Algorithms every Programmer should know is a good start to a vast topic about algorithms'
],
LanguageCode='en'
)
print (json.dumps(response, indent=4, sort_keys=True))
This produces the following output:
Out of these four reviews analyzed, we can definitely see that, overall, it's a positive review for this book. Now, while reading the reviews, there were some reviews in different languages which, being an English reader, I did not understand. Unfortunately, I don't know which languages these reviews use, and therefore what to choose for translation.
response = comprehend.batch_detect_dominant_language(
TextList=[
'It include recenet algorithm trend. it is very helpful.','Je ne lai pas encore lu entièrement mais le livre semble expliquer de façon suffisamment claire lensemble de ces algorithmes.'
]
)
print (json.dumps(response, indent=4, sort_keys=True))
This gives us the following output:
It's interesting to find out that out of the two reviews sent to this batch detect the dominant language, one is in English, and one is in French.
We have now covered some of the key APIs, such as detect_entities, detect_key_phrases, batch_detect_sentiment, and batch_detect_dominant_languages.
Now, we will see how we can use these APIs in building an application.
In a previous section, we covered Amazon Comprehend's text analysis API. You can easily call these APIs in a serverless manner using a Lambda function. Amazon Lambda is a serverless event-based trigger that can be integrated with Amazon API Gateway and triggered for GET and POST requests. Amazon API gateway is a serverless REST-based service, which allows you to build GET/POST APIs to easily integrate with any application, be it mobile or web app.
You can create an API to be embedded in your application where you send a text to be analyzed using API Gateway; then the API Gateway calls the Amazon Lambda function, based on the type of request it receives. Amazon Lambda can further call Amazon Comprehend APIs (real-time or batch detect real-time APIs). It then passes the Comprehend response to API Gateway, as shown in the architecture diagram in Figure 3.25:
In this chapter, we covered why you would need to use Amazon Comprehend and Amazon Comprehend Medical. We also discussed the challenges associated with setting NLP pipelines.
Then, we introduced these services, and covered some key benefits they provide, for example, not needing ML skills, or easily using the APIs to build scalable NLP solutions. After that, we showed some key product features of Amazon Comprehend and Amazon Comprehend Medical through a Console demo. Some of Amazon Comprehend's features are identifying entities, key phrases, and sentiment, as well as detecting dominant language, topic modeling, and so on. For Amazon Comprehend Medical, we covered how you can use both text analysis APIs and oncology APIs to enrich and extract key information from medical notes. Then we gave you a quick walkthrough of these APIs using a Jupyter notebook and covered sync and batch sync APIs. We gained a basic theoretical understanding of creating a serverless application using these APIs.
In the next chapter, we will talk about how you can integrate Amazon Textract with Amazon Comprehend for automating financial documents.