In the previous chapter, we were introduced to an approach for improving the accuracy of the results we wanted to extract from documents using Amazon Augmented AI (Amazon A2I). We saw that Amazon A2I can be added to a document processing workflow to review model prediction accuracy. This enabled us to include human reviews in LiveRight's check processing system.
In this chapter, we will walk through an extension of the previous approach by including Amazon Comprehend for text-based insights thereby demonstrating an end-to-end process for setting up an auditing workflow for your custom named entity recognition use cases. We put together this solution based on our collective experience and the usage trends we have observed in our careers. We expect to be hands-on throughout the course of this chapter, but we have all the code samples we need to get going.
With machine learning (ML), companies can set up automated document processing solutions that can be trained to recognize and extract custom entities from your documents. This helps you derive unique insights from your text corpus. These insights can help drive strategic decisions. However, there are certain challenges that need to be navigated first. Typically, companies receive large volumes of incoming documents of different templates, with varying contents, in multiple languages. Also, as businesses grow, the type and volume of documents evolve, and very soon you get into a maintenance overhead situation trying to keep the various templates, formats, and rules synchronized with how you are trying to use these documents for your operational needs. Furthermore, you will have to ensure your infrastructure is able to scale to support your processing needs.
To solve these challenges, we will show you how you can use the ready-made ML capabilities of Amazon Textract, leveraging transfer learning to create a custom entity recognition model with Amazon Comprehend, and auditing the predictions with a human reviewer loop using A2I. We introduced Amazon A2I in detail in Chapter 13, Improving the Accuracy of Document Processing Workflows. In this chapter, we will navigate through the following sections:
For this chapter, you will need access to an AWS account at https://aws.amazon.com/console/. Please refer to the Signing up for an AWS account subsection within the Setting up your AWS environment section in Chapter 2, Introducing Amazon Textract, for detailed instructions on how you can sign up for an AWS account and sign in to the AWS Management Console.
The Python code and sample datasets for the solution discussed in this chapter can be found at the following link: https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/tree/main/Chapter%2014.
Check out the following video to see the Code in Action at https://bit.ly/3GoBh1B.
Financial organizations receive significant volumes of loan applications every day. While the major organizations have switched to fully digital processing, there are still many banks and institutions across the world that rely on paper documents. To illustrate our example, let's go back to our fictitious banking corporation, LiveRight Holdings Private Limited, and review the requirements for this use case:
As the enterprise architect for the project, you decide to use Amazon Textract to leverage its pre-trained ML model for text extraction, the Custom Entity Recognizer feature of Amazon Comprehend to incrementally create your own entity recognizer for loan application checks without the need to build complex natural language processing (NLP) algorithms, and A2I to set up a human review workflow to monitor predictions from your entity recognizer and send feedback to the recognizer to improve its detection capabilities for entities unique to the use case.
You plan to have the private human workflow available for the first 2 to 3 months and subsequently disable it, at which point the document processing workflow will become fully automated. As the human team checks and updates the entity labels, you need to determine the authenticity check decision to be either APPROVE, SUMMARY APPROVE, or REJECT. This decision, along with the relevant content from the loan application, should be stored in an Amazon DynamoDB (a fully managed, low-latency NoSQL database service) table for loan processors to access the content and enable pre-approval qualification. The components of the solution we will build are shown in the following figure:
We will be walking through our solution using an Amazon SageMaker Jupyter notebook that will allow us to review the code and results as we execute it step by step. The solution build includes the following tasks:
Now that we've got the context for the exercise and gone over our intended process, let's start building the solution.
In the previous section, we introduced the loan application approval use case, covered the architecture of the solution we will be building, and briefly walked through the solution components and workflow steps. In this section, we will get right down to action and start executing the tasks to build our solution. But first, there are pre-requisites we will have to take care of.
If you have not done so in the previous chapters, you will first have to create a Jupyter notebook and set up Identity and Access Management (IAM) permissions for that notebook role to access the AWS services we will use in this notebook. After that, you will need to clone the GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services), create an Amazon S3 bucket (https://aws.amazon.com/s3/), and provide the bucket name in the notebook to start execution. Please follow the next steps to complete these tasks before we can execute the cells from our notebook:
Note:
Please ensure you have completed the tasks mentioned in the Technical requirements section.
IAM role permissions while creating Amazon SageMaker Jupyter notebooks
Accept the default for the IAM role at the notebook creation time to allow access to any S3 bucket.
Follow through the steps in this notebook that correspond to the next few subheadings in this section by executing one cell at a time. Please do read the descriptions provided preceding each notebook cell.
To train the Comprehend custom entity recognizer, to set up real-time endpoints, we have to enable additional policies and also update the trust relationships for our SageMaker notebook role. Please refer to Changing IAM permissions and trust relationships for the Amazon SageMaker Notebook execution role in the Setting up your AWS environment section in Chapter 2, Introducing Amazon Textract, for more detailed instructions on how to execute the following steps:
{ "Version": "2012-10-17", "Statement": [ {
"Action": [
"iam:PassRole"
],
"Effect": "Allow",
"Resource": "<your sagemaker notebook execution role ARN">
}
]
}
{ "Version": "2012-10-17", "Statement": [
{ "Effect": "Allow",
"Principal":
{ "Service":
[ "sagemaker.amazonaws.com",
"s3.amazonaws.com",
"comprehend.amazonaws.com" ]
},
"Action": "sts:AssumeRole" }
]
}
Now that we have set up our notebook and set up the IAM role to run the walkthrough notebook, in the next section, we will train an Amazon Comprehend entity recognizer.
Let's begin by training a custom entity recognizer to detect entities unique to this solution. Amazon Comprehend offers pre-trained entity recognition features that we learned about in the previous chapter. For this solution, we will use the Custom Entity Recognition feature of Amazon Comprehend that allows you to train a recognizer for custom needs using incremental training. All we have to do is provide a list of entities we want it to recognize, and a raw dataset containing the lines of text comprising the context that will be detected as entities. Open the notebook and execute the steps as follows:
bucket = '<bucket-name>'
a) First, initialize the boto3 handle for Amazon Comprehend:
comprehend = boto3.client('comprehend')
b) Then, define the variables for the S3 prefixes and upload the training dataset and the entity list to the S3 bucket:
s3_raw_key = prefix + "/train/raw_txt.csv"
s3_entity_key = prefix + "/train/entitylist.csv"
s3.upload_file('train/raw_txt.csv',bucket,s3_raw_key)
s3.upload_file('train/entitylist.csv',bucket,s3_entity_key)
c) Continue executing the rest of the cells in the notebook to declare the variables with the full S3 URIs for our input documents, define the input object for the entity recognizer, and finally, call the Comprehend API to create the custom entity recognizer. This will start the training job:
import datetime
cer_name = "loan-app-recognizer"+str(datetime.datetime.now().strftime("%s"))
cer_response = comprehend.create_entity_recognizer(
RecognizerName = cer_name,
DataAccessRoleArn = role,
InputDataConfig = cer_input_object,
LanguageCode = "en"
)
d) Print the results of the custom entity recognizer training job:
import pprint
pp = pprint.PrettyPrinter(indent=4)
response = comprehend.describe_entity_recognizer(
EntityRecognizerArn=cer_response['EntityRecognizerArn']
)
pp.pprint(response)
Refer to Step 2 in the notebook (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2014/chapter14-auditing-workflows-named-entity-detection-forGitHub.ipynb) for the instructions we will execute now.
In this step, we will create a private team using the Amazon SageMaker labeling workforce console, and we will add ourselves to the private team as a worker. This is required so we can log in to the labeling task UI when we reach the Amazon A2I step in this solution. Please execute the following tasks:
WORKTEAM_ARN= '<workteam-arn>'
Now that we have added the private team, let's review our loan application by extracting the contents using Amazon Textract.
This section corresponds to Step 3 in the notebook: https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2014/chapter14-auditing-workflows-named-entity-detection-forGitHub.ipynb.
In this step, we will review the sample loan application, and then use Amazon Textract to extract the key-value pairs or form data that is of interest to our solution, creating an inference request CSV file to pass as an input to our Comprehend custom entity recognizer for detecting entities. Please follow through using the notebook and execute the cells to perform the tasks required for this step:
documentName = "input/sample-loan-application.png"
display(Image(filename=documentName))
s3.upload_file(documentName,bucket,prefix+'/'+documentName)
!pip install amazon-textract-response-parser
textract = boto3.client('textract')
response = textract.analyze_document(Document={'S3Object': {
'Bucket': bucket,
'Name': prefix+'/'+documentName
}}, FeatureTypes=['FORMS'])
from trp import Document
doc = Document(response)
df = pd.DataFrame()
# Iterate over elements in the document
x = 0
for page in doc.pages:
for field in page.form.fields:
if field.key is not None and field.value is not None:
if field.value.text not in ('SELECTED','NOT_SELECTED'):
df.at[x,'key'] = field.key.text
df.at[x,'value'] = field.value.text
x+=1
df
Now, let's cover detecting entities using the Amazon Comprehend custom entity recognizer.
Now that we have what we need from the loan application, let's construct a string that will become our inference request to the Comprehend custom entity recognizer we trained at the beginning of this walkthrough (Step 1 in the notebook). Before we can detect the entities, we need to create a real-time endpoint and associate that with our entity recognizer. When you deploy this solution in batch mode or use it for processing multiple documents, you will use the Amazon Comprehend StartEntitiesDetection API: https://docs.aws.amazon.com/comprehend/latest/dg/API_StartEntitiesDetectionJob.html.
Please follow the instructions in this section by executing the cells in Step 4 in the notebook: https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2014/chapter14-auditing-workflows-named-entity-detection-forGitHub.ipynb:
df_T.columns = df_T.columns.str.rstrip()
df_T['doc'] = 1
df_T
for idx, row in df_T.iterrows():
entry = 'Country'+':'+str(row['Country']).strip()+" "+'Years'+':'+str(row['Years']).strip()+" "+'Cell Phone'+':'+str(row['Cell Phone']).strip()+" "+'Name'+':'+str(row['Name']).strip()+" "+'Social Security Number'+':'+str(row['Social Security Number']).strip()+" "+'TOTAL $'+':'+str(row['TOTAL $']).strip()+" "+'Date of Birth'+':'+str(row['Date of Birth']).strip()
custom_recognizer_arn=cer_response['EntityRecognizerArn']
endpoint_response = comprehend.create_endpoint(
EndpointName='nlp-chapter4-cer-endpoint',
ModelArn=custom_recognizer_arn,
DesiredInferenceUnits=2,
DataAccessRoleArn=role
)
endpoint_response['EndpointArn']
arn:aws:comprehend:us-east-1:<aws-account-nr>:entity-recognizer-endpoint/nlp-chapter4-cer-endpoint
response = comprehend.detect_entities(Text=entry,
LanguageCode='en',
EndpointArn=endpoint_response['EndpointArn']
)
print(response)
{'Entities': [{'Score': 0.9999999403953552, 'Type': 'PERSON', 'Text': 'Years:18', 'BeginOffset': 11, 'EndOffset': 19}, {'Score': 0.9999998211860657, 'Type': 'PERSON', 'Text': 'Cell Phone:(555 ) 0200 1234', 'BeginOffset': 20, 'EndOffset': 47}, {'Score': 1.0, 'Type': 'PERSON', 'Text': 'Name:Kwaku Mensah', 'BeginOffset': 48, 'EndOffset': 65}, {'Score': 1.0, 'Type': 'PERSON', 'Text': 'Social Security Number:123 - 45 - 6789', 'BeginOffset': 66, 'EndOffset': 104}, {'Score': 1.0, 'Type': 'PERSON', 'Text': 'TOTAL $:8000.00/month', 'BeginOffset': 105, 'EndOffset': 126}, {'Score': 1.0, 'Type': 'PERSON', 'Text': 'Date of Birth:01 / 01 / 1953', 'BeginOffset': 127, 'EndOffset': 155}], 'ResponseMetadata': {'RequestId': 'ecbd75fd-22bc-4dca-9aa0-73f58f6784e4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'ecbd75fd-22bc-4dca-9aa0-73f58f6784e4', 'content-type': 'application/x-amz-json-1.1', 'content-length': '620', 'date': 'Tue, 06 Jul 2021 22:26:11 GMT'}, 'RetryAttempts': 0}}
import json
human_loop_input = []
data = {}
ent = response['Entities']
existing_entities = []
if ent != None and len(ent) > 0:
for entity in ent:
current_entity = {}
current_entity['label'] = entity['Type']
current_entity['text'] = entity['Text']
current_entity['startOffset'] = entity['BeginOffset']
current_entity['endOffset'] = entity['EndOffset']
existing_entities.append(current_entity)
data['ORIGINAL_TEXT'] = entry
data['ENTITIES'] = existing_entities
human_loop_input.append(data)
print(human_loop_input)
126}, {'label': 'PERSON', 'text': 'Date of Birth:01 / 01 / 1953', 'startOffset': 127, 'endOffset': 155}]}]
In this section, we were able to detect entities with the Amazon Comprehend entity recognizer. In the next section, we will walk through how you can use Amazon A2I to review the predictions and make changes to the predicted versus actual entity.
For the code blocks discussed here, refer to Step 5 in the notebook: https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2014/chapter14-auditing-workflows-named-entity-detection-forGitHub.ipynb.
Now that we have the detected entities from our Comprehend custom entity recognizer, it's time to set up a human workflow using the private team we created in Step 2 and send the results to the Amazon A2I human loop for review, and any modifications/augmentation as required. Subsequently, we will update the entitylist.csv file that we originally used to train our Comprehend custom entity recognizer so we can prepare it for retraining based on the human feedback:
timestamp = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
# Amazon SageMaker client
sagemaker = boto3.client('sagemaker')
# Amazon Augment AI (A2I) client
a2i = boto3.client('sagemaker-a2i-runtime')
# Flow definition name
flowDefinition = 'fd-nlp-chapter14-' + timestamp
# Task UI name - this value is unique per account and region. You can also provide your own value here.
taskUIName = 'ui-nlp-chapter14-' + timestamp
# Flow definition outputs
OUTPUT_PATH = f's3://' + bucket + '/' + prefix + '/a2i-results'
def create_task_ui():
'''
Creates a Human Task UI resource.
Returns:
struct: HumanTaskUiArn
'''
response = sagemaker.create_human_task_ui(
HumanTaskUiName=taskUIName,
UiTemplate={'Content': template})
return response
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)
arn:aws:sagemaker:us-east-1:<aws-account-nr>:human-task-ui/ui-nlp-chapter14-<timestamp>
completed_human_loops = []
a2i_resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {a2i_resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {a2i_resp["HumanLoopOutput"]}')
print(' ')
if a2i_resp["HumanLoopStatus"] == "Completed":
completed_human_loops.append(resp)
HumanLoop Name: 0fe076a4-b6eb-49ea-83bf-78f953a71c89
HumanLoop Status: InProgress
HumanLoop Output Destination: {'OutputS3Uri': 's3://<your-bucket-name>/chapter4/a2i-results/fd-nlp-chapter4-2021-07-06-22-32-21/2021/07/06/22/33/08/<hashnr>/output.json'
In the next section, we will walk through how your private reviewers can log in to the console and review the entities detected by Amazon Comprehend.
Now, we will log in to the Amazon A2I Task UI to review, change, and re-label the detected entities from our Comprehend custom entity recognizer. Execute the cells in the notebook based on the instructions discussed in this sectio:.
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])
retrain='N'
el = open('train/entitylist.csv','r').read()
for annotated_entity in a2i_entities:
if original_text[annotated_entity['startOffset']:annotated_entity['endOffset']] not in el:
retrain='Y'
word = ' '+original_text[annotated_entity['startOffset']:annotated_entity['endOffset']]+','+annotated_entity['label'].upper()
print("Updating Entity List with: " + word)
open('train/entitylist.csv','a').write(word)
if retrain == 'Y':
print("Entity list updated, model to be retrained")
Updating Entity List with:
Country:US,GHOST
Updating Entity List with:
Years:18,PERSON
Updating Entity List with:
Cell Phone:(555 ) 0200 1234,PERSON
Entity list updated, model to be retrained
This response is saved automatically in the Amazon S3 bucket JSON file in the form of labels. In the next section, we will use these modified or reviewed labels to retrain our custom entity recognizer model.
We will now retrain our Comprehend custom entity recognizer. The cells to be executed are similar to what we did when we originally trained our recognizer:
After declaring variables, we execute the following code block to start the training:
Import datetime
cer_name = "retrain-loan-recognizer"+str(datetime.datetime.now().strftime("%s"))
cer_response = comprehend.create_entity_recognizer(
RecognizerName = cer_name,
DataAccessRoleArn = role,
InputDataConfig = cer_input_object,
LanguageCode = "en"
)
{ 'EntityRecognizerProperties': { 'DataAccessRoleArn': 'arn:aws:iam::<aws-account-nr>:role/service-role/<execution-role>',
'EntityRecognizerArn': 'arn:aws:comprehend:us-east-1:<aws-account-nr>:entity-recognizer/retrain-loan-recognizer1625612436',
'InputDataConfig': { 'DataFormat': 'COMPREHEND_CSV',
'Documents': { 'S3Uri': 's3://<s3-bucket>/chapter4/train/raw_txt.csv'},
'EntityList': { 'S3Uri': 's3://<s3-bucket>/chapter4/train/entitylist.csv'},
'EntityTypes': [ { 'Type': 'PERSON'},
{ 'Type': 'GHOST'}]},
'LanguageCode': 'en',
'Status': 'SUBMITTED',
'SubmitTime': datetime.datetime(2021, 7, 6, 23, 0, 36, 759000, tzinfo=tzlocal())}}
Let's now execute the steps to store the results of the authentication check for access by applications downstream.
Now we understand how to set up an auditing workflow, let's execute the steps needed to persist the results from our entity detection so we can send them to a downstream application. If the majority or all of the entities are of the GHOST type, we will send a rejection decision, if the majority is of the PERSON type, we will send a summary approval, if all of them are PERSON, we will send approval, and if they are evenly distributed, we will send a rejection decisio:.
[{'endOffset': 10, 'label': 'GHOST', 'startOffset': 0},
{'endOffset': 19, 'label': 'PERSON', 'startOffset': 11},
{'endOffset': 47, 'label': 'PERSON', 'startOffset': 20},
{'endOffset': 65, 'label': 'PERSON', 'startOffset': 48},
{'endOffset': 104, 'label': 'PERSON', 'startOffset': 66},
{'endOffset': 126, 'label': 'PERSON', 'startOffset': 105},
{'endOffset': 155, 'label': 'PERSON', 'startOffset': 127}]
from collections import Counter
docstatus = ''
ghost = float(Counter(labellist)['GHOST'])
person = float(Counter(labellist)['PERSON'])
if ghost >= len(labellist)*.5:
docstatus = 'REJECT'
elif min(len(labellist)*.5, len(labellist)*.8) < person < max(len(labellist)*.5, len(labellist)*.8):
docstatus = 'SUMMARY APPROVE'
elif person > len(labellist)*.8:
docstatus = 'APPROVE'
print(docstatus)
That concludes the solution build. Please refer to the Further reading section for more examples of approaches for this use case, as well as the code sample for building a similar solution using AWS Lambda and CloudFormation.
In this chapter, we learned how to build an auditing workflow for named entity recognition to solve real-world challenges that many organizations face today with document processing, using Amazon Textract, Amazon Comprehend, and Amazon A2I. We reviewed the loan authentication use case to validate the documents before they can be passed to a loan processor. We considered an architecture based on conditions such as reducing the validation time from 2 to 4 weeks to 24 hours within the first 3 months of solution implementation. We assumed that you, the reader, are the solution architect assigned to this project, and we reviewed an overview of the solution components along with an architectural illustration in Figure 4.1.
We then went through the pre-requisites for the solution build, set up an Amazon SageMaker Notebook instance, cloned our GitHub repository, and started executing the code in the notebook based on instructions from this chapter. We covered training an Amazon Comprehend custom entity recognizer, setting up our private work team using Amazon SageMaker labeling workforces, extracting the relevant content from the loan application using Amazon Textract, sending it to the Comprehend custom entity recognizer for detecting entities, forwarding the detection results to an Amazon A2I human review loop, completing the human task steps using the UI, reviewing the results of the review, updating the entities list to retrain the custom entity recognizer, and finally, storing the document contents and the loan validation decision to an Amazon DynamoDB table for downstream processing.
In the next chapter, we will be building a classical use case that's tailor-made for NLP – namely, the active learning workflow for text classification. We will be training a text classification model using Amazon Comprehend custom for labeling documents into classes, review predictions using Amazon A2I, and retrain the classifier based on feedback from the Amazon A2I human review loop. We will demonstrate how the solution evolves in intelligence in being able to improve classification accuracy because of the feedback loop.