In the previous chapter, we were introduced to Amazon Comprehend and Amazon Comprehend Medical, and we covered how to use these services to derive insights from text. We also spent some time understanding how Natural Language Processing algorithms work, the different types of insights you can uncover, and we also ran code samples trying out the Amazon Comprehend APIs.
In this chapter, we will walk through our first real-world use case of automating a document management workflow that many organizations struggle with today. We put together this solution based on our collective experience and the usage trends we have observed in our careers. Fasten your seat belts and get ready to experience architecting an end-to-end AI solution one building block at a time and watch it taking shape in front of you. We expect to be hands-on throughout the course of this chapter, but we have all the code samples we need to get going.
We will dive deep into how you can automate document processing with Amazon Textract and then we will cover how you can set up compliance and control in the documents using Amazon Comprehend. Lastly, we will talk about architecture best practices while designing real-time document processing workflows versus batch processing. We will provide detailed code samples, designs, and development approaches, and a step-by-step guide on how to set up and run these examples along with access to GitHub repositories.
In this chapter, we will cover the following topics:
For this chapter, you will need access to an AWS account. Please make sure to follow the instructions specified in the Technical requirements section in Chapter 2, Introducing Amazon Textract, to create your AWS account, and log in to the AWS Management Console before trying the steps in this chapter.
The Python code and sample datasets for a walk-through of this chapter's code are provided at the following link: https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/tree/main/Chapter%2004. Please use the instructions in the following sections along with the code in the repository to build the solution.
Check out the following video to see the Code in Action at https://bit.ly/3GlcCet.
We have discussed in the previous chapter how Amazon Textract can help us digitize scanned documents such as PDF and images by extracting text from any document. We also covered how Amazon Comprehend can help us extract insights from these documents, including entities, Personal Identifiable Information (PII), and sentiments.
Now, these services can be used together in an architecture to automate the document processing workflows for most organizations, be it a financial organization or healthcare, which we will cover in Chapter 12, AI and NLP in Healthcare.
Let's start with a fictitious bank, LiveRight Pvt Ltd., whose customers are applying for home loans. We all know this loan origination process involves more than 400 documents to be submitted and reviewed by the bank before approval is forthcoming for your home loan. Automating this process will make it easier for banks as well as customers to get loans. The challenge with automating these workflows is that there are more than 1,000 templates for the loan origination process and going with any Optical Character Recognition (OCR) system will require managing these templates. Moreover, these OCR template-based approaches are not scalable and break with format changes. That's why we have Amazon Textract to extract text from any documents, enabling these documents to be automated and processed in hours rather than months or weeks.
You have extracted the data from these forms or semi-structured documents. You will now want to set up compliance and control on the data extracted from these documents; for example, making sure that if the data is PII, you can mask it for further processing. You will also want to extract the entities if you want to focus on the loan approval process, for example, the loan amount or the details of the requester. This is where Amazon Comprehend can help. In fact, you can perform custom classification of the documents submitted and the custom entities based on your requirements with Amazon Comprehend; for example, documents extracted by Textract and sent to Amazon Comprehend for custom classification to classify whether the document submitted is a driving license or W2 form.
The following is the architecture of how you can use Amazon Textract and Amazon Comprehend together to automate your existing document flow:
In this architecture, you have documents coming in, and these documents may be financial documents, legal documents, mortgage applications, and so on. You send these documents to Amazon Textract to extract text from these documents. Once you have extracted text from these documents, you can send this text to Amazon Comprehend to extract insights. These insights can classify these documents based on document type, it can identify PII from these documents, or it can be named entity recognition (NER) using custom entity recognition. We cover custom entities in Chapter 14, Auditing Named Entity Recognition Workflows, and document classification in Chapter 15, Classifying Documents and Setting up Human in the Loop for Active Learning.
In this section, we covered how you can easily and quickly set up an automated document processing workflow with Amazon Textract and Amazon Comprehend by using these services together. In the next section, we will talk about how you can use these services together to set up compliance and control for LiveRight Pvt Ltd., especially by means of masking or redacting the PII data in their forms.
In this section, we will talk about how LiveRight Pvt Ltd. can set up compliance and control as well as automate their loan origination process using Amazon Textract and Amazon Comprehend. We will walk you through the following architecture using code samples in a Jupyter notebook:
We will walk you through this architecture using a single document and sample code. However, this architecture can be automated to process a large number of documents using the step function and lambda functions in a serverless manner. In this architecture, we will show you the following:
So, let's get started with setting up the notebook.
If you have not done so in the previous chapters, you will first have to create an Amazon SageMaker Jupyter notebook and set up Identity and Access Management (IAM) permissions for that notebook role to access the AWS services we will use in this notebook. After that, you will need to clone the GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services). Please perform the following steps to complete these tasks before we can execute the cells from our notebook:
IAM Role Permission while Creating Amazon SageMaker Jupyter Notebooks
Accept the default option for the IAM role at notebook creation time to allow access to any S3 bucket.
This will take you to the home folder of your notebook instance.
Next, we will cover the additional IAM prerequisites.
To train the Comprehend custom entity recognizer and to set up real-time endpoints, we have to enable additional policies and update the trust relationships for our SageMaker notebook role. To do this, attach AmazonS3FullAccess, TextractFullAccess, and ComprehendFullAccess policies to your Amazon SageMaker Notebook IAM Role. To execute this step, please refer to Changing IAM permissions and trust relationships for the Amazon SageMaker notebook execution role in the Setting up your AWS environment section in Chapter 2, Introducing Amazon Textract.
Now that we have the necessary IAM roles and notebook set up in the Amazon SageMaker notebook instance, let's jump to the code walk-through.
In this section, we will give a code walk-through of the architecture we discussed for automating documents using Amazon Textract and setting compliance and control with PII masking using Amazon Comprehend in Figure 14.2 using this notebook:
documentName = "bankstatement.png"
display(Image(filename=documentName))
You will get the following response:
client = boto3.client(service_name='textract',
region_name= 'us-east-1',
endpoint_url='https://textract.us-east-1.amazonaws.com')
with open(documentName, 'rb') as file:
img_test = file.read()
bytes_test = bytearray(img_test)
print('Image loaded', documentName)
response = client.detect_document_text(Document={'Bytes': bytes_test})
print(response)
You get a JSON response from Amazon Textract using the Detect Document Text Sync API.
doc = Document(response)
page_string = ''
for page in doc.pages:
for line in page.lines:
page_string += str(line.text)
print(page_string)
Now that we have the extracted text from the Textract JSON response, let's move on to the next step.
a) First, initialize the boto3 handle for Amazon Comprehend:
`comprehend = boto3.client('comprehend')
b) Then, call Amazon Comprehend and pass it the aggregated text from our sample bank statement image to Comprehend detect PII entities: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.html#Comprehend.Client.detect_pii_entities:
piilist=comprehend.detect_pii_entities(Text = page_string, LanguageCode='en')
redacted_box_color='red'
dpi = 72
pii_detection_threshold = 0.00
print ('Finding PII text...')
not_redacted=0
redacted=0
for pii in piilist['Entities']:
print(pii['Type'])
if pii['Score'] > pii_detection_threshold:
print ("detected as type '"+pii['Type']+"' and will be redacted.")
redacted+=1
else:
print (" was detected as type '"+pii['Type']+"', but did not meet the confidence score threshold and will not be redacted.")
not_redacted+=1
print ("Found", redacted, "text boxes to redact.")
print (not_redacted, "additional text boxes were detected, but did not meet the confidence score threshold.")s3_entity_key = prefix + "/train/entitylist.csv"
You will get a response with identifying PII from the text, which will be redacted in the next step using the Amazon Comprehend PII analysis job.
We will mask/redact these 15 PII entities we found in the sample bank statement.
a) Then job requires the S3 location of documents to be redacted and the S3 location of where you want the redacted output. Run the following cell to specify the location of the S3 text file we want to be redacted:
import uuid
InputS3URI= "s3://"+bucket+ "/pii-detection-redaction/pii_data.txt"
print(InputS3URI)
OutputS3URI="s3://"+bucket+"/pii-detection-redaction"
print(OutputS3URI)
b) Now we will call comprehend.start_pii_entities_detection_job by setting parameters for redaction and passing the input S3 location where data is stored by running the following notebook cell:
response = comprehend.start_pii_entities_detection_job(
InputDataConfig={
'S3Uri': InputS3URI,
'InputFormat': 'ONE_DOC_PER_FILE'
},
OutputDataConfig={
'S3Uri': OutputS3URI
},
Mode='ONLY_REDACTION',
RedactionConfig={
'PiiEntityTypes': [
'ALL',
],
'MaskMode': 'MASK',
'MaskCharacter': '*'
},
DataAccessRoleArn = role,
JobName=job_name,
LanguageCode='en',
)
Note
Using this API or batch job, you have the choice to specify the mode, redaction config, and language.
Here are the parameters that can be modified as shown in the following code block:
Mode='ONLY_REDACTION'|'ONLY_OFFSETS',
RedactionConfig={
'PiiEntityTypes': [
'BANK_ACCOUNT_NUMBER'|'BANK_ROUTING'|'CREDIT_DEBIT_NUMBER'|'CREDIT_DEBIT_CVV'|'CREDIT_DEBIT_EXPIRY'|'PIN'|'EMAIL'|'ADDRESS'|'NAME'|'PHONE'|'SSN'|'DATE_TIME'|'PASSPORT_NUMBER'|'DRIVER_ID'|'URL'|'AGE'|'USERNAME'|'PASSWORD'|'AWS_ACCESS_KEY'|'AWS_SECRET_KEY'|'IP_ADDRESS'|'MAC_ADDRESS'|'ALL',
],
'MaskMode': 'MASK'|'REPLACE_WITH_PII_ENTITY_TYPE',
'MaskCharacter': 'string'
Refer to the API documentation for more details: https://docs.aws.amazon.com/comprehend/latest/dg/API_StartPiiEntitiesDetectionJob.html.
c) The job will take roughly 6-7 minutes. The following code is to check the status of the job. The cell execution will be completed once the job is complete:
from time import sleep
job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)
print(job)
waited = 0
timeout_minutes = 10
while job['PiiEntitiesDetectionJobProperties']['JobStatus'] != 'COMPLETED':
sleep(60)
waited += 60
assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)
You will get a JSON response, and this job will take 5-6 minutes. You can go and grab a coffee until the notebook cell is running and you have a response.
filename="pii_data.txt"
s3_client = boto3.client(service_name='s3')
output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'
print(output_data_s3_file)
output_data_s3_filepath=output_data_s3_file.split("//")[1].split("/")[1]+"/"+output_data_s3_file.split("//")[1].split("/")[2]+"/"+output_data_s3_file.split("//")[1].split("/")[3]+"/"+output_data_s3_file.split("//")[1].split("/")[4]
print(output_data_s3_filepath)
f = BytesIO()
s3_client.download_fileobj(bucket, output_data_s3_filepath, f)
f.seek(0)
print(f.getvalue())
In the output, you can see that the Amazon Comprehend PII job has masked the PII data, such as an address, name, SSN, and bank account number identified using the Amazon Comprehend Detect PII entity.
In this section, we walked you through an end-to-end conceptual architecture for automating documents for compliance and control. In the next section, we will talk about best practices for real-time document processing workflows versus batch processing workflows.
In this section, we will talk about some best practices while architecting solutions using Amazon Textract for real-time workflows versus batch processing document workflows.
Let's compare the Textract real-time APIs against the batch APIs we discussed in Chapter 2, Introducing Amazon Textract, with the help of the following table:
Note
The pricing of Textract is based on which of the three different APIs you are going to use out of Analyze Document (forms, table), Detect Text (text extraction), and Analyze Expense (invoices and receipts). You will not be charged irrespective of whether you use the sync or async (batch) implementation of these, so, feel free to design your architecture based on your need for real-time processing versus batch processing as pricing is based on the number of documents processed with one of the three APIs, irrespective of batch or real-time mode. Check prices here: https://aws.amazon.com/textract/pricing/.
For example, LiveRight pvt Ltd. can use the batch or real-time implementation of the detect text API to detect text from their bank statements to process millions of documents.
We covered architecture in Figure 14.2. This architecture implemented the Amazon Textract Detect Text Sync API in the code walk-through. Now, let's see how we can automate the architecture through Lambda functions for scale to process multiple documents:
In the preceding architecture, we walked you through how you can process scanned images using the proposed synchronous document processing workflow using the sync APIs of Amazon Textract. Here are the steps for this architecture:
You control the throughput of your pipeline by controlling the batch size and Lambda concurrency.
Now we will walk you through the following architecture best practices for scaling multi-page scanned documents, which can be PDF or images using batch APIs of Amazon Textract:
In the preceding diagram, we have an architecture to walk through how batch processing workflow works with Amazon Textract batch jobs:
This GitHub link, https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing, has code samples to implement both the suggested architecture and it also has some additional components to backfill in case the documents already exist in the Amazon S3 bucket. Please feel free to set up and use this if you have large documents to experiment with.
You can also use the following GitHub solution, https://github.com/aws-samples/amazon-textract-textractor, to implement large-scale document processing with Amazon Comprehend insights.
In this section, we covered architecture best practices for using real-time processing or batch processing with Amazon Textract. We also presented some already-existing GitHub implementations for large-scale document processing with Amazon Textract. Now, let's summarize what we have covered in this chapter.
In this chapter, we covered how you can use Amazon Textract to automate your existing documents. We introduced a fictional bank use case with the help of LiveRight Pvt Ltd. We showed you how using an architecture can help banks automate their loan origination process and set up compliance and control with Amazon Comprehend. We also covered code samples using a sample bank statement, and how you can extract data from the scanned bank statement and save it into a CSV.text file in Amazon S3 for further analysis. Then, we showed you how you can use Amazon Comprehend to detect PII using a sync API and how you can redact that sample bank data text/CSV in Amazon S3 using an Amazon Comprehend batch PII redaction job.
We then covered some architecture patterns for using real-time processing document workflows versus batch processing workflows. We also provided some GitHub implementations that can be used to process large-scale documents.
In this chapter, you learned the differences between when to use and how to use real-time APIs versus batch APIs for document automation. You also learned how you can set up PII redaction with Amazon Comprehend PII jobs.
In the next chapter, we will look at a different use case, but one that's equally popular among enterprises looking to leverage NLP to maximize their business value by building smart search indexes. We will cover how you can use Amazon Textract and Amazon Comprehend along with Amazon Elasticsearch and Amazon Kendra to create a quick NLP-based search. We will introduce the use case, discuss how to design the architecture, establish the prerequisites, and walk through in detail the various steps required to build the solution.