In the previous chapters, you were introduced to Amazon Textract for extracting text from documents, and Amazon Comprehend to extract insights with no prior Machine Learning (ML) experience as a prerequisite. In the last chapter, we showed you how you can combine these features together to solve a real-world use case for document automation by giving an example of loan processing.
In this chapter, we will use the Amazon Textract and Amazon Comprehend services to show you how you can quickly set up an intelligent search solution with the integration of powerful elements, such as Amazon Elasticsearch, which is a managed service to set up search and log analytics, and Amazon Kendra, which is an intelligent managed search solution powered by ML for natural language search.
We will cover the following topics in this chapter:
For this chapter, you will need access to an AWS account. Before getting started we recommend that you create an AWS account by going through these steps here:
The Python code and sample datasets for the Amazon Textract examples are provided on the book's GitHub repo at https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/tree/main/Chapter%2005.
Check out the following video to see the Code in Action at https://bit.ly/3nygP5S.
Every organization has lots of documents in the form of paper and in their archives too. The challenge is that these documents lie mostly in separate silos and not all in one place. So, for these organizations to make a business decision based on the hidden information in their siloed documents is extremely challenging. Some approaches these organizations take to make their documents searchable is putting the documents in a data lake. However, extracting meaningful information from these documents is another challenge as it would require a lot of NLP expertise, ML skills, and infrastructure to set that up. Even if you were able to extract insights from these documents, another challenge will then be setting up a scalable search solution.
In this section, we will address these challenges by using the AWS AI services we introduced in previous chapters and then talk about how they can be used to set up a centralized document store.
Once all the documents are in a centralized storage service such as Amazon S3, which is a scalable and durable object store similar to Dropbox, we can use Amazon Textract as covered in Chapter 2, Introducing Amazon Textract, to extract text from these documents, and use Amazon Comprehend as covered in Chapter 3, Introducing Amazon Comprehend, to extract NLP-based insights such as entities, keywords, sentiments, and more. Moreover, we can then quickly index the insights and the text and send it to Amazon Elasticsearch or Amazon Kendra to set up a smart search solution.
The following diagram shows the architecture we will cover in this section:
In Figure 5.1, you can see the two options we have to build a search index. The options are as follows:
If you are looking for a natural language-based search solution powered by ML where you can ask human-like questions rather than searching for keywords, you can choose Amazon Kendra for the search, as Amazon Kendra is an AWS AI service powered by ML. Amazon Kendra offers natural language search functionality and will provide you with NLP-based answers, meaning human-like contextual answers. For example, imagine you are setting up the search function on your IT support documents in Salesforce. Using Amazon Kendra you can ask direct questions such as "where is the IT desk located?" and Amazon Kendra will give you an exact response, such as "the sixth floor," whereas in Amazon Elasticsearch you can only perform keyword-based search.
Moreover, you can also integrate Amazon Kendra into Amazon Lex, which is a service to create chatbots. You can deploy a smart search chatbot on your website powered by Amazon Lex and Amazon Kendra. Also, Amazon Kendra comes with a lot of connectors to discover and index your data for search, including Amazon S3, OneDrive, Google Drive, Salesforce, relational databases such as RDS, and many more supported by third-party vendors.
You can set up a search on many different interesting use cases, for example, for financial analysts searching for financial events, as they have to scroll through tons of SEC filing reports and look for meaningful financial entities such as mergers and acquisitions. Using the proposed pipeline along with Amazon Comprehend Events can easily reduce the time and noise while scrolling through these documents and update their financial models in case of any financial events such as mergers or acquisitions.
For healthcare companies, they can use the set of services and options offered by Amazon Comprehend Medical to create a smart search for healthcare data, where a doctor can log in and search for relevant keywords or information from the centralized patient data in Amazon HealthLake. We will cover more on this use case in this chapter.
We all know finding jobs is extremely difficult. It's harder even for talent acquisition companies hunting for good candidates to search for relevant skills across thousands of resumes. You can use the proposed solution to set up a resume processing pipeline where you can upload the resumes of various candidates in Amazon S3 and search for relevant skills based on the jobs you are looking for.
In this section, we covered two options with which to set up smart search indexes. In the next section, we will show you how you can set up this architecture to create an NLP-powered search application where Human Resources (HR) admin users can quickly upload candidates' scanned resumes and other folks can log in and search for relevant skill sets based on open job positions.
In the previous chapters, we spoke about how you can use Amazon Lambda functions to create a serverless application. In this section, we will walk you through the following architecture to set up a scanned image-based search solution by calling the Amazon Textract and Amazon Comprehend APIs using an Amazon Lambda function. We are going to use Amazon Elasticsearch for this use case. However, you can also replace Amazon Elasticsearch with Amazon Kendra to create an ML-based search solution where you can use natural language to ask questions while searching.
The AWS service used in the previous architecture is Amazon Cognito to set up the login for your backend users.
Amazon S3 is used for centralized storage. Amazon Lambda functions are used as serverless event triggers when the scanned resumes are uploaded to Amazon S3, and then we use both Amazon Textract and Amazon Comprehend to extract text and insights such as key phrases and entities. Then we index everything into Amazon Elasticsearch. Your end users can log in through Cognito, and will access Amazon Elasticsearch through a Kibana dashboard that comes integrated with Amazon Elasticsearch for visualization.
We will use an AWS CloudFormation template to spin up the resources needed for this chapter. CloudFormation templates are scripts written in YAML or JSON format to spin up resources or Infrastructure as Code (IaC). AWS CloudFormation templates write IaC and set all the necessary permissions for you:
Note:
You will get an email with the login details to Cognito while your stack is being created. Make sure you check the same email you provided while creating this stack. An admin can add multiple users' email addresses through the Amazon Cognito console once it's deployed. Those emails can be sent to end users for logging in to the system once the resumes' data has been uploaded to Amazon S3.
Now you have set up up the infrastructure, including an Amazon S3 bucket, Lambda functions, the Cognito login, Kibana, and the Amazon Elasticsearch cluster using CloudFormation. You have the output from CloudFormation for your S3 bucket and Kibana dashboard login URLs. In the next section, we will walk you through how you can upload scanned images to interact with this application as an admin user.
We'll start with the following steps for uploading documents to Amazon S3:
We have uploaded the sample scanned resume to Amazon S3, and also showed you where you can find the S3 event notifications that trigger a Lambda function. In the next section, let's explore what is happening in the Lambda function.
In this section, we will inspect the code blocks of AWS Lambda and the API calls made to Amazon Textract and Amazon Comprehend along with Amazon Elasticsearch.
The deployment code is too large for this function to show up in this AWS Lambda console. You can access the code through through the following GitHub repo instead, at https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2005/lambda/index.py:
def handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = unquote_plus(event['Records'][0]['s3']['object']['key'])
s3.Bucket(bucket).download_file(Key=key,Filename='/tmp/{}')
with open('/tmp/{}', 'rb') as document:
imageBytes = bytearray(document.read())
print("Object downloaded")
response = textract.analyze_document(Document={'Bytes': imageBytes},FeatureTypes=["TABLES", "FORMS"])
document = Document(response)
blocks=response['Blocks']
for block in blocks:
if block['BlockType'] == 'LINE':
text += block['Text']+" "
print(text)
Note:
Comprehend sync APIs allow up to 5,000 characters as input so make sure your text is not more than 5,000 characters long.
keyphrase_response = comprehend.detect_key_phrases(Text=text, LanguageCode='en')
KeyPhraseList=keyphrase_response.get("KeyPhrases")
for s in KeyPhraseList:
textvalues.append(s.get("Text")
detect_entity= comprehend.detect_entities(Text=text, LanguageCode='en')
EntityList=detect_entity.get("Entities")
for s in EntityList:
textvalues_entity.update([(s.get("Type").strip(' '),s.get("Text").strip(' '))]
s3url='https://s3.console.aws.amazon.com/s3/object/'+bucket+'/'+key+'?region='+region
searchdata={'s3link':s3url,'KeyPhrases':textvalues,'Entity':textvalues_entity,'text':text, 'table':table, 'forms':forms}
print(searchdata)
print("connecting to ES")
es=connectES()
es.index(index="document", doc_type="_doc", body=searchdata)
Note:
In case the resumes have tables or forms, we have prepared to index them as well. Moreover, this solution can also be used for invoice search.
In this section, we walked you through how you can extract text and insights from the documents uploaded to Amazon S3. We also indexed the data into Amazon Elasticsearch. In the next section, we will walk you through how you can log in to Kibana using your admin login email setup while creating CloudFormation templates and visualize the data in the Kibana dashboard.
In this section, we will cover how you can sign up to Kibana through Amazon Cognito by using the email you entered as the admin while deploying the resources through AWS CloudFormation. Then we will walk you through how you can set up your index in Kibana. We will cover how you can discover and search the data in the Kibana dashboard based on entity, keyword, and table filters from Amazon Comprehend. Lastly, you can download the searched resume link from Amazon S3.
We will cover walkthroughs including signing up to the Kibana console, making the index discoverable for the search functionality, and searching for insights in Kibana.
In these steps, we will walk you through how you can log in to Kibana using the CloudFormation-generated output link:
Note:
You can sign up additional end users using the Sign up button shown in the previous screenshot.
We have covered how to sign up for Kibana. In the next section, we will walk you through setting up the index in Kibana.
In this section, we will walk you through setting up an index in Kibana for searching:
We have created an index. Now we will start searching for insights.
In this section, we will walk you through searching for insights in Kibana:
Let's look at another output shown in the following screenshot:
In this section, we gave you an architecture overview of the search solution for scanned images where an admin user uploads the scanned documents in Amazon S3, and then showed how to sign up for the Kibana dashboard and search for keywords to gain meaningful insights from the scanned documents.
We walked you through the steps to set up the architecture using AWS CloudFormation template one-click deploy, and you can check the Further reading section to learn more about how to create these templates. We also showed how you can interact with this application by uploading some sample documents. We guided you on how to set up the Kibana dashboard and provide some sample queries to gain insights from the keywords and entities as filters.
In the next section, we will explore a Kendra-powered search solution. Let's get started exploring Amazon Kendra and what you can uncover by using it to power Textract and Comprehend in your document processing workflows.
In this section, we will cover how you can quickly create an end-to-end serverless document search application using Amazon Kendra.
We will walk through the steps to git clone the notebook and show code samples to set up the kendra based search architecture using simple boto3 APIs.
Note:
Please add Kendra IAM access to the SageMaker notebook IAM role so that you can call Kendra APIs through this notebook. In previous chapters, you already added IAM access to Amazon Comprehend and Textract APIs from the SageMaker notebook.
We will show you how you can create a Amazon S3 bucket. We will use this bucket as a Kendra datasource and also to store extracted data from Amazon Textract.
# Define IAM role
role = get_execution_role()
print("RoleArn: {}".format(role))
sess = sagemaker.Session()
s3BucketName = '<your s3 bucket name>'
prefix = 'chapter5'
We have the notebook ready and the Amazon S3 bucket created for this section's solution. Let's see a quick architecture walkthrough in the next section to understand the key components and then we will walk you through the code in the notebook you have set up.
Setting up an enterprise-level search can be hard. That's why we have Amazon Kendra, which can crawl data from various data connectors to create a quick and easy search solution. In the following architecture, we will walk you through how you can set up a document search when you have your PDF documents in Amazon S3. We will extract the data using Amazon Textract from these PDF documents and send it to Amazon Comprehend to extract some key entities such as ORGANIZATION, TITLE, DATE, and so on. These entities will be used as filters while we sync the documents directly into Amazon Kendra for search.
So, we gave you a high-level implementation architecture in the previous diagram. In the next section, we will walk you through how you can build this out with few lines of code and using the Python Boto3 APIs.
In this section, we will walk you through how you can quickly set up the proposed architecture:
comprehend = boto3.client('comprehend')
textract= boto3.client('textract')
kendra= boto3.client('kendra')
Note:
You can upload as many documents for search as you wish. For this demonstration, we are providing just one sample. Please feel free to play around by uploading your documents to Amazon S3 and generating metadata files before you start syncing your documents to Amazon Kendra.
For extracting text from the PDF uploaded to Amazon S3, we will use the same code as we used for the asynchronous processing covered in Chapter 2, Introducing Amazon Textract.
text=""
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
#print (' 33[94m' + item["Text"] + ' 33[0m')
text += item['Text']+" "
print(text)
The sample results shown in the following screenshot contain the text from the PDF:
entities= comprehend.detect_entities(Text=text, LanguageCode='en')
Note:
If you created the index using the console, please skip the programmatic creation and avoid running the following notebook cell to create the index.
response = kendra.create_index(
Name='Search',
Edition='DEVELOPER_EDITION',
RoleArn='<enter IAM role by creating IAM role in IAM console')
print(response)
Note:
Index creation can take up to 30 minutes.
Alternatively, if you created the index programmatically using the CreateIndex API, its response will contain an index ID of 36 digits that you need to copy and paste to run the next piece of code to update the search filters based on the Comprehend entities.
response = kendra.update_index(
Id="<paste Index Id from Create Index response>",
DocumentMetadataConfigurationUpdates=[
{
'Name':'ORGANIZATION',
'Type':'STRING_LIST_VALUE',
'Search': {
'Facetable': True,
'Searchable': True,
'Displayable': True
}
}}
categories = ["ORGANIZATION", "PERSON", "DATE", "COMMERCIAL_ITEM", "OTHER", "TITLE", "QUANTITY"]
for e in entities["Entities"]:
if (e["Text"].isprintable()) and (not """ in e["Text"]) and (not e["Text"].upper() in category_text[e["Type"]]):
#Append the text to entity data to be used for a Kendra custom attribute
entity_data[e["Type"]].add(e["Text"])
#Keep track of text in upper case so that we don't treat the same text written in different cases differently
category_text[e["Type"]].append(e["Text"].upper())
#Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance
text_frequency[e["Type"]][e["Text"].upper()] = 1
elif (e["Text"].upper() in category_text[e["Type"]]):
#Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance
text_frequency[e["Type"]][e["Text"].upper()] += 1
print(entity_data)
elimit = 10
for et in categories:
el = [pair[0] for pair in sorted(text_frequency[et].items(), key=lambda item: item[1], reverse=True)][0:elimit]
metadata[et] = [d for d in entity_data[et] if d.upper() in el]
metadata["_source_uri"] = documentName
attributes["Attributes"] = metadata
s3 = boto3.client('s3')
prefix= 'meta/'
with open("metadata.json", "rb") as f:
s3.upload_file( "metadata.json", s3BucketName,'%s/%s' % ("meta","resume_Sample.pdf.metadata.json"))
We gave you a code walkthrough on how to upload a PDF document and extract data from it using Amazon Textract and then use Amazon Comprehend to extract entities. We then created a metadata file using the filters or entities extracted by Comprehend and uploaded it into Amazon S3. In the next section, we will walk you through how you can set up Amazon Kendra sync with the S3 document you uploaded, and how you can create a meta folder and place your metadata files there so that Amazon Kendra picks them up as metadata filters during the Kendra sync.
In this section, we will walk you through how you can sync the documents to the index you have created, along with the filters in the metadata file:
Once the sync is successful, all your documents in Amazon S3 will be synced and the Kendra filters will be populated with the metadata attributes extracted by Amazon Comprehend.
In the next section, we will walk you through how you can navigate to the Amazon Kendra console to search.
Amazon Kendra comes with a built-in search UI that can be used for testing the search functionality.
You can also deploy this UI in a React app after testing. The page at https://docs.aws.amazon.com/kendra/latest/dg/deploying.html has the deployment UI code available, which can be integrated with any serverless application using API Gateway and Lambda.
You can also use the Kendra.query() API to retrieve results from the index you created in Kendra.
In this section, we will walk you through using the built-in Kendra search console:
Amazon Kendra is able to give you a contextual answer containing Jane Doe, whose resume we indexed.
It also provides you with filters based on Comprehend entities on the left-hand side to quickly sort individuals based on entities such as ORGANIZATION, TITLE, DATE, and their word count frequencies.
You can also create Comprehend custom entities, as we covered in Chapter 4, Automated Document Processing Workflows, to enrich your metadata filters based on your business needs.
Amazon Kendra is able to provide you with the exact contextual answer. You can also boost the response in Kendra based on relevance and provide feedback using the thumbs-up and thumbs-down buttons to improve your Kendra model.
Note:
Amazon Kendra supports the use of PDF, Word, JSON, TXT, PPT, and HTML documents for the search functionality. Feel free to add more documents through this pipeline for better search results and accuracy.
In this chapter, we covered two options to set up an intelligent search solution for your document-processing workflow. The first option involved setting up an NLP-based search quickly using Amazon Textract, Amazon Comprehend, and Amazon Elasticsearch using a Lambda function in a CloudFormation template for your scanned resume analysis, and can be used with anything scanned, such as images, invoices, or receipts. For the second option, we covered how you can set up an enterprise-level serverless scalable search solution with Amazon Kendra for your PDF documents. We also walked you through how you can enrich the Amazon Kendra search with additional attributes or metadata generated from Amazon Comprehend named entities.
In the next chapter, we will talk about how you can use AI to improve customer service in your contact center.