In the previous chapter, you read how businesses can harness the benefits of applying NLP to derive insights from text, and you were briefly introduced to the AWS ML stack. We will now provide a detailed introduction to Amazon Textract, along with do-it-yourself code samples and instructions. Amazon Textract is an AWS AI service that can be used to extract text from documents and images with little to no prior ML skills. But before we get to what Textract can do, we will first cover some of the challenges with document processing. Then we will cover how Textract can help in overcoming the challenges. We will also talk about the benefits of using Amazon Textract, along with its product features. Lastly, we will cover how you can integrate Amazon Textract quickly into your applications.
We will navigate through the following sections in this chapter:
For this chapter, you will need access to an AWS account at https://aws.amazon.com/console/. Please refer to the Signing up for an AWS account sub-section within the Setting up your AWS environment section for detailed instructions on how you can sign up for an AWS account and sign in to the AWS Management Console.
The Python code and sample datasets for the solution discussed in this chapter are available at https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/tree/main/Chapter%2002.
Check out the following video to see the Code in Action at https://bit.ly/3be9eUh.
Important Note
Please do not execute the instructions in this section on their own. This section is a reference for all the basic setup tasks needed throughout the book. You will be guided to this section when building your solution in this chapter and the rest of the chapters in this book. Only execute these tasks when so guided.
Depending on the chapter you are in, you will be running tasks using the AWS Management Console, an Amazon SageMaker Jupyter notebook, from your command line, or a combination of any of these. Either way, you need the right AWS Identity and Access Management (IAM) permissions, resources, and, in most cases, one or more Amazon Simple Storage Service (S3) buckets, as prerequisites for your solution builds. This section provides instructions for setting up these basic tasks. We will be referring to this section throughout the rest of the chapters in the book as needed.
In this chapter and all subsequent chapters in which we run code examples, you will need access to an AWS account. Before getting started, we recommend that you create an AWS account by going through the following steps:
Note
Please use the AWS Free Tier, which enables you to try services free of charge based on certain time limits or service usage limits. For more details, please see https://aws.amazon.com/free.
You now have access to the AWS Management Console (https://aws.amazon.com/console/). In the next section, we will show how to create an S3 bucket and upload your documents.
In this book, we will use Amazon S3 as the storage option for our solutions. So, we will need to create an S3 bucket, create folders within the bucket, and upload documents for use within the solution. Please follow these instructions to learn how to do this:
Please note that the AWS Management Console is not the only option to upload objects to S3. You can do it using the AWS Command-Line Interface (CLI) (for more details, see https://docs.aws.amazon.com/cli/latest/reference/s3/) or you can also upload files programmatically using the Python SDK, for example (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html). AWS provides SDKs for programming in several languages (https://aws.amazon.com/tools/).
And that concludes the instructions for creating an S3 bucket, creating a folder, and uploading objects to the bucket. In the next section, let's see how we can add IAM permissions policies for our Amazon SageMaker Jupyter notebook role.
In this section, we will see how to create a notebook instance in Amazon SageMaker. This is an important step, as most of our solution examples are run using notebooks. After the notebook is created, please follow the instructions to use the notebook in the specific chapters based on the solution being built. Please follow these steps to create an Amazon SageMaker Jupyter notebook instance:
Note
By default, each notebook instance is provided internet access by SageMaker. If you want to disable internet access for this notebook instance, you can attach it to your Virtual Private Cloud (VPC), a highly secure virtual network in the cloud for launching AWS resources (https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html), and select to disable internet access. We need internet access for this notebook instance, so if you are planning to attach a VPC and disable internet access through SageMaker, please either configure a Network Address Translation (NAT) gateway, which allows instances in a subnet within the VPC to communicate with resources outside the VPC but not the other way around (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html), or an interface VPC endpoint (https://docs.aws.amazon.com/sagemaker/latest/dg/interface-vpc-endpoint.html), which allows a private connection through the AWS backbone between the notebook instance and your VPC. This allows you to manage access to the internet for your notebook instance through the controls you have implemented within your VPC.
Your notebook instance will take a few minutes to be provisioned; once it's ready, the status will change to InService. Please follow the instructions in the Using Amazon Textract with your applications section to find out how you can use your notebook instance. In the next few sections, we will walk through the steps required to modify the IAM role we attached to the notebook.
Note
You cannot attach more than 10 managed policies to an IAM role. If your IAM role already has a managed policy from a previous chapter, please detach this policy before adding a new policy as per the requirements of your current chapter.
When we create an Amazon SageMaker Jupyter notebook instance (like we did in the previous section), the default role creation step includes permissions to either an S3 bucket you specify or any S3 bucket in your AWS account. But often, we need the notebook to have more permissions that that. For example, we may need permission to use Amazon Textract or Amazon Comprehend APIs, and/or other services as required.
In this section, we will walk through the steps needed to add additional permissions policies to our Amazon SageMaker Jupyter notebook role:
{ "Version": "2012-10-17", "Statement": [ {
"Action": [
"iam:PassRole"
],
"Effect": "Allow",
"Resource": "<IAM ARN of your current SageMaker notebook execution role>"
}
]
}
{ "Version": "2012-10-17", "Statement": [
{ "Effect": "Allow",
"Principal":
{ "Service":
[ "sagemaker.amazonaws.com",
"s3.amazonaws.com",
"comprehend.amazonaws.com" ]
},
"Action": "sts:AssumeRole" }
]
}
And you are all set. In this section, you learned how to update the IAM role for your Amazon SageMaker notebook instances to add permissions policies, add a custom inline policy, and, finally, edit the trust relationships to add the trusted entities you needed for your solution build. You may now go back to the chapter you navigated to here from and continue your solution build task.
Automating operational activities is very important for organizations looking to minimize costs, increase productivity, and enable faster go-to-market cycles. Typically, operations that are at the core of these businesses are prioritized for automation. Back-office support processes, including administrative tasks, are often relegated to the bottom of the priority list because they may not be deemed mission critical. According to this Industry Analysts report (https://www.industryanalysts.com/111015_konica/, written in 2015, with data collected from sources such as Gartner Group, AIIM, the US Department of Labor, Imaging Magazine, and Coopers and Lybrand, and accessed on March 30, 2021), organizations continue to be reliant on paper-based documents, and the effort required to maintain these documents poses significant challenges due to the lack of automation and inefficiencies in the document workflow.
Many organizations, such as financial institutions, healthcare, manufacturing, and other small-to-medium-sized enterprises, have a large number of scanned and handwritten documents. These documents can be in various formats, such as invoices, receipts, resumes, application forms, and so on. Moreover, these documents are not kept in one place; instead, they are in silos, which makes it really difficult to uncover useful insights from these documents. Suppose that you have an archive of documents that you would like to extract data from. And let's say we build an application that makes it easy for you to search across the vast collection of documents in these archives. Extracting data from these documents is really important for you as they contain a lot of useful information that is relevant for your organization. Once you extract the information you need (of course, we first have to determine what is useful and what is not), you can do so many things, such as discover business context, set up compliance, design search and discovery for important keywords, and automate your existing business processes.
As time progresses, we see more organizations embracing digital media for their business processes due to the ease of integration with their operational systems, but paper-based documents are not going away anytime soon. According to this article (https://medium.com/high-peak-ai/real-time-applications-of-intelligent-document-processing-993e314360f9, accessed on March 30, 2021), there is in fact an increase in the usage of paper documents in organizations. And that's why it's really important to automate document processing workflows.
So, what is the problem with paper documents? The problem is the cost and time required to extract the data from documents using traditional approaches. One of the most common approaches is manual processing of these documents. What is manual processing? A human will read the documents and then key all the values into an application or copy and paste them into another document. This approach is highly inefficient and expensive: not only do you need to invest time and effort to train the human workforce to understand the data domain they are working with, but also there may be errors in data entry due to human nature. For example, when working with tax forms and financial forms, you would need an experienced Certified Public Accountant (CPA) to do that manual entry, as this would require accounting knowledge to extract the details needed. So, we can see that a traditional approach with manual processing of documents is time consuming, error prone, and expensive.
Another approach that we have seen organizations use is rule-based formatting templates along with Optical Character Recognition (OCR) systems to extract data from these documents. The challenge with this method is that these rule-based systems are not intelligent enough to adapt to evolving document formats, and often break with even minor template changes. As businesses grow and expand, their underlying processes need the flexibility to adapt, and this often leads to working with multiple document structures, often running to hundreds or even thousands of formats. Trying to set up and manage these formats for each document type can turn into a huge maintenance overhead pretty quickly and it can become challenging to update these formats in rule-based systems once the document format changes. Another challenge to consider is the provisioning of infrastructure and the scaling required to handle millions of such documents and the associated costs.
That's why we have Amazon Textract, a fully managed ML and AI service, built with out-of-the-box features to extract handwritten and printed text in forms, tables, and pages from images and PDF documents. Textract provides Application Programming Interfaces (APIs) behind which run powerful ML models trained on millions of documents to provide a highly effective solution for intelligent text extraction.
So, we covered the challenges with processing documents in this section and why we need Amazon Textract. In the next section, we will talk about how Amazon Textract can quickly help organizations solve this pain point.
We covered AWS AI Services briefly in Chapter 1, NLP in the Business Context and Introduction to AWS AI Services, when introducing the business context for NLP. Amazon Textract is an OCR-based service in the AWS AI Services stack that comes with ready-made intelligence, enabling you to use it without any prior ML experience for your document processing workflows. It is interesting to note that Amazon Textract has its origins in the deep learning ML models built for Amazon.com. It comes with a pre-trained model and provides APIs where you can send your documents in PDF or image format and get a response as text/tables and key/value pairs along with a confidence score.
Note
Amazon Textract currently supports PNG, JPEG, and PDF formats.
Amazon Textract provides serverless APIs without you needing to manage any kind of infrastructure, enabling you to quickly automate document management and scale to process millions of documents. Once the document content is extracted, you can leverage it within your business applications for a variety of document processing use cases for your industry and operational requirements. Amazon Textract models learn as they go, so they become more intelligent in understanding your documents as you continue to use them. Please refer to the following list for a subset of Amazon Textract usage examples we will be covering in the upcoming chapters:
As you can see, Amazon Textract can be used for various types of document processing use cases and provides several advanced benefits that you would not find in traditional rule-based systems or OCR solutions. You can read some of these benefits here:
To understand the security and compliance features of Amazon Textract, please refer to https://docs.aws.amazon.com/textract/latest/dg/security.html. Amazon Textract is covered in multiple AWS compliance programs, including System and Organizational Control (SOC), International Organization for Standardization (ISO), as well as PCI and HIPAA. For more details, please refer to https://docs.aws.amazon.com/textract/latest/dg/SERVICENAME-compliance.html.
In this section, we briefly listed some interesting document-processing use cases that Amazon Textract can help solve and reviewed some of the key benefits of Amazon Textract, such as pre-built intelligence, cost effectiveness, scalability, and ease of use. In the next section, we will use the AWS Management Console (https://console.aws.amazon.com/) to walk through Amazon Textract's product features, such as table detection, form detection, handwriting detection, text detection, and multi-language support.
Alright, it's time to start exploring the cool features we have been talking about so far. We will start by seeing how you can quickly upload the sample documents provided in our GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services) to your Amazon Textract AWS console. Then, we will walk through the key features of Amazon Textract, along with multi-language support by using a French COVID-19 form. We will also cover Amazon Textract's integration with Amazon A2I, which will quickly help set up a human review workflow for the text, which needs to be highly accurate, such as an invoice amount (https://aws.amazon.com/augmented-ai/) at a high level. We will cover the following:
As a first step, please refer to the Technical requirements section to sign up for an AWS account and sign in to get started.
Now, let's see how to upload a document to Textract:
This will upload the document to Amazon Textract:
The following analysis is displayed in the Amazon Textract console:
Click on the Raw text tab to see the extracted text:
Note
Amazon Textract provides support for rotated documents. Please refer to https://docs.aws.amazon.com/textract/latest/dg/limits.html for more details on Textract service limits.
Amazon Textract has the intelligence to recognize that some documents have multiple formats in them and is able to extract content accordingly. For example, you may be working with reports or a request for proposal document with multiple segments. Please download the image shown in Figure 2.4 (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2002/two-column-image.jpeg) and upload it to the Amazon Textract console to try this out:
Amazon Textract will extract the pages and the paragraphs, along with the lines and the words. Also, it will give you the exact positions of these words and paragraphs in the document, which is very important for context. See the following screenshot to understand the bounding box or geometry derived using Textract:
Here is a screenshot of this document in the AWS console:
Amazon Textract segments documents to identify forms so it can return your key/value pairs from these forms. We will use the employment application sample document template (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2002/emp_app_printed.png), which you downloaded from the GitHub repository:
Amazon Textract can recognize if your document has content structured in tables, for example, receipts, or a listing of technical specifications, pharmacy prescription data, and so on. Textract provides you with the ability to specify whether it should look for tables in your documents when using the API. Along with the table and its contents, Textract returns metadata and indexing information of the table contents, which you can find out more about in the API walk-through later. For this demo, you can download this sample receipt (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2002/receipt-image.png) and upload it into the Amazon Textract console. You will get the extracted table shown in the following screenshot:
Amazon Textract provides support for extracting text in multiple languages. For the latest list of languages supported, please refer to this link: https://aws.amazon.com/textract/faqs/.
Note
Handwriting support is available only in English at the time of writing (April 2021).
During the COVID lockdown in France, anyone wishing to leave their house had to fill in a declaration form to explain why they were outside. We will use this sample form to demo the Amazon Textract language detection feature for the French language. The form is available at https://www.connexionfrance.com/French-news/Covid-19-in-France-Your-questions-on-declaration-form-needed-to-leave-the-house.
You can also download this form from https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2002/form-derogatoire.jpg and upload it to the Amazon Textract console. Click on the Raw text tab, then Forms:
Amazon Textract is able to detect both key/value pairs and raw text from this form in French.
Another very common challenge customers face with data extraction is when you have mixed content documents, such as handwritten text along with printed text. This could be, for example, a prescription form that doctors write for their patients on paper printed with the doctor's name and address. This brings us to another key feature of Amazon Textract: detecting handwritten content from documents:
Amazon Textract provides in-built integration with Amazon A2I (https://aws.amazon.com/augmented-ai/). Using Amazon A2I, you can build human workflows to manage certain documents that require further review by a human for auditing purposes, or just to review the ML predictions. For example, social security numbers or monetary amounts may need to be highly accurate. It is similar to having a first pass of getting text from AI and then using human teams to double-check what the AI has predicted for you.
We will cover handwriting and human in the loop in detail when we get to Chapter 17, Visualizing Insights from Handwritten Content.
Lastly, the Textract console provides you the option to download and review the JSON documents that are the result of the API responses that were invoked for the various Textract features we walked through:
In this section, we walked through Amazon Textract's key product features to extract text, forms, tables, and handwritten content from PDF and image documents, including support for documents in multiple languages. In the next section, we will review how to use Amazon Textract APIs, walk through the JSON responses in detail, and understand how to use Textract with your applications.
In this section, we will introduce and walk through the Amazon Textract APIs for real-time analysis and batch processing of documents. We will show these APIs in action using Amazon SageMaker Jupyter notebooks. For this section, you will need to create an Amazon SageMaker Jupyter notebook and set up IAM permissions for that notebook role to access Amazon Textract. After that you will need to clone the notebook from our GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services), download the sample images, create an Amazon S3 (https://aws.amazon.com/s3/) bucket, upload these images to the S3 bucket, and then refer to this location in the notebook for processing.
Let's get started:
IAM role permissions while creating Amazon SageMaker Jupyter notebooks
First, accept the default for the IAM role at notebook creation time to allow access to any S3 bucket. After the notebook instance is created, follow the instructions in the sub-section Changing IAM permissions and trust relationships for the Amazon SageMaker notebook execution role under the section, Setting up your AWS environment at the beginning of this chapter to add AmazonTextractFullAccess as a permissions policy to the notebook's IAM role.
This will take you to the home folder of your notebook instance.
Before jumping into a notebook demo of how you can use the Textract APIs, we will explore the APIs and their features. Amazon Textract APIs can be classified into synchronous APIs for real-time processing and asynchronous APIs for batch processing. Let's now examine the functions of these APIs.
These APIs take single-page scanned images (JPG or PNG) from your existing filesystem, which is local to your computer, or in an Amazon S3 bucket. There are two APIs for real-time analysis:
Key/value pairs
Key/value pairs in the case of a form means the key will be the name and the value will be "Jane Doe."
These APIs accept single-page or multi-page images (JPG/PNG) and PDFs that are uploaded to an Amazon S3 bucket. It runs a batch analysis to extract content from these images and documents:
Note
Batch APIs can be used with JPEG, PNG, and PDF documents stored in an Amazon S3 bucket.
In this section, we covered batch and real-time APIs of Amazon Textract. In the next section, we will see the implementation of these APIs through the Jupyter notebook you set up in the previous section.
In this section, we will provide Textract APIs' implementation through a Jupyter notebook. We will execute the code cells in the Jupyter notebook you set up at https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2002/Amazon%20Textract%20API%20Sample.ipynb, which you cloned in a previous step in a Jupyter notebook environment. The notebook contains the prerequisite steps, and we will walk through the complete code for all the APIs here. We provide only important code snippets in the book, as follows:
Let's begin:
s3 = boto3.client('s3')
textract = boto3.client('textract')
s3BucketName = "<your amazon s3 bucket>"
documentName = "sample-invoice.png"
display(Image(filename=documentName))
That displays the following image:
The following code will read the document's content in the form of image bytes:
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read())
response = textract.detect_document_text(Document={'Bytes': imageBytes})
You are passing the image bytes directly to this API and getting a JSON response. This JSON response has a structure that contains blocks of identified text, pages, lines, a bounding box, form key values, and tables. In order to understand the Amazon Textract JSON structure and data types, refer to this link: https://docs.aws.amazon.com/textract/latest/dg/API_Block.html.
import json
print (json.dumps(response, indent=4, sort_keys=True))
{
"BlockType": "LINE",
"Confidence": 99.96764373779297,
"Geometry": {
"BoundingBox": {
"Height": 0.013190358877182007,
"Left": 0.5149770379066467,
"Top": 0.16227620840072632,
"Width": 0.06892169266939163
},
Note
This API will not give you forms and tables. It gives only lines, words, and corresponding bounding boxes. This API will be helpful for use cases such as paragraph detection in audit documents and extracting text from scanned books.
Execute the rest of the cells in the notebook to explore the JSON response in detail.
Now, we will show you how you can use the DetectDocument API to detect text in two-column documents in a reading order, with your data in stored in an Amazon S3 bucket:
documentName = "textract-samples/two-column-image.jpg"
display(Image(url=s3.generate_presigned_url('get_object', Params={'Bucket': s3BucketName, 'Key': documentName})))
Response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
})
print(response)
Note
For more details about the DetectDocumentText API, refer to this link: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.detect_document_text.
python -m pip install amazon-textract-response-parser
doc = Document(response)
for page in doc.pages:
for line in page.getLinesInReadingOrder():
print(line[1])
You get the following response:
Now we will analyze invoices with the AnalyzeDocument API to extract forms and tables:
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["FORMS","TABLES"])
doc = Document(response)
for page in doc.pages:
print("Fields:")
for field in page.form.fields:
print("Key: {}, Value: {}".format(field.key, field.value))
print(" Get Field by Key:")
key = "Phone Number:"
field = page.form.getFieldByKey(key)
if(field):
print("Key: {}, Value: {}".format(field.key, field.value))
print(" Search Fields:")
key = "address"
fields = page.form.searchFieldsByKey(key)
for field in fields:
print("Key: {}, Value: {}".format(field.key, field.value))
You will get the following output:
Fields:
Key: Phone:, Value: 206-555-1234
Key: Phone:, Value: None
Key: Phone:, Value: None
Key: COMMENTS OR SPECIAL INSTRUCTIONS:, Value: loading lack locatal in alley
Key: SALES TAX, Value: 41.21
Key: SHIPPING and HANDLING, Value: 50.00
Key: REQUISITIONER, Value: None
Key: SUBTOTAL, Value: 457.9n
Key: TOTAL DUE, Value: 549.15
Key: SALESPERSON, Value: John SMITH
Key: SHIP TO:, Value: Jane Doe Doe Street Press 987 Doe St. #800 Seattle, WA 98108 206-555-9876
Key: P.O. NUMBER, Value: 0000145678
Key: TO:, Value: Jane Doe Doe Street Press 987 Doe St. #800 Seattle, WA 98108 206-555-9876
Key: DATE:, Value: 01/10/2021
doc = Document(response)
for page in doc.pages:
# Print tables
for table in page.tables:
for r, row in enumerate(table.rows):
for c, cell in enumerate(row.cells):
print("Table[{}][{}] = {}".format(r, c, cell.text))
Table[0][0] = QUANTITY
Table[0][1] = DESCRIPTION
Table[0][2] = UNIT PRICE
Table[0][3] = TOTAL
Table[1][0] = 4
Table[1][1] = OFFILE GARS
Table[1][2] = 64.99
Table[1][3] = 25996
Table[2][0] = 2
Table[2][1] = OFFICE DESX
Table[2][2] = 98.99
Table[2][3] = 197.98
Note
You can convert these values into a pandas DataFrame, which we will cover in Chapter 16, Improving the Accuracy of PDF Batch Processing.
To find out more about the API JSON responses, refer to this link: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html.
In this section, you will see how to analyze PDF documents using Textract async APIs for a sample job application form:
jobID = startTextAnalysis(s3Bucket, docName)
print("Started text analysis for: {}".format(jobID))
if(isAnalysisComplete(jobID)):
response = getAnalysisResults(jobID)
def startTextAnalysis(s3Bucket, doc):
response = None
response = textract.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3Bucket,
'Name': doc
}
})
return response["JobID"]
def isAnalysisComplete(jobID):
response = textract.get_document_text_detection(JobId=jobID)
status = response["JobStatus"]
print("Text Analysis status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(2)
response = textract.get_document_text_detection(JobId=jobID)
status = response["JobStatus"]
print("Status of Text Analysis is: {}".format(status))
return status
def getAnalysisResults(jobID):
pages = []
response = textract.get_document_text_detection(JobId=jobID)
pages.append(response)
print("We received results for: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = textract.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("We got the results for: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
And we are done with the demo. Hopefully, you have had an opportunity to review and work with the different Textract APIs for real-time and batch processing and have successfully completed your notebook. In the next section, we will see how you can use these APIs to build serverless applications.
We have spoken about both synchronous (or real-time) APIs and asynchronous (or batch) APIs. Now, the question is how to integrate these APIs into an application. You can quickly integrate these APIs into a web application or any batch processing systems by using AWS Lambda. AWS Lambda runs any code in a serverless manner, be it Java or Python. It's an event-based trigger or programming technique in which you trigger a Lambda function based on an event. For example, you upload your documents to Amazon S3, which can trigger a Lambda function. In that Lambda function, you can call the Amazon Textract APIs and save the results in Amazon S3:
We will cover the architecture in detail in upcoming chapters, where we will talk about how you can build applications using the synchronous versus asynchronous APIs of Amazon Textract. We will also talk about using AWS API Gateway to create RESTful APIs to integrate into your web applications or mobile applications.
In this chapter, we saw a detailed introduction to Amazon Textract and its product features, along with a console walk-through, as well as running code samples using Textract APIs for different types of documents using both real-time and batch analysis.
We started by introducing the ready-made intelligence that Amazon Textract offers with powerful pre-trained ML models, and the ability to use its capabilities in your applications with just an API call. We also read about some popular use cases that Textract can be used for, along with references to some of the following chapters, where we will review those use cases in greater detail. We also read about Textract's benefits as compared to traditional OCR applications and rule-based document processing.
We covered various examples on how you can use Amazon Textract with different types of scanned images and forms. We reviewed different functions of Textract, such as detecting raw text, detecting form values that are stored as key/value pairs, detecting text in tables, detecting pages of text, detecting lines and words, detecting handwritten text and printed text, and detecting text in multiple languages, as well as detecting text that is written in two-column styles in documents. We covered both synchronous processing and asynchronous processing using Textract APIs. We also saw how to set up an Amazon SageMaker Jupyter notebook, clone the GitHub repository, and get started with running a Jupyter notebook. We were able to use an Amazon S3 bucket to store input documents and use them with Textract, and we were able to extract data from unstructured documents and store them in an Amazon S3 bucket.
In this chapter, we also covered Amazon Textract real-time APIs such as the AnalyzeDocument API and the DetectDocumentText API. We discussed the expected input document formats for these APIs and their limitations. We then spoke about how you can scale document processing for use cases where you need to extract data in batches. We read about batch processing APIs along with a Python SDK demo. Finally, we introduced an architecture to integrate Textract into your applications using AWS Lambda.
In the next chapter, you will be introduced to Amazon Comprehend, an AI service that uses ML to uncover insights in text. You will learn about different NLP techniques, review the features for Amazon Comprehend, read about its APIs, learn how you can set up a custom NLP model using Comprehend to detect entities unique to your business, and, like we did in this chapter, you will see Comprehend in action for different use cases.