Extract Data from an Image Using AWS Textract

November 18, 2022 | Comments(0) |

TABLE OF CONTENT

1. Overview
2. About AWS Textract
3. Use Cases
4. Architecture Diagram
5. Steps to Setup AWS S3
6. Steps to Setup Amazon Lambda
7. Conclusion
8. About CloudThat
9. FAQs

 

Overview

Modern technology has solved this problem to a large extent and data can be extracted from structured forms without human touch. In other cases, however, data is received from a wide variety of unstructured documents without any rhyme or reason to the way the information is presented. Many businesses and government organizations extract data manually from scanned documents, such as PDFs, tables, and forms, which are slow, expensive, and prone to errors. Textract uses machine learning to handle any type of document in real-time, accurately extracting text, forms, and tables without any specification and code. 

About AWS Textract

Amazon Textract is a highly scalable machine learning (ML) service that automatically extracts text, handwriting, and data from documents like images, pdf, etc. It can also analyze a document such as related text, tables, key-value pairs, and selection elements. Use Amazon Textract to detect and extract text in your documents.

When the Amazon Textract operation processes the document, the results are returned in an array of Block objects or an array of Expense Document objects. Both objects contain information that has been found about items, including their location in the document and their relationship to other items in the document.

Use Cases

  • Import documents and forms into business applications
  • Creating smart search indexes
  • Creating automated workflows for document processing
  • Maintaining compliance in document archives
  • Text Extraction for Natural Language Processing (NLP)
  • Text extraction for document classification

AD_textract

Steps to Setup AWS S3

Step 1: Open AWS S3 Console

Step 2: Click on Create Bucket. Enter the bucket name (i.e., data-extract-from-image) and select the region that you want to perform.

step2

Step 3: Click on Create Bucket.

step3

Steps to Setup Amazon Lambda

Step 1: Open Aws lambda console.

Step 2: Click on create function and enter the function name (i.e., textract-lambda). Then select the python 3.9 version.

lambda_step2

Step 3: Select a role that defines the permissions of your lambda function. Select a new role with a basic lambda function and click on Create function.

lambda_step3

Step 4: Inside the lambda function there is another option configuration. Go to configuration and click on permission. Then click on Role name.

lambda_step4

Step 5: Attach AmazonTextractFullAccess and AWSLambdaExecute policies to the lambda permission role.

lambda_step5

Step 6: Add S3 bucket as a trigger in lambda.

lambda_step6

Step 7: Add code in lambda. Inside the code, we are using detect_document_text boto3 API which detects text in the input document. Amazon Textract API detects and analyses text in documents and converts it into machine-readable text. After adding the code save it and click on the deploy button. (GitHub Link)

lambda_step7

Step 8: Upload one invoice image on the data-extract-from-image bucket.

lambda_step8

Step 9: Check CloudWatch log groups. Inside the log event, you can get all your image extracted data.

lambda_step9

Conclusion

In this blog, we learned about how to use AWS Textract API to extract data from an Image without any ML experience. This solution will drive decision-making efficiency and can be applied to any industry that has physical/scanned documents such as legal documents, purchase receipts, inventory reports, invoices, and purchase orders. We will discuss more use cases of AWS’s other services in our upcoming blogs.

About CloudThat

CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding AWS Textract and I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.

FAQs

Q1: What document formats does Amazon Textract support?

A. Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats. With synchronous APIs, you can send images either as an S3 object or as a byte array. For the asynchronous API, you can send S3 objects. If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPG, PNG), do not convert or resample it before uploading it to Amazon Textract.

Q2: In which AWS regions are Amazon Textract available?

A. Amazon Textract is currently available in US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), AWS GovCloud (US-West), AWS GovCloud (US-East), Regions Canada (Central), EU (Ireland), EU (London), EU (Frankfurt), EU (Paris), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Seoul), and Asia Pacific (Mumbai).

Q3: Are there any limits on the number of questions I can ask per document?

A. Queries are processed on a per-page basis, and information can be extracted using queries through synchronous or asynchronous operations. A maximum of 15 queries per page is supported for synchronous operations. A maximum of 30 queries per page is supported for asynchronous operations.


Leave a Reply