Developing OCR Applications with Amazon Textract

DEVELOPING OCR APPLICATIONS WITH AMAZON TEXTRACT CHARLES RAJENDRAN [email protected]

About me • CHARLES RAJENDRAN • WORKS AT ASCENTIC •
MAINLY A JS DEVELOPER • BLOGGER/ YOUTUBER

Agenda • What is OCR? • Overview of Amazon Textract
• Single Page Document Extraction • Understanding Textract Response JSON • Multi Page Document Extraction • Textract Limitations • Q & A

What is OCR? • OCR is the conversion of images
of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo or etc. • Use cases of OCR: • Automating data entry, extraction and processing. • Scanning printed documents into versions that can be edited with word processors. • Indexing print material for search engines. • Placing important, signed legal documents into an electronic database. • OCR can be performed in different ways: • Hardware + Software • AI Techniques • Software Level OCR: • Using Libraries: Tesseract • Cloud providers • Google Vision OCR , Amazon Textract

Amazon Textract • Amazon Textract is a service that automatically
extracts text and data from scanned documents. • Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. • Amazon Textract is currently available in the US East (Northern Virginia), US East (Ohio), US West (Oregon), and EU (Ireland) and very recently in London and Singapore. • Pricing: • Free Tier: Analyze up to 1,000 pages per month using the Detect Document Text API and up to 100 pages per month using the Analyze Document API, for the first three months. First 1 Million Pages $0.0015 (Price per Page) Over 1 Million Pages $0.0006 (Price per Page)

Analyzing Text Sync API Async API • Detect lines and
words • Relationship between lines and words • Location of the words and lines • For single page documents. • DetectDocumentText • AnalyzeDocument • analyzes documents and forms for relationships between detected text. • Return 3 categories of text extraction — text, forms, and tables. • For multi page documents. • StartDocumentTextDetection • GetDocumentTextDetection • StartDocumentAnalysis • GetDocumentAnalysis Detecting Text Textract API’s

DEMO: Single Page Document Extraction

DEMO: Single Page Document Extraction • Create a serverless project
• Import the AWS SDK and Configure the Textract Object. • Provide the image information you want to extract as Input for Textract API along with other options. • Call the API (Analyze Document). • Create a key value extraction function. • Test the function • Code: https://github.com/CharlesRajendran/aws-meetup

Textract Response

Important information in the Textract response is Blocks array, this
will contain all the text and other elements as block objects. BLOCKS ARRAY

BLOCK OBJECTS • BlockType: Type of Element • PAGE •
LINE • WORD • TABLE • CELL • KEY_VALUE_SET • SELECTION_ELEMENT • Geometry: Location of the element in the page • Id: Unique identifier for objects • Relationship • CHILD • VALUE • Page: Page number the particular block item is located

BLOCK OBJECTS – LINE and WORD • Confidence: Model accuracy
on the Text. • Text: Actual Text • Geometry: Location of the text related to page • Bounding Box Ratio • Polygon Point’s Ratio • Id: Identifier of the Block Object • Relationships • Child - Words

BLOCK OBJECTS – TABLE and CELL • Relationships: • Page
> Table > Cell > Word

BLOCK OBJECTS – KEY VALUE SET • Entity Type: To
denote whether the block object is a key or value. • Key Relationships • VALUE: have a reference to the value of the key • CHILD: Word that forms the Key Text • Value Relationships • CHILD: have the words that forms the value Text. • Careful with the confidence percentage

MULTI PAGE DOCUMENT EXTRACTION

ASYNC API EXAMPLE (Analyse Document) • Async Operation involves few
more steps and configurations.

• Create an Amazon SNS topic, prepend the topic name
with AmazonTextract. • Create a Amazon SQS standard queue.

• Subscribe the queue to the topic

• Give permission to the Amazon SNS topic to send
messages to the SQS queue (Trust Policy).

• Create a service role and Give permission to the
Amazon Textract to access SNS Topic (Trust Policy).

• Give relevant permissions (inline, manages policies) to the user.

• Input param should information about the file, feature types,
sns topic and role information.

• Call startDocumentAnalysis with the Parameters to Initiate Textract. •
It will return a JobId and Textract will perform the rest in the background.

• Once Textract finish the work in the background, it
will send message to the SNS topic with the status and JobId information.

• Once the message is received by subscribed queue, SQS
will trigger a lambda which has the logic to receive the final Textract response with getDocumentAnalysis method.

LIMITATIONS OF AMAZON TEXTRACT • File Size Limitations • The
maximum document image (JPEG/PNG) size is 5 MB. • The maximum PDF file size is 500 MB. • The maximum number of pages in a PDF file is 3000. • The maximum PDF media size for the height and width dimensions is 40 inches or 2880 points. • The minimum height for text to be detected is 15 pixels. At 150 DPI, this would be equivalent to 8-pt font. • Rotate Images • Documents can be rotated a maximum of +/- 10% from the vertical axis. Text can be text aligned horizontally within the document. • Language • Amazon Textract only supports English text detection. • Amazon Textract doesn't support the detection of handwriting.

REFERENCES • AWS Textract Developer Documentation: • https://docs.aws.amazon.com/textract/latest/dg/what-is.html • AWS
Textract Documentation • https://aws.amazon.com/textract/ • AWS SDK • https://docs.aws.amazon.com/textract/latest/dg/what-is.html • Code Sample: • https://github.com/CharlesRajendran/aws-meetup

ANY QUESTIONS?

Developing OCR Applications with Amazon Textract

Developing OCR Applications with Amazon Textract

Charles Rajendran

Other Decks in Technology

Featured

Transcript