Developing OCR Applications with Amazon Textract

Slide 1

Slide 1 text

DEVELOPING OCR APPLICATIONS WITH AMAZON TEXTRACT CHARLES RAJENDRAN [email protected]

Slide 2

Slide 2 text

About me • CHARLES RAJENDRAN • WORKS AT ASCENTIC • MAINLY A JS DEVELOPER • BLOGGER/ YOUTUBER

Slide 3

Slide 3 text

Agenda • What is OCR? • Overview of Amazon Textract • Single Page Document Extraction • Understanding Textract Response JSON • Multi Page Document Extraction • Textract Limitations • Q & A

Slide 4

Slide 4 text

What is OCR? • OCR is the conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo or etc. • Use cases of OCR: • Automating data entry, extraction and processing. • Scanning printed documents into versions that can be edited with word processors. • Indexing print material for search engines. • Placing important, signed legal documents into an electronic database. • OCR can be performed in different ways: • Hardware + Software • AI Techniques • Software Level OCR: • Using Libraries: Tesseract • Cloud providers • Google Vision OCR , Amazon Textract

Slide 5

Slide 5 text

Amazon Textract • Amazon Textract is a service that automatically extracts text and data from scanned documents. • Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. • Amazon Textract is currently available in the US East (Northern Virginia), US East (Ohio), US West (Oregon), and EU (Ireland) and very recently in London and Singapore. • Pricing: • Free Tier: Analyze up to 1,000 pages per month using the Detect Document Text API and up to 100 pages per month using the Analyze Document API, for the first three months. First 1 Million Pages $0.0015 (Price per Page) Over 1 Million Pages $0.0006 (Price per Page)

Slide 6

Slide 6 text

Analyzing Text Sync API Async API • Detect lines and words • Relationship between lines and words • Location of the words and lines • For single page documents. • DetectDocumentText • AnalyzeDocument • analyzes documents and forms for relationships between detected text. • Return 3 categories of text extraction — text, forms, and tables. • For multi page documents. • StartDocumentTextDetection • GetDocumentTextDetection • StartDocumentAnalysis • GetDocumentAnalysis Detecting Text Textract API’s

Slide 7

Slide 7 text

DEMO: Single Page Document Extraction

Slide 8

Slide 8 text

DEMO: Single Page Document Extraction • Create a serverless project • Import the AWS SDK and Configure the Textract Object. • Provide the image information you want to extract as Input for Textract API along with other options. • Call the API (Analyze Document). • Create a key value extraction function. • Test the function • Code: https://github.com/CharlesRajendran/aws-meetup

Slide 9

Slide 9 text

Textract Response

Slide 10

Slide 10 text

Important information in the Textract response is Blocks array, this will contain all the text and other elements as block objects. BLOCKS ARRAY

Slide 11

Slide 11 text

BLOCK OBJECTS • BlockType: Type of Element • PAGE • LINE • WORD • TABLE • CELL • KEY_VALUE_SET • SELECTION_ELEMENT • Geometry: Location of the element in the page • Id: Unique identifier for objects • Relationship • CHILD • VALUE • Page: Page number the particular block item is located

Slide 12

Slide 12 text

BLOCK OBJECTS – LINE and WORD • Confidence: Model accuracy on the Text. • Text: Actual Text • Geometry: Location of the text related to page • Bounding Box Ratio • Polygon Point’s Ratio • Id: Identifier of the Block Object • Relationships • Child - Words

Slide 13

Slide 13 text

BLOCK OBJECTS – TABLE and CELL • Relationships: • Page > Table > Cell > Word

Slide 14

Slide 14 text

BLOCK OBJECTS – KEY VALUE SET • Entity Type: To denote whether the block object is a key or value. • Key Relationships • VALUE: have a reference to the value of the key • CHILD: Word that forms the Key Text • Value Relationships • CHILD: have the words that forms the value Text. • Careful with the confidence percentage

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

MULTI PAGE DOCUMENT EXTRACTION

Slide 17

Slide 17 text

ASYNC API EXAMPLE (Analyse Document) • Async Operation involves few more steps and configurations.

Slide 18

Slide 18 text

• Create an Amazon SNS topic, prepend the topic name with AmazonTextract. • Create a Amazon SQS standard queue.

Slide 19

Slide 19 text

• Subscribe the queue to the topic

Slide 20

Slide 20 text

• Give permission to the Amazon SNS topic to send messages to the SQS queue (Trust Policy).

Slide 21

Slide 21 text

• Create a service role and Give permission to the Amazon Textract to access SNS Topic (Trust Policy).

Slide 22

Slide 22 text

• Give relevant permissions (inline, manages policies) to the user.

Slide 23

Slide 23 text

• Input param should information about the file, feature types, sns topic and role information.

Slide 24

Slide 24 text

• Call startDocumentAnalysis with the Parameters to Initiate Textract. • It will return a JobId and Textract will perform the rest in the background.

Slide 25

Slide 25 text

• Once Textract finish the work in the background, it will send message to the SNS topic with the status and JobId information.

Slide 26

Slide 26 text

• Once the message is received by subscribed queue, SQS will trigger a lambda which has the logic to receive the final Textract response with getDocumentAnalysis method.

Slide 27

Slide 27 text

LIMITATIONS OF AMAZON TEXTRACT • File Size Limitations • The maximum document image (JPEG/PNG) size is 5 MB. • The maximum PDF file size is 500 MB. • The maximum number of pages in a PDF file is 3000. • The maximum PDF media size for the height and width dimensions is 40 inches or 2880 points. • The minimum height for text to be detected is 15 pixels. At 150 DPI, this would be equivalent to 8-pt font. • Rotate Images • Documents can be rotated a maximum of +/- 10% from the vertical axis. Text can be text aligned horizontally within the document. • Language • Amazon Textract only supports English text detection. • Amazon Textract doesn't support the detection of handwriting.

Slide 28

Slide 28 text

REFERENCES • AWS Textract Developer Documentation: • https://docs.aws.amazon.com/textract/latest/dg/what-is.html • AWS Textract Documentation • https://aws.amazon.com/textract/ • AWS SDK • https://docs.aws.amazon.com/textract/latest/dg/what-is.html • Code Sample: • https://github.com/CharlesRajendran/aws-meetup