Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing OCR Applications with Amazon Textract

Developing OCR Applications with Amazon Textract

Many companies today extract data from documents and forms through manual data entry that’s slow and expensive, Amazon Textract is the tool that you want to use here. Amazon Textract is a dedicated machine learning service from AWS to perform OCR on images and PDFs. Comparing to its competitive services, Textract provides a rich set of new features such as identifying some relationship exist in between the text items in a document.

Charles Rajendran

November 14, 2019
Tweet

Other Decks in Technology

Transcript

  1. About me • CHARLES RAJENDRAN • WORKS AT ASCENTIC •

    MAINLY A JS DEVELOPER • BLOGGER/ YOUTUBER
  2. Agenda • What is OCR? • Overview of Amazon Textract

    • Single Page Document Extraction • Understanding Textract Response JSON • Multi Page Document Extraction • Textract Limitations • Q & A
  3. What is OCR? • OCR is the conversion of images

    of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo or etc. • Use cases of OCR: • Automating data entry, extraction and processing. • Scanning printed documents into versions that can be edited with word processors. • Indexing print material for search engines. • Placing important, signed legal documents into an electronic database. • OCR can be performed in different ways: • Hardware + Software • AI Techniques • Software Level OCR: • Using Libraries: Tesseract • Cloud providers • Google Vision OCR , Amazon Textract
  4. Amazon Textract • Amazon Textract is a service that automatically

    extracts text and data from scanned documents. • Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. • Amazon Textract is currently available in the US East (Northern Virginia), US East (Ohio), US West (Oregon), and EU (Ireland) and very recently in London and Singapore. • Pricing: • Free Tier: Analyze up to 1,000 pages per month using the Detect Document Text API and up to 100 pages per month using the Analyze Document API, for the first three months. First 1 Million Pages $0.0015 (Price per Page) Over 1 Million Pages $0.0006 (Price per Page)
  5. Analyzing Text Sync API Async API • Detect lines and

    words • Relationship between lines and words • Location of the words and lines • For single page documents. • DetectDocumentText • AnalyzeDocument • analyzes documents and forms for relationships between detected text. • Return 3 categories of text extraction — text, forms, and tables. • For multi page documents. • StartDocumentTextDetection • GetDocumentTextDetection • StartDocumentAnalysis • GetDocumentAnalysis Detecting Text Textract API’s
  6. DEMO: Single Page Document Extraction • Create a serverless project

    • Import the AWS SDK and Configure the Textract Object. • Provide the image information you want to extract as Input for Textract API along with other options. • Call the API (Analyze Document). • Create a key value extraction function. • Test the function • Code: https://github.com/CharlesRajendran/aws-meetup
  7. Important information in the Textract response is Blocks array, this

    will contain all the text and other elements as block objects. BLOCKS ARRAY
  8. BLOCK OBJECTS • BlockType: Type of Element • PAGE •

    LINE • WORD • TABLE • CELL • KEY_VALUE_SET • SELECTION_ELEMENT • Geometry: Location of the element in the page • Id: Unique identifier for objects • Relationship • CHILD • VALUE • Page: Page number the particular block item is located
  9. BLOCK OBJECTS – LINE and WORD • Confidence: Model accuracy

    on the Text. • Text: Actual Text • Geometry: Location of the text related to page • Bounding Box Ratio • Polygon Point’s Ratio • Id: Identifier of the Block Object • Relationships • Child - Words
  10. BLOCK OBJECTS – KEY VALUE SET • Entity Type: To

    denote whether the block object is a key or value. • Key Relationships • VALUE: have a reference to the value of the key • CHILD: Word that forms the Key Text • Value Relationships • CHILD: have the words that forms the value Text. • Careful with the confidence percentage
  11. • Create an Amazon SNS topic, prepend the topic name

    with AmazonTextract. • Create a Amazon SQS standard queue.
  12. • Give permission to the Amazon SNS topic to send

    messages to the SQS queue (Trust Policy).
  13. • Create a service role and Give permission to the

    Amazon Textract to access SNS Topic (Trust Policy).
  14. • Call startDocumentAnalysis with the Parameters to Initiate Textract. •

    It will return a JobId and Textract will perform the rest in the background.
  15. • Once Textract finish the work in the background, it

    will send message to the SNS topic with the status and JobId information.
  16. • Once the message is received by subscribed queue, SQS

    will trigger a lambda which has the logic to receive the final Textract response with getDocumentAnalysis method.
  17. LIMITATIONS OF AMAZON TEXTRACT • File Size Limitations • The

    maximum document image (JPEG/PNG) size is 5 MB. • The maximum PDF file size is 500 MB. • The maximum number of pages in a PDF file is 3000. • The maximum PDF media size for the height and width dimensions is 40 inches or 2880 points. • The minimum height for text to be detected is 15 pixels. At 150 DPI, this would be equivalent to 8-pt font. • Rotate Images • Documents can be rotated a maximum of +/- 10% from the vertical axis. Text can be text aligned horizontally within the document. • Language • Amazon Textract only supports English text detection. • Amazon Textract doesn't support the detection of handwriting.
  18. REFERENCES • AWS Textract Developer Documentation: • https://docs.aws.amazon.com/textract/latest/dg/what-is.html • AWS

    Textract Documentation • https://aws.amazon.com/textract/ • AWS SDK • https://docs.aws.amazon.com/textract/latest/dg/what-is.html • Code Sample: • https://github.com/CharlesRajendran/aws-meetup