Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OCR Optimized for Images Created by Print Typesetting

OCR Optimized for Images Created by Print Typesetting

SAWA, Norihiko

June 10, 2022
Tweet

More Decks by SAWA, Norihiko

Other Decks in Technology

Transcript

  1. OCR Optimized for Images Created by Print Typesetting Achieving highly

    accurate recognition of text characters in print replica digital pages Norihiko SAWA 06-10-2022 2022 Computation+Journalism Conference 1
  2. SAWA, Norihiko - Principal Research Engineer • Researching the media

    and mediatech environment in the US and building relationships for Nikkei, based in New York • Launched in-house development and grew team • Made an algorithm and built a platform for article recommendations • AI-Generated headlines and articles • AI-Generated videos from articles • Developing business strategies based on sophisticated data simulations and forecasts • Implementing strategies to grow subscriber LTV [email protected] Who I am 2
  3. AOTA, Masaki • Implementation of this algorithm • Effect measurement

    for new features • LTV calculation Data scientist Who we are 3
  4. - Highly accurate image recognition in Nikkei print replica editions

    - The accuracy of detection of character rectangles is defined as the total number of detected rectangles divided by the number of rectangles that do not cross over the bodies of characters OCR specially optimized to recognize images created by typesetting What we build Google Cloud Vision API Our algorithm 0.21322 0.97222 4
  5. Nikkei’s print replica viewer apps - Web for PC -

    Apps for Tablet and Smartphone 5
  6. Vertical writing - Top to bottom and right to left

    - Vertical headlines - Vertical sub-headlines - Vertical text - Paragraph blocks - Flexible layouts with images Challenges and Problems 6
  7. Demo - Use Images as “Text” - Highlight any part

    of replica images - [In the near future] Select and copy text to the clipboard For what 7
  8. - OCRs didn’t work well on images in Nikkei print

    replica editions - By assuming that characters will be arranged in blocks, detecting rectangular blocks become easier Why we started developing this technology in-house Idea - Detects rectangular blocks in the following order: (1) paragraphs, (2) lines, and (3) characters 8
  9. Dilation by Morphological Transformations - Removes other articles and headlines

    at preprocessing - Dilation to recognize areas of each group of characters How we build 9
  10. Finding the positions of characters - Define loss function to

    find the positions of characters - Gaps should be between characters - Minimize the loss function to draw horizontal lines How we build Figure: Y where there is no single pixel of black color can be a break between characters. 10
  11. Paragraph Order - Both reverse Z and N patterns are

    used to order paragraphs in Japanese print newspapers - Matching of each character in sentences to templates - total 2000 kanji characters - Low accuracy tolerance, focus on position of characters and order of paragraphs - Paragraph Order by Dynamic Programming with Levenshtein distance between predicted text and text for digital products How we build 11
  12. Positions of each character - Special characters like “30” as

    1 character, and “㌽” containing 4 characters in 1 character space are translated to text for digital products - Match detected characters to text for digital products by dynamic programming How we build 12
  13. Defining the accuracy of detection - The accuracy rate is

    defined as the number of rectangles that do not cross over the bodies of characters out of the total number of detected rectangles Results Google Cloud Vision API Our algorithm 0.21322 0.97222 Our algorithm Google Cloud Vision API 13
  14. Highlight and copy “text” - This technology makes highlight and

    copy text to clipboard functions possible in the print replica viewer - Use images of replica as “text” on the app - Select, highlight and copy are helpful in understanding users’ interests - Highlighting can help people with reading disabilities such as dyslexia by combining with audio Application Figure: Next release User Interface 14