OCR Optimized for Images Created by Print Typesetting

OCR Optimized for Images Created by Print Typesetting Achieving highly
accurate recognition of text characters in print replica digital pages Norihiko SAWA 06-10-2022 2022 Computation+Journalism Conference 1

SAWA, Norihiko - Principal Research Engineer • Researching the media
and mediatech environment in the US and building relationships for Nikkei, based in New York • Launched in-house development and grew team • Made an algorithm and built a platform for article recommendations • AI-Generated headlines and articles • AI-Generated videos from articles • Developing business strategies based on sophisticated data simulations and forecasts • Implementing strategies to grow subscriber LTV [email protected] Who I am 2

AOTA, Masaki • Implementation of this algorithm • Effect measurement
for new features • LTV calculation Data scientist Who we are 3

- Highly accurate image recognition in Nikkei print replica editions
- The accuracy of detection of character rectangles is defined as the total number of detected rectangles divided by the number of rectangles that do not cross over the bodies of characters OCR specially optimized to recognize images created by typesetting What we build Google Cloud Vision API Our algorithm 0.21322 0.97222 4

Nikkei’s print replica viewer apps - Web for PC -
Apps for Tablet and Smartphone 5

Vertical writing - Top to bottom and right to left
- Vertical headlines - Vertical sub-headlines - Vertical text - Paragraph blocks - Flexible layouts with images Challenges and Problems 6

Demo - Use Images as “Text” - Highlight any part
of replica images - [In the near future] Select and copy text to the clipboard For what 7

- OCRs didn’t work well on images in Nikkei print
replica editions - By assuming that characters will be arranged in blocks, detecting rectangular blocks become easier Why we started developing this technology in-house Idea - Detects rectangular blocks in the following order: (1) paragraphs, (2) lines, and (3) characters 8

Dilation by Morphological Transformations - Removes other articles and headlines
at preprocessing - Dilation to recognize areas of each group of characters How we build 9

Finding the positions of characters - Define loss function to
find the positions of characters - Gaps should be between characters - Minimize the loss function to draw horizontal lines How we build Figure: Y where there is no single pixel of black color can be a break between characters. 10

Paragraph Order - Both reverse Z and N patterns are
used to order paragraphs in Japanese print newspapers - Matching of each character in sentences to templates - total 2000 kanji characters - Low accuracy tolerance, focus on position of characters and order of paragraphs - Paragraph Order by Dynamic Programming with Levenshtein distance between predicted text and text for digital products How we build 11

Positions of each character - Special characters like “30” as
1 character, and “㌽” containing 4 characters in 1 character space are translated to text for digital products - Match detected characters to text for digital products by dynamic programming How we build 12

Defining the accuracy of detection - The accuracy rate is
defined as the number of rectangles that do not cross over the bodies of characters out of the total number of detected rectangles Results Google Cloud Vision API Our algorithm 0.21322 0.97222 Our algorithm Google Cloud Vision API 13

Highlight and copy “text” - This technology makes highlight and
copy text to clipboard functions possible in the print replica viewer - Use images of replica as “text” on the app - Select, highlight and copy are helpful in understanding users’ interests - Highlighting can help people with reading disabilities such as dyslexia by combining with audio Application Figure: Next release User Interface 14

Thank you! ありがとうございます。 SAWA, Norihiko [email protected] Feel free to contact
me https://www.ft.com/search?q=denshiba 15

OCR Optimized for Images Created by Print Types...

OCR Optimized for Images Created by Print Typesetting

SAWA, Norihiko

More Decks by SAWA, Norihiko

Other Decks in Technology

Featured

Transcript

OCR Optimized for Images Created by Print Typesetting Achieving highly

SAWA, Norihiko - Principal Research Engineer • Researching the media

AOTA, Masaki • Implementation of this algorithm • Effect measurement

- Highly accurate image recognition in Nikkei print replica editions

Nikkei’s print replica viewer apps - Web for PC -

Vertical writing - Top to bottom and right to left

Demo - Use Images as “Text” - Highlight any part

- OCRs didn’t work well on images in Nikkei print

Dilation by Morphological Transformations - Removes other articles and headlines

Finding the positions of characters - Define loss function to

Paragraph Order - Both reverse Z and N patterns are

Positions of each character - Special characters like “30” as

Defining the accuracy of detection - The accuracy rate is

Highlight and copy “text” - This technology makes highlight and

Thank you! ありがとうございます。 SAWA, Norihiko [email protected] Feel free to contact