Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OCR Optimized for Images Created by Print Typesetting

OCR Optimized for Images Created by Print Typesetting

SAWA, Norihiko

June 10, 2022
Tweet

More Decks by SAWA, Norihiko

Other Decks in Technology

Transcript

  1. OCR Optimized for Images
    Created by Print
    Typesetting
    Achieving highly accurate recognition of text
    characters in print replica digital pages
    Norihiko SAWA 06-10-2022
    2022 Computation+Journalism Conference
    1

    View Slide

  2. SAWA, Norihiko - Principal Research Engineer
    ● Researching the media and mediatech environment in the US and building
    relationships for Nikkei, based in New York
    ● Launched in-house development and grew team
    ● Made an algorithm and built a platform for article recommendations
    ● AI-Generated headlines and articles
    ● AI-Generated videos from articles
    ● Developing business strategies based on sophisticated data simulations and
    forecasts
    ● Implementing strategies to grow subscriber LTV
    [email protected]
    Who I am
    2

    View Slide

  3. AOTA, Masaki
    ● Implementation of this algorithm
    ● Effect measurement for new features
    ● LTV calculation
    Data scientist
    Who we are
    3

    View Slide

  4. - Highly accurate image
    recognition in Nikkei print
    replica editions
    - The accuracy of detection of
    character rectangles is
    defined as the total number of
    detected rectangles divided by
    the number of rectangles that
    do not cross over the bodies
    of characters
    OCR specially optimized to recognize images
    created by typesetting
    What we build
    Google Cloud Vision API Our algorithm
    0.21322 0.97222 4

    View Slide

  5. Nikkei’s print replica viewer apps
    - Web for PC
    - Apps for Tablet and Smartphone
    5

    View Slide

  6. Vertical writing
    - Top to bottom and right to left
    - Vertical headlines
    - Vertical sub-headlines
    - Vertical text
    - Paragraph blocks
    - Flexible layouts with images
    Challenges and Problems
    6

    View Slide

  7. Demo
    - Use Images as “Text”
    - Highlight any part of replica
    images
    - [In the near future]
    Select and copy text to the
    clipboard
    For what
    7

    View Slide

  8. - OCRs didn’t work well on images
    in Nikkei print replica editions
    - By assuming that characters will
    be arranged in blocks, detecting
    rectangular blocks become easier
    Why we started developing this technology in-house
    Idea
    - Detects rectangular blocks in the
    following order: (1) paragraphs,
    (2) lines, and (3) characters
    8

    View Slide

  9. Dilation by Morphological Transformations
    - Removes other articles
    and headlines at
    preprocessing
    - Dilation to recognize
    areas of each group of
    characters
    How we build
    9

    View Slide

  10. Finding the positions of characters
    - Define loss function to find the positions of characters
    - Gaps should be between characters
    - Minimize the loss function to draw horizontal lines
    How we build
    Figure: Y where there is no single pixel of black color can be
    a break between characters.
    10

    View Slide

  11. Paragraph Order
    - Both reverse Z and N patterns are used to order
    paragraphs in Japanese print newspapers
    - Matching of each character in sentences to templates -
    total 2000 kanji characters
    - Low accuracy tolerance, focus on position of characters
    and order of paragraphs
    - Paragraph Order by Dynamic Programming with
    Levenshtein distance between predicted text and text for
    digital products
    How we build
    11

    View Slide

  12. Positions of each character
    - Special characters like “30” as 1 character,
    and “㌽” containing 4 characters in 1
    character space are translated to text for
    digital products
    - Match detected characters to text for digital
    products by dynamic programming
    How we build
    12

    View Slide

  13. Defining the accuracy of detection
    - The accuracy rate is defined as the number of rectangles that do not cross
    over the bodies of characters out of the total number of detected rectangles
    Results
    Google Cloud
    Vision API
    Our algorithm
    0.21322 0.97222
    Our algorithm
    Google Cloud Vision API 13

    View Slide

  14. Highlight and copy “text”
    - This technology makes highlight and copy text
    to clipboard functions possible in the print
    replica viewer
    - Use images of replica as “text” on the app
    - Select, highlight and copy are helpful in
    understanding users’ interests
    - Highlighting can help people with reading
    disabilities such as dyslexia by combining with
    audio
    Application
    Figure: Next release User Interface
    14

    View Slide

  15. Thank you! ありがとうございます。
    SAWA, Norihiko
    [email protected]
    Feel free to contact me
    https://www.ft.com/search?q=denshiba
    15

    View Slide