Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Efficient Method to Extract Units of Manchu Characters

An Efficient Method to Extract Units of Manchu Characters

This was my first poster presentation at a Korean Academic Technology conference.

Aaron Snowberger

May 21, 2021
Tweet

More Decks by Aaron Snowberger

Other Decks in Technology

Transcript

  1. An E icient Method to Extract
    Units of Manchu Characters
    1. Aaron Daniel Snowberger ([email protected])
    2. Choong Ho Lee ([email protected])

    View Slide

  2. CONTENTS
    NATURAL LANGUAGE
    PROCESSING
    Extracting the characters
    present in an image of
    Manchu script (right) is a
    simple form of NLP known as
    Optical Character Recognition
    (OCR). This can lead to future
    research related to the
    Manchu script such as
    Stemming, Lemmatisation,
    and Morphological
    Segmentation among others.
    01 Problem Overview
    02 Pre-processing Method
    a. Inverse Binary Image
    b. Scan for pixel depth
    c. Simplify data with binarization
    d. Find cut points
    e. Cut image
    03 Program Process
    a. Find Lines
    b. Find Words
    c. Find Letters
    04 Future Research Plan
    05 Conclusion
    2

    View Slide

  3. Problem Overview
    Extracting characters from Manchu Script
    1

    View Slide

  4. Manchu Script Character Extractor
    Abstract
    Since Manchu characters are written vertically and are connected without spaces within a
    word, pre-processing is required to separate the character area and the units that make
    up the characters before being able to recognize the characters. In this paper, we describe
    a pre-processing method that extracts the character area and cuts off the unit of each
    character.
    Unlike existing research that presupposes a method of recognizing each word or
    character unit, or recognizing the remaining part a er removing the stem of a
    continuous character, this method cuts the character into each recognizable unit and
    then combines the units. It can be applied to the method of recognizing letters. Through
    an experiment, the effectiveness of this method was verified.
    Keywords
    Manchu Characters, Character Recognition, Preprocessing, Pattern Recognition
    4

    View Slide

  5. Pre-Processing Method
    Preparing (standardizing) a script image for analysis
    2

    View Slide

  6. Load dataset & visualize input
    Read in the image of the Manchu script
    in grayscale.
    Visualize it, and find its shape.
    6
    00

    View Slide

  7. Binarize Image & Inverse it
    By binarizing the image, every pixel value
    in the white background becomes 1, and
    every pixel that contains the script is 0.
    Therefore, by inversing the binary image
    (reversing the 1s and 0s), we can easily
    scan the image for non-zero values which
    contain the (white) script.
    7
    01

    View Slide

  8. Find font areas
    We will begin with lines of the script. This
    means we need to scan every column of
    the data looking for non-zero values. We
    sum the non-zero values for each column,
    store them in an array, and plot a graph.
    8
    02

    View Slide

  9. Visualize font areas
    Notice how the graph detailing non-zero
    values matches up with the script image.
    9
    02

    View Slide

  10. Simplify data with binarization
    Next, we binarize the array of pixel density
    information. This binary array will inform
    the program’s decisions about where to
    set cut points for the image.
    10
    03

    View Slide

  11. Find Cut Points
    Next, we run through the binary array
    looking for cut points.
    ⬥ Startpoint: when 0s change to 1s
    ⬥ Endpoint: when 1s change to 0s
    ⬥ Edge case: 1s at the end of the image
    11
    04

    View Slide

  12. Cut Image & Visualize it
    Let’s take a look at the 13 pieces cut from the image based on the cut points we found.
    12
    05

    View Slide

  13. Program Process
    Full image → Lines → Words → Letters
    3

    View Slide

  14. Find Lines
    The above process found lines in the image. It can be extended to find words and letters.
    14
    01

    View Slide

  15. Find Words
    Take the first line for
    example. Here, we visualize
    the pixel density and every
    word contained within it.
    We also count the number
    of words (font areas)
    contained within each of
    the other lines in our
    source image.
    Then using the same
    method described above,
    we cut and save each word
    from each line.
    15
    02

    View Slide

  16. Find Words (Visualize it)
    Now, let’s visualize the words that have been cut out of the first line.
    16
    02

    View Slide

  17. Find Letters
    Once again, using the method described,
    we can now look for letters in each word.
    But, notice that this time, we have a bit of
    a problem. Because each word of the
    script is one continuous line, there are no
    clear “zero” points where we can cut it.
    Therefore, we need to add a new function
    that will check for the deepest “valley”
    points in the non-binary array.
    These locations typically range between 2
    and 5 pixels in depth - which will be
    found using a threshold value. We’ll set
    these values to 0 when we find them.
    Then, we’ll binarize the array as before.
    17
    03

    View Slide

  18. Find Letters (by finding valleys)
    This time, we search for valleys in the
    original array that represents the pixel
    density of the script. We set any valley
    to 0, and binarize the rest of the array.
    18
    03

    View Slide

  19. Find Letters (Visualize it)
    Now, we are able to plot cut points for any word and cut it into letters.
    19
    03

    View Slide

  20. Find Letters (Visualize more)
    As a final step, let’s
    visualize the results of a
    few more words being cut
    into letters.
    20
    03
    1 2 3 4 5 6 7
    8 9 10 11 12 13 14 15
    Word 4
    Word 7
    Word 10
    Line 1

    View Slide

  21. Future Research Plan
    Template Matching & Algorithm Improvement
    4

    View Slide

  22. Template Matching & Improvement
    MANCHU ALPHABET
    The Manchu alphabet is included in the Unicode
    block for Mongolian.
    For future research, the letters that this
    program is able to find and cut from a
    given script image need to be matched
    to the existing letters in the Manchu
    alphabet (which is included in the
    Unicode block for Mongolian).
    A er that, we will be able to understand
    how accurate or inaccurate it is, and use
    that information to help improve the
    letter-cutting algorithm.
    22

    View Slide

  23. Conclusion
    A simple & viable solution
    4

    View Slide

  24. Summary
    This project illustrated how an image of Manchu script could be cut into each
    recognizable unit through a pre-processing method. The pre-processing method first
    standardizes the image data it reads in, then, in a step-by-step manner, divides the image
    into (1) Lines of the script, then (2) Words of the script, and finally (3) Letters of the script.
    Future research needs to be conducted in confirming whether or not (and how
    accurately) each unit (letter) that was cut from the image matches the actual Manchu
    alphabet. It is expected that there will be some margin of error due to the script cutting
    perfectly horizontal lines at the narrowest point of each word.
    Thus, a er performing an accuracy check, the algorithm may be improved upon by
    adjusting the cutting threshold values for each word, or by rotation of the cutting line (or
    the image itself) at certain locations.
    In conclusion, we hope to illustrate that this method of extracting Manchu characters
    from an image is a (relatively) simple and viable solution, even though some
    improvements may need to be made.
    24

    View Slide

  25. THANKS!
    The Jupyter Notebook used to build this project
    including code examples and output can be
    found at:
    https://github.com/jekkilekki/learning-opencv/blob/main
    /project-manchu/Manchu%20Script%20Reader.ipynb
    *Filesize may require download to view*
    25

    View Slide