An Efficient Method to Extract Units of Manchu Characters

An E icient Method to Extract Units of Manchu Characters
1. Aaron Daniel Snowberger ([email protected]) 2. Choong Ho Lee ([email protected])

CONTENTS NATURAL LANGUAGE PROCESSING Extracting the characters present in an
image of Manchu script (right) is a simple form of NLP known as Optical Character Recognition (OCR). This can lead to future research related to the Manchu script such as Stemming, Lemmatisation, and Morphological Segmentation among others. 01 Problem Overview 02 Pre-processing Method a. Inverse Binary Image b. Scan for pixel depth c. Simplify data with binarization d. Find cut points e. Cut image 03 Program Process a. Find Lines b. Find Words c. Find Letters 04 Future Research Plan 05 Conclusion 2

Problem Overview Extracting characters from Manchu Script 1

Manchu Script Character Extractor Abstract Since Manchu characters are written
vertically and are connected without spaces within a word, pre-processing is required to separate the character area and the units that make up the characters before being able to recognize the characters. In this paper, we describe a pre-processing method that extracts the character area and cuts off the unit of each character. Unlike existing research that presupposes a method of recognizing each word or character unit, or recognizing the remaining part a er removing the stem of a continuous character, this method cuts the character into each recognizable unit and then combines the units. It can be applied to the method of recognizing letters. Through an experiment, the effectiveness of this method was verified. Keywords Manchu Characters, Character Recognition, Preprocessing, Pattern Recognition 4

Pre-Processing Method Preparing (standardizing) a script image for analysis 2

Load dataset & visualize input Read in the image of
the Manchu script in grayscale. Visualize it, and ﬁnd its shape. 6 00

Binarize Image & Inverse it By binarizing the image, every
pixel value in the white background becomes 1, and every pixel that contains the script is 0. Therefore, by inversing the binary image (reversing the 1s and 0s), we can easily scan the image for non-zero values which contain the (white) script. 7 01

Find font areas We will begin with lines of the
script. This means we need to scan every column of the data looking for non-zero values. We sum the non-zero values for each column, store them in an array, and plot a graph. 8 02

Visualize font areas Notice how the graph detailing non-zero values
matches up with the script image. 9 02

Simplify data with binarization Next, we binarize the array of
pixel density information. This binary array will inform the program’s decisions about where to set cut points for the image. 10 03

Find Cut Points Next, we run through the binary array
looking for cut points. ⬥ Startpoint: when 0s change to 1s ⬥ Endpoint: when 1s change to 0s ⬥ Edge case: 1s at the end of the image 11 04

Cut Image & Visualize it Let’s take a look at
the 13 pieces cut from the image based on the cut points we found. 12 05

Program Process Full image → Lines → Words → Letters
3

Find Lines The above process found lines in the image.
It can be extended to ﬁnd words and letters. 14 01

Find Words Take the ﬁrst line for example. Here, we
visualize the pixel density and every word contained within it. We also count the number of words (font areas) contained within each of the other lines in our source image. Then using the same method described above, we cut and save each word from each line. 15 02

Find Words (Visualize it) Now, let’s visualize the words that
have been cut out of the ﬁrst line. 16 02

Find Letters Once again, using the method described, we can
now look for letters in each word. But, notice that this time, we have a bit of a problem. Because each word of the script is one continuous line, there are no clear “zero” points where we can cut it. Therefore, we need to add a new function that will check for the deepest “valley” points in the non-binary array. These locations typically range between 2 and 5 pixels in depth - which will be found using a threshold value. We’ll set these values to 0 when we ﬁnd them. Then, we’ll binarize the array as before. 17 03

Find Letters (by ﬁnding valleys) This time, we search for
valleys in the original array that represents the pixel density of the script. We set any valley to 0, and binarize the rest of the array. 18 03

Find Letters (Visualize it) Now, we are able to plot
cut points for any word and cut it into letters. 19 03

Find Letters (Visualize more) As a ﬁnal step, let’s visualize
the results of a few more words being cut into letters. 20 03 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Word 4 Word 7 Word 10 Line 1

Future Research Plan Template Matching & Algorithm Improvement 4

Template Matching & Improvement MANCHU ALPHABET The Manchu alphabet is
included in the Unicode block for Mongolian. For future research, the letters that this program is able to ﬁnd and cut from a given script image need to be matched to the existing letters in the Manchu alphabet (which is included in the Unicode block for Mongolian). A er that, we will be able to understand how accurate or inaccurate it is, and use that information to help improve the letter-cutting algorithm. 22

Conclusion A simple & viable solution 4

Summary This project illustrated how an image of Manchu script
could be cut into each recognizable unit through a pre-processing method. The pre-processing method first standardizes the image data it reads in, then, in a step-by-step manner, divides the image into (1) Lines of the script, then (2) Words of the script, and finally (3) Letters of the script. Future research needs to be conducted in confirming whether or not (and how accurately) each unit (letter) that was cut from the image matches the actual Manchu alphabet. It is expected that there will be some margin of error due to the script cutting perfectly horizontal lines at the narrowest point of each word. Thus, a er performing an accuracy check, the algorithm may be improved upon by adjusting the cutting threshold values for each word, or by rotation of the cutting line (or the image itself) at certain locations. In conclusion, we hope to illustrate that this method of extracting Manchu characters from an image is a (relatively) simple and viable solution, even though some improvements may need to be made. 24

THANKS! The Jupyter Notebook used to build this project including
code examples and output can be found at: https://github.com/jekkilekki/learning-opencv/blob/main /project-manchu/Manchu%20Script%20Reader.ipynb *Filesize may require download to view* 25

An Efficient Method to Extract Units of Manchu ...

An Efficient Method to Extract Units of Manchu Characters

Aaron Snowberger

More Decks by Aaron Snowberger

Other Decks in Technology

Featured

Transcript

An E icient Method to Extract Units of Manchu Characters

CONTENTS NATURAL LANGUAGE PROCESSING Extracting the characters present in an

Problem Overview Extracting characters from Manchu Script 1

Manchu Script Character Extractor Abstract Since Manchu characters are written

Pre-Processing Method Preparing (standardizing) a script image for analysis 2

Load dataset & visualize input Read in the image of

Binarize Image & Inverse it By binarizing the image, every

Find font areas We will begin with lines of the

Visualize font areas Notice how the graph detailing non-zero values

Simplify data with binarization Next, we binarize the array of

Find Cut Points Next, we run through the binary array

Cut Image & Visualize it Let’s take a look at

Program Process Full image → Lines → Words → Letters

Find Lines The above process found lines in the image.

Find Words Take the ﬁrst line for example. Here, we

Find Words (Visualize it) Now, let’s visualize the words that

Find Letters Once again, using the method described, we can

Find Letters (by ﬁnding valleys) This time, we search for

Find Letters (Visualize it) Now, we are able to plot

Find Letters (Visualize more) As a ﬁnal step, let’s visualize

Future Research Plan Template Matching & Algorithm Improvement 4

Template Matching & Improvement MANCHU ALPHABET The Manchu alphabet is

Conclusion A simple & viable solution 4

Summary This project illustrated how an image of Manchu script

THANKS! The Jupyter Notebook used to build this project including