An Efficient Method to Extract Units of Manchu Characters

Slide 1

Slide 1 text

An E icient Method to Extract Units of Manchu Characters 1. Aaron Daniel Snowberger ([email protected]) 2. Choong Ho Lee ([email protected])

Slide 2

Slide 2 text

CONTENTS NATURAL LANGUAGE PROCESSING Extracting the characters present in an image of Manchu script (right) is a simple form of NLP known as Optical Character Recognition (OCR). This can lead to future research related to the Manchu script such as Stemming, Lemmatisation, and Morphological Segmentation among others. 01 Problem Overview 02 Pre-processing Method a. Inverse Binary Image b. Scan for pixel depth c. Simplify data with binarization d. Find cut points e. Cut image 03 Program Process a. Find Lines b. Find Words c. Find Letters 04 Future Research Plan 05 Conclusion 2

Slide 3

Slide 3 text

Problem Overview Extracting characters from Manchu Script 1

Slide 4

Slide 4 text

Manchu Script Character Extractor Abstract Since Manchu characters are written vertically and are connected without spaces within a word, pre-processing is required to separate the character area and the units that make up the characters before being able to recognize the characters. In this paper, we describe a pre-processing method that extracts the character area and cuts off the unit of each character. Unlike existing research that presupposes a method of recognizing each word or character unit, or recognizing the remaining part a er removing the stem of a continuous character, this method cuts the character into each recognizable unit and then combines the units. It can be applied to the method of recognizing letters. Through an experiment, the effectiveness of this method was verified. Keywords Manchu Characters, Character Recognition, Preprocessing, Pattern Recognition 4

Slide 5

Slide 5 text

Pre-Processing Method Preparing (standardizing) a script image for analysis 2

Slide 6

Slide 6 text

Load dataset & visualize input Read in the image of the Manchu script in grayscale. Visualize it, and ﬁnd its shape. 6 00

Slide 7

Slide 7 text

Binarize Image & Inverse it By binarizing the image, every pixel value in the white background becomes 1, and every pixel that contains the script is 0. Therefore, by inversing the binary image (reversing the 1s and 0s), we can easily scan the image for non-zero values which contain the (white) script. 7 01

Slide 8

Slide 8 text

Find font areas We will begin with lines of the script. This means we need to scan every column of the data looking for non-zero values. We sum the non-zero values for each column, store them in an array, and plot a graph. 8 02

Slide 9

Slide 9 text

Visualize font areas Notice how the graph detailing non-zero values matches up with the script image. 9 02

Slide 10

Slide 10 text

Simplify data with binarization Next, we binarize the array of pixel density information. This binary array will inform the program’s decisions about where to set cut points for the image. 10 03

Slide 11

Slide 11 text

Find Cut Points Next, we run through the binary array looking for cut points. ⬥ Startpoint: when 0s change to 1s ⬥ Endpoint: when 1s change to 0s ⬥ Edge case: 1s at the end of the image 11 04

Slide 12

Slide 12 text

Cut Image & Visualize it Let’s take a look at the 13 pieces cut from the image based on the cut points we found. 12 05

Slide 13

Slide 13 text

Program Process Full image → Lines → Words → Letters 3

Slide 14

Slide 14 text

Find Lines The above process found lines in the image. It can be extended to ﬁnd words and letters. 14 01

Slide 15

Slide 15 text

Find Words Take the ﬁrst line for example. Here, we visualize the pixel density and every word contained within it. We also count the number of words (font areas) contained within each of the other lines in our source image. Then using the same method described above, we cut and save each word from each line. 15 02

Slide 16

Slide 16 text

Find Words (Visualize it) Now, let’s visualize the words that have been cut out of the ﬁrst line. 16 02

Slide 17

Slide 17 text

Find Letters Once again, using the method described, we can now look for letters in each word. But, notice that this time, we have a bit of a problem. Because each word of the script is one continuous line, there are no clear “zero” points where we can cut it. Therefore, we need to add a new function that will check for the deepest “valley” points in the non-binary array. These locations typically range between 2 and 5 pixels in depth - which will be found using a threshold value. We’ll set these values to 0 when we ﬁnd them. Then, we’ll binarize the array as before. 17 03

Slide 18

Slide 18 text

Find Letters (by ﬁnding valleys) This time, we search for valleys in the original array that represents the pixel density of the script. We set any valley to 0, and binarize the rest of the array. 18 03

Slide 19

Slide 19 text

Find Letters (Visualize it) Now, we are able to plot cut points for any word and cut it into letters. 19 03

Slide 20

Slide 20 text

Find Letters (Visualize more) As a ﬁnal step, let’s visualize the results of a few more words being cut into letters. 20 03 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Word 4 Word 7 Word 10 Line 1

Slide 21

Slide 21 text

Future Research Plan Template Matching & Algorithm Improvement 4

Slide 22

Slide 22 text

Template Matching & Improvement MANCHU ALPHABET The Manchu alphabet is included in the Unicode block for Mongolian. For future research, the letters that this program is able to ﬁnd and cut from a given script image need to be matched to the existing letters in the Manchu alphabet (which is included in the Unicode block for Mongolian). A er that, we will be able to understand how accurate or inaccurate it is, and use that information to help improve the letter-cutting algorithm. 22

Slide 23

Slide 23 text

Conclusion A simple & viable solution 4

Slide 24

Slide 24 text

Summary This project illustrated how an image of Manchu script could be cut into each recognizable unit through a pre-processing method. The pre-processing method first standardizes the image data it reads in, then, in a step-by-step manner, divides the image into (1) Lines of the script, then (2) Words of the script, and finally (3) Letters of the script. Future research needs to be conducted in confirming whether or not (and how accurately) each unit (letter) that was cut from the image matches the actual Manchu alphabet. It is expected that there will be some margin of error due to the script cutting perfectly horizontal lines at the narrowest point of each word. Thus, a er performing an accuracy check, the algorithm may be improved upon by adjusting the cutting threshold values for each word, or by rotation of the cutting line (or the image itself) at certain locations. In conclusion, we hope to illustrate that this method of extracting Manchu characters from an image is a (relatively) simple and viable solution, even though some improvements may need to be made. 24

Slide 25

Slide 25 text

THANKS! The Jupyter Notebook used to build this project including code examples and output can be found at: https://github.com/jekkilekki/learning-opencv/blob/main /project-manchu/Manchu%20Script%20Reader.ipynb *Filesize may require download to view* 25