This was my first poster presentation at a Korean Academic Technology conference.
An E icient Method to Extract
Units of Manchu Characters
1. Aaron Daniel Snowberger ([email protected])
2. Choong Ho Lee ([email protected])
Extracting the characters
present in an image of
Manchu script (right) is a
simple form of NLP known as
Optical Character Recognition
(OCR). This can lead to future
research related to the
Manchu script such as
Segmentation among others.
01 Problem Overview
02 Pre-processing Method
a. Inverse Binary Image
b. Scan for pixel depth
c. Simplify data with binarization
d. Find cut points
e. Cut image
03 Program Process
a. Find Lines
b. Find Words
c. Find Letters
04 Future Research Plan
Extracting characters from Manchu Script
Manchu Script Character Extractor
Since Manchu characters are written vertically and are connected without spaces within a
word, pre-processing is required to separate the character area and the units that make
up the characters before being able to recognize the characters. In this paper, we describe
a pre-processing method that extracts the character area and cuts oﬀ the unit of each
Unlike existing research that presupposes a method of recognizing each word or
character unit, or recognizing the remaining part a er removing the stem of a
continuous character, this method cuts the character into each recognizable unit and
then combines the units. It can be applied to the method of recognizing letters. Through
an experiment, the eﬀectiveness of this method was veriﬁed.
Manchu Characters, Character Recognition, Preprocessing, Pattern Recognition
Preparing (standardizing) a script image for analysis
Load dataset & visualize input
Read in the image of the Manchu script
Visualize it, and ﬁnd its shape.
Binarize Image & Inverse it
By binarizing the image, every pixel value
in the white background becomes 1, and
every pixel that contains the script is 0.
Therefore, by inversing the binary image
(reversing the 1s and 0s), we can easily
scan the image for non-zero values which
contain the (white) script.
Find font areas
We will begin with lines of the script. This
means we need to scan every column of
the data looking for non-zero values. We
sum the non-zero values for each column,
store them in an array, and plot a graph.
Visualize font areas
Notice how the graph detailing non-zero
values matches up with the script image.
Simplify data with binarization
Next, we binarize the array of pixel density
information. This binary array will inform
the program’s decisions about where to
set cut points for the image.
Find Cut Points
Next, we run through the binary array
looking for cut points.
⬥ Startpoint: when 0s change to 1s
⬥ Endpoint: when 1s change to 0s
⬥ Edge case: 1s at the end of the image
Cut Image & Visualize it
Let’s take a look at the 13 pieces cut from the image based on the cut points we found.
Full image → Lines → Words → Letters
The above process found lines in the image. It can be extended to ﬁnd words and letters.
Take the ﬁrst line for
example. Here, we visualize
the pixel density and every
word contained within it.
We also count the number
of words (font areas)
contained within each of
the other lines in our
Then using the same
method described above,
we cut and save each word
from each line.
Find Words (Visualize it)
Now, let’s visualize the words that have been cut out of the ﬁrst line.
Once again, using the method described,
we can now look for letters in each word.
But, notice that this time, we have a bit of
a problem. Because each word of the
script is one continuous line, there are no
clear “zero” points where we can cut it.
Therefore, we need to add a new function
that will check for the deepest “valley”
points in the non-binary array.
These locations typically range between 2
and 5 pixels in depth - which will be
found using a threshold value. We’ll set
these values to 0 when we ﬁnd them.
Then, we’ll binarize the array as before.
Find Letters (by ﬁnding valleys)
This time, we search for valleys in the
original array that represents the pixel
density of the script. We set any valley
to 0, and binarize the rest of the array.
Find Letters (Visualize it)
Now, we are able to plot cut points for any word and cut it into letters.
Find Letters (Visualize more)
As a ﬁnal step, let’s
visualize the results of a
few more words being cut
1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Future Research Plan
Template Matching & Algorithm Improvement
Template Matching & Improvement
The Manchu alphabet is included in the Unicode
block for Mongolian.
For future research, the letters that this
program is able to ﬁnd and cut from a
given script image need to be matched
to the existing letters in the Manchu
alphabet (which is included in the
Unicode block for Mongolian).
A er that, we will be able to understand
how accurate or inaccurate it is, and use
that information to help improve the
A simple & viable solution
This project illustrated how an image of Manchu script could be cut into each
recognizable unit through a pre-processing method. The pre-processing method ﬁrst
standardizes the image data it reads in, then, in a step-by-step manner, divides the image
into (1) Lines of the script, then (2) Words of the script, and ﬁnally (3) Letters of the script.
Future research needs to be conducted in conﬁrming whether or not (and how
accurately) each unit (letter) that was cut from the image matches the actual Manchu
alphabet. It is expected that there will be some margin of error due to the script cutting
perfectly horizontal lines at the narrowest point of each word.
Thus, a er performing an accuracy check, the algorithm may be improved upon by
adjusting the cutting threshold values for each word, or by rotation of the cutting line (or
the image itself) at certain locations.
In conclusion, we hope to illustrate that this method of extracting Manchu characters
from an image is a (relatively) simple and viable solution, even though some
improvements may need to be made.
The Jupyter Notebook used to build this project
including code examples and output can be
*Filesize may require download to view*