Deep Learning for Classical Japanese Literature

Deep Learning for Classical Japaneses Literature Kurian Benoy CS-7 A
34

Contents • Abstract • Introduction • Kuzushiji Dataset • Classiﬁcation
Baselines • Domain Transfer from Kuzushiji-Kanji to Modern Kanji • Similar work in chinese • Why don’t we need domain transfer for Malayalam • Summary

Abstract • To encourage ML researchers to produce models for
Social or Cultural relevance to transcribe Kuzushiji into contemporary Japanese characters. • To release Kuzushiji MNIST dataset, Kuzushiji 49 and Kuzushiji-Kanji datasets to general public. • Written by Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, David Ha. https://arxiv.org/abs/1812.01718

Introduction Land of Rising Sun-Japan

Introduction • Historically, Japan and its culture had been isolated
from the west for a long period of time. Until the Meiji restoration in 1868, when a 15 year old emperor brought unity to whole of Japan which was earlier broken down into regional small rulers. • This caused a massive change in Japanese Language, writing and printing system. Even though Kuzushiji had been used for over 1000 years there are very few ﬂuent readers of Kuzushiji today (only 0.01% of modern Japanese natives).

Introduction So now most Japan natives cannot read books written
and published over 150 years ago. In General Catalog of National Books, there is over 1.7 million books and about 3 millions unregistered books yet to be found. It's estimated that there are around a billion historical documents written in Kuzushiji language over a span of centuries. Most of this knowledge is now inaccessible to general public. .

Kuzushiji Dataset Fig on left: Kuzhushiji(old Japanese) Fig on right:
Modern day contempary japanese

Kuzushiji Dataset The Japanese language can be divided into two
types of systems: • Logographic systems, where each character represents a word or a phrase (with thousands of characters). A prominent logographic system is Kanji, which is based on the Chinese System. • Syllabary symbol systems, where words are constructed from syllables (similar to an alphabet). A prominent syllabary system is Hiragana with 49 characters (Kuzushiji-49), which prior to the Kuzushiji standardization had several representations for each Hiranaga character.

a) Kuzhushiji MNIST: • MNIST for handwritten digits is one
of the most popular dataset's till and is usually the hello world for Deep Learning. • Yet there are fewer than 49 letters needed to fully represent Kuzushiji Hirangana.

• Since MNIST restricts us to 10 classes, we chose
one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST. • Kuzushiji MNIST is more difﬁcult compared to MNIST because for each image the chance for a human to detect characters correctly when a single image is of small size and is stacked together of 5 rows is very less.

b) Kuzhushiji 49 • As the name suggest, it is
a much larger imbalanced dataset containing 49 hirangana characters with about 266,407 images. • Both Kuzushiji-49 and Kuzushiji-MNIST consists of `grey images of 28x28 pixel resolution`. • The training and test is split in ratio of 6/7 to 1/7 for each classes. • There are several rare characters with small no of samples such as (e) in hiragana has only 456 images.

c) Kuzushiji Kanji: • Kuzushiji Kanji has a total of
3832 classes of characters in this dataset with about 140,426 images. • Kuzushiji-Kanji images are are of larger 64x64 pixel resolution and the number of samples per class range from over a thousand to only one sample.

To download the dataset: https://github.com/rois-codh/kmnist

Classiﬁcation Baselines This research paper focussed on calculating the accuracy
of recognising Kuzushiji datasets which in both Kanji and Hiragana, based on pre-processed images of characters from 35 books from the 18th century.

Even you can improve the results. The current state of
art model according to ROIS-CODH is a combination of Resnet18+VGG ensemble over capsule networks.

PreAct Resnet with ManiFold mixup • A method for learning
better representations, that acts as a regularizer and despite its no signiﬁcant additional computation cost , achieves improvements over strong baselines on Supervised and Semi-supervised Learning tasks. • Manifold Mixup is that the dimensionality of the hidden states exceeds the number of classes, which is often the case in practice. Resnet Ensembled over Capsule Networks • Ensemble of Resnet and VGG • Ensembling Resnets with Capsule networks

My intuition • EfﬁcentNet coupled with Capsule networks

Domain Transfer • Our proposed model should transfer the pixel
image from a given Kuzushiji-Kanji input, to a vector image of Modern Kanji version.

Algorithm 1. Train two seperate variational autoencoder on pixel version
of KanjiVG and Kuzushiji-Kanji on 64x64px resolution. 2. Train mixture density network to mode P(Znew | Zold) as mixture of gaussians. 3. Train sketch RNN to generate Kanji VGG strokes conditioned on either znew or z~new ~P(Znew|Zold).

Components of this network • Auto Encoders and Decoders They
are widely used unsupervised application of neural networks whose original purpose is to ﬁnd latent lower dimensional state-spaces of datasets, but they are also capable of solving other problems, such as image denoising, enhancement or colourization. • Variational Autoencoders is used to provide latent space of KanjiVG to Kuzushiji Kanji. It’s used in the architecture to ﬁnetune the input and provide better colourization and enhancement. It’s used in complex generative models.

• Mixture Density Network: Used to model density function to
a new domain. It’s used for making the neural networks to translate from Kuzushiji Kanji to KanjiVG format in pixels. • Sketch RNN It’s a decoder network which conditions the model in a new latent vector.

Comparison with Chinese Kanji • Training two VAE encoders in
our algorithm gives better performance than single VAE encoders used. • Sketch-RNN is better than char-RNN to give a better accuracy. • Using adversarial losses as in other approaches is not necessary. http://otoro.net/kanji-rnn/

Why not such a system for Malayalam? • We also
have a similar problem of domain transfer • This was due to a government rule in 1956 which limited the typography for malayalam as 56 characters only

• Free software community namely Swathanthra Malayalam computing have already
created mappings for 1200 characters in Malayalam. Unicode

Summary • Explored the deep learning technique for classifying Classical
Japanese, Kuzushiji and do the domain transfer to Contemporary Japanese Language. • Looked the various Kuzushiji datasets

Thanks • Slides: bit.ly/japanslides • Brief summary: https://kurianbenoy.github.io/ • Research
paper: https://arxiv.org/pdf/1812.01718.pdf

Deep Learning for Classical Japanese Literature

Deep Learning for Classical Japanese Literature

Kurian Benoy

More Decks by Kurian Benoy

Other Decks in Research

Featured

Transcript

Deep Learning for Classical Japaneses Literature Kurian Benoy CS-7 A

Contents • Abstract • Introduction • Kuzushiji Dataset • Classiﬁcation

Abstract • To encourage ML researchers to produce models for

Introduction Land of Rising Sun-Japan

Introduction • Historically, Japan and its culture had been isolated

Introduction So now most Japan natives cannot read books written

Kuzushiji Dataset Fig on left: Kuzhushiji(old Japanese) Fig on right:

Kuzushiji Dataset The Japanese language can be divided into two

a) Kuzhushiji MNIST: • MNIST for handwritten digits is one

• Since MNIST restricts us to 10 classes, we chose

b) Kuzhushiji 49 • As the name suggest, it is

c) Kuzushiji Kanji: • Kuzushiji Kanji has a total of

To download the dataset: https://github.com/rois-codh/kmnist

Classiﬁcation Baselines This research paper focussed on calculating the accuracy

Even you can improve the results. The current state of

PreAct Resnet with ManiFold mixup • A method for learning

My intuition • EfﬁcentNet coupled with Capsule networks

Domain Transfer • Our proposed model should transfer the pixel

Algorithm 1. Train two seperate variational autoencoder on pixel version

Components of this network • Auto Encoders and Decoders They

• Mixture Density Network: Used to model density function to

Comparison with Chinese Kanji • Training two VAE encoders in

Why not such a system for Malayalam? • We also

• Free software community namely Swathanthra Malayalam computing have already

Summary • Explored the deep learning technique for classifying Classical

Thanks • Slides: bit.ly/japanslides • Brief summary: https://kurianbenoy.github.io/ • Research