▼参考リンク
・Service | 株式会社 微分(VIVEN, Inc.)
https://www.viven.co.jp/ja/service/
・Mission | 株式会社 微分(VIVEN, Inc.)
https://www.viven.co.jp/ja/company/
もし、話を聞いてみたいと思った際には、メールまたは以下のSNSからご連絡ください。
Email, Twitter, Facebook, Instagram, Linkedin
【弊社HP 】
https://www.viven.inc
1
OCR Survey
Tapas Dutta, Deep Learning Engineer
2
OCR Survey of
Foreign Language
Tapas Dutta, Deep Learning Engineer
Summary
Current OCR technologies miss genuine character or produce new
character thus this work generates pixel-wise maps for character class,
position and order in parallel with RNN for context modelling.
Related Works
Cheng(2017) used character’s class and localization labels to adjust the
attention positions. Bai(2018) used novel loss function to improve
attention decoder. Lyu(2018), Liao(2019) used segmentation for OCR,
however not effective for languages with closely spaced characters.
Proposed Methodology
A CNN architecture is used for feature extraction. The extracted features
are fed to class and geometry branches. The class branch consists of 2
stacked convolutions followed by soft normalization. The output feature
maps(character segmentation maps) has dimension of h*w*c
3
TextScanner: Reading Characters in Order for Robust Scene Text Recognition
Copyright © 2022 VIVEN Inc. All Rights Reserved
(c = number of character + background). To produce the localization
map sigmoid activation on the input. For order segmentation map, a
small U-Net architecture is used, with GRU layers in the middle. After
upsampling two convolution layers are used to generate feature maps
of size h*w*N (N is sequence length). Order map such that kth
character is indicated by its kth feature map is generated by multiplying
order segmentation and character localization maps. The classification
scores is obtained by multiplying the character segmentation maps and
order segmentation maps
Result
Different datasets was used to validate the effectiveness of the model
such as IIIT(50,1K,0 lexicons), SVT(50, 0 lexicons), IC13, IC15, SVTP,
CT achieving 99.8,99.5, 95.7, 99.4, 92.7, 94.9, 83.5, 84.8, 91.6%
accuracies respectively.
Next must-read paper: “Textscanner: Reading characters in order
for robust scene text recognition”
Summary
OCR for Persian language needs to address the difference in Persian
language like right-to-left text, interpretation of semicolon, dot, oblique,
etc. Increasing the number of layers/filters/kernel size for LSTM did not
improve results. BiLSTM improved results when compared to LSTM
Increasing the dimension of extracted vector improved results for
BiLSTM and LSTM. Increasing the number of BiLSTM layers improved
the performance
Related Works
Khosravi (2006) achieved 99.02% and 98.8% accuracy using improved
gradient and gradient histogram respectively for HODA dataset.
Alizadehashraf (2017) achieved 97% accuracy using a small CNN
architecture. Bonyani (2021) compared the performance of standard
CNN architecture (DenseNet, ResNet, VGG) for recognizing Persian text.
LeCun(1998) used LeNet whose weights were optimized using firefly,
ant colony, chimp, particle swarm optimization techniques with chimp
optimization performing the best. Smith(2007) used CNN followed by
LSTM achieving an accuracy of 93%.
4
Persian Optical Character Recognition Using Deep Bidirectional Long Short-Term Memory
Copyright © 2022 VIVEN Inc. All Rights Reserved
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑛𝑒𝑠𝑠 = 100 ∗
#𝐴𝑙𝑙𝑊𝑟𝑑𝑠 − (#𝐷𝑒𝑙𝑊𝑟𝑑 + #𝑆𝑢𝑏𝑊𝑟𝑑)
#𝐴𝑙𝑙𝑊𝑟𝑑
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 100 ∗
#𝐴𝑙𝑙𝑊𝑟𝑑𝑠 − (#𝐼𝑛𝑠𝑊𝑟𝑑 + #𝐷𝑒𝑙𝑊𝑟𝑑 + #𝑆𝑢𝑏𝑊𝑟𝑑)
#𝐴𝑙𝑙𝑊𝑟𝑑
Proposed MethodologyType equation here.
The proposed algorithm consists of 3 modules for segmentation, feature
extraction and recognition. For segmentation the height of images are
normalized and slid windowing algorithm is used. For feature extraction
a small CNN architecture of 1 convolution and 1 maxpooling is used. The
recognition module uses 4 BiLSTM layers, the first two with tanh
activation and the middle two with sigmoid activation and trained with
connectionist temporal classification loss.
Result
20 pages of English and Persian texts from different and random books
was types in MS-Office. The images taken were height normalized. The
metrics used for evaluation are: AllWrds: Number of words in text,
InsWrds: Wrongly inserted words, DelWrds: Wrongly deleted words,
SubWrds: Wrongly substituted words. The tesseract model achieved
91.73% for Persian texts only and 73.56% accuracy when trained on
dataset and tested on set having Persian and English texts, as compared
to Bina achieving 96% on both the test sets. The tesseract and Bina
achieved 71% and 91% correctness for test sets having Persian and
English texts.
Next must-read paper: “An overview of the Tesseract OCR engine”
Summary
Existing OCR algorithms require a CNN architecture for feature
extraction from image, sequence modelling layers for text generation
and a language model to improve the performance. This work includes
all these steps in an end-to-end trainable model. Extensive experiments
on different combinations of encoder and decoder validate the
superiority of using BEiT as encoder and RoBERTa as decoder. Ablation
studies conducted validate the effectiveness of different strategies used.
Related Works
Diaz(2021), Vaswani(2017) incorporated transformer in the CNN
architectures observing significant performance improvement.
Bao(2021) replaced used self-supervised image pretrained transformers
to replace CNN architectures.
Proposed Methodology
The image is divided into P*P patches, flattened and a linear layer is
used to change the dimension to a predefined number. The encoder
consists of an image transformer
5
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
Copyright © 2022 VIVEN Inc. All Rights Reserved
where DeiT(Touvron 2021) and ViT (Dosovitskiy 2021) are used as
encoders initialization. The “[CLS]” token is used to represent the entire
mage. A text transformer is used as the decoder, it is initialized with
RoBERTa mode, thus output would be wordpiece instead of a
character. The model is first trained on hundreds of millions of synthetic
printed text line images. These weights are used to initialize the second
stage pretraining on task specific synthetic and real-world datasets.
Augmentation such as gaussian blur, image erosion, rotation, image
dilation, downscaling, underlying are used to help model’s generalizing
capabilities.
Result
SROIE dataset is used to evaluate the model’s performance according
to precision, recall, F1-score where the model achieved 95.76, 95.91,
95.84 respectively.
Next must-read paper: “Scene Text Recognition with Permuted
Autoregressive Sequence Models”
Summary
The work proposes an encoder-decoder architecture with GRU, and
attention mechanism evaluated on Khmer texts in different fonts.
tesseract produced characters like @, #, / not present in the texts. The
proposed model outputs repeated characters before reaching EOS
character, due to encoder-decoder architecture. Khmer language
characters have similar structure resulting in error in both models.
Related Works
Ilya (2014) employed RNN based encoder-decoder for English to French
machine translation. Dzmitry (2014) modified the previous work to
include attention mechanism and improved the performance. Devendra
(2015) employed LSTM based encoder-decoder mechanism for English
text recognition. Farisa (2021) trained standard CNN
architectures(ResNet, DeseNet) with LSTM, GRU layers and trained the
entire model using Connectionist Temporal Classification loss in an end-
to-end manner.
Proposed Methodology
The encoder consists of a convolution, batchnormalization and ReLU
activation followed by 3 residual modules. There are two types of
residual blocks as photo above.The first residual block contains 2 res
6
An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention
Copyright © 2022 VIVEN Inc. All Rights Reserved
block 0 while the second and third blocks contains res block 0 followed
by res block 1.This is followed by a 2DAveragePooling and 2D dropout
layers. For an intermediate output of h*w*c the feature maps are
reshaped to w*hc(𝑂𝑑𝑟𝑜𝑝𝑜𝑢𝑡2𝐷) represented as
𝐻, ℎ𝑇
= 𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝐺𝑅𝑈(𝑂𝑑𝑟𝑜𝑝𝑜𝑢𝑡2𝐷
). Here h is passed through a dropout,
linear, tanh activation and used as decoder’s initial hidden state, while
H is the input to the decoder. The context vector is weighted average of
different hidden state, the current weights are calculated using SoftMax
on encoder’s current hidden state and decoder’s previous hidden state
as 𝑒𝑖𝑗
= 𝑉
𝑎
𝑇 ∗ tan(𝑊
𝑎
𝑆𝑖−1
+ 𝑈𝑎
ℎ𝑗
). The one-hot encoded previous
decoder’s output and context vector from attention are concatenated
and passed through a GRU layer along with decoder’s previous hidden
state to calculate current hidden state. The current hidden state,
context vector and one-hot encoded previous output are concatenated
and passed through a linear layer for next character prediction.
Result
Text2image is used to generate 92,213 texts for different Khmer fonts
and sizes. The test set contains 3000 images. The proposed model
achieved a character error rate(ratio of unrecognized to total number of
characters) of 1% as compared to tesseract’s error rate of 3% on this
dataset.
Next must-read paper: “Khmer OCR fine tune engine for Unicode
and legacy fonts using Tesseract 4.0 with Deep Neural Network”
𝐶𝐸𝑅 =
𝑆 + 𝐼 + 𝐷
𝑁
𝑊𝐸𝑅 =
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
Summary
Texts in the real world are mostly curved or stylised thus ASTER is
equipped with a rectification module to rectify the input image(using thin
plate spline) and a recognition module using sequence to sequence
module with attention to predict characters from rectified images.
Related Works
Wang(2012) used two separate CNN modules to localize and recognize
texts. Jaderberg(2014) used one CNN for both localization and
recognition. Su and Lu (2014,2017) used RNN for sequence prediction.
He(2016), Shi(2017) used a combination of CNN and RNN for text
prediction. Wang(2017) employed gated recurrent CNN for text
recognition. Yang (2017) employed a character detection model
optimized with an alignment loss for character localization. End-to-End
text Jaderber (2016), Weinman(2014) recognition models use text
proposals followed by word recognizer. Busta (2017) combined FCN
detector with connectionist temporal classification(CTC) for recognition.
Proposed Methodology
The algorithm contains two parts for rectification and recognition.
Rectification is done using Thin Plate Spline Transformation, it contains 3
parts as localization network, grid generator and sampler
7
ASTER: An Attentional Scene Text Recognizer with Flexible Rectification
Copyright © 2022 VIVEN Inc. All Rights Reserved
The localization network predicts k coordinates using a CNN
architecture with FC layer, from the original image. Given a pixel
location p on original image I the grid generator computes the pixel
location on rectified image. Here ∆𝐶 is calculated
as ∆𝐶 =
11∗𝐾 0 0
𝐶 0 0
ℂ 1𝐾∗1 𝐶𝑇
and 𝑇 = 𝑐, 02∗3 ∆𝐶−1. ℂ is a
square matrix such that ℂ𝑖𝑗
= 𝐹(||𝐶𝑖
− 𝐶𝑗
||) and 𝐹 𝑟 = 𝑟2 ∗ log (𝑟).
Differentiable image sampling is used to clip neighbouring pixels to
restrict pixels within image and interpolate neighbourhood pixels in a
differentiable manner. The recognition module consists of an encoder-
decoder architecture with ConvNet acting as encoder and BLSTM with
attention as decoder. The attention weights are calculated from
encoder output and previous hidden state as
𝑒𝑖𝑗
= tan(𝑊𝑠𝑡−1
+ 𝑉ℎ𝑖
+ 𝑏) ∗ 𝑊𝑇. after softmax glimpse vector g is the
weighted sum of encoder outputs using attention weights. g is then
concatenated with one hot encoded previous output and passed
through a recurrent unit. This output is used fed to a linear layer with
softmax to calculate current character
Result
Multiple datasets are used to evaluate the model’s performance as
IIIT5K(0), SVT(0), IC03(0), IC13, SVTP, CUTE obtaining 93.4, 93. 6,
94.5, 91.8, 76.1.78.5 and 79.5 respectively.
Next must-read paper: “Focusing attention: Towards accurate text
recognition in natural images.”
Summary
Integrates connectionist temporal classification(CTC) loss with focal loss
to help the model with unbalanced languages datasets. Empirically for
both synthetic and real images performance can be improved using
alpha = 0.25 and y=0.5.
Related Works
A Graves(2008) was the first to combine CTC loss with RNN for text
recognition. A Ul Hasam (2013) used BiLSTM with CTC loss for Urdu text
recognition. M. Busta (2017) combined recognition and detection in an
end-to-end model. J. BA(2014) used reinforcement learning to
concentrate on part of image useful for prediction. C.Y. Lee (2016) used
RNN with attention for optical character recognition. M.Jaderberg used
spatial transformer for spatial manipulation of data within the module
combined with focal loss for text recognition.
Proposed Methodology
With ResNet as the backbone the algorithm extract feature maps from
the last convolution layer which is cut into multiple slices, each
containing information about a small area within the image. This is
followed by a BiLSTM and fully connected layer with softmax for final 8
Focal CTC Loss for Chinese Optical Character Recognition on Unbalanced Datasets
Copyright © 2022 VIVEN Inc. All Rights Reserved
output 𝑝 𝜋 𝑥 = 𝑦𝜋𝑡
𝑡
𝑇
𝑡=1 . Here 𝑦𝜋𝑡
𝑡 represents the probability of observing
the elements (set of all possible characters and blank character) at
slice t for T total slices. Thus, CTC loss is calculated as
𝑝 𝑙 𝑦 = 𝑝(𝜋|𝑦)
𝜋:𝛽 𝜋 =𝑙 i.e., the sum of all probabilities. For
hyperparameter alpha and y focal loss is calculated as
𝐹𝐿 𝑝𝑡
= −𝛼𝑡
(1 − 𝑝𝑡
)𝑦log (𝑝𝑡
). Here alpha is used to overcome
data imbalance and y to help model focus more on hard samples. Thus,
CTC loss can be modified as 𝐹𝐶𝑇𝐶 𝑙 𝑦
= −𝛼𝑡
1 − 𝑝 𝑙 𝑦 𝑦
log (𝑝(𝑙|𝑦). Thus,
the model can focus more on hard samples.
Result
A synthetic dataset generated from MNIST by concatenating 5 images
together from two groups of ‘0-9,a-h’ and ’i-z’ characters one of 1M
and 100K (10:1 imbalance) other of 1Mand 10K(100:1 imbalance)
images with 10K test set, and Chinese OCR of 3.6M training and 5K
testing images. Highest accuracy obtained were 62.8%, 72.4% and
76.4% respectively.
Next must-read paper: “Deep TextSpotter: An End-to-End Trainable
Scene Text Localization and Recognition Framework”
𝜋
𝑡
Summary
This work proposes a light-weight model for text recognition. Various
strategies to improve the model’s performance or decrease the model’s
parameters are also discussed in the work. Ablation studies conducted
are used to verify the effectiveness of each strategy.
Proposed Methodology
The algorithm uses 3 modules for recognition namely text detection to
output bounding box for text, direction classifier is necessary if the
bounding box reversed the text and text recognizing. For recognition a
light backbone MobileNetV3_large_x0.5. Empirically it was observed that
removing the squeeze excitation blocks from the model resulted in no
loss of accuracy while reducing the number of parameters along with
inference time. Feature Pyramid Network(FPN) is used in the head to
detect small texts using large resolution feature maps. Cosine learning
rate is used to use large learning rate( LR) at the beginning and small LR
at later stages. Large value of LR at the beginning may lead to instability
thus LR warmup is used. FPGM pruner is used to dynamically calculate
compression ratio for each layer and remove similar values, thus
improve inference efficiency. MobileNetV3_small_x0.35 is used as the
backbone for direction classifier for input resolution of 48*192. 9
PP-OCR: A Practical Ultra Lightweight OCR System
Copyright © 2022 VIVEN Inc. All Rights Reserved
Augmentation such as rotation, gaussian blur, perspective distortion,
motion blur(Base Data Augmentation BDA) along with random
augment are used to improve the model’s generalizing capabilities.
Modified PACT quantization along with L2 regularization coefficient of
1e-3 is used. For recognizer MobileNetV3_small_x0.35 is used as
backbone pretrained on synthetic images and modified strides to
preserve horizontal and vertical information, with BDA and TIA (Luo
2020) augmentation. A fully connected layer of 48 dimension is used as
head along with L2 regularization. Cosine learning rate along with LR
warmup is used for training. PACT quantization is used for every layers
except LSTM layers.
Result
Multiple synthetic as well as public datasets such as LSVT, RCTW-17,
MTWI 2018, CASIA-10K, etc. are combined for training and validation
of the different modules. When using all the strategies previously
mentioned the model achieved an accuracy of 69%
Next must-read paper: “Towards end-to-end license plate de-
tection and recognition: A large dataset and baseline”
10
On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention
Copyright © 2022 VIVEN Inc. All Rights Reserved
𝐴𝑡𝑡 − 𝑜𝑢𝑡ℎ𝑤
= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑟𝑒𝑙
ℎ′𝑤′)→(ℎ𝑤
𝑉ℎ′𝑤′
ℎ′𝑤′ . V
ℎ′𝑤′
is calculated by
multiplying the feature maps of shallow CNN with trainable weights and
attention weights 𝑟𝑒𝑙
ℎ′𝑤′-> ℎ𝑤
are calculated
as 𝑟𝑒𝑙
ℎ′𝑤′ → ℎ𝑤
∝ 𝑒
ℎ𝑤
+ 𝑃ℎ𝑤 𝑊𝑞𝑊𝑘𝑡 𝑒
ℎ′𝑤′
+ 𝑝
ℎ′𝑤′
𝑇
Here 𝑒
ℎ′𝑤′
P
h′w′ represented extracted feature maps and positional
embedding, respectively. Further Ph’w’
can be calculated as
𝑃
ℎ′𝑤′
= 𝛼 𝐸 𝑃
ℎ
𝑠𝑖𝑛𝑢 + 𝛽 𝐸 𝑃
𝑤
𝑠𝑖𝑛𝑢 Further Ph
sinu and Pw
sinu are sinusoidal positional
encoding along height and width, while ɑ(E) and 𝛃(E) are calculated as
𝛼 𝐸 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 max 0, 𝑔 𝐸 𝑊ℎ
1
𝑊ℎ
2
and 𝛽 𝐸 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(max 0, 𝑔 𝐸 𝑊𝑤1 𝑊𝑤
2
)
For E representing CNN extracted features. To capture short-term
dependency the 1X1 convolution are replaced with 3X3 convolution and are
represented as Locality-aware feedforward layer.
Result
Datasets having horizontal texts as well as randomly aligned texts were
used to validate the performance of model such as IC13(94.1%),
IC03(96.7%), SVT(91.3%), IIIT5K(92.8%), IC15(79%), SVTP(86.5%),
CT80(87.8%)
Next must-read paper: “Aster: An attentional scene text recognizer with
flexible rectification”
Summary
Current OCR technologies is unable to recognize rotated, curved,
vertically aligned, arbitrary shaped texts. This work uses attention
mechanism to tackle these challenges.
Related Works
A Cheng(2018) employed a selection module to select features in 4
direction by projecting an intermediate feature map. Yang(2017)
used attention module requiring extensive character level
supervision. Hui(2019) employed attention but is biased towards
horizontal texts due to height pooling and RNN layers. Fenfen(2018)
used 1D transformer for recognition. Pengyuan(2019) employed
self-attention in the decoder.
𝑎𝑡𝑡 − 𝑜𝑢𝑡ℎ𝑤 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑟𝑒𝑙
ℎ′𝑤′
→
ℎ𝑤
𝑉
ℎ′𝑤′
ℎ′𝑤′ text recognition.
Proposed Methodology
A shallow CNN module is used to supress background information
while reducing computational costs for subsequent layers. The
output is passed through self attention blocks, with novel 2D
positional embedding. This can be formulated as
11
Towards Accurate Scene Text Recognition with Semantic Reasoning Networks
Copyright © 2022 VIVEN Inc. All Rights Reserved
step is needed, this causes a bottleneck when in model’s semantic
reasoning. Thus, this work uses approximated information at t-1 step
to calculate vectors at t step. The output of the attention(g) is used to
predict target character(FC with softmax) and optimized using cross-
entropy(CE). The most likely character is passed through an embedding
layer to calculate approximate embedding. The extracted features are
passed through several transformer units to output global
context(s) and optimized using CE loss. Features g, s are dynamically
allocated importance as, 𝑍
𝑡
= 𝛼(𝑊𝑧 . [𝑔𝑡
,
𝑠𝑡]) and 𝑓
𝑡
= 𝑧𝑡 ∗ 𝑔𝑡 + 1 − 𝑧𝑡 ∗
𝑠𝑡. The entire model is optimized as 𝐿𝑜𝑠𝑠 = 𝛼
𝑒
𝐿
𝑒
+ 𝛼
𝑟
𝐿
𝑟
+ 𝛼
𝑓
𝐿
𝑓
Result
Various datasets such as IC13, IC15, IIIT5K, SVT, SVTP, CUTE, TRW-T,
TRW-L are used for evaluation achieving 95.5, 82.7, 94.8, 91.5, 85.1,
87.8, 85.5 and 84.3, respectively.
Conclusion
The model could be incorporated with CTC loss to improve
performance.
Next must-read paper: “An end-to- end trainable neural network for
spotting text with arbitrary shapes”
Summary
This work attempts to overcome the shortcomings of RNN such as its
time dependency, and most importantly one way transmission of context
which greatly limits the model’s effectiveness to learn semantic
information.
Related Works
Baoguang (2016) combined CNN and RNN with connectionist temporal
classification(CTC) loss function for recognition. Minghui (2019)
formulated the problem as a pixel level classification task. Chen (2016)
extracted the visual features in 1D and used the semantic information of
last time step for recognition. Mingkun (2019) used a rectification network
based on local features to improve performance. Zhanzhan (2018)
extracted features along 4 directions and used a filter gate to calculate the
contribution of each. Zbigniew (2017) encoded spatial coordinates on 2D
feature maps to increase sequential information extracted.
Proposed Methodology
ResNet50 is used as backbone with feature pyramid extracting features
from 3rd, 4th and 5th residual blocks. The features extracted are passed
to transformer along with positional embedding. A novel parallel visual
attention module is used that computes weights as
𝑒𝑡,𝑖𝑗
= 𝑊
𝑒
𝑇 tan(𝑊
𝑜
𝑓𝑜
𝑂𝑡
+ 𝑤𝑣
𝑣𝑖𝑗
)
For transformer extracted features as v and O representing character
reading order (1…N-1) and f as the embedding function. After softmax,
weighted sum with v is used to compute the attention outputs. When
using RNN to calculate vectors at t time step information of t-1 time
12
Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter
Copyright © 2022 VIVEN Inc. All Rights Reserved
First L and R are trained with cross-entropy loss function. The trained
modules are used in conjunction with S in reinforcement learning. S
outputs the partition map in a probabilistic vector, which is then
processed to output. Binary map. The processing involves non-max
suppression, clipping probabilities at 0.99 and thresholding. Thus, the
action is based on the model’s output and the reward is based on the
distance from true value as 𝑟 𝑋, 𝑎 = 1 − 𝑑(𝑌 𝑋,𝑎 ,𝑌)
max ( 𝑎 +1,𝑁𝑦)
. For input X and
action, a Y(x, a) is the processed partition map and Y is the ground
truth. The denominator is used for length normalization, Ny being the
total number of characters in Y.
Result
Texts of various languages are used for evaluation such as Chinese,
English, Korean as well as mixed texts such as Chinese with English,
Chinese with Korean English with Korean and Chinese with English and
Korean achieving 94.74, 77.01, 97.07, 87.23, 97.1, 87.46, 90.87,
respectively.
Next must-read paper: “Tesseract Blends Old and New OCR
Technology”
Summary
The performance of OCR systems decrease when working on texts
having multiple languages. To tackle this problem this work uses
segmenter (Reinforcement algorithm), switcher and recognizer trained
in a supervised manner.
Related Works
Zheng(2016) formulated character segmentation as a binary
segmentation. Chernyshova(2020) proposed a word image
segmentation model using dynamic programming to select most
probable boundaries in images. B.shi(2017) used convolution layers for
feature extraction, LSTM layers to predict character class and CTC loss
to ignore the repeated characters produced due to multiple slices. D.
Kumar(2015) used encoder-decoder architecture with attention to scan
image along horizontal direction followed by decoding the feature vector
using attention.
Proposed Methodology
The segmenter is used to partition a word image into n sub-images,
switcher then assigns a recognizer for each sub-image followed by
recognizer which assigns a label. The architecture of word
recognizers(R):
13
OCR Survey of
Japanese Language
Tapas Dutta, Deep Learning Engineer
14
An attention-based row-column encoder-decoder model for text recognition in Japanese historical documents
Copyright © 2022 VIVEN Inc. All Rights Reserved
along column to extract sequential information along horizontal and
vertical direction, respectively. The attention scores are calculated
as 𝑠𝑐𝑜𝑟𝑒 ℎ𝑡
𝐴𝑡𝑡𝑛
, 𝑒𝑖 = tanh 𝑊ℎℎ𝑡𝐴𝑡𝑡𝑛 + 𝑤𝑒𝑒𝑖 . Here e is the feature maps
extracted using CNN module, h is the current LSTM hidden state which
are calculated as ℎ𝑡
𝐴𝑡𝑡𝑛
= 𝐿𝑆𝑇𝑀𝐴𝑡𝑡𝑛(ℎ𝑡 − 1
𝐴𝑡𝑡𝑛
𝐸𝑚𝑏𝑒𝑑 𝑦𝑡 − 1 , 𝑎𝑡 − 1 ). Here y
is the previous predicted character converted to feature vector using
Embed layer, LSTM(Attn)
is 2 LSTM each with 512 nodes and a is the
attention vector calculated as 𝑎𝑡 = tanh (𝑊𝑐 𝑐𝑡; ℎ𝑡𝐴𝑡𝑡𝑛 ). Here c is the
context vector calculated as 𝑐𝑡 = 𝑝
𝑖
𝑡𝑒
𝑖
𝑛
𝑡=1 i.e. weighted average of
extracted features using attention weights. The attention outputs are
thus calculated as ℎ𝑡
𝑅𝑒𝑠
, 𝑂𝑡 = 𝐿𝑆𝑇𝑀𝑅𝑒𝑠(ℎ𝑡 − 1
𝑅𝑒𝑠
, 𝑎𝑡). For LSTM(Res)
a 1
LSTM layer of 512 nodes with a skip connection with the input attention
vector. The decoder consist of multiplying the attention outputs with
trainable weights followed by SoftMax as 𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑎 ∗ 𝑂𝑡)
Result
PRMU dataset is used to validate the performance of the model. It has
3 tasks namely 1:single character recognition, 2: 3-character
recognition written vertically, 3: 3 or more character written in multiple
lines. The model achieved character error rate and sequence error rate
of 4.15, 11.43 and 12.69, 58.58 on level 2 and 3 respectively.
Next must-read paper: “Recognition of anomalously deformed kana
sequences in Japanese historical documents”
Summary
The work proposes a novel algorithm to recognize text without the need
for segmentation. This is achieved by incorporating BiLSTM along rows
and columns in the encoder and residual LSTM in the decoder. Language
model could be incorporated in the algorithm to further improve
performance
Related Works
A. Graves(2009) used BiLSTM with Connectionist Temporal
Classification(CTC) for English text recognition. A. Graves (2008)
employed Multi-Dimensional LSTM with CTC loss for arabica text
recognition. H. Yang(2018) CNN with CTC for Chinese text recognition. D.
Valy(2018) used CNN with 1D or 2D LSTM for Khmer text recognition. C.
Wang(2018) used attention for text recognition. Y. Deng(2017) used
attention for mathematical expression recognition. T. Bluce(2017)
incorporated Multi-Dimensional LSTM with attention for handwritten
paragraph recognition.
Proposed Methodology
The algorithm has 3 modules for feature extraction, row-column encoder
and decoder. A standard CNN module is used for feature extraction. 2
modified BiLSTM are used one along row and another
15
Deep Convolutional Recurrent Network for Segmentation-free Offline Handwritten Japanese Text Recognition
Copyright © 2022 VIVEN Inc. All Rights Reserved
Result
Kondate dataset is used to finetune and validate the model. Label Error
rate and Sequence Error rate is used to calculate the model’s
performance, these are calculated as 𝐿𝐸𝑅 ℎ, 𝑆′ = 1
𝑍
𝐸𝐷(ℎ 𝑥 , 𝑍)
(𝑆,𝑍)∈𝑆′
and 𝑆𝐸𝑅 ℎ, 𝑆′ = 100
|𝑆′|
0 𝑖𝑓 ℎ 𝑥 = 𝑧
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(𝑥,𝑦)∈𝑆′ . The model achieved LER and
SER of 6.44 and 28.9 using DCRNs and 6.95 and 28.04 using DCRN-
f&s respectively.
Next must-read paper: “Text-Line Character Segmentation for
Offline Recognition of Handwritten Japanese Text ”
Summary
The work proposes a segmentation free algorithm for recognition. It has
three components i.e, a CNN feature extractor using sliding window,
BLSTM for prediction and optimized using connectionist temporal
classification loss(CTC)
Related Works
Graves(2009) combined BLSTM and CTC for text recognition. Messina
(2015) combined Multi Dimensional LSTM and CTC for end-to-end
trainable chinese text recognition. Suryani(2016) combined pretrained
CNN with LSTM and hidden markov model for alignment.
Proposed Methodology
The CNN architecture used for feature extraction is shown below. The
model is pretrained with japanese handwritten character dataset
(Nakayosi, Kuchibue). After training the softmax(DCRNs) or both the fully
connected layers and softmax(DCRN-fs) are removed and the remaining
is used for feature extraction. LSTM layers are used instead of RNN to
address the vanishing gradient problem. LSTM however extract
information in one direction thus 2 LSTM are used to extract information
in both direction. The model is optimized using CTC loss function.
16
Attention Augmented Convolutional Recurrent Network for Handwritten Japanese Text Recognition
Copyright © 2022 VIVEN Inc. All Rights Reserved
Proposed Methodology
A CNN architecture without fully connected layers is used for feature
extraction. The feature maps extracted are unfolded from left to right.
These features are fed to self-attention module where they are
projected to query, key and value followed by scaled dot product
attention. For n heads this process is repeated n times. The outputs are
then concatenated and added with the input of self attention module.
Result
Multiple datasets are used to evaluate the model’s performance as
IIIT5K, SVT, IC03, IC13, SVTP, CUTE obtaining 92.67, 91.16, 93.72,
90.74, 78.76 and 76.39 respectively.
Next must-read paper: “Focusing attention: Towards accurate text
recognition in natural images”
Summary
Japanese OCR is difficult due to vast character set, multiple writing styles,
multiple touch-points between characters. The work proposes attention
augmented Convolutional Recurrent Network (AACRN) consisting of three
modules convolution feature extractor, self-attention-based encoder and
CTC decoder.
Related Works
Feng(2012) used segmentation to segment each character followed by
recognizing each character. Segmentation free methods involved using
Connectionist Temporal Classification (CTC) and attention mechanism.
Graves(2009) was the first to use BLSTM with CTC for recognition.
Shi(2017), Ly(2017) used CNN with BLSTM and CTC for recognition.
Deng(2017) used attention-based model to convert mathematical
expressions to LaTex. Chowdhury(2020) used attention model with beam
search decoder for english and french text recognition. Vaswani(2017)
used self-attention along with positional information for recognition.
17
Recognition of Anomalously Deformed Kana Sequences in Japanese Historical Documents
Copyright © 2022 VIVEN Inc. All Rights Reserved
prediction. The end-to-end approach consists of similar model without
pretraining and BatchNormalization after each convolution layer in
feature extractor and relu replaced by leakyrelu and dropout layers
after every 2 LSTM layer in frame predictor. For text spanning multiple
lines, one approach is to segment the vertical lines and join the
vertically and apply previous algorithms. Another approach is to use the
image feature extractor from previous algorithm and a frame predictor
of 2 levels of 2DBLSTM. The first layer having 4 LSTM each having 64
nodes, the second layer having 4 LSTM each with 128 nodes. The 3rd
approach replaces the CNN module with a 2DBLSTM thus, the entire
structure has a total of 3 2DBLSTM each having 4 LSTM layers and 2,
10 and 50 nodes respectively. Limited window size and fully connected
layers are used to reduce weights.
Result
For level 2 (single vertical texts) the end-to-end trained model achieved
the best result of 10.9 LER and 27.7 SER. For level 3(multiple vertical
texts) segmentation with end-to-end trained model achieved best result
of 12.3 LER and 54.9 SER.
Conclusion
Context-preprocessing, language statistics , augmentation could be
applied to further improve the performance.
Next must-read paper: “Character and Text Recognition of Khmer
Historical Palm Leaf Manuscripts”
Summary
The work proposes a segmentation free algorithm for Khana character
recognition in multiple lines. The work proposes Deep Convolutional Recurrent
Neural Network(DCRN) which uses a CNN architecture for feature extraction
and BiLSTM with connectionist temporal classification loss(CTC) for
recognition.
Related Works
Kitadai(2008) restores the image and performs similar pattern matching and
returns similar characters already decoded. Phan(2016) segmented a
document into characters and recognized them using modified quadratic
discriminant function. Nguyen(2017) used a segmentation-based approach for
handwritten japanese text recognition. Graves (2008) combined
multidimensional LSTM with CTC loss for end-to-end trainable model for
handwritten arabic recognition. Shi(2015) used CNN with LSTM for scene text
recognition. Rawls (2017) used CNN with LSTM for handwritten english and
arabic text recognition. Ly(2017) used a combination of CNN with LSTM
optimized using CTC loss for japanese handwritten text recognition.
Proposed Methodology
The pretrained model has 5 blocks of CNN with maxpooling with 2 FC layer
with relu after every layer and softmax layer for prediction. This architecture is
pretrained(to recognise 1 character) and features are extracted in 3 ways:
• Extract features from model without softmax for sliding window(stride 12 or
16) applied on text
• Sliding window of 64*32 with stride of 32 for non-overlapping regions,
features extracted from last convolution layer.
• Features extracted for full text image from last convolution layer
A frame predictor module of 3 BLSTM (each having 2 LSTM of 128 nodes)
followed by a dense layer to predict character for each frame and CTC for final
18
A semantic Segmentation-based method for Handwritten Japanese Text Recognition
Copyright © 2022 VIVEN Inc. All Rights Reserved
Kondate dataset is used for training and Kuchibue and Nakayoshi
datasets are used for testing. The pixel level accuracy and IoU% are
calculated for character segmentation while errors are calculated for
recognition the results are presented above.
Result
Improved segmentation algorithm such as RCNN and Mask R-CNN can
be used to improve results as well as Conditional Random Field for post
processing or adding it directly.
Next must-read paper:”Progress and results of Kaggle Machine
Learning Competition for Kuzushiji Recognition”
Summary
The work proposes a segmentation based Japanese handwritten text
recognition algorithm. Semantic segmentation using an U-Net architecture
for pixel level classification followed by CNN based OCR which are then
combined using a language model.
Related Works
Xu(2001) used skeleton and contour analysis to detect cut points. A filter
trained with geometric features is used to prune the implausible cutting
candidate points. An OCR model was used for recognition followed by
language model for refinement. Shi(2015) combined CNN and LSTM with
CTC loss for text recognition. Ly(2019) used CNN for feature extraction
with RNN to encode followed by attention for output. Asanobu(2019) first
segmented using Faster R-CNN with two classes the characters followed
by recognition using ResNet152 as backbone, FPN and ROI align.
Baek(2019).
Proposed Methodology
ResNet101 is used as the encoder for U-Net. Dilated convolution is used
to extract features without the need of down-sampling multiple times.
The encoder features are passed through dilated convolution with dilation
of 2,4,8,16 concatenated and fed to the upsampling layers. The model is
tasked to predict the center of character along with convex hulls to reduce
touching within characters. Watershed algorithm is used on the output to
separate convex hulls. Recognition is done via Inception ResNet V2
network. The outputs are represented as unigram, bigram and trigram
which are then selected using Viterbi to obtain result.
19
A unified method for augmented incremental recognition of online handwritten Japanese and English text
Copyright © 2022 VIVEN Inc. All Rights Reserved
The pointer is decided based on previous results. After segmentation if
the classification of a stroke has changed recognition scope is from that
stroke(earliest classification changed off-stroke EccOs) to latest stroke.
Thue the previous recognition and current recognition scope overlaps,
this could be used to reduce the processing time. SP is used to split the
words into preceding and succeeding characters. After splitting if a
character was present in the previous scope or out of scope it is reused
else recognition is done. Characters/ words containing the last segment
are treated as partial characters and recognition is postponed till
complete patterns are received. Newly added strokes are treated as
delayed stroke if ii is close to previous segmentation result.
Character/word with and without the delayed stroke is considered and
best path is searched.
Result
TUAT-Kondate dataset is used for Japanese text evaluation. The
dataset contains text from 100 people, divided into 4 sets(each having
text from 25 people) with 3 for training and 1 for testing. IAM-OnDBt2
is used for english text evaluation.
Next must-read paper:”An approach for real time recognition of
online Chinese handwritten sentences”
Summary
The algorithm proposed can be used for recognition while writing or
delayed recognition. Segmentation is followed by recognition done after
a fixed interval of strokes. This is done using three techniques, i.e the
structures used in previous step is reused, strokes that were not
recognized in previous step are attempted again, skip for incomplete
characters.
Related Works
Zhu(2010) used geometric features extracted from previous and next
strokes for japanese text recognition. Nakagawa(2006) used geometric
and linguistic information to improve performance. Graves(2009) used
bi-directional recurrent networks for feature extraction. Tanaka(2002)
reported online recognition decreases performance by 0.3.
Proposed Methodology
The offline segmentation is done in two steps, i.e segmentation into
lines and segmentation of each character, A classifier is used to classify
each stroke into segmentation point(SP), non-segmentation point(NSP)
and undecided point(UP). Here SP separates two characters or words,
NSP indicates stroke is within character and UP is used for low-
confident predictions. Here SVM is used for japanese text and BLSTM
for english text to classify each stroke. Strokes separated by SP or UP
are considered as character/word or part of character/word. These are
then recognized and assigned a confidence score. Viterbi algorithm is
used to search for optimal combinations that give the best score. For
online recognition the algorithm starts after some strokes are added,
When resuming segmentation, it is done from segmentation pointer to
current stroke.
20
Training an End-to-End Model for Offline handwritten Japanese Text Recognition by Generated Synthetic Patterns
Copyright © 2022 VIVEN Inc. All Rights Reserved
To create synthetic data a random sentence and a different writer is
chosen. A new sentence is formed using characters used in sentence,
but handwritten image of that character written by writer. Each
handwritten character undergoes local distortion(shearing, rotation,
scaling, translation), and the entire sentence undergoes global
distortion(rotation and scaling).
Result
Kondate dataset is used to evaluate the model’s performance and
effectiveness of the augmentation used. Training the model in an end-
to-end manner with augmentation achieved the best performance of
1.87 Label Error Rate and 13.81 Sequence Error Rate
Next must-read paper:”Are Multidimensional Recurrent Layers
Really Necessary for Handwritten Text Recognition”
Summary
The algorithm proposed uses Deep Convolution Neural Network for
feature extraction from text line image, deep BLSTM as recurrent
network and trained using Connectionist Temporal Classification(CTC)
loss. Elastic distortions are used to synthesize images.
Related works
Graves(2006) first introduced CTC loss for handwritten text recognition.
Graves(2009) combined BLSTM with CTC loss to improve performance.
Messina(2015) used Multi-Dimensional LSTM with CTC loss for Chinese
text recognition. Puigcerver(2017) proved multi-dimensional LSTM is
not necessary for text recognition.
Proposed Methodology
The images are first scaled to fixed size and otsu’s method is used to
obtain binary images. The CNN architecture shown above is used for
feature extraction. Bidirectional LSTM is used to pass information from
both direction from the features extracted by CNN. The entire
architecture is trained using CTC loss.
21
Attempts to recognize anomalously deformed Kana in Japanese historical documents
Copyright © 2022 VIVEN Inc. All Rights Reserved
previous algorithm. Another method could be using 2DBLSTM with CNN
pretrained on single character recognition or object detection followed
by CNN pretrained with single character recognition without
segmentation.
Result
Recognizing anomalously deformed Kana in Japanese historical
documents contested by IEICE PRMU having 3 tasks namely
recognizing single character, one line of text, multiple lines of text is
used to validate the approaches presented.
Next must-read paper:”Deep Convolutional Recurrent Network for
Segmentation-free Offline Handwritten Japanese Text Recognition”
Summary
The work proposes three different algorithm for recognizing single
character, single line character using CNN with BLSTM and multiple line
character using a segmentation-based method.
Proposed Methodology
For recognizing single characters Otsu's, method is used to reduce
background noise while enhancing foreground. The result is padded so
that character is in the centre. This is followed by linear spatial
normalization, resizing and standardization. For augmentation rotation,
shearing are used. Experiments are conducted on multiple CNN
backbones as well as LSTM architectures for character classification. For
detection of character sequence, a combined architecture of CNN and
BLSTM and CTC loss is used. The CNN architecture is first trained on
single character recognition, features can then be extracted using a
sliding window with(DCRN-o),without overlapping(DCRN-wo) or using
the features extracted from the last convolution layer(DCRN-ws). For
recurrent layer 1DBLSTM is trained along with CTC loss. For recognizing
texts in multiple vertical lines multiple approaches are presented such
as vertical line segmentation and concatenating them followed by
22
A Multiplexed Network for End-to-End, Multilingual OCR
Copyright © 2022 VIVEN Inc. All Rights Reserved
Proposed Methodology
For text detection the algorithm uses a ResNet50 backbone with a Unet
structure. Vatti clipping algorithm is used to shrink text regions and RoI
masking to suppress background and neighboring text instances. For
recognition, a character segmentation module and spatial attention
module is used. The output of the detection and segmentation modules
is used as input for language prediction module which uses a small CNN
architecture(2 CNN, 1 FC) for classification. The language prediction
and recognition heads are trained using as
𝐿ℎ𝑎𝑟𝑑−𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑡𝑒𝑑
= 𝛼𝑠𝑒𝑞 𝑟
𝐿𝑠𝑒𝑞
𝑎𝑟𝑔𝑚𝑎𝑥1<𝑙<𝑁𝑟𝑒𝑐
𝑝 𝑙 for when 𝑁𝑟𝑒𝑐 number of
different languages are supported. Thus, for each word during training
one the recognition head having the highest confidence is selected.
𝐿𝑠𝑒𝑞
𝑟 = − 1
𝑇
𝐼 𝑐𝑡
∈ 𝐶𝑟
. log 𝑝 𝑦𝑡
= 𝑐𝑡
+ 𝐼 𝑐𝑡
∉ 𝐶𝑟
𝛽
𝑇
𝑡=1 . Here 𝐶𝑟 is the
character set supported by head r and 𝑐𝑡 is the ground truth. 𝛽 is a
hyperparameter as penalty for unsupported characters.
Results
ICDAR 2019 MLT dataset (MLT19) is used to evaluate the model’s
recognition performance.
Next must-read paper:” Mask TextSpotter v3: Segmentation
Proposal Network for Robust Scene Text Spotting”
Summary
This work proposes an end-to-end trainable pipeline which includes text
recognition and detection. The algorithm uses multiple heads for
recognizing different languages.
Related works
The task of Multiple text recognition can be divided into 3 sub-tasks
namely text detection, script identification and text recognition. Prior to
the use of deep learning hand crafted features were used Anil(1998),
Lukas(2012), Cheng(2013). The success of deep learning methods led
to them being used in conjunction to hand crafted features. Recent
approaches however extensively use deep learning methods. Most
recognition algorithm either use Connectionist Temporal
Classification(CTC) Graves(2006) loss to convert features to language
sequence or Seq2Seq encoder-decoder framework with attention
Bahdanau (2014) for classification. Script identification is necessary for
determining which language recognizer to use. Shi(2015) used CNN
with multi-staged pooling for classification. Fugii(2015) modified the
task to a sequence to label problem. Some recognition algorithms like
Michal(2018), Youngmin(2020), Pengyuan(2018) does not incorporate
these components and have one recognition head for characters from
different languages.
23
E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
Copyright © 2022 VIVEN Inc. All Rights Reserved
The entire algorithm is trained using 𝐿𝑓𝑖𝑛𝑎𝑙
= 𝐿𝑔𝑒𝑜
+ 𝜆1
𝐿𝑎𝑛𝑔𝑙𝑒
+ 𝜆2
𝐿𝑑𝑖𝑐𝑒
+
𝜆3
𝐿𝐶𝑇𝐶. Here 𝐿𝑎𝑛𝑔𝑙𝑒 is mean squared error over sin(∅) and cos(∅), 𝐿𝑔𝑒𝑜 is
IoU loss function, 𝐿𝐶𝑇𝐶 is word level recognition loss and 𝐿𝑑𝑖𝑐𝑒 is dice loss
calculated for predictions with more than 90% confidence.
Results
ICDAR 201yMLT dataset (MLT17) is used to evaluate the model’s
performance.
Next must-read paper:” Fots: Fast oriented text spotting with a
unified network”
”
Summary
This work proposes end-to-end method for multi-language scene text
localization and recognition.
Related works
For scene text recognition, localization is an important step to obtain
word level bounding box or segmentation maps. Jaderberg(2016) used
output of edge boxes and channel features to obtain bounding box,
random forest were used to filter the predictions and CNN regressor for
post-processing. Gupta(2016) used CNN network to detect objects at
multiple scales. Tian(2016) used a CNN-RNN architecture to predict
presence of character. For recognition Jaderberg(2016) used VGG16 for
classification 0f 90K words. Shi(2017) generates one word per image
using CNN with BLSTM and Connectionist Temporal Classification loss.
Lee(2016) used CNN with RNN and soft-attention for recognition.
Li(2017) used convolution recurrent network for text localization as well
as text recognition. Liu(2018) used a shared CNN architecture for text
localization and recognition.
Proposed Methodology
The algorithm uses ResNet34 with FPN object detector. For an input, the
localization module produces 7 outputs as presence or absence of text, 4
co-ordinates of bounding box and orientation angle ∅ is predicted.
24
OCR Survey of
Dataset
Tapas Dutta, Deep Learning Engineer
25
Generating Synthetic Data for Text Recognition
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work generates 9M synthetic handwritten word image corpus using
open-source fonts and data augmentation schemes.
Related Works
Jaderber(2014), Rozantsev(2015), Ros(2016) used synthetic mechanism
for data generation and annotation. Sankar(2010) used rendered images
for annotating large scale datasets.
Proposed Methodology
Synthetic words can be generated using making words in available fonts,
learning different parameters for style and content and modifying the
style parameter for deep learning model. This work uses the first
technique using publicly available fonts(750) and vocabulary chosen from
dictionary(90K unique words). After randomly selecting word and style
inter character space, stroke width is varied, random augmentation
within(-5 to +5) and shear(+/- 0.5) in horizontal direction.
Conclusion
To simulate cursive writing elastic distortion could be used in future works.
Next must-read paper: “The synthia dataset: A large collection of
synthetic images for semantic segmentation of urban scenes”
26
A Database of On-line Handwritten Mixed Objects named “Kondate”
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work presents a database containing online handwritten of text,
figures, tables, map, diagram, etc. This database has 100 Japanese, 25
English, 45 Thai writers.
Proposed Methodology
Two strategies for writing were used namely copy writing where for
participant receives pattern and content, free writing where neither are
provided thus real patterns can be obtained. Attributes of participant,
environment as well as each stroke is collected. The X/Y co-ordinates of
each stroke from pen down to pen up is reported along with stroke-id,
timeOffset(from ped down of first stroke to pen down of current stroke)
and duration.
Conclusion
The work presents a dataset with online text as well as figures, tables,
maps, diagrams.
Next must-read paper: “Arabic Handwriting Data Base for Text
Recognition”
27
ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification – RRC-MLT
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
RRC-MLT challenge has tasks like text detection, script classification and
all tasks for necessary for multilingual text recognition. The dataset
contains 18K images having text from 9 different languages.
Proposed Methodology
The dataset contains natural images with embedded text such as street
signs, advertisement boards, shop names, passing vehicles, images from
internet search. The dataset contains at least 2K images from Arabic,
Bangla, Chinese, English, French, German, Italian, Japanese and Korean.
Text containing special character also from multiple languages are
available. Task 1 is text detection for which 9K images were for training
and remaining 9K for testing. For a set of don’t care regions D = {d1,
d2, …dk }, ground truth regions G = {g1, g2,…gm} and predictions T =
{t1,t 2, …tn}. The predictions are matched as
area(dk
) = 0 area(dk) area(tk
)
area(dk)
> 0.5 against don’t care regions to
discard noise. The filtered predictions T’ are considered positive if
area(gi
) area(t’
j
)
area(gi
) ∪ area(t’
j
)
> 0.5 . The set of positive matches(M) are used to
calculate F-score for𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = |𝑀|
|𝑇′|
and 𝑟𝑒𝑐𝑎𝑙𝑙 = |𝑀|
|𝐺|
thus 𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
2 ∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙
. Team SCUT-DLVC lab used one model for rough
detection followed by another for finely adjusting bounding box and
postprocessing. CAS and Fudan university designed rotation region
proposal network for inclined region proposals with angle information
which is used for bounding box regression. This is followed by text region
classifier. Sensetime group proposed an FCN based model with
with deformable convolution which predicts whether a pixel is a text or
not along with location if it is in a text box. TH-DL group used an FCN
with residual connection to predict whether a pixel belongs to text,
followed by binarizing at multiple thresholds. The connected components
thus extracted along with their bounding boxes are used as region
proposals. The features of region proposals after Rotation ROI pooling are
fed to Fast R-CNN and non-max suppression for results. Linkage group
extracted external regions from each channel followed by calculating
linkage between components and pruning to get non-overlapping
components, candidate lines are extracted and merged. Alibaba group
used a CNN to detect text regions and predict their relation which are
then grouped and used by a CNN+RNN to divide into words. For script
identification 84K and 97K images were used for training andtesting
respectively, with text Arabic, Bangla, Chinese, Japanese, Korean, Latin
and Symbols. SCUT-DLVC labs used CNN with random crop for training
and sliding windows for test images. Team CNN-based method used
VGG16 initialized with ImageNet weights and cross-entropy loss with
training images resized to fixed height and variable width for training.
TNet enhanced the features of dataset followed by deep network for
training and majority vote for classification. Team BLCT extracted patches
of variable sizes to train a 6-layer CNN model. Features from penultimate
layer are randomly combined, these are used to create bag-of-visual
words of 1024 codewords which are used as image representations which
are aggregated to form histogram which in turn is used by a 2 dense and
1 dropout layer for classification. Team Th-DL, TH-CNN used GoogLeNet
like structure, while Synthetic-ECN used an ECN based structure.
28
ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification – RRC-MLT
Copyright © 2022 VIVEN Inc. All Rights Reserved
Team 8 used VGG16 with SVM classifier. Task 3 is a combination text
detection using bounding box and script classification, evaluated as
𝑎𝑟𝑒𝑎 𝑔𝑖 ∩ 𝑎𝑟𝑒𝑎(𝑡𝑖
′)
𝑎𝑟𝑒𝑎 𝑔𝑖 ∪ 𝑎𝑟𝑒𝑎(𝑡𝑖
′)
> 0.5 𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑑 𝑡
𝑖
′ = 𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑑(𝑔𝑖
). Team TH-DL used a a
combination of methods in previous tasks. Team SCUT-DLVCLab trained
2 models for classification and detection. The classification model predicts
background for high confident background boxes generated by detection.
Results
The results of Task-1, Task-2 and Task-3 are presented.
Conclusion
Future works are expected to tackle larger datasets and more languages.
Next must-read paper: “Script identification in the wild via distinctive
convolutional network”
29
Recognizing Text with Perspective Distortion in Natural Scenes
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work proposes an algorithm for recognizing texts of arbitrary
orientations. A new dataset StreetViewText-Perspective is also
introduced.
Related works
Smith(2011) and Weinman(2009) proposed similarity constraint so that
visually similar characters have similar labels. Wang(2011) used an object
recognition framework requiring all character to be correctly recognized.
Gandhi(2000) rectified texts using motion information. Li(2010)
successfully recognized characters, but word recognition was not
addressed.
Proposed Methodology
Matas(2002) algorithm is used to detect potential character locations.
Non-maximal suppression is used to select one bounding box for a
character. The output is classified into text vs non-text based on relative
height, aspect ratio, horizontal crossings, holes. The bounding boxes are
used as character candidates. The character patch is normalized to 48 X
48 and a grid of 2 pixels spacing is used to extract dense SIFT features at
each grid points at multiple scales. Bag-of-keywords is used for matching
since it ignores spatial information allowing for more distortion between
training and testing samples. SVM with histogram intersection kernel is
usedas classifier. For word recognition alignment score, how well a
character match a word is used. This score is calculated as 𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑙 =
𝑃 𝑙 𝑐 𝑖𝑓 𝑙 ≠ 𝜀
1 − 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐 𝑖𝑓 𝑙 = 𝜀 . Thus, the alignment score of an entire word is
the sum of alignment score for each character as 𝐴𝑙𝑖𝑔𝑛𝑆𝑐𝑜𝑟𝑒 𝑎𝑤
=
𝑠𝑐𝑜𝑟𝑒(𝑐𝑖
, 𝑤(𝑎𝑤
(𝑖)))
𝑛
𝑖=1 . Here 𝑐𝑖 is the 𝑖𝑡ℎ character and 𝑤(𝑎𝑤
(𝑖)) is the
label it is assigned(selected by taking the prediction with maximum
confidence). StreetViewText-Perspective contains images same as
StreetViewText however side views are considered such that they are
readable to humans.
Results
Accuracy of various state-of-the art methods are compared for SVT as
well as SVT-Perspective.
Conclusion
Most dataset assume horizontal dataset which may not be possible in real
world thus SVT-Perspective is crucial.
Next Must-Read Paper:” Top-Down and Bottom-up Cues for Scene
Text Recognition “
30
Deep Learning for Classical Japanese Literature
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work introduces dataset for cursive Japanese Kuzushiji along with
larger and more challenging datasets Kuzushiji-49 and Kuzushiji-Kanji.
Proposed Methodology
The work pre-processed characters scanned from 35 books written in 18𝑡ℎ
century and organized in 3 parts. Kuzushiji-MNIST as a replacement for
MNIST dataset. Kuzushiji-49 with an imbalanced dataset of 48 Hiragana
characters and one Hiragana iteration mark. Kuzushiji-Kanji an
imbalanced dataset of 3832 Kanji character including some rare
characters with few examples. Kuzushiji-49 has 266407 images of 28 X
28 resolution same as Kuzushiji-MNIST while Kuzushiji-Kanji has 140426
images of 64 X 64 resolution
Results
A baseline using nearest neighbor, small CNN and resnet architectures for
Kuzushiji-MNIST, Kuzushiji-49 is reported.
Next Must-Read Paper: “Unsupervised image-to-image translation
networks”
31
Towards End-to-End License Plate Detection and Recognition: A Large Dataset and Baseline
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work proposes a dataset of license plate from 250K cars. A novel
lightweight algorithm is introduced to achieve state of the art
performance in real time.
Related works
Caltech, Zermis collected less than 700 high-resolution images.
SSIG(2016), UFPR(2018) collected images using cameras on road.
Hseih(2002) used morphological method to reduce number of candidates
thus speeding up the detection process. Yu(2015) used wavelet
transform with empirical mode decomposition analysis to locate license
plate. Ren(2015) used Faster-RCNN for effective detection. Liu(2016)
used SSD for detection. Redmon(2016) used YOLO and approached the
task as a regression problem. Ho(2009) used license plate feature directly
without segmentation for recognition, Duan(2005) used an OCR system
for recognition. Spanhel(2017) used a CNN model for recognition.
Ciregan(2012) used CNN after segmentation for recognition. Abdel(2006)
used SIFT features near license plate for recognition.
Proposed methodology
The license plate images collected are from city parking management
company. There are more than 250K unique images of size 720*1160*3.
The LP number contains one Chinese character, a letter and five letters or
numbers. Roadside Parking Net is introduced for detection and
recognition. 10 CNN to extract different level feature maps(second, fourth
and sixth) which are fed to the detection module of three fully connected
layers in parallel. The recognition module uses region of interest pooling
layers to extract feature maps crucial for classification. The feature maps
are fed to ROI pooling to extract fixed sized feature maps, which are then
concatenated and used for recognition.
Results
A prediction is considered correct if IoU is more than 0.6 and all
characters are correctly recognized.
Next must-read paper:”Ssd: Single shot multibox detector“
32
Chinese Street View Text: Large-scale Chinese Text Reading with Partially Supervised Learning
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work introduces a large Chinese street view dataset of 430K images.
However, only 30K are fully-annotated and 400K are weakly annotated. A
text reading network in a partially supervised learning framework is
proposed to exploit fully and weakly annotated data.
Related works
ICDAR 2013 and ICDAR 2015 contains horizontal and multi-oriented texts
were first used for training text reading model. Total-Text and SCUT-
CTW1500 are used for curved texts. Most text recognition
algorithm(Bartz 2017, 2018, Busta 2017) first localize then detect text in
an end-to-end manner. For detection branch region proposal network(Li
2017, Lyu 2018) could be used or a CNN layer to predict locations(Zhou
2017). For recognition Connectionist Temporal Classification(Busta 2017,
Liu 2018) or attention/LSTM(Li 2017, He 2018) layers could be used.
Proposed methodology
430K images obtained from real streets signs in China, with
29,966 with all locations and text regions marked and 400K images
having text of roughly marked regions. ResNet50
with feature pyramid network is used as backbone. For F representing the
extracted feature maps. Text/non text classification is done for each
spatial location. For localization quadrangle offsets are predicted. The
entire branch is optimized as 𝐿𝑑𝑒𝑡
= 𝐿𝑙𝑜𝑐
+ 𝜆𝐿𝑐𝑙𝑠. Where 𝐿𝑙𝑜𝑐
, 𝐿𝑐𝑙𝑠 represent
smooth L1 and dice loss, respectively. Perspective RoI transform is used
to align feature maps F to 𝐹𝑝. An encoder-decoder framework is used for
recognition with CNN and GRU as encoder, attention and GRU decoder.
Online Proposal Matching(OPM) is used to locate text regions given
weakly annotated text. Given image with weakly annotated text the
image is fed to the entire network and characters extracted are passed
through embedding layer and compared using Euclidean distance with
annotated text as 𝑑 𝑖 = 1
𝑇
||𝑓 ℎ𝑡
, 𝑊 − 𝑓(𝑒𝑡
, 𝑊||
𝑇
𝑡=1 for T length word, h
predicted, e ground truth and W embedding weights.
Results
The proposed method is compared against state-of-the-art methods in
the given dataset.
Next must-read paper:”Accurate scene text detection through border
semantics awareness and bootstrapping”
33
A robust arbitrary text detection system for natural scene images
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work explores different properties of pixels to identify pixels and SIFT
for feature extraction. A new dataset CUTE80 is introduced for curved
text evaluation.
Related works
Chen(2004) extracted 79 features from classifier and used adaptive
binarization to classify text and non-text pixels. Yi(2011) used gradient
and color information for partition. Sivakumara(2013) used a combination
of wavelet and median moments for detection. Epshtein(2010) used
stroke width transform on canny image for text detection.
Proposed methodology
The proposed dataset CUTE80 has 80 images either indoor or outdoor
captured using camera or taken from internet. The work creates three
novel features Mutual Direction Symmetry(MDS), Mutual Magnitude
Symmetry(MMS), Gradient Vector Symmetry(GVS) present in Sobel and
Canny images for text. These features are used to separate text from
background. SIFT features are used to eliminate false text candidates.
This algorithm works for any orientation of text since it is based on ellipse
property of text and is implemented based on nearest neighbour
algorithm. Since the method does not involve language specific features it
can work with multilingual texts.
Results
The effectiveness of proposed model is shown.
Next must-read paper: ”Detecting texts of arbitrary orientations in
natural images“
34
Scene Text Recognition using Higher Order Language Priors
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work introduces an algorithm that uses higher order statistical
features for recognition as well as a large recognition dataset.
Related works
Text recognition task can be divided into sub-tasks as text detection,
character recognition, word recognition. These tasks or worked on
individually by Campos(2009), Epshtein(2010) Chen(2004) or jointly by
Neumann(2012), Wang(2011). Works of Mishra(2012), Smith(2011),
Wang(2011) are successful in recognizing texts in limited setting.
Proposed methodology
The proposed algorithm uses a conditional random field-based(CRF)
model for recognition. Random variables 𝑥 = {𝑥𝑖
|𝑖𝜖𝑉} for 𝑥𝑖 representing a
potential character and can take a label from label set L. The most likely
character is found by minimizing energy function 𝐸 𝑥 = 𝜑𝑐
(𝑋𝑐
)
𝑐𝜖𝐶 . For C
being a subset of V and 𝑥𝑐 being a set of random variables in C. Sliding
window with aspect ratio as in Mishra(2012) is used. Auxiliary variable 𝑥𝑐
𝑎
if used for 𝑐 ∈ 𝐶 which takes h-gram combination in L for higher order h
used in CRF, thus model can capture a large context. The characters are
ordered sequentially from left to right, with one node for every sliding
window. For non-auxiliary nodes it takes from label
set L with unary cost associated with it, expressed as 𝜑 𝑥𝑖
= 𝑦𝑗
= 1 −
𝑝(𝑙𝑗
|𝑥𝑖
) for 𝑝(𝑙𝑗
|𝑥𝑖
) being the SVM score of 𝑙𝑗 for node 𝑥𝑖. Pairwise cost for
neighbor characters as 𝜑 𝑥𝑖
, 𝑥𝑗
= 𝜆 ∗ (1 − 𝑝 𝑙𝑖
, 𝑙𝑗
) for 𝜆 being the penalty
for occurring together. The higher order cost for auxiliary node 𝑥𝑖 taking
label 𝐿𝐾 and leftmost non auxiliary node 𝑥𝑗 label 𝑙𝑙 is given by 𝜑2
𝑎(𝑥𝑖
=
𝐿𝑘
, 𝑥𝑗
= 𝐿𝑙
) =
0 𝑖𝑓 𝑙 = 𝑖
𝜆𝑎′
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 . IIIT-5K dataset contains scene text and born
digital images of low resolution and a variety of resolution and styles. The
images were annotated using bounding box and ground truth as well as
divided into easy and hard category based on visual appearance.
Results
The performance of proposed algorithm in multiple datasets for pairwise
and higher-order cost is reported.
Next must-read paper: ”Top-down and bottom-up cues for scene text”
35
Collection of on-line handwritten Japanese character pattern databases and their analyses
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
The work collected two online handwritten Japanese character pattern
database with more than 3 million patterns, one from 120 people
contributing 12K patterns each, other with 163 with 10K patterns each.
Related works
UNIPEN dataset Guyon(1994) is popular for online character recognition
but does not include oriental characters.
Proposed methodology
Online handwritten dataset collection requires specialized tools Pen PCs,
PC with tablet inference making collection difficult. Thus, display
integrated tablets(DITs) are used for dataset collection. For effective
training large amount of data from each individual needs to be collected
to learn individual’s handwriting pattern. Boxes are displayed such that
characters are written within the box, pen tip coordinates relative to the
page are recorded. People are asked to write according to sequences
given. It is generally recognized that people write unnaturally neatly text
without any meaningful context but write casually when writing
sentences. Thus, they are asked to write sentences that cover frequently
used character followed by writing frequently asked character one by one.
Some characters cannot be written without being seen, also character in
kanji could be written in Kana thus characters are displayed
before written. Participants details are recorded for later use. The
sentences used for text are collected from newspapers. A tool was
designed using DOS/V machine with display integrated tablet and MS
windows OS. By specifying a text file, the corresponding character and a
writing box(1.7cm*1.7cm or 1.43cm*1.43cm) is displayed to write the
character. A verification tools is included that identifies erroneous
characters that are verified by humans and reported to participant. Thus,
Kuchibue dataset has 120 participants each contributing 11,962-character
patterns. Nakayosi has 163 participants with 10,403 patterns per person.
Next must-read paper: ”A new warping technique for normalizing
likelihood of multiple classifiers and its effectiveness in combined on-
line/off-line Japanese character recognition”
36
Copyright © 2022 VIVEN Inc. All Rights Reserved
会社名 株式会社 微分
代表者 吉田 慎太郎
所在地
東京都新宿区新宿4丁目1-16 JR新宿ミラ
イナタワー18F JustCo新宿
設立 2020年10月
資本金 7,000,000(2022年10月時点)
従業員 20名(全雇用形態含む)
事業内容
教育機関向けソフトウェア「School DX」の開発
ウェブアプリケーションの開発
画像認識/自然言語の研究開発
会社概要
37
Copyright © 2022 VIVEN Inc. All Rights Reserved