Slide 1

Slide 1 text

1 OCR Survey Tapas Dutta, Deep Learning Engineer

Slide 2

Slide 2 text

2 OCR Survey of Foreign Language Tapas Dutta, Deep Learning Engineer

Slide 3

Slide 3 text

 Summary Current OCR technologies miss genuine character or produce new character thus this work generates pixel-wise maps for character class, position and order in parallel with RNN for context modelling.  Related Works Cheng(2017) used character’s class and localization labels to adjust the attention positions. Bai(2018) used novel loss function to improve attention decoder. Lyu(2018), Liao(2019) used segmentation for OCR, however not effective for languages with closely spaced characters.  Proposed Methodology A CNN architecture is used for feature extraction. The extracted features are fed to class and geometry branches. The class branch consists of 2 stacked convolutions followed by soft normalization. The output feature maps(character segmentation maps) has dimension of h*w*c 3 TextScanner: Reading Characters in Order for Robust Scene Text Recognition Copyright © 2022 VIVEN Inc. All Rights Reserved (c = number of character + background). To produce the localization map sigmoid activation on the input. For order segmentation map, a small U-Net architecture is used, with GRU layers in the middle. After upsampling two convolution layers are used to generate feature maps of size h*w*N (N is sequence length). Order map such that kth character is indicated by its kth feature map is generated by multiplying order segmentation and character localization maps. The classification scores is obtained by multiplying the character segmentation maps and order segmentation maps  Result Different datasets was used to validate the effectiveness of the model such as IIIT(50,1K,0 lexicons), SVT(50, 0 lexicons), IC13, IC15, SVTP, CT achieving 99.8,99.5, 95.7, 99.4, 92.7, 94.9, 83.5, 84.8, 91.6% accuracies respectively.  Next must-read paper: “Textscanner: Reading characters in order for robust scene text recognition”

Slide 4

Slide 4 text

 Summary OCR for Persian language needs to address the difference in Persian language like right-to-left text, interpretation of semicolon, dot, oblique, etc. Increasing the number of layers/filters/kernel size for LSTM did not improve results. BiLSTM improved results when compared to LSTM Increasing the dimension of extracted vector improved results for BiLSTM and LSTM. Increasing the number of BiLSTM layers improved the performance  Related Works Khosravi (2006) achieved 99.02% and 98.8% accuracy using improved gradient and gradient histogram respectively for HODA dataset. Alizadehashraf (2017) achieved 97% accuracy using a small CNN architecture. Bonyani (2021) compared the performance of standard CNN architecture (DenseNet, ResNet, VGG) for recognizing Persian text. LeCun(1998) used LeNet whose weights were optimized using firefly, ant colony, chimp, particle swarm optimization techniques with chimp optimization performing the best. Smith(2007) used CNN followed by LSTM achieving an accuracy of 93%. 4 Persian Optical Character Recognition Using Deep Bidirectional Long Short-Term Memory Copyright © 2022 VIVEN Inc. All Rights Reserved 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑛𝑒𝑠𝑠 = 100 ∗ #𝐴𝑙𝑙𝑊𝑟𝑑𝑠 − (#𝐷𝑒𝑙𝑊𝑟𝑑 + #𝑆𝑢𝑏𝑊𝑟𝑑) #𝐴𝑙𝑙𝑊𝑟𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 100 ∗ #𝐴𝑙𝑙𝑊𝑟𝑑𝑠 − (#𝐼𝑛𝑠𝑊𝑟𝑑 + #𝐷𝑒𝑙𝑊𝑟𝑑 + #𝑆𝑢𝑏𝑊𝑟𝑑) #𝐴𝑙𝑙𝑊𝑟𝑑  Proposed MethodologyType equation here. The proposed algorithm consists of 3 modules for segmentation, feature extraction and recognition. For segmentation the height of images are normalized and slid windowing algorithm is used. For feature extraction a small CNN architecture of 1 convolution and 1 maxpooling is used. The recognition module uses 4 BiLSTM layers, the first two with tanh activation and the middle two with sigmoid activation and trained with connectionist temporal classification loss.  Result 20 pages of English and Persian texts from different and random books was types in MS-Office. The images taken were height normalized. The metrics used for evaluation are: AllWrds: Number of words in text, InsWrds: Wrongly inserted words, DelWrds: Wrongly deleted words, SubWrds: Wrongly substituted words. The tesseract model achieved 91.73% for Persian texts only and 73.56% accuracy when trained on dataset and tested on set having Persian and English texts, as compared to Bina achieving 96% on both the test sets. The tesseract and Bina achieved 71% and 91% correctness for test sets having Persian and English texts.  Next must-read paper: “An overview of the Tesseract OCR engine”

Slide 5

Slide 5 text

 Summary Existing OCR algorithms require a CNN architecture for feature extraction from image, sequence modelling layers for text generation and a language model to improve the performance. This work includes all these steps in an end-to-end trainable model. Extensive experiments on different combinations of encoder and decoder validate the superiority of using BEiT as encoder and RoBERTa as decoder. Ablation studies conducted validate the effectiveness of different strategies used.  Related Works Diaz(2021), Vaswani(2017) incorporated transformer in the CNN architectures observing significant performance improvement. Bao(2021) replaced used self-supervised image pretrained transformers to replace CNN architectures.  Proposed Methodology The image is divided into P*P patches, flattened and a linear layer is used to change the dimension to a predefined number. The encoder consists of an image transformer 5 TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models Copyright © 2022 VIVEN Inc. All Rights Reserved where DeiT(Touvron 2021) and ViT (Dosovitskiy 2021) are used as encoders initialization. The “[CLS]” token is used to represent the entire mage. A text transformer is used as the decoder, it is initialized with RoBERTa mode, thus output would be wordpiece instead of a character. The model is first trained on hundreds of millions of synthetic printed text line images. These weights are used to initialize the second stage pretraining on task specific synthetic and real-world datasets. Augmentation such as gaussian blur, image erosion, rotation, image dilation, downscaling, underlying are used to help model’s generalizing capabilities.  Result SROIE dataset is used to evaluate the model’s performance according to precision, recall, F1-score where the model achieved 95.76, 95.91, 95.84 respectively.  Next must-read paper: “Scene Text Recognition with Permuted Autoregressive Sequence Models”

Slide 6

Slide 6 text

 Summary The work proposes an encoder-decoder architecture with GRU, and attention mechanism evaluated on Khmer texts in different fonts. tesseract produced characters like @, #, / not present in the texts. The proposed model outputs repeated characters before reaching EOS character, due to encoder-decoder architecture. Khmer language characters have similar structure resulting in error in both models.  Related Works Ilya (2014) employed RNN based encoder-decoder for English to French machine translation. Dzmitry (2014) modified the previous work to include attention mechanism and improved the performance. Devendra (2015) employed LSTM based encoder-decoder mechanism for English text recognition. Farisa (2021) trained standard CNN architectures(ResNet, DeseNet) with LSTM, GRU layers and trained the entire model using Connectionist Temporal Classification loss in an end- to-end manner.  Proposed Methodology The encoder consists of a convolution, batchnormalization and ReLU activation followed by 3 residual modules. There are two types of residual blocks as photo above.The first residual block contains 2 res 6 An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention Copyright © 2022 VIVEN Inc. All Rights Reserved block 0 while the second and third blocks contains res block 0 followed by res block 1.This is followed by a 2DAveragePooling and 2D dropout layers. For an intermediate output of h*w*c the feature maps are reshaped to w*hc(𝑂𝑑𝑟𝑜𝑝𝑜𝑢𝑡2𝐷) represented as 𝐻, ℎ𝑇 = 𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝐺𝑅𝑈(𝑂𝑑𝑟𝑜𝑝𝑜𝑢𝑡2𝐷 ). Here h is passed through a dropout, linear, tanh activation and used as decoder’s initial hidden state, while H is the input to the decoder. The context vector is weighted average of different hidden state, the current weights are calculated using SoftMax on encoder’s current hidden state and decoder’s previous hidden state as 𝑒𝑖𝑗 = 𝑉 𝑎 𝑇 ∗ tan(𝑊 𝑎 𝑆𝑖−1 + 𝑈𝑎 ℎ𝑗 ). The one-hot encoded previous decoder’s output and context vector from attention are concatenated and passed through a GRU layer along with decoder’s previous hidden state to calculate current hidden state. The current hidden state, context vector and one-hot encoded previous output are concatenated and passed through a linear layer for next character prediction.  Result Text2image is used to generate 92,213 texts for different Khmer fonts and sizes. The test set contains 3000 images. The proposed model achieved a character error rate(ratio of unrecognized to total number of characters) of 1% as compared to tesseract’s error rate of 3% on this dataset.  Next must-read paper: “Khmer OCR fine tune engine for Unicode and legacy fonts using Tesseract 4.0 with Deep Neural Network” 𝐶𝐸𝑅 = 𝑆 + 𝐼 + 𝐷 𝑁 𝑊𝐸𝑅 = 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

Slide 7

Slide 7 text

 Summary Texts in the real world are mostly curved or stylised thus ASTER is equipped with a rectification module to rectify the input image(using thin plate spline) and a recognition module using sequence to sequence module with attention to predict characters from rectified images.  Related Works Wang(2012) used two separate CNN modules to localize and recognize texts. Jaderberg(2014) used one CNN for both localization and recognition. Su and Lu (2014,2017) used RNN for sequence prediction. He(2016), Shi(2017) used a combination of CNN and RNN for text prediction. Wang(2017) employed gated recurrent CNN for text recognition. Yang (2017) employed a character detection model optimized with an alignment loss for character localization. End-to-End text Jaderber (2016), Weinman(2014) recognition models use text proposals followed by word recognizer. Busta (2017) combined FCN detector with connectionist temporal classification(CTC) for recognition.  Proposed Methodology The algorithm contains two parts for rectification and recognition. Rectification is done using Thin Plate Spline Transformation, it contains 3 parts as localization network, grid generator and sampler 7 ASTER: An Attentional Scene Text Recognizer with Flexible Rectification Copyright © 2022 VIVEN Inc. All Rights Reserved The localization network predicts k coordinates using a CNN architecture with FC layer, from the original image. Given a pixel location p on original image I the grid generator computes the pixel location on rectified image. Here ∆𝐶 is calculated as ∆𝐶 = 11∗𝐾 0 0 𝐶 0 0 ℂ 1𝐾∗1 𝐶𝑇 and 𝑇 = 𝑐, 02∗3 ∆𝐶−1. ℂ is a square matrix such that ℂ𝑖𝑗 = 𝐹(||𝐶𝑖 − 𝐶𝑗 ||) and 𝐹 𝑟 = 𝑟2 ∗ log (𝑟). Differentiable image sampling is used to clip neighbouring pixels to restrict pixels within image and interpolate neighbourhood pixels in a differentiable manner. The recognition module consists of an encoder- decoder architecture with ConvNet acting as encoder and BLSTM with attention as decoder. The attention weights are calculated from encoder output and previous hidden state as 𝑒𝑖𝑗 = tan(𝑊𝑠𝑡−1 + 𝑉ℎ𝑖 + 𝑏) ∗ 𝑊𝑇. after softmax glimpse vector g is the weighted sum of encoder outputs using attention weights. g is then concatenated with one hot encoded previous output and passed through a recurrent unit. This output is used fed to a linear layer with softmax to calculate current character  Result Multiple datasets are used to evaluate the model’s performance as IIIT5K(0), SVT(0), IC03(0), IC13, SVTP, CUTE obtaining 93.4, 93. 6, 94.5, 91.8, and 79.5 respectively.  Next must-read paper: “Focusing attention: Towards accurate text recognition in natural images.”

Slide 8

Slide 8 text

 Summary Integrates connectionist temporal classification(CTC) loss with focal loss to help the model with unbalanced languages datasets. Empirically for both synthetic and real images performance can be improved using alpha = 0.25 and y=0.5.  Related Works A Graves(2008) was the first to combine CTC loss with RNN for text recognition. A Ul Hasam (2013) used BiLSTM with CTC loss for Urdu text recognition. M. Busta (2017) combined recognition and detection in an end-to-end model. J. BA(2014) used reinforcement learning to concentrate on part of image useful for prediction. C.Y. Lee (2016) used RNN with attention for optical character recognition. M.Jaderberg used spatial transformer for spatial manipulation of data within the module combined with focal loss for text recognition.  Proposed Methodology With ResNet as the backbone the algorithm extract feature maps from the last convolution layer which is cut into multiple slices, each containing information about a small area within the image. This is followed by a BiLSTM and fully connected layer with softmax for final 8 Focal CTC Loss for Chinese Optical Character Recognition on Unbalanced Datasets Copyright © 2022 VIVEN Inc. All Rights Reserved output 𝑝 𝜋 𝑥 = 𝑦𝜋𝑡 𝑡 𝑇 𝑡=1 . Here 𝑦𝜋𝑡 𝑡 represents the probability of observing the elements (set of all possible characters and blank character) at slice t for T total slices. Thus, CTC loss is calculated as 𝑝 𝑙 𝑦 = 𝑝(𝜋|𝑦) 𝜋:𝛽 𝜋 =𝑙 i.e., the sum of all probabilities. For hyperparameter alpha and y focal loss is calculated as 𝐹𝐿 𝑝𝑡 = −𝛼𝑡 (1 − 𝑝𝑡 )𝑦log (𝑝𝑡 ). Here alpha is used to overcome data imbalance and y to help model focus more on hard samples. Thus, CTC loss can be modified as 𝐹𝐶𝑇𝐶 𝑙 𝑦 = −𝛼𝑡 1 − 𝑝 𝑙 𝑦 𝑦 log (𝑝(𝑙|𝑦). Thus, the model can focus more on hard samples.  Result A synthetic dataset generated from MNIST by concatenating 5 images together from two groups of ‘0-9,a-h’ and ’i-z’ characters one of 1M and 100K (10:1 imbalance) other of 1Mand 10K(100:1 imbalance) images with 10K test set, and Chinese OCR of 3.6M training and 5K testing images. Highest accuracy obtained were 62.8%, 72.4% and 76.4% respectively.  Next must-read paper: “Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework” 𝜋 𝑡

Slide 9

Slide 9 text

 Summary This work proposes a light-weight model for text recognition. Various strategies to improve the model’s performance or decrease the model’s parameters are also discussed in the work. Ablation studies conducted are used to verify the effectiveness of each strategy.  Proposed Methodology The algorithm uses 3 modules for recognition namely text detection to output bounding box for text, direction classifier is necessary if the bounding box reversed the text and text recognizing. For recognition a light backbone MobileNetV3_large_x0.5. Empirically it was observed that removing the squeeze excitation blocks from the model resulted in no loss of accuracy while reducing the number of parameters along with inference time. Feature Pyramid Network(FPN) is used in the head to detect small texts using large resolution feature maps. Cosine learning rate is used to use large learning rate( LR) at the beginning and small LR at later stages. Large value of LR at the beginning may lead to instability thus LR warmup is used. FPGM pruner is used to dynamically calculate compression ratio for each layer and remove similar values, thus improve inference efficiency. MobileNetV3_small_x0.35 is used as the backbone for direction classifier for input resolution of 48*192. 9 PP-OCR: A Practical Ultra Lightweight OCR System Copyright © 2022 VIVEN Inc. All Rights Reserved Augmentation such as rotation, gaussian blur, perspective distortion, motion blur(Base Data Augmentation BDA) along with random augment are used to improve the model’s generalizing capabilities. Modified PACT quantization along with L2 regularization coefficient of 1e-3 is used. For recognizer MobileNetV3_small_x0.35 is used as backbone pretrained on synthetic images and modified strides to preserve horizontal and vertical information, with BDA and TIA (Luo 2020) augmentation. A fully connected layer of 48 dimension is used as head along with L2 regularization. Cosine learning rate along with LR warmup is used for training. PACT quantization is used for every layers except LSTM layers.  Result Multiple synthetic as well as public datasets such as LSVT, RCTW-17, MTWI 2018, CASIA-10K, etc. are combined for training and validation of the different modules. When using all the strategies previously mentioned the model achieved an accuracy of 69%  Next must-read paper: “Towards end-to-end license plate de- tection and recognition: A large dataset and baseline”

Slide 10

Slide 10 text

10 On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention Copyright © 2022 VIVEN Inc. All Rights Reserved 𝐴𝑡𝑡 − 𝑜𝑢𝑡ℎ𝑤 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑟𝑒𝑙 ℎ′𝑤′)→(ℎ𝑤 𝑉ℎ′𝑤′ ℎ′𝑤′ . V ℎ′𝑤′ is calculated by multiplying the feature maps of shallow CNN with trainable weights and attention weights 𝑟𝑒𝑙 ℎ′𝑤′-> ℎ𝑤 are calculated as 𝑟𝑒𝑙 ℎ′𝑤′ → ℎ𝑤 ∝ 𝑒 ℎ𝑤 + 𝑃ℎ𝑤 𝑊𝑞𝑊𝑘𝑡 𝑒 ℎ′𝑤′ + 𝑝 ℎ′𝑤′ 𝑇 Here 𝑒 ℎ′𝑤′ P h′w′ represented extracted feature maps and positional embedding, respectively. Further Ph’w’ can be calculated as 𝑃 ℎ′𝑤′ = 𝛼 𝐸 𝑃 ℎ 𝑠𝑖𝑛𝑢 + 𝛽 𝐸 𝑃 𝑤 𝑠𝑖𝑛𝑢 Further Ph sinu and Pw sinu are sinusoidal positional encoding along height and width, while ɑ(E) and 𝛃(E) are calculated as 𝛼 𝐸 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 max 0, 𝑔 𝐸 𝑊ℎ 1 𝑊ℎ 2 and 𝛽 𝐸 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(max 0, 𝑔 𝐸 𝑊𝑤1 𝑊𝑤 2 ) For E representing CNN extracted features. To capture short-term dependency the 1X1 convolution are replaced with 3X3 convolution and are represented as Locality-aware feedforward layer.  Result Datasets having horizontal texts as well as randomly aligned texts were used to validate the performance of model such as IC13(94.1%), IC03(96.7%), SVT(91.3%), IIIT5K(92.8%), IC15(79%), SVTP(86.5%), CT80(87.8%)  Next must-read paper: “Aster: An attentional scene text recognizer with flexible rectification”  Summary Current OCR technologies is unable to recognize rotated, curved, vertically aligned, arbitrary shaped texts. This work uses attention mechanism to tackle these challenges.  Related Works A Cheng(2018) employed a selection module to select features in 4 direction by projecting an intermediate feature map. Yang(2017) used attention module requiring extensive character level supervision. Hui(2019) employed attention but is biased towards horizontal texts due to height pooling and RNN layers. Fenfen(2018) used 1D transformer for recognition. Pengyuan(2019) employed self-attention in the decoder. 𝑎𝑡𝑡 − 𝑜𝑢𝑡ℎ𝑤 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑟𝑒𝑙 ℎ′𝑤′ → ℎ𝑤 𝑉 ℎ′𝑤′ ℎ′𝑤′ text recognition.  Proposed Methodology A shallow CNN module is used to supress background information while reducing computational costs for subsequent layers. The output is passed through self attention blocks, with novel 2D positional embedding. This can be formulated as

Slide 11

Slide 11 text

11 Towards Accurate Scene Text Recognition with Semantic Reasoning Networks Copyright © 2022 VIVEN Inc. All Rights Reserved step is needed, this causes a bottleneck when in model’s semantic reasoning. Thus, this work uses approximated information at t-1 step to calculate vectors at t step. The output of the attention(g) is used to predict target character(FC with softmax) and optimized using cross- entropy(CE). The most likely character is passed through an embedding layer to calculate approximate embedding. The extracted features are passed through several transformer units to output global context(s) and optimized using CE loss. Features g, s are dynamically allocated importance as, 𝑍 𝑡 = 𝛼(𝑊𝑧 . [𝑔𝑡 , 𝑠𝑡]) and 𝑓 𝑡 = 𝑧𝑡 ∗ 𝑔𝑡 + 1 − 𝑧𝑡 ∗ 𝑠𝑡. The entire model is optimized as 𝐿𝑜𝑠𝑠 = 𝛼 𝑒 𝐿 𝑒 + 𝛼 𝑟 𝐿 𝑟 + 𝛼 𝑓 𝐿 𝑓  Result Various datasets such as IC13, IC15, IIIT5K, SVT, SVTP, CUTE, TRW-T, TRW-L are used for evaluation achieving 95.5, 82.7, 94.8, 91.5, 85.1, 87.8, 85.5 and 84.3, respectively.  Conclusion The model could be incorporated with CTC loss to improve performance.  Next must-read paper: “An end-to- end trainable neural network for spotting text with arbitrary shapes”  Summary This work attempts to overcome the shortcomings of RNN such as its time dependency, and most importantly one way transmission of context which greatly limits the model’s effectiveness to learn semantic information.  Related Works Baoguang (2016) combined CNN and RNN with connectionist temporal classification(CTC) loss function for recognition. Minghui (2019) formulated the problem as a pixel level classification task. Chen (2016) extracted the visual features in 1D and used the semantic information of last time step for recognition. Mingkun (2019) used a rectification network based on local features to improve performance. Zhanzhan (2018) extracted features along 4 directions and used a filter gate to calculate the contribution of each. Zbigniew (2017) encoded spatial coordinates on 2D feature maps to increase sequential information extracted.  Proposed Methodology ResNet50 is used as backbone with feature pyramid extracting features from 3rd, 4th and 5th residual blocks. The features extracted are passed to transformer along with positional embedding. A novel parallel visual attention module is used that computes weights as 𝑒𝑡,𝑖𝑗 = 𝑊 𝑒 𝑇 tan(𝑊 𝑜 𝑓𝑜 𝑂𝑡 + 𝑤𝑣 𝑣𝑖𝑗 ) For transformer extracted features as v and O representing character reading order (1…N-1) and f as the embedding function. After softmax, weighted sum with v is used to compute the attention outputs. When using RNN to calculate vectors at t time step information of t-1 time

Slide 12

Slide 12 text

12 Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter Copyright © 2022 VIVEN Inc. All Rights Reserved First L and R are trained with cross-entropy loss function. The trained modules are used in conjunction with S in reinforcement learning. S outputs the partition map in a probabilistic vector, which is then processed to output. Binary map. The processing involves non-max suppression, clipping probabilities at 0.99 and thresholding. Thus, the action is based on the model’s output and the reward is based on the distance from true value as 𝑟 𝑋, 𝑎 = 1 − 𝑑(𝑌 𝑋,𝑎 ,𝑌) max ( 𝑎 +1,𝑁𝑦) . For input X and action, a Y(x, a) is the processed partition map and Y is the ground truth. The denominator is used for length normalization, Ny being the total number of characters in Y.  Result Texts of various languages are used for evaluation such as Chinese, English, Korean as well as mixed texts such as Chinese with English, Chinese with Korean English with Korean and Chinese with English and Korean achieving 94.74, 77.01, 97.07, 87.23, 97.1, 87.46, 90.87, respectively.  Next must-read paper: “Tesseract Blends Old and New OCR Technology”  Summary The performance of OCR systems decrease when working on texts having multiple languages. To tackle this problem this work uses segmenter (Reinforcement algorithm), switcher and recognizer trained in a supervised manner.  Related Works Zheng(2016) formulated character segmentation as a binary segmentation. Chernyshova(2020) proposed a word image segmentation model using dynamic programming to select most probable boundaries in images. B.shi(2017) used convolution layers for feature extraction, LSTM layers to predict character class and CTC loss to ignore the repeated characters produced due to multiple slices. D. Kumar(2015) used encoder-decoder architecture with attention to scan image along horizontal direction followed by decoding the feature vector using attention.  Proposed Methodology The segmenter is used to partition a word image into n sub-images, switcher then assigns a recognizer for each sub-image followed by recognizer which assigns a label. The architecture of word recognizers(R):

Slide 13

Slide 13 text

13 OCR Survey of Japanese Language Tapas Dutta, Deep Learning Engineer

Slide 14

Slide 14 text

14 An attention-based row-column encoder-decoder model for text recognition in Japanese historical documents Copyright © 2022 VIVEN Inc. All Rights Reserved along column to extract sequential information along horizontal and vertical direction, respectively. The attention scores are calculated as 𝑠𝑐𝑜𝑟𝑒 ℎ𝑡 𝐴𝑡𝑡𝑛 , 𝑒𝑖 = tanh 𝑊ℎℎ𝑡𝐴𝑡𝑡𝑛 + 𝑤𝑒𝑒𝑖 . Here e is the feature maps extracted using CNN module, h is the current LSTM hidden state which are calculated as ℎ𝑡 𝐴𝑡𝑡𝑛 = 𝐿𝑆𝑇𝑀𝐴𝑡𝑡𝑛(ℎ𝑡 − 1 𝐴𝑡𝑡𝑛 𝐸𝑚𝑏𝑒𝑑 𝑦𝑡 − 1 , 𝑎𝑡 − 1 ). Here y is the previous predicted character converted to feature vector using Embed layer, LSTM(Attn) is 2 LSTM each with 512 nodes and a is the attention vector calculated as 𝑎𝑡 = tanh (𝑊𝑐 𝑐𝑡; ℎ𝑡𝐴𝑡𝑡𝑛 ). Here c is the context vector calculated as 𝑐𝑡 = 𝑝 𝑖 𝑡𝑒 𝑖 𝑛 𝑡=1 i.e. weighted average of extracted features using attention weights. The attention outputs are thus calculated as ℎ𝑡 𝑅𝑒𝑠 , 𝑂𝑡 = 𝐿𝑆𝑇𝑀𝑅𝑒𝑠(ℎ𝑡 − 1 𝑅𝑒𝑠 , 𝑎𝑡). For LSTM(Res) a 1 LSTM layer of 512 nodes with a skip connection with the input attention vector. The decoder consist of multiplying the attention outputs with trainable weights followed by SoftMax as 𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑎 ∗ 𝑂𝑡)  Result PRMU dataset is used to validate the performance of the model. It has 3 tasks namely 1:single character recognition, 2: 3-character recognition written vertically, 3: 3 or more character written in multiple lines. The model achieved character error rate and sequence error rate of 4.15, 11.43 and 12.69, 58.58 on level 2 and 3 respectively.  Next must-read paper: “Recognition of anomalously deformed kana sequences in Japanese historical documents”  Summary The work proposes a novel algorithm to recognize text without the need for segmentation. This is achieved by incorporating BiLSTM along rows and columns in the encoder and residual LSTM in the decoder. Language model could be incorporated in the algorithm to further improve performance  Related Works A. Graves(2009) used BiLSTM with Connectionist Temporal Classification(CTC) for English text recognition. A. Graves (2008) employed Multi-Dimensional LSTM with CTC loss for arabica text recognition. H. Yang(2018) CNN with CTC for Chinese text recognition. D. Valy(2018) used CNN with 1D or 2D LSTM for Khmer text recognition. C. Wang(2018) used attention for text recognition. Y. Deng(2017) used attention for mathematical expression recognition. T. Bluce(2017) incorporated Multi-Dimensional LSTM with attention for handwritten paragraph recognition.  Proposed Methodology The algorithm has 3 modules for feature extraction, row-column encoder and decoder. A standard CNN module is used for feature extraction. 2 modified BiLSTM are used one along row and another

Slide 15

Slide 15 text

15 Deep Convolutional Recurrent Network for Segmentation-free Offline Handwritten Japanese Text Recognition Copyright © 2022 VIVEN Inc. All Rights Reserved  Result Kondate dataset is used to finetune and validate the model. Label Error rate and Sequence Error rate is used to calculate the model’s performance, these are calculated as 𝐿𝐸𝑅 ℎ, 𝑆′ = 1 𝑍 𝐸𝐷(ℎ 𝑥 , 𝑍) (𝑆,𝑍)∈𝑆′ and 𝑆𝐸𝑅 ℎ, 𝑆′ = 100 |𝑆′| 0 𝑖𝑓 ℎ 𝑥 = 𝑧 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (𝑥,𝑦)∈𝑆′ . The model achieved LER and SER of 6.44 and 28.9 using DCRNs and 6.95 and 28.04 using DCRN- f&s respectively.  Next must-read paper: “Text-Line Character Segmentation for Offline Recognition of Handwritten Japanese Text ”  Summary The work proposes a segmentation free algorithm for recognition. It has three components i.e, a CNN feature extractor using sliding window, BLSTM for prediction and optimized using connectionist temporal classification loss(CTC)  Related Works Graves(2009) combined BLSTM and CTC for text recognition. Messina (2015) combined Multi Dimensional LSTM and CTC for end-to-end trainable chinese text recognition. Suryani(2016) combined pretrained CNN with LSTM and hidden markov model for alignment.  Proposed Methodology The CNN architecture used for feature extraction is shown below. The model is pretrained with japanese handwritten character dataset (Nakayosi, Kuchibue). After training the softmax(DCRNs) or both the fully connected layers and softmax(DCRN-fs) are removed and the remaining is used for feature extraction. LSTM layers are used instead of RNN to address the vanishing gradient problem. LSTM however extract information in one direction thus 2 LSTM are used to extract information in both direction. The model is optimized using CTC loss function.

Slide 16

Slide 16 text

16 Attention Augmented Convolutional Recurrent Network for Handwritten Japanese Text Recognition Copyright © 2022 VIVEN Inc. All Rights Reserved  Proposed Methodology A CNN architecture without fully connected layers is used for feature extraction. The feature maps extracted are unfolded from left to right. These features are fed to self-attention module where they are projected to query, key and value followed by scaled dot product attention. For n heads this process is repeated n times. The outputs are then concatenated and added with the input of self attention module.  Result Multiple datasets are used to evaluate the model’s performance as IIIT5K, SVT, IC03, IC13, SVTP, CUTE obtaining 92.67, 91.16, 93.72, 90.74, 78.76 and 76.39 respectively.  Next must-read paper: “Focusing attention: Towards accurate text recognition in natural images”  Summary Japanese OCR is difficult due to vast character set, multiple writing styles, multiple touch-points between characters. The work proposes attention augmented Convolutional Recurrent Network (AACRN) consisting of three modules convolution feature extractor, self-attention-based encoder and CTC decoder.  Related Works Feng(2012) used segmentation to segment each character followed by recognizing each character. Segmentation free methods involved using Connectionist Temporal Classification (CTC) and attention mechanism. Graves(2009) was the first to use BLSTM with CTC for recognition. Shi(2017), Ly(2017) used CNN with BLSTM and CTC for recognition. Deng(2017) used attention-based model to convert mathematical expressions to LaTex. Chowdhury(2020) used attention model with beam search decoder for english and french text recognition. Vaswani(2017) used self-attention along with positional information for recognition.

Slide 17

Slide 17 text

17 Recognition of Anomalously Deformed Kana Sequences in Japanese Historical Documents Copyright © 2022 VIVEN Inc. All Rights Reserved prediction. The end-to-end approach consists of similar model without pretraining and BatchNormalization after each convolution layer in feature extractor and relu replaced by leakyrelu and dropout layers after every 2 LSTM layer in frame predictor. For text spanning multiple lines, one approach is to segment the vertical lines and join the vertically and apply previous algorithms. Another approach is to use the image feature extractor from previous algorithm and a frame predictor of 2 levels of 2DBLSTM. The first layer having 4 LSTM each having 64 nodes, the second layer having 4 LSTM each with 128 nodes. The 3rd approach replaces the CNN module with a 2DBLSTM thus, the entire structure has a total of 3 2DBLSTM each having 4 LSTM layers and 2, 10 and 50 nodes respectively. Limited window size and fully connected layers are used to reduce weights.  Result For level 2 (single vertical texts) the end-to-end trained model achieved the best result of 10.9 LER and 27.7 SER. For level 3(multiple vertical texts) segmentation with end-to-end trained model achieved best result of 12.3 LER and 54.9 SER.  Conclusion Context-preprocessing, language statistics , augmentation could be applied to further improve the performance.  Next must-read paper: “Character and Text Recognition of Khmer Historical Palm Leaf Manuscripts”  Summary The work proposes a segmentation free algorithm for Khana character recognition in multiple lines. The work proposes Deep Convolutional Recurrent Neural Network(DCRN) which uses a CNN architecture for feature extraction and BiLSTM with connectionist temporal classification loss(CTC) for recognition.  Related Works Kitadai(2008) restores the image and performs similar pattern matching and returns similar characters already decoded. Phan(2016) segmented a document into characters and recognized them using modified quadratic discriminant function. Nguyen(2017) used a segmentation-based approach for handwritten japanese text recognition. Graves (2008) combined multidimensional LSTM with CTC loss for end-to-end trainable model for handwritten arabic recognition. Shi(2015) used CNN with LSTM for scene text recognition. Rawls (2017) used CNN with LSTM for handwritten english and arabic text recognition. Ly(2017) used a combination of CNN with LSTM optimized using CTC loss for japanese handwritten text recognition.  Proposed Methodology The pretrained model has 5 blocks of CNN with maxpooling with 2 FC layer with relu after every layer and softmax layer for prediction. This architecture is pretrained(to recognise 1 character) and features are extracted in 3 ways: • Extract features from model without softmax for sliding window(stride 12 or 16) applied on text • Sliding window of 64*32 with stride of 32 for non-overlapping regions, features extracted from last convolution layer. • Features extracted for full text image from last convolution layer A frame predictor module of 3 BLSTM (each having 2 LSTM of 128 nodes) followed by a dense layer to predict character for each frame and CTC for final

Slide 18

Slide 18 text

18 A semantic Segmentation-based method for Handwritten Japanese Text Recognition Copyright © 2022 VIVEN Inc. All Rights Reserved Kondate dataset is used for training and Kuchibue and Nakayoshi datasets are used for testing. The pixel level accuracy and IoU% are calculated for character segmentation while errors are calculated for recognition the results are presented above.  Result Improved segmentation algorithm such as RCNN and Mask R-CNN can be used to improve results as well as Conditional Random Field for post processing or adding it directly.  Next must-read paper:”Progress and results of Kaggle Machine Learning Competition for Kuzushiji Recognition”  Summary The work proposes a segmentation based Japanese handwritten text recognition algorithm. Semantic segmentation using an U-Net architecture for pixel level classification followed by CNN based OCR which are then combined using a language model. Related Works Xu(2001) used skeleton and contour analysis to detect cut points. A filter trained with geometric features is used to prune the implausible cutting candidate points. An OCR model was used for recognition followed by language model for refinement. Shi(2015) combined CNN and LSTM with CTC loss for text recognition. Ly(2019) used CNN for feature extraction with RNN to encode followed by attention for output. Asanobu(2019) first segmented using Faster R-CNN with two classes the characters followed by recognition using ResNet152 as backbone, FPN and ROI align. Baek(2019).  Proposed Methodology ResNet101 is used as the encoder for U-Net. Dilated convolution is used to extract features without the need of down-sampling multiple times. The encoder features are passed through dilated convolution with dilation of 2,4,8,16 concatenated and fed to the upsampling layers. The model is tasked to predict the center of character along with convex hulls to reduce touching within characters. Watershed algorithm is used on the output to separate convex hulls. Recognition is done via Inception ResNet V2 network. The outputs are represented as unigram, bigram and trigram which are then selected using Viterbi to obtain result.

Slide 19

Slide 19 text

19 A unified method for augmented incremental recognition of online handwritten Japanese and English text Copyright © 2022 VIVEN Inc. All Rights Reserved The pointer is decided based on previous results. After segmentation if the classification of a stroke has changed recognition scope is from that stroke(earliest classification changed off-stroke EccOs) to latest stroke. Thue the previous recognition and current recognition scope overlaps, this could be used to reduce the processing time. SP is used to split the words into preceding and succeeding characters. After splitting if a character was present in the previous scope or out of scope it is reused else recognition is done. Characters/ words containing the last segment are treated as partial characters and recognition is postponed till complete patterns are received. Newly added strokes are treated as delayed stroke if ii is close to previous segmentation result. Character/word with and without the delayed stroke is considered and best path is searched.  Result TUAT-Kondate dataset is used for Japanese text evaluation. The dataset contains text from 100 people, divided into 4 sets(each having text from 25 people) with 3 for training and 1 for testing. IAM-OnDBt2 is used for english text evaluation.  Next must-read paper:”An approach for real time recognition of online Chinese handwritten sentences”  Summary The algorithm proposed can be used for recognition while writing or delayed recognition. Segmentation is followed by recognition done after a fixed interval of strokes. This is done using three techniques, i.e the structures used in previous step is reused, strokes that were not recognized in previous step are attempted again, skip for incomplete characters. Related Works Zhu(2010) used geometric features extracted from previous and next strokes for japanese text recognition. Nakagawa(2006) used geometric and linguistic information to improve performance. Graves(2009) used bi-directional recurrent networks for feature extraction. Tanaka(2002) reported online recognition decreases performance by 0.3.  Proposed Methodology The offline segmentation is done in two steps, i.e segmentation into lines and segmentation of each character, A classifier is used to classify each stroke into segmentation point(SP), non-segmentation point(NSP) and undecided point(UP). Here SP separates two characters or words, NSP indicates stroke is within character and UP is used for low- confident predictions. Here SVM is used for japanese text and BLSTM for english text to classify each stroke. Strokes separated by SP or UP are considered as character/word or part of character/word. These are then recognized and assigned a confidence score. Viterbi algorithm is used to search for optimal combinations that give the best score. For online recognition the algorithm starts after some strokes are added, When resuming segmentation, it is done from segmentation pointer to current stroke.

Slide 20

Slide 20 text

20 Training an End-to-End Model for Offline handwritten Japanese Text Recognition by Generated Synthetic Patterns Copyright © 2022 VIVEN Inc. All Rights Reserved To create synthetic data a random sentence and a different writer is chosen. A new sentence is formed using characters used in sentence, but handwritten image of that character written by writer. Each handwritten character undergoes local distortion(shearing, rotation, scaling, translation), and the entire sentence undergoes global distortion(rotation and scaling).  Result Kondate dataset is used to evaluate the model’s performance and effectiveness of the augmentation used. Training the model in an end- to-end manner with augmentation achieved the best performance of 1.87 Label Error Rate and 13.81 Sequence Error Rate  Next must-read paper:”Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition”  Summary The algorithm proposed uses Deep Convolution Neural Network for feature extraction from text line image, deep BLSTM as recurrent network and trained using Connectionist Temporal Classification(CTC) loss. Elastic distortions are used to synthesize images.  Related works Graves(2006) first introduced CTC loss for handwritten text recognition. Graves(2009) combined BLSTM with CTC loss to improve performance. Messina(2015) used Multi-Dimensional LSTM with CTC loss for Chinese text recognition. Puigcerver(2017) proved multi-dimensional LSTM is not necessary for text recognition.  Proposed Methodology The images are first scaled to fixed size and otsu’s method is used to obtain binary images. The CNN architecture shown above is used for feature extraction. Bidirectional LSTM is used to pass information from both direction from the features extracted by CNN. The entire architecture is trained using CTC loss.

Slide 21

Slide 21 text

21 Attempts to recognize anomalously deformed Kana in Japanese historical documents Copyright © 2022 VIVEN Inc. All Rights Reserved previous algorithm. Another method could be using 2DBLSTM with CNN pretrained on single character recognition or object detection followed by CNN pretrained with single character recognition without segmentation.  Result Recognizing anomalously deformed Kana in Japanese historical documents contested by IEICE PRMU having 3 tasks namely recognizing single character, one line of text, multiple lines of text is used to validate the approaches presented.  Next must-read paper:”Deep Convolutional Recurrent Network for Segmentation-free Offline Handwritten Japanese Text Recognition”  Summary The work proposes three different algorithm for recognizing single character, single line character using CNN with BLSTM and multiple line character using a segmentation-based method.  Proposed Methodology For recognizing single characters Otsu's, method is used to reduce background noise while enhancing foreground. The result is padded so that character is in the centre. This is followed by linear spatial normalization, resizing and standardization. For augmentation rotation, shearing are used. Experiments are conducted on multiple CNN backbones as well as LSTM architectures for character classification. For detection of character sequence, a combined architecture of CNN and BLSTM and CTC loss is used. The CNN architecture is first trained on single character recognition, features can then be extracted using a sliding window with(DCRN-o),without overlapping(DCRN-wo) or using the features extracted from the last convolution layer(DCRN-ws). For recurrent layer 1DBLSTM is trained along with CTC loss. For recognizing texts in multiple vertical lines multiple approaches are presented such as vertical line segmentation and concatenating them followed by

Slide 22

Slide 22 text

22 A Multiplexed Network for End-to-End, Multilingual OCR Copyright © 2022 VIVEN Inc. All Rights Reserved  Proposed Methodology For text detection the algorithm uses a ResNet50 backbone with a Unet structure. Vatti clipping algorithm is used to shrink text regions and RoI masking to suppress background and neighboring text instances. For recognition, a character segmentation module and spatial attention module is used. The output of the detection and segmentation modules is used as input for language prediction module which uses a small CNN architecture(2 CNN, 1 FC) for classification. The language prediction and recognition heads are trained using as 𝐿ℎ𝑎𝑟𝑑−𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑡𝑒𝑑 = 𝛼𝑠𝑒𝑞 𝑟 𝐿𝑠𝑒𝑞 𝑎𝑟𝑔𝑚𝑎𝑥1<𝑙<𝑁𝑟𝑒𝑐 𝑝 𝑙 for when 𝑁𝑟𝑒𝑐 number of different languages are supported. Thus, for each word during training one the recognition head having the highest confidence is selected. 𝐿𝑠𝑒𝑞 𝑟 = − 1 𝑇 𝐼 𝑐𝑡 ∈ 𝐶𝑟 . log 𝑝 𝑦𝑡 = 𝑐𝑡 + 𝐼 𝑐𝑡 ∉ 𝐶𝑟 𝛽 𝑇 𝑡=1 . Here 𝐶𝑟 is the character set supported by head r and 𝑐𝑡 is the ground truth. 𝛽 is a hyperparameter as penalty for unsupported characters.  Results ICDAR 2019 MLT dataset (MLT19) is used to evaluate the model’s recognition performance.  Next must-read paper:” Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting”  Summary This work proposes an end-to-end trainable pipeline which includes text recognition and detection. The algorithm uses multiple heads for recognizing different languages.  Related works The task of Multiple text recognition can be divided into 3 sub-tasks namely text detection, script identification and text recognition. Prior to the use of deep learning hand crafted features were used Anil(1998), Lukas(2012), Cheng(2013). The success of deep learning methods led to them being used in conjunction to hand crafted features. Recent approaches however extensively use deep learning methods. Most recognition algorithm either use Connectionist Temporal Classification(CTC) Graves(2006) loss to convert features to language sequence or Seq2Seq encoder-decoder framework with attention Bahdanau (2014) for classification. Script identification is necessary for determining which language recognizer to use. Shi(2015) used CNN with multi-staged pooling for classification. Fugii(2015) modified the task to a sequence to label problem. Some recognition algorithms like Michal(2018), Youngmin(2020), Pengyuan(2018) does not incorporate these components and have one recognition head for characters from different languages.

Slide 23

Slide 23 text

23 E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text Copyright © 2022 VIVEN Inc. All Rights Reserved The entire algorithm is trained using 𝐿𝑓𝑖𝑛𝑎𝑙 = 𝐿𝑔𝑒𝑜 + 𝜆1 𝐿𝑎𝑛𝑔𝑙𝑒 + 𝜆2 𝐿𝑑𝑖𝑐𝑒 + 𝜆3 𝐿𝐶𝑇𝐶. Here 𝐿𝑎𝑛𝑔𝑙𝑒 is mean squared error over sin(∅) and cos(∅), 𝐿𝑔𝑒𝑜 is IoU loss function, 𝐿𝐶𝑇𝐶 is word level recognition loss and 𝐿𝑑𝑖𝑐𝑒 is dice loss calculated for predictions with more than 90% confidence.  Results ICDAR 201yMLT dataset (MLT17) is used to evaluate the model’s performance.  Next must-read paper:” Fots: Fast oriented text spotting with a unified network” ”  Summary This work proposes end-to-end method for multi-language scene text localization and recognition.  Related works For scene text recognition, localization is an important step to obtain word level bounding box or segmentation maps. Jaderberg(2016) used output of edge boxes and channel features to obtain bounding box, random forest were used to filter the predictions and CNN regressor for post-processing. Gupta(2016) used CNN network to detect objects at multiple scales. Tian(2016) used a CNN-RNN architecture to predict presence of character. For recognition Jaderberg(2016) used VGG16 for classification 0f 90K words. Shi(2017) generates one word per image using CNN with BLSTM and Connectionist Temporal Classification loss. Lee(2016) used CNN with RNN and soft-attention for recognition. Li(2017) used convolution recurrent network for text localization as well as text recognition. Liu(2018) used a shared CNN architecture for text localization and recognition.  Proposed Methodology The algorithm uses ResNet34 with FPN object detector. For an input, the localization module produces 7 outputs as presence or absence of text, 4 co-ordinates of bounding box and orientation angle ∅ is predicted.

Slide 24

Slide 24 text

24 OCR Survey of Dataset Tapas Dutta, Deep Learning Engineer

Slide 25

Slide 25 text

25 Generating Synthetic Data for Text Recognition Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work generates 9M synthetic handwritten word image corpus using open-source fonts and data augmentation schemes.  Related Works Jaderber(2014), Rozantsev(2015), Ros(2016) used synthetic mechanism for data generation and annotation. Sankar(2010) used rendered images for annotating large scale datasets.  Proposed Methodology Synthetic words can be generated using making words in available fonts, learning different parameters for style and content and modifying the style parameter for deep learning model. This work uses the first technique using publicly available fonts(750) and vocabulary chosen from dictionary(90K unique words). After randomly selecting word and style inter character space, stroke width is varied, random augmentation within(-5 to +5) and shear(+/- 0.5) in horizontal direction.  Conclusion To simulate cursive writing elastic distortion could be used in future works.  Next must-read paper: “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes”

Slide 26

Slide 26 text

26 A Database of On-line Handwritten Mixed Objects named “Kondate” Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work presents a database containing online handwritten of text, figures, tables, map, diagram, etc. This database has 100 Japanese, 25 English, 45 Thai writers.  Proposed Methodology Two strategies for writing were used namely copy writing where for participant receives pattern and content, free writing where neither are provided thus real patterns can be obtained. Attributes of participant, environment as well as each stroke is collected. The X/Y co-ordinates of each stroke from pen down to pen up is reported along with stroke-id, timeOffset(from ped down of first stroke to pen down of current stroke) and duration.  Conclusion The work presents a dataset with online text as well as figures, tables, maps, diagrams.  Next must-read paper: “Arabic Handwriting Data Base for Text Recognition”

Slide 27

Slide 27 text

27 ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification – RRC-MLT Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary RRC-MLT challenge has tasks like text detection, script classification and all tasks for necessary for multilingual text recognition. The dataset contains 18K images having text from 9 different languages.  Proposed Methodology The dataset contains natural images with embedded text such as street signs, advertisement boards, shop names, passing vehicles, images from internet search. The dataset contains at least 2K images from Arabic, Bangla, Chinese, English, French, German, Italian, Japanese and Korean. Text containing special character also from multiple languages are available. Task 1 is text detection for which 9K images were for training and remaining 9K for testing. For a set of don’t care regions D = {d1, d2, …dk }, ground truth regions G = {g1, g2,…gm} and predictions T = {t1,t 2, …tn}. The predictions are matched as area(dk ) = 0 area(dk) area(tk ) area(dk) > 0.5 against don’t care regions to discard noise. The filtered predictions T’ are considered positive if area(gi ) area(t’ j ) area(gi ) ∪ area(t’ j ) > 0.5 . The set of positive matches(M) are used to calculate F-score for𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = |𝑀| |𝑇′| and 𝑟𝑒𝑐𝑎𝑙𝑙 = |𝑀| |𝐺| thus 𝐹 − 𝑠𝑐𝑜𝑟𝑒 = 2 ∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙 . Team SCUT-DLVC lab used one model for rough detection followed by another for finely adjusting bounding box and postprocessing. CAS and Fudan university designed rotation region proposal network for inclined region proposals with angle information which is used for bounding box regression. This is followed by text region classifier. Sensetime group proposed an FCN based model with with deformable convolution which predicts whether a pixel is a text or not along with location if it is in a text box. TH-DL group used an FCN with residual connection to predict whether a pixel belongs to text, followed by binarizing at multiple thresholds. The connected components thus extracted along with their bounding boxes are used as region proposals. The features of region proposals after Rotation ROI pooling are fed to Fast R-CNN and non-max suppression for results. Linkage group extracted external regions from each channel followed by calculating linkage between components and pruning to get non-overlapping components, candidate lines are extracted and merged. Alibaba group used a CNN to detect text regions and predict their relation which are then grouped and used by a CNN+RNN to divide into words. For script identification 84K and 97K images were used for training andtesting respectively, with text Arabic, Bangla, Chinese, Japanese, Korean, Latin and Symbols. SCUT-DLVC labs used CNN with random crop for training and sliding windows for test images. Team CNN-based method used VGG16 initialized with ImageNet weights and cross-entropy loss with training images resized to fixed height and variable width for training. TNet enhanced the features of dataset followed by deep network for training and majority vote for classification. Team BLCT extracted patches of variable sizes to train a 6-layer CNN model. Features from penultimate layer are randomly combined, these are used to create bag-of-visual words of 1024 codewords which are used as image representations which are aggregated to form histogram which in turn is used by a 2 dense and 1 dropout layer for classification. Team Th-DL, TH-CNN used GoogLeNet like structure, while Synthetic-ECN used an ECN based structure.

Slide 28

Slide 28 text

28 ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification – RRC-MLT Copyright © 2022 VIVEN Inc. All Rights Reserved Team 8 used VGG16 with SVM classifier. Task 3 is a combination text detection using bounding box and script classification, evaluated as 𝑎𝑟𝑒𝑎 𝑔𝑖 ∩ 𝑎𝑟𝑒𝑎(𝑡𝑖 ′) 𝑎𝑟𝑒𝑎 𝑔𝑖 ∪ 𝑎𝑟𝑒𝑎(𝑡𝑖 ′) > 0.5 𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑑 𝑡 𝑖 ′ = 𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑑(𝑔𝑖 ). Team TH-DL used a a combination of methods in previous tasks. Team SCUT-DLVCLab trained 2 models for classification and detection. The classification model predicts background for high confident background boxes generated by detection.  Results The results of Task-1, Task-2 and Task-3 are presented.  Conclusion Future works are expected to tackle larger datasets and more languages.  Next must-read paper: “Script identification in the wild via distinctive convolutional network”

Slide 29

Slide 29 text

29 Recognizing Text with Perspective Distortion in Natural Scenes Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work proposes an algorithm for recognizing texts of arbitrary orientations. A new dataset StreetViewText-Perspective is also introduced.  Related works Smith(2011) and Weinman(2009) proposed similarity constraint so that visually similar characters have similar labels. Wang(2011) used an object recognition framework requiring all character to be correctly recognized. Gandhi(2000) rectified texts using motion information. Li(2010) successfully recognized characters, but word recognition was not addressed.  Proposed Methodology Matas(2002) algorithm is used to detect potential character locations. Non-maximal suppression is used to select one bounding box for a character. The output is classified into text vs non-text based on relative height, aspect ratio, horizontal crossings, holes. The bounding boxes are used as character candidates. The character patch is normalized to 48 X 48 and a grid of 2 pixels spacing is used to extract dense SIFT features at each grid points at multiple scales. Bag-of-keywords is used for matching since it ignores spatial information allowing for more distortion between training and testing samples. SVM with histogram intersection kernel is usedas classifier. For word recognition alignment score, how well a character match a word is used. This score is calculated as 𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑙 = 𝑃 𝑙 𝑐 𝑖𝑓 𝑙 ≠ 𝜀 1 − 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐 𝑖𝑓 𝑙 = 𝜀 . Thus, the alignment score of an entire word is the sum of alignment score for each character as 𝐴𝑙𝑖𝑔𝑛𝑆𝑐𝑜𝑟𝑒 𝑎𝑤 = 𝑠𝑐𝑜𝑟𝑒(𝑐𝑖 , 𝑤(𝑎𝑤 (𝑖))) 𝑛 𝑖=1 . Here 𝑐𝑖 is the 𝑖𝑡ℎ character and 𝑤(𝑎𝑤 (𝑖)) is the label it is assigned(selected by taking the prediction with maximum confidence). StreetViewText-Perspective contains images same as StreetViewText however side views are considered such that they are readable to humans.  Results Accuracy of various state-of-the art methods are compared for SVT as well as SVT-Perspective.  Conclusion Most dataset assume horizontal dataset which may not be possible in real world thus SVT-Perspective is crucial.  Next Must-Read Paper:” Top-Down and Bottom-up Cues for Scene Text Recognition “

Slide 30

Slide 30 text

30 Deep Learning for Classical Japanese Literature Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work introduces dataset for cursive Japanese Kuzushiji along with larger and more challenging datasets Kuzushiji-49 and Kuzushiji-Kanji.  Proposed Methodology The work pre-processed characters scanned from 35 books written in 18𝑡ℎ century and organized in 3 parts. Kuzushiji-MNIST as a replacement for MNIST dataset. Kuzushiji-49 with an imbalanced dataset of 48 Hiragana characters and one Hiragana iteration mark. Kuzushiji-Kanji an imbalanced dataset of 3832 Kanji character including some rare characters with few examples. Kuzushiji-49 has 266407 images of 28 X 28 resolution same as Kuzushiji-MNIST while Kuzushiji-Kanji has 140426 images of 64 X 64 resolution  Results A baseline using nearest neighbor, small CNN and resnet architectures for Kuzushiji-MNIST, Kuzushiji-49 is reported.  Next Must-Read Paper: “Unsupervised image-to-image translation networks”

Slide 31

Slide 31 text

31 Towards End-to-End License Plate Detection and Recognition: A Large Dataset and Baseline Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work proposes a dataset of license plate from 250K cars. A novel lightweight algorithm is introduced to achieve state of the art performance in real time.  Related works Caltech, Zermis collected less than 700 high-resolution images. SSIG(2016), UFPR(2018) collected images using cameras on road. Hseih(2002) used morphological method to reduce number of candidates thus speeding up the detection process. Yu(2015) used wavelet transform with empirical mode decomposition analysis to locate license plate. Ren(2015) used Faster-RCNN for effective detection. Liu(2016) used SSD for detection. Redmon(2016) used YOLO and approached the task as a regression problem. Ho(2009) used license plate feature directly without segmentation for recognition, Duan(2005) used an OCR system for recognition. Spanhel(2017) used a CNN model for recognition. Ciregan(2012) used CNN after segmentation for recognition. Abdel(2006) used SIFT features near license plate for recognition.  Proposed methodology The license plate images collected are from city parking management company. There are more than 250K unique images of size 720*1160*3. The LP number contains one Chinese character, a letter and five letters or numbers. Roadside Parking Net is introduced for detection and recognition. 10 CNN to extract different level feature maps(second, fourth and sixth) which are fed to the detection module of three fully connected layers in parallel. The recognition module uses region of interest pooling layers to extract feature maps crucial for classification. The feature maps are fed to ROI pooling to extract fixed sized feature maps, which are then concatenated and used for recognition.  Results A prediction is considered correct if IoU is more than 0.6 and all characters are correctly recognized.  Next must-read paper:”Ssd: Single shot multibox detector“

Slide 32

Slide 32 text

32 Chinese Street View Text: Large-scale Chinese Text Reading with Partially Supervised Learning Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work introduces a large Chinese street view dataset of 430K images. However, only 30K are fully-annotated and 400K are weakly annotated. A text reading network in a partially supervised learning framework is proposed to exploit fully and weakly annotated data.  Related works ICDAR 2013 and ICDAR 2015 contains horizontal and multi-oriented texts were first used for training text reading model. Total-Text and SCUT- CTW1500 are used for curved texts. Most text recognition algorithm(Bartz 2017, 2018, Busta 2017) first localize then detect text in an end-to-end manner. For detection branch region proposal network(Li 2017, Lyu 2018) could be used or a CNN layer to predict locations(Zhou 2017). For recognition Connectionist Temporal Classification(Busta 2017, Liu 2018) or attention/LSTM(Li 2017, He 2018) layers could be used.  Proposed methodology 430K images obtained from real streets signs in China, with 29,966 with all locations and text regions marked and 400K images having text of roughly marked regions. ResNet50 with feature pyramid network is used as backbone. For F representing the extracted feature maps. Text/non text classification is done for each spatial location. For localization quadrangle offsets are predicted. The entire branch is optimized as 𝐿𝑑𝑒𝑡 = 𝐿𝑙𝑜𝑐 + 𝜆𝐿𝑐𝑙𝑠. Where 𝐿𝑙𝑜𝑐 , 𝐿𝑐𝑙𝑠 represent smooth L1 and dice loss, respectively. Perspective RoI transform is used to align feature maps F to 𝐹𝑝. An encoder-decoder framework is used for recognition with CNN and GRU as encoder, attention and GRU decoder. Online Proposal Matching(OPM) is used to locate text regions given weakly annotated text. Given image with weakly annotated text the image is fed to the entire network and characters extracted are passed through embedding layer and compared using Euclidean distance with annotated text as 𝑑 𝑖 = 1 𝑇 ||𝑓 ℎ𝑡 , 𝑊 − 𝑓(𝑒𝑡 , 𝑊|| 𝑇 𝑡=1 for T length word, h predicted, e ground truth and W embedding weights.  Results The proposed method is compared against state-of-the-art methods in the given dataset.  Next must-read paper:”Accurate scene text detection through border semantics awareness and bootstrapping”

Slide 33

Slide 33 text

33 A robust arbitrary text detection system for natural scene images Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work explores different properties of pixels to identify pixels and SIFT for feature extraction. A new dataset CUTE80 is introduced for curved text evaluation.  Related works Chen(2004) extracted 79 features from classifier and used adaptive binarization to classify text and non-text pixels. Yi(2011) used gradient and color information for partition. Sivakumara(2013) used a combination of wavelet and median moments for detection. Epshtein(2010) used stroke width transform on canny image for text detection.  Proposed methodology The proposed dataset CUTE80 has 80 images either indoor or outdoor captured using camera or taken from internet. The work creates three novel features Mutual Direction Symmetry(MDS), Mutual Magnitude Symmetry(MMS), Gradient Vector Symmetry(GVS) present in Sobel and Canny images for text. These features are used to separate text from background. SIFT features are used to eliminate false text candidates. This algorithm works for any orientation of text since it is based on ellipse property of text and is implemented based on nearest neighbour algorithm. Since the method does not involve language specific features it can work with multilingual texts.  Results The effectiveness of proposed model is shown.  Next must-read paper: ”Detecting texts of arbitrary orientations in natural images“

Slide 34

Slide 34 text

34 Scene Text Recognition using Higher Order Language Priors Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work introduces an algorithm that uses higher order statistical features for recognition as well as a large recognition dataset.  Related works Text recognition task can be divided into sub-tasks as text detection, character recognition, word recognition. These tasks or worked on individually by Campos(2009), Epshtein(2010) Chen(2004) or jointly by Neumann(2012), Wang(2011). Works of Mishra(2012), Smith(2011), Wang(2011) are successful in recognizing texts in limited setting.  Proposed methodology The proposed algorithm uses a conditional random field-based(CRF) model for recognition. Random variables 𝑥 = {𝑥𝑖 |𝑖𝜖𝑉} for 𝑥𝑖 representing a potential character and can take a label from label set L. The most likely character is found by minimizing energy function 𝐸 𝑥 = 𝜑𝑐 (𝑋𝑐 ) 𝑐𝜖𝐶 . For C being a subset of V and 𝑥𝑐 being a set of random variables in C. Sliding window with aspect ratio as in Mishra(2012) is used. Auxiliary variable 𝑥𝑐 𝑎 if used for 𝑐 ∈ 𝐶 which takes h-gram combination in L for higher order h used in CRF, thus model can capture a large context. The characters are ordered sequentially from left to right, with one node for every sliding window. For non-auxiliary nodes it takes from label set L with unary cost associated with it, expressed as 𝜑 𝑥𝑖 = 𝑦𝑗 = 1 − 𝑝(𝑙𝑗 |𝑥𝑖 ) for 𝑝(𝑙𝑗 |𝑥𝑖 ) being the SVM score of 𝑙𝑗 for node 𝑥𝑖. Pairwise cost for neighbor characters as 𝜑 𝑥𝑖 , 𝑥𝑗 = 𝜆 ∗ (1 − 𝑝 𝑙𝑖 , 𝑙𝑗 ) for 𝜆 being the penalty for occurring together. The higher order cost for auxiliary node 𝑥𝑖 taking label 𝐿𝐾 and leftmost non auxiliary node 𝑥𝑗 label 𝑙𝑙 is given by 𝜑2 𝑎(𝑥𝑖 = 𝐿𝑘 , 𝑥𝑗 = 𝐿𝑙 ) = 0 𝑖𝑓 𝑙 = 𝑖 𝜆𝑎′ 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 . IIIT-5K dataset contains scene text and born digital images of low resolution and a variety of resolution and styles. The images were annotated using bounding box and ground truth as well as divided into easy and hard category based on visual appearance.  Results The performance of proposed algorithm in multiple datasets for pairwise and higher-order cost is reported.  Next must-read paper: ”Top-down and bottom-up cues for scene text”

Slide 35

Slide 35 text

35 Collection of on-line handwritten Japanese character pattern databases and their analyses Copyright © 2022 VIVEN Inc. All Rights Reserved  Summary The work collected two online handwritten Japanese character pattern database with more than 3 million patterns, one from 120 people contributing 12K patterns each, other with 163 with 10K patterns each.  Related works UNIPEN dataset Guyon(1994) is popular for online character recognition but does not include oriental characters.  Proposed methodology Online handwritten dataset collection requires specialized tools Pen PCs, PC with tablet inference making collection difficult. Thus, display integrated tablets(DITs) are used for dataset collection. For effective training large amount of data from each individual needs to be collected to learn individual’s handwriting pattern. Boxes are displayed such that characters are written within the box, pen tip coordinates relative to the page are recorded. People are asked to write according to sequences given. It is generally recognized that people write unnaturally neatly text without any meaningful context but write casually when writing sentences. Thus, they are asked to write sentences that cover frequently used character followed by writing frequently asked character one by one. Some characters cannot be written without being seen, also character in kanji could be written in Kana thus characters are displayed before written. Participants details are recorded for later use. The sentences used for text are collected from newspapers. A tool was designed using DOS/V machine with display integrated tablet and MS windows OS. By specifying a text file, the corresponding character and a writing box(1.7cm*1.7cm or 1.43cm*1.43cm) is displayed to write the character. A verification tools is included that identifies erroneous characters that are verified by humans and reported to participant. Thus, Kuchibue dataset has 120 participants each contributing 11,962-character patterns. Nakayosi has 163 participants with 10,403 patterns per person.  Next must-read paper: ”A new warping technique for normalizing likelihood of multiple classifiers and its effectiveness in combined on- line/off-line Japanese character recognition”

Slide 36

Slide 36 text

36 Copyright © 2022 VIVEN Inc. All Rights Reserved 会社名 株式会社 微分 代表者 吉田 慎太郎 所在地 東京都新宿区新宿4丁目1-16 JR新宿ミラ イナタワー18F JustCo新宿 設立 2020年10月 資本金 7,000,000(2022年10月時点) 従業員 20名(全雇用形態含む) 事業内容 教育機関向けソフトウェア「School DX」の開発 ウェブアプリケーションの開発 画像認識/自然言語の研究開発 会社概要

Slide 37

Slide 37 text

37 Copyright © 2022 VIVEN Inc. All Rights Reserved