Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OCR Survey by VIVEN Inc

VIVEN, Inc.
January 21, 2023

OCR Survey by VIVEN Inc

▼参考リンク
・Service | 株式会社 微分(VIVEN, Inc.)
 https://www.viven.co.jp/ja/service/

・Mission | 株式会社 微分(VIVEN, Inc.)
 https://www.viven.co.jp/ja/company/

もし、話を聞いてみたいと思った際には、メールまたは以下のSNSからご連絡ください。

Email, Twitter, Facebook, Instagram, Linkedin

【弊社HP 】
https://www.viven.inc

VIVEN, Inc.

January 21, 2023
Tweet

More Decks by VIVEN, Inc.

Other Decks in Science

Transcript

  1. 1
    OCR Survey
    Tapas Dutta, Deep Learning Engineer

    View Slide

  2. 2
    OCR Survey of
    Foreign Language
    Tapas Dutta, Deep Learning Engineer

    View Slide

  3.  Summary
    Current OCR technologies miss genuine character or produce new
    character thus this work generates pixel-wise maps for character class,
    position and order in parallel with RNN for context modelling.
     Related Works
    Cheng(2017) used character’s class and localization labels to adjust the
    attention positions. Bai(2018) used novel loss function to improve
    attention decoder. Lyu(2018), Liao(2019) used segmentation for OCR,
    however not effective for languages with closely spaced characters.
     Proposed Methodology
    A CNN architecture is used for feature extraction. The extracted features
    are fed to class and geometry branches. The class branch consists of 2
    stacked convolutions followed by soft normalization. The output feature
    maps(character segmentation maps) has dimension of h*w*c
    3
    TextScanner: Reading Characters in Order for Robust Scene Text Recognition
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    (c = number of character + background). To produce the localization
    map sigmoid activation on the input. For order segmentation map, a
    small U-Net architecture is used, with GRU layers in the middle. After
    upsampling two convolution layers are used to generate feature maps
    of size h*w*N (N is sequence length). Order map such that kth
    character is indicated by its kth feature map is generated by multiplying
    order segmentation and character localization maps. The classification
    scores is obtained by multiplying the character segmentation maps and
    order segmentation maps
     Result
    Different datasets was used to validate the effectiveness of the model
    such as IIIT(50,1K,0 lexicons), SVT(50, 0 lexicons), IC13, IC15, SVTP,
    CT achieving 99.8,99.5, 95.7, 99.4, 92.7, 94.9, 83.5, 84.8, 91.6%
    accuracies respectively.
     Next must-read paper: “Textscanner: Reading characters in order
    for robust scene text recognition”

    View Slide

  4.  Summary
    OCR for Persian language needs to address the difference in Persian
    language like right-to-left text, interpretation of semicolon, dot, oblique,
    etc. Increasing the number of layers/filters/kernel size for LSTM did not
    improve results. BiLSTM improved results when compared to LSTM
    Increasing the dimension of extracted vector improved results for
    BiLSTM and LSTM. Increasing the number of BiLSTM layers improved
    the performance
     Related Works
    Khosravi (2006) achieved 99.02% and 98.8% accuracy using improved
    gradient and gradient histogram respectively for HODA dataset.
    Alizadehashraf (2017) achieved 97% accuracy using a small CNN
    architecture. Bonyani (2021) compared the performance of standard
    CNN architecture (DenseNet, ResNet, VGG) for recognizing Persian text.
    LeCun(1998) used LeNet whose weights were optimized using firefly,
    ant colony, chimp, particle swarm optimization techniques with chimp
    optimization performing the best. Smith(2007) used CNN followed by
    LSTM achieving an accuracy of 93%.
    4
    Persian Optical Character Recognition Using Deep Bidirectional Long Short-Term Memory
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑛𝑒𝑠𝑠 = 100 ∗
    #𝐴𝑙𝑙𝑊𝑟𝑑𝑠 − (#𝐷𝑒𝑙𝑊𝑟𝑑 + #𝑆𝑢𝑏𝑊𝑟𝑑)
    #𝐴𝑙𝑙𝑊𝑟𝑑
    𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 100 ∗
    #𝐴𝑙𝑙𝑊𝑟𝑑𝑠 − (#𝐼𝑛𝑠𝑊𝑟𝑑 + #𝐷𝑒𝑙𝑊𝑟𝑑 + #𝑆𝑢𝑏𝑊𝑟𝑑)
    #𝐴𝑙𝑙𝑊𝑟𝑑
     Proposed MethodologyType equation here.
    The proposed algorithm consists of 3 modules for segmentation, feature
    extraction and recognition. For segmentation the height of images are
    normalized and slid windowing algorithm is used. For feature extraction
    a small CNN architecture of 1 convolution and 1 maxpooling is used. The
    recognition module uses 4 BiLSTM layers, the first two with tanh
    activation and the middle two with sigmoid activation and trained with
    connectionist temporal classification loss.
     Result
    20 pages of English and Persian texts from different and random books
    was types in MS-Office. The images taken were height normalized. The
    metrics used for evaluation are: AllWrds: Number of words in text,
    InsWrds: Wrongly inserted words, DelWrds: Wrongly deleted words,
    SubWrds: Wrongly substituted words. The tesseract model achieved
    91.73% for Persian texts only and 73.56% accuracy when trained on
    dataset and tested on set having Persian and English texts, as compared
    to Bina achieving 96% on both the test sets. The tesseract and Bina
    achieved 71% and 91% correctness for test sets having Persian and
    English texts.
     Next must-read paper: “An overview of the Tesseract OCR engine”

    View Slide

  5.  Summary
    Existing OCR algorithms require a CNN architecture for feature
    extraction from image, sequence modelling layers for text generation
    and a language model to improve the performance. This work includes
    all these steps in an end-to-end trainable model. Extensive experiments
    on different combinations of encoder and decoder validate the
    superiority of using BEiT as encoder and RoBERTa as decoder. Ablation
    studies conducted validate the effectiveness of different strategies used.
     Related Works
    Diaz(2021), Vaswani(2017) incorporated transformer in the CNN
    architectures observing significant performance improvement.
    Bao(2021) replaced used self-supervised image pretrained transformers
    to replace CNN architectures.
     Proposed Methodology
    The image is divided into P*P patches, flattened and a linear layer is
    used to change the dimension to a predefined number. The encoder
    consists of an image transformer
    5
    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    where DeiT(Touvron 2021) and ViT (Dosovitskiy 2021) are used as
    encoders initialization. The “[CLS]” token is used to represent the entire
    mage. A text transformer is used as the decoder, it is initialized with
    RoBERTa mode, thus output would be wordpiece instead of a
    character. The model is first trained on hundreds of millions of synthetic
    printed text line images. These weights are used to initialize the second
    stage pretraining on task specific synthetic and real-world datasets.
    Augmentation such as gaussian blur, image erosion, rotation, image
    dilation, downscaling, underlying are used to help model’s generalizing
    capabilities.
     Result
    SROIE dataset is used to evaluate the model’s performance according
    to precision, recall, F1-score where the model achieved 95.76, 95.91,
    95.84 respectively.
     Next must-read paper: “Scene Text Recognition with Permuted
    Autoregressive Sequence Models”

    View Slide

  6.  Summary
    The work proposes an encoder-decoder architecture with GRU, and
    attention mechanism evaluated on Khmer texts in different fonts.
    tesseract produced characters like @, #, / not present in the texts. The
    proposed model outputs repeated characters before reaching EOS
    character, due to encoder-decoder architecture. Khmer language
    characters have similar structure resulting in error in both models.
     Related Works
    Ilya (2014) employed RNN based encoder-decoder for English to French
    machine translation. Dzmitry (2014) modified the previous work to
    include attention mechanism and improved the performance. Devendra
    (2015) employed LSTM based encoder-decoder mechanism for English
    text recognition. Farisa (2021) trained standard CNN
    architectures(ResNet, DeseNet) with LSTM, GRU layers and trained the
    entire model using Connectionist Temporal Classification loss in an end-
    to-end manner.
     Proposed Methodology
    The encoder consists of a convolution, batchnormalization and ReLU
    activation followed by 3 residual modules. There are two types of
    residual blocks as photo above.The first residual block contains 2 res
    6
    An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    block 0 while the second and third blocks contains res block 0 followed
    by res block 1.This is followed by a 2DAveragePooling and 2D dropout
    layers. For an intermediate output of h*w*c the feature maps are
    reshaped to w*hc(𝑂𝑑𝑟𝑜𝑝𝑜𝑢𝑡2𝐷) represented as
    𝐻, ℎ𝑇
    = 𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝐺𝑅𝑈(𝑂𝑑𝑟𝑜𝑝𝑜𝑢𝑡2𝐷
    ). Here h is passed through a dropout,
    linear, tanh activation and used as decoder’s initial hidden state, while
    H is the input to the decoder. The context vector is weighted average of
    different hidden state, the current weights are calculated using SoftMax
    on encoder’s current hidden state and decoder’s previous hidden state
    as 𝑒𝑖𝑗
    = 𝑉
    𝑎
    𝑇 ∗ tan(𝑊
    𝑎
    𝑆𝑖−1
    + 𝑈𝑎
    ℎ𝑗
    ). The one-hot encoded previous
    decoder’s output and context vector from attention are concatenated
    and passed through a GRU layer along with decoder’s previous hidden
    state to calculate current hidden state. The current hidden state,
    context vector and one-hot encoded previous output are concatenated
    and passed through a linear layer for next character prediction.
     Result
    Text2image is used to generate 92,213 texts for different Khmer fonts
    and sizes. The test set contains 3000 images. The proposed model
    achieved a character error rate(ratio of unrecognized to total number of
    characters) of 1% as compared to tesseract’s error rate of 3% on this
    dataset.
     Next must-read paper: “Khmer OCR fine tune engine for Unicode
    and legacy fonts using Tesseract 4.0 with Deep Neural Network”
    𝐶𝐸𝑅 =
    𝑆 + 𝐼 + 𝐷
    𝑁
    𝑊𝐸𝑅 =
    𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
    𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

    View Slide

  7.  Summary
    Texts in the real world are mostly curved or stylised thus ASTER is
    equipped with a rectification module to rectify the input image(using thin
    plate spline) and a recognition module using sequence to sequence
    module with attention to predict characters from rectified images.
     Related Works
    Wang(2012) used two separate CNN modules to localize and recognize
    texts. Jaderberg(2014) used one CNN for both localization and
    recognition. Su and Lu (2014,2017) used RNN for sequence prediction.
    He(2016), Shi(2017) used a combination of CNN and RNN for text
    prediction. Wang(2017) employed gated recurrent CNN for text
    recognition. Yang (2017) employed a character detection model
    optimized with an alignment loss for character localization. End-to-End
    text Jaderber (2016), Weinman(2014) recognition models use text
    proposals followed by word recognizer. Busta (2017) combined FCN
    detector with connectionist temporal classification(CTC) for recognition.
     Proposed Methodology
    The algorithm contains two parts for rectification and recognition.
    Rectification is done using Thin Plate Spline Transformation, it contains 3
    parts as localization network, grid generator and sampler
    7
    ASTER: An Attentional Scene Text Recognizer with Flexible Rectification
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    The localization network predicts k coordinates using a CNN
    architecture with FC layer, from the original image. Given a pixel
    location p on original image I the grid generator computes the pixel
    location on rectified image. Here ∆𝐶 is calculated
    as ∆𝐶 =
    11∗𝐾 0 0
    𝐶 0 0
    ℂ 1𝐾∗1 𝐶𝑇
    and 𝑇 = 𝑐, 02∗3 ∆𝐶−1. ℂ is a
    square matrix such that ℂ𝑖𝑗
    = 𝐹(||𝐶𝑖
    − 𝐶𝑗
    ||) and 𝐹 𝑟 = 𝑟2 ∗ log (𝑟).
    Differentiable image sampling is used to clip neighbouring pixels to
    restrict pixels within image and interpolate neighbourhood pixels in a
    differentiable manner. The recognition module consists of an encoder-
    decoder architecture with ConvNet acting as encoder and BLSTM with
    attention as decoder. The attention weights are calculated from
    encoder output and previous hidden state as
    𝑒𝑖𝑗
    = tan(𝑊𝑠𝑡−1
    + 𝑉ℎ𝑖
    + 𝑏) ∗ 𝑊𝑇. after softmax glimpse vector g is the
    weighted sum of encoder outputs using attention weights. g is then
    concatenated with one hot encoded previous output and passed
    through a recurrent unit. This output is used fed to a linear layer with
    softmax to calculate current character
     Result
    Multiple datasets are used to evaluate the model’s performance as
    IIIT5K(0), SVT(0), IC03(0), IC13, SVTP, CUTE obtaining 93.4, 93. 6,
    94.5, 91.8, 76.1.78.5 and 79.5 respectively.
     Next must-read paper: “Focusing attention: Towards accurate text
    recognition in natural images.”

    View Slide

  8.  Summary
    Integrates connectionist temporal classification(CTC) loss with focal loss
    to help the model with unbalanced languages datasets. Empirically for
    both synthetic and real images performance can be improved using
    alpha = 0.25 and y=0.5.
     Related Works
    A Graves(2008) was the first to combine CTC loss with RNN for text
    recognition. A Ul Hasam (2013) used BiLSTM with CTC loss for Urdu text
    recognition. M. Busta (2017) combined recognition and detection in an
    end-to-end model. J. BA(2014) used reinforcement learning to
    concentrate on part of image useful for prediction. C.Y. Lee (2016) used
    RNN with attention for optical character recognition. M.Jaderberg used
    spatial transformer for spatial manipulation of data within the module
    combined with focal loss for text recognition.
     Proposed Methodology
    With ResNet as the backbone the algorithm extract feature maps from
    the last convolution layer which is cut into multiple slices, each
    containing information about a small area within the image. This is
    followed by a BiLSTM and fully connected layer with softmax for final 8
    Focal CTC Loss for Chinese Optical Character Recognition on Unbalanced Datasets
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    output 𝑝 𝜋 𝑥 = 𝑦𝜋𝑡
    𝑡
    𝑇
    𝑡=1 . Here 𝑦𝜋𝑡
    𝑡 represents the probability of observing
    the elements (set of all possible characters and blank character) at
    slice t for T total slices. Thus, CTC loss is calculated as
    𝑝 𝑙 𝑦 = 𝑝(𝜋|𝑦)
    𝜋:𝛽 𝜋 =𝑙 i.e., the sum of all probabilities. For
    hyperparameter alpha and y focal loss is calculated as
    𝐹𝐿 𝑝𝑡
    = −𝛼𝑡
    (1 − 𝑝𝑡
    )𝑦log (𝑝𝑡
    ). Here alpha is used to overcome
    data imbalance and y to help model focus more on hard samples. Thus,
    CTC loss can be modified as 𝐹𝐶𝑇𝐶 𝑙 𝑦
    = −𝛼𝑡
    1 − 𝑝 𝑙 𝑦 𝑦
    log (𝑝(𝑙|𝑦). Thus,
    the model can focus more on hard samples.
     Result
    A synthetic dataset generated from MNIST by concatenating 5 images
    together from two groups of ‘0-9,a-h’ and ’i-z’ characters one of 1M
    and 100K (10:1 imbalance) other of 1Mand 10K(100:1 imbalance)
    images with 10K test set, and Chinese OCR of 3.6M training and 5K
    testing images. Highest accuracy obtained were 62.8%, 72.4% and
    76.4% respectively.
     Next must-read paper: “Deep TextSpotter: An End-to-End Trainable
    Scene Text Localization and Recognition Framework”
    𝜋
    𝑡

    View Slide

  9.  Summary
    This work proposes a light-weight model for text recognition. Various
    strategies to improve the model’s performance or decrease the model’s
    parameters are also discussed in the work. Ablation studies conducted
    are used to verify the effectiveness of each strategy.
     Proposed Methodology
    The algorithm uses 3 modules for recognition namely text detection to
    output bounding box for text, direction classifier is necessary if the
    bounding box reversed the text and text recognizing. For recognition a
    light backbone MobileNetV3_large_x0.5. Empirically it was observed that
    removing the squeeze excitation blocks from the model resulted in no
    loss of accuracy while reducing the number of parameters along with
    inference time. Feature Pyramid Network(FPN) is used in the head to
    detect small texts using large resolution feature maps. Cosine learning
    rate is used to use large learning rate( LR) at the beginning and small LR
    at later stages. Large value of LR at the beginning may lead to instability
    thus LR warmup is used. FPGM pruner is used to dynamically calculate
    compression ratio for each layer and remove similar values, thus
    improve inference efficiency. MobileNetV3_small_x0.35 is used as the
    backbone for direction classifier for input resolution of 48*192. 9
    PP-OCR: A Practical Ultra Lightweight OCR System
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    Augmentation such as rotation, gaussian blur, perspective distortion,
    motion blur(Base Data Augmentation BDA) along with random
    augment are used to improve the model’s generalizing capabilities.
    Modified PACT quantization along with L2 regularization coefficient of
    1e-3 is used. For recognizer MobileNetV3_small_x0.35 is used as
    backbone pretrained on synthetic images and modified strides to
    preserve horizontal and vertical information, with BDA and TIA (Luo
    2020) augmentation. A fully connected layer of 48 dimension is used as
    head along with L2 regularization. Cosine learning rate along with LR
    warmup is used for training. PACT quantization is used for every layers
    except LSTM layers.
     Result
    Multiple synthetic as well as public datasets such as LSVT, RCTW-17,
    MTWI 2018, CASIA-10K, etc. are combined for training and validation
    of the different modules. When using all the strategies previously
    mentioned the model achieved an accuracy of 69%
     Next must-read paper: “Towards end-to-end license plate de-
    tection and recognition: A large dataset and baseline”

    View Slide

  10. 10
    On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    𝐴𝑡𝑡 − 𝑜𝑢𝑡ℎ𝑤
    = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑟𝑒𝑙
    ℎ′𝑤′)→(ℎ𝑤
    𝑉ℎ′𝑤′
    ℎ′𝑤′ . V
    ℎ′𝑤′
    is calculated by
    multiplying the feature maps of shallow CNN with trainable weights and
    attention weights 𝑟𝑒𝑙
    ℎ′𝑤′-> ℎ𝑤
    are calculated
    as 𝑟𝑒𝑙
    ℎ′𝑤′ → ℎ𝑤
    ∝ 𝑒
    ℎ𝑤
    + 𝑃ℎ𝑤 𝑊𝑞𝑊𝑘𝑡 𝑒
    ℎ′𝑤′
    + 𝑝
    ℎ′𝑤′
    𝑇
    Here 𝑒
    ℎ′𝑤′
    P
    h′w′ represented extracted feature maps and positional
    embedding, respectively. Further Ph’w’
    can be calculated as
    𝑃
    ℎ′𝑤′
    = 𝛼 𝐸 𝑃

    𝑠𝑖𝑛𝑢 + 𝛽 𝐸 𝑃
    𝑤
    𝑠𝑖𝑛𝑢 Further Ph
    sinu and Pw
    sinu are sinusoidal positional
    encoding along height and width, while ɑ(E) and 𝛃(E) are calculated as
    𝛼 𝐸 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 max 0, 𝑔 𝐸 𝑊ℎ
    1
    𝑊ℎ
    2
    and 𝛽 𝐸 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(max 0, 𝑔 𝐸 𝑊𝑤1 𝑊𝑤
    2
    )
    For E representing CNN extracted features. To capture short-term
    dependency the 1X1 convolution are replaced with 3X3 convolution and are
    represented as Locality-aware feedforward layer.
     Result
    Datasets having horizontal texts as well as randomly aligned texts were
    used to validate the performance of model such as IC13(94.1%),
    IC03(96.7%), SVT(91.3%), IIIT5K(92.8%), IC15(79%), SVTP(86.5%),
    CT80(87.8%)
     Next must-read paper: “Aster: An attentional scene text recognizer with
    flexible rectification”
     Summary
    Current OCR technologies is unable to recognize rotated, curved,
    vertically aligned, arbitrary shaped texts. This work uses attention
    mechanism to tackle these challenges.
     Related Works
    A Cheng(2018) employed a selection module to select features in 4
    direction by projecting an intermediate feature map. Yang(2017)
    used attention module requiring extensive character level
    supervision. Hui(2019) employed attention but is biased towards
    horizontal texts due to height pooling and RNN layers. Fenfen(2018)
    used 1D transformer for recognition. Pengyuan(2019) employed
    self-attention in the decoder.
    𝑎𝑡𝑡 − 𝑜𝑢𝑡ℎ𝑤 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑟𝑒𝑙
    ℎ′𝑤′

    ℎ𝑤
    𝑉
    ℎ′𝑤′
    ℎ′𝑤′ text recognition.
     Proposed Methodology
    A shallow CNN module is used to supress background information
    while reducing computational costs for subsequent layers. The
    output is passed through self attention blocks, with novel 2D
    positional embedding. This can be formulated as

    View Slide

  11. 11
    Towards Accurate Scene Text Recognition with Semantic Reasoning Networks
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    step is needed, this causes a bottleneck when in model’s semantic
    reasoning. Thus, this work uses approximated information at t-1 step
    to calculate vectors at t step. The output of the attention(g) is used to
    predict target character(FC with softmax) and optimized using cross-
    entropy(CE). The most likely character is passed through an embedding
    layer to calculate approximate embedding. The extracted features are
    passed through several transformer units to output global
    context(s) and optimized using CE loss. Features g, s are dynamically
    allocated importance as, 𝑍
    𝑡
    = 𝛼(𝑊𝑧 . [𝑔𝑡
    ,
    𝑠𝑡]) and 𝑓
    𝑡
    = 𝑧𝑡 ∗ 𝑔𝑡 + 1 − 𝑧𝑡 ∗
    𝑠𝑡. The entire model is optimized as 𝐿𝑜𝑠𝑠 = 𝛼
    𝑒
    𝐿
    𝑒
    + 𝛼
    𝑟
    𝐿
    𝑟
    + 𝛼
    𝑓
    𝐿
    𝑓
     Result
    Various datasets such as IC13, IC15, IIIT5K, SVT, SVTP, CUTE, TRW-T,
    TRW-L are used for evaluation achieving 95.5, 82.7, 94.8, 91.5, 85.1,
    87.8, 85.5 and 84.3, respectively.
     Conclusion
    The model could be incorporated with CTC loss to improve
    performance.
     Next must-read paper: “An end-to- end trainable neural network for
    spotting text with arbitrary shapes”
     Summary
    This work attempts to overcome the shortcomings of RNN such as its
    time dependency, and most importantly one way transmission of context
    which greatly limits the model’s effectiveness to learn semantic
    information.
     Related Works
    Baoguang (2016) combined CNN and RNN with connectionist temporal
    classification(CTC) loss function for recognition. Minghui (2019)
    formulated the problem as a pixel level classification task. Chen (2016)
    extracted the visual features in 1D and used the semantic information of
    last time step for recognition. Mingkun (2019) used a rectification network
    based on local features to improve performance. Zhanzhan (2018)
    extracted features along 4 directions and used a filter gate to calculate the
    contribution of each. Zbigniew (2017) encoded spatial coordinates on 2D
    feature maps to increase sequential information extracted.
     Proposed Methodology
    ResNet50 is used as backbone with feature pyramid extracting features
    from 3rd, 4th and 5th residual blocks. The features extracted are passed
    to transformer along with positional embedding. A novel parallel visual
    attention module is used that computes weights as
    𝑒𝑡,𝑖𝑗
    = 𝑊
    𝑒
    𝑇 tan(𝑊
    𝑜
    𝑓𝑜
    𝑂𝑡
    + 𝑤𝑣
    𝑣𝑖𝑗
    )
    For transformer extracted features as v and O representing character
    reading order (1…N-1) and f as the embedding function. After softmax,
    weighted sum with v is used to compute the attention outputs. When
    using RNN to calculate vectors at t time step information of t-1 time

    View Slide

  12. 12
    Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    First L and R are trained with cross-entropy loss function. The trained
    modules are used in conjunction with S in reinforcement learning. S
    outputs the partition map in a probabilistic vector, which is then
    processed to output. Binary map. The processing involves non-max
    suppression, clipping probabilities at 0.99 and thresholding. Thus, the
    action is based on the model’s output and the reward is based on the
    distance from true value as 𝑟 𝑋, 𝑎 = 1 − 𝑑(𝑌 𝑋,𝑎 ,𝑌)
    max ( 𝑎 +1,𝑁𝑦)
    . For input X and
    action, a Y(x, a) is the processed partition map and Y is the ground
    truth. The denominator is used for length normalization, Ny being the
    total number of characters in Y.
     Result
    Texts of various languages are used for evaluation such as Chinese,
    English, Korean as well as mixed texts such as Chinese with English,
    Chinese with Korean English with Korean and Chinese with English and
    Korean achieving 94.74, 77.01, 97.07, 87.23, 97.1, 87.46, 90.87,
    respectively.
     Next must-read paper: “Tesseract Blends Old and New OCR
    Technology”
     Summary
    The performance of OCR systems decrease when working on texts
    having multiple languages. To tackle this problem this work uses
    segmenter (Reinforcement algorithm), switcher and recognizer trained
    in a supervised manner.
     Related Works
    Zheng(2016) formulated character segmentation as a binary
    segmentation. Chernyshova(2020) proposed a word image
    segmentation model using dynamic programming to select most
    probable boundaries in images. B.shi(2017) used convolution layers for
    feature extraction, LSTM layers to predict character class and CTC loss
    to ignore the repeated characters produced due to multiple slices. D.
    Kumar(2015) used encoder-decoder architecture with attention to scan
    image along horizontal direction followed by decoding the feature vector
    using attention.
     Proposed Methodology
    The segmenter is used to partition a word image into n sub-images,
    switcher then assigns a recognizer for each sub-image followed by
    recognizer which assigns a label. The architecture of word
    recognizers(R):

    View Slide

  13. 13
    OCR Survey of
    Japanese Language
    Tapas Dutta, Deep Learning Engineer

    View Slide

  14. 14
    An attention-based row-column encoder-decoder model for text recognition in Japanese historical documents
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    along column to extract sequential information along horizontal and
    vertical direction, respectively. The attention scores are calculated
    as 𝑠𝑐𝑜𝑟𝑒 ℎ𝑡
    𝐴𝑡𝑡𝑛
    , 𝑒𝑖 = tanh 𝑊ℎℎ𝑡𝐴𝑡𝑡𝑛 + 𝑤𝑒𝑒𝑖 . Here e is the feature maps
    extracted using CNN module, h is the current LSTM hidden state which
    are calculated as ℎ𝑡
    𝐴𝑡𝑡𝑛
    = 𝐿𝑆𝑇𝑀𝐴𝑡𝑡𝑛(ℎ𝑡 − 1
    𝐴𝑡𝑡𝑛
    𝐸𝑚𝑏𝑒𝑑 𝑦𝑡 − 1 , 𝑎𝑡 − 1 ). Here y
    is the previous predicted character converted to feature vector using
    Embed layer, LSTM(Attn)
    is 2 LSTM each with 512 nodes and a is the
    attention vector calculated as 𝑎𝑡 = tanh (𝑊𝑐 𝑐𝑡; ℎ𝑡𝐴𝑡𝑡𝑛 ). Here c is the
    context vector calculated as 𝑐𝑡 = 𝑝
    𝑖
    𝑡𝑒
    𝑖
    𝑛
    𝑡=1 i.e. weighted average of
    extracted features using attention weights. The attention outputs are
    thus calculated as ℎ𝑡
    𝑅𝑒𝑠
    , 𝑂𝑡 = 𝐿𝑆𝑇𝑀𝑅𝑒𝑠(ℎ𝑡 − 1
    𝑅𝑒𝑠
    , 𝑎𝑡). For LSTM(Res)
    a 1
    LSTM layer of 512 nodes with a skip connection with the input attention
    vector. The decoder consist of multiplying the attention outputs with
    trainable weights followed by SoftMax as 𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑎 ∗ 𝑂𝑡)
     Result
    PRMU dataset is used to validate the performance of the model. It has
    3 tasks namely 1:single character recognition, 2: 3-character
    recognition written vertically, 3: 3 or more character written in multiple
    lines. The model achieved character error rate and sequence error rate
    of 4.15, 11.43 and 12.69, 58.58 on level 2 and 3 respectively.
     Next must-read paper: “Recognition of anomalously deformed kana
    sequences in Japanese historical documents”
     Summary
    The work proposes a novel algorithm to recognize text without the need
    for segmentation. This is achieved by incorporating BiLSTM along rows
    and columns in the encoder and residual LSTM in the decoder. Language
    model could be incorporated in the algorithm to further improve
    performance
     Related Works
    A. Graves(2009) used BiLSTM with Connectionist Temporal
    Classification(CTC) for English text recognition. A. Graves (2008)
    employed Multi-Dimensional LSTM with CTC loss for arabica text
    recognition. H. Yang(2018) CNN with CTC for Chinese text recognition. D.
    Valy(2018) used CNN with 1D or 2D LSTM for Khmer text recognition. C.
    Wang(2018) used attention for text recognition. Y. Deng(2017) used
    attention for mathematical expression recognition. T. Bluce(2017)
    incorporated Multi-Dimensional LSTM with attention for handwritten
    paragraph recognition.
     Proposed Methodology
    The algorithm has 3 modules for feature extraction, row-column encoder
    and decoder. A standard CNN module is used for feature extraction. 2
    modified BiLSTM are used one along row and another

    View Slide

  15. 15
    Deep Convolutional Recurrent Network for Segmentation-free Offline Handwritten Japanese Text Recognition
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Result
    Kondate dataset is used to finetune and validate the model. Label Error
    rate and Sequence Error rate is used to calculate the model’s
    performance, these are calculated as 𝐿𝐸𝑅 ℎ, 𝑆′ = 1
    𝑍
    𝐸𝐷(ℎ 𝑥 , 𝑍)
    (𝑆,𝑍)∈𝑆′
    and 𝑆𝐸𝑅 ℎ, 𝑆′ = 100
    |𝑆′|
    0 𝑖𝑓 ℎ 𝑥 = 𝑧
    1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
    (𝑥,𝑦)∈𝑆′ . The model achieved LER and
    SER of 6.44 and 28.9 using DCRNs and 6.95 and 28.04 using DCRN-
    f&s respectively.
     Next must-read paper: “Text-Line Character Segmentation for
    Offline Recognition of Handwritten Japanese Text ”
     Summary
    The work proposes a segmentation free algorithm for recognition. It has
    three components i.e, a CNN feature extractor using sliding window,
    BLSTM for prediction and optimized using connectionist temporal
    classification loss(CTC)
     Related Works
    Graves(2009) combined BLSTM and CTC for text recognition. Messina
    (2015) combined Multi Dimensional LSTM and CTC for end-to-end
    trainable chinese text recognition. Suryani(2016) combined pretrained
    CNN with LSTM and hidden markov model for alignment.
     Proposed Methodology
    The CNN architecture used for feature extraction is shown below. The
    model is pretrained with japanese handwritten character dataset
    (Nakayosi, Kuchibue). After training the softmax(DCRNs) or both the fully
    connected layers and softmax(DCRN-fs) are removed and the remaining
    is used for feature extraction. LSTM layers are used instead of RNN to
    address the vanishing gradient problem. LSTM however extract
    information in one direction thus 2 LSTM are used to extract information
    in both direction. The model is optimized using CTC loss function.

    View Slide

  16. 16
    Attention Augmented Convolutional Recurrent Network for Handwritten Japanese Text Recognition
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Proposed Methodology
    A CNN architecture without fully connected layers is used for feature
    extraction. The feature maps extracted are unfolded from left to right.
    These features are fed to self-attention module where they are
    projected to query, key and value followed by scaled dot product
    attention. For n heads this process is repeated n times. The outputs are
    then concatenated and added with the input of self attention module.
     Result
    Multiple datasets are used to evaluate the model’s performance as
    IIIT5K, SVT, IC03, IC13, SVTP, CUTE obtaining 92.67, 91.16, 93.72,
    90.74, 78.76 and 76.39 respectively.
     Next must-read paper: “Focusing attention: Towards accurate text
    recognition in natural images”
     Summary
    Japanese OCR is difficult due to vast character set, multiple writing styles,
    multiple touch-points between characters. The work proposes attention
    augmented Convolutional Recurrent Network (AACRN) consisting of three
    modules convolution feature extractor, self-attention-based encoder and
    CTC decoder.
     Related Works
    Feng(2012) used segmentation to segment each character followed by
    recognizing each character. Segmentation free methods involved using
    Connectionist Temporal Classification (CTC) and attention mechanism.
    Graves(2009) was the first to use BLSTM with CTC for recognition.
    Shi(2017), Ly(2017) used CNN with BLSTM and CTC for recognition.
    Deng(2017) used attention-based model to convert mathematical
    expressions to LaTex. Chowdhury(2020) used attention model with beam
    search decoder for english and french text recognition. Vaswani(2017)
    used self-attention along with positional information for recognition.

    View Slide

  17. 17
    Recognition of Anomalously Deformed Kana Sequences in Japanese Historical Documents
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    prediction. The end-to-end approach consists of similar model without
    pretraining and BatchNormalization after each convolution layer in
    feature extractor and relu replaced by leakyrelu and dropout layers
    after every 2 LSTM layer in frame predictor. For text spanning multiple
    lines, one approach is to segment the vertical lines and join the
    vertically and apply previous algorithms. Another approach is to use the
    image feature extractor from previous algorithm and a frame predictor
    of 2 levels of 2DBLSTM. The first layer having 4 LSTM each having 64
    nodes, the second layer having 4 LSTM each with 128 nodes. The 3rd
    approach replaces the CNN module with a 2DBLSTM thus, the entire
    structure has a total of 3 2DBLSTM each having 4 LSTM layers and 2,
    10 and 50 nodes respectively. Limited window size and fully connected
    layers are used to reduce weights.
     Result
    For level 2 (single vertical texts) the end-to-end trained model achieved
    the best result of 10.9 LER and 27.7 SER. For level 3(multiple vertical
    texts) segmentation with end-to-end trained model achieved best result
    of 12.3 LER and 54.9 SER.
     Conclusion
    Context-preprocessing, language statistics , augmentation could be
    applied to further improve the performance.
     Next must-read paper: “Character and Text Recognition of Khmer
    Historical Palm Leaf Manuscripts”
     Summary
    The work proposes a segmentation free algorithm for Khana character
    recognition in multiple lines. The work proposes Deep Convolutional Recurrent
    Neural Network(DCRN) which uses a CNN architecture for feature extraction
    and BiLSTM with connectionist temporal classification loss(CTC) for
    recognition.
     Related Works
    Kitadai(2008) restores the image and performs similar pattern matching and
    returns similar characters already decoded. Phan(2016) segmented a
    document into characters and recognized them using modified quadratic
    discriminant function. Nguyen(2017) used a segmentation-based approach for
    handwritten japanese text recognition. Graves (2008) combined
    multidimensional LSTM with CTC loss for end-to-end trainable model for
    handwritten arabic recognition. Shi(2015) used CNN with LSTM for scene text
    recognition. Rawls (2017) used CNN with LSTM for handwritten english and
    arabic text recognition. Ly(2017) used a combination of CNN with LSTM
    optimized using CTC loss for japanese handwritten text recognition.
     Proposed Methodology
    The pretrained model has 5 blocks of CNN with maxpooling with 2 FC layer
    with relu after every layer and softmax layer for prediction. This architecture is
    pretrained(to recognise 1 character) and features are extracted in 3 ways:
    • Extract features from model without softmax for sliding window(stride 12 or
    16) applied on text
    • Sliding window of 64*32 with stride of 32 for non-overlapping regions,
    features extracted from last convolution layer.
    • Features extracted for full text image from last convolution layer
    A frame predictor module of 3 BLSTM (each having 2 LSTM of 128 nodes)
    followed by a dense layer to predict character for each frame and CTC for final

    View Slide

  18. 18
    A semantic Segmentation-based method for Handwritten Japanese Text Recognition
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    Kondate dataset is used for training and Kuchibue and Nakayoshi
    datasets are used for testing. The pixel level accuracy and IoU% are
    calculated for character segmentation while errors are calculated for
    recognition the results are presented above.
     Result
    Improved segmentation algorithm such as RCNN and Mask R-CNN can
    be used to improve results as well as Conditional Random Field for post
    processing or adding it directly.
     Next must-read paper:”Progress and results of Kaggle Machine
    Learning Competition for Kuzushiji Recognition”
     Summary
    The work proposes a segmentation based Japanese handwritten text
    recognition algorithm. Semantic segmentation using an U-Net architecture
    for pixel level classification followed by CNN based OCR which are then
    combined using a language model.
    Related Works
    Xu(2001) used skeleton and contour analysis to detect cut points. A filter
    trained with geometric features is used to prune the implausible cutting
    candidate points. An OCR model was used for recognition followed by
    language model for refinement. Shi(2015) combined CNN and LSTM with
    CTC loss for text recognition. Ly(2019) used CNN for feature extraction
    with RNN to encode followed by attention for output. Asanobu(2019) first
    segmented using Faster R-CNN with two classes the characters followed
    by recognition using ResNet152 as backbone, FPN and ROI align.
    Baek(2019).
     Proposed Methodology
    ResNet101 is used as the encoder for U-Net. Dilated convolution is used
    to extract features without the need of down-sampling multiple times.
    The encoder features are passed through dilated convolution with dilation
    of 2,4,8,16 concatenated and fed to the upsampling layers. The model is
    tasked to predict the center of character along with convex hulls to reduce
    touching within characters. Watershed algorithm is used on the output to
    separate convex hulls. Recognition is done via Inception ResNet V2
    network. The outputs are represented as unigram, bigram and trigram
    which are then selected using Viterbi to obtain result.

    View Slide

  19. 19
    A unified method for augmented incremental recognition of online handwritten Japanese and English text
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    The pointer is decided based on previous results. After segmentation if
    the classification of a stroke has changed recognition scope is from that
    stroke(earliest classification changed off-stroke EccOs) to latest stroke.
    Thue the previous recognition and current recognition scope overlaps,
    this could be used to reduce the processing time. SP is used to split the
    words into preceding and succeeding characters. After splitting if a
    character was present in the previous scope or out of scope it is reused
    else recognition is done. Characters/ words containing the last segment
    are treated as partial characters and recognition is postponed till
    complete patterns are received. Newly added strokes are treated as
    delayed stroke if ii is close to previous segmentation result.
    Character/word with and without the delayed stroke is considered and
    best path is searched.
     Result
    TUAT-Kondate dataset is used for Japanese text evaluation. The
    dataset contains text from 100 people, divided into 4 sets(each having
    text from 25 people) with 3 for training and 1 for testing. IAM-OnDBt2
    is used for english text evaluation.
     Next must-read paper:”An approach for real time recognition of
    online Chinese handwritten sentences”
     Summary
    The algorithm proposed can be used for recognition while writing or
    delayed recognition. Segmentation is followed by recognition done after
    a fixed interval of strokes. This is done using three techniques, i.e the
    structures used in previous step is reused, strokes that were not
    recognized in previous step are attempted again, skip for incomplete
    characters.
    Related Works
    Zhu(2010) used geometric features extracted from previous and next
    strokes for japanese text recognition. Nakagawa(2006) used geometric
    and linguistic information to improve performance. Graves(2009) used
    bi-directional recurrent networks for feature extraction. Tanaka(2002)
    reported online recognition decreases performance by 0.3.
     Proposed Methodology
    The offline segmentation is done in two steps, i.e segmentation into
    lines and segmentation of each character, A classifier is used to classify
    each stroke into segmentation point(SP), non-segmentation point(NSP)
    and undecided point(UP). Here SP separates two characters or words,
    NSP indicates stroke is within character and UP is used for low-
    confident predictions. Here SVM is used for japanese text and BLSTM
    for english text to classify each stroke. Strokes separated by SP or UP
    are considered as character/word or part of character/word. These are
    then recognized and assigned a confidence score. Viterbi algorithm is
    used to search for optimal combinations that give the best score. For
    online recognition the algorithm starts after some strokes are added,
    When resuming segmentation, it is done from segmentation pointer to
    current stroke.

    View Slide

  20. 20
    Training an End-to-End Model for Offline handwritten Japanese Text Recognition by Generated Synthetic Patterns
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    To create synthetic data a random sentence and a different writer is
    chosen. A new sentence is formed using characters used in sentence,
    but handwritten image of that character written by writer. Each
    handwritten character undergoes local distortion(shearing, rotation,
    scaling, translation), and the entire sentence undergoes global
    distortion(rotation and scaling).
     Result
    Kondate dataset is used to evaluate the model’s performance and
    effectiveness of the augmentation used. Training the model in an end-
    to-end manner with augmentation achieved the best performance of
    1.87 Label Error Rate and 13.81 Sequence Error Rate
     Next must-read paper:”Are Multidimensional Recurrent Layers
    Really Necessary for Handwritten Text Recognition”
     Summary
    The algorithm proposed uses Deep Convolution Neural Network for
    feature extraction from text line image, deep BLSTM as recurrent
    network and trained using Connectionist Temporal Classification(CTC)
    loss. Elastic distortions are used to synthesize images.
     Related works
    Graves(2006) first introduced CTC loss for handwritten text recognition.
    Graves(2009) combined BLSTM with CTC loss to improve performance.
    Messina(2015) used Multi-Dimensional LSTM with CTC loss for Chinese
    text recognition. Puigcerver(2017) proved multi-dimensional LSTM is
    not necessary for text recognition.
     Proposed Methodology
    The images are first scaled to fixed size and otsu’s method is used to
    obtain binary images. The CNN architecture shown above is used for
    feature extraction. Bidirectional LSTM is used to pass information from
    both direction from the features extracted by CNN. The entire
    architecture is trained using CTC loss.

    View Slide

  21. 21
    Attempts to recognize anomalously deformed Kana in Japanese historical documents
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    previous algorithm. Another method could be using 2DBLSTM with CNN
    pretrained on single character recognition or object detection followed
    by CNN pretrained with single character recognition without
    segmentation.
     Result
    Recognizing anomalously deformed Kana in Japanese historical
    documents contested by IEICE PRMU having 3 tasks namely
    recognizing single character, one line of text, multiple lines of text is
    used to validate the approaches presented.
     Next must-read paper:”Deep Convolutional Recurrent Network for
    Segmentation-free Offline Handwritten Japanese Text Recognition”
     Summary
    The work proposes three different algorithm for recognizing single
    character, single line character using CNN with BLSTM and multiple line
    character using a segmentation-based method.
     Proposed Methodology
    For recognizing single characters Otsu's, method is used to reduce
    background noise while enhancing foreground. The result is padded so
    that character is in the centre. This is followed by linear spatial
    normalization, resizing and standardization. For augmentation rotation,
    shearing are used. Experiments are conducted on multiple CNN
    backbones as well as LSTM architectures for character classification. For
    detection of character sequence, a combined architecture of CNN and
    BLSTM and CTC loss is used. The CNN architecture is first trained on
    single character recognition, features can then be extracted using a
    sliding window with(DCRN-o),without overlapping(DCRN-wo) or using
    the features extracted from the last convolution layer(DCRN-ws). For
    recurrent layer 1DBLSTM is trained along with CTC loss. For recognizing
    texts in multiple vertical lines multiple approaches are presented such
    as vertical line segmentation and concatenating them followed by

    View Slide

  22. 22
    A Multiplexed Network for End-to-End, Multilingual OCR
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Proposed Methodology
    For text detection the algorithm uses a ResNet50 backbone with a Unet
    structure. Vatti clipping algorithm is used to shrink text regions and RoI
    masking to suppress background and neighboring text instances. For
    recognition, a character segmentation module and spatial attention
    module is used. The output of the detection and segmentation modules
    is used as input for language prediction module which uses a small CNN
    architecture(2 CNN, 1 FC) for classification. The language prediction
    and recognition heads are trained using as
    𝐿ℎ𝑎𝑟𝑑−𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑡𝑒𝑑
    = 𝛼𝑠𝑒𝑞 𝑟
    𝐿𝑠𝑒𝑞
    𝑎𝑟𝑔𝑚𝑎𝑥1<𝑙<𝑁𝑟𝑒𝑐
    𝑝 𝑙 for when 𝑁𝑟𝑒𝑐 number of
    different languages are supported. Thus, for each word during training
    one the recognition head having the highest confidence is selected.
    𝐿𝑠𝑒𝑞
    𝑟 = − 1
    𝑇
    𝐼 𝑐𝑡
    ∈ 𝐶𝑟
    . log 𝑝 𝑦𝑡
    = 𝑐𝑡
    + 𝐼 𝑐𝑡
    ∉ 𝐶𝑟
    𝛽
    𝑇
    𝑡=1 . Here 𝐶𝑟 is the
    character set supported by head r and 𝑐𝑡 is the ground truth. 𝛽 is a
    hyperparameter as penalty for unsupported characters.
     Results
    ICDAR 2019 MLT dataset (MLT19) is used to evaluate the model’s
    recognition performance.
     Next must-read paper:” Mask TextSpotter v3: Segmentation
    Proposal Network for Robust Scene Text Spotting”
     Summary
    This work proposes an end-to-end trainable pipeline which includes text
    recognition and detection. The algorithm uses multiple heads for
    recognizing different languages.
     Related works
    The task of Multiple text recognition can be divided into 3 sub-tasks
    namely text detection, script identification and text recognition. Prior to
    the use of deep learning hand crafted features were used Anil(1998),
    Lukas(2012), Cheng(2013). The success of deep learning methods led
    to them being used in conjunction to hand crafted features. Recent
    approaches however extensively use deep learning methods. Most
    recognition algorithm either use Connectionist Temporal
    Classification(CTC) Graves(2006) loss to convert features to language
    sequence or Seq2Seq encoder-decoder framework with attention
    Bahdanau (2014) for classification. Script identification is necessary for
    determining which language recognizer to use. Shi(2015) used CNN
    with multi-staged pooling for classification. Fugii(2015) modified the
    task to a sequence to label problem. Some recognition algorithms like
    Michal(2018), Youngmin(2020), Pengyuan(2018) does not incorporate
    these components and have one recognition head for characters from
    different languages.

    View Slide

  23. 23
    E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    The entire algorithm is trained using 𝐿𝑓𝑖𝑛𝑎𝑙
    = 𝐿𝑔𝑒𝑜
    + 𝜆1
    𝐿𝑎𝑛𝑔𝑙𝑒
    + 𝜆2
    𝐿𝑑𝑖𝑐𝑒
    +
    𝜆3
    𝐿𝐶𝑇𝐶. Here 𝐿𝑎𝑛𝑔𝑙𝑒 is mean squared error over sin(∅) and cos(∅), 𝐿𝑔𝑒𝑜 is
    IoU loss function, 𝐿𝐶𝑇𝐶 is word level recognition loss and 𝐿𝑑𝑖𝑐𝑒 is dice loss
    calculated for predictions with more than 90% confidence.
     Results
    ICDAR 201yMLT dataset (MLT17) is used to evaluate the model’s
    performance.
     Next must-read paper:” Fots: Fast oriented text spotting with a
    unified network”

     Summary
    This work proposes end-to-end method for multi-language scene text
    localization and recognition.
     Related works
    For scene text recognition, localization is an important step to obtain
    word level bounding box or segmentation maps. Jaderberg(2016) used
    output of edge boxes and channel features to obtain bounding box,
    random forest were used to filter the predictions and CNN regressor for
    post-processing. Gupta(2016) used CNN network to detect objects at
    multiple scales. Tian(2016) used a CNN-RNN architecture to predict
    presence of character. For recognition Jaderberg(2016) used VGG16 for
    classification 0f 90K words. Shi(2017) generates one word per image
    using CNN with BLSTM and Connectionist Temporal Classification loss.
    Lee(2016) used CNN with RNN and soft-attention for recognition.
    Li(2017) used convolution recurrent network for text localization as well
    as text recognition. Liu(2018) used a shared CNN architecture for text
    localization and recognition.
     Proposed Methodology
    The algorithm uses ResNet34 with FPN object detector. For an input, the
    localization module produces 7 outputs as presence or absence of text, 4
    co-ordinates of bounding box and orientation angle ∅ is predicted.

    View Slide

  24. 24
    OCR Survey of
    Dataset
    Tapas Dutta, Deep Learning Engineer

    View Slide

  25. 25
    Generating Synthetic Data for Text Recognition
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work generates 9M synthetic handwritten word image corpus using
    open-source fonts and data augmentation schemes.
     Related Works
    Jaderber(2014), Rozantsev(2015), Ros(2016) used synthetic mechanism
    for data generation and annotation. Sankar(2010) used rendered images
    for annotating large scale datasets.
     Proposed Methodology
    Synthetic words can be generated using making words in available fonts,
    learning different parameters for style and content and modifying the
    style parameter for deep learning model. This work uses the first
    technique using publicly available fonts(750) and vocabulary chosen from
    dictionary(90K unique words). After randomly selecting word and style
    inter character space, stroke width is varied, random augmentation
    within(-5 to +5) and shear(+/- 0.5) in horizontal direction.
     Conclusion
    To simulate cursive writing elastic distortion could be used in future works.
     Next must-read paper: “The synthia dataset: A large collection of
    synthetic images for semantic segmentation of urban scenes”

    View Slide

  26. 26
    A Database of On-line Handwritten Mixed Objects named “Kondate”
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work presents a database containing online handwritten of text,
    figures, tables, map, diagram, etc. This database has 100 Japanese, 25
    English, 45 Thai writers.
     Proposed Methodology
    Two strategies for writing were used namely copy writing where for
    participant receives pattern and content, free writing where neither are
    provided thus real patterns can be obtained. Attributes of participant,
    environment as well as each stroke is collected. The X/Y co-ordinates of
    each stroke from pen down to pen up is reported along with stroke-id,
    timeOffset(from ped down of first stroke to pen down of current stroke)
    and duration.
     Conclusion
    The work presents a dataset with online text as well as figures, tables,
    maps, diagrams.
     Next must-read paper: “Arabic Handwriting Data Base for Text
    Recognition”

    View Slide

  27. 27
    ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification – RRC-MLT
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    RRC-MLT challenge has tasks like text detection, script classification and
    all tasks for necessary for multilingual text recognition. The dataset
    contains 18K images having text from 9 different languages.
     Proposed Methodology
    The dataset contains natural images with embedded text such as street
    signs, advertisement boards, shop names, passing vehicles, images from
    internet search. The dataset contains at least 2K images from Arabic,
    Bangla, Chinese, English, French, German, Italian, Japanese and Korean.
    Text containing special character also from multiple languages are
    available. Task 1 is text detection for which 9K images were for training
    and remaining 9K for testing. For a set of don’t care regions D = {d1,
    d2, …dk }, ground truth regions G = {g1, g2,…gm} and predictions T =
    {t1,t 2, …tn}. The predictions are matched as
    area(dk
    ) = 0 area(dk) area(tk
    )
    area(dk)
    > 0.5 against don’t care regions to
    discard noise. The filtered predictions T’ are considered positive if
    area(gi
    ) area(t’
    j
    )
    area(gi
    ) ∪ area(t’
    j
    )
    > 0.5 . The set of positive matches(M) are used to
    calculate F-score for𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = |𝑀|
    |𝑇′|
    and 𝑟𝑒𝑐𝑎𝑙𝑙 = |𝑀|
    |𝐺|
    thus 𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
    2 ∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑟𝑒𝑐𝑎𝑙𝑙
    𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙
    . Team SCUT-DLVC lab used one model for rough
    detection followed by another for finely adjusting bounding box and
    postprocessing. CAS and Fudan university designed rotation region
    proposal network for inclined region proposals with angle information
    which is used for bounding box regression. This is followed by text region
    classifier. Sensetime group proposed an FCN based model with
    with deformable convolution which predicts whether a pixel is a text or
    not along with location if it is in a text box. TH-DL group used an FCN
    with residual connection to predict whether a pixel belongs to text,
    followed by binarizing at multiple thresholds. The connected components
    thus extracted along with their bounding boxes are used as region
    proposals. The features of region proposals after Rotation ROI pooling are
    fed to Fast R-CNN and non-max suppression for results. Linkage group
    extracted external regions from each channel followed by calculating
    linkage between components and pruning to get non-overlapping
    components, candidate lines are extracted and merged. Alibaba group
    used a CNN to detect text regions and predict their relation which are
    then grouped and used by a CNN+RNN to divide into words. For script
    identification 84K and 97K images were used for training andtesting
    respectively, with text Arabic, Bangla, Chinese, Japanese, Korean, Latin
    and Symbols. SCUT-DLVC labs used CNN with random crop for training
    and sliding windows for test images. Team CNN-based method used
    VGG16 initialized with ImageNet weights and cross-entropy loss with
    training images resized to fixed height and variable width for training.
    TNet enhanced the features of dataset followed by deep network for
    training and majority vote for classification. Team BLCT extracted patches
    of variable sizes to train a 6-layer CNN model. Features from penultimate
    layer are randomly combined, these are used to create bag-of-visual
    words of 1024 codewords which are used as image representations which
    are aggregated to form histogram which in turn is used by a 2 dense and
    1 dropout layer for classification. Team Th-DL, TH-CNN used GoogLeNet
    like structure, while Synthetic-ECN used an ECN based structure.

    View Slide

  28. 28
    ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification – RRC-MLT
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    Team 8 used VGG16 with SVM classifier. Task 3 is a combination text
    detection using bounding box and script classification, evaluated as
    𝑎𝑟𝑒𝑎 𝑔𝑖 ∩ 𝑎𝑟𝑒𝑎(𝑡𝑖
    ′)
    𝑎𝑟𝑒𝑎 𝑔𝑖 ∪ 𝑎𝑟𝑒𝑎(𝑡𝑖
    ′)
    > 0.5 𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑑 𝑡
    𝑖
    ′ = 𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑑(𝑔𝑖
    ). Team TH-DL used a a
    combination of methods in previous tasks. Team SCUT-DLVCLab trained
    2 models for classification and detection. The classification model predicts
    background for high confident background boxes generated by detection.
     Results
    The results of Task-1, Task-2 and Task-3 are presented.
     Conclusion
    Future works are expected to tackle larger datasets and more languages.
     Next must-read paper: “Script identification in the wild via distinctive
    convolutional network”

    View Slide

  29. 29
    Recognizing Text with Perspective Distortion in Natural Scenes
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work proposes an algorithm for recognizing texts of arbitrary
    orientations. A new dataset StreetViewText-Perspective is also
    introduced.
     Related works
    Smith(2011) and Weinman(2009) proposed similarity constraint so that
    visually similar characters have similar labels. Wang(2011) used an object
    recognition framework requiring all character to be correctly recognized.
    Gandhi(2000) rectified texts using motion information. Li(2010)
    successfully recognized characters, but word recognition was not
    addressed.
     Proposed Methodology
    Matas(2002) algorithm is used to detect potential character locations.
    Non-maximal suppression is used to select one bounding box for a
    character. The output is classified into text vs non-text based on relative
    height, aspect ratio, horizontal crossings, holes. The bounding boxes are
    used as character candidates. The character patch is normalized to 48 X
    48 and a grid of 2 pixels spacing is used to extract dense SIFT features at
    each grid points at multiple scales. Bag-of-keywords is used for matching
    since it ignores spatial information allowing for more distortion between
    training and testing samples. SVM with histogram intersection kernel is
    usedas classifier. For word recognition alignment score, how well a
    character match a word is used. This score is calculated as 𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑙 =
    𝑃 𝑙 𝑐 𝑖𝑓 𝑙 ≠ 𝜀
    1 − 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐 𝑖𝑓 𝑙 = 𝜀 . Thus, the alignment score of an entire word is
    the sum of alignment score for each character as 𝐴𝑙𝑖𝑔𝑛𝑆𝑐𝑜𝑟𝑒 𝑎𝑤
    =
    𝑠𝑐𝑜𝑟𝑒(𝑐𝑖
    , 𝑤(𝑎𝑤
    (𝑖)))
    𝑛
    𝑖=1 . Here 𝑐𝑖 is the 𝑖𝑡ℎ character and 𝑤(𝑎𝑤
    (𝑖)) is the
    label it is assigned(selected by taking the prediction with maximum
    confidence). StreetViewText-Perspective contains images same as
    StreetViewText however side views are considered such that they are
    readable to humans.
     Results
    Accuracy of various state-of-the art methods are compared for SVT as
    well as SVT-Perspective.
     Conclusion
    Most dataset assume horizontal dataset which may not be possible in real
    world thus SVT-Perspective is crucial.
     Next Must-Read Paper:” Top-Down and Bottom-up Cues for Scene
    Text Recognition “

    View Slide

  30. 30
    Deep Learning for Classical Japanese Literature
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work introduces dataset for cursive Japanese Kuzushiji along with
    larger and more challenging datasets Kuzushiji-49 and Kuzushiji-Kanji.
     Proposed Methodology
    The work pre-processed characters scanned from 35 books written in 18𝑡ℎ
    century and organized in 3 parts. Kuzushiji-MNIST as a replacement for
    MNIST dataset. Kuzushiji-49 with an imbalanced dataset of 48 Hiragana
    characters and one Hiragana iteration mark. Kuzushiji-Kanji an
    imbalanced dataset of 3832 Kanji character including some rare
    characters with few examples. Kuzushiji-49 has 266407 images of 28 X
    28 resolution same as Kuzushiji-MNIST while Kuzushiji-Kanji has 140426
    images of 64 X 64 resolution
     Results
    A baseline using nearest neighbor, small CNN and resnet architectures for
    Kuzushiji-MNIST, Kuzushiji-49 is reported.
     Next Must-Read Paper: “Unsupervised image-to-image translation
    networks”

    View Slide

  31. 31
    Towards End-to-End License Plate Detection and Recognition: A Large Dataset and Baseline
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work proposes a dataset of license plate from 250K cars. A novel
    lightweight algorithm is introduced to achieve state of the art
    performance in real time.
     Related works
    Caltech, Zermis collected less than 700 high-resolution images.
    SSIG(2016), UFPR(2018) collected images using cameras on road.
    Hseih(2002) used morphological method to reduce number of candidates
    thus speeding up the detection process. Yu(2015) used wavelet
    transform with empirical mode decomposition analysis to locate license
    plate. Ren(2015) used Faster-RCNN for effective detection. Liu(2016)
    used SSD for detection. Redmon(2016) used YOLO and approached the
    task as a regression problem. Ho(2009) used license plate feature directly
    without segmentation for recognition, Duan(2005) used an OCR system
    for recognition. Spanhel(2017) used a CNN model for recognition.
    Ciregan(2012) used CNN after segmentation for recognition. Abdel(2006)
    used SIFT features near license plate for recognition.
     Proposed methodology
    The license plate images collected are from city parking management
    company. There are more than 250K unique images of size 720*1160*3.
    The LP number contains one Chinese character, a letter and five letters or
    numbers. Roadside Parking Net is introduced for detection and
    recognition. 10 CNN to extract different level feature maps(second, fourth
    and sixth) which are fed to the detection module of three fully connected
    layers in parallel. The recognition module uses region of interest pooling
    layers to extract feature maps crucial for classification. The feature maps
    are fed to ROI pooling to extract fixed sized feature maps, which are then
    concatenated and used for recognition.
     Results
    A prediction is considered correct if IoU is more than 0.6 and all
    characters are correctly recognized.
     Next must-read paper:”Ssd: Single shot multibox detector“

    View Slide

  32. 32
    Chinese Street View Text: Large-scale Chinese Text Reading with Partially Supervised Learning
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work introduces a large Chinese street view dataset of 430K images.
    However, only 30K are fully-annotated and 400K are weakly annotated. A
    text reading network in a partially supervised learning framework is
    proposed to exploit fully and weakly annotated data.
     Related works
    ICDAR 2013 and ICDAR 2015 contains horizontal and multi-oriented texts
    were first used for training text reading model. Total-Text and SCUT-
    CTW1500 are used for curved texts. Most text recognition
    algorithm(Bartz 2017, 2018, Busta 2017) first localize then detect text in
    an end-to-end manner. For detection branch region proposal network(Li
    2017, Lyu 2018) could be used or a CNN layer to predict locations(Zhou
    2017). For recognition Connectionist Temporal Classification(Busta 2017,
    Liu 2018) or attention/LSTM(Li 2017, He 2018) layers could be used.
     Proposed methodology
    430K images obtained from real streets signs in China, with
    29,966 with all locations and text regions marked and 400K images
    having text of roughly marked regions. ResNet50
    with feature pyramid network is used as backbone. For F representing the
    extracted feature maps. Text/non text classification is done for each
    spatial location. For localization quadrangle offsets are predicted. The
    entire branch is optimized as 𝐿𝑑𝑒𝑡
    = 𝐿𝑙𝑜𝑐
    + 𝜆𝐿𝑐𝑙𝑠. Where 𝐿𝑙𝑜𝑐
    , 𝐿𝑐𝑙𝑠 represent
    smooth L1 and dice loss, respectively. Perspective RoI transform is used
    to align feature maps F to 𝐹𝑝. An encoder-decoder framework is used for
    recognition with CNN and GRU as encoder, attention and GRU decoder.
    Online Proposal Matching(OPM) is used to locate text regions given
    weakly annotated text. Given image with weakly annotated text the
    image is fed to the entire network and characters extracted are passed
    through embedding layer and compared using Euclidean distance with
    annotated text as 𝑑 𝑖 = 1
    𝑇
    ||𝑓 ℎ𝑡
    , 𝑊 − 𝑓(𝑒𝑡
    , 𝑊||
    𝑇
    𝑡=1 for T length word, h
    predicted, e ground truth and W embedding weights.
     Results
    The proposed method is compared against state-of-the-art methods in
    the given dataset.
     Next must-read paper:”Accurate scene text detection through border
    semantics awareness and bootstrapping”

    View Slide

  33. 33
    A robust arbitrary text detection system for natural scene images
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work explores different properties of pixels to identify pixels and SIFT
    for feature extraction. A new dataset CUTE80 is introduced for curved
    text evaluation.
     Related works
    Chen(2004) extracted 79 features from classifier and used adaptive
    binarization to classify text and non-text pixels. Yi(2011) used gradient
    and color information for partition. Sivakumara(2013) used a combination
    of wavelet and median moments for detection. Epshtein(2010) used
    stroke width transform on canny image for text detection.
     Proposed methodology
    The proposed dataset CUTE80 has 80 images either indoor or outdoor
    captured using camera or taken from internet. The work creates three
    novel features Mutual Direction Symmetry(MDS), Mutual Magnitude
    Symmetry(MMS), Gradient Vector Symmetry(GVS) present in Sobel and
    Canny images for text. These features are used to separate text from
    background. SIFT features are used to eliminate false text candidates.
    This algorithm works for any orientation of text since it is based on ellipse
    property of text and is implemented based on nearest neighbour
    algorithm. Since the method does not involve language specific features it
    can work with multilingual texts.
     Results
    The effectiveness of proposed model is shown.
     Next must-read paper: ”Detecting texts of arbitrary orientations in
    natural images“

    View Slide

  34. 34
    Scene Text Recognition using Higher Order Language Priors
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work introduces an algorithm that uses higher order statistical
    features for recognition as well as a large recognition dataset.
     Related works
    Text recognition task can be divided into sub-tasks as text detection,
    character recognition, word recognition. These tasks or worked on
    individually by Campos(2009), Epshtein(2010) Chen(2004) or jointly by
    Neumann(2012), Wang(2011). Works of Mishra(2012), Smith(2011),
    Wang(2011) are successful in recognizing texts in limited setting.
     Proposed methodology
    The proposed algorithm uses a conditional random field-based(CRF)
    model for recognition. Random variables 𝑥 = {𝑥𝑖
    |𝑖𝜖𝑉} for 𝑥𝑖 representing a
    potential character and can take a label from label set L. The most likely
    character is found by minimizing energy function 𝐸 𝑥 = 𝜑𝑐
    (𝑋𝑐
    )
    𝑐𝜖𝐶 . For C
    being a subset of V and 𝑥𝑐 being a set of random variables in C. Sliding
    window with aspect ratio as in Mishra(2012) is used. Auxiliary variable 𝑥𝑐
    𝑎
    if used for 𝑐 ∈ 𝐶 which takes h-gram combination in L for higher order h
    used in CRF, thus model can capture a large context. The characters are
    ordered sequentially from left to right, with one node for every sliding
    window. For non-auxiliary nodes it takes from label
    set L with unary cost associated with it, expressed as 𝜑 𝑥𝑖
    = 𝑦𝑗
    = 1 −
    𝑝(𝑙𝑗
    |𝑥𝑖
    ) for 𝑝(𝑙𝑗
    |𝑥𝑖
    ) being the SVM score of 𝑙𝑗 for node 𝑥𝑖. Pairwise cost for
    neighbor characters as 𝜑 𝑥𝑖
    , 𝑥𝑗
    = 𝜆 ∗ (1 − 𝑝 𝑙𝑖
    , 𝑙𝑗
    ) for 𝜆 being the penalty
    for occurring together. The higher order cost for auxiliary node 𝑥𝑖 taking
    label 𝐿𝐾 and leftmost non auxiliary node 𝑥𝑗 label 𝑙𝑙 is given by 𝜑2
    𝑎(𝑥𝑖
    =
    𝐿𝑘
    , 𝑥𝑗
    = 𝐿𝑙
    ) =
    0 𝑖𝑓 𝑙 = 𝑖
    𝜆𝑎′
    𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 . IIIT-5K dataset contains scene text and born
    digital images of low resolution and a variety of resolution and styles. The
    images were annotated using bounding box and ground truth as well as
    divided into easy and hard category based on visual appearance.
     Results
    The performance of proposed algorithm in multiple datasets for pairwise
    and higher-order cost is reported.
     Next must-read paper: ”Top-down and bottom-up cues for scene text”

    View Slide

  35. 35
    Collection of on-line handwritten Japanese character pattern databases and their analyses
    Copyright © 2022 VIVEN Inc. All Rights Reserved
     Summary
    The work collected two online handwritten Japanese character pattern
    database with more than 3 million patterns, one from 120 people
    contributing 12K patterns each, other with 163 with 10K patterns each.
     Related works
    UNIPEN dataset Guyon(1994) is popular for online character recognition
    but does not include oriental characters.
     Proposed methodology
    Online handwritten dataset collection requires specialized tools Pen PCs,
    PC with tablet inference making collection difficult. Thus, display
    integrated tablets(DITs) are used for dataset collection. For effective
    training large amount of data from each individual needs to be collected
    to learn individual’s handwriting pattern. Boxes are displayed such that
    characters are written within the box, pen tip coordinates relative to the
    page are recorded. People are asked to write according to sequences
    given. It is generally recognized that people write unnaturally neatly text
    without any meaningful context but write casually when writing
    sentences. Thus, they are asked to write sentences that cover frequently
    used character followed by writing frequently asked character one by one.
    Some characters cannot be written without being seen, also character in
    kanji could be written in Kana thus characters are displayed
    before written. Participants details are recorded for later use. The
    sentences used for text are collected from newspapers. A tool was
    designed using DOS/V machine with display integrated tablet and MS
    windows OS. By specifying a text file, the corresponding character and a
    writing box(1.7cm*1.7cm or 1.43cm*1.43cm) is displayed to write the
    character. A verification tools is included that identifies erroneous
    characters that are verified by humans and reported to participant. Thus,
    Kuchibue dataset has 120 participants each contributing 11,962-character
    patterns. Nakayosi has 163 participants with 10,403 patterns per person.
     Next must-read paper: ”A new warping technique for normalizing
    likelihood of multiple classifiers and its effectiveness in combined on-
    line/off-line Japanese character recognition”

    View Slide

  36. 36
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    会社名 株式会社 微分
    代表者 吉田 慎太郎
    所在地
    東京都新宿区新宿4丁目1-16 JR新宿ミラ
    イナタワー18F JustCo新宿
    設立 2020年10月
    資本金 7,000,000(2022年10月時点)
    従業員 20名(全雇用形態含む)
    事業内容
    教育機関向けソフトウェア「School DX」の開発
    ウェブアプリケーションの開発
    画像認識/自然言語の研究開発
    会社概要

    View Slide

  37. 37
    Copyright © 2022 VIVEN Inc. All Rights Reserved

    View Slide