Slide 27
Slide 27 text
27
ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification – RRC-MLT
Copyright © 2022 VIVEN Inc. All Rights Reserved
Summary
RRC-MLT challenge has tasks like text detection, script classification and
all tasks for necessary for multilingual text recognition. The dataset
contains 18K images having text from 9 different languages.
Proposed Methodology
The dataset contains natural images with embedded text such as street
signs, advertisement boards, shop names, passing vehicles, images from
internet search. The dataset contains at least 2K images from Arabic,
Bangla, Chinese, English, French, German, Italian, Japanese and Korean.
Text containing special character also from multiple languages are
available. Task 1 is text detection for which 9K images were for training
and remaining 9K for testing. For a set of don’t care regions D = {d1,
d2, …dk }, ground truth regions G = {g1, g2,…gm} and predictions T =
{t1,t 2, …tn}. The predictions are matched as
area(dk
) = 0 area(dk) area(tk
)
area(dk)
> 0.5 against don’t care regions to
discard noise. The filtered predictions T’ are considered positive if
area(gi
) area(t’
j
)
area(gi
) ∪ area(t’
j
)
> 0.5 . The set of positive matches(M) are used to
calculate F-score for𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = |𝑀|
|𝑇′|
and 𝑟𝑒𝑐𝑎𝑙𝑙 = |𝑀|
|𝐺|
thus 𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
2 ∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙
. Team SCUT-DLVC lab used one model for rough
detection followed by another for finely adjusting bounding box and
postprocessing. CAS and Fudan university designed rotation region
proposal network for inclined region proposals with angle information
which is used for bounding box regression. This is followed by text region
classifier. Sensetime group proposed an FCN based model with
with deformable convolution which predicts whether a pixel is a text or
not along with location if it is in a text box. TH-DL group used an FCN
with residual connection to predict whether a pixel belongs to text,
followed by binarizing at multiple thresholds. The connected components
thus extracted along with their bounding boxes are used as region
proposals. The features of region proposals after Rotation ROI pooling are
fed to Fast R-CNN and non-max suppression for results. Linkage group
extracted external regions from each channel followed by calculating
linkage between components and pruning to get non-overlapping
components, candidate lines are extracted and merged. Alibaba group
used a CNN to detect text regions and predict their relation which are
then grouped and used by a CNN+RNN to divide into words. For script
identification 84K and 97K images were used for training andtesting
respectively, with text Arabic, Bangla, Chinese, Japanese, Korean, Latin
and Symbols. SCUT-DLVC labs used CNN with random crop for training
and sliding windows for test images. Team CNN-based method used
VGG16 initialized with ImageNet weights and cross-entropy loss with
training images resized to fixed height and variable width for training.
TNet enhanced the features of dataset followed by deep network for
training and majority vote for classification. Team BLCT extracted patches
of variable sizes to train a 6-layer CNN model. Features from penultimate
layer are randomly combined, these are used to create bag-of-visual
words of 1024 codewords which are used as image representations which
are aggregated to form histogram which in turn is used by a 2 dense and
1 dropout layer for classification. Team Th-DL, TH-CNN used GoogLeNet
like structure, while Synthetic-ECN used an ECN based structure.