Fully Convolutional Networks における Softmax Cross Entropy Loss の変種の適用 / Fully Convolutional Networks with Variants of Softmax Cross Entropy Loss

© LINE Data Labs Self-Introduction l Name: Keigo Kubo l
Affiliation: LINE Data Labs Machine Learning Team l Works in LINE ¡ Recommendation for Themeshop, LINE LIVE, and LINE Delima ¡ Near-duplication Search for Fraud Detection in LINE Creators Studio ¡ Technical assistant in Clova Japanese Speech Recognition ¡ Text Area Estimation from Creative Image for AD Auto Review System 1

© LINE Data Labs Fully Convolutional Networks with Variants of
Softmax Cross Entropy Loss 2019-03-26 Keigo Kubo

© LINE Data Labs Table of Contents l Introduction ¡
Softmax Cross Entropy Loss and motivation behind variants of it ¡ Fully Convolutional Networks l Problems in machine learning for classification task l Variants of Softmax Loss l How to apply variants in Fully Convolutional Networks (FCN) l Experiment in AD Text Area Estimation using FCN l Conclusion and Future work 3

© LINE Data Labs Softmax Cross Entropy Loss [Bridle, John
S. 1990] l Be used as cost function for multi-classification task in machine learning ¡ Be also referred to as Softmax Loss l Be defined as a below formula. ¡ where is a below generally. 4

© LINE Data Labs Motivation behind Variants of Softmax Loss
l There is a demand for improving performance without extra inference time. ¡ real-time application ¡ Daily-batch that performs a lot of inference ¡ Power consumption reduction l Most of the variants are proposed in face verification task. ¡ The task requires real-time inference. 5 Fig. 1: Pipeline of face verification model training and testing using a classification loss function from Wang, Feng, et al. 2017

© LINE Data Labs Fully Convolutional Networks (FCN) [Long, Jonathan,
et al. 2015] l A kind of neural network that only consists of convolution neural network (CNN) ¡ Perform pixel-wise predictions in output layer by such as softmax ¡ Be mainly used in such as semantic segmentation task 6 Fig. 2: Fully convolutional networks for semantic segmentation from Long, Jonathan, et al. 2015

© LINE Data Labs Table of Contents l Introduction l
Problems in machine learning for classification task ¡ Class Imbalance Problem and Within-class Imbalance Problem l Variants of Softmax Loss l How to apply variants in Fully Convolutional Networks (FCN) l Experiment in AD Text Area Estimation using FCN l Conclusion and Future work 7

© LINE Data Labs Problems in machine learning for classification
task l Class Imbalance Problem l A class that frequently occurs tends to be overestimated. l Within-class Imbalance Problem ¡ It is also referred to as Small Disjuncts Problem [Ali, Aida., et al. 2015]. ¡ When there are implicit major/minor sub-classes in a class, minor sub-classes tend not to be correctly estimated. 8 … … Feature Extraction Classification Learning output hidden input Fig. 3: Small disjuncts degrade classification learning and feature extraction in neural network Implicit sub-class Major Minor user heavy user light user voice tone ordinary high/low text color black rainbow background sky, house fence

© LINE Data Labs Explicit Solution over Small Disjuncts l
Clustering per class is performed to obtain implicit sub-classes as cluster. ¡ Uniformly selects a class, and uniformly samples data from clusters within the class [Jo, Taeho, et al. 2004] ¡ Optimized to satisfy the below stronger constraints that employ embeddings in (b) [Huang, Chen, et al. 2016] 9 Fig. 4: Embeddings by (a) triplet vs. (b) quintuplet adapted from Huang, Chen, et al. 2016 where is Euclid distance.

Problems in machine learning for classification task l Variants of Softmax Loss ¡ Normalization ¡ Mining ¡ Margin ¡ Ensemble (only in training) l How to apply variants in Fully Convolutional Networks (FCN) l Experiment in AD Text Area Estimation using FCN l Conclusion and Future work 10

© LINE Data Labs Normalization of Weights and Embedding in
Softmax Loss l The inner product in softmax is transformed as follows. ¡ where l The and/or are normalized to be and/or in training. ¡ Weight-Normalized Softmax (W-Softmax) Loss [Liu, Weiyang, et al. 2017] ¡ L2-Constrained Softmax (L2-Softmax) Loss [Ranjan, Rajeev, et al. 2017] n The scaled factor is empirically determined from 30 to 50. ¡ Congenerous Cosine (COCO) Loss [Liu, Yu, et al. 2017] 11

© LINE Data Labs Effect of Normalizing Embedding l Ranjan,
R., et al. say L2-norm of embedding is related to image quality in face recognition as the Fig. 5. ¡ Face direction can be considered as implicit sub-class. ¡ Frontal face is major sub-class because it frequently occurs. ¡ Downward face are minor sub-classes. l By normalizing embedding, the amount of update depends on the L2- norm of embedding as the below gradient ¡ The minor sub-classes are not ignored. 12 Fig. 5: Examples on each level of L2 norm adapted from Ranjan, Rajeev, et al. 2017

© LINE Data Labs Mining in Softmax Loss l Softmax
Loss is multiplied by a factor to focus on high loss examples which are minor classes and minor sub-classes in most cases. ¡ Online Hard Example Mining (OHEM) [Shrivastava, Abhinav, et al. 2016] n when it is hard examples, and when it is not hard examples. n Hard examples are selected based on top-k loss at online. ¡ Focal Loss (FL) [Lin, Tsung-Yi, et al. 2018] n to increase the amount of error in high loss examples relatively, where is a probability of correct label and is a hyper parameter. 14 Fig. 6: The amount of error in Softmax Loss (CE) and FL

© LINE Data Labs Support Vector Guided Softmax (SV-Softmax) Loss
for Face Recognition [Wang, Xiaobo, et al. 2018] l Enhance a loss when scores of incorrect labels are higher than that of correct label. ¡ where and . 15 Range of 0~2 Penalized when correct label's score is less than incorrect label's score. data Probabilities of each label (red is the correct label) Focal Loss SV-Softmax Example 1 (0.4, 0.5, 0.01, …, 0.001) Enhance Enhance Example 2 (0.4, 0.01, 0.01, …0.01) Enhance Not enhance

© LINE Data Labs Margin in Softmax Loss l Large
Margin Softmax (L-Softmax) Loss [Liu, Weiyang, et al. 2016] ¡ The angle is multiplied by the margin as penalty. 17 Fig. 7: Examples of Geometric Interpretation adapted from Liu, Weiyang, et al. 2016 ,

© LINE Data Labs Modification of in L-Softmax l Modify
as the below because it does not decrease monotonically. ¡ where 18 -5.5 -4.5 -3.5 -2.5 -1.5 -0.5 0.5 1.5 0 0.27 0.54 0.81 1.08 1.35 1.62 1.89 2.16 2.43 2.7 2.97 3.24 3.51 3.78 4.05 4.32 4.59 4.86 5.13 5.4 5.67 5.94 6.21 6.48 6.75 7.02 7.29 7.56 7.83 8.1 8.37 8.64 8.91 9.18 cos3θ (-1)^k cos(3θ)-2k Fig. 8: Graph of and modification of when . Gradient is 0.

© LINE Data Labs Differential of in L-Softmax l is
an integer due to the use of multiple-angle formula to differentiate and . ¡ E.g. double-angle formula: l However, can be real number by using arccos. 19

© LINE Data Labs Variants of Margin in Softmax Loss
l Angular Softmax (A-Softmax, SphereFace) [Liu, Weiyang, et al. 2017] ¡ Normalize weights: ¡ Angular Focus Softmax (AF-Softmax) also applies Focal Loss [Li, Jian, et al. 2018]. l Soft-Margin Softmax (SM-Softmax) [Liang, Xuezhi, et al. 2017] [Wang, Xiaobo, et al. 2018] ¡ Subtract a margin in Euclidean space: l Additive Margin Softmax (AM-Softmax, CosineFace) [Wang, Feng, et al. 2018] ¡ Normalization and subtract a margin in Euclidean space: l Additive Angular Margin Softmax (ArcFace) [Deng, Jiankang, et al. 2018] ¡ Normalization and add a margin: 20 Fig. 9: Logits curve for each variants adapted from Deng, Jiankang, et al. 2018 Reverse gradient

© LINE Data Labs Ensemble Soft-Margin Softmax (EM-Softmax) Loss [Wang,
Xiaobo, et al. 2018] l In training, multi softmax classifiers are trained with SM-Softmax with Hilbert-Schmidt Independence Criterion (HSIC) to obtain model diversity. ¡ Where and is hyper-parameter l In prediction, the averaged weights as follows are used. 22 HSIC penalizes for dependency between classifiers High HSIC value Low HSIC value

© LINE Data Labs Effect of the Ensemble l Improve
accuracies in CIFAR10 and CIFAR100 by increasing the ensemble ¡ Although the accuracy improvement by increasing the ensemble is limited by the number of classes. 23 Fig. 10: The effect of the ensemble on CIFAR10 and CIFAR100, respectively adapted from Wang, Xiaobo, et al. 2018

Problems in machine learning for classification task l Variants of Softmax Loss l How to apply variants in Fully Convolutional Networks (FCN) l Experiment in AD Text Area Estimation using FCN l Conclusion and Future work 24

© LINE Data Labs How to apply variants in Fully
Convolutional Networks (FCN) l Normalization and Margin require and , and is required every pixels in FCN. 25 Convolution by a filter that is all value 1 and the same shape as and calculate its square root. ・・・・・・ Square in each pixels in each pixels Feature map in each pixels in each class in each pixel K is the number of classes.

Problems in machine learning for classification task l Variants of Softmax Loss l How to apply variants in FCN l How to apply variants in Fully Convolutional Networks (FCN) l Experiment in AD Text Area Estimation using FCN l Conclusion and Future work 26

© LINE Data Labs Experimentation in Text Area Estimation from
Creative Image for AD Auto Review System l A large amount of texts tend to lower CTR, so detect images that have the text area grater than 20%. ¡ There are multi class: Text, Logo Text, and Text in Product. ¡ Only the Text class should be included as text area. l The measure is the F-measure between Recall and (1 - False Alarm Rate) 27 Fig. 11: Example of text area estimation from AD creative image from LINE Corporation. The blue boxes are ground truths (gt) of Text class. The green boxes are gts of Logo Text and Text in Product class. The red segmentations are predictions of Text class. The blue segmentations are predications of another classes.

© LINE Data Labs Experimental Result (1/2) Online Hard Example
Mining (OHEM) Focal Loss γ=2 Embedding normalization Weight normalization Multiplicative angular (MA) margin Euclid margin Additive angular (AA) margin F- measure (%) ◦ 90.47 ◦ ◦ 90.52 ◦ ◦ ◦ 90.47 ◦ ◦ ◦ 88.39 ◦ ◦ ◦ ◦ ◦ 91.78 ◦ ◦ ◦ 92.56 28 Table 1: Experimental result on the detection task of images that have the text area grater than 20%. The ◦ means the method was applied.

© LINE Data Labs Only OHEM OHEM+Focal Loss OHEM+Focal Loss+Margin
Experimental Result (2/2) 29 Fig. 12: Results of the text area estimation in each method

Problems in machine learning for classification task l Variants of Softmax Loss l How to apply variants in Fully Convolutional Networks (FCN) l Experiment in AD Text Area Estimation using FCN l Conclusion and Future work 30

© LINE Data Labs Conclusion and Future work l Conclusion
¡ Variants of Softmax Loss incorporates concepts such as Margin and Ensemble found in other machine learnings. ¡ Some of the variants improved a performance in text area estimation using FCN l Future work ¡ Search hyper parameter for Normalization and Margin in softmax ¡ Ensemble learning for binary classification and multi-class classification that classifies few classes 31

© LINE Data Labs References (1/3) [Bridle, John S. 1990]
Bridle, John S. “Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters.” Advances in neural information processing systems. (1990): 211-217. [Wang, Feng, et al. 2017] Wang, Feng, et al. “Normface: l 2 hypersphere embedding for face verification.” Proceedings of the 25th ACM international conference on Multimedia. (2017): 1041-1049. [Long, Jonathan, et al. 2015] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. [Ali, Aida., et al. 2015] Ali, Aida, Siti Mariyam Shamsuddin, and Anca L. Ralescu. "Classification with class imbalance problem: a review." Int J Adv Soft Comput Appl 7.3 (2015): 176-204. [Jo, Taeho, et al. 2004] Jo, Taeho, and Nathalie Japkowicz. "Class imbalances versus small disjuncts." ACM Sigkdd Explorations Newsletter 6.1 (2004): 40-49. [Huang, Chen, et al. 2016] Huang, Chen, et al. "Learning deep representation for imbalanced classification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. [Shrivastava, Abhinav, et al. 2016] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region- based object detectors with online hard example mining." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. [Lin, Tsung-Yi, et al. 2018] Lin, Tsung-Yi, et al. "Focal loss for dense object detection." IEEE transactions on pattern analysis and machine intelligence (2018). 32

© LINE Data Labs References (2/3) [Liu, Weiyang, et al.
2017] Liu, Weiyang, et al. "Deep hyperspherical learning." Advances in Neural Information Processing Systems. 2017. [Ranjan, Rajeev, et al. 2017] Ranjan, Rajeev, Carlos D. Castillo, and Rama Chellappa. "L2-constrained softmax loss for discriminative face verification." arXiv preprint arXiv:1703.09507 (2017). [Liu, Yu, et al. 2017] Liu, Yu, Hongyang Li, and Xiaogang Wang. "Rethinking feature discrimination and polymerization for large-scale recognition." arXiv preprint arXiv:1710.00870 (2017). [Liu, Weiyang, et al. 2016] Liu, Weiyang, et al. "Large-Margin Softmax Loss for Convolutional Neural Networks." ICML. 2016. [Liu, Weiyang, et al. 2017] Liu, Weiyang, et al. "Sphereface: Deep hypersphere embedding for face recognition." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 1. 2017. [Li, Jian, et al. 2018] Li, Jian, et al. "AF-Softmax for Face Recognition." 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC). IEEE, 2018. [Liang, Xuezhi, et al. 2017] Liang, Xuezhi, et al. "Soft-margin softmax for deep classification." International Conference on Neural Information Processing. Springer, Cham, 2017. [Wang, Feng, et al. 2018] Wang, Feng, et al. "Additive margin softmax for face verification." IEEE Signal Processing Letters 25.7 (2018): 926-930. 33

© LINE Data Labs References (3/3) [Deng, Jiankang, et al.
2018] Deng, Jiankang, Jia Guo, and Stefanos Zafeiriou. "Arcface: Additive angular margin loss for deep face recognition." arXiv preprint arXiv:1801.07698 (2018). [Wang, Xiaobo, et al. 2018] Wang, Xiaobo, et al. "Support Vector Guided Softmax Loss for Face Recognition." arXiv preprint arXiv:1812.11317 (2018). [Wang, Xiaobo, et al. 2018] Wang, Xiaobo, et al. "Ensemble Soft-Margin Softmax Loss for Image Classification." arXiv preprint arXiv:1805.03922 (2018). 34

Fully Convolutional Networks における Softmax Cross...

Fully Convolutional Networks における Softmax Cross Entropy Loss の変種の適用 / Fully Convolutional Networks with Variants of Softmax Cross Entropy Loss

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript