(Retrieval) Large-scale Landmark Retrieval/Recognition under a Noisy and Diverse Dataset

Slide 1

Slide 1 text

Team smlyaka: Kohei Ozaki * (Recruit Technologies) Shuhei Yokoo * (University of Tsukuba)   * Equal contribution. Large-scale Landmark Retrieval/Recognition  under a Noisy and Diverse Dataset (arXiv:1906.04087) (1st place solution, retrieval)

Slide 2

Slide 2 text

Final Results Our pipeline is based on the standard method. CNN-based global descriptor + Euclidean search + Re-ranking Single Model (0.318) + Ensemble (0.330) + Re-ranking

Slide 3

Slide 3 text

Two important things to improve landmark retrieval in 2019 1. Cosine-based softmax loss with a “cleaned subset” Related topic: (Arandjelović & Zisserman, CVPR’10) “Three important things to improve image retrieval” Related topic: (Wang+, ECCV’18) “The Devil of Face Recognition is in the Noise” 2. Rediscover the idea of “Discriminative QE” technique

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Cleaning the Google-Landmarks-v2 landmark_id=140690 The Google-Landmarks-v2 is a quite noisy and diverse dataset. Metric learning methods are usually sensitive to noise, and it is essential to clean the dataset before applying it.

Slide 6

Slide 6 text

Cleaning the Google-Landmarks-v2 landmark_id=140690 ✗ ✗ ✗ ✗ ✗ To address the noise issue, we developed an automated data cleaning system, and apply it to the Google-Landmarks-v2.

Slide 7

Slide 7 text

2. Select up to the 100-NN assigned to the same label to . xi Automated Data Cleaning With local feature matching & spatial verification (inlier-count). For each  train image , 1. kNN (k=1000) from the train set.  (image representation is learned from the Google-Landmarks-v1) 3. Spatial Verification (\w DELFv2) is performed on up to the 100-NN xi ✔ 4. Add into our clean train set  when the count of verified images is greater than the threshold (=2). xi

Slide 8

Slide 8 text

Automated Data Cleaning With local feature matching & spatial verification (inlier-count). For each  train image , 1. kNN (k=1000) from the train set.  (image representation is learned from the Google-Landmarks-v1) 3. Spatial Verification (\w DELFv2) is performed on up to the 100-NN xi 2. Select up to the 100-NN assigned to the same label to . ✔ 4. Add into our clean train set  when the count of verified images is greater than the threshold (=2). xi xi

Slide 9

Slide 9 text

Automated Data Cleaning With local feature matching & spatial verification (inlier-count). For each  train image , 1. kNN (k=1000) from the train set.  (image representation is learned from the Google-Landmarks-v1) 3. Spatial Verification (w/ DELFv2) is performed on up to the 100-NN xi ✔ 4. Add into our clean train set  when the count of verified images is greater than the threshold (=2). xi 2. Select up to the 100-NN assigned to the same label to . xi

Slide 10

Slide 10 text

Automated Data Cleaning With local feature matching & spatial verification (inlier-count). For each  train image , 1. kNN (k=1000) from the train set.  (image representation is learned from the Google-Landmarks-v1) 3. Spatial Verification (w/ DELFv2) is performed on up to the 100-NN xi ✔ 4. Add into our clean train set  when the count of verified images is greater than the threshold (=2). xi 2. Select up to the 100-NN assigned to the same label to . xi

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Discriminative Reranking Predict a landmark_id of each sample from the test set and index set. Recognition Pipeline id=1 id=2 predict id=1 predict predict test index ・・・・・・・・・

Slide 13

Slide 13 text

Discriminative Reranking Predict a landmark_id of each sample from the test set and index set. Recognition Pipeline id=1 id=2 predict id=1 predict predict test index ・・・・・・・・・

Slide 14

Slide 14 text

Discriminative Reranking Append positive samples from the entire index set, which are not retrieved by the similarity search. Positive, which are not retrieved Query Positive samples are moved to the left of the negative samples in the ranking. Query Query Positive Positive Positive Positive Positive Positive Negative Negative

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Key takeaways: Two important things to improve landmark retrieval in 2019 1. Cosine-based softmax loss with a “cleaned subset” Related topic: (Arandjelović & Zisserman, CVPR’10) “Three important things to improve image retrieval” Related topic: (Wang+, ECCV’18) “The Devil of Face Recognition is in the Noise” 2. Rediscover the idea of “Discriminative QE” technique

Slide 18

Slide 18 text

Appendix

Slide 19

Slide 19 text

Soft-voting with spatial veriﬁcation Similarity term Inlier-count term Conﬁdence scoring: 0.85 1.00 0.75 1.00 0.60 0.50 Query Euclidean  search TOP k (k=3) nearest neighbors in the train set Similarity term Inlier-count term a set of q's neighbors (top3) and its members are assigned to l. Inlier-count The New Town Hall  in Hanover Hamburg City Hall Our recognition method is based on accumulating top-K nearest neighbors in the train set. ˆ y = argmax = sl Hamburg City Hall l =

Slide 20

Slide 20 text

Cosine-based Softmax Loss • Employ ArcFace and CosFace for learning metric in our solution. • Successful methods in face recognition. • Also in landmark retrieval/recognition, we found out cosine-based softmax losses are very eﬀective. • Hyperparameter: m=0.3 and s=30 were used in both. • There are many winning solutions using cosine-based softmax losses: • Whale Humpbuck - 1st place • Protain Classiﬁcation - 1st place [1] J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv:1801.07698, 2018. [2] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In CVPR, pages 5265–5274, 2018.

Slide 21

Slide 21 text

Modeling | Overview • Backbones: • FishNet-150 • ResNet-101 • SE-ResNeXt-101 • Data augmentation, “soft” and “hard” strategy. • “Soft”: 5 epochs with random cropping and scaling. • “Hard”: 7 epochs with random brightness shift, random sheer translation, random cropping, and scaling. • Combine various techniques: • Aspect preserving of input images. • Cosine annealing LR scheduler. • GeM-pooling (generalized mean pooling). • Fine-tuning at full resolution on the last epoch with freezing BN.

Slide 22

Slide 22 text

Modeling | Ensemble Pub/Priv=30.95/33.01 Ensemble Concat + L2N (3072d) FishNet-150 ArcFace, Soft (512d) Pub/Priv=28.66/30.76 FishNet-150 CosFace, Soft (512d) Pub/Priv=29.04/31.56 FishNet-150 ArcFace, Hard (512d) Pub/Priv=29.17/31.26 ResNet-101 ArcFace, Hard (512d) Pub/Priv=28.57/31.07 SE-ResNeXt-101 ArcFace, Hard (512d) Pub/Priv=29.60/31.52 SE-ResNeXt-101 ArcFace, Hard (512d) Pub/Priv=29.42/31.80 Pub/Priv: Public/PrivateLB score L2N: L2-Normalization

Slide 23

Slide 23 text

Appendix: Another case landmark_id=29