Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Paper reading party (ICCV 2023): End-to-End Sem...

Paper reading party (ICCV 2023): End-to-End Semi-Supervised Object Detection with Soft Teacher

Semi-supervised object detection の論文

Avatar for Kazuya Nishimura

Kazuya Nishimura

April 23, 2025
Tweet

More Decks by Kazuya Nishimura

Other Decks in Research

Transcript

  1. Introduce paper End-to-End Semi-Supervised Object Detection with Soft Teacher Small

    amount of labeled data (image + bounding box) Unlabeled data (image) paper url: [Xu+, ICCV 2021] dog semi-supervised object detection Object detector Bounding box + class How to get object detector by effectively using unlabeled data? 1
  2. Introduce paper End-to-End Semi-Supervised Object Detection with Soft Teacher Small

    amount of labeled data (image + bounding box) Unlabeled data (image) paper url: [Xu+, ICCV 2021] dog semi-supervised object detection Object detector Bounding box + class How to get object detector by effectively using unlabeled data? Use Soft Teacher !! 2
  3. Why select?  Various relationships between teachers and students Teacher

    model (Deep) Student model (shallow) e.g.) Knowledge distillation Teacher model Student model (same or more deep) e.g.) classification with noisy label (weakly-supervised) Transfer knowledge [Wang+, TPAMI 2021] [Xia+, CVPR 2020] e.g.) semi-supervised object detection (This presentation) generate noisy label Imitate teacher Outperform teacher Student model (same) Update weight by using student weight Teacher model [Xu+, ICCV 2021] [Peng+, ICCV 2021] Honda will introduce This paper 3
  4. Background: object detection e.g.) Faster RCNN: 2 stage object detector

    Input image Feature extractor Feature Region proposal network Proposal of bounding location (overlap is ok) 4
  5. Background: object detection e.g.) Faster RCNN: 2 stage object detector

    Classification head Input image Feature extractor Feature Region proposal network Proposal of bounding location (overlap is ok) ... Regression head crop ො 𝑦𝑟𝑒𝑔 object position (x, y, w, h) ො 𝑦𝑐𝑙𝑠 Class 5
  6. Background: object detection e.g.) Faster RCNN: 2 stage object detector

    Classification head Input image Feature extractor Feature Region proposal network Proposal of bounding location (overlap is ok) ... Regression head crop ො 𝑦𝑟𝑒𝑔 object position (x, y, w, h) ො 𝑦𝑐𝑙𝑠 Class 𝐿𝑜𝑠𝑠 = 𝐿𝑐𝑙𝑠 𝑦𝑐𝑙𝑠 , ො 𝑦𝑐𝑙𝑠 + 𝐿𝑟𝑒𝑔 (𝑦𝑟𝑒𝑔 , ො 𝑦𝑟𝑒𝑔 ) Classification loss Regression loss 6
  7. Related work: Semi-supervised learning  Consistency-based method  Pseudo-labeling-based method

    7 … … [Jeong+, Neurips 2019] [Tang+, WACV 2021] Pretrained with labeled data … … Pseudo label Pseudo-labeling with Soft Teacher! Consistency loss
  8. Related work: Mean Teacher (EMA teacher) 8 Student model (same)

    𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑜𝑠𝑠𝑢 [Tarvainen+, NeurIPS 2018]
  9. Student model (same) 𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 +

    1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑜𝑠𝑠𝑢 Related work: Mean Teacher (EMA teacher) 9 Calculate loss based on Consistency or pseudo label [Tarvainen+, NeurIPS 2018]
  10. Proposed method: overview 10 = 𝐿𝑐𝑙𝑠 𝑢 + 𝐿𝑟𝑒𝑔 𝑢

    Classification loss Regression loss Student model (same) 𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑜𝑠𝑠𝑢
  11. Proposed method: overview 11 Student model (same) 𝜃𝑡 𝑡𝑒𝑎 =

    𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑐𝑙𝑠 𝑢 + 𝐿𝑟𝑒𝑔 𝑢
  12. Proposed method: overview 12 Weak aug. Strong aug. Student model

    (same) 𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑐𝑙𝑠 𝑢 + 𝐿𝑟𝑒𝑔 𝑢
  13. Proposed method: overview 13 threshold < score Weak aug. Strong

    aug. Student model (same) 𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑐𝑙𝑠 𝑢 + 𝐿𝑟𝑒𝑔 𝑢
  14. Proposed method: overview 14 threshold < score Weak aug. Strong

    aug. Student model (same) 𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑐𝑙𝑠 𝑢 + 𝐿𝑟𝑒𝑔 𝑢
  15. Proposed method: overview 15 Weak aug. Strong aug. Student model

    (same) 𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑐𝑙𝑠 𝑢 + 𝐿𝑟𝑒𝑔 𝑢 1. Soft teacher Weighting with score 𝑠𝑗 σ 𝑘=1 𝑁𝑏𝑔 𝑠𝑘
  16. Proposed method: overview 16 Weak aug. Strong aug. Student model

    (same) 𝜃𝑡 𝑡𝑒𝑎 = 𝛼 𝜃𝑡 𝑠𝑡𝑢 + 1 − 𝛼 𝜃𝑡−1 𝑡𝑒𝑎 Teacher model dog 𝐿𝑜𝑠𝑠𝑠 + 𝐿𝑐𝑙𝑠 𝑢 + 𝐿𝑟𝑒𝑔 𝑢 2. Box Jittering Filtering with regression variance
  17. The recall and precision of selected box are 33% and

    89% 1. Soft teacher 17 Teacher model threshold < score 𝒢𝑓𝑔 , 𝒢𝑏𝑔 Pseudo foreground box list Pseudo background box list 𝐿𝑐𝑙𝑠 𝑢 = 1 𝑁 𝑏 𝑓𝑔 ෍ 𝑖=0 𝑁 𝑏 𝑓𝑔 𝑙𝑐𝑙𝑠 (𝑏 𝑖 𝑓𝑔, 𝒢𝑓𝑔) + 1 𝑁 𝑏 𝑏𝑔 ෍ 𝑗=0 𝑁 𝑏 𝑏𝑔 𝑙𝑐𝑙𝑠 (𝑏 𝑗 𝑏𝑔, 𝒢𝑏𝑔) The model tend to estimate background!
  18. The recall and precision of selected box are 33% and

    89% 1. Soft teacher 18 Teacher model threshold < score 𝒢𝑓𝑔 , 𝒢𝑏𝑔 Pseudo foreground box list Pseudo background box list 𝐿𝑐𝑙𝑠 𝑢 = 1 𝑁 𝑏 𝑓𝑔 ෍ 𝑖=0 𝑁 𝑏 𝑓𝑔 𝑙𝑐𝑙𝑠 (𝑏 𝑖 𝑓𝑔, 𝒢𝑓𝑔) + ෍ 𝑗=0 𝑁 𝑏 𝑏𝑔 𝑤𝑗 𝑙𝑐𝑙𝑠 (𝑏 𝑖 𝑏𝑔, 𝒢𝑏𝑔) 𝑤𝑖 = 𝑟𝑖 σ 𝑘=1 𝑁 𝑏 𝑏𝑔 𝑟𝑘 The weight is calculated by the background score
  19. 2. Box Jittering  Localization accuracy don’t have good correlation

    19 Ideal If we select high score sample, the sample accuracy is not accurate
  20. 2. Box Jittering  Localization accuracy don’t have good correlation

    20 Ideal If we select high score sample, the sample accuracy is not accurate
  21. 2. Box Jittering  Localization accuracy don’t have good correlation

     Want to get more correlated index ➢ Introduce box regression variance ! 21 Ideal If we select high score sample, the sample accuracy is not accurate
  22. How to calculate box regression variance? 22 Classification head Input

    image Feature extractor Feature RPN ... Regression head crop BBox candidates Add jittering
  23. How to calculate box regression variance? 23 Classification head Input

    image Feature extractor Feature RPN ... Regression head crop BBox candidates Add jittering
  24. How to calculate box regression variance? 26 RPN Add jittering

    𝑏1 𝑏2 𝑏3 𝑏4 Regression CNN ത 𝜎 = 1 4 σ𝑘=1 4 𝜎𝑘 > 0.5 𝑏0 (𝑏1− 𝑏𝑎𝑣𝑒)^2 + (𝑏2− 𝑏0)^2 + (𝑏3− 𝑏0)^2 +
  25. Experiments: dataset  MS COCO semi-supervised benchmark ➢118k training data

    + 123k unlabeled data ➢1%, 5%, and 10% images is randomly sampled from training data ➢5 fold cross validation with mAP 27 Method 1% 5% 10% Supervised (only use training data) 10.0 ± 0.26 20.92 ± 0.15 26.94 ± 0.111 Sohn+, arXiv 2020 13.97 ± 0.35 24.38 ± 0.12 28.64 ± 0.21 Proposed 20.46 ± 0.39 30.74 ± 0.08 34.04 ± 0.14
  26. Ablation study  MS COCO 10% labeled data  End2End

    28 Soft Teacher Box jittering m-AP 31.2 33.6 34.2 Method m-AP Supervised 27.1 Multi-stage 28.7 E2E 31.2
  27. Summary  Proposed E2E semi-supervised object detection  two techniques

    is used for training ➢Soft teacher that effectively transfer teacher info. ➢Box jittering that can omit inaccurate pseudo-label  Outperforms SoTA methods on MS COCO benchmark  Take home message ➢Teacher and student framework is used for various task 29