Material n .e., the initial and final ubsampling. Once they hod such as vanilla KD or different model archi- the paper. s (i.e., ω, ε, ϑlow , ϑhigh ) mpirically tune these hy- h on CIFAR-100. Ta- n the effects of the hy- Exploration GG8 WRN40 → Res8↑4 76.11 B.2. Application to Object Detection We further apply KDAS to the object detection task with the PASCAL VOC dataset. We target the backbone network of an object detection model, Faster R-CNN, for distillation. Table S3 summarizes the accuracy and training time reduc- tions for each teacher and student pair. Table S3. Application to Object Detection (Metric: mAP) T → S KD KD + KDAS DKD DKD + KDAS R101 → R18 39.23 39.97 (-9.09%) 38.04 38.32 (-9.06%) R50 → MV2 36.14 36.13 (-9.10%) 35.15 35.91 (-9.06%) The results demonstrate a broader applicability of KDAS beyond the classification task. ମݕग़ɿ1"4$"-70$ 'BTUFS3$// +0.73 +0.19 +0.20 +0.07 +0.16 +0.00 +0.01 +0.75 -9.98 -10.67 -12.34 -12.75 -9.56 -8.79 -8.12 -9.20 75.26 77.92 77.11 74.36 74.37 71.43 74.17 71.48 75.34 79.1 77.5 74.63 74.49 71.67 74.43 71.58 +0.08 +1.18 +0.39 +0.27 +0.12 +0.24 +0.26 +0.10 -10.01 -11.88 -12.16 -12.55 -9.67 -9.04 -8.99 -9.11 ghting, we use percentile = 80 to define our optimal LTG distribution. We set the inimum weight and penalty erature parameter is set to 4 rwise specified. Table 2 presents the top-1 ods with and without our 00 across diverse teacher- S consistently improves ac- time across all configura- KD [9], our method achieves approximately 28% faster ation depending on teacher eems that the performance lly affected by whether the n the same architecture fam- ent relationship between the ning time reduction. h as DKD [33] and Logit- consistent performance en- raining time reduction. As ng method more conserva- on is smaller compared with s modifies the loss function y give a different impact. at our approach is not only method but also complemen- methods, highlighting the ptive sampling method. Table 3. Top-1 accuracy (%) on the ImageNet validation dataset. Improvements over the corresponding baseline are shown in !. Teacher ResNet34 73.31 ResNet50 76.16 Student ResNet18 69.75 MobileNetV1 68.87 KD [9] 71.03 70.50 KD + Ours 71.35 70.86 ! Acc., T.T. (%) +0.22 -36.23 +0.36 -35.81 DKD [33] 71.70 72.05 DKD + Ours 71.81 72.49 ! Acc., T.T. (%) +0.11 -15.01 +0.44 -14.74 LogitSTD [24] 71.42 72.18 LogitSTD + Ours 71.48 72.21 ! Acc., T.T. (%) +0.06 -15.27 +0.03 -15.01 Results with ImageNet. On the large-scale ImageNet dataset, as shown in Table 3, KDAS maintains its effec- tiveness for practical teacher-student combinations. For the both pairs, our approach improves the accuracy of the stu- dent models while reducing the distillation time by over 35%. With the advanced methods, KDAS continues to en- hance performance with approximately 15% faster training. The training time reduction is larger because KDAS can more efficiently subsample the bigger dataset. These results show that our approach scales well to a larger dataset where the computational efficiency is partic- ularly more valuable. KD [9] 73.25 74.15 75.47 73.60 KD + Ours 77.81 76.39 76.87 76.31 ! Acc., (%) +4.56 +2.24 +1.40 +2.71 ! T.T. (%) -25.51 -25.42 -24.98 -25.16 Table 5. Impact of different KDAS components on CIFAR-100 with VGG13→VGG8. (CS: Curriculm Sampling; PN: Quality- based Penalization). Vanilla Subsampling CS PN Opt. Acc. Training time ↭ 73.14 100 % ↭ ↭ 73.82 +0.68 ↭ ↭ ↭ 74.08 +0.94 -12.45 % ↭ ↭ ↭ ↭ 74.12 +0.98 ↭ ↭ ↭ ↭ ↭ 73.91 +0.77 -30.22 % Results with Vision Transformers. To verify the effec- tiveness of KDAS on modern architectures, we apply the approach to various vision transformer models on CIFAR- 100. Table 4 summarizes the accuracy and training time improvement with the vanilla KD, comparing with other knowledge distillation methods. These results highlight that KDAS effectively bridges the architectural gap between the CNN-based teacher (ResNet56) and the transformer-based students while providing accuracy gains with improved dis- tillation efficiency. 6.3. Ablation Study Table 5 presents a systematic analysis of the contribution of each KDAS component. The quantity-based subsampling alone yields a 0.68% accuracy improvement while reducing training time by 12.45%. Incorporating curriculum sam- pling enhances accuracy further by 0.94%, demonstrating the value of adaptive sample prioritization. The quality- Figure 8. Training an samples grouped by t 6.4. Impact of K Figure 8 illustrates ferent KL divergenc elevated initial loss ing, yet yield supe from their rich info cision boundaries th tion capacity. Con training convergenc ing they capture sim tion but inadequate information-theoret sampling strategy, offering optimal kn 7. Conclusion This work conduc of data in knowled (i) quantity of kno $4ɿྔʹج͍ͮͨΧϦΩϡϥϜαϯϓϦϯάɹ1/ɿ࣭ʹج͍ͮͨଛࣦॏΈ͚ 0QUɿPVSPQUJNJ[BUJPOUFDIOJRVFT "CMBUJPO4UVEZɿ$*'"3 ਫ਼ͱֶश࣌ؒͷؔɿ$*'"3 3FT/FUˠ3FT/FU What to Distill? Fast Knowledge Distillation with Adaptive Sampling Byungchul Chae Kyung Hee University; SqueezeBits Inc.
[email protected] Seonyeong Heo Kyung Hee University
[email protected] Abstract Knowledge Distillation (KD) has been established as an effective technique for reducing the resource requirements of models when tackling computer vision tasks. Prior work has studied how to distill the knowledge of a teacher model better, but it overlooks how data affects the distillation re- sult. This work examines the impact of data in knowledge distillation from two perspectives: (i) quantity of knowledge and (ii) quality of knowledge. Our examination finds that faster knowledge distillation can be achieved by using data KD FitNet RKD CRD OFD ReviewKD DKD LogitSTD KD+Ours LogitSTD+Ours 70.5 71 71.5 72 72.5 73 73.5 74 74.5 75 0 1 2 3 4 5 6 Accuracy (%) Relative Training Time Figure 1. Average training time (relative to vanilla KD [9]) vs. top-1 accuracy on CIFAR-100. We set ResNet110 as the teacher