[Journal club] Dynamic DETR: End-to-End Object Detection with Dynamic Attention

Dynamic DETR: End-to-End Object Detection with Dynamic Attention Xiyang Dai,
Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, Lei Zhang Microsoft ICCV 2021 杉浦孔明研究室神原元就 Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., & Zhang, L. (2021). Dynamic detr: End-to-end object detection with dynamic attention. In ICCV (pp. 2988-2997).

背景：Detection Transformer (DETR [Carion+ ECCV20])の登場 3 The Power of PowerPoint
- thepopp.com • Transformerをエンコーダ-デコーダとして利用 (backbornはResNet) • (クラスラベル，位置，大きさ)の集合を予測 • 得られた集合について，𝑦𝑖 及びො 𝑦𝑖 を対応させるための2部マッチング損失を提案 • MS COCOにおいて，既存手法と同程度か上回る検出性能

背景：DETRには普及のため克服すべき問題点が存在 4 The Power of PowerPoint - thepopp.com ①：入力特徴量マップの解像度を上げることが難しい ②：訓練の収束まで，既存手法よりもエポック数を必要とする
• Encoderにおいて，Self-attentionの計算量はΟ(𝑑(𝐻𝑊)2) 特徴量マップを大きくすると計算量が爆発的に増加 • このために小さい物体の検出性能が比較的低い query key query key Decoderにおいて，attention の重み(HW × 𝑁個)は密→疎と訓練されていく

関連研究：DETRの派生形 5 The Power of PowerPoint - thepopp.com DETRの課題を克服 DETRをベースに発展
• Up-DETR [Dai+ CVPR21] • Deformable DETR [Zhu+ ICLR21] • Spatially Modulated Co-Attention [Gao+ ICCV21] • Conditional DETR [Meng+ ICCV21] • MDETR [Kamath+ ICCV21] • Mete-DETR [Zhang+ 21] • DA-DETR [Zhang+ 21] [Dai+ CVPR21] [Kamath+ ICCV21]

関連研究：Deformable DETR [Zhu+ ICLR21] 6 The Power of PowerPoint -
thepopp.com DETRにおける2つの課題を解決するため， • Feature Pyramid + Deformable convolution [Dai+ ICCV17]を基とした，Deformable attentionを導入 • 推論結果から，オフセットを用いてbboxの座標を予測 DETRと比べ1.6倍のFLOPs，数倍早く収束，各指標で高い性能

提案手法：Dynamic DETR 7 The Power of PowerPoint - thepopp.com •
DETRにおけるEncoder，DecoderをそれぞれDynamic Encoder, Dynamic Decoderに置換

Dynamic Encoder 8 The Power of PowerPoint - thepopp.com •
入力画像について，空間的なattentionを計算 • やっていること ① ② ③ 1. 階層的な特徴量マップ(pyramid feature)の畳み込み(valueの作成) 2. attention weightの作成 3. 活性化関数 & concat

Dynamic Encoder: Pyramid Convolution [Wang+ CVPR20] 9 The Power of
PowerPoint - thepopp.com 𝑃𝑙+1 𝑃𝑙 𝑃𝑙−1 𝑃1 ... • 𝑃′𝑙 = Upsample Conv 𝑃𝑙−1 + Conv 𝑃𝑙 + Downsample(Conv(𝑃𝑙+1 )) • 異なる解像度の画像を畳み込むことにより，様々な粒度での特徴量を獲得空間的な注意の獲得はカーネルサイズに依存 𝑃′𝑙

Dynamic Encoder: Deformable Convolution [Dai+ ICCV17] 10 The Power of
PowerPoint - thepopp.com • Deformable convolutionにより，空間的な注意を獲得 𝑦 𝑝0 = ෍ 𝑝𝑛∈𝑅 𝑤(𝑝𝑛 ) ∙ 𝑥(𝑝0 + 𝑝𝑛 + ∆𝑝𝑛 ) 𝑅 = { −1, −1 , −1, 0 , … , 0, 1 , 1, 1 } (3×3のカーネルの場合) オフセットどのピクセルを用いるかを学習 • 𝑃𝑙 + = Upsample DeformConv 𝑃𝑙−1 + DeformConv 𝑃𝑙 + Downsample(DeformConv(𝑃𝑙+1 ) オフセットは同一

Dynamic Encoder: Squeeze and Excitation[Hu+ CVPR18] 11 The Power of
PowerPoint - thepopp.com • 𝑃𝑙 +についてのAttention weightにSqueeze and Excitationモジュールを利用 • チャネル間の相互関係をモデリング • 具体的な手順 ① ② ③ 1. 画像全体の画素平均を取り，各チャネルを単一の値に変換 2. 非線形変換による重みの獲得 sigmoid(𝑾𝑑 ReLU(𝑾𝑒 𝒉)) 3. 重みを各チャネルにかけ合わせる一度チャネル数を1/r倍することで複雑性を抑える Attention weight𝒘𝑃𝑙を獲得

Dynamic Encoder：Dynamic ReLU [Chen+ ECCV20] 12 The Power of PowerPoint
- thepopp.com Multi-scale attention𝒉𝑚𝑠𝑎 は以下で得られる 𝒉𝑚𝑠𝑎 = {DyReLU 𝒘𝑃1𝑷1 + ; … ; DyReLU 𝒘𝑃𝑘𝑷𝑘 + } Dynamic ReLU DyReLU 𝑥𝑐 = max(𝑎𝑐 (1)𝑥𝑐 + 𝑏𝑐 (1), 𝑎𝑐 (2)𝑥𝑐 + 𝑏𝑐 (2)) • ReLUの一般化 • Squeeze and Excitationと同様の処理を行い， [𝑎 1 (1), … , 𝑎𝐶 1 , 𝑎1 2 , … , 𝑎𝐶 2 , 𝑎 1 (1), … , 𝑏𝐶 1 , 𝑏1 2 , … , 𝑏𝐶 2 ]を求める ReLUにおいて，チャネル間の相互関係を考慮した重みづけを導入

Dynamic Decoder 13 The Power of PowerPoint - thepopp.com TransformerにFaster
R-CNNにおけるRoI poolingを導入 𝑭 = 𝑓ROI (𝑷enc , 𝑩, 𝑟) 𝐵: box encoding(ここではregion proposal), r: pooling size • TransformerにFaster R-CNNにおけるRoI poolingを導入 • 1×1Conv層によって，self-attentionと共に畳み込み出力 • Object embedding • Box encoding • Object class 𝑸𝐹 = Conv(𝑭, 𝒉𝒒 ) ℎ𝑞: object embeddingsより計算したself attention ෡ 𝑸 = 𝑓FFN (𝑸𝐹 ) ෡ 𝑩 = ReLU(𝑓LN (𝑓FC (𝑸𝐹 ))) መ 𝐶 = softmax(𝑓F𝐶 𝑸𝐹 ) 画像サイズから段々小さくする(coarse-to-fine)

定量的結果 14 The Power of PowerPoint - thepopp.com 手法 AP
APS APM APL DETR [Carion+ ECCV20] 43.3 22.5 47.3 61.1 Deformable DETR [Zhu+ ICLR21] 43.8 26.4 47.1 58.0 Dynamic DETR 47.2 28.6 49.3 59.1 MS COCO test setにおける定量的結果提案手法は訓練の収束が早く，かつ高性能おおよそ1/4

定性的結果 15 The Power of PowerPoint - thepopp.com 候補全てを囲むより絞って囲む
• Dynamic decoderにおいて，疎→密(coarse-to-fine)というcross-attentionの獲得ができている

Ablation study 16 The Power of PowerPoint - thepopp.com MS
COCO test setにおけるAblation study • DETR + Dynamic Encoderの場合は性能向上 • Deformable DETR + Dynamic Encoderの場合性能低下 Feature pyramid含む構造変更の結果 Deformable Transformerの注意計算における ”anchor points”との相性が悪い収束を早めるため，attentionの重みに初期値を与えている

まとめ 17 The Power of PowerPoint - thepopp.com • DETRには計算量が膨大，収束まで多くのエポック数を必要とする，という課題
• Dynamic attention機構を導入した物体検出モデル，Dynamic DETRを提案 • 計算量を減らし，収束までのエポック数も減らしつつ既存手法を超える検出性能

[Journal club] Dynamic DETR: End-to-End Object ...

[Journal club] Dynamic DETR: End-to-End Object Detection with Dynamic Attention

Semantic Machine Intelligence Lab., Keio Univ. PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

Dynamic DETR: End-to-End Object Detection with Dynamic Attention Xiyang Dai,

背景：Detection Transformer (DETR [Carion+ ECCV20])の登場 3 The Power of PowerPoint

背景：DETRには普及のため克服すべき問題点が存在 4 The Power of PowerPoint - thepopp.com ①：入力特徴量マップの解像度を上げることが難しい ②：訓練の収束まで，既存手法よりもエポック数を必要とする

関連研究：DETRの派生形 5 The Power of PowerPoint - thepopp.com DETRの課題を克服 DETRをベースに発展

関連研究：Deformable DETR [Zhu+ ICLR21] 6 The Power of PowerPoint -

提案手法：Dynamic DETR 7 The Power of PowerPoint - thepopp.com •

Dynamic Encoder 8 The Power of PowerPoint - thepopp.com •

Dynamic Encoder: Pyramid Convolution [Wang+ CVPR20] 9 The Power of

Dynamic Encoder: Deformable Convolution [Dai+ ICCV17] 10 The Power of

Dynamic Encoder: Squeeze and Excitation[Hu+ CVPR18] 11 The Power of

Dynamic Encoder：Dynamic ReLU [Chen+ ECCV20] 12 The Power of PowerPoint

Dynamic Decoder 13 The Power of PowerPoint - thepopp.com TransformerにFaster

定量的結果 14 The Power of PowerPoint - thepopp.com 手法 AP

定性的結果 15 The Power of PowerPoint - thepopp.com 候補全てを囲むより絞って囲む

Ablation study 16 The Power of PowerPoint - thepopp.com MS

まとめ 17 The Power of PowerPoint - thepopp.com • DETRには計算量が膨大，収束まで多くのエポック数を必要とする，という課題