[CVPR2022読み会] Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving

Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for
Autonomous Driving TURING Inc. Inoue Yuichi

自己紹介 ❏ Inoue Yuichi TURING Inc.で自動運転開発京都大学博士（薬学） Kaggle competition
grandmaster Twitter: https://twitter.com/inoichan Github: https://github.com/Ino-Ichan Kaggle: https://www.kaggle.com/inoueu1 Linkedin: https://www.linkedin.com/in/inoichan TURING Wantedly→https://www.wantedly.com/companies/turing-motors

今日紹介する論文はどんなもの？ → paper link - 自動運転で使われる 3D物体検出とトラッキングを E2Eで行うフレームワークを提案した。 Transformerをうまく活用することで時間や外観、位置の特徴も活用し、実時間で精度の高い
3D物体トラッキングを達成した。＊特に注釈がないものは紹介論文から引用してます。

3D物体検出 - 実空間上での位置や大きさを予測する 3D物体検出は自動運転においてとても重要な技術。 - LiDARに比べてカメラは安価だが、カメラは深度の情報が薄いところに弱点がある。 - 深度情報があまりないのを解決するために、以前の画像と現在の画像を使って
物体のトラッキングをすることで解決を試みていた。

Object trackingについて CenterTrackやDeep Aﬃnity NetworkのようなDeep learningを用いた手法が提案されてきた。しかし、自動運転の文脈では未だにいくつか弱点がある。 - 物体検出とAssociation(IDの紐づけ)を別々に行うので、3D物体検出の不確実性をうまく Detectorに学習させることができていない。
- 同じカテゴリーの物体は似た外観の特徴 (appearance)を持っている。さらに、自動運転の文脈では物体は頻繁に画像から消えたり、速度のバリエーションが高い。 - 表面の特徴や位置の情報を直接制約として使っていないので、追跡している物体の動きがなめらかでない。

この論文が達成したこと 1. 3D物体検出と3Dトラッキングを1つのフレームワークでEnd-to-endに学習できるようにした。 2. 2Dと3Dのボックスを統一的な表現に変換することで、幾何学と外観の情報を互換性を持たせるEmbedding extractorを提案した。 3. 時間的トポロジーに制約を加えることで軌跡をよりなめらかにする
temporal-consistency loss を提案した。 4. nuScenesの3D trackingでリアルタイム性を維持しながら最高のトラッキング精度を達成した。

提案されたアプローチ

Overview

Monocular 3D Object Detection KM3D-Netを利用して以下を出力する。 • 2D bbox •
3D bbox • Category • ReID embedding • Li, Peixuan. 2020. “Monocular 3D Detection with Geometric Constraints Embedding and Semi-Supervised Training.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2009.00764. • Li, Peixuan, Huaici Zhao, Pengfei Liu, and Feidao Cao. 2020. “RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2001.03343.

3D bbox • Category • ReID embedding • Li, Peixuan. 2020. “Monocular 3D Detection with Geometric Constraints Embedding and Semi-Supervised Training.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2009.00764. • Li, Peixuan, Huaici Zhao, Pengfei Liu, and Feidao Cao. 2020. “RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2001.03343. anchor-freeな手法で、key pointsを予測し、3D bboxを推定する手法

3D bbox • Category • ReID embedding • Li, Peixuan. 2020. “Monocular 3D Detection with Geometric Constraints Embedding and Semi-Supervised Training.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2009.00764. • Li, Peixuan, Huaici Zhao, Pengfei Liu, and Feidao Cao. 2020. “RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2001.03343. Keypointでもうまく使えるような Feature pyramid networkを提案。

3D bbox • Category • ReID embedding • Li, Peixuan. 2020. “Monocular 3D Detection with Geometric Constraints Embedding and Semi-Supervised Training.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2009.00764. • Li, Peixuan, Huaici Zhao, Pengfei Liu, and Feidao Cao. 2020. “RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2001.03343. Position attributeを計算するために微分可能な射影幾何の制約を加え、ネットワークに組み込んだ。

3D bbox • Category • ReID embedding • Wang, Zhongdao, Liang Zheng, Yixuan Liu, Yali Li, and Shengjin Wang. 2019. “Towards Real-Time Multi-Object Tracking.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1909.12605. Re-ID embeddingはJoint Detection and Embeddingの枠組みを採用。検出した物体の中心にある Embeddingを使う。

Heterogeneous Cues Embedding ❖ 外観特徴(Re-ID feature)：Vector空間 ❖ 位置や次元、向き(geometric feature)：Euclidian空間この2つの特徴をうまく組み合わせるのは難しかった。

全部NNで合わせちゃおう！

全部NNで合わせちゃおう！ 2D box corner 3D box corner PointNet Qi, Charles R., Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2016. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1612.00593. Geometric feature Re-ID feature One-hot Class ＋ Appearance feature Feature extractor

Spatial-Temporal Information Flow ❖ Object trackingの物体のマッチングはAttentionに近い。 Transformerを使えば時間・空間情報をうまく拡張できる！ ★ Self-attention：ある時間内での物体の情報を伝播 ★
Cross-attention：時間方向での物体の情報を伝播

Spatial-Temporal Information Flow

Spatial-Temporal Information Flow Geometry & Appearance feature

Spatial-Temporal Information Flow Geometry & Appearance feature 各時刻でSelf-attention Geometry featureがあるので
Positional encodingは使わない。

Positional encodingは使わない。どれだけ前のタイムポイントかをここで入れる。

Positional encodingは使わない。どれだけ前のタイムポイントかをここで入れる。過去のタイムポイントの特徴量を KeyとValue に、現在の特徴量を Queryとした Cross-attentionで時刻情報を学習。

Spatial-Temporal Information Flow

Spatial-Temporal Information Flow 学習可能なWqをWkを使ってAﬃnity matrixを作成。最後はHangarian algorhythmでIDを割り当てる。

Training Loss ❏ Monocular object 3D detection loss 元論文参照 →
Link ❏ Tracking loss 外観特徴も位置特徴も明示的にモデルに組み込めている。 Aﬃnity matrixをつかってシンプルに Crossentropy lossを計算する。 ❏ Temporal-consistency loss 新たに提案！

Training Loss ❏ Temporal-consistency loss 従来の手法では各フレームで独立して物体検出の結果が出されていたので検出結果の一貫性が微妙だった。そこを補うために、フレーム間の各物体の移動を学習するような Lossを設計した。 Ground truthのoﬀset
3D boxのコーナー 3D box 3D box reﬁnement value

リアルタイムで推論できる Spatial Featureをメモリに保存しておくことで、重たい部分（3D detection、embedding exroctor、spatial information）を画像につき一度だけ推論させれば良い。残りは軽いtemporal information
ﬂowだけなので、リアルタイム性がある！ ...らしい。

結果

学習の条件 ★ BackboneはDLA-34（Imagenet pretrained weight） ★ Spatial information flow：3層のSelf attention
★ Temporal information flow：4層のCross attention ★ Affinity matrixは2層目のところからSoftmaxなしで取得 ★ AugmentationはShift scale ★ 画像は(900, 1600) → (448, 800)にリサイズ ★ 10 images / 2080Ti * 8 GPUs → batch size 80 ★ 200 epochs (1.25e-4 90 epoch → 1.25e-5 30 epoch → 1.25e-6 80 epoch)

結果：Qualitative Result ➢ 過去15フレーム分の軌跡を表示。 ➢ 軌跡は比較的なめらか。 ➢ OculusionやHigh speedの車もいけてる？！

結果：nuScenes test set ➢ リアルタイムで推論できる中では Trackingは圧勝！ ➢ 物体検出については LiDARベースの手法に及ばないものの、 Multi-Object
Trackingに関してはLiDARを使ったものよりも良い結果に！ ➢ Time3D‡は3D detectorとRe-ID extractor、spatial-temporal moduleを別々に学習した(no End-to-end)。 DetctionとTrackingをEnd-to-endに学習させたほうが良い！評価指標についてはこちらのブログがわかりやすいです： Multi-Object Trackingの精度評価指標

結果：Heterogeneous Cues EmbeddingのAblation ➢ Re-IDの特徴量が一番重要ではあるが、そこに boxの特徴を足していくことで確実にTrackingの精度は上がっていってる。

結果：Re-ID featureのAblation ➢ Re-IDの特徴量は3D物体検出のところで若干精度を悪化させてしまう。おそらくRe-IDの「アイデンティティ」の不変性 (invariance)と物体検出のばらつき(variance)にある矛盾が原因であはないか？

結果：Spatial-Temporal Information FlowのAblation ➢ 本論文の肝となる部分。 6層のニューラルネットと置き換えたときの比較。Spatial-Temporal Information Flowはちゃんと効いてる。

まとめ • 本研究では、リアルタイムに動作する単眼ビデオのみから、 3次元物体検出と3次元多物体追跡をEnd-to-endで学習するための新しいフレームワークを提案した。 • 本フレームワークは、カテゴリ、 2D Box、3D Box、Re-ID特徴などのHeterogeneous
cuesを互換性のあるEmbeddingにエンコードする方法を示した。 • Transformerベースのアーキテクチャは、 Spatial-Temporal information ﬂowの良い軌跡推定器であることがわかった。 Temporal-consistency lossを使うことでより滑らかな軌跡を推定することができた。

Thank you for listening!!

[CVPR2022読み会] Time3D: End-to-End Joint Monocula...

[CVPR2022読み会] Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving

Inoichan

More Decks by Inoichan

Other Decks in Research

Featured

Transcript

Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for

自己紹介 ❏ Inoue Yuichi TURING Inc.で自動運転開発京都大学博士（薬学） Kaggle competition

提案されたアプローチ

Overview

Monocular 3D Object Detection KM3D-Netを利用して以下を出力する。 • 2D bbox •

Monocular 3D Object Detection KM3D-Netを利用して以下を出力する。 • 2D bbox •

Monocular 3D Object Detection KM3D-Netを利用して以下を出力する。 • 2D bbox •

Monocular 3D Object Detection KM3D-Netを利用して以下を出力する。 • 2D bbox •

Monocular 3D Object Detection KM3D-Netを利用して以下を出力する。 • 2D bbox •

Heterogeneous Cues Embedding ❖ 外観特徴(Re-ID feature)：Vector空間 ❖ 位置や次元、向き(geometric feature)：Euclidian空間この2つの特徴をうまく組み合わせるのは難しかった。

Heterogeneous Cues Embedding ❖ 外観特徴(Re-ID feature)：Vector空間 ❖ 位置や次元、向き(geometric feature)：Euclidian空間この2つの特徴をうまく組み合わせるのは難しかった。

Heterogeneous Cues Embedding ❖ 外観特徴(Re-ID feature)：Vector空間 ❖ 位置や次元、向き(geometric feature)：Euclidian空間この2つの特徴をうまく組み合わせるのは難しかった。

Spatial-Temporal Information Flow ❖ Object trackingの物体のマッチングはAttentionに近い。 Transformerを使えば時間・空間情報をうまく拡張できる！ ★ Self-attention：ある時間内での物体の情報を伝播 ★

Spatial-Temporal Information Flow

Spatial-Temporal Information Flow Geometry & Appearance feature

Spatial-Temporal Information Flow Geometry & Appearance feature 各時刻でSelf-attention Geometry featureがあるので

Spatial-Temporal Information Flow Geometry & Appearance feature 各時刻でSelf-attention Geometry featureがあるので

Spatial-Temporal Information Flow Geometry & Appearance feature 各時刻でSelf-attention Geometry featureがあるので

Spatial-Temporal Information Flow

Spatial-Temporal Information Flow 学習可能なWqをWkを使ってAﬃnity matrixを作成。最後はHangarian algorhythmでIDを割り当てる。

Training Loss ❏ Monocular object 3D detection loss 元論文参照 →

リアルタイムで推論できる Spatial Featureをメモリに保存しておくことで、重たい部分（3D detection、embedding exroctor、spatial information）を画像につき一度だけ推論させれば良い。残りは軽いtemporal information

結果

学習の条件 ★ BackboneはDLA-34（Imagenet pretrained weight） ★ Spatial information ﬂow：3層のSelf attention

結果：Qualitative Result ➢ 過去15フレーム分の軌跡を表示。 ➢ 軌跡は比較的なめらか。 ➢ OculusionやHigh speedの車もいけてる？！

結果：nuScenes test set ➢ リアルタイムで推論できる中では Trackingは圧勝！ ➢ 物体検出については LiDARベースの手法に及ばないものの、 Multi-Object

結果：Heterogeneous Cues EmbeddingのAblation ➢ Re-IDの特徴量が一番重要ではあるが、そこに boxの特徴を足していくことで確実にTrackingの精度は上がっていってる。

結果：Re-ID featureのAblation ➢ Re-IDの特徴量は3D物体検出のところで若干精度を悪化させてしまう。おそらくRe-IDの「アイデンティティ」の不変性 (invariance)と物体検出のばらつき(variance)にある矛盾が原因であはないか？

結果：Spatial-Temporal Information FlowのAblation ➢ 本論文の肝となる部分。 6層のニューラルネットと置き換えたときの比較。Spatial-Temporal Information Flowはちゃんと効いてる。

Thank you for listening!!