点群SegmentationのためのTransformerサーベイ

点群Segmentationのための Transformerサーベイ 2023/05/23 takmin

2 自己紹介

自己紹介 3 株式会社ビジョン＆ITラボ代表取締役皆川卓也（みながわたくや）博士（工学）「コンピュータビジョン勉強会＠関東」主催株式会社フューチャースタンダード
技術顧問略歴： 1999-2003年日本HP（後にアジレント・テクノロジーへ分社）にて、ITエンジニアとしてシステム構築、プリセールス、プロジェクトマネジメント、サポート等の業務に従事 2004-2009年コンピュータビジョンを用いたシステム/アプリ/サービス開発等に従事 2007-2010年慶應義塾大学大学院後期博士課程にて、コンピュータビジョンを専攻単位取得退学後、博士号取得（2014年） 2009年-現在フリーランスとして、コンピュータビジョンのコンサル/研究/開発等に従事（2018年法人化） http://visitlab.jp

4 株式会社ビジョン＆ITラボはコンピュータビジョンとAI によって御社の「こまった」を助ける会社です

ビジョン技術の町医者 AIビジネスについて、気軽に相談できる

事業内容 1. Ｒ＆Ｄコンサルティング 2. 受託研究/開発 3. 開発マネジメント 4. 開発コンサルティング 5.
ビジネス化コンサルティング 6

ソリューション/製品 7 深層学習 (Deep Learning) Virtual / Augmented Reality ナンバープレート認識
ビジョン＆ITラボの代表的なソリューションや製品の例を紹介いたします。

深層学習 (Deep Learning) 8 深層学習についてのコンサルティングや開発支援などを行います。  画像識別  物体検出
 領域分割  人物姿勢推定  画像変換  画像生成  etc

Virtual Reality/Augmented Reality 9 御社がVirtual RealityやAugmented Realityを用いたビジネスを行う上で必要な、総合的な技術コンサルティングや開発/プロダクトを提供します。 
特定物体認識  Visual SLAM  三次元スキャン  Face Tracking

ナンバープレート認識： Number Plate Recognizer  画像や動画からナンバープレートを読み取ります入力画像/動画文字＋座標 Number Plate
Recognizer 札幌000 (み) 0000 • Web APIまたはSDKで提供可能 • SDK • LinuxまたはWindows • C++またはPython • アルファベット分類番号および図柄入りナンバープレートにも対応 • GPU不要でロバストかつ高速な認識

お問合せ先 11 https://visitlab.jp

12 はじめに

発表の背景 13 「MetaFormerのアイデアはPointNetや点群畳み込みに通じるところがあり、特にPointNetで用いられた（Global Poolingで得られた）大域特徴量と点ごとの特徴量を結合してShared MLPで変換するというアイデアは、MetaFormer構造の目的とよく似ています。」コンピュータビジョン最前線
Winter2022 ニュウモン点群深層学習 Deepで挑む３Dへの第一歩千葉直也より

発表の背景 14

本資料の目的 15  主にSemantic Segmentationを目的として、点群にTransformerを適用した手法について調査  どのように適用したのか？ 
Vision Transformer、MLP Mixer、Pool Formerなどと何が違うのか？  PointNet/PointNet++と何が違うのか？

本資料の内容 16  PointNetのおさらい  PointNet  PointNet++  PointNeXt
 Transformerのおさらい  Transformer  Vision Transformer  MLP Mixer  Meta Former (Pool Former)  点群＋Transformer  Point Transformer  Point Transformer V2  Point Mixer  Point Cloud Transformer  Point Voxel Transformer  Dual Transformer  Fast Point Transformer  Point BERT  Stratified Transformer  OctFormer  Self-positioning Point-based Transformer  まとめ

17 PointNetのおさらい

PointNetおさらい：出典 18  PointNet  Qi, C. R., Su, H.,
Mo, K., & Guibas, L. J. (2017). PointNet : Deep Learning on Point Sets for 3D Classification and Segmentation Big Data + Deep Representation Learning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).  PointNet++  Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017). PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. Conference on Neural Information Processing Systems (NeurIPS)  PointNeXt  Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H. A. A. K., Elhoseiny, M., & Ghanem, B. (2022). PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies. Conference on Neural Information Processing Systems (NeurIPS).

PointNet 19  各点群の点を独立に（周辺の点を参照せず）MLPで特徴量を学習  Global Max Poolingで点群全体の特徴量を取得

PointNet 20  各点群の点を独立に（周辺の点を参照せず）MLPで特徴量を学習  Global Max Poolingで点群全体の特徴量を取得直交行列（≒回転行列）を学習
し、座標変換

し、座標変換座標値（3次元）を特徴量（64次元）へ変換

し、座標変換座標値（3次元）を特徴量（64次元）へ変換 64次元の直交行列を学習し、特徴量を変換

し、座標変換座標値（3次元）を特徴量（64次元）へ変換 64次元の直交行列を学習し、特徴量を変換特徴量の変換（点ごと）

し、座標変換座標値（3次元）を特徴量（64次元）へ変換 64次元の直交行列を学習し、特徴量を変換特徴量の変換（点ごと） Max Poolingで全点の特徴を統合し、 Global特徴を算出

し、座標変換座標値（3次元）を特徴量（64次元）へ変換 64次元の直交行列を学習し、特徴量を変換特徴量の変換（点ごと） Max Poolingで全点の特徴を統合し、 Global特徴を算出 Classification Score

し、座標変換座標値（3次元）を特徴量（64次元）へ変換 64次元の直交行列を学習し、特徴量を変換特徴量の変換（点ごと） Max Poolingで全点の特徴を統合し、 Global特徴を算出 Global特徴を各点の特徴に追加 Segmentation Task

し、座標変換座標値（3次元）を特徴量（64次元）へ変換 64次元の直交行列を学習し、特徴量を変換特徴量の変換（点ごと） Max Poolingで全点の特徴を統合し、 Global特徴を算出 Global特徴を各点の特徴に追加特徴量の変換（点ごと）

し、座標変換座標値（3次元）を特徴量（64次元）へ変換 64次元の直交行列を学習し、特徴量を変換特徴量の変換（点ごと） Max Poolingで全点の特徴を統合し、 Global特徴を算出 Global特徴を各点の特徴に追加特徴量の変換（点ごと）特徴量から各点のラベルスコア算出 (Segmentation)

PointNet++ 29  PointNetを階層的に適用  点群をクラスタ分割→PointNet→クラスタ内で統合を繰り返す

PointNet++ 30  PointNetを階層的に適用  点群をクラスタ分割→PointNet→クラスタ内で統合を繰り返す Farthest Point Samplingでサンプリン
グした点を中心に半径rでグルーピング（オーバーラップあり）

PointNet++ 31  PointNetを階層的に適用  点群をクラスタ分割→PointNet→クラスタ内で統合を繰り返すグループごとに PointNetを適用

PointNet++ 32  PointNetを階層的に適用  点群をクラスタ分割→PointNet→クラスタ内で統合を繰り返すサンプリング＋グルーピング＋PointNetを繰り返し

PointNet++ 33  PointNetを階層的に適用  点群をクラスタ分割→PointNet→クラスタ内で統合を繰り返す K近傍の点から、距離に基づいた重み付き和でアップ
サンプルした点の特徴量を補間

PointNet++ 34  PointNetを階層的に適用  点群をクラスタ分割→PointNet→クラスタ内で統合を繰り返す各点単独で PointNet

PointNet++ 35  PointNetを階層的に適用  点群をクラスタ分割→PointNet→クラスタ内で統合を繰り返すアップサンプルと PointNetを繰り返し

PointNeXt 36  PointNet++の性能を以下の仕組みによって大幅改善  Data Augmentation、最適化手法、ハイパーパラメータを最新研究の知見に基づき再調整  受容野を広げるために、Groupingの際に近傍との相対距離を正規
化  層を深くするために、Inverted Residual MLP (InvResMLP)ブロックを導入

37 Transformerのおさらい

Transformerおさらい: 出典 38  Transformer  Vaswani, A., Shazeer, N.,
Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS).  Vision Transformer  Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale. International Conference on Learning Representations (ICLR).

Transformerおさらい: 出典 39  MLP Mixer  Tolstikhin, I., Houlsby,
N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., Lucic, M., & Dosovitskiy, A. (2021). MLP-Mixer: An all-MLP Architecture for Vision. Advances in Neural Information Processing Systems  Meta Former (Pool Former)  Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). MetaFormer is Actually What You Need for Vision. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Transformer 40  自然言語処理の分野で提案された手法で、EncoderとDecoderで構成される。  Encoderは単語列や時系列信号等のシーケンスを入力として、特徴ベクトルのシーケンスへ変換す
る。  Decoderは特徴ベクトルのシーケンスを受け取り、入力シーケンスの再現、または別のシーケンスを出力する（例：翻訳）  Attention（注意機構）という仕組みを用いることで、例えば単語同士の関係の重要度などを特徴ベクトルに埋め込んでいる。

Attention 41  Queryによって、メモリ（Key- Value）の中から必要な情報を選択的に取得する仕組み  例：翻訳のケース  Query:
 日本語の単語（特徴ベクトル）  Key, Value:  英語の文章（英単語特徴ベクトル群）  出力：  英語の各単語ベクトルの重み付き和  重みはQueryと関連が高いものほど大きい Query Key Value

Attention 42  Self-Attention  Cross-Attention 𝑊𝑄 𝑊𝐾 𝑊𝑉 Q
K V 𝑊𝑄 𝑊𝐾 𝑊𝑉 Q K V

Attention 43 QueryとKeyの内積を計算し、Query に対するKeyの類似度を算出 𝑸𝑲T スケール調整(𝑑𝑘:入力次元数) 𝑸𝑲T 𝑑𝑘 類似度に基づいた重み softmax
𝑸𝑲T 𝑑𝑘 Valueの重み付き和 softmax 𝑸𝑲T 𝑑𝑘 𝑽

Multi-Head Attention 44  Single-HeadのAttentionを複数並列に並べることで、複数のAttention表現を取得  入力次元は計算量を抑え
るためにHead数hで除算した数

Vision Transformer 45  画像を16x16のパッチに分割し、パッチをトークンとして Transformer Encoderを適用したところ、State-of-the-artの CNNに匹敵する性能

MLP Mixer 46  Self-Attentionを特徴ベクトルの転置＋MLPで置き換えることで、Vision Transformerに匹敵する性能 Patch(トークン) の混ぜ合わせ Channelの混ぜ
合わせ

Meta Former 47  Vision TransformerやMLP Mixerを、Token Mixing + Channel
Mixingというアーキテクチャで一般化

Pool Former 48  Token MixingをGlobal Average Poolingのようなシンプルな方法で実現しても、Vision TransformerやMLP
Mixerに匹敵する性能

49 点群＋Transformer Visionとの比較

Point Transformer 50  Zhao, H., Jiang, L., Jia, J.,
Torr, P., & Koltun, V. (2021). Point Transformer. International Conference on Computer Vision (ICCV).  点群にTransformerを適用した最初期の論文の一つ  Vector AttentionやPositional Embeddingに相対座標を利用する等、Transformerを点群に適用するにあたり、様々な工夫を施している。  SegmentationおよびClassificationタスクで当時のState-of-the- Artを達成

Point Transformer 51 Semantic Segmentation Classification  ネットワーク構造

Point Transformer 52  MLPブロック座標を特徴量へ変換

Point Transformer 53  Point Transformerブロック入力特徴量座標近傍点を使った特徴量変換

Point Transformer 54  Point Transformerブロック入力特徴量座標近傍点を使った特徴量変換

Point Transformer ブロック 55 入力特徴量座標 x𝑗 , 𝑝𝑗 x𝑖
, 𝑝𝑖 K近傍

Point Transformer ブロック 56 入力特徴量座標 x p x φ
x𝑖 − 𝜓(x𝑗 ) Query Key x𝑗 , 𝑝𝑗 x𝑖 , 𝑝𝑖 𝛿 = 𝜃 p𝑖 − p𝑗 MLP 相対座標 Positional Embedding 𝛼 x𝑗 MLP Value K近傍

Point Transformer ブロック 57 入力特徴量座標 x𝑗 , 𝑝𝑗 x𝑖
, 𝑝𝑖 𝛾 φ x𝑖 − 𝜓 x𝑗 + 𝛿 Query Key Positional Embedding 𝛾𝑖−1 𝛾𝑖−𝐾 𝛾𝑖−𝑗 … … 𝛼 x𝑗 + 𝛿 Value Positional Embedding 𝛼1 𝛼𝐾 𝛼𝑗 … … K近傍

K近傍 Point Transformer ブロック 58 入力特徴量座標 x𝑗 , 𝑝𝑗
x𝑖 , 𝑝𝑖 y𝑖 = ෍ x𝑗∈𝜒(𝑖) 𝜎 𝛾 φ x𝑖 − 𝜓 x𝑗 + 𝛿 ⊙ 𝛼 x𝑗 + 𝛿 チャネル方向にSoftmax ⊙ 要素ごとの積総和 Vector Attention 𝜎

Point Transformer 59  Transition Downブロックダウンサンプリングサンプリングサンプリングした点のK近傍の特徴量をMLPで変換
K個の特徴量のMax Pooling

Point Transformer 60  Transition Upブロックアップサンプリング Skip Connection ３近傍点で補間

Point Transformer 61  実験：S3DIS dataset

Point Transformer 62  実験： ModelNet40, ShapeNetPart

Vision TransformerとPoint Transformerの違い 63 Vision Transformer Point Transformer QueryとKey
の相関内積差分＋MLP Attention • スカラー • Multi-Head • ベクトル（チャネル方向にも重みづけ） • Single-Head Positional Embedding ランダムな初期値から学習点の相対座標＋ MLP Token Mixing 画像全体 K近傍点 PointTransformerV2 ではMulti-Head

Point Transformer V2 64  Wu, X., Lao, Y., Jiang,
L., Liu, X., & Zhao, H. (2022). Point Transformer V2: Grouped Vector Attention and Partition- based Pooling. Advances in Neural Information Processing Systems (NeurIPS), NeurIPS  Point Transformerに対して、以下を導入することで性能改善  Grouped Vector Attention  より強力なPositional Embedding  Partition Based Pooling

Point Transformer V2 65

Point Transformer V2 66 Multi-Head版のVector Attention

Point Transformer V2 67 より強力な Positional Embedding

Point Transformer V2 68 K近傍での Pooling/Unpooling 空間をパーティションに区切ってPooling/Unpooling

Point Transformer V2 69  実験：ScanNet v2, S3DIS dataset

PointMixer 70  Choe, J., Park, C., Rameau, F., Park,
J., & Kweon, I. S. (2022). PointMixer: MLP-Mixer for Point Cloud Understanding. European Conference on Computer Vision (ECCV)  MLP Mixerを、点群のような疎で乱雑なデータに対して適用するために、Token-Mixing部分をChannel-MixingとSoftmaxの組み合わせで置き換え  Inter-Set、Intra-Set、Hierarchical-Setの３パターンでmixing  高効率

PointMixer 71  基本構造はPoint Transformerと同じ

PointMixer 73  Mixer Block 入力特徴量座標

PointMixer 74  Mixer Block 入力特徴量座標チャネル方向にSoftmax ⊙ 要素ごと
の積 𝜎 𝑔2 𝑔3 𝐲𝒊 Σ

PointMixer 75  Mixer Block 入力特徴量座標 ★をk近傍点•の特徴量を用いてアップデート

PointMixer 76  Mixer Block 入力特徴量座標 ★は★から見たk近傍点の１つ ★の特徴量を用いて★をアップデート

PointMixer 78  基本構造はPoint Transformerと同じダウンサンプリング

PointMixer 79  基本構造はPoint Transformerと同じダウンサンプリングサンプリングした点の特徴量をK近傍からアップデート

PointMixer 80  基本構造はPoint Transformerと同じスキップ接続された点群座標へアップサンプリング

PointMixer 81  基本構造はPoint Transformerと同じスキップ接続された点群座標へアップサンプリングダウンサンプリングの時とは対称方向にアップサンプリングして特徴量更新

PointMixer 82  実験：S3DIS, ModelNet40

MLP MixerとPointMixerの違い 83 MLP Mixer PointMixer MLP Mixing トークンの転置チャネル方向の
Sotmaxによる重み付き和 Positional Embedding なし。（トークンの順番に含まれている）点の相対座標＋ MLP Token Mixing 画像全体 K近傍点所感：PointMixerはMLP Mixerとはまるで別物

Point TransformerとPointMixerの違い 84  Point Transformer  y𝑖 = σx𝑗∈𝜒(𝑖)
𝜎 𝛾 φ x𝑖 − 𝜓 x𝑗 + 𝛿 ⊙ 𝛼 x𝑗 + 𝛿  PointMixer  y𝑖 = σx𝑗∈𝜒(𝑖) 𝜎 𝑔2 𝑔1 x𝑗 ; 𝛿 ⊙ 𝑔3 x𝑗 KeyとQueryの差分 +Positional Embedding KeyにPositional EmbeddingをConcat Value + Positional Embedding Value PointMixerのToken Mixingは、シンプルにSoftmaxによるチャネル方向の重み付き和のみ Softmax チャネル方向の重み付き和

PointNet++/PointTransformer/PointMixerの比較 85 SOP = Symmetric Operation  Transformer Blockの構造比較

PointNet++/PointTransformer/PointMixerの比較 86 Max Pooling Softmax +Summation Softmax +Summation SOP
= Symmetric Operation  Transformer Blockの構造比較

画像と点群比較 87 Vision Point Cloud PointNet++にChannel Mixingを加えたらPool Formerに対応。

88 点群＋Transformer その他の手法

PCT: Point Cloud Transformer 89  Guo, M. H., Cai,
J. X., Liu, Z. N., Mu, T. J., Martin, R. R., & Hu, S. M. (2021). PCT: Point cloud transformer. Computational Visual Media, 7(2), 187–199.  点群の座標を特徴量へ変換し、通常のTransformerと同様、 Key、Queryの内積を用いてＡｔｔｅｎｔｉｏｎを生成し、Valueに重みづけ  全ての点同士でSelf-Attentionを計算  グラフ理論で用いられるラプラシアン行列を用いたOffset Attentionを導入することで、順序不変なAttentionを実装

PCT: Point Cloud Transformer 90

PCT: Point Cloud Transformer 91 点群を特徴量へ変換

PCT: Point Cloud Transformer 92 Self- Attention

PCT: Point Cloud Transformer 93 Linear + Batch Normalization +
ReLU

PCT: Point Cloud Transformer 94 Max Poolingと Average Poolingの Concat

PCT: Point Cloud Transformer 95 Linear + Batch Normalization +
ReLU + Dropout

PCT: Point Cloud Transformer 96

PCT: Point Cloud Transformer 97 𝑸 ∙ 𝑲𝑇

PCT: Point Cloud Transformer 98 𝑭𝑆𝐴 = 𝜎 𝑸 ∙
𝑲𝑇 ∙ 𝑽 Attention

PCT: Point Cloud Transformer 99 通常のSelf Attention Offset-Attention 𝑭𝑜𝑢𝑡 =
(𝑰 − 𝑨)𝑭𝑖𝑛 Attention Mapを隣接行列とみなす Laplacian Matrix

PCT: Point Cloud Transformer 100  実験: Model40 Classification

PCT: Point Cloud Transformer 101  実験：S3DIS

Vision TransformerとPCTの違い 102 Vision Transformer Point Transformer QueryとKey の相関内積
内積 Attention Multi-Head Offset-Attention Positional Embedding ランダムな初期値から学習 Sampling + Groupingで周辺領域から特徴量算出 Token Mixing 画像全体点群全体

PVT: Point Voxel Transformer 103  Zhang, C., Wan, H.,
Shen, X., & Wu, Z. (2022). PVT: Point- voxel transformer for point cloud learning. International Journal of Intelligent Systems  点群ベースのAttentionとVoxelベースのAttention (Sparse Window Attention)を組み合わせることで、高速高性能なモデルを実現  VoxelベースのAttentionでは、点が内在するVoxelのみ使用し、 Voxel化されたWindow内でSelf-Attentionを取ることで、計算量削減し、また点群密度の影響を低減

PVT: Point Voxel Transformer 104  Point Voxel Transformer Block
 Voxel Branch:  点群をボクセル化し、局所領域でSelf Attention  Point Branch:  領域全体で点群同士の相対座標も考慮したSelf Attention。巨大な点群に対しては簡易な External Attentionを使用

 Voxel Branch:  点群をボクセル化し、局所領域でSelf Attention  Point Branch:  領域全体で点群同士の相対座標も考慮したSelf Attention。巨大な点群に対しては簡易な External Attentionを使用 Voxel Branch Window内で、疎な点群に対し、ハッシュテーブルを用いてSelf- Attention 特徴量をVoxel 上へ割り当て

 Voxel Branch:  点群をボクセル化し、局所領域でSelf Attention  Point Branch:  領域全体で点群同士の相対座標も考慮したSelf Attention。巨大な点群に対しては簡易な External Attentionを使用 Point Branch 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 + 𝐵 ∙ 𝑉 点同士の相対位置

PVT: Point Voxel Transformer 107  実験：ShapeNetPart, S3DIS

Dual Transformer 108  Han, X. F., Jin, Y. F.,
Cheng, H. X., & Xiao, G. Q. (2022). Dual Transformer for Point Cloud Analysis. IEEE Transactions on Multimedia.  Self-Attentionを点群同士、およびチャネル方向に対して適用するDual Transformer Blockを導入

Dual Transformer 109  Dual Point Cloud Transformer Blockを導入 
点群同士、およびチャネル同士のMulti-Head Self-Attentionをそれぞれ独立に計算し、和を取る。

Dual Transformer 110  Dual Point Cloud Transformer Blockを導入 
点群同士、およびチャネル同士のMulti-Head Self-Attentionをそれぞれ独立に計算し、和を取る。点群同士のSelf- Attention softmax 𝑄𝐾𝑇 ∙ 𝑉 チャネル間のSelf- Attention softmax 𝑄𝑇𝐾 ∙ 𝑉

Dual Transformer 111  実験：  ModelNet40  ShapeNet

Fast Point Transformer 112  Park, C., Jeong, Y., Cho,
M., & Park, J. (2022). Fast Point Transformer. Conference on Computer Vision and Pattern Recognition (CVPR)  Light Weightな局所領域でのSelf-Attention Blockを導入  Voxel-Hashingベースアーキテクチャによって、Point Transformerと比較して129倍の推論の高速化

Fast Point Transformer 113

Fast Point Transformer 114 点群をVoxelで分割 𝒫𝑖𝑛 = 𝐩𝑛 , 𝐢𝑛
座標特徴ベクトル

Fast Point Transformer 115 Voxel内の特徴量算出 𝒱 = 𝐯𝑖 , 𝐟𝑖
, 𝐜𝑖 Voxel 座標特徴量 Centroid 座標

Fast Point Transformer 119 Light-Weight Self-Attention 𝐠𝑖 = 𝐟𝑖 +
δabs 𝐜𝑖 − 𝐯𝑖 CentroidとVoxelの相対座標＋MLP

δabs 𝐜𝑖 − 𝐯𝑖 𝐟𝑖 ′ = ෍ 𝑗∈𝒩 𝑖 𝑎 𝐠𝑖 , δabs 𝐯𝑖 − 𝐯𝑗 𝜓 𝐠𝑖 CentroidとVoxelの相対座標＋MLP 隣接Voxelの相対座標＋MLP cosine 類似度

δabs 𝐜𝑖 − 𝐯𝑖 𝐟𝑖 ′ = ෍ 𝑗∈𝒩 𝑖 𝑎 𝐠𝑖 , δabs 𝐯𝑖 − 𝐯𝑗 𝜓 𝐠𝑖 Positional Embedding Query Key Value

δabs 𝐜𝑖 − 𝐯𝑖 𝐟𝑖 ′ = ෍ 𝑗∈𝒩 𝑖 𝑎 𝐠𝑖 , δabs 𝐯𝑖 − 𝐯𝑗 𝜓 𝐠𝑖 Positional Embedding Query Key Value 全ての(i, j)の組み合わせで、 Kパターンのみ

Fast Point Transformer 124 Voxel特徴から点群を復元 𝒫𝑜𝑢𝑡 = 𝐩𝑛 , 𝐢𝑛
座標特徴ベクトル

Fast Point Transformer 125  実験：S3DIS

Point-BERT 126  Yu, X., Tang, L., Rao, Y., Huang,
T., Zhou, J., & Lu, J. (2022). Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling. Conference on Computer Vision and Pattern Recognition (CVPR)  点群解析のための事前学習モデルの作成  Classificationは2層のMLPを加えて識別。  Object Part Segmentationは、Transformerのいくつかの中間層と最終層の特徴量を元に、各点のラベルを計算

Point-BERT 127

Point-BERT 128 点群をパッチに分割

Point-BERT 129 点群をパッチに分割点群パッチから特徴量算出

Point-BERT 130 点群をパッチに分割 dVAEを用いてパッチ特徴量から離散トークンを、元点群が復元できるよう学習トークン
点群パッチから特徴量算出

点群パッチから特徴量算出パッチ特徴量のシーケンスマスクをかける

点群パッチから特徴量算出パッチ特徴量のシーケンス Transformerでマスク部も含め、トークンを予測するよう学習マスクをかける

点群パッチから特徴量算出パッチ特徴量のシーケンス Transformerでマスク部も含め、トークンを予測するよう学習マスクをかけるデータ拡張（CutMixの点群版）を用いてContrastive Learningで表現学習

Point-BERT 134  実験： ModelNet40, SpaheNetPart

Stratified Transformer 135  Lai, X., Liu, J., Jiang, L.,
Wang, L., Zhao, H., Liu, S., Qi, X., & Jia, J. (2022). Stratified Transformer for 3D Point Cloud Segmentation. Conference on Computer Vision and Pattern Recognition (CVPR)  近傍に対しては密に、遠方に対しては疎にサンプリングすることで、局所領域の特徴と広域での特徴、両方を集約できるモデルを提案

Stratified Transformer 136

Stratified Transformer 137 学習可能なLook Up Tableを用いて、点同士の相対座標をQuery 、 Key、Valueの特徴量へ変換し、埋め込み

139 Layer Normalization Feed Forward Network

140 Layer Normalization 異なるサイズのWindow 内でSelf-Attention

141 Layer Normalization 通常のTransformerと同様にKeyとQueryの内積を用いる（Multi-Head Self Attention） y𝑖 = ෍
𝑗 softmax 𝑄𝑢𝑒𝑟𝑦𝑖 ∙ 𝐾𝑒𝑦𝑗 ∙ 𝑉𝑎𝑙𝑢𝑒𝑗

142 Layer Normalization Feed Forward Network Windowを1/2ずらしてSelf-Attentionを計算

Stratified Transformer 144 Farthest Point Sampling + k-nn

Stratified Transformer 145 Farthest Point Sampling + k-nn サンプル点のk 近傍でPooling

Stratified Transformer 147 Skip Connectionからの入力特徴量 Skip Connectionからの入力座標前層からの入力
Down Sample前の点の特徴量を補間

Stratified Transformer 148  実験：S3DIS、ShapeNetPart

OctFormer 149  Wang, P.-S. (2023). OctFormer: Octree-based Transformers for
3D Point Clouds. ACM Transactions on Graphics (SIGGRAPH), 42(4), 1–11.  点群をWindowで区切ってSelf-Attentionを計算することで、計算量削減  Windowごとの点の数が異なるという課題を解決するために、 Windowの形状を柔軟に変更  Windowの位置をずらして再計算することで、Receptive Field を拡大（Dilated Partition）

OctFormer 150  八分木（英: Octree）とは、木構造の一種で、各ノードに最大8個の子ノードがある。3次元空間を8つのオクタント（八分空間）に再帰的に分割する場合によく使われる。 Wikipediaより(https://ja.wikipedia.org/wiki/%E5%85%AB%E5%88%86%E6%9C%A8)

OctFormer 151 • 点群からOctreeを生成（ここでは２次元で説明）。 • 赤が点群、点が存在するノードはグレー。 • Z-Order
Curveを用いて、 Octreeノードを１列に並べる。 • 点の存在するノードおよび同じ親をもつノードのみ並べる。 • ノード配列をオーバーラップの無いWindow で分割（同じ色が同じ Window） • Window内のノード数は一定（ここでは7） • 設定したWindow内で Self-Attentionを計算 • Windowの位置をずらすことで受容野を広げる。 • Dilation=2の例 • Z-Order Curve上（ただし空ノードは含まない）で2個おきのノードを同じWindowに設定

OctFormer 152  実験：ScanNet

Self-Positioning Point-based Transformer (SPoTr) 153  Park, J., Lee, S.,
Kim, S., Xiong, Y., & Kim, H. J. (2023). Self-positioning Point-based Transformer for Point Cloud Understanding. Conference on Computer Vision and Pattern Recognition (CVPR).  リソース削減のために、全ての点同士のSelf- Attentionを取るのではなく、グローバルおよびローカルの特徴を捉えたself-positioning point (SP point) を使用。  SP pointを用いてローカルおよびグローバルなCross- Attentionを取ることで、3つのベンチマーク(SONN, SN-Part, and S3DIS)でSOTA達成

Self-Positioning Point-based Transformer (SPoTr) 154 SP Pointの算出方法

Self-Positioning Point-based Transformer (SPoTr) 155 SP Pointの算出方法入力点群座標各点の特徴ベクトル

Self-Positioning Point-based Transformer (SPoTr) 156 SP Pointの算出方法潜在変数各潜在変数を元に算出されたSP Point座標
𝛿𝑠 = ෍ 𝑖 Softmax 𝒇𝑖 T𝒛𝑠 𝑥𝑖

Self-Positioning Point-based Transformer (SPoTr) 157 SP Pointの算出方法 SP Pointに近い点ほど大きい重み 𝑔
𝛿𝑠 , 𝑥𝑖 = exp −𝛾 𝛿𝑠 − 𝑥𝑖 2 潜在変数に近い特徴ほど大きい重み ℎ 𝒛𝑠 , 𝒇𝑖 = exp 𝒇𝑖 T𝒛𝑠 σ 𝑗 exp 𝒇𝑗 T𝒛𝑠 各SP Pointの特徴ベクトル 𝝍𝑠 = ෍ 𝑖 𝑔 𝛿𝑠 , 𝑥𝑖 ∙ ℎ 𝒛𝑠 , 𝒇𝑖 ∙ 𝒇𝑖

Self-Positioning Point-based Transformer (SPoTr) 158 Channel-wise Point Attention (CWPA)

Self-Positioning Point-based Transformer (SPoTr) 159 Channel-wise Point Attention (CWPA) SP
Pointと入力点群の相対座標算出 (Positional Embedding) 入力点群座標 SP Point座標

Self-Positioning Point-based Transformer (SPoTr) 160 Channel-wise Point Attention (CWPA) 入力点群の特徴ベクトル
(Query) SP Pointの特徴ベクトル (Key) SP Pointと入力点群間の特徴ベクトル差分 SP Pointと入力点群間の特徴ベクトル差分

Self-Positioning Point-based Transformer (SPoTr) 161 Channel-wise Point Attention (CWPA) MLPで特徴ベクトルの変換
MLPで特徴ベクトルの変換 Vector Attention (Channel方向)

Self-Positioning Point-based Transformer (SPoTr) 162  全体ネットワーク構成 Segmentation Classification

Self-Positioning Point-based Transformer (SPoTr) 163  全体ネットワーク構成 Segmentation Classification Farthest
Point Sampling SP Pointを用いた CWPA 入力近傍点群のみでCWPA

Self-Positioning Point-based Transformer (SPoTr) 164 実験:S3DIS

点群＋Transformerまとめ 165 Attentionの計算範囲 Attentionの取り方 Positional Embedding Point Transformer 局所領域のみ差分＋Vector
Attention 相対座標＋MLP PointMixer 局所領域のみ差分＋Vector Attention 相対座標＋MLP PCT 点群全体（小さな点群）内積＋Offset Attention 特徴量がすでに座標情報を含んでいるという考え方 PVT 局所領域＋全体点群（ただし大規模点群に対しては簡易処理）内積＋Scalar Attention 相対座標 Dual Transformer 点群全体（小さな点群）内積＋Scalar Attention 記載なし Fast Point Transformer 局所領域のみ Light-Weight Self- Attention 相対座標（Voxel間 or Voxel- Centroid間）＋MLP Point BERT 点群全体（局所領域をトークンとして）内積＋Scalar Attention クラスタ中心座標＋MLP Stratified Transformer 局所領域(マルチスケール)＋Shifted Window 内積＋Scalar Attention 相対座標を量子化したLook Up Table OctFormer 局所領域（可変形状） +Dilated Window 内積＋Scalar Attention Conditional Positional Encoding (Depth Wise Conv + Batch Norm) SPoTr Self-Positioning Point 差分＋Vector Attention 相対座標＋MLP

点群SegmentationのためのTransformerサーベイ

点群SegmentationのためのTransformerサーベイ

More Decks by Takuya MINAGAWA

Other Decks in Technology

Featured

Transcript