画像検索今昔物語

Slide 1

Slide 1 text

画像検索今昔物語株式会社ディー・エヌ・エー内⽥祐介

Slide 2

Slide 2 text

特定物体認識 1 • 類似画像検索 • ⼀般物体認識（クラス分類） • 特定物体認識同じ物体（インスタンス）が写っている画像を検出 Result Query Query 空、雲 Result Query 大規模特定物体認識の最新動向 https://sites.google.com/site/yu4uchida/uchida_ieice2013.pdf

Slide 3

Slide 3 text

⼤域特徴ベース vs 局所特徴ベース 2 • ⼤域特徴 (global feature) ベース – 画像から1つの特徴を抽出（e.g. カラーヒストグラム） – 類似画像検索ではうまくいくが特定物体認識ではうまくいかない • 局所特徴 (local feature) ベース – 画像から多数の局所特徴を抽出（e.g. SIFT） – それらのマッチング結果により類似度を定義 – SIFT等の強⼒な特徴量により deep learningに最後まで抵抗（最近やられた模様）

Slide 4

Slide 4 text

⼤域特徴ベース vs 局所特徴ベース 3 • ⼤域特徴で検索 • 局所特徴で検索 • 局所特徴をaggregateして⼤域特徴にして検索 – FV, VLAD

Slide 5

Slide 5 text

局所特徴ベース特定物体認識 4 • Detection︓局所特徴領域の検出 • Description︓局所特徴領域の記述 • Indexing＆Search︓（近似）最近傍探索 • Post process – Geometric verification – Query expansion セットになることが多いが本来は独⽴して選択できる

Slide 6

Slide 6 text

局所特徴を⽤いた特定物体認識 5/25/23 5 ①Extract local regions (patches) from images ②Describe the patches by d-dimensional vectors ③Make correspondences between similar patches ④Calculate similarity between the images Similarity: 3 Position (x, y) Orientation θ Scale σ Feature vector f (e.g., 128-dim SIFT) Local feature

Slide 7

Slide 7 text

局所特徴領域の検出⼿法 6 • Blobタイプとコーナータイプ • 回転不変、スケール不変、アフィン不変とタイプ分けされる • 基本的なアイディア＝畳み込みフィルタの応答の極⼤値により検出

Slide 8

Slide 8 text

マルチスケール検出の直感的理解 7 • Blobの中⼼とカーネルの中⼼が⼀致するときが⼀番responseが⼤きくなる信号（画像）畳み込みカーネル（e.g. LoG）

Slide 9

Slide 9 text

マルチスケール検出の直感的理解 8 • カーネルサイズとblobのスケールが⼀致するときが⼀番responseが⼤きくなる • スケールスペースでのフィルタ応答が極⼤となる＝局所特徴

Slide 10

Slide 10 text

局所特徴領域の検出⼿法 9 Hessian Beaudet’78 Harris Harris’88 LoG Lindeberg’98 DoG Lowe’99 SURF Bay’06 Harris-Laplace Mikolajczyk’01 Hessian-Affine Mikolajczyk’04 Harris-Affine Mikolajczyk’02 FAST Rosten’05 Affine-invariant Scale-invariant Rotation-invariant LoG scale seletion Affine adaptation Multi-scale + Box filter acceleration LoG approximation Hessian-Laplace Mikolajczyk’01 Oriented FAST Rublee’11 SUSAN Smith’97 Simplification + tree acceleration Orientation Corner-like Blob-like (SIFT) (ORB)

Slide 11

Slide 11 text

局所特徴領域の記述⼿法 10 • 実数値タイプとバイナリタイプがある SIFT Lowe’99 SURF Bay’06 BRIEF Calonder’10 ORB Rublee’11 GLOH Mikolajczyk’05 FREAK Alahi’12 A-KAZE Alcantarilla’13 LDB Yang’12 LATCH Levi’16 BRISK Leutenegger’11 Real-valued Binary (0.56, 0.22, -0.10, …, 0.96) (1, 0, 0, …, 1) RootSIFT Arandjelovic’12

Slide 12

Slide 12 text

どれを使えば良いの︖ 11 • 精度重視 – SIFT or Hessian Affine detector + RootSIFT descriptor • 速度重視 – ORB detector + ORB descriptor • Local Feature Detectors, Descriptors, and Image Representations: A Survey https://arxiv.org/abs/1607.08368

Slide 13

Slide 13 text

RootSIFT [Arandjelovic+, CVPRʼ12] 5/25/23 12 • Hellinger kernel works better than Euclidean distance in comparing histograms such as SIFT • Hellinger kernel (Bhattacharyyaʼs coefficient) for L1 normalized histograms x and y: • Explicit feature map of x into xʼ : – L1 normalize x – element-wise square root x to give xʼ – then xʼ is L2 normalized • Computing Euclidean distance in the feature map space is equivalent to Hellinger distance in the original space: RootSIFT RootSIFT

Slide 14

Slide 14 text

Large-scale Object Recognition 5/25/23 13 ・・・ Distance calculation Query image Reference images Explicit feature matching requires high computational cost and memory footprint Match Bag-of-visual words!

Slide 15

Slide 15 text

Bag-of-Visual Words [Sivic+, ICCVʼ03] 5/25/23 14 • Offline – Collect a large number of training vectors – Perform clustering algorithm (e.g., k-means) – Centroids of clusters = visual words (VWs) • Online: – All features are assigned to their nearest visual words – An image is represented by the frequency histogram of VWs – (Dis)similarity is defined by the distance between histograms Visual words (VW) VW1 VWn VW2 … Visual words －－ " " " －－－ " " " －－－ " " " －－－ " " " －－－ " " " － Frequency } 1 | { N i i £ £ = v V

Slide 16

Slide 16 text

Bag-of-Visual Words [Sivic+, ICCVʼ03] 5/25/23 15 15 VW1 VW2 VWk VWn ・・・・・・ Indexing step (quantization) Search step (quantization) Match Match Matching can be performed in O(1) with an inverted index Query image Reference images Nearest VW

Slide 17

Slide 17 text

1 2 w N Inverted index Image ID 1 2 3 4 5 6 7 8 9 10 11 12 ... Image ID Accumulated scores VW ID Obtain image IDs Query image Reference image Image ID ... (x, y) σ θ (1) Feature detection (2) Feature description (3) Quantization (1) Feature detection (2) Feature description (3) Quantization (4) Voting ... ... ... ... Visual word v1 ... Visual word vw ... Visual word vN Visual words 1 4 5 7 10 16 19 Offline step Visual word v1 ... Visual word vw ... Visual word vN Visual words Get images with the top-K scores Results inlier outlier (5) Geometric verification 全体処理 Geometric verification

Slide 18

Slide 18 text

Geometric (Spatial) Verification 17 • マッチングした結果には誤検出が含まれる – 正解のマッチング（inlier）はある幾何的な変換モデルに対して整合性が取れているはずなので、モデルの推定とinlierの同定を同時に⾏う→RANSAC – inlierのみを⽤いて画像間の類似度とすると精度が向上 outlier inlier

Slide 19

Slide 19 text

モデル; pʼ = Mp 18 rotation scaling translation similarity trans. affine trans. perspective trans. 1DoF 2DoF 1DoF 4DoF 5DoF 6DoF 7DoF Fundamental Matrix

Slide 20

Slide 20 text

RANSAC 19 1. モデルパラメータを計算できる対応点をランダムサンプリング 2. モデルパラメータを算出 3. 全ての点対応で、上記のモデルパラメータと整合する点対応をinlierとみなす 4. 上記を⼀定回数繰り返し、⼀番inlierが多かったモデルパラメータを採⽤

Slide 21

Slide 21 text

Weak Geometric Consistency [Jegou+, ECCVʼ08] 5/25/23 20 • スケール⽐、⾓度差はconsistentなので⾓度差、スケール⽐空間にハフ変換的に投票する – 正解はスコアが下がらないが不正解ペアのスコアが⼤きく下がる

Slide 22

Slide 22 text

どのモデルを使えばよいの︖ 21 • とりあえず相似変換かアファイン変換 ←対象から離れていれば⼤体相似変換で近似可能 • スケールと⾓度がある特徴領域だと 1つの対応点から相似変換が求まる︕ →全ペアに対してモデル推定＋inlier算出をする • その後、より⾃由度の⼤きいモデルをフィッティングしても良い J. Philbin et al., “Object retrieval with large vocabularies and fast spatial matching,” CVPR’17.

Slide 23

Slide 23 text

Query Expansion 22 • 最初の検索結果を元に、新たな検索クエリを⼈⼯的に作成し、「芋づる式」に検索結果を改善することを狙うクエリ検索結果拡張クエリ新たな検索結果

Slide 24

Slide 24 text

Average Query Expansion [Chum+, ICCVʼ07] 5/25/23 23 • Obtain top (m < 50) verified results of original query • Construct new query using average of these results Without geometric verification, QE degrades accuracy! Query image Verified results New query

Slide 25

Slide 25 text

Multiple Image Resolution Expansion [Chum+, ICCVʼ07] 5/25/23 24 ROI Query image ROI ROI ROI ROI ROI ROI First verified results ROI ROI ROI ROI ROI ROI • Calculate relative change in resolution • Construct average query for each resolution New query1 New query2 New query3

Slide 26

Slide 26 text

Query Expansion Results 5/25/23 25 • ori = original query • qeb = query expansion baseline • trc = transitive closure expansion • avg = average query expansion • rec = recursive average query expansion • sca = multiple image resolution expansion

Slide 27

Slide 27 text

Discriminative Query Expansion [Arandjelovic+, CVPRʼ12] 5/25/23 26 • Train a linear SVM classifier – Use verified results as positive training data – Use low ranked images as negative training data – Rank images on their signed distance from the decision boundary – Reranking can be efficient with an inverted index!

Slide 28

Slide 28 text

Aggregation Methods 27 • 局所特徴は1画像から1000前後抽出される • 画像が多いとインデックスが肥⼤化 • 特に画像認識では1つのベクトルとして扱いたい – Fisher Vector (FV) – VLAD • 精度を求める場合は使わない

Slide 29

Slide 29 text

最近傍探索 (Nearest Neighbor Search, NNS) 28 • 距離空間 M における点の集合 S とクエリ点 q∈M が与えられた際に S の中で q に最も近い点を探す – k近傍 / range search • ユークリッド空間での最近傍探索を扱うことがほとんど • kd-tree, SR-tree等のindexingにより⾼速化（⾼次元（数⼗︖）で次元の呪いにかかる） + + + + + + + + + + + + o q Input + + + + + + + + + + + + o q Output S

Slide 30

Slide 30 text

近似最近傍探索 29 • エラーを許す代わりに⾼速化、エラー率とトレードオフ – 速度、精度、メモリ使⽤量がトレードオフになる • ⽊構造＋priority search – kd-tree, randomized kd-trees, hierarchical kd-tree – メモリを気にしなければ無難で良い • Locality Sensitive Hashing (LSH) 系 – ***LSHがいっぱい。個⼈的には嫌い • 直積量⼦化系 – サーベイ → https://www.jstage.jst.go.jp/article/mta/6/1/6_2/_article/-char/ja/ – データを圧縮し、圧縮したまま検索 • バイナリ圧縮系 – いっぱいある https://www.slideshare.net/ren4yu/k-means-hashing-up (Heさんだよ） – バイナリ符号にするのでpopcnt命令で距離計算できる（がそのままだとlinear search）

Slide 31

Slide 31 text

え︖でもCNNのほうが良いんでしょ︖ 30

Slide 32

Slide 32 text

CNN系 (global feature) 31 • CNN Features off-the-shelf: an Astounding Baseline for Recognition https://arxiv.org/abs/1403.6382 – クラス分類⽤のCNN (OverFeat) のFCをそのまま使っても結構良い • Neural Codes for Image Retrieval https://arxiv.org/pdf/1404.1777.pdf – 最終層前のFCを使ったほうが良いとか、検索対象のドメインで finetuneしたほうが良いとか • CNN Image Retrieval Learns from BoW: Unsupervised Fine- Tuning with Hard Examples https://arxiv.org/abs/1604.02426 – Siamese Networkで学習 • Global featureでもかなり良い（vs. FV/VLAD) • 基本的に回転・スケール不変ではないことに注意

Slide 33

Slide 33 text

CNN系 (local feature) 32 • LIFT: Learned Invariant Feature Transform https://arxiv.org/abs/1603.09114 – 検出、⾓度推定、記述をend-to-endで学習 – 遅いし検索では精度出ていない • Large-Scale Image Retrieval with Attentive Deep Local Features https://arxiv.org/abs/1612.06321 – FCN＋アテンション（マルチスケールでやる）で局所特徴を定義 – 良さげ https://github.com/tensorflow/models/tree/master/researc h/delf – 回転不変性は担保されない

Slide 34

Slide 34 text

DELF 33

Slide 35

Slide 35 text

Comparative Study 34 • Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking https://arxiv.org/abs/1803.11285 – Local, global, CNN/⾮CNNが網羅的に⽐較されている（が、著者らのチームにバイアスがかかっているかも） Local Global 非CNN CNN

Slide 36

Slide 36 text

ベストプラクティス① 35 • Global → https://arxiv.org/abs/1711.02512 – 性能の良いベースネットワークを利⽤（ResNet以上）し、finetune（Siamere?）する – generalized mean-pooling (Lp, p=3) を利⽤ – 複数スケール (region) を利⽤ – RegionレベルでDiffusionベースのquery expansion https://arxiv.org/abs/1611.05113

Slide 37

Slide 37 text

ベストプラクティス② 36 • Local → https://hal.inria.fr/hal-01131898/document – 特徴量としてはDELFを利⽤ – Indexing, matching, scoringがややこしい（ASMK – Geometric verificationは必須 – Query expansionもやる