画像検索今昔物語

画像検索今昔物語株式会社ディー・エヌ・エー内⽥祐介

特定物体認識 1 • 類似画像検索 • ⼀般物体認識（クラス分類） • 特定物体認識同じ物体（インスタンス）が写っている画像を検出
Result Query Query 空、雲 Result Query 大規模特定物体認識の最新動向 https://sites.google.com/site/yu4uchida/uchida_ieice2013.pdf

⼤域特徴ベース vs 局所特徴ベース 2 • ⼤域特徴 (global feature) ベース –
画像から1つの特徴を抽出（e.g. カラーヒストグラム） – 類似画像検索ではうまくいくが特定物体認識ではうまくいかない • 局所特徴 (local feature) ベース – 画像から多数の局所特徴を抽出（e.g. SIFT） – それらのマッチング結果により類似度を定義 – SIFT等の強⼒な特徴量により deep learningに最後まで抵抗（最近やられた模様）

⼤域特徴ベース vs 局所特徴ベース 3 • ⼤域特徴で検索 • 局所特徴で検索 • 局所特徴をaggregateして⼤域特徴にして検索
– FV, VLAD

局所特徴ベース特定物体認識 4 • Detection︓局所特徴領域の検出 • Description︓局所特徴領域の記述 • Indexing＆Search︓（近似）最近傍探索 • Post
process – Geometric verification – Query expansion セットになることが多いが本来は独⽴して選択できる

局所特徴を⽤いた特定物体認識 5/25/23 5 ①Extract local regions (patches) from images ②Describe
the patches by d-dimensional vectors ③Make correspondences between similar patches ④Calculate similarity between the images Similarity: 3 Position (x, y) Orientation θ Scale σ Feature vector f (e.g., 128-dim SIFT) Local feature

局所特徴領域の検出⼿法 6 • Blobタイプとコーナータイプ • 回転不変、スケール不変、アフィン不変とタイプ分けされる • 基本的なアイディア＝畳み込みフィルタの応答の極⼤値により検出

マルチスケール検出の直感的理解 7 • Blobの中⼼とカーネルの中⼼が⼀致するときが⼀番responseが⼤きくなる信号（画像）畳み込みカーネル（e.g. LoG）

マルチスケール検出の直感的理解 8 • カーネルサイズとblobのスケールが⼀致するときが⼀番responseが⼤きくなる • スケールスペースでのフィルタ応答が極⼤となる＝局所特徴

局所特徴領域の検出⼿法 9 Hessian Beaudet’78 Harris Harris’88 LoG Lindeberg’98 DoG Lowe’99
SURF Bay’06 Harris-Laplace Mikolajczyk’01 Hessian-Affine Mikolajczyk’04 Harris-Affine Mikolajczyk’02 FAST Rosten’05 Affine-invariant Scale-invariant Rotation-invariant LoG scale seletion Affine adaptation Multi-scale + Box filter acceleration LoG approximation Hessian-Laplace Mikolajczyk’01 Oriented FAST Rublee’11 SUSAN Smith’97 Simplification + tree acceleration Orientation Corner-like Blob-like (SIFT) (ORB)

局所特徴領域の記述⼿法 10 • 実数値タイプとバイナリタイプがある SIFT Lowe’99 SURF Bay’06 BRIEF Calonder’10
ORB Rublee’11 GLOH Mikolajczyk’05 FREAK Alahi’12 A-KAZE Alcantarilla’13 LDB Yang’12 LATCH Levi’16 BRISK Leutenegger’11 Real-valued Binary (0.56, 0.22, -0.10, …, 0.96) (1, 0, 0, …, 1) RootSIFT Arandjelovic’12

どれを使えば良いの︖ 11 • 精度重視 – SIFT or Hessian Affine detector
+ RootSIFT descriptor • 速度重視 – ORB detector + ORB descriptor • Local Feature Detectors, Descriptors, and Image Representations: A Survey https://arxiv.org/abs/1607.08368

RootSIFT [Arandjelovic+, CVPRʼ12] 5/25/23 12 • Hellinger kernel works better
than Euclidean distance in comparing histograms such as SIFT • Hellinger kernel (Bhattacharyyaʼs coefficient) for L1 normalized histograms x and y: • Explicit feature map of x into xʼ : – L1 normalize x – element-wise square root x to give xʼ – then xʼ is L2 normalized • Computing Euclidean distance in the feature map space is equivalent to Hellinger distance in the original space: RootSIFT RootSIFT

Large-scale Object Recognition 5/25/23 13 ・・・ Distance calculation
Query image Reference images Explicit feature matching requires high computational cost and memory footprint Match Bag-of-visual words!

Bag-of-Visual Words [Sivic+, ICCVʼ03] 5/25/23 14 • Offline – Collect
a large number of training vectors – Perform clustering algorithm (e.g., k-means) – Centroids of clusters = visual words (VWs) • Online: – All features are assigned to their nearest visual words – An image is represented by the frequency histogram of VWs – (Dis)similarity is defined by the distance between histograms Visual words (VW) VW1 VWn VW2 … Visual words －－ " " " －－－ " " " －－－ " " " －－－ " " " －－－ " " " － Frequency } 1 | { N i i £ £ = v V

Bag-of-Visual Words [Sivic+, ICCVʼ03] 5/25/23 15 15 VW1 VW2 VWk
VWn ・・・・・・ Indexing step (quantization) Search step (quantization) Match Match Matching can be performed in O(1) with an inverted index Query image Reference images Nearest VW

1 2 w N Inverted index Image ID 1 2
3 4 5 6 7 8 9 10 11 12 ... Image ID Accumulated scores VW ID Obtain image IDs Query image Reference image Image ID ... (x, y) σ θ (1) Feature detection (2) Feature description (3) Quantization (1) Feature detection (2) Feature description (3) Quantization (4) Voting ... ... ... ... Visual word v1 ... Visual word vw ... Visual word vN Visual words 1 4 5 7 10 16 19 Offline step Visual word v1 ... Visual word vw ... Visual word vN Visual words Get images with the top-K scores Results inlier outlier (5) Geometric verification 全体処理 Geometric verification

Geometric (Spatial) Verification 17 • マッチングした結果には誤検出が含まれる – 正解のマッチング（inlier）はある幾何的な変換モデルに対して整合性が取れているはずなので、モデルの推定とinlierの同定を同時に⾏う→RANSAC
– inlierのみを⽤いて画像間の類似度とすると精度が向上 outlier inlier

モデル; pʼ = Mp 18 rotation scaling translation similarity trans.
affine trans. perspective trans. 1DoF 2DoF 1DoF 4DoF 5DoF 6DoF 7DoF Fundamental Matrix

RANSAC 19 1. モデルパラメータを計算できる対応点をランダムサンプリング 2. モデルパラメータを算出 3. 全ての点対応で、上記のモデルパラメータと整合する点対応をinlierとみなす
4. 上記を⼀定回数繰り返し、⼀番inlierが多かったモデルパラメータを採⽤

Weak Geometric Consistency [Jegou+, ECCVʼ08] 5/25/23 20 • スケール⽐、⾓度差はconsistentなので⾓度差、スケール⽐空間にハフ変換的に投票する
– 正解はスコアが下がらないが不正解ペアのスコアが⼤きく下がる

どのモデルを使えばよいの︖ 21 • とりあえず相似変換かアファイン変換 ←対象から離れていれば⼤体相似変換で近似可能 • スケールと⾓度がある特徴領域だと 1つの対応点から相似変換が求まる︕ →全ペアに対してモデル推定＋inlier算出をする •
その後、より⾃由度の⼤きいモデルをフィッティングしても良い J. Philbin et al., “Object retrieval with large vocabularies and fast spatial matching,” CVPR’17.

Query Expansion 22 • 最初の検索結果を元に、新たな検索クエリを⼈⼯的に作成し、「芋づる式」に検索結果を改善することを狙うクエリ検索結果拡張クエリ
新たな検索結果

Average Query Expansion [Chum+, ICCVʼ07] 5/25/23 23 • Obtain top
(m < 50) verified results of original query • Construct new query using average of these results Without geometric verification, QE degrades accuracy! Query image Verified results New query

Multiple Image Resolution Expansion [Chum+, ICCVʼ07] 5/25/23 24 ROI Query
image ROI ROI ROI ROI ROI ROI First verified results ROI ROI ROI ROI ROI ROI • Calculate relative change in resolution • Construct average query for each resolution New query1 New query2 New query3

Query Expansion Results 5/25/23 25 • ori = original query
• qeb = query expansion baseline • trc = transitive closure expansion • avg = average query expansion • rec = recursive average query expansion • sca = multiple image resolution expansion

Discriminative Query Expansion [Arandjelovic+, CVPRʼ12] 5/25/23 26 • Train a
linear SVM classifier – Use verified results as positive training data – Use low ranked images as negative training data – Rank images on their signed distance from the decision boundary – Reranking can be efficient with an inverted index!

Aggregation Methods 27 • 局所特徴は1画像から1000前後抽出される • 画像が多いとインデックスが肥⼤化 • 特に画像認識では1つのベクトルとして扱いたい –
Fisher Vector (FV) – VLAD • 精度を求める場合は使わない

最近傍探索 (Nearest Neighbor Search, NNS) 28 • 距離空間 M における点の集合
S とクエリ点 q∈M が与えられた際に S の中で q に最も近い点を探す – k近傍 / range search • ユークリッド空間での最近傍探索を扱うことがほとんど • kd-tree, SR-tree等のindexingにより⾼速化（⾼次元（数⼗︖）で次元の呪いにかかる） + + + + + + + + + + + + o q Input + + + + + + + + + + + + o q Output S

近似最近傍探索 29 • エラーを許す代わりに⾼速化、エラー率とトレードオフ – 速度、精度、メモリ使⽤量がトレードオフになる • ⽊構造＋priority search –
kd-tree, randomized kd-trees, hierarchical kd-tree – メモリを気にしなければ無難で良い • Locality Sensitive Hashing (LSH) 系 – ***LSHがいっぱい。個⼈的には嫌い • 直積量⼦化系 – サーベイ → https://www.jstage.jst.go.jp/article/mta/6/1/6_2/_article/-char/ja/ – データを圧縮し、圧縮したまま検索 • バイナリ圧縮系 – いっぱいある https://www.slideshare.net/ren4yu/k-means-hashing-up (Heさんだよ） – バイナリ符号にするのでpopcnt命令で距離計算できる（がそのままだとlinear search）

え︖でもCNNのほうが良いんでしょ︖ 30

CNN系 (global feature) 31 • CNN Features off-the-shelf: an Astounding
Baseline for Recognition https://arxiv.org/abs/1403.6382 – クラス分類⽤のCNN (OverFeat) のFCをそのまま使っても結構良い • Neural Codes for Image Retrieval https://arxiv.org/pdf/1404.1777.pdf – 最終層前のFCを使ったほうが良いとか、検索対象のドメインで finetuneしたほうが良いとか • CNN Image Retrieval Learns from BoW: Unsupervised Fine- Tuning with Hard Examples https://arxiv.org/abs/1604.02426 – Siamese Networkで学習 • Global featureでもかなり良い（vs. FV/VLAD) • 基本的に回転・スケール不変ではないことに注意

CNN系 (local feature) 32 • LIFT: Learned Invariant Feature Transform
https://arxiv.org/abs/1603.09114 – 検出、⾓度推定、記述をend-to-endで学習 – 遅いし検索では精度出ていない • Large-Scale Image Retrieval with Attentive Deep Local Features https://arxiv.org/abs/1612.06321 – FCN＋アテンション（マルチスケールでやる）で局所特徴を定義 – 良さげ https://github.com/tensorflow/models/tree/master/researc h/delf – 回転不変性は担保されない

DELF 33

Comparative Study 34 • Revisiting Oxford and Paris: Large-Scale Image
Retrieval Benchmarking https://arxiv.org/abs/1803.11285 – Local, global, CNN/⾮CNNが網羅的に⽐較されている（が、著者らのチームにバイアスがかかっているかも） Local Global 非CNN CNN

ベストプラクティス① 35 • Global → https://arxiv.org/abs/1711.02512 – 性能の良いベースネットワークを利⽤（ResNet以上）し、finetune（Siamere?）する –
generalized mean-pooling (Lp, p=3) を利⽤ – 複数スケール (region) を利⽤ – RegionレベルでDiffusionベースのquery expansion https://arxiv.org/abs/1611.05113

ベストプラクティス② 36 • Local → https://hal.inria.fr/hal-01131898/document – 特徴量としてはDELFを利⽤ – Indexing,
matching, scoringがややこしい（ASMK – Geometric verificationは必須 – Query expansionもやる

画像検索今昔物語

画像検索今昔物語

More Decks by yu4u

Other Decks in Technology

Featured

Transcript