レシピの画像検索

レシピの画像検索クックパッド株式会社三條智史第52回コンピュータビジョン勉強会＠関東 2019/04/14

自己紹介 2 • 三條智史 [ @johshisha ] • 同志社大学大学院
修士 • クックパッド株式会社 ◦ 2018年新卒入社（2年目） ◦ 研究開発部 ◦ 広告推薦，見栄え推定 • カルピスが好き

> クックパッドにおけるレシピの画像検索に関する取り組み > レシピの画像検索に必要な技術 by ayemos クックパッド開発者ブログ https://techlife.cookpad.com/entry/2018/08/28/162000 > 名前のわからない料理の作り方を調べることができる！
> お店で食べた料理・SNSで見た料理 > 撮影した写真とレシピを紐付けることができる！ > ユーザが調理したレシピを特定できるなぜ画像検索？ 3

> 特定物体認識 > 同じオブジェクトが写っている画像を取得 4 画像検索とレシピの画像検索 > レシピの画像検索
> 想定するユースケースの場合 > 同じ料理を実現できるレシピを取得クエリ結果クエリ結果

画像検索の流れ [Zhou+, 2017] より引用主にこの辺について話します 5

> hand-craftな大域特徴 (global feature)ベース > 画像から一つの特徴ベクトルを得る > カラーヒストグラムやGISTなど [Douze+, 2009]
> 背景がごちゃごちゃしている場合などにうまくいかない > hand-craftな局所特徴 (local feature)ベース > 画像から多数の局所特徴ベクトルを得る > SIFT, SURFなど [Zheng+, 2018] > 多数のベクトルを生成 → 集約する (BoVW, VLAD) 画像検索の手法 6

> CNNの大域特徴ベース > 一般物体認識や様々な分野でのCNNの性能向上に触発 > pretrained modelの特徴量使ったり，fine-tuningしたり [Razavian+, 2014] >
CNNの局所特徴ベース > 大域特徴では，背景がごちゃごちゃしていると厳しい > CNNから局所特徴を抽出 [Ng+, 2015] > 画像をpatchに分割したり，重要そうな部分のみ使ったり画像検索の手法 7

> 基本的には一般的な画像検索と同じ流れ（後追い） > hand-craftな特徴を用いた手法 [Farinella+, 2016] > CNNによる大域特徴を用いた手法 > (次はCNNによる局所特徴を用いた手法が来る？)
> レシピには画像以外にもテキストがある（食材，手順など） > Cross-modalなレシピ検索手法レシピの画像検索の手法 8

> レシピには画像以外にもテキストがある（食材，手順など） > Cross-modalなレシピ検索手法レシピの画像検索の手法これらの手法について詳しく説明する 9

> Learning CNN-based Features for Retrieval of Food Images >
Ciocca+ , ICIAP 2017 > 分類ベースの手法 > Learning Food Image Similarity for Food Image Retrieval > Shimoda+ , BigMM 2017 > 特徴空間上での距離ベースの手法 CNNによる大域特徴を用いた手法 11

Learning CNN-based Features for Retrieval of Food Images [Ciocca+ ,
ICIAP 2017] 12 > 料理ドメインにおいてもCNN-based featuresが有用なのかを検証する > 料理の524クラス分類のタスクでResNet-50をfine-tuning > Classification Accuracy: 69.52% for the Top-1, and 89.61% for the Top-5 > 最後のFCから特徴抽出

> テストデータセット > 4,754枚，1,200種類の料理 > 様々な国の料理 > タスク > クエリと同じ料理の写真を取得する
> 検索対象は1200種類の料理各1枚ずつ実験結果 http://iplab.dmi.unict.it/UNICT-FD1200/VisualAnalysis_1.htm 13

14 > クエリと似ている画像は特徴空間上で近くに配置したい > Siamese Network > Triplet Network Learning
Food Image Similarity for Food Image Retrieval [Shimoda+, BigMM 2017]

> Siamese Network [Bromley+, 1994] > クエリと対象サンプル > 似ている場合は近くなるように >
似ていない場合は遠くなるように > Triplet Network [Wang+, 2014] > クエリと2つのサンプル > 似ているものが似ていないものよりも近くなるように > 選択されたサンプル間の関係性を学習 2つのネットワークを検証 15

Siamese Network のロス関数 : サンプル画像 Y : が似ているものなら1，違えば0 C :
マージン D : ユークリッド距離 f : 特徴抽出関数やりたいこと： (indexのiは省略) 2つの画像が似ている場合似ていない場合似ている画像は距離が近くなるように似ていない画像は距離が遠くなるように Cより距離が遠くなればLossは0 16

Triplet Network のロス関数 p : クエリ画像 : 似ているサンプル画像 (positive sample)
: 似ていないサンプル画像 (negative sample) D : ユークリッド距離 f : 特徴抽出関数 (indexのiは省略) 似ている画像との距離似ていない画像との距離似ていない画像よりも似ている画像との距離が近ければ Lossは0 クエリと選択された2つの画像との関係性を定義 17 やりたいこと：

> 見た目が似ていても，カテゴリが違うと検索は失敗 > カテゴリも考慮できるようにする Classification Loss : 空間上の距離 ( similarity
loss ) : 分類のロス ( classification loss ) λ: パラメータ (論文では1) 18

> データセット: UEC-FOOD256 [Kawano+, 2014] > 256クラス，100枚/class，25,600枚 > タスク >
クエリと同じクラスの画像を取得できるか実験結果 FT: fine-tuning SN: siamese network TN: triplet network MT: multi-task ( combine Classification Loss ) 19

> Learning Cross-modal Embeddings for Cooking Recipes and Food Images
> Salvador+ , CVPR 2017 > 大雑把にいうと: Siamese Network + Classification Loss + Cross-modal > Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings > Carvalho+ , SIGIR 2018 > 大雑把に言うと: Triplet Network + Classification Loss + Cross-modal Cross-modalなレシピ検索手法 21

22 Learning Cross-modal Embeddings for Cooking Recipes and Food Images
[Salvador+ , CVPR 2017] > 料理に関する大規模データセットのRecipe1Mを作ったよ！ > これがあればこんなことできるよ！ > 画像→レシピ検索で人間を超えた

ネットワークの概要 Word2Vec + Bi-directional LSTM skip-thoughts + LSTM pretrained Resnet-50
23

ロス関数画像・テキストが同じレシピなら1 コサイン類似度マージン (論文では0.1) Siamese Networkのロス関数と同じ 2つの画像が似ている場合似ていない場合
24

分類ロス (Semantic Regularization Loss) Cosine Similarity Loss Semantic Regularization Loss
パラメータ (論文では0.02) クラスラベル > カテゴリも考慮できるようにする > テキスト・画像両方で同じ重みを用いて分類 25

検索性能： vs. 関連手法 > ランダムな1000サンプルのサブセットを作成 > 厳密にはランダムな999 + クエリ画像と同じレシピのテキスト
> サブセット内にある同じレシピのテキストを見つける > 10回1000サンプルを選び直した平均値 26

関連レシピ当て実験： vs. 人間 > 与えられた画像のペアとなるテキストを10個の中から選ぶ > Accuracy 料理のジャンルごとの実験 (易) 料理のメニューごとの実験
(難) 27

28 Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image
Embeddings [Carvalho+ , SIGIR 2018] > 分類ロスだと特徴空間上でのクラスごとのまとまりを保証できない > 未知のサンプルに対してのロバスト性に欠ける > Triplet Lossの導入

概要 > 特徴空間上でのクラスごとのまとまりを保証 > 同じレシピの画像とテキストを特徴空間で近くに配置 > 似ている（同じカテゴリの）レシピを近くに配置 > 新しいロス関数を提案 >
Classification Lossはやりたいことと異なる 29

> instance-based triplets > 同じレシピのものは違うレシピよりも近く Double Triplet Loss 関数同じレシピ
違うレシピ同じカテゴリ違うカテゴリ (変数の説明はだいたい同じなので省略 ) 30 > semantic-based triplets > 同じカテゴリのものは違うカテゴリより近く

> ある程度学習が進むと，Lossが0になることがよくある > サンプル数で平均をとると，0のものに影響を受ける Adaptive Learning Schema ＜だと Lossが0になる
31 > 対応策 > 平均ではなくて，Lossが0じゃないサンプル数でわる

検索性能：関連手法との比較 > ランダムに選択した10,000サンプルの中にある対となるサンプルを見つける > 5回10,000サンプルを選び直した平均値 [Salvador+ , CVPR
2017]の改良版カテゴリ情報はClassification Lossを使う Adaptive Learning Schema を使わない 32

33 カテゴリのまとまりを可視化 > カテゴリの分布を可視化 > 5つのカテゴリをピックアップ > カテゴリに関するLossをいれる > よりまとまって配置

> レシピの画像検索 > 一般的な画像検索と考えとしては同じ > 特徴：画像のみならず，テキストも使える > 所感 >
基本的には一般的な画像検索で優れた手法がレシピに転用されているので最新の手法を追うのは大事 > 想定するのユースケースでは，未知の画像へのロバスト性は大事 > Double Triplet Loss が有用そうまとめ 34

• [Douze+, 2009] ◦ Douze, Matthijs, et al. "Evaluation of
gist descriptors for web-scale image search." Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 2009. • [Zheng+, 2018] ◦ Zheng, Liang, Yi Yang, and Qi Tian. "SIFT meets CNN: A decade survey of instance retrieval." IEEE transactions on pattern analysis and machine intelligence 40.5 (2018): 1224-1244. • [Razavian+, 2014] ◦ Sharif Razavian, Ali, et al. "CNN features off-the-shelf: an astounding baseline for recognition." Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2014. • [Ng+, 2015] ◦ Yue-Hei Ng, Joe, Fan Yang, and Larry S. Davis. "Exploiting local features from deep networks for image retrieval." Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2015. 参考文献 35

• [Farinella+, 2016] ◦ Farinella, Giovanni Maria, et al. "Retrieval
and classification of food images." Computers in biology and medicine 77 (2016): 23-39. • [Ciocca+ , ICIAP 2017] ◦ Ciocca, Gianluigi, Paolo Napoletano, and Raimondo Schettini. "Learning cnn-based features for retrieval of food images." International Conference on Image Analysis and Processing. Springer, Cham, 2017. • [Shimoda+ , BigMM 2017] ◦ Shimoda, Wataru, and Keiji Yanai. "Learning food image similarity for food image retrieval." 2017 IEEE Third International Conference on Multimedia Big Data (BigMM). IEEE, 2017. • [Salvador+ , CVPR 2017] ◦ Salvador, Amaia, et al. "Learning cross-modal embeddings for cooking recipes and food images." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. • [Carvalho+ , SIGIR 2018] ◦ Carvalho, Micael, et al. "Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings." The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2018. 参考文献 36

参考文献 • [Zhou+, 2017] ◦ Zhou, Wengang, Houqiang Li, and
Qi Tian. "Recent advance in content-based image retrieval: A literature survey." arXiv preprint arXiv:1706.06064 (2017). • [Bromley+, 1994] ◦ Bromley, Jane, et al. "Signature verification using a" siamese" time delay neural network." Advances in neural information processing systems. 1994. • [Wang+, 2014] ◦ Wang, Jiang, et al. "Learning fine-grained image similarity with deep ranking." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. • [Kawano+, 2014] ◦ Kawano, Yoshiyuki, and Keiji Yanai. "Automatic expansion of a food image dataset leveraging existing categories with domain adaptation." European Conference on Computer Vision. Springer, Cham, 2014. 37

レシピの画像検索

レシピの画像検索

More Decks by johshisha

Featured

Transcript