Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Harvesting Image Databases from the Web

Dan Chen
June 19, 2013

Harvesting Image Databases from the Web

Florian Schroff, Antonio Criminisi, and Andrew Zisserman, “Harvesting Image Databases from the Web”, PAMI, 2011

June 19, Class Presentation, Course on Machine Learning, NTNU.

CC BY-NC 3.0

Dan Chen

June 19, 2013
Tweet

More Decks by Dan Chen

Other Decks in Science

Transcript

  1. Harvesting Image Databases from the Web Florian Schroff, Antonio Criminisi,

    and Andrew Zisserman // Microsoft Research Sponsored 1 IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, No. 4, April 2011 Shao-Chung Chen Presentation on “Machine Learning”, June 18 2013
  2. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) 2
  3. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) 2
  4. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases 2
  5. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web 2
  6. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines 2
  7. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines • precision above 55% on average 2
  8. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) 3
  9. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process 3
  10. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) 3
  11. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches 3
  12. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision 3
  13. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision • restricted by the # of downloads 3
  14. Try Web Search Instead • instead of image search •

    eliminate the download restriction 4
  15. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 4
  16. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages 4
  17. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text 4
  18. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked 4
  19. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked • labeling — positive/negative image clusters 4
  20. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text 5
  21. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) 5
  22. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) • text features 5
  23. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classifier 5
  24. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classifier • user labeling avoids polysemy 5
  25. Objective & Challenge • harvest large # of images automatically

    • of a particular class; high precision 6
  26. Objective & Challenge • harvest large # of images automatically

    • of a particular class; high precision • provide training DB for new object model 6
  27. Objective & Challenge • harvest large # of images automatically

    • of a particular class; high precision • provide training DB for new object model • combine text, metadata, visual info 6
  28. Contribution • text attr. + metadata — P(image in class)

    • above probability • noisy training data for visual classifier 7
  29. Contribution • text attr. + metadata — P(image in class)

    • above probability • noisy training data for visual classifier • superior reranking to which produced by text alone 7
  30. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) 9
  31. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually 9
  32. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass 9
  33. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass • good & ok — abstract, nonabstract 9
  34. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search 11
  35. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page 11
  36. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only 11
  37. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) 11
  38. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) • Statistics 11
  39. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) • Statistics 11
  40. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches 12
  41. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision 12
  42. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision • characterized symbolic images by visual features 12
  43. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision • characterized symbolic images by visual features 12
  44. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram 14
  45. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient 14
  46. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient 14
  47. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases 14
  48. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases • ~90% classification (two-fold-cross-validation) 14
  49. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image 15
  50. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image 15
  51. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (<img  src  alt  title>, <title>) 15
  52. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (<img  src  alt  title>, <title>) • Other features (e.g. MIME types) didn’t help 15
  53. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability 16 Image Reranking
  54. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} 16 Image Reranking
  55. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker 16 Image Reranking
  56. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) 16 Image Reranking
  57. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class 16 Image Reranking
  58. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classifier — learn P(a|y), P(y), P(a) 16 Image Reranking
  59. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classifier — learn P(a|y), P(y), P(a) • for any new class — no ground-truth needed 16 Image Reranking
  60. 18 Image Reranking (cont.) The last column gives the average

    over all classes Mixed naive Bayes performs best and was picked
  61. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) 19
  62. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points 19
  63. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor 19
  64. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means 19
  65. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) 19
  66. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins 19
  67. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins • 900D feature vector 19
  68. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes 20
  69. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy 20
  70. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) 20
  71. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case 20
  72. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case • training minimizes the following sum: 20
  73. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case • training minimizes the following sum: 20
  74. 21

  75. 22

  76. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle 24
  77. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work 24
  78. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models 24
  79. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings 24
  80. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods 24
  81. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories 24
  82. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories • “airplane” — airports, airplane interior, ... 24