Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Harvesting Image Databases from the Web

890288c65d45a8e6978917006e3d3bc7?s=47 Dan Chen
June 19, 2013

Harvesting Image Databases from the Web

Florian Schroff, Antonio Criminisi, and Andrew Zisserman, “Harvesting Image Databases from the Web”, PAMI, 2011

June 19, Class Presentation, Course on Machine Learning, NTNU.

CC BY-NC 3.0

890288c65d45a8e6978917006e3d3bc7?s=128

Dan Chen

June 19, 2013
Tweet

Transcript

  1. Harvesting Image Databases from the Web Florian Schroff, Antonio Criminisi,

    and Andrew Zisserman // Microsoft Research Sponsored 1 IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, No. 4, April 2011 Shao-Chung Chen Presentation on “Machine Learning”, June 18 2013
  2. Introduction 2

  3. Introduction • image databases is not sufficient enough 2

  4. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route 2
  5. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) 2
  6. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) 2
  7. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases 2
  8. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web 2
  9. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines 2
  10. Introduction • image databases is not sufficient enough • search

    engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines • precision above 55% on average 2
  11. Related Works 3

  12. Related Works • direct download from image search engine 3

  13. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) 3
  14. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process 3
  15. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) 3
  16. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches 3
  17. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision 3
  18. Related Works • direct download from image search engine •

    probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision • restricted by the # of downloads 3
  19. Try Web Search Instead 4

  20. Try Web Search Instead • instead of image search 4

  21. Try Web Search Instead • instead of image search •

    eliminate the download restriction 4
  22. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 4
  23. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages 4
  24. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text 4
  25. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked 4
  26. Try Web Search Instead • instead of image search •

    eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked • labeling — positive/negative image clusters 4
  27. Try Web Search Instead (cont.) 5

  28. Try Web Search Instead (cont.) • phase #2 5

  29. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text 5
  30. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) 5
  31. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) • text features 5
  32. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classifier 5
  33. Try Web Search Instead (cont.) • phase #2 • train

    classifier — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classifier • user labeling avoids polysemy 5
  34. Objective & Challenge 6

  35. Objective & Challenge • harvest large # of images automatically

    6
  36. Objective & Challenge • harvest large # of images automatically

    • of a particular class; high precision 6
  37. Objective & Challenge • harvest large # of images automatically

    • of a particular class; high precision • provide training DB for new object model 6
  38. Objective & Challenge • harvest large # of images automatically

    • of a particular class; high precision • provide training DB for new object model • combine text, metadata, visual info 6
  39. Contribution 7

  40. Contribution • text attr. + metadata — P(image in class)

    7
  41. Contribution • text attr. + metadata — P(image in class)

    • above probability 7
  42. Contribution • text attr. + metadata — P(image in class)

    • above probability • noisy training data for visual classifier 7
  43. Contribution • text attr. + metadata — P(image in class)

    • above probability • noisy training data for visual classifier • superior reranking to which produced by text alone 7
  44. Contribution (cont.) 8 (“shark” query)

  45. The Database 9

  46. The Database • initial 18 predefined classes 9

  47. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) 9
  48. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually 9
  49. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass 9
  50. The Database • initial 18 predefined classes • airplane (ap),

    beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass • good & ok — abstract, nonabstract 9
  51. The Database (cont.) 10

  52. Data Collection 11

  53. Data Collection • three approaches 11

  54. Data Collection • three approaches • WebSearch — Google web

    search 11
  55. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search 11
  56. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page 11
  57. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only 11
  58. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) 11
  59. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) • Statistics 11
  60. Data Collection • three approaches • WebSearch — Google web

    search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) • Statistics 11
  61. Filtering 12

  62. Filtering • remove symbolic images 12

  63. Filtering • remove symbolic images • remove abstract is too

    challenging 12
  64. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches 12
  65. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision 12
  66. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision • characterized symbolic images by visual features 12
  67. Filtering • remove symbolic images • remove abstract is too

    challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision • characterized symbolic images by visual features 12
  68. Filtering (cont.) 13

  69. Learning the Filter 14

  70. Learning the Filter • SVM (radial basis function) 14

  71. Learning the Filter • SVM (radial basis function) • 3

    visual features 14
  72. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram 14
  73. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient 14
  74. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient 14
  75. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases 14
  76. Learning the Filter • SVM (radial basis function) • 3

    visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases • ~90% classification (two-fold-cross-validation) 14
  77. Ranking by Textual 15

  78. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) 15
  79. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image 15
  80. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image 15
  81. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (<img  src  alt  title>, <title>) 15
  82. Ranking by Textual • 7 textual features (plaintext and HTML

    tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (<img  src  alt  title>, <title>) • Other features (e.g. MIME types) didn’t help 15
  83. 16 Image Reranking

  84. • binary (textual) feature factor a = (a1, …, a7)

    16 Image Reranking
  85. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability 16 Image Reranking
  86. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} 16 Image Reranking
  87. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker 16 Image Reranking
  88. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) 16 Image Reranking
  89. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class 16 Image Reranking
  90. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classifier — learn P(a|y), P(y), P(a) 16 Image Reranking
  91. • binary (textual) feature factor a = (a1, …, a7)

    • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classifier — learn P(a|y), P(y), P(a) • for any new class — no ground-truth needed 16 Image Reranking
  92. 17 Image Reranking (cont.)

  93. 18 Image Reranking (cont.) The last column gives the average

    over all classes Mixed naive Bayes performs best and was picked
  94. Ranking by Visual 19

  95. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image 19
  96. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) 19
  97. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points 19
  98. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor 19
  99. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means 19
  100. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) 19
  101. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins 19
  102. Ranking by Visual • text reranking — p(y=in-class|a) for each

    image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins • 900D feature vector 19
  103. Ranking by Visual (cont.) 20

  104. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes 20
  105. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy 20
  106. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) 20
  107. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case 20
  108. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case • training minimizes the following sum: 20
  109. Ranking by Visual (cont.) • n+ images (top 150) from

    specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case • training minimizes the following sum: 20
  110. 21

  111. 22

  112. 23 Reranking Google Image Search 14 vs. 6 nonclass images

    (false positives)
  113. Conclusion 24

  114. Conclusion • automatic algorithm to harvest Web 24

  115. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class 24
  116. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle 24
  117. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work 24
  118. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models 24
  119. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings 24
  120. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods 24
  121. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories 24
  122. Conclusion • automatic algorithm to harvest Web • for hundreds

    images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories • “airplane” — airports, airplane interior, ... 24
  123. 25 Thanks & Question?