Harvesting Image Databases from the Web

Harvesting Image Databases from the Web Florian Schroff, Antonio Criminisi,
and Andrew Zisserman // Microsoft Research Sponsored 1 IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, No. 4, April 2011 Shao-Chung Chen Presentation on “Machine Learning”, June 18 2013

Introduction 2

Introduction • image databases is not sufﬁcient enough 2

Introduction • image databases is not sufﬁcient enough • search
engines provides an effortless route 2

engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) 2

engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) 2

engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases 2

engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web 2

engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines 2

engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines • precision above 55% on average 2

Related Works 3

Related Works • direct download from image search engine 3

Related Works • direct download from image search engine •
probabilistic Latent Semantic Analysis (pLSA) 3

probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process 3

probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) 3

probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches 3

probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision 3

probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision • restricted by the # of downloads 3

Try Web Search Instead 4

Try Web Search Instead • instead of image search 4

Try Web Search Instead • instead of image search •
eliminate the download restriction 4

eliminate the download restriction • phase #1 4

eliminate the download restriction • phase #1 • topics — based on the words on the pages 4

eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text 4

eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked 4

eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked • labeling — positive/negative image clusters 4

Try Web Search Instead (cont.) 5

Try Web Search Instead (cont.) • phase #2 5

Try Web Search Instead (cont.) • phase #2 • train
classiﬁer — image + assoc. text 5

classiﬁer — image + assoc. text • voting on visual (shape, color, texture) 5

classiﬁer — image + assoc. text • voting on visual (shape, color, texture) • text features 5

classiﬁer — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classiﬁer 5

classiﬁer — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classiﬁer • user labeling avoids polysemy 5

Objective & Challenge 6

Objective & Challenge • harvest large # of images automatically
6

• of a particular class; high precision 6

• of a particular class; high precision • provide training DB for new object model 6

• of a particular class; high precision • provide training DB for new object model • combine text, metadata, visual info 6

Contribution 7

Contribution • text attr. + metadata — P(image in class)
7

• above probability 7

• above probability • noisy training data for visual classiﬁer 7

• above probability • noisy training data for visual classiﬁer • superior reranking to which produced by text alone 7

Contribution (cont.) 8 (“shark” query)

The Database 9

The Database • initial 18 predeﬁned classes 9

The Database • initial 18 predeﬁned classes • airplane (ap),
beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) 9

beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually 9

beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass 9

beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass • good & ok — abstract, nonabstract 9

The Database (cont.) 10

Data Collection 11

Data Collection • three approaches 11

Data Collection • three approaches • WebSearch — Google web
search 11

search • ImageSearch — Google image search 11

search • ImageSearch — Google image search • with images on the same (original) page 11

search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only 11

search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image ﬁlename) 11

search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image ﬁlename) • Statistics 11

Filtering 12

Filtering • remove symbolic images 12

Filtering • remove symbolic images • remove abstract is too
challenging 12

challenging • comics, graphs, plots, maps, charts, drawings, sketches 12

challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision 12

challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision • characterized symbolic images by visual features 12

Filtering (cont.) 13

Learning the Filter 14

Learning the Filter • SVM (radial basis function) 14

Learning the Filter • SVM (radial basis function) • 3
visual features 14

visual features • color histogram 14

visual features • color histogram • histogram of the L2-norm of the gradient 14

visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient 14

visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases 14

visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases • ~90% classiﬁcation (two-fold-cross-validation) 14

Ranking by Textual 15

Ranking by Textual • 7 textual features (plaintext and HTML
tags) 15

tags) • Context10 — 10 words (each side) around image 15

tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image 15

tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (<img src alt title>, <title>) 15

tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (<img src alt title>, <title>) • Other features (e.g. MIME types) didn’t help 15

16 Image Reranking

• binary (textual) feature factor a = (a1, …, a7)
16 Image Reranking

• ranking based on posterior probability 16 Image Reranking

• ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} 16 Image Reranking

• ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker 16 Image Reranking

• ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) 16 Image Reranking

• ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class 16 Image Reranking

• ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classiﬁer — learn P(a|y), P(y), P(a) 16 Image Reranking

• ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classiﬁer — learn P(a|y), P(y), P(a) • for any new class — no ground-truth needed 16 Image Reranking

17 Image Reranking (cont.)

18 Image Reranking (cont.) The last column gives the average
over all classes Mixed naive Bayes performs best and was picked

Ranking by Visual 19

Ranking by Visual • text reranking — p(y=in-class|a) for each
image 19

image • variety region detectors + bag-of-words (BOW) 19

image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points 19

image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor 19

image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means 19

image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) 19

image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins 19

image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins • 900D feature vector 19

Ranking by Visual (cont.) 20

Ranking by Visual (cont.) • n+ images (top 150) from
speciﬁc class; n- images (random 1000) from all classes 20

speciﬁc class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy 20

speciﬁc class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) 20

speciﬁc class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case 20

speciﬁc class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case • training minimizes the following sum: 20

23 Reranking Google Image Search 14 vs. 6 nonclass images
(false positives)

Conclusion 24

Conclusion • automatic algorithm to harvest Web 24

Conclusion • automatic algorithm to harvest Web • for hundreds
images of a given query class 24

images of a given query class • polysemy and diffuseness are difﬁcult to handle 24

images of a given query class • polysemy and diffuseness are difﬁcult to handle • future work 24

images of a given query class • polysemy and diffuseness are difﬁcult to handle • future work • leverage multimodal visual models 24

images of a given query class • polysemy and diffuseness are difﬁcult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings 24

images of a given query class • polysemy and diffuseness are difﬁcult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods 24

images of a given query class • polysemy and diffuseness are difﬁcult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories 24

images of a given query class • polysemy and diffuseness are difﬁcult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories • “airplane” — airports, airplane interior, ... 24

25 Thanks & Question?

Harvesting Image Databases from the Web

Harvesting Image Databases from the Web

More Decks by Dan Chen

Other Decks in Science

Featured

Transcript