Slide 1

Slide 1 text

Harvesting Image Databases from the Web Florian Schroff, Antonio Criminisi, and Andrew Zisserman // Microsoft Research Sponsored 1 IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, No. 4, April 2011 Shao-Chung Chen Presentation on “Machine Learning”, June 18 2013

Slide 2

Slide 2 text

Introduction 2

Slide 3

Slide 3 text

Introduction • image databases is not sufficient enough 2

Slide 4

Slide 4 text

Introduction • image databases is not sufficient enough • search engines provides an effortless route 2

Slide 5

Slide 5 text

Introduction • image databases is not sufficient enough • search engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) 2

Slide 6

Slide 6 text

Introduction • image databases is not sufficient enough • search engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) 2

Slide 7

Slide 7 text

Introduction • image databases is not sufficient enough • search engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases 2

Slide 8

Slide 8 text

Introduction • image databases is not sufficient enough • search engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web 2

Slide 9

Slide 9 text

Introduction • image databases is not sufficient enough • search engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines 2

Slide 10

Slide 10 text

Introduction • image databases is not sufficient enough • search engines provides an effortless route • poor precision (32% for 1, avg. 39%; w/ Google) • restricted # of downloads (1000 w/ Google) • automatically harvest image databases • from the web • with help of search engines • precision above 55% on average 2

Slide 11

Slide 11 text

Related Works 3

Slide 12

Slide 12 text

Related Works • direct download from image search engine 3

Slide 13

Slide 13 text

Related Works • direct download from image search engine • probabilistic Latent Semantic Analysis (pLSA) 3

Slide 14

Slide 14 text

Related Works • direct download from image search engine • probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process 3

Slide 15

Slide 15 text

Related Works • direct download from image search engine • probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) 3

Slide 16

Slide 16 text

Related Works • direct download from image search engine • probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches 3

Slide 17

Slide 17 text

Related Works • direct download from image search engine • probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision 3

Slide 18

Slide 18 text

Related Works • direct download from image search engine • probabilistic Latent Semantic Analysis (pLSA) • Hierarchical Dirichlet Process • + text on the original page (with image search) • above approaches • poor precision • restricted by the # of downloads 3

Slide 19

Slide 19 text

Try Web Search Instead 4

Slide 20

Slide 20 text

Try Web Search Instead • instead of image search 4

Slide 21

Slide 21 text

Try Web Search Instead • instead of image search • eliminate the download restriction 4

Slide 22

Slide 22 text

Try Web Search Instead • instead of image search • eliminate the download restriction • phase #1 4

Slide 23

Slide 23 text

Try Web Search Instead • instead of image search • eliminate the download restriction • phase #1 • topics — based on the words on the pages 4

Slide 24

Slide 24 text

Try Web Search Instead • instead of image search • eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text 4

Slide 25

Slide 25 text

Try Web Search Instead • instead of image search • eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked 4

Slide 26

Slide 26 text

Try Web Search Instead • instead of image search • eliminate the download restriction • phase #1 • topics — based on the words on the pages • using Latent Dirichlet Allocation on text • images — near by the text — top ranked • labeling — positive/negative image clusters 4

Slide 27

Slide 27 text

Try Web Search Instead (cont.) 5

Slide 28

Slide 28 text

Try Web Search Instead (cont.) • phase #2 5

Slide 29

Slide 29 text

Try Web Search Instead (cont.) • phase #2 • train classifier — image + assoc. text 5

Slide 30

Slide 30 text

Try Web Search Instead (cont.) • phase #2 • train classifier — image + assoc. text • voting on visual (shape, color, texture) 5

Slide 31

Slide 31 text

Try Web Search Instead (cont.) • phase #2 • train classifier — image + assoc. text • voting on visual (shape, color, texture) • text features 5

Slide 32

Slide 32 text

Try Web Search Instead (cont.) • phase #2 • train classifier — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classifier 5

Slide 33

Slide 33 text

Try Web Search Instead (cont.) • phase #2 • train classifier — image + assoc. text • voting on visual (shape, color, texture) • text features • rerank — with above classifier • user labeling avoids polysemy 5

Slide 34

Slide 34 text

Objective & Challenge 6

Slide 35

Slide 35 text

Objective & Challenge • harvest large # of images automatically 6

Slide 36

Slide 36 text

Objective & Challenge • harvest large # of images automatically • of a particular class; high precision 6

Slide 37

Slide 37 text

Objective & Challenge • harvest large # of images automatically • of a particular class; high precision • provide training DB for new object model 6

Slide 38

Slide 38 text

Objective & Challenge • harvest large # of images automatically • of a particular class; high precision • provide training DB for new object model • combine text, metadata, visual info 6

Slide 39

Slide 39 text

Contribution 7

Slide 40

Slide 40 text

Contribution • text attr. + metadata — P(image in class) 7

Slide 41

Slide 41 text

Contribution • text attr. + metadata — P(image in class) • above probability 7

Slide 42

Slide 42 text

Contribution • text attr. + metadata — P(image in class) • above probability • noisy training data for visual classifier 7

Slide 43

Slide 43 text

Contribution • text attr. + metadata — P(image in class) • above probability • noisy training data for visual classifier • superior reranking to which produced by text alone 7

Slide 44

Slide 44 text

Contribution (cont.) 8 (“shark” query)

Slide 45

Slide 45 text

The Database 9

Slide 46

Slide 46 text

The Database • initial 18 predefined classes 9

Slide 47

Slide 47 text

The Database • initial 18 predefined classes • airplane (ap), beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) 9

Slide 48

Slide 48 text

The Database • initial 18 predefined classes • airplane (ap), beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually 9

Slide 49

Slide 49 text

The Database • initial 18 predefined classes • airplane (ap), beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass 9

Slide 50

Slide 50 text

The Database • initial 18 predefined classes • airplane (ap), beaver (bv), bikes (bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant (ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg), motorbikes (mb), penguin (pg), shark (sk), tiger (tr), wristwatch (ww), and zebra (zb) • annotate manually • in-class-good, in-class-ok, nonclass • good & ok — abstract, nonabstract 9

Slide 51

Slide 51 text

The Database (cont.) 10

Slide 52

Slide 52 text

Data Collection 11

Slide 53

Slide 53 text

Data Collection • three approaches 11

Slide 54

Slide 54 text

Data Collection • three approaches • WebSearch — Google web search 11

Slide 55

Slide 55 text

Data Collection • three approaches • WebSearch — Google web search • ImageSearch — Google image search 11

Slide 56

Slide 56 text

Data Collection • three approaches • WebSearch — Google web search • ImageSearch — Google image search • with images on the same (original) page 11

Slide 57

Slide 57 text

Data Collection • three approaches • WebSearch — Google web search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only 11

Slide 58

Slide 58 text

Data Collection • three approaches • WebSearch — Google web search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) 11

Slide 59

Slide 59 text

Data Collection • three approaches • WebSearch — Google web search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) • Statistics 11

Slide 60

Slide 60 text

Data Collection • three approaches • WebSearch — Google web search • ImageSearch — Google image search • with images on the same (original) page • GoogleImages — Google image search only • can consist of text & metadata (e.g. image filename) • Statistics 11

Slide 61

Slide 61 text

Filtering 12

Slide 62

Slide 62 text

Filtering • remove symbolic images 12

Slide 63

Slide 63 text

Filtering • remove symbolic images • remove abstract is too challenging 12

Slide 64

Slide 64 text

Filtering • remove symbolic images • remove abstract is too challenging • comics, graphs, plots, maps, charts, drawings, sketches 12

Slide 65

Slide 65 text

Filtering • remove symbolic images • remove abstract is too challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision 12

Slide 66

Slide 66 text

Filtering • remove symbolic images • remove abstract is too challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision • characterized symbolic images by visual features 12

Slide 67

Slide 67 text

Filtering • remove symbolic images • remove abstract is too challenging • comics, graphs, plots, maps, charts, drawings, sketches • improves resulting precision • characterized symbolic images by visual features 12

Slide 68

Slide 68 text

Filtering (cont.) 13

Slide 69

Slide 69 text

Learning the Filter 14

Slide 70

Slide 70 text

Learning the Filter • SVM (radial basis function) 14

Slide 71

Slide 71 text

Learning the Filter • SVM (radial basis function) • 3 visual features 14

Slide 72

Slide 72 text

Learning the Filter • SVM (radial basis function) • 3 visual features • color histogram 14

Slide 73

Slide 73 text

Learning the Filter • SVM (radial basis function) • 3 visual features • color histogram • histogram of the L2-norm of the gradient 14

Slide 74

Slide 74 text

Learning the Filter • SVM (radial basis function) • 3 visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient 14

Slide 75

Slide 75 text

Learning the Filter • SVM (radial basis function) • 3 visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases 14

Slide 76

Slide 76 text

Learning the Filter • SVM (radial basis function) • 3 visual features • color histogram • histogram of the L2-norm of the gradient • histogram of the angles (0…π) weighted by the L2- norm of the corresponding gradient • 1000 equally spaced bin, in all cases • ~90% classification (two-fold-cross-validation) 14

Slide 77

Slide 77 text

Ranking by Textual 15

Slide 78

Slide 78 text

Ranking by Textual • 7 textual features (plaintext and HTML tags) 15

Slide 79

Slide 79 text

Ranking by Textual • 7 textual features (plaintext and HTML tags) • Context10 — 10 words (each side) around image 15

Slide 80

Slide 80 text

Ranking by Textual • 7 textual features (plaintext and HTML tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image 15

Slide 81

Slide 81 text

Ranking by Textual • 7 textual features (plaintext and HTML tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (, ) 15

Slide 82

Slide 82 text

Ranking by Textual • 7 textual features (plaintext and HTML tags) • Context10 — 10 words (each side) around image • ContextR — 11–50 words away from image • ImageAlt, ImageTitle, FileDir, FileName, WebsiteTitle — HTML tags (, ) • Other features (e.g. MIME types) didn’t help 15

Slide 83

Slide 83 text

16 Image Reranking

Slide 84

Slide 84 text

• binary (textual) feature factor a = (a1, …, a7) 16 Image Reranking

Slide 85

Slide 85 text

• binary (textual) feature factor a = (a1, …, a7) • ranking based on posterior probability 16 Image Reranking

Slide 86

Slide 86 text

• binary (textual) feature factor a = (a1, …, a7) • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} 16 Image Reranking

Slide 87

Slide 87 text

• binary (textual) feature factor a = (a1, …, a7) • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker 16 Image Reranking

Slide 88

Slide 88 text

• binary (textual) feature factor a = (a1, …, a7) • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) 16 Image Reranking

Slide 89

Slide 89 text

• binary (textual) feature factor a = (a1, …, a7) • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class 16 Image Reranking

Slide 90

Slide 90 text

• binary (textual) feature factor a = (a1, …, a7) • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classifier — learn P(a|y), P(y), P(a) 16 Image Reranking

Slide 91

Slide 91 text

• binary (textual) feature factor a = (a1, …, a7) • ranking based on posterior probability • P(y=in-class|a) where y ∈ {in-class, nonclass} • class independent ranker • to rank one particular class (with P(y|a) ) • don’t employ the ground-truth of that class • Bayes classifier — learn P(a|y), P(y), P(a) • for any new class — no ground-truth needed 16 Image Reranking

Slide 92

Slide 92 text

17 Image Reranking (cont.)

Slide 93

Slide 93 text

18 Image Reranking (cont.) The last column gives the average over all classes Mixed naive Bayes performs best and was picked

Slide 94

Slide 94 text

Ranking by Visual 19

Slide 95

Slide 95 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image 19

Slide 96

Slide 96 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image • variety region detectors + bag-of-words (BOW) 19

Slide 97

Slide 97 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points 19

Slide 98

Slide 98 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor 19

Slide 99

Slide 99 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means 19

Slide 100

Slide 100 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) 19

Slide 101

Slide 101 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins 19

Slide 102

Slide 102 text

Ranking by Visual • text reranking — p(y=in-class|a) for each image • variety region detectors + bag-of-words (BOW) • difference of Gaussians, Multiscale-Harris, Kadir’s saliency operator, Canny edge points • 72D SIFT descriptor • vocabulary (of 100 words) learned for each detectors using K-means • histogram of gradients (HOG) • 8 pixels/cell, 9 contrast invariant gradient bins • 900D feature vector 19

Slide 103

Slide 103 text

Ranking by Visual (cont.) 20

Slide 104

Slide 104 text

Ranking by Visual (cont.) • n+ images (top 150) from specific class; n- images (random 1000) from all classes 20

Slide 105

Slide 105 text

Ranking by Visual (cont.) • n+ images (top 150) from specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy 20

Slide 106

Slide 106 text

Ranking by Visual (cont.) • n+ images (top 150) from specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) 20

Slide 107

Slide 107 text

Ranking by Visual (cont.) • n+ images (top 150) from specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case 20

Slide 108

Slide 108 text

Ranking by Visual (cont.) • n+ images (top 150) from specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case • training minimizes the following sum: 20

Slide 109

Slide 109 text

Ranking by Visual (cont.) • n+ images (top 150) from specific class; n- images (random 1000) from all classes • SVM — since the subset of positive images are noisy • images reranked by textual features (noise comes) • SVM has the potential to train in such case • training minimizes the following sum: 20

Slide 110

Slide 110 text

21

Slide 111

Slide 111 text

22

Slide 112

Slide 112 text

23 Reranking Google Image Search 14 vs. 6 nonclass images (false positives)

Slide 113

Slide 113 text

Conclusion 24

Slide 114

Slide 114 text

Conclusion • automatic algorithm to harvest Web 24

Slide 115

Slide 115 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class 24

Slide 116

Slide 116 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class • polysemy and diffuseness are difficult to handle 24

Slide 117

Slide 117 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class • polysemy and diffuseness are difficult to handle • future work 24

Slide 118

Slide 118 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models 24

Slide 119

Slide 119 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings 24

Slide 120

Slide 120 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods 24

Slide 121

Slide 121 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories 24

Slide 122

Slide 122 text

Conclusion • automatic algorithm to harvest Web • for hundreds images of a given query class • polysemy and diffuseness are difficult to handle • future work • leverage multimodal visual models • different clusters of polysemous meanings • “tiger” — the animal, Tiger Woods • divide diffuse categories • “airplane” — airports, airplane interior, ... 24

Slide 123

Slide 123 text

25 Thanks & Question?