Slide 1

Slide 1 text

imagededup šŸ˜Ž Finding duplicate images made easy! Tanuj Jain (Senior ML Engineer @ Axel Springer AI) Dat Tran (Co-Founder & CTO @ Priceloop) 08 december 2020 ~ berlin ~ X

Slide 2

Slide 2 text

X echo $(whoami) 2

Slide 3

Slide 3 text

X https://ai.axelspringer.com/ 3

Slide 4

Slide 4 text

X Agenda 1. Motivation 2. Components of an image deduplication system ā–Ŗ Component 1: Feature generator ā–Ŗ Component 2: Search ā–Ŗ Component 3: Evaluator 3. Demo 4. Summary 4

Slide 5

Slide 5 text

X 5 Motivation

Slide 6

Slide 6 text

X Motivation Why is an image deduplication system required? E-Commerce: https://www.idealo.de/preisvergleich/OffersOfProduct/200840890_-macbook-air-13-2020-m1-mgnd 3d-a-apple.html Hotel comparison portals: https://hotel.idealo.de/berlin/list/happygolucky-hotel-hostel-2365681#/d:10-12-2020+3/a:2/r:1 6

Slide 7

Slide 7 text

X Motivation Why is an image deduplication system required? 7

Slide 8

Slide 8 text

X Motivation Why is an image deduplication system required? Machine Learning = LEAKAGE Interesting paper: ā€œDo we train on test data? Purging CIFAR of near-duplicatesā€ - Bjƶrn Barz, Joachim Denzler, 2019 - https://arxiv.org/abs/1902.00423 8

Slide 9

Slide 9 text

X Motivation Teams organization within the company ā—¼ Different teams: ā—¼ Varying tech stack ā—¼ Communication challenges ā—¼ Different definition of duplicates ā—¼ Varying areas of expertise Need a ā€œstandardizedā€ solution 9

Slide 10

Slide 10 text

X Motivation Epiphany ā—¼ We are not the only e-commerce company! ā—¼ We are not the only hotel comparison portal! ā—¼ There are so many other businesses that need this solution! ā—¼ An average Joe, because of his/her instagram addiction, takes a gazillion pictures a day, but doesnā€™t want to keep them all! So, itā€™s more than a business need! ā—¼ We believe in community efforts Need an Open-Source solution 10

Slide 11

Slide 11 text

X ā€œIā€ am one of those average Joes ó°£» 11

Slide 12

Slide 12 text

X Motivation Minimum requirements ā—¼ Technical 1. Feature generation (basic + slightly advanced) 2. Search mechanism 3. Evaluation of deduplication quality ā—¼ User 1. Usable out-of-the-box 2. Simple API design 3. Doesnā€™t require in-depth understanding of the chosen programming language Need to build a Python package ourselves Nothing pre-existing! 12

Slide 13

Slide 13 text

X 13 Components

Slide 14

Slide 14 text

X Components 1. Feature Generator 2. Search/comparison mechanism 3. Evaluation framework 3 basic components 14

Slide 15

Slide 15 text

X Component 1 Feature generators: imagededup Feature generator Hash bae7c83d 8358e4c3 Vector [1.2, 0.43, . . ., 0.91] 15

Slide 16

Slide 16 text

X Component 1 Feature generators: imagededup Feature generator Hash bae7c83d 8358e4c3 Vector [1.2, 0.43, . . ., 0.91] 1. Perceptual 2. Average 3. Wavelet 4. Difference Convolutional Neural Network Embeddings 16

Slide 17

Slide 17 text

X Hashing Difference hashing Concept: Gradients of similar images are similar 17

Slide 18

Slide 18 text

X Hashing Difference hashing: How it works Step 1: Resize 18

Slide 19

Slide 19 text

X Hashing Difference hashing: How it works Step 2: Convert to Grayscale 19

Slide 20

Slide 20 text

X Hashing Difference hashing: How it works Step 3: Is the pixel to the left dimmer? 20

Slide 21

Slide 21 text

X Hashing Difference hashing: How it works Step 4: Convert boolean array to hash bae7c83d8358e4c3 21

Slide 22

Slide 22 text

X Convolutional Neural Network Image embeddings 22

Slide 23

Slide 23 text

X Convolutional Neural Network Image embeddings 23

Slide 24

Slide 24 text

X Convolutional Neural Network Image embeddings Embeddings 24

Slide 25

Slide 25 text

X Component 2 Search/comparison mechanism Feature Extractor Image 1: bae7c83d8358e4c3 Image 2: a5e4c83d8758e5e1 Image 3: 2a37c83d8358e3f6 Image 4: c777c83d8358eee4 Image 1: [1.2, 0.43, . . ., 0.91] Image 2: [0.3, 0.2, . . ., 0.11] Image 3: [0.22, 0.65, . . ., 0.12] Image 4: [0.43, 0.21, . . ., 0.34] Hashing CNN 25

Slide 26

Slide 26 text

X Hashing Hamming distance (HD) Given two binary numbers, how many positions do they differ in? 1101 0010 1001 1010 1 1 01 0 010 1 0 01 1 010 = 2 HD 26

Slide 27

Slide 27 text

X Hashing Hamming distance (HD) Given two binary numbers, how many positions do they differ in? 1101 0010 1001 1010 1 1 01 0 010 1 0 01 1 010 = 2 HD 27 > Threshold? Y N Duplicate! Not Duplicate!

Slide 28

Slide 28 text

X Hashing Efficiency Brute force: Compare every image with every other image HD = 44 HD = 31 HD = 10 28

Slide 29

Slide 29 text

X Hashing Efficiency Brute force: Compare every image with every other image HD = 44 HD = 31 HD = 10 INEFFICIENT !! (more so in pure Python!) 29

Slide 30

Slide 30 text

X Hashing 1. Instead of pure python, use cython ~ Orders of magnitude faster ~ Default for Linux and macOS 2. Use a better search method: Bktree ~ Creates index using hashes ~ Reduces the search space and hence number of comparisons ~ Default for Windows 3. Do comparisons in parallel to increase speed (via multiprocessing) 30 Improving search efficiency

Slide 31

Slide 31 text

X Convolutional Neural Network (CNN) Cosine similarity (CS) Given two vectors, calculate the cosine similarity [0.3, 0.2, . . ., 0.11] [1.2, 0.43, . . ., 0.91] cos( , ) 31

Slide 32

Slide 32 text

X Convolutional Neural Network (CNN) Decision Given two vectors, calculate the cosine similarity [0.3, 0.2, . . ., 0.11] = 0.8 > Threshold? Y N Duplicate! Not Duplicate! [1.2, 0.43, . . ., 0.91] cos( , ) 32

Slide 33

Slide 33 text

X Convolutional Neural Network (CNN) Improving search efficiency 1. Utilize matrix multiplication 2. Perform cosine similarity calculations in chunks for memory optimizations 3. Do comparisons in parallel to increase speed (via multiprocessing) 33

Slide 34

Slide 34 text

X Component 3 Evaluation Custom definition of ā€œduplicateā€ 34

Slide 35

Slide 35 text

X Evaluation Factors 5 different deduplication methods What is a duplicate? Different threshold values 35

Slide 36

Slide 36 text

X Evaluation Steps 1. Provide ground truth ā–Ŗ Pairs of images marked as duplicates 2. Run through a deduplication method with corresponding threshold 3. Pass it to evaluation function of imagededup ā–Ŗ Precision ā–Ŗ Recall ā–Ŗ Mean Average Precision ā–Ŗ . . . 4. Select method + threshold with best metrics ā³Benchmarks available: https://idealo.github.io/imagededup/user_guide/benchmarks/ 36

Slide 37

Slide 37 text

X 37

Slide 38

Slide 38 text

X Summary ā—¼ 5 deduplication methods ā—¼ Out-of-the-box ā—¼ Simple API design/installation ā—¼ Evaluation framework ā—¼ Exact and Near duplicates 38

Slide 39

Slide 39 text

X šŸ”¤ Repo: https://github.com/idealo/imagededup šŸ“ Documentation: https://idealo.github.io/imagededup/ 39

Slide 40

Slide 40 text

X 40 Thanks to all contributors https://github.com/idealo/imagededup/issues

Slide 41

Slide 41 text

X Questions? 41

Slide 42

Slide 42 text

X backup Cnn Image 1: [1.2, 0.43, . . ., 0.91] Image 2: [0.3, 0.2, . . ., 0.11] Image 3: [0.22, 0.65, . . ., 0.12] Image 4: [0.43, 0.21, . . ., 0.34] 1 2 3 4 d 1 2 3 4 1 2 3 4 1.2 0.43 . . . . . . . . . . 0.91 0.8 CS(image1, image3) 42