Upgrade to Pro — share decks privately, control downloads, hide ads and more …

imagededup

Tanuj
January 09, 2025
4

 imagededup

imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.

https://github.com/idealo/imagededup

Tanuj

January 09, 2025
Tweet

Transcript

  1. imagededup 😎 Finding duplicate images made easy! Tanuj Jain (Senior

    ML Engineer @ Axel Springer AI) Dat Tran (Co-Founder & CTO @ Priceloop) 08 december 2020 ~ berlin ~ X
  2. X Agenda 1. Motivation 2. Components of an image deduplication

    system ▪ Component 1: Feature generator ▪ Component 2: Search ▪ Component 3: Evaluator 3. Demo 4. Summary 4
  3. X Motivation Why is an image deduplication system required? E-Commerce:

    https://www.idealo.de/preisvergleich/OffersOfProduct/200840890_-macbook-air-13-2020-m1-mgnd 3d-a-apple.html Hotel comparison portals: https://hotel.idealo.de/berlin/list/happygolucky-hotel-hostel-2365681#/d:10-12-2020+3/a:2/r:1 6
  4. X Motivation Why is an image deduplication system required? Machine

    Learning = LEAKAGE Interesting paper: “Do we train on test data? Purging CIFAR of near-duplicates” - Björn Barz, Joachim Denzler, 2019 - https://arxiv.org/abs/1902.00423 8
  5. X Motivation Teams organization within the company ◼ Different teams:

    ◼ Varying tech stack ◼ Communication challenges ◼ Different definition of duplicates ◼ Varying areas of expertise Need a “standardized” solution 9
  6. X Motivation Epiphany ◼ We are not the only e-commerce

    company! ◼ We are not the only hotel comparison portal! ◼ There are so many other businesses that need this solution! ◼ An average Joe, because of his/her instagram addiction, takes a gazillion pictures a day, but doesn’t want to keep them all! So, it’s more than a business need! ◼ We believe in community efforts Need an Open-Source solution 10
  7. X Motivation Minimum requirements ◼ Technical 1. Feature generation (basic

    + slightly advanced) 2. Search mechanism 3. Evaluation of deduplication quality ◼ User 1. Usable out-of-the-box 2. Simple API design 3. Doesn’t require in-depth understanding of the chosen programming language Need to build a Python package ourselves Nothing pre-existing! 12
  8. X Component 1 Feature generators: imagededup Feature generator Hash bae7c83d

    8358e4c3 Vector [1.2, 0.43, . . ., 0.91] 1. Perceptual 2. Average 3. Wavelet 4. Difference Convolutional Neural Network Embeddings 16
  9. X Hashing Difference hashing: How it works Step 4: Convert

    boolean array to hash bae7c83d8358e4c3 21
  10. X Component 2 Search/comparison mechanism Feature Extractor Image 1: bae7c83d8358e4c3

    Image 2: a5e4c83d8758e5e1 Image 3: 2a37c83d8358e3f6 Image 4: c777c83d8358eee4 Image 1: [1.2, 0.43, . . ., 0.91] Image 2: [0.3, 0.2, . . ., 0.11] Image 3: [0.22, 0.65, . . ., 0.12] Image 4: [0.43, 0.21, . . ., 0.34] Hashing CNN 25
  11. X Hashing Hamming distance (HD) Given two binary numbers, how

    many positions do they differ in? 1101 0010 1001 1010 1 1 01 0 010 1 0 01 1 010 = 2 HD 26
  12. X Hashing Hamming distance (HD) Given two binary numbers, how

    many positions do they differ in? 1101 0010 1001 1010 1 1 01 0 010 1 0 01 1 010 = 2 HD 27 > Threshold? Y N Duplicate! Not Duplicate!
  13. X Hashing Efficiency Brute force: Compare every image with every

    other image HD = 44 HD = 31 HD = 10 INEFFICIENT !! (more so in pure Python!) 29
  14. X Hashing 1. Instead of pure python, use cython ~

    Orders of magnitude faster ~ Default for Linux and macOS 2. Use a better search method: Bktree ~ Creates index using hashes ~ Reduces the search space and hence number of comparisons ~ Default for Windows 3. Do comparisons in parallel to increase speed (via multiprocessing) 30 Improving search efficiency
  15. X Convolutional Neural Network (CNN) Cosine similarity (CS) Given two

    vectors, calculate the cosine similarity [0.3, 0.2, . . ., 0.11] [1.2, 0.43, . . ., 0.91] cos( , ) 31
  16. X Convolutional Neural Network (CNN) Decision Given two vectors, calculate

    the cosine similarity [0.3, 0.2, . . ., 0.11] = 0.8 > Threshold? Y N Duplicate! Not Duplicate! [1.2, 0.43, . . ., 0.91] cos( , ) 32
  17. X Convolutional Neural Network (CNN) Improving search efficiency 1. Utilize

    matrix multiplication 2. Perform cosine similarity calculations in chunks for memory optimizations 3. Do comparisons in parallel to increase speed (via multiprocessing) 33
  18. X Evaluation Factors 5 different deduplication methods What is a

    duplicate? Different threshold values 35
  19. X Evaluation Steps 1. Provide ground truth ▪ Pairs of

    images marked as duplicates 2. Run through a deduplication method with corresponding threshold 3. Pass it to evaluation function of imagededup ▪ Precision ▪ Recall ▪ Mean Average Precision ▪ . . . 4. Select method + threshold with best metrics ⏳Benchmarks available: https://idealo.github.io/imagededup/user_guide/benchmarks/ 36
  20. X Summary ◼ 5 deduplication methods ◼ Out-of-the-box ◼ Simple

    API design/installation ◼ Evaluation framework ◼ Exact and Near duplicates 38
  21. X backup Cnn Image 1: [1.2, 0.43, . . .,

    0.91] Image 2: [0.3, 0.2, . . ., 0.11] Image 3: [0.22, 0.65, . . ., 0.12] Image 4: [0.43, 0.21, . . ., 0.34] 1 2 3 4 d 1 2 3 4 1 2 3 4 1.2 0.43 . . . . . . . . . . 0.91 0.8 CS(image1, image3) 42