Learning = LEAKAGE Interesting paper: “Do we train on test data? Purging CIFAR of near-duplicates” - Björn Barz, Joachim Denzler, 2019 - https://arxiv.org/abs/1902.00423 8
company! ◼ We are not the only hotel comparison portal! ◼ There are so many other businesses that need this solution! ◼ An average Joe, because of his/her instagram addiction, takes a gazillion pictures a day, but doesn’t want to keep them all! So, it’s more than a business need! ◼ We believe in community efforts Need an Open-Source solution 10
+ slightly advanced) 2. Search mechanism 3. Evaluation of deduplication quality ◼ User 1. Usable out-of-the-box 2. Simple API design 3. Doesn’t require in-depth understanding of the chosen programming language Need to build a Python package ourselves Nothing pre-existing! 12
Orders of magnitude faster ~ Default for Linux and macOS 2. Use a better search method: Bktree ~ Creates index using hashes ~ Reduces the search space and hence number of comparisons ~ Default for Windows 3. Do comparisons in parallel to increase speed (via multiprocessing) 30 Improving search efficiency
matrix multiplication 2. Perform cosine similarity calculations in chunks for memory optimizations 3. Do comparisons in parallel to increase speed (via multiprocessing) 33
images marked as duplicates 2. Run through a deduplication method with corresponding threshold 3. Pass it to evaluation function of imagededup ▪ Precision ▪ Recall ▪ Mean Average Precision ▪ . . . 4. Select method + threshold with best metrics ⏳Benchmarks available: https://idealo.github.io/imagededup/user_guide/benchmarks/ 36