Image matching presentation during cubonacci meetup

Image Matching Marijn Lems

IAM • E-commerce startup founded in 2017 • I joined
in 2019 • Online platform that offers micro-retailers their own shopping venue • Launch April 2020

Mission Shop from content • Platforms come and go •
Content is very persistent

Coldstart • No data • No computer vision experience •
Competitors like Google Lens

• Create an app that can detect predefined planar objects
like paintings, images, billboards, logos • QR-code scanner experience • Publisher uploads content, end-user scans it like a QR code • ImageLink Focus

Data collection Queries

Data collection References

Make labeling easier • Object detection algorithm to detect the
planar object(s) in this query VOC data format

Examine what we got Query dimensions Planar dimensions references 20K
queries ~5K Domains Artworks, Magazines, Billboards

Data labeling • After eyeballing the data it seemed trivial
to associate queries and references Screenshot of query (topleft) and 5 possible references

Simple model • Mean RGB pixel difference WHAT? Only 9
matches Mean pixel value Top-K distance 4 10 11 55 221

Experimentation so far • Local feature descriptors (ORB) • Pre-processing
• Starting with a simple model • Augmentation • Transformer networks • Fine tuning a pretrained network • Multimodal networks • Text • MPEG7 feature descriptors MPEG7 descriptors Augmentations ORB with RANSAC

Siamese neural network Triplet loss = = = = ℎℎ
threshold that determines preferred distance between images n-dimensional embeddings, one for each input query positive negative

Embedding space Where does my embedding go? Embedding space •
Embedding space should not have too many dimensions

Baseline experiment • Pretrained VGG16 on ImageNet • Baseline performance
on validation set picture from imagenet dataset .75 for free https://arxiv.org/pdf/1409.1556.pdf vgg16

Finetuning my baseline • Freeze the first n blocks •
Tune the weights of the last n layers • epoch@100, best hyperparams, 4 fold Here we jump 23% to .74 vgg16 Freeze these layers train these layers

Debugging training process • Tensorboard:80**

Triplet sampling • How you select your triplets affects performance
and convergence • 5k queries, 20k references makes around 4e+16 possible triplets • Uniform anchor positive negative

Challenges with batch construction • As it turns out, it’s
easy to sample easy negatives • Its expensive to sample “useful” negatives because you need pairwise similarities https://arxiv.org/abs/1706.07567 https://omoindrot.github.io/triplet-loss A P

Sampling experiment • Pretrained VGG • Randomly sampled hyper parameters
• {batchsize} • {margin} • {trainable_blocks} • … • Evaluate at epoch 100 • 30 trials Recall @1

Visually our negatives weren’t so hard • Sorry no example
• Collect a sample of difficult negatives from Azure Similar image search • Hand labeling https://omoindrot.github.io/triplet-loss w reference negative reference negative .78

• Uniform vs semi-hard triplet mining • 60sec vs 6min
per epoch • Trade-off between • Computation time and memory footprint • Quality of the solution • Online sampling [during batch construction] • Per (query,positive) sample n references • Calculate tripletloss • Keep semi-hard references TODO: Online triplet mining

FAILED: Augmentation • Stop collecting new data • Train on
augmented references as queries • Validate on real queries https://imgaug.readthedocs.io 1. reference 2. Random Perspective transformed TRAINING INPUT 4. Biggest crop from 2. 3. Inverse Homography Recall @1

How we use it content query Index embedding or query
storage FAISS Never seen query and content Its also a matter of choice One-shot vs zero-shot publisher End-user Content embedding storage serving

Image matching presentation during cubonacci me...

Image matching presentation during cubonacci meetup

Marijn

Other Decks in Research

Featured

Transcript

Image Matching Marijn Lems

IAM • E-commerce startup founded in 2017 • I joined

Mission Shop from content • Platforms come and go •

Coldstart • No data • No computer vision experience •

• Create an app that can detect predefined planar objects

Data collection Queries

Data collection References

Make labeling easier • Object detection algorithm to detect the

Examine what we got Query dimensions Planar dimensions references 20K

Data labeling • After eyeballing the data it seemed trivial

Simple model • Mean RGB pixel difference WHAT? Only 9

Experimentation so far • Local feature descriptors (ORB) • Pre-processing

Siamese neural network Triplet loss = = = = ℎℎ

Embedding space Where does my embedding go? Embedding space •

Baseline experiment • Pretrained VGG16 on ImageNet • Baseline performance

Finetuning my baseline • Freeze the first n blocks •

Debugging training process • Tensorboard:80**

Triplet sampling • How you select your triplets affects performance

Challenges with batch construction • As it turns out, it’s

Sampling experiment • Pretrained VGG • Randomly sampled hyper parameters

Visually our negatives weren’t so hard • Sorry no example

• Uniform vs semi-hard triplet mining • 60sec vs 6min

FAILED: Augmentation • Stop collecting new data • Train on

How we use it content query Index embedding or query