Large-scale image similarity search at LINE

Why you are here? › You want develop a system
which can search similar images accurately and fast in large-scale. › Then hopefully, this talk may benefit you! › You wonder how to search by images.

Project members Kao Wenchun Hiep V. Le Ueno Eidi (Manager)

Agenda › Image similarity search › Methodology › A use
case: Large-scale sticker search at LINE

Image Similarity Search

Definition › Content-based image retrieval: Given an image, retrieve a
list of similar images by the content of the image itself rather than keywords, tags, or descriptions associated with the image. https://en.wikipedia.org/wiki/Content-based_image_retrieval

Approaches Our chosen approach: Represent images by binary code (Hashing)
› Pros: fast computation, memory-friendly › Cons: Accuracy trade-off SOTA Approach: Represent images by float embedding and retrieve similar images by ranking similarity scores using a distance metric › Pros: The retrieval is accurate › Cons: Float embedding is computational expensive and memory-inefficient Traditional Approach: Color histogram, texture, shape as features › Pros: Idea is simple and somewhat easy to implement › Cons: Not accurate

Goal and System design System design Goals › Learning highly-accurate
binary code › Searching effectively using binary code

Methodology

Binary embedding model Popular approaches using Neural Networks › Similarity
preserving manners: Pairwise, Multi-wise › Classification oriented Our approach: hybrid › Obtaining the binary code by classifying if a pair of images is similar or not

Binary embedding model Model artitecture Model settings › ConvNet:
Shared-weights › Loss function: Cross entropy loss › Binary mapping: Sign function

Binary code search system System Architecture Large-scale problem › Exhaustive
search is slow even with binary codes in large-scale › Need to reduce the search space: Approximate Nearest Neighbor Search › nprobe: The number of closest centroids in centroid search.

A Use Case: Large-Scale Sticker Search at LINE Database statistics
› 10M sticker packages › Each package has from 8 to 40 stickers › Current search system: 20 CPUs, 256 GB Memory › Performance: 0.01 second per sticker Search time Sticker Search system › Number of centroids N: 2^17 = 131072 centroids › nprobe: 1000

Conclusion › The system only needs 0.01 second for each
sticker search on the database of over 300M stickers. › The system supports the sticker review process and saves up to 3 hours of review time for each reviewer a day. › A large-scale image similarity search system has been developed at LINE.

Thank you for listening!

Large-scale image similarity search at LINE

Large-scale image similarity search at LINE

LINE DevDay 2020

More Decks by LINE DevDay 2020

Other Decks in Technology

Featured

Transcript

Why you are here? › You want develop a system

Project members Kao Wenchun Hiep V. Le Ueno Eidi (Manager)

Agenda › Image similarity search › Methodology › A use

Image Similarity Search

Definition › Content-based image retrieval: Given an image, retrieve a

Approaches Our chosen approach: Represent images by binary code (Hashing)

Goal and System design System design Goals › Learning highly-accurate

Methodology

Binary embedding model Popular approaches using Neural Networks › Similarity

Binary embedding model Model artitecture Model settings › ConvNet:

Binary code search system System Architecture Large-scale problem › Exhaustive

A Use Case: Large-Scale Sticker Search at LINE Database statistics

Conclusion › The system only needs 0.01 second for each

Thank you for listening!