Large-scale image similarity search at LINE

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Why you are here? › You want develop a system which can search similar images accurately and fast in large-scale. › Then hopefully, this talk may benefit you! › You wonder how to search by images.

Slide 3

Slide 3 text

Project members Kao Wenchun Hiep V. Le Ueno Eidi (Manager)

Slide 4

Slide 4 text

Agenda › Image similarity search › Methodology › A use case: Large-scale sticker search at LINE

Slide 5

Slide 5 text

Image Similarity Search

Slide 6

Slide 6 text

Definition › Content-based image retrieval: Given an image, retrieve a list of similar images by the content of the image itself rather than keywords, tags, or descriptions associated with the image. https://en.wikipedia.org/wiki/Content-based_image_retrieval

Slide 7

Slide 7 text

Approaches Our chosen approach: Represent images by binary code (Hashing) › Pros: fast computation, memory-friendly › Cons: Accuracy trade-off SOTA Approach: Represent images by float embedding and retrieve similar images by ranking similarity scores using a distance metric › Pros: The retrieval is accurate › Cons: Float embedding is computational expensive and memory-inefficient Traditional Approach: Color histogram, texture, shape as features › Pros: Idea is simple and somewhat easy to implement › Cons: Not accurate

Slide 8

Slide 8 text

Goal and System design System design Goals › Learning highly-accurate binary code › Searching effectively using binary code

Slide 9

Slide 9 text

Methodology

Slide 10

Slide 10 text

Binary embedding model Popular approaches using Neural Networks › Similarity preserving manners: Pairwise, Multi-wise › Classification oriented Our approach: hybrid › Obtaining the binary code by classifying if a pair of images is similar or not

Slide 11

Slide 11 text

Binary embedding model Model artitecture Model settings › ConvNet: Shared-weights › Loss function: Cross entropy loss › Binary mapping: Sign function

Slide 12

Slide 12 text

Binary code search system System Architecture Large-scale problem › Exhaustive search is slow even with binary codes in large-scale › Need to reduce the search space: Approximate Nearest Neighbor Search › nprobe: The number of closest centroids in centroid search.

Slide 13

Slide 13 text

A Use Case: Large-Scale Sticker Search at LINE Database statistics › 10M sticker packages › Each package has from 8 to 40 stickers › Current search system: 20 CPUs, 256 GB Memory › Performance: 0.01 second per sticker Search time Sticker Search system › Number of centroids N: 2^17 = 131072 centroids › nprobe: 1000

Slide 14

Slide 14 text

Conclusion › The system only needs 0.01 second for each sticker search on the database of over 300M stickers. › The system supports the sticker review process and saves up to 3 hours of review time for each reviewer a day. › A large-scale image similarity search system has been developed at LINE.

Slide 15

Slide 15 text

Thank you for listening!