Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-scale image similarity search at LINE

Large-scale image similarity search at LINE

Vinh Hiep Le
LINE Fukuoka Machine Learning Team Machine Learning Engineer
https://linedevday.linecorp.com/2020/jp/sessions/5320
https://linedevday.linecorp.com/2020/en/sessions/5320

LINE DevDay 2020

November 27, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Why you are here? › You want develop a system

    which can search similar images accurately and fast in large-scale. › Then hopefully, this talk may benefit you! › You wonder how to search by images. 
  2. Agenda › Image similarity search › Methodology › A use

    case: Large-scale sticker search at LINE 
  3. Definition › Content-based image retrieval: Given an image, retrieve a

    list of similar images by the content of the image itself rather than keywords, tags, or descriptions associated with the image.  https://en.wikipedia.org/wiki/Content-based_image_retrieval
  4. Approaches Our chosen approach: Represent images by binary code (Hashing)

    › Pros: fast computation, memory-friendly › Cons: Accuracy trade-off SOTA Approach: Represent images by float embedding and retrieve similar images by ranking similarity scores using a distance metric › Pros: The retrieval is accurate › Cons: Float embedding is computational expensive and memory-inefficient  Traditional Approach: Color histogram, texture, shape as features › Pros: Idea is simple and somewhat easy to implement › Cons: Not accurate
  5. Goal and System design System design Goals › Learning highly-accurate

    binary code › Searching effectively using binary code 
  6. Binary embedding model Popular approaches using Neural Networks › Similarity

    preserving manners: Pairwise, Multi-wise › Classification oriented  Our approach: hybrid › Obtaining the binary code by classifying if a pair of images is similar or not
  7. Binary embedding model  Model artitecture Model settings › ConvNet:

    Shared-weights › Loss function: Cross entropy loss › Binary mapping: Sign function
  8. Binary code search system System Architecture Large-scale problem › Exhaustive

    search is slow even with binary codes in large-scale › Need to reduce the search space: Approximate Nearest Neighbor Search › nprobe: The number of closest centroids in centroid search. 
  9. A Use Case: Large-Scale Sticker Search at LINE Database statistics

    › 10M sticker packages › Each package has from 8 to 40 stickers › Current search system: 20 CPUs, 256 GB Memory › Performance: 0.01 second per sticker Search time Sticker Search system › Number of centroids N: 2^17 = 131072 centroids › nprobe: 1000 
  10. Conclusion › The system only needs 0.01 second for each

    sticker search on the database of over 300M stickers. › The system supports the sticker review process and saves up to 3 hours of review time for each reviewer a day. › A large-scale image similarity search system has been developed at LINE.