Save 37% off PRO during our Black Friday Sale! »

Binary Embeddings For Efficient Ranking

F401afbfb8100568304f0caf45c79575?s=47 Maciej Kula
August 31, 2017

Binary Embeddings For Efficient Ranking

Talk given during the Large Scale Recommender System workshop at the ACM RecSys 2017 conference.


Maciej Kula

August 31, 2017


  1. Binary Latent Representations for Efficient Ranking Maciej Kula Ravelin

  2. Item set size a challenge in LSRS When you have

    10s or 100s of millions of items in your system, embeddings are challenging to • estimate • store • and compute predictions with. True both in offline and online systems, but problem especially severe online.
  3. Online settings increasingly important Need online predictions for • contextual

    models • incorporating new interactions
  4. Online settings have hard constraints In online settings, have ~100ms

    to • update models • retrieve candidates • perform scoring. Can you carry out 100 million dot products in under 100ms? Still need to fit in business logic, network latency and so on.
  5. Solutions • Heuristics • ANN search • More compact representations.

    Less storage and computation required for smaller embedding dimensions, at expense of accuracy.
  6. Use binary representations instead

  7. Binary dot product Scaled XNOR as the binary analogue of

    a dot product: Successfully used for binary CNNs.
  8. Benefits Space • Real-valued representations require 4 bytes per dimension

    • 32 binary dimensions in 4 bytes Speed • Two floating point operations per dimension • XNOR all 32 dimensions in two clock cycles
  9. Does this offer a better tradeoff than simply reducing the

    latent dimensionality?
  10. Experiments Set up: • Standard learning-to-rank matrix factorization model •

    Evaluated on MovieLens 1M Key metrics: • MRR • Predictions per millisecond
  11. Models Baseline • 2 embedding layers, for users and items

    • Negative sampling • BPR loss with tied weights • Adaptive hinge loss with tied weights Binary model: • Embeddings followed by a sign function • Trained by backpropagation
  12. Backpropagation Sign function is not differentiable. Use a continuous version:

  13. Backpropagation Normal forward pass. In the backward pass, gradients are

    applied to the real-valued embedding layers. We can discard those once the model has been estimated.
  14. Predictions Implemented in C. The baseline is a standard dot

    product using SIMD intrinsics.
  15. Aside: SIMD

  16. XNOR • 8-float wide XOR • 8-float wide negation •

    count one bits • scaling
  17. Results

  18. Results

  19. Bottom line Moving from the 1024 to 32 dimensions in

    the continuous model implies a 29 times increase in prediction speed at the expense of a modest 4% decrease in accuracy Moving from a float representation to a 1024-dimensional binary representation implies a sharper accuracy drop at 6% in exchange for a smaller 20 times increase in prediction speed.
  20. More promising approaches • Maximum inner product search • Bloom

  21. Thanks! @Maciej_Kula Source code: