Binary Embeddings For Efficient Ranking

Slide 1

Slide 1 text

Binary Latent Representations for Efficient Ranking Maciej Kula Ravelin

Slide 2

Slide 2 text

Item set size a challenge in LSRS When you have 10s or 100s of millions of items in your system, embeddings are challenging to ● estimate ● store ● and compute predictions with. True both in offline and online systems, but problem especially severe online.

Slide 3

Slide 3 text

Online settings increasingly important Need online predictions for ● contextual models ● incorporating new interactions

Slide 4

Slide 4 text

Online settings have hard constraints In online settings, have ~100ms to ● update models ● retrieve candidates ● perform scoring. Can you carry out 100 million dot products in under 100ms? Still need to fit in business logic, network latency and so on.

Slide 5

Slide 5 text

Solutions ● Heuristics ● ANN search ● More compact representations. Less storage and computation required for smaller embedding dimensions, at expense of accuracy.

Slide 6

Slide 6 text

Use binary representations instead

Slide 7

Slide 7 text

Binary dot product Scaled XNOR as the binary analogue of a dot product: Successfully used for binary CNNs.

Slide 8

Slide 8 text

Benefits Space ● Real-valued representations require 4 bytes per dimension ● 32 binary dimensions in 4 bytes Speed ● Two floating point operations per dimension ● XNOR all 32 dimensions in two clock cycles

Slide 9

Slide 9 text

Does this offer a better tradeoff than simply reducing the latent dimensionality?

Slide 10

Slide 10 text

Experiments Set up: ● Standard learning-to-rank matrix factorization model ● Evaluated on MovieLens 1M Key metrics: ● MRR ● Predictions per millisecond

Slide 11

Slide 11 text

Models Baseline ● 2 embedding layers, for users and items ● Negative sampling ● BPR loss with tied weights ● Adaptive hinge loss with tied weights Binary model: ● Embeddings followed by a sign function ● Trained by backpropagation

Slide 12

Slide 12 text

Backpropagation Sign function is not differentiable. Use a continuous version:

Slide 13

Slide 13 text

Backpropagation Normal forward pass. In the backward pass, gradients are applied to the real-valued embedding layers. We can discard those once the model has been estimated.

Slide 14

Slide 14 text

Predictions Implemented in C. The baseline is a standard dot product using SIMD intrinsics.

Slide 15

Slide 15 text

Aside: SIMD

Slide 16

Slide 16 text

XNOR ● 8-float wide XOR ● 8-float wide negation ● count one bits ● scaling

Slide 17

Slide 17 text

Results

Slide 18

Slide 18 text

Results

Slide 19

Slide 19 text

Bottom line Moving from the 1024 to 32 dimensions in the continuous model implies a 29 times increase in prediction speed at the expense of a modest 4% decrease in accuracy Moving from a float representation to a 1024-dimensional binary representation implies a sharper accuracy drop at 6% in exchange for a smaller 20 times increase in prediction speed.

Slide 20

Slide 20 text

More promising approaches ● Maximum inner product search ● Bloom embeddings!

Slide 21

Slide 21 text

Thanks! @Maciej_Kula Source code: github.com/maciejkula/binge