Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Billion-scale similarity search with GPUs

Takuya Asano
November 14, 2018

Billion-scale similarity search with GPUs

近似近傍探索ライブラリ Faiss の元論文、"Billion-scale similarity search with GPUs" https://arxiv.org/abs/1702.08734 の紹介です。

Takuya Asano

November 14, 2018
Tweet

More Decks by Takuya Asano

Other Decks in Technology

Transcript

  1. Billion-scale similarity
    search with GPUs
    ୈ1ճ ߹ಉ࿦จಡΈձ (2018-11-14)

    ঺հऀɿ ͸ͯͳ ઙ໺୎໵ (id:takuya-a)

    View Slide

  2. ࣗݾ঺հ
    • ΞΧ΢ϯτ

    • id:takuya-a

    • @takuya_b fka @takuya_a

    • @takuyaa

    • ڵຯ

    • ݕࡧɾػցֶशɾNLP ͳͲ

    • ࢓ࣄ

    • ͸ͯͳϒοΫϚʔΫ (2015-2018)

    • ήʔϜؔ܎डୗ (2018-)

    • ΞυςΫ (2018-)

    View Slide

  3. Billion-scale similarity
    search with GPUs
    • ஶऀ

    • Jeff Johnson (Facebook AI Research)

    • Matthijs Douze (Facebook AI Research)

    • Hervé Jégou (Facebook AI Research)

    • URL

    • https://arxiv.org/abs/1702.08734

    View Slide

  4. 1. INTRODUCTION

    View Slide

  5. എܠ
    • 10ԯΦʔμʔͷߴ࣍ݩϕΫτϧʢը૾΍ಈըͷಛ௃ྔͳͲʣʹରͯ͠ྨࣅ୳ࡧΛ͍ͨ͠

    • ϕΫτϧͰ͋Ε͹ͳΜͰ΋Α͍

    • ࣍ݩͷढ͍

    • ࣍ݩ͕ߴ͘ͳΔͱɺਖ਼֬ͳݕࡧ (exact search) ͷख๏͸ઢܗ୳ࡧͱมΘΒͳ͍ܭࢉ࣌ؒʹͳΔ

    • ϝϞϦ΋ͨ͘͞ΜඞཁͳͷͰ Billion-scale Ͱ͸ෆద

    • ͍ΘΏΔۙࣅ࠷ۙ๣୳ࡧ (Approximate nearest neighbor search) Λ͍ͨ͠

    View Slide

  6. ख๏ͷΞΠσΞ
    • ฒߦॲཧ͕ಘҙͳ GPU Λ࢖͍ߴ଎Խ

    • ͨͩ͠ɺGPU ͸ϝϞϦ͕ݶΒΕ͍ͯΔ

    • ௚ੵྔࢠԽ (PQ; Product Quantization) [Jégou, TPAMI 2011] ʹΑΔίʔυԽΛߦ͏

    • ߴ࣍ݩσʔλΛѹॖͯ͠ϝϞϦফඅྔΛ࡟ݮ

    • ۙࣅʹΑΓॲཧ΋ߴ଎Խ

    View Slide

  7. ର৅σʔληοτͷن໛
    • SIFT1M [1]

    • 128࣍ݩɺ100ສϕΫτϧ

    • SIFT1B [1]

    • 128࣍ݩɺ10ԯϕΫτϧ

    • DEEP1B [2]

    • 96࣍ݩɺ10ԯϕΫτϧ
    [1] http://corpus-texmex.irisa.fr/

    [2] http://sites.skoltech.ru/compvision/noimi/

    View Slide

  8. ྨࣅ୳ࡧͷΞϓϦέʔγϣϯ
    • ߴ࣍ݩσʔλͷݕࡧ

    • ը૾͸ SIFT, SURF ͳͲͷಛ௃நग़Λߦͬͯ΋ߴ࣍ݩ

    • ୯ޠ΍จॻͳͲͷ෼ࢄදݱ

    • ΫϥελϦϯά

    • k-means ͷܭࢉͷϘτϧωοΫ͸࠷ۙ๣୳ࡧ

    • ࣮ࡍɺ Faiss ʹ΋௒ߴ଎ͳ k-means ͕࣮૷͞Ε͍ͯΔ

    • Ϩίϝϯυ

    View Slide

  9. 2. PROBLEM STATEMENT

    View Slide

  10. ྨࣅݕࡧͷ໰୊ઃఆ
    • ΫΤϦϕΫτϧ x

    • l ݸͷϕΫτϧ͔ΒͳΔσʔλू߹

    • ͜ͷத͔Βɺx ʹʮ͍ۙʯ k ݸͷ෦෼ू߹ L Λݟ͚͍ͭͨ

    • ྨࣅݕࡧͰ͸ۙ͞ͷई౓ʹϢʔΫϦουڑ཭ʢL2 ϊϧϜʣ͕࢖ΘΕΔ

    • ͍ۙ఺͑͞ಘΒΕΕ͹Α͍ͷͰɺ

    Ͱൺֱͯ͠Α͍
    L = k−argmin
    i=0:l
    ∥x − yi
    ∥2
    [y]i=0:l
    ∥x − y∥2
    2

    View Slide

  11. 3. GPU: OVERVIEW
    AND K-SELECTION
    GPU ͱྨࣅݕࡧʹ͍ͭͯ

    View Slide

  12. NVIDIA GPU ͷมભ
    • Fermi (GF100, GF110, ...)

    • Kepler (GK110, GK104, ...)

    • Maxwell (GM200, GM204, …)

    • Pascal (GP100, GP104)
    • Volta (GV100) <- New!

    • Turing (TU102, TU104, TU106) <- New!

    View Slide

  13. GPU ͷΞʔΩςΫνϟ
    • Pascal ΞʔΩςΫνϟͷ GP100 Λྫʹ

    • Tesla P100 ʹ౥ࡌ͞Ε͍ͯΔ

    View Slide

  14. NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

    View Slide

  15. GP100 ͷߏ଄
    • GPC (Graphics Processing Cluster) x6

    • TPC (Texture Processing Cluster) x5

    • SM (Streaming Multiprocessor) x2

    • ΪΨεϨουΤϯδϯ x1

    • σόΠεϝϞϦ (HBM2, 16 GB @ Tesla P100) x1

    • L2 Ωϟογϡ (4 MB) x1
    • ઃܭਤ্͸ SM ͸60ݸ͕ͩɺεϖοΫ্͸56ݸ

    • าཹ·Γ޲্ͷͨΊɺ੡඼ग़ՙ࣌ʹ 4 ݸ͸ແޮԽ͞ΕΔ
    NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

    View Slide

  16. SM (Streaming
    Multiprocessor) ͷߏ଄
    • Processing Unit x2

    • CUDA Core (FP32) x32 + DP Unit (FP64) x16 + LD/ST
    Unit x8 + SFU x8

    • Warp εέδϡʔϥ x1

    • σΟεύονϢχοτ x2

    • ϨδελϑΝΠϧ x1

    • L1Ωϟογϡ (24 KB) x1

    • γΣΞʔυϝϞϦ (64 KB) x1
    NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

    View Slide

  17. εϨου
    • GPU ʹ͓͚Δ࠷খ࣮ߦ୯Ґ

    • CUDA Core ͳͲͷݸʑͷ࣮ߦϢχοτʹσΟεύον͞ΕΔ

    • GPU Ͱ͸਺ઍʙ਺ສͷεϨουΛಉ࣌ʹ࣮ߦͰ͖Δ

    View Slide

  18. εϨουϒϩοΫ
    • 1ͭͷεϨουϒϩοΫʹ࠷େ 1,024 εϨου

    • SM ୯ҐͰεϨουϒϩοΫׂ͕Γ౰ͯΒΕΔ

    • SM ΁ͷεϨουϒϩοΫͷׂΓ౰ͯ͸ΪΨε
    ϨουΤϯδϯ͕ߦ͏
    Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

    View Slide

  19. εϨουͱ Warp
    • Warp ͸ GPU Ͱͷ࣮ߦ୯Ґ

    • 1 Warp = 32 εϨου

    • 1εςοϓͰ·ͱΊ࣮ͯߦ͞ΕΔ

    • ΪΨεϨουΤϯδϯ͕ɺεϨουϒϩοΫ͔
    Β Warp ΛऔΓग़ͯ͠ Warp Pool ʹೖΕΔ

    • Warp Pool ʹ͸࠷େ64ݸͷ Warp ΛόοϑΝ
    Ͱ͖Δ
    Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

    View Slide

  20. Warp εέδϡʔϥ
    • Warp Pool ͷத͔Β1ͭͷ Warp ΛબΜͰ࣮ߦ

    • ϩʔυ/ετΞͳͲʹ͸਺ʙ਺ेαΠΫϧඞཁ

    • ଴ͪ࣌ؒʹଞͷ Warp Λ࣮ߦ͢ΔΑ͏ʹεέ
    δϡʔϦϯά

    • GPU Ϧιʔεͷ઎༗཰ (occupancy) Λ࠷େԽ

    • Processing Unit ʹ͸2ͭͷ໋ྩσΟεύονϢ
    χοτ͕͋Γɺ2໋ྩΛಉ࣌ʹ࣮ߦ
    NVIDIA Kepler GK110 white paper:

    https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf

    View Slide

  21. GP100 ͷϝϞϦγεςϜ
    • L1Ωϟογϡ

    • SM ͝ͱʹ1ͭ (24 KB)

    • ʢఆ਺ͳͲΛ֨ೲʣ

    • ϥΠτ͢Δ৔߹͸γΣΞʔυϝϞϦΛ࢖͏

    • γΣΞʔυϝϞϦ

    • SM ͝ͱʹ1ͭ (64 KB)

    • ಉ͡ Warp ʢಉ͡εϨουϒϩοΫʣ಺ͷεϨου͔ΒࢀরͰ͖Δ
    Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

    View Slide

  22. GP100 ͷϝϞϦγεςϜ
    • L2 Ωϟογϡ

    • GPU ͝ͱʹ1ͭ (4 MB)

    • σόΠεϝϞϦ

    • ෺ཧతʹ͸ GDDR, HBM2 ͳͲ

    • GPU ͝ͱʹ1ͭ (16 GB)
    Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

    View Slide

  23. GP100 ͷϝϞϦγεςϜ
    • ϨδελϑΝΠϧ

    • SM ͝ͱʹ 65,536 ΤϯτϦ x 32bit (256 KB)

    • ϨʔϯʢεϨουʣ͝ͱʹ෼͔Ε͍ͯΔ

    • Ϩʔϯ͝ͱʹ 2,048 ΤϯτϦ (8 KB)

    • εϨου͸ϝϞϦͷόϯυ෯Λಠ઎Ͱ͖Δ
    Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

    View Slide

  24. γϟοϑϧ໋ྩ
    • γϟοϑϧ໋ྩΛ࢖͏ͱϨʔϯΛӽ͑ͯϨδελͷஔ׵͕Մೳ

    • 1εςοϓͰ׬ྃ͢ΔͷͰγΣΞʔυϝϞϦΛ࢖͏ΑΓߴ଎

    • ͜ͷ໋ྩΛ࢖ͬͯιʔτΛߦ͏͜ͱ͕Ͱ͖Δʢޙड़ʣ
    NVIDIA Kepler GK110 white paper: https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf

    View Slide

  25. 4. FAST K-SELECTION
    ON THE GPU
    GPU ͰྨࣅݕࡧΛߴ଎Խ͢ΔΞϧΰϦζϜ

    View Slide

  26. ϧʔϑϥΠϯϞσϧ
    • ύϑΥʔϚϯεʹؔ͢ΔϞσϧ

    • ΞϧΰϦζϜ͕ϋʔυ΢ΣΞΛ࢖͍͖ΕΔ͔͸ҎԼʹґଘ͢Δ

    • ϝϞϦͷόϯυ෯

    • ϐʔΫੑೳ

    • ϐʔΫੑೳΛग़͢ʹ͸ϝϞϦͷଳҬΛ࢖͍੾ΕΔ͔Ͳ͏͔͕ॏཁ

    • -> ࠷΋ߴ଎ͳετϨʔδͰ͋ΔɺϨδελϑΝΠϧΛ࢖͏
    Roofline model - Wikipedia: https://en.wikipedia.org/wiki/Roofline_model

    View Slide

  27. 4.1 In-register sorting
    • ϨδελϑΝΠϧͱγϟοϑϧ໋ྩΛ࢖ͬͨߴ଎ͳฒྻιʔτ

    • ʢۙࣅʣڑ཭Λܭࢉͨ͋͠ͱɺΫΤϦʹ͍ۙ k ݸͷϕΫτϧΛબ୒͢Δ (k-selection) ͨΊ

    • ༏ઌ౓෇͖Ωϡʔͷ GPU ൛Έ͍ͨͳ΋ͷ

    View Slide

  28. Odd-size merging and
    sorting networks
    • Batcher's bitonic sorting network ͷѥछΛ࢖͏

    • bitonicʢ૒ௐʣͰ͸ͳ͘ monotonicʢ୯ௐʣ

    • ҟͳΔཁૉ਺ͷ഑ྻͲ͏͠Ͱ΋ιʔτՄೳ

    • Bitonic sorting network ʹ͍ͭͯ͸ [Batcher68] Λࢀর

    • Bitonic mergesort

    • https://en.wikipedia.org/wiki/Bitonic_sorter

    • Bitonic sort

    • https://t-pot.com/program/90_BitonicSort/index.html

    View Slide

  29. Bitonic sort
    Bitonic sorter - Wikipedia: https://en.wikipedia.org/wiki/Bitonic_sorter

    View Slide

  30. Odd-size network merging
    MERGE-ODD
    γϟοϑϧ໋ྩ

    Ͱ࣮ߦՄೳ
    step 1
    step 2-4
    MERGE-ODD:

    ҎԼͷΑ͏ʹ2ͭͷ഑ྻΛड͚औͬͯ

    ιʔτ͢ΔΞϧΰϦζϜ

    View Slide

  31. Odd-size network merging
    SORT-ODD:
    1ͭͷ഑ྻΛιʔτ͢ΔΞϧΰϦζϜ

    MERGE-ODDʢલड़ʣΛ࢖͏

    View Slide

  32. 4.2 WarpSelect
    • k-selection Λ GPU Ͱฒྻ࣮ߦ͢ΔΞϧΰϦζϜ

    • Ϩʔϯ͝ͱʹ෼͔Εͨσʔλߏ଄Λ࢖͏

    • ͢΂ͯϨδελʹஔ͘

    • ࠷େͰ (k + 32t + 32) ݸͷཁૉ

    • Thread queue

    • ֤εϨου͕΋ͭΩϡʔ

    • ࠷খͷ t ݸͷ஋Λอ࣋

    • Warp queue

    • ϫʔϓؒͰڞ༗͢ΔΩϡʔ

    • ࠷খͷ k ݸͷ஋Λอ࣋
    ௕͞ t
    େ খ


    View Slide

  33. 4.2 WarpSelect
    Ϩʔϯ͝ͱʹ࣮ߦ͞ΕΔίʔυ
    ௕͞ t
    େ খ

    େ Thread queue ʹೖͬͯ

    ͍ͨ஋Λฒྻιʔτ
    ιʔτࡁΈ Thread queue ͱ

    Warp queue ΛϚʔδιʔτ
    32εϨουͰಉ࣮࣌ߦͰ͖ΔΑ͏

    Thread queue ͷ t ݸͷ஋Λ

    ॱ൪ʹͳΔΑ͏ʹγϟοϑϧ໋ྩͰ

    ٧Ί௚͢

    View Slide

  34. 5. COMPUTATION
    LAYOUT

    View Slide

  35. 5.1 Exact search
    • ϒϧʔτϑΥʔεʹΑΔݫີͳݕࡧ
    • 3߲໨ͷ಺ੵܭࢉ͕ॏ͍

    • ͢΂ͯͷ x, y ͷ૊ʹ͍ͭͯܭࢉ͢ΔͨΊɺ XYT ͷߦྻԋࢉͱಉ͡

    • cuBLAS ͷ GEMM (GEneral Matrix to Matrix Multiplication) ϧʔνϯΛ࢖༻

    • খ͞ͳσʔληοτʹରͯ͠͸࣮༻త

    • ͷͪ΄Ͳ IVFADC ͷૈ͍ྔࢠԽ (coarse quantizer) ͷͨΊʹ࢖͏

    View Slide

  36. 5.2 IVFADC indexing
    • ௚ੵྔࢠԽ (PQ; Product Quantization) Λ࢖ͬͨσʔλߏ଄

    • PQ code ͷ഑ྻͱɺͦΕʹඥͮ͘ ID Λอ࣋͢Δ഑ྻ

    • ৄࡉ͸ҎԼͷࢿྉ͕ৄ͍͠

    • দҪ༐༎ (2018) ʰbillion-scaleͷۙࣅ࠷ۙ๣୳ࡧʱ

    http://yusukematsui.me/project/survey_pq/doc/
    ann_billion_2018.pdf

    View Slide

  37. IVFADC

    (Inverted File system with Asymmetric Distance Computation)
    • ૈ͍ྔࢠԽʹΑΔϋογϡϚοϓ

    • సஔϦετ (inverted list) ʹ ID ͱ PQ codeΛอଘ͢Δ

    • Inverted file ͷ෦෼ͷཁૉ਺͸গͳ͍

    • CPU ͰͳΊͯ΋ίετ͸খ͍͞

    • ૈ͍ྔࢠԽޙͷసஔϦετʹରͯۙ͠ࣅۙ๣ݕࡧ

    • GPU Λ࢖ͬͯฒྻܭࢉʢલड़ʣ
    [Jegou+11] https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf

    View Slide

  38. 5.3 GPU implementation
    [Jegou+11] https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf

    View Slide

  39. 6. EXPERIMENTS &
    APPLICATIONS

    View Slide

  40. 6.1 k-selection performance
    • εϞʔϧσʔλʹର͢Δ k-selection ͷύϑΥʔϚϯε

    • ܭࢉػ؀ڥ

    • 2x 2.8GHz Xeon E5-2680v2

    • 4x Maxwell Titan X on CUDA 8.0

    • ύϥϝʔλ

    • όονʹؚ·ΕΔΫΤϦ਺ nq = 10,000

    • k = 100 or 1,000

    • ൺֱͨ͠ϥΠϒϥϦ

    • Truncated Bitonic Sort (TBiS)

    • fgknn

    View Slide

  41. 6.2 k-means clustering
    • k = 1 ͷ k-means ΫϥελϦϯά

    • σʔλ͸ MNIST8m

    • 810 ສຕͷ 28 x 28 ը૾ʢ784 ࣍ݩʣ

    • GPU ͕࢖͑Δ BIDMach ͱൺֱ

    • ͲͪΒ΋ cuBLAS ͕࢖ΘΕ͍ͯΔ

    • 2ഒҎ্ߴ଎

    View Slide

  42. 6.3 Exact nearest neighbor
    search
    • SIFT1M
    • 100ສ݅ͷը૾

    • 128࣍ݩ

    • nq = 10,000

    View Slide

  43. 6.4 Billion-scale
    approximate search
    • DEEP1B (10ԯϕΫτϧ, 96࣍ݩ, nq = 10,000)

    • [Babenko&Lempitsky16]: R@1 = 0.45, 20 msec

    • ఏҊख๏: R@1 = 0.4517, 0.0133 msec
    • SIFT1M (100ສϕΫτϧ, 128࣍ݩ)

    • [Wieschollek+16] ͷख๏ͱൺֱ

    • ಉ࣌ؒ͡ (0.02 msec) ʹରͯ͠

    • [Wieschollek+16]: R@1 = 0.51, R@100 = 0.86

    • ఏҊख๏: R@1 = 0.80, R@100 = 0.95

    • SIFT1B (10ԯϕΫτϧ, 128࣍ݩ, nq = 10,000)

    • [Wieschollek+16]: R@10 = 0.35, 150 μsec

    • ఏҊख๏: R@10 = 0.376, 17.7 μsec

    • ಉఔ౓ͷ recall Λୡ੒͠ͳ͕Β 8.5 ഒߴ଎
    https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors#deep1b

    View Slide

  44. ࢀߟจݙ
    • NVIDIA Tesla P100 white paper

    https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

    • NVIDIA Kepler GK110 white paper

    https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf

    • Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़――௒ฒྻϋʔυ΢ΣΞͷշਐܸʦٕज़جૅʧʱɼٕज़ධ࿦ࣾ

    http://gihyo.jp/book/2017/978-4-7741-9056-3

    • [Batcher68] K. E. Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30--May 2, 1968, spring joint
    computer conference (AFIPS '68 (Spring)). ACM, New York, NY, USA, 307-314. DOI=http://dx.doi.org/10.1145/1468075.1468121

    • [Jegou+11] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Trans.
    Pattern Anal. Mach. Intell. 33, 1 (January 2011), 117-128. DOI: https://doi.org/10.1109/TPAMI.2010.57

    • [Wieschollek+16] P. Wieschollek, O. Wang, A. Sorkine-Hornung, and H. P. A. Lensch. Ecient large-scale approximate nearest neighbor
    search on the GPU. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2027-2035, June 2016.

    • [Babenko&Lempitsky16] A. Babenko and V. Lempitsky. Ecient indexing of billion-scale datasets of deep descriptors. In Proc. IEEE
    Conference on Computer Vision and Pattern Recognition, pages 2055-2063, June 2016.

    View Slide