Billion-scale similarity search with GPUs

13f3313ae1ec1d9b3ed76ccbd746291b?s=47 Takuya Asano
November 14, 2018

Billion-scale similarity search with GPUs

近似近傍探索ライブラリ Faiss の元論文、"Billion-scale similarity search with GPUs" https://arxiv.org/abs/1702.08734 の紹介です。

13f3313ae1ec1d9b3ed76ccbd746291b?s=128

Takuya Asano

November 14, 2018
Tweet

Transcript

  1. Billion-scale similarity search with GPUs ୈ1ճ ߹ಉ࿦จಡΈձ (2018-11-14) ঺հऀɿ ͸ͯͳ

    ઙ໺୎໵ (id:takuya-a)
  2. ࣗݾ঺հ • ΞΧ΢ϯτ • id:takuya-a • @takuya_b fka @takuya_a •

    @takuyaa • ڵຯ • ݕࡧɾػցֶशɾNLP ͳͲ • ࢓ࣄ • ͸ͯͳϒοΫϚʔΫ (2015-2018) • ήʔϜؔ܎डୗ (2018-) • ΞυςΫ (2018-)
  3. Billion-scale similarity search with GPUs • ஶऀ • Jeff Johnson

    (Facebook AI Research) • Matthijs Douze (Facebook AI Research) • Hervé Jégou (Facebook AI Research) • URL • https://arxiv.org/abs/1702.08734
  4. 1. INTRODUCTION

  5. എܠ • 10ԯΦʔμʔͷߴ࣍ݩϕΫτϧʢը૾΍ಈըͷಛ௃ྔͳͲʣʹରͯ͠ྨࣅ୳ࡧΛ͍ͨ͠ • ϕΫτϧͰ͋Ε͹ͳΜͰ΋Α͍ • ࣍ݩͷढ͍ • ࣍ݩ͕ߴ͘ͳΔͱɺਖ਼֬ͳݕࡧ (exact

    search) ͷख๏͸ઢܗ୳ࡧͱมΘΒͳ͍ܭࢉ࣌ؒʹͳΔ • ϝϞϦ΋ͨ͘͞ΜඞཁͳͷͰ Billion-scale Ͱ͸ෆద • ͍ΘΏΔۙࣅ࠷ۙ๣୳ࡧ (Approximate nearest neighbor search) Λ͍ͨ͠
  6. ख๏ͷΞΠσΞ • ฒߦॲཧ͕ಘҙͳ GPU Λ࢖͍ߴ଎Խ • ͨͩ͠ɺGPU ͸ϝϞϦ͕ݶΒΕ͍ͯΔ • ௚ੵྔࢠԽ

    (PQ; Product Quantization) [Jégou, TPAMI 2011] ʹΑΔίʔυԽΛߦ͏ • ߴ࣍ݩσʔλΛѹॖͯ͠ϝϞϦফඅྔΛ࡟ݮ • ۙࣅʹΑΓॲཧ΋ߴ଎Խ
  7. ର৅σʔληοτͷن໛ • SIFT1M [1] • 128࣍ݩɺ100ສϕΫτϧ • SIFT1B [1] •

    128࣍ݩɺ10ԯϕΫτϧ • DEEP1B [2] • 96࣍ݩɺ10ԯϕΫτϧ [1] http://corpus-texmex.irisa.fr/
 [2] http://sites.skoltech.ru/compvision/noimi/
  8. ྨࣅ୳ࡧͷΞϓϦέʔγϣϯ • ߴ࣍ݩσʔλͷݕࡧ • ը૾͸ SIFT, SURF ͳͲͷಛ௃நग़Λߦͬͯ΋ߴ࣍ݩ • ୯ޠ΍จॻͳͲͷ෼ࢄදݱ

    • ΫϥελϦϯά • k-means ͷܭࢉͷϘτϧωοΫ͸࠷ۙ๣୳ࡧ • ࣮ࡍɺ Faiss ʹ΋௒ߴ଎ͳ k-means ͕࣮૷͞Ε͍ͯΔ • Ϩίϝϯυ
  9. 2. PROBLEM STATEMENT

  10. ྨࣅݕࡧͷ໰୊ઃఆ • ΫΤϦϕΫτϧ x • l ݸͷϕΫτϧ͔ΒͳΔσʔλू߹ • ͜ͷத͔Βɺx ʹʮ͍ۙʯ

    k ݸͷ෦෼ू߹ L Λݟ͚͍ͭͨ • ྨࣅݕࡧͰ͸ۙ͞ͷई౓ʹϢʔΫϦουڑ཭ʢL2 ϊϧϜʣ͕࢖ΘΕΔ • ͍ۙ఺͑͞ಘΒΕΕ͹Α͍ͷͰɺ
 Ͱൺֱͯ͠Α͍ L = k−argmin i=0:l ∥x − yi ∥2 [y]i=0:l ∥x − y∥2 2
  11. 3. GPU: OVERVIEW AND K-SELECTION GPU ͱྨࣅݕࡧʹ͍ͭͯ

  12. NVIDIA GPU ͷมભ • Fermi (GF100, GF110, ...) • Kepler

    (GK110, GK104, ...) • Maxwell (GM200, GM204, …) • Pascal (GP100, GP104) • Volta (GV100) <- New! • Turing (TU102, TU104, TU106) <- New!
  13. GPU ͷΞʔΩςΫνϟ • Pascal ΞʔΩςΫνϟͷ GP100 Λྫʹ • Tesla P100

    ʹ౥ࡌ͞Ε͍ͯΔ
  14. NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

  15. GP100 ͷߏ଄ • GPC (Graphics Processing Cluster) x6 • TPC

    (Texture Processing Cluster) x5 • SM (Streaming Multiprocessor) x2 • ΪΨεϨουΤϯδϯ x1 • σόΠεϝϞϦ (HBM2, 16 GB @ Tesla P100) x1 • L2 Ωϟογϡ (4 MB) x1 • ઃܭਤ্͸ SM ͸60ݸ͕ͩɺεϖοΫ্͸56ݸ • าཹ·Γ޲্ͷͨΊɺ੡඼ग़ՙ࣌ʹ 4 ݸ͸ແޮԽ͞ΕΔ NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
  16. SM (Streaming Multiprocessor) ͷߏ଄ • Processing Unit x2 • CUDA

    Core (FP32) x32 + DP Unit (FP64) x16 + LD/ST Unit x8 + SFU x8 • Warp εέδϡʔϥ x1 • σΟεύονϢχοτ x2 • ϨδελϑΝΠϧ x1 • L1Ωϟογϡ (24 KB) x1 • γΣΞʔυϝϞϦ (64 KB) x1 NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
  17. εϨου • GPU ʹ͓͚Δ࠷খ࣮ߦ୯Ґ • CUDA Core ͳͲͷݸʑͷ࣮ߦϢχοτʹσΟεύον͞ΕΔ • GPU

    Ͱ͸਺ઍʙ਺ສͷεϨουΛಉ࣌ʹ࣮ߦͰ͖Δ
  18. εϨουϒϩοΫ • 1ͭͷεϨουϒϩοΫʹ࠷େ 1,024 εϨου • SM ୯ҐͰεϨουϒϩοΫׂ͕Γ౰ͯΒΕΔ • SM

    ΁ͷεϨουϒϩοΫͷׂΓ౰ͯ͸ΪΨε ϨουΤϯδϯ͕ߦ͏ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  19. εϨουͱ Warp • Warp ͸ GPU Ͱͷ࣮ߦ୯Ґ • 1 Warp

    = 32 εϨου • 1εςοϓͰ·ͱΊ࣮ͯߦ͞ΕΔ • ΪΨεϨουΤϯδϯ͕ɺεϨουϒϩοΫ͔ Β Warp ΛऔΓग़ͯ͠ Warp Pool ʹೖΕΔ • Warp Pool ʹ͸࠷େ64ݸͷ Warp ΛόοϑΝ Ͱ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  20. Warp εέδϡʔϥ • Warp Pool ͷத͔Β1ͭͷ Warp ΛબΜͰ࣮ߦ • ϩʔυ/ετΞͳͲʹ͸਺ʙ਺ेαΠΫϧඞཁ

    • ଴ͪ࣌ؒʹଞͷ Warp Λ࣮ߦ͢ΔΑ͏ʹεέ δϡʔϦϯά • GPU Ϧιʔεͷ઎༗཰ (occupancy) Λ࠷େԽ • Processing Unit ʹ͸2ͭͷ໋ྩσΟεύονϢ χοτ͕͋Γɺ2໋ྩΛಉ࣌ʹ࣮ߦ NVIDIA Kepler GK110 white paper:
 https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf
  21. GP100 ͷϝϞϦγεςϜ • L1Ωϟογϡ • SM ͝ͱʹ1ͭ (24 KB) •

    ʢఆ਺ͳͲΛ֨ೲʣ • ϥΠτ͢Δ৔߹͸γΣΞʔυϝϞϦΛ࢖͏ • γΣΞʔυϝϞϦ • SM ͝ͱʹ1ͭ (64 KB) • ಉ͡ Warp ʢಉ͡εϨουϒϩοΫʣ಺ͷεϨου͔ΒࢀরͰ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  22. GP100 ͷϝϞϦγεςϜ • L2 Ωϟογϡ • GPU ͝ͱʹ1ͭ (4 MB)

    • σόΠεϝϞϦ • ෺ཧతʹ͸ GDDR, HBM2 ͳͲ • GPU ͝ͱʹ1ͭ (16 GB) Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  23. GP100 ͷϝϞϦγεςϜ • ϨδελϑΝΠϧ • SM ͝ͱʹ 65,536 ΤϯτϦ x

    32bit (256 KB) • ϨʔϯʢεϨουʣ͝ͱʹ෼͔Ε͍ͯΔ • Ϩʔϯ͝ͱʹ 2,048 ΤϯτϦ (8 KB) • εϨου͸ϝϞϦͷόϯυ෯Λಠ઎Ͱ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  24. γϟοϑϧ໋ྩ • γϟοϑϧ໋ྩΛ࢖͏ͱϨʔϯΛӽ͑ͯϨδελͷஔ׵͕Մೳ • 1εςοϓͰ׬ྃ͢ΔͷͰγΣΞʔυϝϞϦΛ࢖͏ΑΓߴ଎ • ͜ͷ໋ྩΛ࢖ͬͯιʔτΛߦ͏͜ͱ͕Ͱ͖Δʢޙड़ʣ NVIDIA Kepler GK110

    white paper: https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf
  25. 4. FAST K-SELECTION ON THE GPU GPU ͰྨࣅݕࡧΛߴ଎Խ͢ΔΞϧΰϦζϜ

  26. ϧʔϑϥΠϯϞσϧ • ύϑΥʔϚϯεʹؔ͢ΔϞσϧ • ΞϧΰϦζϜ͕ϋʔυ΢ΣΞΛ࢖͍͖ΕΔ͔͸ҎԼʹґଘ͢Δ • ϝϞϦͷόϯυ෯ • ϐʔΫੑೳ •

    ϐʔΫੑೳΛग़͢ʹ͸ϝϞϦͷଳҬΛ࢖͍੾ΕΔ͔Ͳ͏͔͕ॏཁ • -> ࠷΋ߴ଎ͳετϨʔδͰ͋ΔɺϨδελϑΝΠϧΛ࢖͏ Roofline model - Wikipedia: https://en.wikipedia.org/wiki/Roofline_model
  27. 4.1 In-register sorting • ϨδελϑΝΠϧͱγϟοϑϧ໋ྩΛ࢖ͬͨߴ଎ͳฒྻιʔτ • ʢۙࣅʣڑ཭Λܭࢉͨ͋͠ͱɺΫΤϦʹ͍ۙ k ݸͷϕΫτϧΛબ୒͢Δ (k-selection)

    ͨΊ • ༏ઌ౓෇͖Ωϡʔͷ GPU ൛Έ͍ͨͳ΋ͷ
  28. Odd-size merging and sorting networks • Batcher's bitonic sorting network

    ͷѥछΛ࢖͏ • bitonicʢ૒ௐʣͰ͸ͳ͘ monotonicʢ୯ௐʣ • ҟͳΔཁૉ਺ͷ഑ྻͲ͏͠Ͱ΋ιʔτՄೳ • Bitonic sorting network ʹ͍ͭͯ͸ [Batcher68] Λࢀর • Bitonic mergesort • https://en.wikipedia.org/wiki/Bitonic_sorter • Bitonic sort • https://t-pot.com/program/90_BitonicSort/index.html
  29. Bitonic sort Bitonic sorter - Wikipedia: https://en.wikipedia.org/wiki/Bitonic_sorter

  30. Odd-size network merging MERGE-ODD γϟοϑϧ໋ྩ
 Ͱ࣮ߦՄೳ step 1 step 2-4

    MERGE-ODD:
 ҎԼͷΑ͏ʹ2ͭͷ഑ྻΛड͚औͬͯ
 ιʔτ͢ΔΞϧΰϦζϜ
  31. Odd-size network merging SORT-ODD: 1ͭͷ഑ྻΛιʔτ͢ΔΞϧΰϦζϜ
 MERGE-ODDʢલड़ʣΛ࢖͏

  32. 4.2 WarpSelect • k-selection Λ GPU Ͱฒྻ࣮ߦ͢ΔΞϧΰϦζϜ • Ϩʔϯ͝ͱʹ෼͔Εͨσʔλߏ଄Λ࢖͏ •

    ͢΂ͯϨδελʹஔ͘ • ࠷େͰ (k + 32t + 32) ݸͷཁૉ • Thread queue • ֤εϨου͕΋ͭΩϡʔ • ࠷খͷ t ݸͷ஋Λอ࣋ • Warp queue • ϫʔϓؒͰڞ༗͢ΔΩϡʔ • ࠷খͷ k ݸͷ஋Λอ࣋ ௕͞ t େ খ খ େ
  33. 4.2 WarpSelect Ϩʔϯ͝ͱʹ࣮ߦ͞ΕΔίʔυ ௕͞ t େ খ খ େ Thread

    queue ʹೖͬͯ
 ͍ͨ஋Λฒྻιʔτ ιʔτࡁΈ Thread queue ͱ
 Warp queue ΛϚʔδιʔτ 32εϨουͰಉ࣮࣌ߦͰ͖ΔΑ͏
 Thread queue ͷ t ݸͷ஋Λ
 ॱ൪ʹͳΔΑ͏ʹγϟοϑϧ໋ྩͰ
 ٧Ί௚͢
  34. 5. COMPUTATION LAYOUT

  35. 5.1 Exact search • ϒϧʔτϑΥʔεʹΑΔݫີͳݕࡧ • 3߲໨ͷ಺ੵܭࢉ͕ॏ͍ • ͢΂ͯͷ x,

    y ͷ૊ʹ͍ͭͯܭࢉ͢ΔͨΊɺ XYT ͷߦྻԋࢉͱಉ͡ • cuBLAS ͷ GEMM (GEneral Matrix to Matrix Multiplication) ϧʔνϯΛ࢖༻ • খ͞ͳσʔληοτʹରͯ͠͸࣮༻త • ͷͪ΄Ͳ IVFADC ͷૈ͍ྔࢠԽ (coarse quantizer) ͷͨΊʹ࢖͏
  36. 5.2 IVFADC indexing • ௚ੵྔࢠԽ (PQ; Product Quantization) Λ࢖ͬͨσʔλߏ଄ •

    PQ code ͷ഑ྻͱɺͦΕʹඥͮ͘ ID Λอ࣋͢Δ഑ྻ • ৄࡉ͸ҎԼͷࢿྉ͕ৄ͍͠ • দҪ༐༎ (2018) ʰbillion-scaleͷۙࣅ࠷ۙ๣୳ࡧʱ
 http://yusukematsui.me/project/survey_pq/doc/ ann_billion_2018.pdf
  37. IVFADC
 (Inverted File system with Asymmetric Distance Computation) • ૈ͍ྔࢠԽʹΑΔϋογϡϚοϓ

    • సஔϦετ (inverted list) ʹ ID ͱ PQ codeΛอଘ͢Δ • Inverted file ͷ෦෼ͷཁૉ਺͸গͳ͍ • CPU ͰͳΊͯ΋ίετ͸খ͍͞ • ૈ͍ྔࢠԽޙͷసஔϦετʹରͯۙ͠ࣅۙ๣ݕࡧ • GPU Λ࢖ͬͯฒྻܭࢉʢલड़ʣ [Jegou+11] https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
  38. 5.3 GPU implementation [Jegou+11] https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf

  39. 6. EXPERIMENTS & APPLICATIONS

  40. 6.1 k-selection performance • εϞʔϧσʔλʹର͢Δ k-selection ͷύϑΥʔϚϯε • ܭࢉػ؀ڥ •

    2x 2.8GHz Xeon E5-2680v2 • 4x Maxwell Titan X on CUDA 8.0 • ύϥϝʔλ • όονʹؚ·ΕΔΫΤϦ਺ nq = 10,000 • k = 100 or 1,000 • ൺֱͨ͠ϥΠϒϥϦ • Truncated Bitonic Sort (TBiS) • fgknn
  41. 6.2 k-means clustering • k = 1 ͷ k-means ΫϥελϦϯά

    • σʔλ͸ MNIST8m • 810 ສຕͷ 28 x 28 ը૾ʢ784 ࣍ݩʣ • GPU ͕࢖͑Δ BIDMach ͱൺֱ • ͲͪΒ΋ cuBLAS ͕࢖ΘΕ͍ͯΔ • 2ഒҎ্ߴ଎
  42. 6.3 Exact nearest neighbor search • SIFT1M • 100ສ݅ͷը૾ •

    128࣍ݩ • nq = 10,000
  43. 6.4 Billion-scale approximate search • DEEP1B (10ԯϕΫτϧ, 96࣍ݩ, nq =

    10,000) • [Babenko&Lempitsky16]: R@1 = 0.45, 20 msec • ఏҊख๏: R@1 = 0.4517, 0.0133 msec • SIFT1M (100ສϕΫτϧ, 128࣍ݩ) • [Wieschollek+16] ͷख๏ͱൺֱ • ಉ࣌ؒ͡ (0.02 msec) ʹରͯ͠ • [Wieschollek+16]: R@1 = 0.51, R@100 = 0.86 • ఏҊख๏: R@1 = 0.80, R@100 = 0.95 • SIFT1B (10ԯϕΫτϧ, 128࣍ݩ, nq = 10,000) • [Wieschollek+16]: R@10 = 0.35, 150 μsec • ఏҊख๏: R@10 = 0.376, 17.7 μsec • ಉఔ౓ͷ recall Λୡ੒͠ͳ͕Β 8.5 ഒߴ଎ https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors#deep1b
  44. ࢀߟจݙ • NVIDIA Tesla P100 white paper
 https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf • NVIDIA

    Kepler GK110 white paper
 https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf • Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़――௒ฒྻϋʔυ΢ΣΞͷշਐܸʦٕज़جૅʧʱɼٕज़ධ࿦ࣾ
 http://gihyo.jp/book/2017/978-4-7741-9056-3 • [Batcher68] K. E. Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30--May 2, 1968, spring joint computer conference (AFIPS '68 (Spring)). ACM, New York, NY, USA, 307-314. DOI=http://dx.doi.org/10.1145/1468075.1468121 • [Jegou+11] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (January 2011), 117-128. DOI: https://doi.org/10.1109/TPAMI.2010.57 • [Wieschollek+16] P. Wieschollek, O. Wang, A. Sorkine-Hornung, and H. P. A. Lensch. Ecient large-scale approximate nearest neighbor search on the GPU. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2027-2035, June 2016. • [Babenko&Lempitsky16] A. Babenko and V. Lempitsky. Ecient indexing of billion-scale datasets of deep descriptors. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2055-2063, June 2016.