Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Billion-scale similarity search with GPUs

Takuya Asano
November 14, 2018

Billion-scale similarity search with GPUs

近似近傍探索ライブラリ Faiss の元論文、"Billion-scale similarity search with GPUs" https://arxiv.org/abs/1702.08734 の紹介です。

Takuya Asano

November 14, 2018
Tweet

More Decks by Takuya Asano

Other Decks in Technology

Transcript

  1. ࣗݾ঺հ • ΞΧ΢ϯτ • id:takuya-a • @takuya_b fka @takuya_a •

    @takuyaa • ڵຯ • ݕࡧɾػցֶशɾNLP ͳͲ • ࢓ࣄ • ͸ͯͳϒοΫϚʔΫ (2015-2018) • ήʔϜؔ܎डୗ (2018-) • ΞυςΫ (2018-)
  2. Billion-scale similarity search with GPUs • ஶऀ • Jeff Johnson

    (Facebook AI Research) • Matthijs Douze (Facebook AI Research) • Hervé Jégou (Facebook AI Research) • URL • https://arxiv.org/abs/1702.08734
  3. എܠ • 10ԯΦʔμʔͷߴ࣍ݩϕΫτϧʢը૾΍ಈըͷಛ௃ྔͳͲʣʹରͯ͠ྨࣅ୳ࡧΛ͍ͨ͠ • ϕΫτϧͰ͋Ε͹ͳΜͰ΋Α͍ • ࣍ݩͷढ͍ • ࣍ݩ͕ߴ͘ͳΔͱɺਖ਼֬ͳݕࡧ (exact

    search) ͷख๏͸ઢܗ୳ࡧͱมΘΒͳ͍ܭࢉ࣌ؒʹͳΔ • ϝϞϦ΋ͨ͘͞ΜඞཁͳͷͰ Billion-scale Ͱ͸ෆద • ͍ΘΏΔۙࣅ࠷ۙ๣୳ࡧ (Approximate nearest neighbor search) Λ͍ͨ͠
  4. ख๏ͷΞΠσΞ • ฒߦॲཧ͕ಘҙͳ GPU Λ࢖͍ߴ଎Խ • ͨͩ͠ɺGPU ͸ϝϞϦ͕ݶΒΕ͍ͯΔ • ௚ੵྔࢠԽ

    (PQ; Product Quantization) [Jégou, TPAMI 2011] ʹΑΔίʔυԽΛߦ͏ • ߴ࣍ݩσʔλΛѹॖͯ͠ϝϞϦফඅྔΛ࡟ݮ • ۙࣅʹΑΓॲཧ΋ߴ଎Խ
  5. ର৅σʔληοτͷن໛ • SIFT1M [1] • 128࣍ݩɺ100ສϕΫτϧ • SIFT1B [1] •

    128࣍ݩɺ10ԯϕΫτϧ • DEEP1B [2] • 96࣍ݩɺ10ԯϕΫτϧ [1] http://corpus-texmex.irisa.fr/
 [2] http://sites.skoltech.ru/compvision/noimi/
  6. ྨࣅ୳ࡧͷΞϓϦέʔγϣϯ • ߴ࣍ݩσʔλͷݕࡧ • ը૾͸ SIFT, SURF ͳͲͷಛ௃நग़Λߦͬͯ΋ߴ࣍ݩ • ୯ޠ΍จॻͳͲͷ෼ࢄදݱ

    • ΫϥελϦϯά • k-means ͷܭࢉͷϘτϧωοΫ͸࠷ۙ๣୳ࡧ • ࣮ࡍɺ Faiss ʹ΋௒ߴ଎ͳ k-means ͕࣮૷͞Ε͍ͯΔ • Ϩίϝϯυ
  7. ྨࣅݕࡧͷ໰୊ઃఆ • ΫΤϦϕΫτϧ x • l ݸͷϕΫτϧ͔ΒͳΔσʔλू߹ • ͜ͷத͔Βɺx ʹʮ͍ۙʯ

    k ݸͷ෦෼ू߹ L Λݟ͚͍ͭͨ • ྨࣅݕࡧͰ͸ۙ͞ͷई౓ʹϢʔΫϦουڑ཭ʢL2 ϊϧϜʣ͕࢖ΘΕΔ • ͍ۙ఺͑͞ಘΒΕΕ͹Α͍ͷͰɺ
 Ͱൺֱͯ͠Α͍ L = k−argmin i=0:l ∥x − yi ∥2 [y]i=0:l ∥x − y∥2 2
  8. NVIDIA GPU ͷมભ • Fermi (GF100, GF110, ...) • Kepler

    (GK110, GK104, ...) • Maxwell (GM200, GM204, …) • Pascal (GP100, GP104) • Volta (GV100) <- New! • Turing (TU102, TU104, TU106) <- New!
  9. GP100 ͷߏ଄ • GPC (Graphics Processing Cluster) x6 • TPC

    (Texture Processing Cluster) x5 • SM (Streaming Multiprocessor) x2 • ΪΨεϨουΤϯδϯ x1 • σόΠεϝϞϦ (HBM2, 16 GB @ Tesla P100) x1 • L2 Ωϟογϡ (4 MB) x1 • ઃܭਤ্͸ SM ͸60ݸ͕ͩɺεϖοΫ্͸56ݸ • าཹ·Γ޲্ͷͨΊɺ੡඼ग़ՙ࣌ʹ 4 ݸ͸ແޮԽ͞ΕΔ NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
  10. SM (Streaming Multiprocessor) ͷߏ଄ • Processing Unit x2 • CUDA

    Core (FP32) x32 + DP Unit (FP64) x16 + LD/ST Unit x8 + SFU x8 • Warp εέδϡʔϥ x1 • σΟεύονϢχοτ x2 • ϨδελϑΝΠϧ x1 • L1Ωϟογϡ (24 KB) x1 • γΣΞʔυϝϞϦ (64 KB) x1 NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
  11. εϨουϒϩοΫ • 1ͭͷεϨουϒϩοΫʹ࠷େ 1,024 εϨου • SM ୯ҐͰεϨουϒϩοΫׂ͕Γ౰ͯΒΕΔ • SM

    ΁ͷεϨουϒϩοΫͷׂΓ౰ͯ͸ΪΨε ϨουΤϯδϯ͕ߦ͏ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  12. εϨουͱ Warp • Warp ͸ GPU Ͱͷ࣮ߦ୯Ґ • 1 Warp

    = 32 εϨου • 1εςοϓͰ·ͱΊ࣮ͯߦ͞ΕΔ • ΪΨεϨουΤϯδϯ͕ɺεϨουϒϩοΫ͔ Β Warp ΛऔΓग़ͯ͠ Warp Pool ʹೖΕΔ • Warp Pool ʹ͸࠷େ64ݸͷ Warp ΛόοϑΝ Ͱ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  13. Warp εέδϡʔϥ • Warp Pool ͷத͔Β1ͭͷ Warp ΛબΜͰ࣮ߦ • ϩʔυ/ετΞͳͲʹ͸਺ʙ਺ेαΠΫϧඞཁ

    • ଴ͪ࣌ؒʹଞͷ Warp Λ࣮ߦ͢ΔΑ͏ʹεέ δϡʔϦϯά • GPU Ϧιʔεͷ઎༗཰ (occupancy) Λ࠷େԽ • Processing Unit ʹ͸2ͭͷ໋ྩσΟεύονϢ χοτ͕͋Γɺ2໋ྩΛಉ࣌ʹ࣮ߦ NVIDIA Kepler GK110 white paper:
 https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf
  14. GP100 ͷϝϞϦγεςϜ • L1Ωϟογϡ • SM ͝ͱʹ1ͭ (24 KB) •

    ʢఆ਺ͳͲΛ֨ೲʣ • ϥΠτ͢Δ৔߹͸γΣΞʔυϝϞϦΛ࢖͏ • γΣΞʔυϝϞϦ • SM ͝ͱʹ1ͭ (64 KB) • ಉ͡ Warp ʢಉ͡εϨουϒϩοΫʣ಺ͷεϨου͔ΒࢀরͰ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  15. GP100 ͷϝϞϦγεςϜ • L2 Ωϟογϡ • GPU ͝ͱʹ1ͭ (4 MB)

    • σόΠεϝϞϦ • ෺ཧతʹ͸ GDDR, HBM2 ͳͲ • GPU ͝ͱʹ1ͭ (16 GB) Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  16. GP100 ͷϝϞϦγεςϜ • ϨδελϑΝΠϧ • SM ͝ͱʹ 65,536 ΤϯτϦ x

    32bit (256 KB) • ϨʔϯʢεϨουʣ͝ͱʹ෼͔Ε͍ͯΔ • Ϩʔϯ͝ͱʹ 2,048 ΤϯτϦ (8 KB) • εϨου͸ϝϞϦͷόϯυ෯Λಠ઎Ͱ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ
  17. ϧʔϑϥΠϯϞσϧ • ύϑΥʔϚϯεʹؔ͢ΔϞσϧ • ΞϧΰϦζϜ͕ϋʔυ΢ΣΞΛ࢖͍͖ΕΔ͔͸ҎԼʹґଘ͢Δ • ϝϞϦͷόϯυ෯ • ϐʔΫੑೳ •

    ϐʔΫੑೳΛग़͢ʹ͸ϝϞϦͷଳҬΛ࢖͍੾ΕΔ͔Ͳ͏͔͕ॏཁ • -> ࠷΋ߴ଎ͳετϨʔδͰ͋ΔɺϨδελϑΝΠϧΛ࢖͏ Roofline model - Wikipedia: https://en.wikipedia.org/wiki/Roofline_model
  18. Odd-size merging and sorting networks • Batcher's bitonic sorting network

    ͷѥछΛ࢖͏ • bitonicʢ૒ௐʣͰ͸ͳ͘ monotonicʢ୯ௐʣ • ҟͳΔཁૉ਺ͷ഑ྻͲ͏͠Ͱ΋ιʔτՄೳ • Bitonic sorting network ʹ͍ͭͯ͸ [Batcher68] Λࢀর • Bitonic mergesort • https://en.wikipedia.org/wiki/Bitonic_sorter • Bitonic sort • https://t-pot.com/program/90_BitonicSort/index.html
  19. Odd-size network merging MERGE-ODD γϟοϑϧ໋ྩ
 Ͱ࣮ߦՄೳ step 1 step 2-4

    MERGE-ODD:
 ҎԼͷΑ͏ʹ2ͭͷ഑ྻΛड͚औͬͯ
 ιʔτ͢ΔΞϧΰϦζϜ
  20. 4.2 WarpSelect • k-selection Λ GPU Ͱฒྻ࣮ߦ͢ΔΞϧΰϦζϜ • Ϩʔϯ͝ͱʹ෼͔Εͨσʔλߏ଄Λ࢖͏ •

    ͢΂ͯϨδελʹஔ͘ • ࠷େͰ (k + 32t + 32) ݸͷཁૉ • Thread queue • ֤εϨου͕΋ͭΩϡʔ • ࠷খͷ t ݸͷ஋Λอ࣋ • Warp queue • ϫʔϓؒͰڞ༗͢ΔΩϡʔ • ࠷খͷ k ݸͷ஋Λอ࣋ ௕͞ t େ খ খ େ
  21. 4.2 WarpSelect Ϩʔϯ͝ͱʹ࣮ߦ͞ΕΔίʔυ ௕͞ t େ খ খ େ Thread

    queue ʹೖͬͯ
 ͍ͨ஋Λฒྻιʔτ ιʔτࡁΈ Thread queue ͱ
 Warp queue ΛϚʔδιʔτ 32εϨουͰಉ࣮࣌ߦͰ͖ΔΑ͏
 Thread queue ͷ t ݸͷ஋Λ
 ॱ൪ʹͳΔΑ͏ʹγϟοϑϧ໋ྩͰ
 ٧Ί௚͢
  22. 5.1 Exact search • ϒϧʔτϑΥʔεʹΑΔݫີͳݕࡧ • 3߲໨ͷ಺ੵܭࢉ͕ॏ͍ • ͢΂ͯͷ x,

    y ͷ૊ʹ͍ͭͯܭࢉ͢ΔͨΊɺ XYT ͷߦྻԋࢉͱಉ͡ • cuBLAS ͷ GEMM (GEneral Matrix to Matrix Multiplication) ϧʔνϯΛ࢖༻ • খ͞ͳσʔληοτʹରͯ͠͸࣮༻త • ͷͪ΄Ͳ IVFADC ͷૈ͍ྔࢠԽ (coarse quantizer) ͷͨΊʹ࢖͏
  23. 5.2 IVFADC indexing • ௚ੵྔࢠԽ (PQ; Product Quantization) Λ࢖ͬͨσʔλߏ଄ •

    PQ code ͷ഑ྻͱɺͦΕʹඥͮ͘ ID Λอ࣋͢Δ഑ྻ • ৄࡉ͸ҎԼͷࢿྉ͕ৄ͍͠ • দҪ༐༎ (2018) ʰbillion-scaleͷۙࣅ࠷ۙ๣୳ࡧʱ
 http://yusukematsui.me/project/survey_pq/doc/ ann_billion_2018.pdf
  24. IVFADC
 (Inverted File system with Asymmetric Distance Computation) • ૈ͍ྔࢠԽʹΑΔϋογϡϚοϓ

    • సஔϦετ (inverted list) ʹ ID ͱ PQ codeΛอଘ͢Δ • Inverted file ͷ෦෼ͷཁૉ਺͸গͳ͍ • CPU ͰͳΊͯ΋ίετ͸খ͍͞ • ૈ͍ྔࢠԽޙͷసஔϦετʹରͯۙ͠ࣅۙ๣ݕࡧ • GPU Λ࢖ͬͯฒྻܭࢉʢલड़ʣ [Jegou+11] https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
  25. 6.1 k-selection performance • εϞʔϧσʔλʹର͢Δ k-selection ͷύϑΥʔϚϯε • ܭࢉػ؀ڥ •

    2x 2.8GHz Xeon E5-2680v2 • 4x Maxwell Titan X on CUDA 8.0 • ύϥϝʔλ • όονʹؚ·ΕΔΫΤϦ਺ nq = 10,000 • k = 100 or 1,000 • ൺֱͨ͠ϥΠϒϥϦ • Truncated Bitonic Sort (TBiS) • fgknn
  26. 6.2 k-means clustering • k = 1 ͷ k-means ΫϥελϦϯά

    • σʔλ͸ MNIST8m • 810 ສຕͷ 28 x 28 ը૾ʢ784 ࣍ݩʣ • GPU ͕࢖͑Δ BIDMach ͱൺֱ • ͲͪΒ΋ cuBLAS ͕࢖ΘΕ͍ͯΔ • 2ഒҎ্ߴ଎
  27. 6.4 Billion-scale approximate search • DEEP1B (10ԯϕΫτϧ, 96࣍ݩ, nq =

    10,000) • [Babenko&Lempitsky16]: R@1 = 0.45, 20 msec • ఏҊख๏: R@1 = 0.4517, 0.0133 msec • SIFT1M (100ສϕΫτϧ, 128࣍ݩ) • [Wieschollek+16] ͷख๏ͱൺֱ • ಉ࣌ؒ͡ (0.02 msec) ʹରͯ͠ • [Wieschollek+16]: R@1 = 0.51, R@100 = 0.86 • ఏҊख๏: R@1 = 0.80, R@100 = 0.95 • SIFT1B (10ԯϕΫτϧ, 128࣍ݩ, nq = 10,000) • [Wieschollek+16]: R@10 = 0.35, 150 μsec • ఏҊख๏: R@10 = 0.376, 17.7 μsec • ಉఔ౓ͷ recall Λୡ੒͠ͳ͕Β 8.5 ഒߴ଎ https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors#deep1b
  28. ࢀߟจݙ • NVIDIA Tesla P100 white paper
 https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf • NVIDIA

    Kepler GK110 white paper
 https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf • Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़――௒ฒྻϋʔυ΢ΣΞͷշਐܸʦٕज़جૅʧʱɼٕज़ධ࿦ࣾ
 http://gihyo.jp/book/2017/978-4-7741-9056-3 • [Batcher68] K. E. Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30--May 2, 1968, spring joint computer conference (AFIPS '68 (Spring)). ACM, New York, NY, USA, 307-314. DOI=http://dx.doi.org/10.1145/1468075.1468121 • [Jegou+11] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (January 2011), 117-128. DOI: https://doi.org/10.1109/TPAMI.2010.57 • [Wieschollek+16] P. Wieschollek, O. Wang, A. Sorkine-Hornung, and H. P. A. Lensch. Ecient large-scale approximate nearest neighbor search on the GPU. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2027-2035, June 2016. • [Babenko&Lempitsky16] A. Babenko and V. Lempitsky. Ecient indexing of billion-scale datasets of deep descriptors. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2055-2063, June 2016.