Slide 1

Slide 1 text

Billion-scale similarity search with GPUs ୈ1ճ ߹ಉ࿦จಡΈձ (2018-11-14) ঺հऀɿ ͸ͯͳ ઙ໺୎໵ (id:takuya-a)

Slide 2

Slide 2 text

ࣗݾ঺հ • ΞΧ΢ϯτ • id:takuya-a • @takuya_b fka @takuya_a • @takuyaa • ڵຯ • ݕࡧɾػցֶशɾNLP ͳͲ • ࢓ࣄ • ͸ͯͳϒοΫϚʔΫ (2015-2018) • ήʔϜؔ܎डୗ (2018-) • ΞυςΫ (2018-)

Slide 3

Slide 3 text

Billion-scale similarity search with GPUs • ஶऀ • Jeff Johnson (Facebook AI Research) • Matthijs Douze (Facebook AI Research) • Hervé Jégou (Facebook AI Research) • URL • https://arxiv.org/abs/1702.08734

Slide 4

Slide 4 text

1. INTRODUCTION

Slide 5

Slide 5 text

എܠ • 10ԯΦʔμʔͷߴ࣍ݩϕΫτϧʢը૾΍ಈըͷಛ௃ྔͳͲʣʹରͯ͠ྨࣅ୳ࡧΛ͍ͨ͠ • ϕΫτϧͰ͋Ε͹ͳΜͰ΋Α͍ • ࣍ݩͷढ͍ • ࣍ݩ͕ߴ͘ͳΔͱɺਖ਼֬ͳݕࡧ (exact search) ͷख๏͸ઢܗ୳ࡧͱมΘΒͳ͍ܭࢉ࣌ؒʹͳΔ • ϝϞϦ΋ͨ͘͞ΜඞཁͳͷͰ Billion-scale Ͱ͸ෆద • ͍ΘΏΔۙࣅ࠷ۙ๣୳ࡧ (Approximate nearest neighbor search) Λ͍ͨ͠

Slide 6

Slide 6 text

ख๏ͷΞΠσΞ • ฒߦॲཧ͕ಘҙͳ GPU Λ࢖͍ߴ଎Խ • ͨͩ͠ɺGPU ͸ϝϞϦ͕ݶΒΕ͍ͯΔ • ௚ੵྔࢠԽ (PQ; Product Quantization) [Jégou, TPAMI 2011] ʹΑΔίʔυԽΛߦ͏ • ߴ࣍ݩσʔλΛѹॖͯ͠ϝϞϦফඅྔΛ࡟ݮ • ۙࣅʹΑΓॲཧ΋ߴ଎Խ

Slide 7

Slide 7 text

ର৅σʔληοτͷن໛ • SIFT1M [1] • 128࣍ݩɺ100ສϕΫτϧ • SIFT1B [1] • 128࣍ݩɺ10ԯϕΫτϧ • DEEP1B [2] • 96࣍ݩɺ10ԯϕΫτϧ [1] http://corpus-texmex.irisa.fr/
 [2] http://sites.skoltech.ru/compvision/noimi/

Slide 8

Slide 8 text

ྨࣅ୳ࡧͷΞϓϦέʔγϣϯ • ߴ࣍ݩσʔλͷݕࡧ • ը૾͸ SIFT, SURF ͳͲͷಛ௃நग़Λߦͬͯ΋ߴ࣍ݩ • ୯ޠ΍จॻͳͲͷ෼ࢄදݱ • ΫϥελϦϯά • k-means ͷܭࢉͷϘτϧωοΫ͸࠷ۙ๣୳ࡧ • ࣮ࡍɺ Faiss ʹ΋௒ߴ଎ͳ k-means ͕࣮૷͞Ε͍ͯΔ • Ϩίϝϯυ

Slide 9

Slide 9 text

2. PROBLEM STATEMENT

Slide 10

Slide 10 text

ྨࣅݕࡧͷ໰୊ઃఆ • ΫΤϦϕΫτϧ x • l ݸͷϕΫτϧ͔ΒͳΔσʔλू߹ • ͜ͷத͔Βɺx ʹʮ͍ۙʯ k ݸͷ෦෼ू߹ L Λݟ͚͍ͭͨ • ྨࣅݕࡧͰ͸ۙ͞ͷई౓ʹϢʔΫϦουڑ཭ʢL2 ϊϧϜʣ͕࢖ΘΕΔ • ͍ۙ఺͑͞ಘΒΕΕ͹Α͍ͷͰɺ
 Ͱൺֱͯ͠Α͍ L = k−argmin i=0:l ∥x − yi ∥2 [y]i=0:l ∥x − y∥2 2

Slide 11

Slide 11 text

3. GPU: OVERVIEW AND K-SELECTION GPU ͱྨࣅݕࡧʹ͍ͭͯ

Slide 12

Slide 12 text

NVIDIA GPU ͷมભ • Fermi (GF100, GF110, ...) • Kepler (GK110, GK104, ...) • Maxwell (GM200, GM204, …) • Pascal (GP100, GP104) • Volta (GV100) <- New! • Turing (TU102, TU104, TU106) <- New!

Slide 13

Slide 13 text

GPU ͷΞʔΩςΫνϟ • Pascal ΞʔΩςΫνϟͷ GP100 Λྫʹ • Tesla P100 ʹ౥ࡌ͞Ε͍ͯΔ

Slide 14

Slide 14 text

NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

Slide 15

Slide 15 text

GP100 ͷߏ଄ • GPC (Graphics Processing Cluster) x6 • TPC (Texture Processing Cluster) x5 • SM (Streaming Multiprocessor) x2 • ΪΨεϨουΤϯδϯ x1 • σόΠεϝϞϦ (HBM2, 16 GB @ Tesla P100) x1 • L2 Ωϟογϡ (4 MB) x1 • ઃܭਤ্͸ SM ͸60ݸ͕ͩɺεϖοΫ্͸56ݸ • าཹ·Γ޲্ͷͨΊɺ੡඼ग़ՙ࣌ʹ 4 ݸ͸ແޮԽ͞ΕΔ NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

Slide 16

Slide 16 text

SM (Streaming Multiprocessor) ͷߏ଄ • Processing Unit x2 • CUDA Core (FP32) x32 + DP Unit (FP64) x16 + LD/ST Unit x8 + SFU x8 • Warp εέδϡʔϥ x1 • σΟεύονϢχοτ x2 • ϨδελϑΝΠϧ x1 • L1Ωϟογϡ (24 KB) x1 • γΣΞʔυϝϞϦ (64 KB) x1 NVIDIA Tesla P100 white paper: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

Slide 17

Slide 17 text

εϨου • GPU ʹ͓͚Δ࠷খ࣮ߦ୯Ґ • CUDA Core ͳͲͷݸʑͷ࣮ߦϢχοτʹσΟεύον͞ΕΔ • GPU Ͱ͸਺ઍʙ਺ສͷεϨουΛಉ࣌ʹ࣮ߦͰ͖Δ

Slide 18

Slide 18 text

εϨουϒϩοΫ • 1ͭͷεϨουϒϩοΫʹ࠷େ 1,024 εϨου • SM ୯ҐͰεϨουϒϩοΫׂ͕Γ౰ͯΒΕΔ • SM ΁ͷεϨουϒϩοΫͷׂΓ౰ͯ͸ΪΨε ϨουΤϯδϯ͕ߦ͏ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

Slide 19

Slide 19 text

εϨουͱ Warp • Warp ͸ GPU Ͱͷ࣮ߦ୯Ґ • 1 Warp = 32 εϨου • 1εςοϓͰ·ͱΊ࣮ͯߦ͞ΕΔ • ΪΨεϨουΤϯδϯ͕ɺεϨουϒϩοΫ͔ Β Warp ΛऔΓग़ͯ͠ Warp Pool ʹೖΕΔ • Warp Pool ʹ͸࠷େ64ݸͷ Warp ΛόοϑΝ Ͱ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

Slide 20

Slide 20 text

Warp εέδϡʔϥ • Warp Pool ͷத͔Β1ͭͷ Warp ΛબΜͰ࣮ߦ • ϩʔυ/ετΞͳͲʹ͸਺ʙ਺ेαΠΫϧඞཁ • ଴ͪ࣌ؒʹଞͷ Warp Λ࣮ߦ͢ΔΑ͏ʹεέ δϡʔϦϯά • GPU Ϧιʔεͷ઎༗཰ (occupancy) Λ࠷େԽ • Processing Unit ʹ͸2ͭͷ໋ྩσΟεύονϢ χοτ͕͋Γɺ2໋ྩΛಉ࣌ʹ࣮ߦ NVIDIA Kepler GK110 white paper:
 https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf

Slide 21

Slide 21 text

GP100 ͷϝϞϦγεςϜ • L1Ωϟογϡ • SM ͝ͱʹ1ͭ (24 KB) • ʢఆ਺ͳͲΛ֨ೲʣ • ϥΠτ͢Δ৔߹͸γΣΞʔυϝϞϦΛ࢖͏ • γΣΞʔυϝϞϦ • SM ͝ͱʹ1ͭ (64 KB) • ಉ͡ Warp ʢಉ͡εϨουϒϩοΫʣ಺ͷεϨου͔ΒࢀরͰ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

Slide 22

Slide 22 text

GP100 ͷϝϞϦγεςϜ • L2 Ωϟογϡ • GPU ͝ͱʹ1ͭ (4 MB) • σόΠεϝϞϦ • ෺ཧతʹ͸ GDDR, HBM2 ͳͲ • GPU ͝ͱʹ1ͭ (16 GB) Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

Slide 23

Slide 23 text

GP100 ͷϝϞϦγεςϜ • ϨδελϑΝΠϧ • SM ͝ͱʹ 65,536 ΤϯτϦ x 32bit (256 KB) • ϨʔϯʢεϨουʣ͝ͱʹ෼͔Ε͍ͯΔ • Ϩʔϯ͝ͱʹ 2,048 ΤϯτϦ (8 KB) • εϨου͸ϝϞϦͷόϯυ෯Λಠ઎Ͱ͖Δ Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़ʱɼٕज़ධ࿦ࣾ

Slide 24

Slide 24 text

γϟοϑϧ໋ྩ • γϟοϑϧ໋ྩΛ࢖͏ͱϨʔϯΛӽ͑ͯϨδελͷஔ׵͕Մೳ • 1εςοϓͰ׬ྃ͢ΔͷͰγΣΞʔυϝϞϦΛ࢖͏ΑΓߴ଎ • ͜ͷ໋ྩΛ࢖ͬͯιʔτΛߦ͏͜ͱ͕Ͱ͖Δʢޙड़ʣ NVIDIA Kepler GK110 white paper: https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf

Slide 25

Slide 25 text

4. FAST K-SELECTION ON THE GPU GPU ͰྨࣅݕࡧΛߴ଎Խ͢ΔΞϧΰϦζϜ

Slide 26

Slide 26 text

ϧʔϑϥΠϯϞσϧ • ύϑΥʔϚϯεʹؔ͢ΔϞσϧ • ΞϧΰϦζϜ͕ϋʔυ΢ΣΞΛ࢖͍͖ΕΔ͔͸ҎԼʹґଘ͢Δ • ϝϞϦͷόϯυ෯ • ϐʔΫੑೳ • ϐʔΫੑೳΛग़͢ʹ͸ϝϞϦͷଳҬΛ࢖͍੾ΕΔ͔Ͳ͏͔͕ॏཁ • -> ࠷΋ߴ଎ͳετϨʔδͰ͋ΔɺϨδελϑΝΠϧΛ࢖͏ Roofline model - Wikipedia: https://en.wikipedia.org/wiki/Roofline_model

Slide 27

Slide 27 text

4.1 In-register sorting • ϨδελϑΝΠϧͱγϟοϑϧ໋ྩΛ࢖ͬͨߴ଎ͳฒྻιʔτ • ʢۙࣅʣڑ཭Λܭࢉͨ͋͠ͱɺΫΤϦʹ͍ۙ k ݸͷϕΫτϧΛબ୒͢Δ (k-selection) ͨΊ • ༏ઌ౓෇͖Ωϡʔͷ GPU ൛Έ͍ͨͳ΋ͷ

Slide 28

Slide 28 text

Odd-size merging and sorting networks • Batcher's bitonic sorting network ͷѥछΛ࢖͏ • bitonicʢ૒ௐʣͰ͸ͳ͘ monotonicʢ୯ௐʣ • ҟͳΔཁૉ਺ͷ഑ྻͲ͏͠Ͱ΋ιʔτՄೳ • Bitonic sorting network ʹ͍ͭͯ͸ [Batcher68] Λࢀর • Bitonic mergesort • https://en.wikipedia.org/wiki/Bitonic_sorter • Bitonic sort • https://t-pot.com/program/90_BitonicSort/index.html

Slide 29

Slide 29 text

Bitonic sort Bitonic sorter - Wikipedia: https://en.wikipedia.org/wiki/Bitonic_sorter

Slide 30

Slide 30 text

Odd-size network merging MERGE-ODD γϟοϑϧ໋ྩ
 Ͱ࣮ߦՄೳ step 1 step 2-4 MERGE-ODD:
 ҎԼͷΑ͏ʹ2ͭͷ഑ྻΛड͚औͬͯ
 ιʔτ͢ΔΞϧΰϦζϜ

Slide 31

Slide 31 text

Odd-size network merging SORT-ODD: 1ͭͷ഑ྻΛιʔτ͢ΔΞϧΰϦζϜ
 MERGE-ODDʢલड़ʣΛ࢖͏

Slide 32

Slide 32 text

4.2 WarpSelect • k-selection Λ GPU Ͱฒྻ࣮ߦ͢ΔΞϧΰϦζϜ • Ϩʔϯ͝ͱʹ෼͔Εͨσʔλߏ଄Λ࢖͏ • ͢΂ͯϨδελʹஔ͘ • ࠷େͰ (k + 32t + 32) ݸͷཁૉ • Thread queue • ֤εϨου͕΋ͭΩϡʔ • ࠷খͷ t ݸͷ஋Λอ࣋ • Warp queue • ϫʔϓؒͰڞ༗͢ΔΩϡʔ • ࠷খͷ k ݸͷ஋Λอ࣋ ௕͞ t େ খ খ େ

Slide 33

Slide 33 text

4.2 WarpSelect Ϩʔϯ͝ͱʹ࣮ߦ͞ΕΔίʔυ ௕͞ t େ খ খ େ Thread queue ʹೖͬͯ
 ͍ͨ஋Λฒྻιʔτ ιʔτࡁΈ Thread queue ͱ
 Warp queue ΛϚʔδιʔτ 32εϨουͰಉ࣮࣌ߦͰ͖ΔΑ͏
 Thread queue ͷ t ݸͷ஋Λ
 ॱ൪ʹͳΔΑ͏ʹγϟοϑϧ໋ྩͰ
 ٧Ί௚͢

Slide 34

Slide 34 text

5. COMPUTATION LAYOUT

Slide 35

Slide 35 text

5.1 Exact search • ϒϧʔτϑΥʔεʹΑΔݫີͳݕࡧ • 3߲໨ͷ಺ੵܭࢉ͕ॏ͍ • ͢΂ͯͷ x, y ͷ૊ʹ͍ͭͯܭࢉ͢ΔͨΊɺ XYT ͷߦྻԋࢉͱಉ͡ • cuBLAS ͷ GEMM (GEneral Matrix to Matrix Multiplication) ϧʔνϯΛ࢖༻ • খ͞ͳσʔληοτʹରͯ͠͸࣮༻త • ͷͪ΄Ͳ IVFADC ͷૈ͍ྔࢠԽ (coarse quantizer) ͷͨΊʹ࢖͏

Slide 36

Slide 36 text

5.2 IVFADC indexing • ௚ੵྔࢠԽ (PQ; Product Quantization) Λ࢖ͬͨσʔλߏ଄ • PQ code ͷ഑ྻͱɺͦΕʹඥͮ͘ ID Λอ࣋͢Δ഑ྻ • ৄࡉ͸ҎԼͷࢿྉ͕ৄ͍͠ • দҪ༐༎ (2018) ʰbillion-scaleͷۙࣅ࠷ۙ๣୳ࡧʱ
 http://yusukematsui.me/project/survey_pq/doc/ ann_billion_2018.pdf

Slide 37

Slide 37 text

IVFADC
 (Inverted File system with Asymmetric Distance Computation) • ૈ͍ྔࢠԽʹΑΔϋογϡϚοϓ • సஔϦετ (inverted list) ʹ ID ͱ PQ codeΛอଘ͢Δ • Inverted file ͷ෦෼ͷཁૉ਺͸গͳ͍ • CPU ͰͳΊͯ΋ίετ͸খ͍͞ • ૈ͍ྔࢠԽޙͷసஔϦετʹରͯۙ͠ࣅۙ๣ݕࡧ • GPU Λ࢖ͬͯฒྻܭࢉʢલड़ʣ [Jegou+11] https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf

Slide 38

Slide 38 text

5.3 GPU implementation [Jegou+11] https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf

Slide 39

Slide 39 text

6. EXPERIMENTS & APPLICATIONS

Slide 40

Slide 40 text

6.1 k-selection performance • εϞʔϧσʔλʹର͢Δ k-selection ͷύϑΥʔϚϯε • ܭࢉػ؀ڥ • 2x 2.8GHz Xeon E5-2680v2 • 4x Maxwell Titan X on CUDA 8.0 • ύϥϝʔλ • όονʹؚ·ΕΔΫΤϦ਺ nq = 10,000 • k = 100 or 1,000 • ൺֱͨ͠ϥΠϒϥϦ • Truncated Bitonic Sort (TBiS) • fgknn

Slide 41

Slide 41 text

6.2 k-means clustering • k = 1 ͷ k-means ΫϥελϦϯά • σʔλ͸ MNIST8m • 810 ສຕͷ 28 x 28 ը૾ʢ784 ࣍ݩʣ • GPU ͕࢖͑Δ BIDMach ͱൺֱ • ͲͪΒ΋ cuBLAS ͕࢖ΘΕ͍ͯΔ • 2ഒҎ্ߴ଎

Slide 42

Slide 42 text

6.3 Exact nearest neighbor search • SIFT1M • 100ສ݅ͷը૾ • 128࣍ݩ • nq = 10,000

Slide 43

Slide 43 text

6.4 Billion-scale approximate search • DEEP1B (10ԯϕΫτϧ, 96࣍ݩ, nq = 10,000) • [Babenko&Lempitsky16]: R@1 = 0.45, 20 msec • ఏҊख๏: R@1 = 0.4517, 0.0133 msec • SIFT1M (100ສϕΫτϧ, 128࣍ݩ) • [Wieschollek+16] ͷख๏ͱൺֱ • ಉ࣌ؒ͡ (0.02 msec) ʹରͯ͠ • [Wieschollek+16]: R@1 = 0.51, R@100 = 0.86 • ఏҊख๏: R@1 = 0.80, R@100 = 0.95 • SIFT1B (10ԯϕΫτϧ, 128࣍ݩ, nq = 10,000) • [Wieschollek+16]: R@10 = 0.35, 150 μsec • ఏҊख๏: R@10 = 0.376, 17.7 μsec • ಉఔ౓ͷ recall Λୡ੒͠ͳ͕Β 8.5 ഒߴ଎ https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors#deep1b

Slide 44

Slide 44 text

ࢀߟจݙ • NVIDIA Tesla P100 white paper
 https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf • NVIDIA Kepler GK110 white paper
 https://www.nvidia.co.jp/content/apac/pdf/tesla/nvidia-kepler-gk110-architecture-whitepaper-jp.pdf • Hisa Ando (2017) ʰGPUΛࢧ͑Δٕज़――௒ฒྻϋʔυ΢ΣΞͷշਐܸʦٕज़جૅʧʱɼٕज़ධ࿦ࣾ
 http://gihyo.jp/book/2017/978-4-7741-9056-3 • [Batcher68] K. E. Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30--May 2, 1968, spring joint computer conference (AFIPS '68 (Spring)). ACM, New York, NY, USA, 307-314. DOI=http://dx.doi.org/10.1145/1468075.1468121 • [Jegou+11] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (January 2011), 117-128. DOI: https://doi.org/10.1109/TPAMI.2010.57 • [Wieschollek+16] P. Wieschollek, O. Wang, A. Sorkine-Hornung, and H. P. A. Lensch. Ecient large-scale approximate nearest neighbor search on the GPU. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2027-2035, June 2016. • [Babenko&Lempitsky16] A. Babenko and V. Lempitsky. Ecient indexing of billion-scale datasets of deep descriptors. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2055-2063, June 2016.