Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Online Tracking on GPUs at PANDA

AndiH
May 23, 2014

Online Tracking on GPUs at PANDA

Talk at FIAS Tracking Workshop

AndiH

May 23, 2014
Tweet

More Decks by AndiH

Other Decks in Science

Transcript

  1. Mitglied der Helmholtz-Gemeinschaft 1 5th International Workshop for Future Challenges

    in Tracking and Trigger Concepts, FIAS Frankfurt 13 May 2014, Andreas Herten Online Tracking on GPUs at PANDA
  2. Mitglied der Helmholtz-Gemeinschaft Outline • GPUs & PANDA • Algorithms

    – Hough Transform – Riemann Track Finder – Triplet Finder 2
  3. Mitglied der Helmholtz-Gemeinschaft Graphics Processing Units 3 GPU CPU a1

    → b1 → c1; a2 → b2 → c2; a3 → … a1 → b1 → c1 a2 → b2 → c2 a3 → …
  4. Mitglied der Helmholtz-Gemeinschaft PANDA — Event Reconstruction • Continuous read

    out – Background & signal similar – Novel feature • Event Rate: 2 • 107/s 5 Raw Data Rate: 200 GB/s Disk Storage Space for Offline Analysis: 3 PB/y Reduce by ~1/1000 (Reject background events, save interesting physics events)
  5. Mitglied der Helmholtz-Gemeinschaft PANDA — Event Reconstruction • Continuous read

    out – Background & signal similar – Novel feature • Event Rate: 2 • 107/s 5 Raw Data Rate: 200 GB/s Disk Storage Space for Offline Analysis: 3 PB/y Reduce by ~1/1000 (Reject background events, save interesting physics events) GPUs
  6. Mitglied der Helmholtz-Gemeinschaft Algorithm: Hough Transform • Idea: Transform (x,y)i

    → (α,r)ij, find lines via (α,r) space • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each • Fill histogram • Extract track parameters 7 x y x y Mitglied der Helmholtz-Gemeinschaft Hough Transform — Princip → Bin giv r α
  7. Mitglied der Helmholtz-Gemeinschaft Algorithm: Hough Transform • Idea: Transform (x,y)i

    → (α,r)ij, find lines via (α,r) space • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each • Fill histogram • Extract track parameters 7 rij = cos ↵j · xi + sin ↵j · yi + ⇢i i: ~100 hits/event (STT) j: every 0.2° rij: 180 000 x y x y Mitglied der Helmholtz-Gemeinschaft Hough Transform — Princip → Bin giv r α
  8. ° Angle / 0 20 40 60 80 100 120

    140 160 180 Hough transformed -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 0 5 10 15 20 25 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 1800 x 1800 Grid PANDA STT+MVD Mitglied der Helmholtz-Gemeinschaft 8 68 (x,y) points r α Algorithm: Hough Transform
  9. ° Angle / 0 20 40 60 80 100 120

    140 160 180 Hough transformed -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 0 5 10 15 20 25 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 1800 x 1800 Grid PANDA STT+MVD Mitglied der Helmholtz-Gemeinschaft 8 68 (x,y) points r α Algorithm: Hough Transform
  10. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Remarks 9 Thrust (CUDA‘s

    STL) Plain CUDA • Performance: 3 ms/event – Reduced to set of standard routines • Fast (uses Thrust‘s optimized algorithms) • Inflexible (has it‘s limits, hard to customize) – Not yet at performance maximum – No peakfinding included • Even possible? • Adds to time! • Ideas in exploration • Performance: 0.5 ms/event – Built completely for this task • Fitting to this problem • Customizable • A bit more complicated at parts – Simple peakfinder implemented (threshold) • Using: Dynamic Parallelism, Shared Memory Two Implementations
  11. Mitglied der Helmholtz-Gemeinschaft 11 Riemann Track Finder — Method •

    Idea: Don‘t fit lines (in 2D), fit planes (in 3D)! • Create seeds – All possible three hit combinations • Grow seeds to tracks Continuously test next hit if it fits – Use mapping to Riemann paraboloid (+ s-z fit, det. layer) x x x x y z‘ x x x y x x x x y x More on: Seeds; Growing
  12. Mitglied der Helmholtz-Gemeinschaft 11 Riemann Track Finder — Method •

    Idea: Don‘t fit lines (in 2D), fit planes (in 3D)! • Create seeds – All possible three hit combinations • Grow seeds to tracks Continuously test next hit if it fits – Use mapping to Riemann paraboloid (+ s-z fit, det. layer) x x x x y z‘ x x x y x x x x y x More on: Seeds; Growing 1 2
  13. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 1

    2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  14. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 1

    2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  15. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 1

    2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  16. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  17. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 31 11 41 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  18. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 31 11 41 31 11 32 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  19. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 31 11 41 31 11 32 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  20. Mitglied der Helmholtz-Gemeinschaft 13 Riemann Track Finder — GPU Adaptations

    CPU GPU 3 loops to generate seeds serially for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } } Needed: Mapping of inherent GPU indexing variable to triplet index int ijk = threadIdx.x + blockIdx.x * blockDim.x; nLayerx = 1 2 ⇣p 8x + 1 1 ⌘ pos ( nLayerx ) = 3 pp 3 p 243x2 1 + 27x 32 / 3 + 1 3 p 3 3 pp 3 p 243x2 1 + 27x 1 1
  21. Mitglied der Helmholtz-Gemeinschaft 13 Riemann Track Finder — GPU Adaptations

    CPU GPU 3 loops to generate seeds serially for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } } Needed: Mapping of inherent GPU indexing variable to triplet index int ijk = threadIdx.x + blockIdx.x * blockDim.x; nLayerx = 1 2 ⇣p 8x + 1 1 ⌘ pos ( nLayerx ) = 3 pp 3 p 243x2 1 + 27x 32 / 3 + 1 3 p 3 3 pp 3 p 243x2 1 + 27x 1 1 2 Port of CPU code; parallelism on seed base Only easy computations; e.g. 3x3 matrices
  22. Mitglied der Helmholtz-Gemeinschaft 13 Riemann Track Finder — GPU Adaptations

    CPU GPU → 100 × faster than CPU version: ~0.6 ms/event 3 loops to generate seeds serially for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } } Needed: Mapping of inherent GPU indexing variable to triplet index int ijk = threadIdx.x + blockIdx.x * blockDim.x; nLayerx = 1 2 ⇣p 8x + 1 1 ⌘ pos ( nLayerx ) = 3 pp 3 p 243x2 1 + 27x 32 / 3 + 1 3 p 3 3 pp 3 p 243x2 1 + 27x 1 1 2 Port of CPU code; parallelism on seed base Only easy computations; e.g. 3x3 matrices
  23. Mitglied der Helmholtz-Gemeinschaft 15 Triplet Finder • Algorithm specifically designed

    for the PANDA Straw Tube Tracker (STT) • Ported to GPU by Andrew Adinetz – CUDA, Dynamic Parallelism, Thrust – Quality of tracks comparable to CPU http://www.fz-juelich.de/ias/jsc/ Original algorithm by Marius Mertens et al 1.5 m
  24. Mitglied der Helmholtz-Gemeinschaft 16 Triplet Finder • Idea: Use only

    subset of detector as seed – Don‘t use STT isochrones (drift times) – Calculate circle from 3 points (no fit) • Features – Fast & robust algorithm, no t0 – Many tuning possibilities More
  25. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) 17 STT More
  26. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 17 STT More
  27. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 17 STT More
  28. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 17 STT More
  29. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 17 STT More
  30. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 2.Interaction point 17 Interaction Point STT More
  31. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 2.Interaction point • Calculate circle through three points 17 Interaction Point STT More
  32. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 2.Interaction point • Calculate circle through three points → Track Candidate 17 Interaction Point STT More
  33. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Animation 18 Triplet Isochrone

    early Isochrone early & skewed Isochrone close Isochrone late MVD hit Track timed out Track current
  34. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20
  35. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit
  36. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event
  37. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event
  38. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event Bunch
  39. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event Bunch (N2) → (N)
  40. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  41. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  42. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  43. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  44. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  45. Dynamic Parallelism Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations •

    Compare kernel launch strategies 24 1 thread/bunch Calling kernel 1 thread/bunch Calling kernel Triplet Finder 1 thread/bunch Calling kernel 1 block/bunch Joined kernel 1 block/bunch Joined kernel 1 block/bunch Joined kernel TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4 1 stream/bunch Combining stream 1 stream/bunch Combining stream 1 stream/bunch Calling stream Joined Kernel Host Streams Triplet Finder Triplet Finder CPU GPU TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4 TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4
  46. Tesla K40 Tesla K20X Peak double performance Peak single performance

    GPU Chipset # CUDA Cores Memory size Memory bandwidth 1.46 TFLOPS 1.31 TFLOPS 4.29 TFLOPS 3.95 TFLOPS GK110B GK110 2880 2688 12 GB 6 GB 288 GByte/s 250 GByte/s Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Impact of chipset 26 Source: http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kepler-Family-Datasheet.pdf
  47. Mitglied der Helmholtz-Gemeinschaft 27 Triplet Finder — Clock Speed /

    GPU Preliminary (in publication) K40 3004 MHz, 745 MHz / 875 MHz K20X 2600 MHz, 732 MHz / 784 MHz Memory Clock Core Clock GPU Boost
  48. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Summary • Best performance:

    20 µs/event – 20⋅10-6 s/event * 2⋅107 event/s 㱺 400 GPUs2014 – PANDA2019: Multi GPU system – (100) GPUs • Optimizations possible & needed – ε needs to be improved – Speed, €: More float less double-cards a la K10 28
  49. Mitglied der Helmholtz-Gemeinschaft Summary • PANDA researches in using GPUs

    as part of online event reconstruction scheme • Algorithms in active evaluation and optimization – Triplet Finder performance-optimized 29
  50. Thank you! Andreas Herten [email protected] Mitglied der Helmholtz-Gemeinschaft Summary •

    PANDA researches in using GPUs as part of online event reconstruction scheme • Algorithms in active evaluation and optimization – Triplet Finder performance-optimized 29
  51. Mitglied der Helmholtz-Gemeinschaft List of Resources Used • #4: Earth

    icon by Francesco Paleari from The Noun Project • #4: Einstein icon by Roman Rusinov from The Noun Project • #6: FAIR vector logo from official FAIR website • #6: FAIR rendering from official website • #11: Flare Gun icon by Jop van der Kroef from The Noun Project • #27: STT event animation by Marius C. Mertens • #35: Graphics cards images by NVIDIA promotion • #35: GPU Specifications – Tesla K20X Specifications: http://www.nvidia.com/content/PDF/kepler/Tesla- K20X-BD-06397-001-v07.pdf – Tesla K40 Specifications: http://www.nvidia.com/content/PDF/kepler/Tesla-K40- Active-Board-Spec-BD-06949-001_v03.pdf – Tesla Familiy Overview: http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla- Kepler-Family-Datasheet.pdf 30
  52. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * (r, α)1 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  53. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α (r, α)1 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  54. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α (r, α)1 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  55. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α (r, α)1 (r, α)2 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  56. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  57. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  58. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  59. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  60. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  61. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    → Bin with highest multiplicity gives track parameters * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  62. Mitglied der Helmholtz-Gemeinschaft 33 Riemann Algorithm — Procedure • Create

    triplet of hit points – All possible three hit combinations need to become triplets 1
  63. Mitglied der Helmholtz-Gemeinschaft 33 Riemann Algorithm — Procedure • Create

    triplet of hit points – All possible three hit combinations need to become triplets • Grow triplets to tracks: Continuously test next hit if it fits to triplet track – Use Riemann paraboloid to circle fit track • Test closeness of new hit: good → add hit; bad → dismiss hit • Continue with next hit – Helix fit: arc length s vs. z position 1 2
  64. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  65. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  66. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  67. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  68. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  69. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  70. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  71. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  72. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  73. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  74. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Sector Row

    testing – Thicken track; shrink sector row layer to line – Find intersection 35 Sector-Row Testing Track Sector-Row Track Sector-Row Back
  75. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Kernel Launch Strategies •

    Joined Kernel (JK): slowest – High # registers → low occupancy • Dynamic Parallelism (DP) / Host Streams (HS): comparable performance – Performance • HS faster for small # processed hits, DP faster for > 45000 hits • HS stagnates there, while DP continues rising – Limiting factor • High # of required kernel calls • Kernel launch latency • Memcopy – HS more affected by this, because • More PCI-E transfers (launch configurations for kernels) • Less launch throughput, kernel launch latency gets more important • False dependencies of launched kernels – Single CPU thread handles all CUDA streams (Multi-thread possible, but synchronization overhead too high for good performance) – Grid scheduling done on hardware (Grid Management Unit) (DP: software) » False dependencies when N(streams) > N(device connections)=323.5 36 Back