Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Online Tracking on GPUs at PANDA

AndiH
May 23, 2014

Online Tracking on GPUs at PANDA

Talk at FIAS Tracking Workshop

AndiH

May 23, 2014
Tweet

More Decks by AndiH

Other Decks in Science

Transcript

  1. Mitglied der Helmholtz-Gemeinschaft 1 5th International Workshop for Future Challenges

    in Tracking and Trigger Concepts, FIAS Frankfurt 13 May 2014, Andreas Herten Online Tracking on GPUs at PANDA
  2. Mitglied der Helmholtz-Gemeinschaft Outline • GPUs & PANDA • Algorithms

    – Hough Transform – Riemann Track Finder – Triplet Finder 2
  3. Mitglied der Helmholtz-Gemeinschaft Graphics Processing Units 3 GPU CPU

  4. Mitglied der Helmholtz-Gemeinschaft Graphics Processing Units 3 GPU CPU a1

    → b1 → c1; a2 → b2 → c2; a3 → … a1 → b1 → c1 a2 → b2 → c2 a3 → …
  5. Mitglied der Helmholtz-Gemeinschaft PANDA — The Experiment 4 13 m

    p p Magnet STT MVD
  6. Mitglied der Helmholtz-Gemeinschaft PANDA — Event Reconstruction • Continuous read

    out – Background & signal similar – Novel feature • Event Rate: 2 • 107/s 5 Raw Data Rate: 200 GB/s Disk Storage Space for Offline Analysis: 3 PB/y Reduce by ~1/1000 (Reject background events, save interesting physics events)
  7. Mitglied der Helmholtz-Gemeinschaft PANDA — Event Reconstruction • Continuous read

    out – Background & signal similar – Novel feature • Event Rate: 2 • 107/s 5 Raw Data Rate: 200 GB/s Disk Storage Space for Offline Analysis: 3 PB/y Reduce by ~1/1000 (Reject background events, save interesting physics events) GPUs
  8. Mitglied der Helmholtz-Gemeinschaft ALGORITHMS #1 6 Hough Transform Riemann Track

    Finder Triplet Finder
  9. Mitglied der Helmholtz-Gemeinschaft Algorithm: Hough Transform • Idea: Transform (x,y)i

    → (α,r)ij, find lines via (α,r) space • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each • Fill histogram • Extract track parameters 7 x y x y Mitglied der Helmholtz-Gemeinschaft Hough Transform — Princip → Bin giv r α
  10. Mitglied der Helmholtz-Gemeinschaft Algorithm: Hough Transform • Idea: Transform (x,y)i

    → (α,r)ij, find lines via (α,r) space • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each • Fill histogram • Extract track parameters 7 rij = cos ↵j · xi + sin ↵j · yi + ⇢i i: ~100 hits/event (STT) j: every 0.2° rij: 180 000 x y x y Mitglied der Helmholtz-Gemeinschaft Hough Transform — Princip → Bin giv r α
  11. ° Angle / 0 20 40 60 80 100 120

    140 160 180 Hough transformed -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 0 5 10 15 20 25 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 1800 x 1800 Grid PANDA STT+MVD Mitglied der Helmholtz-Gemeinschaft 8 68 (x,y) points r α Algorithm: Hough Transform
  12. ° Angle / 0 20 40 60 80 100 120

    140 160 180 Hough transformed -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 0 5 10 15 20 25 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 1800 x 1800 Grid PANDA STT+MVD Mitglied der Helmholtz-Gemeinschaft 8 68 (x,y) points r α Algorithm: Hough Transform
  13. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Remarks 9 Thrust (CUDA‘s

    STL) Plain CUDA • Performance: 3 ms/event – Reduced to set of standard routines • Fast (uses Thrust‘s optimized algorithms) • Inflexible (has it‘s limits, hard to customize) – Not yet at performance maximum – No peakfinding included • Even possible? • Adds to time! • Ideas in exploration • Performance: 0.5 ms/event – Built completely for this task • Fitting to this problem • Customizable • A bit more complicated at parts – Simple peakfinder implemented (threshold) • Using: Dynamic Parallelism, Shared Memory Two Implementations
  14. Mitglied der Helmholtz-Gemeinschaft 10 ALGORITHMS #2 Hough Transform Riemann Track

    Finder Triplet Finder
  15. Mitglied der Helmholtz-Gemeinschaft 11 Riemann Track Finder — Method •

    Idea: Don‘t fit lines (in 2D), fit planes (in 3D)! • Create seeds – All possible three hit combinations • Grow seeds to tracks Continuously test next hit if it fits – Use mapping to Riemann paraboloid (+ s-z fit, det. layer) x x x x y z‘ x x x y x x x x y x More on: Seeds; Growing
  16. Mitglied der Helmholtz-Gemeinschaft 11 Riemann Track Finder — Method •

    Idea: Don‘t fit lines (in 2D), fit planes (in 3D)! • Create seeds – All possible three hit combinations • Grow seeds to tracks Continuously test next hit if it fits – Use mapping to Riemann paraboloid (+ s-z fit, det. layer) x x x x y z‘ x x x y x x x x y x More on: Seeds; Growing 1 2
  17. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 1

    2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  18. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 1

    2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  19. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 1

    2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  20. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  21. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 31 11 41 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  22. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 31 11 41 31 11 32 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  23. Mitglied der Helmholtz-Gemeinschaft 12 1 2 3 4 5 21

    11 31 31 11 41 31 11 32 1 2 3 4 5 Riemann Track Finder — 1 Seeds 1 Layer number Back
  24. Mitglied der Helmholtz-Gemeinschaft 13 Riemann Track Finder — GPU Adaptations

    CPU GPU
  25. Mitglied der Helmholtz-Gemeinschaft 13 Riemann Track Finder — GPU Adaptations

    CPU GPU 3 loops to generate seeds serially for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } } Needed: Mapping of inherent GPU indexing variable to triplet index int ijk = threadIdx.x + blockIdx.x * blockDim.x; nLayerx = 1 2 ⇣p 8x + 1 1 ⌘ pos ( nLayerx ) = 3 pp 3 p 243x2 1 + 27x 32 / 3 + 1 3 p 3 3 pp 3 p 243x2 1 + 27x 1 1
  26. Mitglied der Helmholtz-Gemeinschaft 13 Riemann Track Finder — GPU Adaptations

    CPU GPU 3 loops to generate seeds serially for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } } Needed: Mapping of inherent GPU indexing variable to triplet index int ijk = threadIdx.x + blockIdx.x * blockDim.x; nLayerx = 1 2 ⇣p 8x + 1 1 ⌘ pos ( nLayerx ) = 3 pp 3 p 243x2 1 + 27x 32 / 3 + 1 3 p 3 3 pp 3 p 243x2 1 + 27x 1 1 2 Port of CPU code; parallelism on seed base Only easy computations; e.g. 3x3 matrices
  27. Mitglied der Helmholtz-Gemeinschaft 13 Riemann Track Finder — GPU Adaptations

    CPU GPU → 100 × faster than CPU version: ~0.6 ms/event 3 loops to generate seeds serially for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } } Needed: Mapping of inherent GPU indexing variable to triplet index int ijk = threadIdx.x + blockIdx.x * blockDim.x; nLayerx = 1 2 ⇣p 8x + 1 1 ⌘ pos ( nLayerx ) = 3 pp 3 p 243x2 1 + 27x 32 / 3 + 1 3 p 3 3 pp 3 p 243x2 1 + 27x 1 1 2 Port of CPU code; parallelism on seed base Only easy computations; e.g. 3x3 matrices
  28. Mitglied der Helmholtz-Gemeinschaft 14 ALGORITHMS #3 Hough Transform Riemann Track

    Finder Triplet Finder
  29. Mitglied der Helmholtz-Gemeinschaft 15 Triplet Finder • Algorithm specifically designed

    for the PANDA Straw Tube Tracker (STT) • Ported to GPU by Andrew Adinetz – CUDA, Dynamic Parallelism, Thrust – Quality of tracks comparable to CPU http://www.fz-juelich.de/ias/jsc/ Original algorithm by Marius Mertens et al 1.5 m
  30. Mitglied der Helmholtz-Gemeinschaft 16 Triplet Finder • Idea: Use only

    subset of detector as seed – Don‘t use STT isochrones (drift times) – Calculate circle from 3 points (no fit) • Features – Fast & robust algorithm, no t0 – Many tuning possibilities More
  31. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method 17 STT More

  32. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method 17 STT More

  33. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method 17 STT More

  34. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method 17 STT More

  35. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw 17 STT More
  36. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) 17 STT More
  37. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 17 STT More
  38. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 17 STT More
  39. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 17 STT More
  40. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 17 STT More
  41. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 2.Interaction point 17 Interaction Point STT More
  42. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 2.Interaction point • Calculate circle through three points 17 Interaction Point STT More
  43. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method • STT hit

    in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with 1.Second STT pivot-cog virtual hit 2.Interaction point • Calculate circle through three points → Track Candidate 17 Interaction Point STT More
  44. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Animation 18 Triplet Isochrone

    early Isochrone early & skewed Isochrone close Isochrone late MVD hit Track timed out Track current
  45. Mitglied der Helmholtz-Gemeinschaft 19 Triplet Finder — Times

  46. Mitglied der Helmholtz-Gemeinschaft 19 Triplet Finder — Times

  47. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20
  48. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit
  49. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event
  50. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event
  51. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event Bunch
  52. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Bunching Wrapper

    – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 20 Hit Event Bunch (N2) → (N)
  53. Mitglied der Helmholtz-Gemeinschaft 21 Triplet Finder — Bunching Performance

  54. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  55. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  56. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  57. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  58. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations 22 • Sector

    Row testing – After found track: Hit association not with all hits of current window, but only with subset (first test rows of sector, then hits of row) More
  59. Mitglied der Helmholtz-Gemeinschaft 23 Triplet Finder — Sector Rows Preliminary

    (in publication)
  60. Dynamic Parallelism Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations •

    Compare kernel launch strategies 24 1 thread/bunch Calling kernel 1 thread/bunch Calling kernel Triplet Finder 1 thread/bunch Calling kernel 1 block/bunch Joined kernel 1 block/bunch Joined kernel 1 block/bunch Joined kernel TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4 1 stream/bunch Combining stream 1 stream/bunch Combining stream 1 stream/bunch Calling stream Joined Kernel Host Streams Triplet Finder Triplet Finder CPU GPU TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4 TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4
  61. Mitglied der Helmholtz-Gemeinschaft 25 Triplet Finder — Kernel Launches Preliminary

    (in publication) Explanation
  62. Tesla K40 Tesla K20X Peak double performance Peak single performance

    GPU Chipset # CUDA Cores Memory size Memory bandwidth 1.46 TFLOPS 1.31 TFLOPS 4.29 TFLOPS 3.95 TFLOPS GK110B GK110 2880 2688 12 GB 6 GB 288 GByte/s 250 GByte/s Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Impact of chipset 26 Source: http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kepler-Family-Datasheet.pdf
  63. Mitglied der Helmholtz-Gemeinschaft 27 Triplet Finder — Clock Speed /

    GPU Preliminary (in publication) K40 3004 MHz, 745 MHz / 875 MHz K20X 2600 MHz, 732 MHz / 784 MHz Memory Clock Core Clock GPU Boost
  64. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Summary • Best performance:

    20 µs/event – 20⋅10-6 s/event * 2⋅107 event/s 㱺 400 GPUs2014 – PANDA2019: Multi GPU system – (100) GPUs • Optimizations possible & needed – ε needs to be improved – Speed, €: More float less double-cards a la K10 28
  65. Mitglied der Helmholtz-Gemeinschaft Summary • PANDA researches in using GPUs

    as part of online event reconstruction scheme • Algorithms in active evaluation and optimization – Triplet Finder performance-optimized 29
  66. Thank you! Andreas Herten a.herten@fz-juelich.de Mitglied der Helmholtz-Gemeinschaft Summary •

    PANDA researches in using GPUs as part of online event reconstruction scheme • Algorithms in active evaluation and optimization – Triplet Finder performance-optimized 29
  67. Mitglied der Helmholtz-Gemeinschaft List of Resources Used • #4: Earth

    icon by Francesco Paleari from The Noun Project • #4: Einstein icon by Roman Rusinov from The Noun Project • #6: FAIR vector logo from official FAIR website • #6: FAIR rendering from official website • #11: Flare Gun icon by Jop van der Kroef from The Noun Project • #27: STT event animation by Marius C. Mertens • #35: Graphics cards images by NVIDIA promotion • #35: GPU Specifications – Tesla K20X Specifications: http://www.nvidia.com/content/PDF/kepler/Tesla- K20X-BD-06397-001-v07.pdf – Tesla K40 Specifications: http://www.nvidia.com/content/PDF/kepler/Tesla-K40- Active-Board-Spec-BD-06949-001_v03.pdf – Tesla Familiy Overview: http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla- Kepler-Family-Datasheet.pdf 30
  68. Mitglied der Helmholtz-Gemeinschaft BACKUP 31

  69. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 Back

  70. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    Back
  71. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    Back
  72. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * (r, α)1 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  73. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α (r, α)1 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  74. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α (r, α)1 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  75. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α (r, α)1 (r, α)2 rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  76. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  77. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  78. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  79. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  80. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  81. Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle 32 x y

    → Bin with highest multiplicity gives track parameters * * r α rij = cos ↵j · xi + sin ↵j · yi + ⇢i Back
  82. Mitglied der Helmholtz-Gemeinschaft 33 Riemann Algorithm — Procedure

  83. Mitglied der Helmholtz-Gemeinschaft 33 Riemann Algorithm — Procedure • Create

    triplet of hit points – All possible three hit combinations need to become triplets 1
  84. Mitglied der Helmholtz-Gemeinschaft 33 Riemann Algorithm — Procedure • Create

    triplet of hit points – All possible three hit combinations need to become triplets • Grow triplets to tracks: Continuously test next hit if it fits to triplet track – Use Riemann paraboloid to circle fit track • Test closeness of new hit: good → add hit; bad → dismiss hit • Continue with next hit – Helix fit: arc length s vs. z position 1 2
  85. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    Back
  86. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ Back
  87. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  88. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  89. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  90. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  91. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) Back
  92. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  93. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  94. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  95. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  96. Mitglied der Helmholtz-Gemeinschaft 34 Riemann Algorithm — 1 Expansion 2

    x x x x y z‘ Expand to z‘ x x x y x Riemann Surface (paraboloid) x Back
  97. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Optimizations • Sector Row

    testing – Thicken track; shrink sector row layer to line – Find intersection 35 Sector-Row Testing Track Sector-Row Track Sector-Row Back
  98. Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Kernel Launch Strategies •

    Joined Kernel (JK): slowest – High # registers → low occupancy • Dynamic Parallelism (DP) / Host Streams (HS): comparable performance – Performance • HS faster for small # processed hits, DP faster for > 45000 hits • HS stagnates there, while DP continues rising – Limiting factor • High # of required kernel calls • Kernel launch latency • Memcopy – HS more affected by this, because • More PCI-E transfers (launch configurations for kernels) • Less launch throughput, kernel launch latency gets more important • False dependencies of launched kernels – Single CPU thread handles all CUDA streams (Multi-thread possible, but synchronization overhead too high for good performance) – Grid scheduling done on hardware (Grid Management Unit) (DP: software) » False dependencies when N(streams) > N(device connections)=323.5 36 Back
  99. Mitglied der Helmholtz-Gemeinschaft 37 Triplet Finder — Host Stream Connections

    Preliminary (in publication)
  100. Mitglied der Helmholtz-Gemeinschaft 38 Triplet Finder — Bunch Sizes Preliminary

    (in publication)