GPU Implementations of Online Track Finding Algorithms at PANDA

Mitglied der Helmholtz-Gemeinschaft GPU Implementations of Online Track Finding Algorithms
at PANDA 1 HK 57.2, DPG-Frühjahrstagung 2014, Frankfurt 21 March 2014, Andreas Herten (Institut für Kernphysik, Forschungszentrum Jülich) for the PANDA Collaboration

Mitglied der Helmholtz-Gemeinschaft Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
PANDA — The Experiment 2 13 m

PANDA — The Experiment 2 13 m Magnet STT MVD

Mitglied der Helmholtz-Gemeinschaft PANDA — Event Reconstruction • Triggerless read
out – Many benchmark channels – Background & signal similar • Event Rate: 2 • 107/s 3 Raw Data Rate: 200 GB/s Disk Storage Space for Oﬄine Analysis: 3 PB/y Reduce by ~1/1000 (Reject background events, save interesting physics events)

Mitglied der Helmholtz-Gemeinschaft PANDA — Event Reconstruction • Triggerless read
out – Many benchmark channels – Background & signal similar • Event Rate: 2 • 107/s 3 Raw Data Rate: 200 GB/s Disk Storage Space for Oﬄine Analysis: 3 PB/y Reduce by ~1/1000 (Reject background events, save interesting physics events) GPUs

4 Trigger Detector layers PANDA — Tracking, Online Tracking • PANDA: No hardware-based trigger • But computational intensive software trigger → Online Tracking

4 Trigger Detector layers Usual HEP experiment PANDA — Tracking, Online Tracking • PANDA: No hardware-based trigger • But computational intensive software trigger → Online Tracking

4 Trigger Detector layers Usual HEP experiment PANDA PANDA — Tracking, Online Tracking • PANDA: No hardware-based trigger • But computational intensive software trigger → Online Tracking

GPUs @ PANDA — Online Tracking • Port tracking algorithms to GPU – Serial → parallel – C++ → CUDA • Investigate suitability for online performance • But also: Find & invent tracking algorithms… • Under investigation: – Hough Transformation – Riemann Track Finder – Triplet Finder 5

Algorithm: Hough Transform • Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each • Fill histogram • Extract track parameters 6 x y x y Mitglied der Helmholtz-Gemeinschaft Hough Transform — Princip → Bin giv r α

Algorithm: Hough Transform • Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each • Fill histogram • Extract track parameters 6 rij = cos ↵j · xi + sin ↵j · yi + ⇢i i: ~100 hits/event (STT) j: every 0.2° rij: 180 000 x y x y Mitglied der Helmholtz-Gemeinschaft Hough Transform — Princip → Bin giv r α

° Angle / 0 20 40 60 80 100 120
140 160 180 Hough transformed -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 0 5 10 15 20 25 0 Entries 2.2356e+08 Mean x 90 Mean y 0.02905 RMS x 51.96 RMS y 0.1063 1800 x 1800 Grid PANDA STT+MVD Mitglied der Helmholtz-Gemeinschaft 7 68 (x,y) points r α Algorithm: Hough Transform

Algorithm: Hough Transform 8 Thrust Plain CUDA • Performance: 3 ms/event – Independent of α granularity – Reduced to set of standard routines • Fast (uses Thrust‘s optimized algorithms) • Inflexible (has it‘s limits, hard to customize) – No peakfinding included • Even possible? • Adds to time! • Performance: 0.5 ms/event – Built completely for this task • Fitting to every problem • Customizable • A bit more complicated at parts – Simple peakfinder implemented (threshold) • Using: Dynamic Parallelism, Shared Memory Two Implementations

9 • Idea: Don‘t fit lines (in 2D), fit planes (in 3D)! • Create seeds – All possible three hit combinations • Grow seeds to tracks Continuously test next hit if it fits – Use mapping to Riemann paraboloid • Summer student project (J. Timcheck) x x x x y z‘ x x x y x x x x y x Algorithm: Riemann Track Finder

nLayerx = 1 2 ⇣p 8x + 1 1 ⌘
pos ( nLayerx ) = 3 pp 3 p 243x2 1 + 27x 32 / 3 + 1 3 p 3 3 pp 3 p 243x2 1 + 27x 1 Mitglied der Helmholtz-Gemeinschaft 10 Algorithm: Riemann Track Finder int ijk = threadIdx.x + blockIdx.x * blockDim.x; for () {for () {for () {}}} • GPU Optimization: Unfolding loops → 100 × faster than CPU version • Time for one event (Tesla K20X): ~0.6 ms

11 Algorithm: Triplet Finder • Idea: Use only sub-set of detector as seed – Combine 3 hits to Triplet – Calculate circle from 3 Triplets (no fit) • Features – Tailored for PANDA – Fast & robust algorithm, no t0 • Ported to GPU together with NVIDIA Application Lab

12 Triplet Finder — Time

Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best 13

Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best 13 Hit

Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best 13 Hit Event

Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best 13 Hit Event Bunch

Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best 13 Hit Event Bunch (N2) → (N)

14 Triplet Finder — Bunching Performance

Dynamic Parallelism Mitglied der Helmholtz-Gemeinschaft Andreas Herten, DPG Frühjahrstagung 2014,
HK 57.2 Triplet Finder — Optimizations • Compare kernel launch strategies 15 1 thread/bunch Calling kernel 1 thread/bunch Calling kernel Triplet Finder 1 thread/bunch Calling kernel 1 block/bunch Joined kernel 1 block/bunch Joined kernel 1 block/bunch Joined kernel TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4 1 stream/bunch Combining stream 1 stream/bunch Combining stream 1 stream/bunch Calling stream Joined Kernel Host Streams Triplet Finder Triplet Finder CPU GPU TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4 TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4

16 Triplet Finder — Kernel Launches Preliminary (in publication)

17 Triplet Finder — Clock Speed / Chipset Preliminary (in publication) K40 3004 MHz, 745 MHz / 875 MHz K20X 2600 MHz, 732 MHz / 784 MHz Memory Clock Core Clock GPU Boost

Summary • Investigated diﬀerent tracking algorithms – Best performance: 20 µs/event → Online Tracking a feasible technique for PANDA • Multi GPU system needed – (100) GPUs • Still much optimization necessary (eﬃciency) • Collaboration with NVIDIA Application Lab 18

Summary • Investigated diﬀerent tracking algorithms – Best performance: 20 µs/event → Online Tracking a feasible technique for PANDA • Multi GPU system needed – (100) GPUs • Still much optimization necessary (eﬃciency) • Collaboration with NVIDIA Application Lab 18 Thank you! Andreas Herten [email protected]

GPU Implementations of Online Track Finding Alg...

GPU Implementations of Online Track Finding Algorithms at PANDA

More Decks by AndiH

Other Decks in Science

Featured

Transcript