Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPU Implementations of Online Track Finding Algorithms at PANDA

AndiH
March 21, 2014

GPU Implementations of Online Track Finding Algorithms at PANDA

A 12 minutes talk I gave at the spring meeting of the German Physical Society in Frankfurt 2014. The status of my PhD thesis. More or less.

AndiH

March 21, 2014
Tweet

More Decks by AndiH

Other Decks in Science

Transcript

  1. Mitglied der Helmholtz-Gemeinschaft
    GPU Implementations of
    Online Track Finding
    Algorithms at PANDA
    1
    HK 57.2, DPG-Frühjahrstagung 2014, Frankfurt
    21 March 2014, Andreas Herten (Institut für Kernphysik, Forschungszentrum Jülich) for the PANDA Collaboration

    View Slide

  2. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    PANDA — The Experiment
    2
    13 m

    View Slide

  3. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    PANDA — The Experiment
    2
    13 m
    Magnet
    STT
    MVD

    View Slide

  4. Mitglied der Helmholtz-Gemeinschaft
    PANDA — Event Reconstruction
    • Triggerless read out
    – Many benchmark channels
    – Background & signal similar
    • Event Rate: 2 • 107/s
    3
    Raw Data Rate:
    200 GB/s
    Disk Storage Space for
    Offline Analysis: 3 PB/y
    Reduce by
    ~1/1000
    (Reject background events,
    save interesting physics events)

    View Slide

  5. Mitglied der Helmholtz-Gemeinschaft
    PANDA — Event Reconstruction
    • Triggerless read out
    – Many benchmark channels
    – Background & signal similar
    • Event Rate: 2 • 107/s
    3
    Raw Data Rate:
    200 GB/s
    Disk Storage Space for
    Offline Analysis: 3 PB/y
    Reduce by
    ~1/1000
    (Reject background events,
    save interesting physics events)
    GPUs

    View Slide

  6. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  7. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  8. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  9. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  10. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  11. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  12. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  13. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  14. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  15. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 4
    Trigger
    Detector layers
    Usual HEP experiment
    PANDA
    PANDA — Tracking, Online Tracking
    • PANDA: No
    hardware-based
    trigger
    • But computational
    intensive software
    trigger
    → Online Tracking

    View Slide

  16. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    GPUs @ PANDA — Online Tracking
    • Port tracking algorithms to GPU
    – Serial → parallel
    – C++ → CUDA
    • Investigate suitability for online performance
    • But also: Find & invent tracking algorithms…
    • Under investigation:
    – Hough Transformation
    – Riemann Track Finder
    – Triplet Finder
    5

    View Slide

  17. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Algorithm: Hough Transform
    • Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space
    • Solve rij line equation for
    – Lots of hits (x,y,ρ)i
    and
    – Many αj ∈ [0°,360°) each
    • Fill histogram
    • Extract track parameters
    6
    x
    y
    x
    y
    Mitglied der Helmholtz-Gemeinschaft
    Hough Transform — Princip
    → Bin
    giv
    r
    α

    View Slide

  18. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Algorithm: Hough Transform
    • Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space
    • Solve rij line equation for
    – Lots of hits (x,y,ρ)i
    and
    – Many αj ∈ [0°,360°) each
    • Fill histogram
    • Extract track parameters
    6
    rij =
    cos
    ↵j
    ·
    xi +
    sin
    ↵j
    ·
    yi + ⇢i
    i: ~100 hits/event (STT)
    j: every 0.2° rij: 180 000
    x
    y
    x
    y
    Mitglied der Helmholtz-Gemeinschaft
    Hough Transform — Princip
    → Bin
    giv
    r
    α

    View Slide

  19. °
    Angle /
    0 20 40 60 80 100 120 140 160 180
    Hough transformed
    -0.4
    -0.3
    -0.2
    -0.1
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6 0
    Entries 2.2356e+08
    Mean x 90
    Mean y 0.02905
    RMS x 51.96
    RMS y 0.1063
    0
    5
    10
    15
    20
    25
    0
    Entries 2.2356e+08
    Mean x 90
    Mean y 0.02905
    RMS x 51.96
    RMS y 0.1063
    1800 x 1800 Grid
    PANDA STT+MVD
    Mitglied der Helmholtz-Gemeinschaft
    7
    68 (x,y) points
    r
    α
    Algorithm: Hough Transform

    View Slide

  20. °
    Angle /
    0 20 40 60 80 100 120 140 160 180
    Hough transformed
    -0.4
    -0.3
    -0.2
    -0.1
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6 0
    Entries 2.2356e+08
    Mean x 90
    Mean y 0.02905
    RMS x 51.96
    RMS y 0.1063
    0
    5
    10
    15
    20
    25
    0
    Entries 2.2356e+08
    Mean x 90
    Mean y 0.02905
    RMS x 51.96
    RMS y 0.1063
    1800 x 1800 Grid
    PANDA STT+MVD
    Mitglied der Helmholtz-Gemeinschaft
    7
    68 (x,y) points
    r
    α
    Algorithm: Hough Transform

    View Slide

  21. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Algorithm: Hough Transform
    8
    Thrust Plain CUDA
    • Performance: 3 ms/event
    – Independent of α granularity
    – Reduced to set of standard routines
    • Fast (uses Thrust‘s optimized algorithms)
    • Inflexible (has it‘s limits, hard to customize)
    – No peakfinding included
    • Even possible?
    • Adds to time!
    • Performance: 0.5 ms/event
    – Built completely for this task
    • Fitting to every problem
    • Customizable
    • A bit more complicated at parts
    – Simple peakfinder implemented
    (threshold)
    • Using: Dynamic Parallelism, Shared
    Memory
    Two Implementations

    View Slide

  22. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 9
    • Idea: Don‘t fit lines (in 2D), fit planes (in 3D)!
    • Create seeds
    – All possible three hit combinations
    • Grow seeds to tracks
    Continuously test next hit if it fits
    – Use mapping to Riemann paraboloid
    • Summer student project (J. Timcheck)
    x
    x
    x
    x
    y
    z‘
    x
    x
    x
    y
    x
    x
    x
    x
    y
    x
    Algorithm: Riemann Track Finder

    View Slide

  23. nLayerx
    = 1
    2
    ⇣p
    8x
    +
    1 1

    pos
    (
    nLayerx
    ) =
    3
    pp
    3
    p
    243x2 1
    +
    27x
    32
    /
    3
    + 1
    3
    p
    3
    3
    pp
    3
    p
    243x2 1
    +
    27x
    1
    Mitglied der Helmholtz-Gemeinschaft
    10
    Algorithm: Riemann Track Finder
    int ijk = threadIdx.x + blockIdx.x * blockDim.x;
    for () {for () {for () {}}}
    • GPU Optimization: Unfolding loops
    → 100 × faster than CPU version
    • Time for one event (Tesla K20X): ~0.6 ms

    View Slide

  24. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 11
    Algorithm: Triplet Finder
    • Idea: Use only sub-set of detector as seed
    – Combine 3 hits to Triplet
    – Calculate circle from 3 Triplets (no fit)
    • Features
    – Tailored for PANDA
    – Fast & robust algorithm, no t0
    • Ported to GPU together with NVIDIA Application Lab

    View Slide

  25. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 12
    Triplet Finder — Time

    View Slide

  26. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Triplet Finder — Optimizations
    • Bunching Wrapper
    – Hits from one event have similar timestamp
    – Combine hits to sets (bunches) which fill up GPU best
    13

    View Slide

  27. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Triplet Finder — Optimizations
    • Bunching Wrapper
    – Hits from one event have similar timestamp
    – Combine hits to sets (bunches) which fill up GPU best
    13
    Hit

    View Slide

  28. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Triplet Finder — Optimizations
    • Bunching Wrapper
    – Hits from one event have similar timestamp
    – Combine hits to sets (bunches) which fill up GPU best
    13
    Hit Event

    View Slide

  29. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Triplet Finder — Optimizations
    • Bunching Wrapper
    – Hits from one event have similar timestamp
    – Combine hits to sets (bunches) which fill up GPU best
    13
    Hit Event

    View Slide

  30. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Triplet Finder — Optimizations
    • Bunching Wrapper
    – Hits from one event have similar timestamp
    – Combine hits to sets (bunches) which fill up GPU best
    13
    Hit Event
    Bunch

    View Slide

  31. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Triplet Finder — Optimizations
    • Bunching Wrapper
    – Hits from one event have similar timestamp
    – Combine hits to sets (bunches) which fill up GPU best
    13
    Hit Event
    Bunch
    (N2) → (N)

    View Slide

  32. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 14
    Triplet Finder — Bunching
    Performance

    View Slide

  33. Dynamic
    Parallelism
    Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Triplet Finder — Optimizations
    • Compare kernel launch strategies
    15
    1 thread/bunch
    Calling
    kernel
    1 thread/bunch
    Calling
    kernel
    Triplet
    Finder
    1 thread/bunch
    Calling
    kernel
    1 block/bunch
    Joined
    kernel
    1 block/bunch
    Joined
    kernel
    1 block/bunch
    Joined
    kernel
    TF Stage #1
    TF Stage #2
    TF Stage #3
    TF Stage #4
    1 stream/bunch
    Combining
    stream
    1 stream/bunch
    Combining
    stream
    1 stream/bunch
    Calling
    stream
    Joined
    Kernel
    Host
    Streams
    Triplet
    Finder
    Triplet
    Finder
    CPU
    GPU
    TF Stage #1
    TF Stage #2
    TF Stage #3
    TF Stage #4
    TF Stage #1
    TF Stage #2
    TF Stage #3
    TF Stage #4

    View Slide

  34. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 16
    Triplet Finder — Kernel Launches
    Preliminary
    (in publication)

    View Slide

  35. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 17
    Triplet Finder — Clock Speed / Chipset
    Preliminary
    (in publication)
    K40 3004 MHz, 745 MHz / 875 MHz
    K20X 2600 MHz, 732 MHz / 784 MHz
    Memory Clock Core Clock GPU Boost

    View Slide

  36. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Summary
    • Investigated different tracking algorithms
    – Best performance: 20 µs/event
    → Online Tracking a feasible technique for PANDA
    • Multi GPU system needed – (100) GPUs
    • Still much optimization necessary (efficiency)
    • Collaboration with NVIDIA Application Lab
    18

    View Slide

  37. Mitglied der Helmholtz-Gemeinschaft
    Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
    Summary
    • Investigated different tracking algorithms
    – Best performance: 20 µs/event
    → Online Tracking a feasible technique for PANDA
    • Multi GPU system needed – (100) GPUs
    • Still much optimization necessary (efficiency)
    • Collaboration with NVIDIA Application Lab
    18
    Thank you!
    Andreas Herten
    [email protected]

    View Slide