Slide 1

Slide 1 text

A Fast and Scalable Low Dimensional Solver for Charged Particle Dynamics in Large Particle Accelerators Yves Ineichen1,2,3, Andreas Adelmann2, Costas Bekas3, Alessandro Curioni3, Peter Arbenz1 1ETH Zürich Department of Computer Science CH-8092 Zürich, Switzerland 2Paul Scherrer Institut Accelerator Modelling and Advanced Simulations CH-5234 Villigen, Switzerland 3IBM Research-Zurich CH-8803 Rüschlikon, Switzerland 17th June 2012 1 / 51

Slide 2

Slide 2 text

. . Physics • Particle accelerators • Applications • Design and Opt. Problem . HPC • “History” of code • Parallelization • Scaling . Optimization • Framework • Algorithms • Results . Complex Problem . Framework . Feasibility of Approach 2 / 51

Slide 3

Slide 3 text

. . Highly complex machines 3 / 51

Slide 4

Slide 4 text

Applications • particle physics • medical treatment • material science • life science 4 / 51

Slide 5

Slide 5 text

design, commissioning, and operation → non-trivial task . . d1 . d2 Design space . . o1 . o2 Objective space 5 / 51

Slide 6

Slide 6 text

Accelerator Principles (acceleration / guiding / focusing) . . Particle trajectory . Beam pipe 6 / 51

Slide 7

Slide 7 text

. . Focusing and guiding magnets 7 / 51

Slide 8

Slide 8 text

Time Evolution z [m] 0.083 0.084 0.085 0.086 0.087 x [m] -0.0015 -0.001 -0.0005 0 0.0005 0.001 0.0015 y [m] -0.0015 -0.001 -0.0005 0 0.0005 0.001 0.0015 Entries 10004 Mean x 0.08502 Mean y 2.227e-06 Mean z 1.231e-06 RMS x 0.0006551 RMS y 0.0006156 RMS z 0.0006149 Entries 10004 Mean x 0.08502 Mean y 2.227e-06 Mean z 1.231e-06 RMS x 0.0006551 RMS y 0.0006156 RMS z 0.0006149 z-x-y distr., step 5 FinPhase3.h5 z [m] 3.084 3.085 3.086 3.087 3.088 x [m] -0.0015 -0.001 -0.0005 0 0.0005 0.001 0.0015 y [m] -0.0015 -0.001 -0.0005 0 0.0005 0.001 0.0015 Entries Mean x Mean y Mean z RMS x RMS y RMS z Entries Mean x Mean y Mean z RMS x RMS y RMS z z-x-y distr., step FinPhase . 8 / 51

Slide 9

Slide 9 text

Example SwissFEL @ PSI (machine / complexity / objectives and physical meaning) 9 / 51

Slide 10

Slide 10 text

SwissFel 1: Switzerland’s X-ray free-electron laser Project at PSI • big project: > 100 people, expensive, 700 m long • 1 Ångström • to reach target it is of crucial importance to attain “good” beam properties (e.g. narrow beam/small phase space volume) . 1http://www.psi.ch/swissfel/ 10 / 51

Slide 11

Slide 11 text

SwissFel 1: Switzerland’s X-ray free-electron laser Project at PSI • big project: > 100 people, expensive, 700 m long • 1 Ångström • to reach target it is of crucial importance to attain “good” beam properties (e.g. narrow beam/small phase space volume) . . Calls for optimization of Injector • Several conflicting objectives • Key technology: multi-objective optimization 1http://www.psi.ch/swissfel/ 10 / 51

Slide 12

Slide 12 text

A Selection of Objectives • transverse dimension • bunch length • 6D phase space • energy spread • emittance . . x . y . . radial in y . radial in x 11 / 51

Slide 13

Slide 13 text

A Selection of Objectives • transverse dimension • bunch length • 6D phase space • energy spread • emittance . . z . x . . bunch length 12 / 51

Slide 14

Slide 14 text

A Selection of Objectives • transverse dimension • bunch length • 6D phase space • energy spread • emittance . . x . px . 13 / 51

Slide 15

Slide 15 text

A Selection of Objectives • transverse dimension • bunch length • 6D phase space • energy spread • emittance . . . dE/E (%) energy distribution in the beam small spread → particles have similar energy 14 / 51

Slide 16

Slide 16 text

Emittance volume of phase space small emittance: • particles confined to a small volume in configuration space • particles have same momenta spread 15 / 51

Slide 17

Slide 17 text

Design Variables Lots of elements with “knobs” to turn/tune: • EM field strength • acceleration voltage • radio frequency • radio frequency phase • ... Typically > 10 design variables 16 / 51

Slide 18

Slide 18 text

Maxwell’s Equation in the Electrostatic approximation . . Electro Magneto Optics . N-Body Dynamics . 1,2 or 3D Field Maps & Analytic Models (E, B)ext . Poisson Problem s.t. BC’s ∆ϕ′ sc = − ρ′ ϵ0 → (E, B)SC . H = Hext + Hsc . If E(x, t) and B(x) are known, the equation of motion can be integrated: • Boris-pusher (adaptive version soon!) • Leap-Frog • RK-4 17 / 51

Slide 19

Slide 19 text

Forward solver is expensive to run.. 18 / 51

Slide 20

Slide 20 text

Forward solver is expensive to run.. and we require a massive number of forward solves per optimization 18 / 51

Slide 21

Slide 21 text

Multi-Objective Optimization Framework • large multi-objective optimization problems • conflicting goals • expensive models • ... • structur analysis • scheduling • design • ... . . common settings for many applications 19 / 51

Slide 22

Slide 22 text

Multi-Objective Optimization Framework . . Optimizer . Pilot . - - input . - - obj . - - constr . - - sims . OPAL . Convex Optimization Algorithms . Heuristic Algorithms 20 / 51

Slide 23

Slide 23 text

Master/Worker Model Comp. Domain Optimizeri [coarse] multiple starting points, multiple opt. problems Corei Forward Solverj [fine] parallel E-T Workers [coarse] eval forward problem Optimizer [coarse] parallel optimizer 21 / 51

Slide 24

Slide 24 text

Issues with M/W Communication Model • Exponential increase in hot-spot latency with increasing # cores • Increasing congestion on links surrounding the master • Gets worse with increasing message sizes Flat network topologies not optimal Adapt number of masters depending on # cores Place master taking network topology into consideration Towards Message Passing for a Million Processes: Characterizing MPI on a Massive Scale Blue Gene/P, P. Balaji et. al. 22 / 51

Slide 25

Slide 25 text

Optimization Algorithms . . Optimizer . Pilot . - - input . - - obj . - - constr . - - sims . OPAL . Convex Optimization Algorithms . Heuristic Algorithms 23 / 51

Slide 26

Slide 26 text

Genetic Algorithms . . NSGA-II Selector . dispatch individuals . Variator . 2 individuals ready • PISA[1] • Finite state machine • Nsga-II selector • “Continuous generations” • Independent bit mutations • Various crossover polices [1] A Platform and Programming Language Independent Interface for Search Algorithms: http://www.tik.ee.ethz.ch/pisa/ 24 / 51

Slide 27

Slide 27 text

First Population . . dE [MeV] . 0.20 . 0.22 . 0.24 . 0.26 . 0.28 . 0.30 . emitx [mm mrad] . 0 . 5 . 10 . 15 . 20 25 / 51

Slide 28

Slide 28 text

30th population . . dE [MeV] . 0.20 . 0.22 . 0.24 . 0.26 . 0.28 . 0.30 . emitx [mm mrad] . 0 . 5 . 10 . 15 . 20 26 / 51

Slide 29

Slide 29 text

649th Population . . dE [MeV] . 0.20 . 0.22 . 0.24 . 0.26 . 0.28 . 0.30 . emitx [mm mrad] . 0 . 5 . 10 . 15 . 20 Now scientist/operator can specify preference 27 / 51

Slide 30

Slide 30 text

Forward Solver . . Optimizer . Pilot . - - input . - - obj . - - constr . - - sims . OPAL . Convex Optimization Algorithms . Heuristic Algorithms 28 / 51

Slide 31

Slide 31 text

Object Oriented Parallel Accelerator Library (OPAL) OPAL is a tool for precise charged-particle optics in large accelerator structures and beam lines including 3D space charge. • built from the ground up as a parallel application • runs on your laptop as well as on the largest HPC clusters • uses the mad language with extensions • written in C++ using OO-techniques, hence OPAL is easy to extend • nightly regression tests track the code quality OPAL: https://amas.psi.ch/OPAL 29 / 51

Slide 32

Slide 32 text

Architecture . . OPAL . MAD-Parser . Flavors: t,Cycl,env . Optimization . Space-charge Solvers . Integrators . Distributions . IP2L . FFT . D-Operators . NGP,CIC, TSI . Fields . Mesh . Particles . Load Balancing . Domain Decomp. . Message Passing . Particle Cache . PETE . Polymorphism . classic . H5hut . Trilinos & GSL • OPAL Object Oriented Parallel Accelerator Library • IP2L Independent Parallel Particle Layer • Class Library for Accelerator Simulation System and Control • H5hut for parallel particle and field I/O (HDF5) • Trilinos http://trilinos.sandia.gov/ 30 / 51

Slide 33

Slide 33 text

3D Tracker . . e− . e− repulsive force of charged particles • Huge # of macro particles (100’000 – 100’000’000) • Computing space-charge is expensive • Load balancing difficult • Lots of synchronization points . . 31 / 51

Slide 34

Slide 34 text

Envelope Tracker . . z . x . y • # slices ≪ # macro particles • Slices distributed in contiguous blocks • Load imbalance of at most 1 slice • Low number of synchronization points 32 / 51

Slide 35

Slide 35 text

Envelope Equations Using a 5th order Runge-Kutta time stepper to update • Slice radius: d2 d2t Ri = ... • Slice energy: E = ... • Slice position: d dt zi = ... 33 / 51

Slide 36

Slide 36 text

Model Implementation 1: procedure ExecuteSliceTracker(tn , slices) 2: for all ti ∈ t1 , · · · , tn−1 , tn do 3: for all s ∈ slices do 4: getExternalFields(s) 5: end for 6: synchronizeSlices() 7: calcSpaceCharge() 8: for all s ∈ slices do 9: timeIntegration(s) 10: end for 11: end for 12: end procedure . 34 / 51

Slide 37

Slide 37 text

Model Implementation 1: procedure ExecuteSliceTracker(tn , slices) 2: for all ti ∈ t1 , · · · , tn−1 , tn do 3: for all s ∈ slices do 4: getExternalFields(s) 5: end for 6: synchronizeSlices() 7: calcSpaceCharge() 8: for all s ∈ slices do 9: timeIntegration(s) 10: end for 11: end for 12: end procedure . 34 / 51

Slide 38

Slide 38 text

N-Body Space Charge 1: procedure ComputeSpaceCharge(slices) 2: for all i ∈ my_slices do 3: for all j ∈ all_slices do 4: sm ← interaction(sj , si ) 5: end for 6: Fl,i ← longitudinalSpaceCharge(sm) 7: Ft,i ← transversalSpaceCharge() 8: end for 9: end procedure . 35 / 51

Slide 39

Slide 39 text

Analytical Space Charge . . . . L . R . z . . Electrical Field = Q 2πϵ0 R2   √ ( 1 − z L )2 + ( R L )2 − √ ( z L )2 + ( R L )2 − 1 − z L + z L   see Homdyn, M. Ferrario, http://nicadd.niu.edu/fnpl/homdyn/manual.pdf 36 / 51

Slide 40

Slide 40 text

Complexity Phase FLOPs #Allreduce size getExternalFields() O(m) – synchronizeSlices() O(1) 2 n calcSpaceCharge() O(n2) 2 1 calcSpaceCharge()∗ O(1) 2 1 timeIntegration() O(n) – Typical n = 10000, m = 150 37 / 51

Slide 41

Slide 41 text

Important quantities withtin 5% margin w. r. t. 3D tracker 38 / 51

Slide 42

Slide 42 text

Computational Platform Blue Gene/P2: • VN mode • Quad core 450 PowerPC (850 MHz) • peak performance of 13.6 GFLOP/s (per node) • 3D Torus network with bandwidth of 6 GB/s • collective (tree) network has a bandwidth of 2 GB/s • round-trip worst case latency of 2.5 µs Running first 2000 steps of SwissFel. . . 2Blue Gene/P is a trademark of the International Business Machines Corporation in the United States, other countries, or both. 39 / 51

Slide 43

Slide 43 text

“History” A stony road to achieve scalability.. Cores Wall [s] Comm [s] 1 321.46 – 2 167.93 3.67 (2.19%) 4 92.9 6.86 (7.43%) 8 57.1 9.08 (15.91%) Run on FELSIM Cluster at PSI. • proper load balancing • profiling and benchmarking: serial parts (Amdahl’s law) • one fix synchronization point per timestep 40 / 51

Slide 44

Slide 44 text

500’000 Slices 128 256 512 1024 2048 4096 Number of cores 10−1 100 101 102 103 log [ Timings ] [s] Total SC and I ode solve field evaluation Analytical space-charge. 41 / 51

Slide 45

Slide 45 text

10’000 Slices: analytical Space-Charge 8 16 32 64 128 256 512 1024 2048 4096 Number of cores 10−1 100 101 102 103 log [ Timings ] [s] Total SC and I ode solve field evaluation 42 / 51

Slide 46

Slide 46 text

10’000 Slices: n2 Space-Charge 8 16 32 64 128 256 512 1024 2048 4096 Number of cores 10−1 100 101 102 103 104 log [ Timings ] [s] Total SC and I ode solve field evaluation 43 / 51

Slide 47

Slide 47 text

Analytical Weak Scaling 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of cores 0 50 100 150 200 250 Timings [s] SC and I field evaluation ode solve total MPI Problem scaling: n . . execisvely long MPI shutdown 44 / 51

Slide 48

Slide 48 text

n2 Weak Scaling 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of cores 0 50 100 150 200 250 Timings [s] SC and I field evaluation ode solve total MPI Problem scaling: √ n 45 / 51

Slide 49

Slide 49 text

Conclusions Fast & scalable forward solver: • introduced a new low-dimensional model • almost 2 orders of magnitude reduction of runtime Crucial cornerstone of framework • becomes feasible with presented forward solver • vary “resolution” (3D, n2 sc., analytical sc.) • achieve an acceptable parallel efficiency for one forward solve • run as many parallel forward solves as possible Helps in design/opartion of particle accelerators 46 / 51

Slide 50

Slide 50 text

Outlook Investigate other multi-objective optimization algorithms Run real optimization problem on massive number of cores Visualization of results 47 / 51

Slide 51

Slide 51 text

This project is funded by: IBM Research – Zurich Paul Scherrer Institut Acknowledgements • OPAL developer team • SwissFel team 48 / 51

Slide 52

Slide 52 text

Backup 49 / 51

Slide 53

Slide 53 text

Optimization Problem min [energy spread, emittance] s.t. ∂t f(x, v, t) + v · ∇x f(x, v, t) + q m0 (Etot + v × Btot) · ∇v f(x, v, t) = 0 ∇ × B = µ0 J + µ0 ε0 ∂E ∂t , ∇ × E = − ∂B ∂t ∇ · E = ρ ε0 , ∇ · B = 0 ρ = −e ∫ f(x, v, t) d3p, J = −e ∫ f(x, v, t)v d3p E = Eext + Eself, B = Bext + Bself . . . 50 / 51

Slide 54

Slide 54 text

Envelope Equations d2 d2t Ri + βiγ2 i d dt (βi Ri) + Ri ∑ j Kj i = 2c2kp Riβi × ( G(∆i, Ar) γ3 i − (1 − β2 i ) G(δi, Ar) γi ) + 4εth n c γi 1 R3 i d dt βi = e0 m0 cγ3 i ( Eext z (zi, t) + Esc z (zi, t) ) d dt zi = cβi 51 / 51