Distributed computing in UQLab using the HPC Dispatcher module
This is a presentation given at the seminar organized by the Chair of Risk, Safety and Uncertainty Quantification ETH Zurich on 28.10.2020 about the upcoming feature of UQLab to uncertainty quantification tasks to distributed computing resources.
• You need to run long computations; you only have your laptops → freeing up local computing resources • You need to run computations with large memory or CPU requirements → exceptional resources (CPU, memory) requirement • Your simulation code only runs in Linux with, possibly, special licensing → compatibility and licensing issues D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 2 / 32
• You need to run long computations; you only have your laptops → freeing up local computing resources • You need to run computations with large memory or CPU requirements → exceptional resources (CPU, memory) requirement • Your simulation code only runs in Linux with, possibly, special licensing → compatibility and licensing issues D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 2 / 32
• You need to run long computations; you only have your laptops → freeing up local computing resources • You need to run computations with large memory or CPU requirements → exceptional resources (CPU, memory) requirement • Your simulation code only runs in Linux with, possibly, special licensing → compatibility and licensing issues Distributed computing resources might help: • They are highly available round the clock • They have massive amount of high-performance CPUs, memory, and storage • They come with expensive (shared) licensed software installed D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 2 / 32
and flavors Credit: Wikipedia, GPL v3 Credit:Argonne National Lab.,CC BY-SA2.0 Credit: Zamurovic Brothers (Noun Project) Distributed computing resources might help: • They are highly available round the clock (maybe) • They have massive amount of high-performance CPUs, memory, and storage (maybe) • They come with expensive (shared) licensed software installed (maybe) D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 2 / 32
The large HPC infrastructure of the ETHZ Credit: Olivier Byrde, ETH Zurich (2015) Your institution provides distributed computing resources, so you: 1 ask or look around how to get an access 2 get an access (username, password, computing time, storage space) 3 read the Wiki 4 are ready for some distributed computing D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 3 / 32
EULER: The large HPC infrastructure of the ETHZ Credit: Olivier Byrde, ETH Zurich (2015) Your institution provides distributed computing resources, so you: 1 ask or look around how to get an access 2 get an access (username, password, computing time, storage space) 3 read the Wiki 4 are ready for some distributed computing (right?) D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 3 / 32
steps: 1 Log in to the remote machine 2 Create an analysis script to run on the remote 3 Create a job script D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 4 / 32
steps: 1 Log in to the remote machine 2 Create an analysis script to run on the remote 3 Create a job script 4 Submit the job to the queues D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 4 / 32
steps: 1 Log in to the remote machine 2 Create an analysis script to run on the remote 3 Create a job script 4 Submit the job to the queues 5 Check the queues until job is finished D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 4 / 32
steps: 1 Log in to the remote machine 2 Create an analysis script to run on the remote 3 Create a job script 4 Submit the job to the queues 5 Check the queues until job is finished 6 Transfer the output files back (scp, rsync) D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 4 / 32
steps The steps: 1 Log in to the remote machine 2 Create an analysis script to run on the remote 3 Create a job script 4 Submit the job to the queues 5 Check the queues until job is finished 6 Transfer the output files back (scp, rsync) 7 Analyze the outputs in the local machine D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 4 / 32
• laptops • desktops • computing servers • HPC clusters • etc. the gap is real Users are used to: • synchronous execution: immediate execution • local storage: results are available on completion • think of serial programs: easier to debug Now, users have to get used to: • asynchronous execution: schedule and submit an execution • remote storage: results must be retrieved • think of parallel programs: harder to debug (e.g., race conditions) D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 5 / 32
A user Remote machine • laptops • desktops • computing servers • HPC clusters • etc. HPC Dispatcher module The HPC dispatcher module is an attempt to bridge the gap • It allows user to dispatch and retrieve some computations to remote distributed computing resources • All from a local UQLab session • Emphasis on some computations, i.e., certain computations that are relevant in uncertainty quantification (UQ) with UQLab
A user Remote machine • laptops • desktops • computing servers • HPC clusters • etc. HPC Dispatcher module This presentation is about: • The HPC dispatcher module features • Its basic usage and a couple of more advanced use cases D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 6 / 32
A user Remote machine • laptops • desktops • computing servers • HPC clusters • etc. HPC Dispatcher module This presentation is about: • The HPC dispatcher module features • Its basic usage and a couple of more advanced use cases ...and less about (if at all): • Distributed computing system and organization • Parallel algorithms and programming D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 6 / 32
Create a Model object from an m-file: 1 ModelOpts . mFile = u q i s h i g a m i ; 2 3 myModel = uq createModel ( ModelOpts ) ; D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 7 / 32
Create a Model object from an m-file: 1 ModelOpts . mFile = u q i s h i g a m i ; 2 3 myModel = uq createModel ( ModelOpts ) ; Evaluate the Model on a single input point: > > uq evalModel ( [ 0 . 5 ∗ p i 0.5∗ p i 0.5∗ p i ] ) ans = 8.6088 D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 7 / 32
Create a Model object from an m-file: 1 ModelOpts . mFile = u q i s h i g a m i ; 2 3 myModel = uq createModel ( ModelOpts ) ; Let’s assume a Dispatcher object has been created and stored in a variable myDispatcher (more on this later). Dispatch the same Model evaluation to the remote machine: > > uq evalModel ( [ 0 . 5 ∗ p i 0.5∗ p i 0.5∗ p i ] , HPC ) D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 7 / 32
Create a Model object from an m-file: 1 ModelOpts . mFile = u q i s h i g a m i ; 2 3 myModel = uq createModel ( ModelOpts ) ; Let’s assume a Dispatcher object has been created and stored in a variable myDispatcher (more on this later). Dispatch the same Model evaluation to the remote machine: > > uq evalModel ( [ 0 . 5 ∗ p i 0.5∗ p i 0.5∗ p i ] , HPC ) ans = [ ] Whoa, what happened? D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 7 / 32
action: Get status > > u q g e t S t a t u s ( myDispatcher ) ans = running Local client Remote machine Get status • laptops • desktops • computing servers • HPC clusters • etc. SSH Return status • ’submitted’: is in the queuing system • ’running’: is being executed • ’complete’: has been successfully finished • ’canceled’: has been deliberately canceled • ’failed’: has exited with errors SSH ’running’ D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 10 / 32
action: remote execution is finished > > u q g e t S t a t u s ( myDispatcher ) ans = complete results Local client Remote machine • laptops • desktops • computing servers • HPC clusters • etc. D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 11 / 32
the results > > r e s u l t s = u q f e t c h R e s u l t s ( myDispatcher ) r e s u l t s = 8.6088 results Local client Remote machine Fetch results results • laptops • desktops • computing servers • HPC clusters • etc. SSH SSH D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 12 / 32
up a Dispatcher object What you need to set up and use a Dispatcher object • An access to a remote machine; it must run a Linux OS and has an MPI implementation • A directory with write access permission in the remote machine • A passwordless SSH connection to the remote machine • A profile file that stores required information about the remote machine An example of a (minimum) remote machine profile file (myProfile.m): 1 Hostname = e u l e r . ethz . ch ; 2 Username = wdamar ; 3 PrivateKey = ˜/. ssh / i d r s a d i s p a t c h e r ; 4 RemoteFolder = /home/wdamar/temp ; In a UQLab session: DispatcherOpts . P r o f i l e = m y P r o f i l e ; myDispatcher = u q c r e a t e D i s p a t c h e r ( DispatcherOpts ) ; D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 13 / 32
up a Dispatcher object What you need to set up and use a Dispatcher object • An access to a remote machine; it must run a Linux OS and has an MPI implementation • A directory with write access permission in the remote machine • A passwordless SSH connection to the remote machine • A profile file that stores required information about the remote machine An example of a (minimum) remote machine profile file (myProfile.m): 1 Hostname = e u l e r . ethz . ch ; 2 Username = wdamar ; 3 PrivateKey = ˜/. ssh / i d r s a d i s p a t c h e r ; 4 RemoteFolder = /home/wdamar/temp ; In a UQLab session: DispatcherOpts . P r o f i l e = m y P r o f i l e ; myDispatcher = u q c r e a t e D i s p a t c h e r ( DispatcherOpts ) ; D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 13 / 32
up a Dispatcher object What you need to set up and use a Dispatcher object • An access to a remote machine; it must run a Linux OS and has an MPI implementation • A directory with write access permission in the remote machine • A passwordless SSH connection to the remote machine • A profile file that stores required information about the remote machine An example of a (minimum) remote machine profile file (myProfile.m): 1 Hostname = e u l e r . ethz . ch ; 2 Username = wdamar ; 3 PrivateKey = ˜/. ssh / i d r s a d i s p a t c h e r ; 4 RemoteFolder = /home/wdamar/temp ; In a UQLab session: DispatcherOpts . P r o f i l e = m y P r o f i l e ; myDispatcher = u q c r e a t e D i s p a t c h e r ( DispatcherOpts ) ; D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 13 / 32
up a Dispatcher object No MATLAB/UQLab in the remote machine is required for UQLink model evaluation An example of a (minimum) remote machine profile file (myProfile.m): 1 Hostname = e u l e r . ethz . ch ; 2 Username = wdamar ; 3 PrivateKey = ˜/. ssh / i d r s a d i s p a t c h e r ; 4 RemoteFolder = /home/wdamar/temp ; In a UQLab session: DispatcherOpts . P r o f i l e = m y P r o f i l e ; myDispatcher = u q c r e a t e D i s p a t c h e r ( DispatcherOpts ) ; D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 13 / 32
up a Dispatcher object Additional settings in the remote profile file: • For generic UQLab computations, you’d need MATLAB and UQLab: . . . MATLABCommand = / usr / l o c a l / bin / matlab ; RemoteUQLabPath = ˜/ uqlab ; • The remote machine might also employ a job scheduler: . . . Scheduler = slurm ; % or l s f , pbs , torque • It might also employ a module system to load software and set up their environment: . . . EnvSetup = module load open mpi ; % only on the l o g i n node PrevCommands = module load matlab ; % a l s o on the compute nodes • And some other settings (e.g., custom scheduler, MPI settings); see the Reference List of the Dispatcher module user manual. D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 14 / 32
( fun , i n p u t s ) inputs uq map(@fun,inputs) )} {fun( )} {fun( )} {fun( )} {fun( )} {fun( output In plain English Evaluate fun for each element of inputs; The output is always a cell array. D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 16 / 32
( fun , i n p u t s ) inputs uq map(@fun,inputs) )} {fun( )} {fun( )} {fun( )} {fun( )} {fun( output Kind of arrayfun, cellfun, or structfun but with focus on inputs as a sequence/collection instead of its data types. D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 16 / 32
map (fun , i n p u t s ) Supported functions as the mapping function: • All built-in matlab functions • User-defined functions (stored in m-files) • Anonymous functions • System commands Many matlab functions are vectorized. Consider using uq map if there is no such a function for your purpose and you’re thinking of using for-loop . D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 17 / 32
uq map(fun,S) • uq map(fun,C) Structure and cell arrays can contain most types of data users need. Matrices and vectors are supported as a shortcut. D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 18 / 32
( fun , inputs , DispatcherObj ) uq map is a dispatcher-aware function • Its computation can be dispatched to a remote machine • It only requires remote UQLab if UQLab’s functionalities are used • Support advanced functionalities, e.g., attached files, remote sequence generator, custom error handler for a specific computation • Same workflow as a dispatched uq evalModel D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 19 / 32
on synchronization Users can wait for a dispatched computation to finish: >> uq_waitForJob ( myDispatcher ) % this will block the session Checking the status of the remote execution ... Checking the status of the remote execution ... Job Status: complete reached. Local client Remote machine Get status • laptops • desktops • computing servers • HPC clusters • etc. SSH Return status SSH Finished? (’completed’ or ’failed’) no yes Unblock session (back to matlab prompt) Time-out? no yes D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 21 / 32
on synchronization Users can also dispatch a computation in a synchronized mode, e.g.: >> Y = uq_map(fun , inputs , DispatcherObj ,... Synchronized , true) Y = ... Local client Remote machine Get status • laptops • desktops • computing servers • HPC clusters • etc. SSH Return status uq waitForJob finished? no yes SSH Fetch results SSH Completed? yes Throw an error no results results SSH D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 22 / 32
on synchronization Users can also dispatch a computation in a synchronized mode, e.g.: >> Y = uq_map(fun , inputs , DispatcherObj ,... Synchronized , true) Y = ... Local client Remote machine Get status • laptops • desktops • computing servers • HPC clusters • etc. SSH Return status uq waitForJob finished? no yes SSH Fetch results SSH Completed? yes Throw an error no results SSH results D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 22 / 32
Notes on parallel execution uq evalModel and uq map are parallelizable • They are of naively parallel type (from data parallelism) • Data are chunked and processes are spawned to deal with each chunk • When fetched, the chunked results are automatically merged Local client Remote machine Dispatch a computation • laptops • desktops • computing servers • HPC clusters • etc. SSH Data chunking spawned parallel processes dispatch package D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 23 / 32
Notes on parallel execution uq evalModel and uq map are parallelizable • They are of naively parallel type (from data parallelism) • Data are chunked and processes are spawned to deal with each chunk • When fetched, the chunked results are automatically merged Local client Remote machine Fetch results Merge results • laptops • desktops • computing servers • HPC clusters • etc. SSH SSH results 1 results 2 results 3 results Data chunking spawned parallel processes D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 23 / 32
Notes on parallel execution A parallel execution does not mean faster execution • Spawning matlab processes has some overhead • The size of data coupled with I/O and network performances may become a bottleneck Local client Remote machine Fetch results Merge results • laptops • desktops • computing servers • HPC clusters • etc. SSH SSH results 1 results 2 results 3 results Data chunking spawned parallel processes D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 23 / 32
computation Notes on multiple dispatched computations Users may dispatch multiple computations to the remote machine before any of the executions have been finished uq_evalModel (myModel , X1 , HPC ); uq_evalModel (myModel , X2 , HPC ); uq_evalModel (myModel , X3 , HPC ); uq_map(@sum , X4 , myDispatcher ); To list of all the dispatched computations associated with a dispatcher object: >> uq_listJobs ( myDispatcher ) No. Job ID Status Tag ... ----------------------------------------------------------- 1 2574 complete uq_evalModel of <Model 1> on <25... 2 2576 submitted uq_evalModel of <Model 1> on <26... 3 2577 submitted uq_evalModel of <Model 1> on <26... 4 2578 submitted uq_map of <sum > on <26-Oct -2020 ... D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 24 / 32
computation Notes on multiple dispatched computations Users may dispatch multiple computations to the remote machine before any of the executions have been finished uq_evalModel (myModel , X1 , HPC ); uq_evalModel (myModel , X2 , HPC ); uq_evalModel (myModel , X3 , HPC ); uq_map(@sum , X4 , myDispatcher ); Without a job scheduler, the remote machine can be flooded! D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 24 / 32
computations Notes on retrieving remote computations Users may exit the current uqlab session and retrieve the results later Option 1: Save (before exiting) and load the dispatcher object • Use uq saveDispatcher to save the object to a file • Use uq loadDispatcher to load the object from a file Option 2: Recreate the object and retrieve remote computations • Use the same remote machine profile file to create a new object • Use uq retrieveJobs to search through a remote directory and re-attach any remote computations to the current object As long as the directory in the remote machine remains intact D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 25 / 32
evaluation in a cloud cluster An example of distributed computing on a (not high-performance) cluster • Input files are created locally • 3rd-party code is executed on the remote (must be installed there) • Output files are parsed locally Local client Virtual Private Cloud • laptops • desktops SSH SSH master node (VM instance) worker nodes (VM instances) shared storage (NFS) D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 27 / 32
evaluation in a cloud cluster An example of distributed computing on a (not high-performance) cluster • Input files are created locally • 3rd-party code is executed on the remote (must be installed there) • Output files are parsed locally A minor modification to uqlink model options: % Location of the external executable in the remote machine EXECPATH = /home/cluster/code/simply -supported -beam ; ModelOpts. ExecutablePath = EXECPATH; % No MATLAB in the remote ModelOpts. RemoteMATLAB = false; Transferring input/output files of 3rd-party code over the network can be time consuming! D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 27 / 32
metamodel construction Create PCE metamodels on the 10-dimensional Truss data set • Methods: 'OLS', 'LARS', 'OMP' • Degrees: 1, 2, 3, 4, 5 • QoI: LOO and validation errors D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 28 / 32
metamodel construction Create PCE metamodels on the 10-dimensional Truss data set • Methods: 'OLS', 'LARS', 'OMP' • Degrees: 1, 2, 3, 4, 5 • QoI: LOO and validation errors There are multiple ways to do this with uq map. The following is just one possibility. D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 28 / 32
metamodel construction The call: uq map ( @myWrapper , ParamSets , myDispatcher , . . . Parameters , Params , . . . UQLab , t r u e ) ; The components: • myWrapper: an ad-hoc wrapper function to get the QoIs and provide basic error handling • ParamSets: the set of all possible configuration options • myDispatcher: the Dispatcher object • 'Parameters': a named argument to specify parameters in order to avoid duplications of data • 'UQLab': a flag to load UQLab in the remote machine D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 29 / 32
been introduced • The module can help bridging the gap in using distributed computing resources (HPC or otherwise) for some computations in UQ with uqlab • The module’s main goal is to dispatch a computation, whether it makes sense to do that depends on the cases • A new worfklow for a dispatched computation (dispatch-and-fetch) • The dispatcher-aware uq evalModel and uq map are introduced A couple of use cases where dispatching makes sense have been presented • UQLink model evaluation • large-scale parameteric studies of UQLab metamodel (can also be analysis) objects construction D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 32 / 32
been introduced • The module can help bridging the gap in using distributed computing resources (HPC or otherwise) for some computations in UQ with uqlab • The module’s main goal is to dispatch a computation, whether it makes sense to do that depends on the cases • A new worfklow for a dispatched computation (dispatch-and-fetch) • The dispatcher-aware uq evalModel and uq map are introduced A couple of use cases where dispatching makes sense have been presented • UQLink model evaluation • large-scale parameteric studies of UQLab metamodel (can also be analysis) objects construction Thank you very much for your attention! D. Wicaksono (RSUQ, ETH Z¨ urich) 28.10.2020 32 / 32