Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Benchmarking Result of MEGADOCK on Summit@ORNL

A Benchmarking Result of MEGADOCK on Summit@ORNL

2021.05.24

metaVariable

May 24, 2021
Tweet

More Decks by metaVariable

Other Decks in Research

Transcript

  1. MEGADOCK Benchmark Result on Summit @ ORNL Kento Aoyama Akiyama

    Laboratory Dept. of Computer Science, School of Computing Tokyo Institute of Technology 2021.05.24
  2. Worker process MEGADOCK: FFT-grid-based protein-protein docking application for heterogenous supercomputers

    2 • MEGADOCK [Ohue, et al. 2014] is a protein-protein interaction prediction software implemented with hybrid parallelization (MPI/OpenMP/GPU) • Based on Master-Worker model task-processing, a pair of proteins is assigned to each worker process sequentially • Each thread that is assigned a ligand rotation (sampling angle: 𝜽) runs FFT-based docking calculation for all ligand translations (𝑿, 𝒀, 𝒁) • Calculation order of a pose evaluation 𝑶(𝑵𝟔) is reduced to 𝑶(𝑵𝟑 𝐥𝐨𝐠 𝑵) by using FFT algorithms Docking List (pairs of proteins) Master proc. P1. Initialization P2. Receptor voxelization P3. Forward FFT of a receptor P4. Ligand rotation & voxelization P5. Forward FFT of a ligand P6. Modulation P7. Inverse FFT (Score calc.) P8. Finding the best solutions P9. Post processes Loop for ligand rotation angles Loop for docking pairs OMP thread CUDA stream OMP thread CUDA stream OMP thread CUDA stream OMP thread CUDA stream OMP thread CUDA stream Docking poses ( for ligand rotation angles ) Ligand Receptor Docking pose evaluation for all Ligand angles (𝜽) and translations (𝑿, 𝒀, 𝒁) 𝑟𝑜𝑡 0,0, 𝜃 𝑟𝑜𝑡 0,0,2𝜃 … Task (a pair of proteins) Internals of docking calculation
  3. • Scientific applications are required to be more available, portable,

    reproducible, interoperable on a wider variety of HPC systems, however, it’s still a challenging task. • We have been doing activities to improve availability of our application: - performance evaluation of virtual machines and Docker container on public cloud (2019), - large-scale PPI prediction with Singularity container on ABCI and TSUBAME3.0 by using container configuration workflow based on HPCCM (2020), - supporting more various supercomputers (Summit, Fugaku) with Singularity container (ongoing) - supporting Spack package manager (ongoing). • Now we are working on deploying MEGADOCK on Summit with Singularity container … but - Summit’s compute-node has POWER9 CPU. Container image must support ppc64le arch. - (currently) Summit system does not support ’--fakeroot’ option, and also not provide any environment for building containers for ppc64le • We tested a small benchmark to obtain the performance of MEGADOCK on a native (bare-metal) environment in compute-node of Summit. Improving availability of MEGADOCK for a wider variety of HPC systems 3  Public Clouds  ABCI, TSUBAME 3.0  Fugaku  Summit ✓ ✓
  4. Performance comparison of MEGADOCK benchmark results on Summit / ABCI*/

    TSUBAME 3.0* 4 • MEGADOCK calculation time for ZDOCK Benchmark 5.0 (all-to-all for unbound PDB, 230x230=52900 PPI) • Benchmarking results clearly related to the total GPU performance of compute-node on each system - TSUBAME 3.0: P100 GPU, 4 slots - ABCI: V100 GPU, 4 slots - Summit: V100 GPU, 6 slots • All results showed good scalability in strong-scaling (> 0.95 with 64 compute-nodes from 4 nodes) • Parallel performance degradation is mainly caused by unbalanced load because calc time depends on data *Results in ABCI and TSUBAME 3.0 obtained w/ singularity container 5,326 2,697 1,369 688 350 8,259 4,164 2,086 1,055 532 13,640 6,861 3,444 1,724 896 1.000 0.987 0.972 0.967 0.952 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000 4 8 16 32 64 Efficiency (strong-scaling) Elapsed time [sec] # of Compute-node Summit ABCI* TSUBAME 3.0* Efficiency (Summit) Efficiency (ABCI)* Efficiency (TSUBAME 3.0*) Ideal
  5. MEGADOCK benchmark results on Summit (cont.) 5 5,326 2,697 1,369

    688 350 187 109 1.000 0.987 0.972 0.967 0.952 0.888 0.766 0 0.2 0.4 0.6 0.8 1 0 1000 2000 3000 4000 5000 6000 7000 4 8 16 32 64 128 256 Efficiency (strong-scaling) Elapsed time [sec] # of Compute-node megadock (master.log) Ideal Efficiency • The result when increasing the number of compute-node under the same dataset and parameters, then the scalability was gradually decreased (0.766 with 256 compute-nodes) • The size of benchmark dataset and parameters were not enough for over hundreds of compute nodes on Summit • A larger dataset is needed for a large-scale experiment on Summit • we have used over 1 million PPI dataset (x25 larger) in the Grand Challenge program of ABCI and TSUBAME 3.0 (2019)
  6. • Deploying HPC Containers for multiple HPC environments is still

    not easy. HPCCM from NVIDIA is helpful, but the building process can’t be simple … • Differences of CPU architectures: • x86_64 (ABCI, TSUBAME 3.0) • aarch64 (Fugaku) • ppc64le (Summit) • Differences of GPU architectures: • P100 (TSUBAME 3.0) • V100 (ABCI, Summit) • A100 (ABCI 2.0) • MPI library and compatibility problems: • OpenMPI • MPICH • Fujitsu MPI • IBM Spectrum MPI • Other difficulties: • differences of supported Singularity versions and options • difference of job scheduler options, storage architectures, etc. • no available online user-manual for building containers on the system Obstacles to build HPC application containers on multiple HPC Systems 6 Rootless option (fakeroot) is necessary for efficient debugging - Sometimes we don’t have a good building environment for target CPU architectures - Emulators may solve it but has performance problems - It is okay if we can use a recently released version of the same library - However, proprietary libraries (binaries) are required to be provided by vendors at least one successful building procedure on the system is needed for reference (especially for MPI containers), if not, then the debugging will be a hard time …