Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Computing at ILRI

Avatar for James Oguya James Oguya
November 29, 2016

Research Computing at ILRI

A use-case presentation on Research Computing at International Livestock Research Institute (ILRI).
Presented at the Science Gateways & Grid Infrastructures workshop at KENET, Nairobi.

Avatar for James Oguya

James Oguya

November 29, 2016
Tweet

More Decks by James Oguya

Other Decks in Technology

Transcript

  1. Research Computing at ILRI Science Gateways & Grid Infrastructures Workshop

    KENET James Oguya Nairobi, Kenya November, 2016
  2. What we do • Agricultural Research at ILRI & its

    partners is aimed at producing healthier crops & livestock to alleviate poverty & hunger in the developing world through exploitation of the latest genome technologies ◦ such technologies require state-of-the-art high-performance computing infrastructure for downstream data analysis & storage • ILRI HPC serves bioinformatics, statistics & geo-spatial computational needs of ILRI & its partners • By sharing computational power of HPC, researchers have been & will be able to conduct more extensive & large-scale genomic research quickly & cost-effectively
  3. Where we came from(2003) • 32 dual-core compute nodes a.k.a.

    'thin' nodes • data storage over NFS to 'master' node • compute jobs had to use MPI ◦ writing & debugging MPI code is not easy! • Ran on Rocks Cluster distro • All this was revolutionary at that time!
  4. Where we are(2016) • Got rid of the original cluster

    • 5 compute nodes, 5 storage nodes, 1 virtualization server & 3 database servers ◦ compute nodes are a mixture of 'thick' & 'thin' nodes ◦ 184 CPU cores, 1.5TB RAM, ~80TB disk space • Compute & Storage nodes run CentOS 6; the rest Ubuntu • Fast 10GbE interconnects • World-class cluster setup & system administration
  5. Future plans • New GPU node(s) • Double the storage

    to ~160TB • 40GbE optical interconnects Only possible with good source of funds...
  6. HPC infrastructure organization • The cluster is divided into 2

    sections: ◦ Compute: ▪ Lots of CPU cores & RAM ▪ Minimal disk space ▪ Job scheduling done SLURM workload manager ◦ Storage: ▪ Lots of disk space ▪ At least 8 CPU cores ▪ Data is distributed & replicated in all storage servers ▪ GlusterFS distributed file system Users IDs, applications & data are available everywhere.
  7. how we use SLURM • SLURM workload manager is an

    open source, highly scalable cluster management & job scheduling system for Linux clusters ◦ was conceived at Lawrence Livermore National Labs(LLNL) ◦ SLURM manages & accounts for computing resources ▪ users request CPU cores, memory & node(s) ◦ queues & prioritizes jobs, logs resource usage, e.t.c • Users can submit ‘batch’ and or ‘interactive’ jobs ◦ ‘batch’ jobs can be long-running jobs, invoke a program multiple times with different variables/options/arguments e.t.c. ◦ Running an ‘interactive’ job is as easy as typing interactive command!
  8. how we use GlusterFS • GlusterFS is an open-source scalable

    distributed network file system developed by RedHat • Can do replicate, distribute, replicate+distribute, geo-replication(off site!) e.t.c. volumes • In ILRI HPC, we have 3 glusterfs replicated volumes: ◦ homes volume: contains users’ home folders, mounted in /home ◦ apps volume: contains all applications; mounted in /export/apps ◦ data volume: contains data(databases, genomes, e.t.c) shared amongst users/groups; mounted in /export/data • persistent directory paths: users can access data in GlusterFS volumes from any compute node/server within the cluster; distributed transparency
  9. Managing applications • Applications are loaded & unloaded using environment

    modules — http://modules.sourceforge.net • Environment modules makes it easy to support multiple application versions, libraries, dependencies, shell environment variables, e.t.c. • Install once, use everywhere… • List of applications installed on the cluster http://hpc.ilri.cgiar.org/list-of-software
  10. Users & Groups • User accounts managed by 389 LDAP

    & authentication done by System Security Services Daemon(SSSD) ◦ SSSD also caches logins; faster logins than using pam_ldap • Consistent UIDs/GIDs across all nodes; you only need to login once, on the head node