Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Computing at ILRI

James Oguya
November 29, 2016

Research Computing at ILRI

A use-case presentation on Research Computing at International Livestock Research Institute (ILRI).
Presented at the Science Gateways & Grid Infrastructures workshop at KENET, Nairobi.

James Oguya

November 29, 2016
Tweet

More Decks by James Oguya

Other Decks in Technology

Transcript

  1. Research Computing at ILRI Science Gateways & Grid Infrastructures Workshop

    KENET James Oguya Nairobi, Kenya November, 2016
  2. What we do • Agricultural Research at ILRI & its

    partners is aimed at producing healthier crops & livestock to alleviate poverty & hunger in the developing world through exploitation of the latest genome technologies ◦ such technologies require state-of-the-art high-performance computing infrastructure for downstream data analysis & storage • ILRI HPC serves bioinformatics, statistics & geo-spatial computational needs of ILRI & its partners • By sharing computational power of HPC, researchers have been & will be able to conduct more extensive & large-scale genomic research quickly & cost-effectively
  3. Where we came from(2003) • 32 dual-core compute nodes a.k.a.

    'thin' nodes • data storage over NFS to 'master' node • compute jobs had to use MPI ◦ writing & debugging MPI code is not easy! • Ran on Rocks Cluster distro • All this was revolutionary at that time!
  4. Where we are(2016) • Got rid of the original cluster

    • 5 compute nodes, 5 storage nodes, 1 virtualization server & 3 database servers ◦ compute nodes are a mixture of 'thick' & 'thin' nodes ◦ 184 CPU cores, 1.5TB RAM, ~80TB disk space • Compute & Storage nodes run CentOS 6; the rest Ubuntu • Fast 10GbE interconnects • World-class cluster setup & system administration
  5. Future plans • New GPU node(s) • Double the storage

    to ~160TB • 40GbE optical interconnects Only possible with good source of funds...
  6. HPC infrastructure organization • The cluster is divided into 2

    sections: ◦ Compute: ▪ Lots of CPU cores & RAM ▪ Minimal disk space ▪ Job scheduling done SLURM workload manager ◦ Storage: ▪ Lots of disk space ▪ At least 8 CPU cores ▪ Data is distributed & replicated in all storage servers ▪ GlusterFS distributed file system Users IDs, applications & data are available everywhere.
  7. how we use SLURM • SLURM workload manager is an

    open source, highly scalable cluster management & job scheduling system for Linux clusters ◦ was conceived at Lawrence Livermore National Labs(LLNL) ◦ SLURM manages & accounts for computing resources ▪ users request CPU cores, memory & node(s) ◦ queues & prioritizes jobs, logs resource usage, e.t.c • Users can submit ‘batch’ and or ‘interactive’ jobs ◦ ‘batch’ jobs can be long-running jobs, invoke a program multiple times with different variables/options/arguments e.t.c. ◦ Running an ‘interactive’ job is as easy as typing interactive command!
  8. how we use GlusterFS • GlusterFS is an open-source scalable

    distributed network file system developed by RedHat • Can do replicate, distribute, replicate+distribute, geo-replication(off site!) e.t.c. volumes • In ILRI HPC, we have 3 glusterfs replicated volumes: ◦ homes volume: contains users’ home folders, mounted in /home ◦ apps volume: contains all applications; mounted in /export/apps ◦ data volume: contains data(databases, genomes, e.t.c) shared amongst users/groups; mounted in /export/data • persistent directory paths: users can access data in GlusterFS volumes from any compute node/server within the cluster; distributed transparency
  9. Managing applications • Applications are loaded & unloaded using environment

    modules — http://modules.sourceforge.net • Environment modules makes it easy to support multiple application versions, libraries, dependencies, shell environment variables, e.t.c. • Install once, use everywhere… • List of applications installed on the cluster http://hpc.ilri.cgiar.org/list-of-software
  10. Users & Groups • User accounts managed by 389 LDAP

    & authentication done by System Security Services Daemon(SSSD) ◦ SSSD also caches logins; faster logins than using pam_ldap • Consistent UIDs/GIDs across all nodes; you only need to login once, on the head node