Research Computing at ILRI

Research Computing at ILRI Science Gateways & Grid Infrastructures Workshop
KENET James Oguya Nairobi, Kenya November, 2016

What we do • Agricultural Research at ILRI & its
partners is aimed at producing healthier crops & livestock to alleviate poverty & hunger in the developing world through exploitation of the latest genome technologies ◦ such technologies require state-of-the-art high-performance computing infrastructure for downstream data analysis & storage • ILRI HPC serves bioinformatics, statistics & geo-spatial computational needs of ILRI & its partners • By sharing computational power of HPC, researchers have been & will be able to conduct more extensive & large-scale genomic research quickly & cost-effectively

Where we came from(2003) • 32 dual-core compute nodes a.k.a.
'thin' nodes • data storage over NFS to 'master' node • compute jobs had to use MPI ◦ writing & debugging MPI code is not easy! • Ran on Rocks Cluster distro • All this was revolutionary at that time!

Where we are(2016) • Got rid of the original cluster
• 5 compute nodes, 5 storage nodes, 1 virtualization server & 3 database servers ◦ compute nodes are a mixture of 'thick' & 'thin' nodes ◦ 184 CPU cores, 1.5TB RAM, ~80TB disk space • Compute & Storage nodes run CentOS 6; the rest Ubuntu • Fast 10GbE interconnects • World-class cluster setup & system administration

Where we are(2016) cont. • 54k jobs • 23 years
wall time • 150 users

Future plans • New GPU node(s) • Double the storage
to ~160TB • 40GbE optical interconnects Only possible with good source of funds...

HPC infrastructure organization • The cluster is divided into 2
sections: ◦ Compute: ▪ Lots of CPU cores & RAM ▪ Minimal disk space ▪ Job scheduling done SLURM workload manager ◦ Storage: ▪ Lots of disk space ▪ At least 8 CPU cores ▪ Data is distributed & replicated in all storage servers ▪ GlusterFS distributed file system Users IDs, applications & data are available everywhere.

Topology

how we use SLURM • SLURM workload manager is an
open source, highly scalable cluster management & job scheduling system for Linux clusters ◦ was conceived at Lawrence Livermore National Labs(LLNL) ◦ SLURM manages & accounts for computing resources ▪ users request CPU cores, memory & node(s) ◦ queues & prioritizes jobs, logs resource usage, e.t.c • Users can submit ‘batch’ and or ‘interactive’ jobs ◦ ‘batch’ jobs can be long-running jobs, invoke a program multiple times with different variables/options/arguments e.t.c. ◦ Running an ‘interactive’ job is as easy as typing interactive command!

how we use GlusterFS • GlusterFS is an open-source scalable
distributed network file system developed by RedHat • Can do replicate, distribute, replicate+distribute, geo-replication(off site!) e.t.c. volumes • In ILRI HPC, we have 3 glusterfs replicated volumes: ◦ homes volume: contains users’ home folders, mounted in /home ◦ apps volume: contains all applications; mounted in /export/apps ◦ data volume: contains data(databases, genomes, e.t.c) shared amongst users/groups; mounted in /export/data • persistent directory paths: users can access data in GlusterFS volumes from any compute node/server within the cluster; distributed transparency

Managing applications • Applications are loaded & unloaded using environment
modules — http://modules.sourceforge.net • Environment modules makes it easy to support multiple application versions, libraries, dependencies, shell environment variables, e.t.c. • Install once, use everywhere… • List of applications installed on the cluster http://hpc.ilri.cgiar.org/list-of-software

Users & Groups • User accounts managed by 389 LDAP
& authentication done by System Security Services Daemon(SSSD) ◦ SSSD also caches logins; faster logins than using pam_ldap • Consistent UIDs/GIDs across all nodes; you only need to login once, on the head node

More info & contact [email protected] http://hpc.ilri.cgiar.org/

Questions?

Research Computing at ILRI

Research Computing at ILRI

James Oguya

More Decks by James Oguya

Other Decks in Technology

Featured

Transcript