Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Computing at ILRI

Research Computing at ILRI

Presentation about ILRI's research-computing platform to CGIAR ICT Managers Meeting in March, 2014.

http://hpc.ilri.cgiar.org

2f685251aa15345f34c41c7091be99cf?s=128

Alan Orth

March 05, 2014
Tweet

Transcript

  1. Research Computing at ILRI Alan Orth Sys Admin ILRI, Kenya

    March 5, 2014
  2. Where we came from (2003) - 32 dual-core compute nodes

    - 32 * 2 != 64 - Writing MPI code is hard! - Data storage over NFS to “master” node - “Rocks” cluster distro - Revolutionary at the time!
  3. Where we came from (2010) - Most of the original

    cluster removed - Replaced with single Dell PowerEdge R910 - 64 cores, 8TB storage, 128 GB - Threading is easier* than MPI! - Data is local - Easier to manage!
  4. To infinity and beyond (2013) - A little bit back

    to the “old” model - Mixture of “thin” and “thick” nodes - Networked storage - Pure CentOS - Supermicro boxen - Pretty exciting! --->
  5. Primary characteristics Computational capacity Data storage

  6. Platform - 152 compute cores - 32* TB storage -

    700 GB RAM - 10 GbE interconnects - LTO-4 tape backups (LOL?)
  7. Homogeneous computing environment User IDs, applications, and data are available

    everywhere.
  8. Scaling out storage with GlusterFS - Developed by Red Hat

    - Abstracts backend storage (file systems, technology, etc) - Can do replicate, distribute, replicate+distribute, geo-replication (off site!), etc - Scales “out”, not “up”
  9. How we use GlusterFS [aorth@hpc: ~]$ df -h Filesystem Size

    Used Avail Use% Mounted on ... wingu1:/homes 31T 9.5T 21T 32% /home wingu0:/apps 31T 9.5T 21T 32% /export/apps wingu1:/data 31T 9.5T 21T 32% /export/data - Persistent paths for homes, data, and applications across the cluster. - These volumes are replicated, so essentially application-layer RAID1
  10. GlusterFS <3 10GbE

  11. - Project from Lawrence Livermore National Labs (LLNL) - Manages

    resources - Users request CPU, memory, and node allocations - Queues / prioritizes jobs, logs usage, etc - More like an accountant than a bouncer
  12. Topology

  13. How we use SLURM - Can submit “batch” jobs (long-running

    jobs, invoke program many times with different variables, etc) - Can run “interactively” (something that needs keyboard interaction) Make it easy for users to do the “right thing”: [aorth@hpc: ~]$ interactive -c 10 salloc: Granted job allocation 1080 [aorth@compute0: ~]$
  14. Managing applications - Environment modules - http://modules. sourceforge.net - Dynamically

    load support for packages in a user’s environment - Makes it easy to support multiple versions, complicated packages with $PERL5LIB, package dependencies, etc
  15. Managing applications Install once, use everywhere... [aorth@hpc: ~]$ module avail

    blast blast/2.2.25+ blast/2.2.26 blast/2.2.26+ blast/2. 2.28+ [aorth@hpc: ~]$ module load blast/2.2.28+ [aorth@hpc: ~]$ which blastn /export/apps/blast/2.2.28+/bin/blastn Works anywhere on the cluster!
  16. Users and Groups - Consistent UID/GIDs across systems - LDAP

    + SSSD (also from Red Hat) is a great match - 389 LDAP works great with CentOS - SSSD is simpler than pam_ldap and does caching
  17. More information and contact a.orth@cgiar.org http://hpc.ilri.cgiar.org/