Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building DevOps into HPC System Administration

Ian Lee
June 28, 2023

Building DevOps into HPC System Administration

Over the past 5 years, Livermore Computing at LLNL has made significant strides forward in how we manage and deploy clusters and infrastructure in support of our HPC systems. We've seen migrations to newer technologies like Git and GitLab for managing code repositories (e.g. configuration management), the addition of CI/CD processes for the first time ever, and integration with new deployment approaches such as using GitLab CI + Ansible + Containers to deploy backend services. The end result is a significant improvement to the robustness of our cluster management processes, as well as quicker detection and remediation of issues and potential issues. This talk will provide an overview of these improvements, the challenges and opportunities they've provided, and outline the plans we have going forward in to the future.

Ian Lee

June 28, 2023

More Decks by Ian Lee

Other Decks in Technology


  1. LLNL-PRES-850669 This work was performed under the auspices of the

    U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Building DevOps into HPC System Administration NLIT 2023 Ian Lee HPC Security Architect 2023-06-28
  2. 2 LLNL-PRES-850669 Livermore Computing 2 Platforms, software, and consulting to

    enable world-class scientific simulation Mission Enable national security missions, support groundbreaking computational science, and advance High Performance Computing Platforms Sierra #6, Lassen #36, rzVernal #116, Tioga #132, Ruby #171, Tenaya #194, Magma #203, Jade #296, Quartz #297, and many smaller compute clusters and infrastructure systems
  3. 5 LLNL-PRES-850669 User Centric View of LC • Each bubble

    represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented
  4. 6 LLNL-PRES-850669 More Complete View of LC SNSI RZ SCF

    • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems CZ Green
  5. 9 LLNL-PRES-850669 • Each bubble represents a cluster • Size

    reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)
  6. 12 LLNL-PRES-850669 HPC Zones – Wider View Open Compute Facility

    Collaboration Zone Internet LLNL Enterprise Resticted Zone Infrastructure Zone GDO / ATTB (DMZ) Secure Compute Facility FIS
  7. 13 LLNL-PRES-850669 § A common operating system and computing environment

    for HPC clusters — Based on RHEL operating system — Modified RHEL Kernel § Methodology for building, quality assurance, integration, and configuration management § Add in customization for HPC specific needs — Consistent source and software across architectures: Intel, PowerPC, and ARM — High speed interconnect — Very large filesystems Tri-Lab Operating System Stack (TOSS)
  8. 18 LLNL-PRES-850669 § GitLab CI + Ansible configuration (separate from

    the TOSS Ansible repo) § Very fast to destroy and rebuild the Elastic clusters § Straightforward to scale up the service to meet demand § Hopefully you caught James’ talk this morning for all the details! LC Example: Elastic Cluster Deployment
  9. 20 LLNL-PRES-850669 Tools / Techniques for Configuration Management § Cfengine3

    -> Ansible § SVN -> Git § Branch and merge workflow — Code review not required — Automated tests must pass Challenges / Quirks § HPC Clusters are traditionally pets, not cattle — Implementations differ by Sys Admin § Multiple network zones / enclaves — Airgaps / one way links — Use git branching to store local changes
  10. 21 LLNL-PRES-850669 GitLab CI: Administrative Controls into Technical Controls §

    Historically, practices are passed down verbally, rules are enforced by individuals § CI brings rigor and repeatability to software development practices § New developers learn from automated pipelines, as they are committing — GitLab CI in LC configuration management
  11. 22 LLNL-PRES-850669 Configuration Management CI/CD - Today § Whitespace check

    § Generate managed files — Genders — Sudoers files § Linting — YAML — Python § Logic Checks — TOSS 3 -> 4 SSH Configs — Ansible inventory logic
  12. 23 LLNL-PRES-850669 Configuration Management CI/CD - Future Process Changes §

    Automated deployment to systems § Mandatory Code Review § Deploy center Vault to manage secrets § Configuration validation (Network rules) Challenges § Testing ALL the combinations of systems § What about changes at 02:00 ? § How to manage unlocking the Vault? § Where is the source of truth?
  13. 25 LLNL-PRES-850669 § Security requirements often quite prescriptive — STIG

    > CIS Benchmark > Vendor Guideline > generic NIST 800-53 controls § Developed a STIG for the TOSS operation system with DISA — Inspired by RHEL 8 STIG, which TOSS 4 is derived from — Small tweaks: adjust some DoD specific language to make compatible for other Gov agencies — Larger requests: no explicit allow-listing of software on TOSS, being a software development OS — HPC specific: RHEL STIG says 10 concurrent sessions for DOS reasons, TOSS STIG allows 256 § Need to regularly check and validation configuration Security Baseline
  14. 27 LLNL-PRES-850669 Continuous Monitoring § LC HPC is the gold

    standard for continuous monitoring at LLNL § Aligns with federal trends towards continuous monitoring § Reduce burden of manual processes on sys admins, shifting those efforts to automation and alerting — Let SysAdmins focus on the engineering work
  15. 29 LLNL-PRES-850669 § Single Source of Truth — Bring key

    information in to Ansible CM — Validate it! § Automatic update of CMDB, and other databases § Drive actions from system data § Testing and Verification of configurations and business rules ”Automate the Boring Stuff”
  16. 33 LLNL-PRES-850669 § Elastic Stack is a collection of software

    for logging, reporting, and visualization of data — Formerly “ELK stack” or just “ELK” § Consists of Elasticsearch, Logstash, Kibana, Beats, and more § Open source components, commercial support, similar idea to GitLab What is Elastic?
  17. 36 LLNL-PRES-850669 Current Elastic Hardware ~ 112TB NVMe (total) ~

    2PB HDD (total) 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Myelin5 32GB / 8 core data_warm Myelin4 32GB / 8 core data_warm Myelin3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Myelin2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core Myelin1 (mgmt) Myelin 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Axon5 32GB / 8 core data_warm Axon4 32GB / 8 core data_warm Axon3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Axon2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core 32GB / 8 core Master, data_hot, data_ingest Axon1 (mgmt) Axon Centrebrain3 (Monitoring “cluster”) 16GB / 8 core 4GB / 2 core Centrebrain2 (Dedicated Master Node) 8GB / 8 core Master, voting_only Centrebrain1 (mgmt) Centrebrain F5
  18. 40 LLNL-PRES-850669 Coming From § Various instances of Atlassian tools

    — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § LC Bamboo Service (~ 2017 – 2022)
  19. 41 LLNL-PRES-850669 Today § Various instances of Atlassian tools —

    Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § Programmatic GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT
  20. 42 LLNL-PRES-850669 Going Forward § Various instances of Atlassian tools

    — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § Programmatic and Institutional GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT — Institutional GitLab Server? • Projects configure runners for use with their projects