Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOps in HPC

DevOps in HPC

Over the past 5 years, Livermore Computing at LLNL has made significant strides forward in how we manage and deploy clusters and infrastructure in support of our HPC systems. We've seen migrations to newer technologies like Git and GitLab for managing code repositories (e.g. configuration management), the addition of CI/CD processes for the first time ever, and integration with new deployment approaches such as using GitLab CI + Ansible + Containers to deploy backend services. The end result is a significant improvement to the robustness of our cluster management processes, as well as quicker detection and remediation of issues and potential issues. This talk will provide an overview of these improvements, the challenges and opportunities they've provided, and outline the plans we have going forward in to the future.

Ian Lee

May 21, 2024
Tweet

More Decks by Ian Lee

Other Decks in Technology

Transcript

  1. LLNL-PRES-850669 This work was performed under the auspices of the

    U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC DevOps in HPC 4th NIST HPC Security Workshop Ian Lee HPC Security Architect 2024-05-21
  2. 2 LLNL-PRES-850669 More Complete View of LC SNSI RZ SCF

    • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems CZ Green
  3. 4 LLNL-PRES-850669 § A common operating system and computing environment

    for HPC clusters — Based on RHEL operating system — Modified RHEL Kernel § Methodology for building, quality assurance, integration, and configuration management § Add in customization for HPC specific needs — Consistent source and software across architectures: Intel, PowerPC, and ARM — High speed interconnect — Very large filesystems Tri-Lab Operating System Stack (TOSS)
  4. 7 LLNL-PRES-850669 § GitLab CI + Ansible configuration — Separate

    from the TOSS Ansible repo § Very fast to destroy and rebuild the Elastic clusters § Straightforward to scale up the service to meet demand — Now running 6 production Elastic deployments LC Example: Elastic Cluster Deployment
  5. 12 LLNL-PRES-850669 Tools / Techniques for Configuration Management § Cfengine3

    -> Ansible § SVN -> Git + GitLab § Branch and merge workflow — Automated tests must pass — CODEOWNERS (in progress) — Code review not required (yet) Challenges / Quirks § HPC Clusters are traditionally pets, not cattle — Implementations differ by Sys Admin § Multiple network zones / enclaves — Airgaps / one way links — Use git branching to store local changes
  6. 13 LLNL-PRES-850669 GitLab CI: Administrative Controls into Technical Controls §

    Historically, practices are passed down verbally, rules are enforced by individuals § CI brings rigor and repeatability to software development practices § New developers learn from automated pipelines, as they are committing — GitLab CI in LC configuration management
  7. 14 LLNL-PRES-850669 Configuration Management CI/CD - Today § Whitespace check

    § Generate managed files — Genders — Sudoers files § Linting — Markdown — Python — YAML § Logic / Quality Checks — TOSS 3 -> 4 SSH Configs — Ansible inventory logic
  8. 16 LLNL-PRES-850669 Configuration Management CI/CD - Future Process Changes §

    Automated deployment to systems § Mandatory Code Review § Deploy center Vault to manage secrets — Deployed and then moved away from Vault § Configuration validation (Network rules) Challenges § Testing ALL the combinations of systems § What about changes at 02:00 ? § How to manage unlocking the Vault? § Where is the source of truth?
  9. 17 LLNL-PRES-850669 § Single Source of Truth — Bring key

    information in to Ansible CM — Validate it! § Automatic update of CMDB, and other databases § Drive actions from system data — pulse.py § Testing and Verification of configurations and business rules ”Automate the Boring Stuff”
  10. 29 LLNL-PRES-850669 § https://github.com/llnl/cmvl (WIP) —Repository of Elastic, Splunk, etc

    queries, dashboards, and visualizations § https://github.com/LLNL/elastic-stacker —Export saved objects from Kibana for sharing § https://github.com/LLNL/toss-configs (WIP) —Configuration files and scripts for setting up and maintaining TOSS HPC systems § https://github.com/llnl/toss-stig —Ansible implementation of the TOSS STIG Community Work
  11. 31 LLNL-PRES-850669 Livermore Computing 31 Platforms, software, and consulting to

    enable world-class scientific simulation Mission Enable national security missions, support groundbreaking computational science, and advance High Performance Computing Platforms Sierra #12, El Cap (early delivery) #46, rzAdams #47, Tuolumne #48, Lassen #57, Dane #130, Bengal #151, rzVernal #166, Tioga #187, Ruby #232, Tenaya #259, Magma #269, Jade #368, Quartz #369, and many smaller compute clusters and infrastructure systems
  12. 35 LLNL-PRES-850669 HPC Zones – Wider View Open Compute Facility

    Collaboration Zone Internet LLNL Enterprise Resticted Zone Infrastructure Zone GDO / ATTB (DMZ) Secure Compute Facility FIS
  13. 37 LLNL-PRES-850669 Coming From § Various instances of Atlassian tools

    — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § LC Bamboo Service (~ 2017 – 2022)
  14. 38 LLNL-PRES-850669 Today § Various instances of Atlassian tools —

    Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § Programmatic GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT
  15. 39 LLNL-PRES-850669 Going Forward § Various instances of Atlassian tools

    — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § Programmatic and Institutional GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT — Institutional GitLab Server? • Projects configure runners for use with their projects