Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building DevOps into HPC System Administration

Building DevOps into HPC System Administration

Over the past 5 years, Livermore Computing at LLNL has made significant strides forward in how we manage and deploy clusters and infrastructure in support of our HPC systems. We've seen migrations to newer technologies like Git and GitLab for managing code repositories (e.g. configuration management), the addition of CI/CD processes for the first time ever, and integration with new deployment approaches such as using GitLab CI + Ansible + Containers to deploy backend services. The end result is a significant improvement to the robustness of our cluster management processes, as well as quicker detection and remediation of issues and potential issues. This talk will provide an overview of these improvements, the challenges and opportunities they've provided, and outline the plans we have going forward in to the future.

Ian Lee

June 28, 2023
Tweet

More Decks by Ian Lee

Other Decks in Technology

Transcript

  1. LLNL-PRES-850669
    This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-
    AC52-07NA27344. Lawrence Livermore National Security, LLC
    Building DevOps into HPC System Administration
    NLIT 2023
    Ian Lee
    HPC Security Architect
    2023-06-28

    View Slide

  2. 2
    LLNL-PRES-850669
    Livermore Computing
    2
    Platforms, software, and consulting to enable world-class scientific simulation
    Mission Enable national security missions, support groundbreaking computational science,
    and advance High Performance Computing
    Platforms Sierra #6, Lassen #36, rzVernal #116, Tioga #132, Ruby #171, Tenaya #194, Magma
    #203, Jade #296, Quartz #297, and many smaller compute clusters and infrastructure
    systems

    View Slide

  3. 3
    LLNL-PRES-850669
    What’s in an HPC Center?

    View Slide

  4. 4
    LLNL-PRES-850669
    What Makes Up an HPC Cluster
    https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.ipd.pdf
    NIST 800-223 IPD

    View Slide

  5. 5
    LLNL-PRES-850669
    User Centric View of LC
    • Each bubble represents a cluster
    • Size reflects theoretical peak performance, color indicates computing zone
    • Only large user-facing compute clusters are represented

    View Slide

  6. 6
    LLNL-PRES-850669
    More Complete View of LC
    SNSI
    RZ SCF
    • Each bubble represents a managed cluster or system
    • Size reflects number of nodes, color indicates type of system
    • All production systems are shown, including non-user-facing systems
    CZ
    Green

    View Slide

  7. 7
    LLNL-PRES-850669
    © 2016 Ian Lee

    View Slide

  8. 8
    LLNL-PRES-850669
    El Capitan
    https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

    View Slide

  9. 9
    LLNL-PRES-850669
    • Each bubble represents a cluster
    • Size reflects theoretical peak performance, color indicates computing zone
    • Only large user-facing compute clusters are represented
    El Capitan vs the Rest
    El Capitan
    ~ 2 EF (~ 2,000,000 TF)

    View Slide

  10. 10
    LLNL-PRES-850669
    HPC Center Architecture

    View Slide

  11. 11
    LLNL-PRES-850669
    HPC
    Collaboration
    Zone
    Internet
    LLNL Enterprise
    Resticted Zone
    HPC Zones – User View

    View Slide

  12. 12
    LLNL-PRES-850669
    HPC Zones – Wider View
    Open Compute Facility
    Collaboration
    Zone
    Internet
    LLNL
    Enterprise
    Resticted
    Zone
    Infrastructure
    Zone
    GDO / ATTB
    (DMZ)
    Secure
    Compute
    Facility
    FIS

    View Slide

  13. 13
    LLNL-PRES-850669
    § A common operating system and computing environment for HPC clusters
    — Based on RHEL operating system
    — Modified RHEL Kernel
    § Methodology for building, quality assurance, integration,
    and configuration management
    § Add in customization for HPC specific needs
    — Consistent source and software across architectures: Intel, PowerPC, and ARM
    — High speed interconnect
    — Very large filesystems
    Tri-Lab Operating System Stack (TOSS)

    View Slide

  14. 14
    LLNL-PRES-850669
    Tech Stack…
    YES

    View Slide

  15. 15
    LLNL-PRES-850669
    GitLab (Version Control)

    View Slide

  16. 16
    LLNL-PRES-850669
    Example .gitlab-ci.yml file

    View Slide

  17. 17
    LLNL-PRES-850669
    LC Example: GitLab Server Deployment

    View Slide

  18. 18
    LLNL-PRES-850669
    § GitLab CI + Ansible configuration (separate from the TOSS Ansible repo)
    § Very fast to destroy and rebuild the Elastic clusters
    § Straightforward to scale up the service to meet demand
    § Hopefully you caught James’ talk this morning for all the details!
    LC Example: Elastic Cluster Deployment

    View Slide

  19. 19
    LLNL-PRES-850669
    Configuration Management

    View Slide

  20. 20
    LLNL-PRES-850669
    Tools / Techniques for Configuration Management
    § Cfengine3 -> Ansible
    § SVN -> Git
    § Branch and merge workflow
    — Code review not required
    — Automated tests must pass
    Challenges / Quirks
    § HPC Clusters are traditionally pets, not
    cattle
    — Implementations differ by Sys Admin
    § Multiple network zones / enclaves
    — Airgaps / one way links
    — Use git branching to store local changes

    View Slide

  21. 21
    LLNL-PRES-850669
    GitLab CI: Administrative Controls into Technical Controls
    § Historically, practices are passed down
    verbally, rules are enforced by individuals
    § CI brings rigor and repeatability to
    software development practices
    § New developers learn from automated
    pipelines, as they are committing
    — GitLab CI in LC configuration management

    View Slide

  22. 22
    LLNL-PRES-850669
    Configuration Management CI/CD - Today
    § Whitespace check
    § Generate managed files
    — Genders
    — Sudoers files
    § Linting
    — YAML
    — Python
    § Logic Checks
    — TOSS 3 -> 4 SSH Configs
    — Ansible inventory logic

    View Slide

  23. 23
    LLNL-PRES-850669
    Configuration Management CI/CD - Future
    Process Changes
    § Automated deployment to systems
    § Mandatory Code Review
    § Deploy center Vault to manage secrets
    § Configuration validation (Network rules)
    Challenges
    § Testing ALL the combinations of systems
    § What about changes at 02:00 ?
    § How to manage unlocking the Vault?
    § Where is the source of truth?

    View Slide

  24. 24
    LLNL-PRES-850669
    Monitoring

    View Slide

  25. 25
    LLNL-PRES-850669
    § Security requirements often quite prescriptive
    — STIG > CIS Benchmark > Vendor Guideline > generic NIST 800-53 controls
    § Developed a STIG for the TOSS operation system with DISA
    — Inspired by RHEL 8 STIG, which TOSS 4 is derived from
    — Small tweaks: adjust some DoD specific language to make compatible for other Gov agencies
    — Larger requests: no explicit allow-listing of software on TOSS, being a software development OS
    — HPC specific: RHEL STIG says 10 concurrent sessions for DOS reasons, TOSS STIG allows 256
    § Need to regularly check and validation configuration
    Security Baseline

    View Slide

  26. 26
    LLNL-PRES-850669
    Security Dashboards – Operational and Compliance

    View Slide

  27. 27
    LLNL-PRES-850669
    Continuous Monitoring
    § LC HPC is the gold standard for
    continuous monitoring at LLNL
    § Aligns with federal trends towards
    continuous monitoring
    § Reduce burden of manual processes on
    sys admins, shifting those efforts
    to automation and alerting
    — Let SysAdmins focus on the
    engineering work

    View Slide

  28. 28
    LLNL-PRES-850669
    Automation with Data

    View Slide

  29. 29
    LLNL-PRES-850669
    § Single Source of Truth
    — Bring key information in to Ansible CM
    — Validate it!
    § Automatic update of CMDB, and other databases
    § Drive actions from system data
    § Testing and Verification of configurations and business rules
    ”Automate the Boring Stuff”

    View Slide

  30. 30
    LLNL-PRES-850669
    More, Better Targeted Alerts

    View Slide

  31. Thank you!
    Happy to chat and answer questions!
    [email protected]
    @IanLee1521

    View Slide

  32. 32
    LLNL-PRES-850669
    Elastic Stack

    View Slide

  33. 33
    LLNL-PRES-850669
    § Elastic Stack is a collection of software for logging, reporting, and visualization of data
    — Formerly “ELK stack” or just “ELK”
    § Consists of Elasticsearch, Logstash, Kibana, Beats, and more
    § Open source components, commercial support, similar idea to GitLab
    What is Elastic?

    View Slide

  34. 34
    LLNL-PRES-850669
    Kibana UI

    View Slide

  35. 35
    LLNL-PRES-850669
    Elastic Logging Architecture
    Cluster
    Node Node Node Node Node Node

    View Slide

  36. 36
    LLNL-PRES-850669
    Current Elastic Hardware
    ~ 112TB NVMe (total)
    ~ 2PB HDD (total)
    90x 16TB HDD JBOD
    45x HDD
    ~ 512TB
    45x HDD
    ~ 512TB
    Myelin5
    32GB / 8 core
    data_warm
    Myelin4
    32GB / 8 core
    data_warm
    Myelin3
    32GB / 8 core
    Master, data_hot, data_ingest
    ~ 27 TB
    NVMe
    16-32GB / 8-12 core
    Myelin2
    ~ 27TB
    NVMe
    32GB / 8 core
    Master, data_hot, data_ingest 4-8GB / 2 core
    Myelin1 (mgmt)
    Myelin
    90x 16TB HDD JBOD
    45x HDD
    ~ 512TB
    45x HDD
    ~ 512TB
    Axon5
    32GB / 8 core
    data_warm
    Axon4
    32GB / 8 core
    data_warm
    Axon3
    32GB / 8 core
    Master, data_hot, data_ingest
    ~ 27 TB
    NVMe
    16-32GB / 8-12 core
    Axon2
    ~ 27TB
    NVMe
    32GB / 8 core
    Master, data_hot, data_ingest 4-8GB / 2 core
    32GB / 8 core
    Master, data_hot, data_ingest
    Axon1 (mgmt)
    Axon
    Centrebrain3
    (Monitoring “cluster”)
    16GB / 8 core
    4GB / 2 core
    Centrebrain2
    (Dedicated Master Node)
    8GB / 8 core
    Master, voting_only
    Centrebrain1 (mgmt)
    Centrebrain
    F5

    View Slide

  37. 37
    LLNL-PRES-850669
    Continuous Monitoring – Auth Failures

    View Slide

  38. 38
    LLNL-PRES-850669
    Continuous Monitoring – GitLab

    View Slide

  39. 39
    LLNL-PRES-850669
    Tech Stack History

    View Slide

  40. 40
    LLNL-PRES-850669
    Coming From
    § Various instances of Atlassian tools
    — Programmatic (LC, NIF, etc; classified and unclassified)
    — Institutional (MyConfluence, MyJira, MyBitbucket)
    § End users responsible for their own CI/CD
    — Mostly Hudson / Jenkins servers
    § LC Bamboo Service (~ 2017 – 2022)

    View Slide

  41. 41
    LLNL-PRES-850669
    Today
    § Various instances of Atlassian tools
    — Programmatic (LC, NIF, etc; classified and unclassified)
    — Institutional (MyConfluence, MyJira, MyBitbucket)
    § End users responsible for their own CI/CD
    — Mostly Hudson / Jenkins servers
    § Programmatic GitLab Servers
    — LC, WCI, NIF, GS, NARAC, SO, LivIT

    View Slide

  42. 42
    LLNL-PRES-850669
    Going Forward
    § Various instances of Atlassian tools
    — Programmatic (LC, NIF, etc; classified and unclassified)
    — Institutional (MyConfluence, MyJira, MyBitbucket)
    § Programmatic and Institutional GitLab Servers
    — LC, WCI, NIF, GS, NARAC, SO, LivIT
    — Institutional GitLab Server?
    • Projects configure runners for use with their projects

    View Slide