Building DevOps into HPC System Administration

LLNL-PRES-850669 This work was performed under the auspices of the
U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Building DevOps into HPC System Administration NLIT 2023 Ian Lee HPC Security Architect 2023-06-28

2 LLNL-PRES-850669 Livermore Computing 2 Platforms, software, and consulting to
enable world-class scientific simulation Mission Enable national security missions, support groundbreaking computational science, and advance High Performance Computing Platforms Sierra #6, Lassen #36, rzVernal #116, Tioga #132, Ruby #171, Tenaya #194, Magma #203, Jade #296, Quartz #297, and many smaller compute clusters and infrastructure systems

3 LLNL-PRES-850669 What’s in an HPC Center?

4 LLNL-PRES-850669 What Makes Up an HPC Cluster https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.ipd.pdf NIST
800-223 IPD

5 LLNL-PRES-850669 User Centric View of LC • Each bubble
represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented

6 LLNL-PRES-850669 More Complete View of LC SNSI RZ SCF
• Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems CZ Green

8 LLNL-PRES-850669 El Capitan https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

9 LLNL-PRES-850669 • Each bubble represents a cluster • Size
reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)

10 LLNL-PRES-850669 HPC Center Architecture

11 LLNL-PRES-850669 HPC Collaboration Zone Internet LLNL Enterprise Resticted Zone
HPC Zones – User View

12 LLNL-PRES-850669 HPC Zones – Wider View Open Compute Facility
Collaboration Zone Internet LLNL Enterprise Resticted Zone Infrastructure Zone GDO / ATTB (DMZ) Secure Compute Facility FIS

13 LLNL-PRES-850669 § A common operating system and computing environment
for HPC clusters — Based on RHEL operating system — Modified RHEL Kernel § Methodology for building, quality assurance, integration, and configuration management § Add in customization for HPC specific needs — Consistent source and software across architectures: Intel, PowerPC, and ARM — High speed interconnect — Very large filesystems Tri-Lab Operating System Stack (TOSS)

14 LLNL-PRES-850669 Tech Stack… YES

15 LLNL-PRES-850669 GitLab (Version Control)

16 LLNL-PRES-850669 Example .gitlab-ci.yml file

17 LLNL-PRES-850669 LC Example: GitLab Server Deployment

18 LLNL-PRES-850669 § GitLab CI + Ansible configuration (separate from
the TOSS Ansible repo) § Very fast to destroy and rebuild the Elastic clusters § Straightforward to scale up the service to meet demand § Hopefully you caught James’ talk this morning for all the details! LC Example: Elastic Cluster Deployment

19 LLNL-PRES-850669 Configuration Management

20 LLNL-PRES-850669 Tools / Techniques for Configuration Management § Cfengine3
-> Ansible § SVN -> Git § Branch and merge workflow — Code review not required — Automated tests must pass Challenges / Quirks § HPC Clusters are traditionally pets, not cattle — Implementations differ by Sys Admin § Multiple network zones / enclaves — Airgaps / one way links — Use git branching to store local changes

21 LLNL-PRES-850669 GitLab CI: Administrative Controls into Technical Controls §
Historically, practices are passed down verbally, rules are enforced by individuals § CI brings rigor and repeatability to software development practices § New developers learn from automated pipelines, as they are committing — GitLab CI in LC configuration management

22 LLNL-PRES-850669 Configuration Management CI/CD - Today § Whitespace check
§ Generate managed files — Genders — Sudoers files § Linting — YAML — Python § Logic Checks — TOSS 3 -> 4 SSH Configs — Ansible inventory logic

23 LLNL-PRES-850669 Configuration Management CI/CD - Future Process Changes §
Automated deployment to systems § Mandatory Code Review § Deploy center Vault to manage secrets § Configuration validation (Network rules) Challenges § Testing ALL the combinations of systems § What about changes at 02:00 ? § How to manage unlocking the Vault? § Where is the source of truth?

24 LLNL-PRES-850669 Monitoring

25 LLNL-PRES-850669 § Security requirements often quite prescriptive — STIG
> CIS Benchmark > Vendor Guideline > generic NIST 800-53 controls § Developed a STIG for the TOSS operation system with DISA — Inspired by RHEL 8 STIG, which TOSS 4 is derived from — Small tweaks: adjust some DoD specific language to make compatible for other Gov agencies — Larger requests: no explicit allow-listing of software on TOSS, being a software development OS — HPC specific: RHEL STIG says 10 concurrent sessions for DOS reasons, TOSS STIG allows 256 § Need to regularly check and validation configuration Security Baseline

26 LLNL-PRES-850669 Security Dashboards – Operational and Compliance

27 LLNL-PRES-850669 Continuous Monitoring § LC HPC is the gold
standard for continuous monitoring at LLNL § Aligns with federal trends towards continuous monitoring § Reduce burden of manual processes on sys admins, shifting those efforts to automation and alerting — Let SysAdmins focus on the engineering work

28 LLNL-PRES-850669 Automation with Data

29 LLNL-PRES-850669 § Single Source of Truth — Bring key
information in to Ansible CM — Validate it! § Automatic update of CMDB, and other databases § Drive actions from system data § Testing and Verification of configurations and business rules ”Automate the Boring Stuff”

30 LLNL-PRES-850669 More, Better Targeted Alerts

Thank you! Happy to chat and answer questions! [email protected] @IanLee1521

32 LLNL-PRES-850669 Elastic Stack

33 LLNL-PRES-850669 § Elastic Stack is a collection of software
for logging, reporting, and visualization of data — Formerly “ELK stack” or just “ELK” § Consists of Elasticsearch, Logstash, Kibana, Beats, and more § Open source components, commercial support, similar idea to GitLab What is Elastic?

34 LLNL-PRES-850669 Kibana UI

35 LLNL-PRES-850669 Elastic Logging Architecture Cluster Node Node Node Node
Node Node

36 LLNL-PRES-850669 Current Elastic Hardware ~ 112TB NVMe (total) ~
2PB HDD (total) 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Myelin5 32GB / 8 core data_warm Myelin4 32GB / 8 core data_warm Myelin3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Myelin2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core Myelin1 (mgmt) Myelin 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Axon5 32GB / 8 core data_warm Axon4 32GB / 8 core data_warm Axon3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Axon2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core 32GB / 8 core Master, data_hot, data_ingest Axon1 (mgmt) Axon Centrebrain3 (Monitoring “cluster”) 16GB / 8 core 4GB / 2 core Centrebrain2 (Dedicated Master Node) 8GB / 8 core Master, voting_only Centrebrain1 (mgmt) Centrebrain F5

37 LLNL-PRES-850669 Continuous Monitoring – Auth Failures

38 LLNL-PRES-850669 Continuous Monitoring – GitLab

39 LLNL-PRES-850669 Tech Stack History

40 LLNL-PRES-850669 Coming From § Various instances of Atlassian tools
— Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § LC Bamboo Service (~ 2017 – 2022)

41 LLNL-PRES-850669 Today § Various instances of Atlassian tools —
Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § Programmatic GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT

42 LLNL-PRES-850669 Going Forward § Various instances of Atlassian tools
— Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § Programmatic and Institutional GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT — Institutional GitLab Server? • Projects configure runners for use with their projects

Building DevOps into HPC System Administration

Building DevOps into HPC System Administration

More Decks by Ian Lee

Other Decks in Technology

Featured

Transcript