Slide 1

Slide 1 text

LLNL-PRES-850669 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Building DevOps into HPC System Administration NLIT 2023 Ian Lee HPC Security Architect 2023-06-28

Slide 2

Slide 2 text

2 LLNL-PRES-850669 Livermore Computing 2 Platforms, software, and consulting to enable world-class scientific simulation Mission Enable national security missions, support groundbreaking computational science, and advance High Performance Computing Platforms Sierra #6, Lassen #36, rzVernal #116, Tioga #132, Ruby #171, Tenaya #194, Magma #203, Jade #296, Quartz #297, and many smaller compute clusters and infrastructure systems

Slide 3

Slide 3 text

3 LLNL-PRES-850669 What’s in an HPC Center?

Slide 4

Slide 4 text

4 LLNL-PRES-850669 What Makes Up an HPC Cluster https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.ipd.pdf NIST 800-223 IPD

Slide 5

Slide 5 text

5 LLNL-PRES-850669 User Centric View of LC • Each bubble represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented

Slide 6

Slide 6 text

6 LLNL-PRES-850669 More Complete View of LC SNSI RZ SCF • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems CZ Green

Slide 7

Slide 7 text

7 LLNL-PRES-850669 © 2016 Ian Lee

Slide 8

Slide 8 text

8 LLNL-PRES-850669 El Capitan https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

Slide 9

Slide 9 text

9 LLNL-PRES-850669 • Each bubble represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)

Slide 10

Slide 10 text

10 LLNL-PRES-850669 HPC Center Architecture

Slide 11

Slide 11 text

11 LLNL-PRES-850669 HPC Collaboration Zone Internet LLNL Enterprise Resticted Zone HPC Zones – User View

Slide 12

Slide 12 text

12 LLNL-PRES-850669 HPC Zones – Wider View Open Compute Facility Collaboration Zone Internet LLNL Enterprise Resticted Zone Infrastructure Zone GDO / ATTB (DMZ) Secure Compute Facility FIS

Slide 13

Slide 13 text

13 LLNL-PRES-850669 § A common operating system and computing environment for HPC clusters — Based on RHEL operating system — Modified RHEL Kernel § Methodology for building, quality assurance, integration, and configuration management § Add in customization for HPC specific needs — Consistent source and software across architectures: Intel, PowerPC, and ARM — High speed interconnect — Very large filesystems Tri-Lab Operating System Stack (TOSS)

Slide 14

Slide 14 text

14 LLNL-PRES-850669 Tech Stack… YES

Slide 15

Slide 15 text

15 LLNL-PRES-850669 GitLab (Version Control)

Slide 16

Slide 16 text

16 LLNL-PRES-850669 Example .gitlab-ci.yml file

Slide 17

Slide 17 text

17 LLNL-PRES-850669 LC Example: GitLab Server Deployment

Slide 18

Slide 18 text

18 LLNL-PRES-850669 § GitLab CI + Ansible configuration (separate from the TOSS Ansible repo) § Very fast to destroy and rebuild the Elastic clusters § Straightforward to scale up the service to meet demand § Hopefully you caught James’ talk this morning for all the details! LC Example: Elastic Cluster Deployment

Slide 19

Slide 19 text

19 LLNL-PRES-850669 Configuration Management

Slide 20

Slide 20 text

20 LLNL-PRES-850669 Tools / Techniques for Configuration Management § Cfengine3 -> Ansible § SVN -> Git § Branch and merge workflow — Code review not required — Automated tests must pass Challenges / Quirks § HPC Clusters are traditionally pets, not cattle — Implementations differ by Sys Admin § Multiple network zones / enclaves — Airgaps / one way links — Use git branching to store local changes

Slide 21

Slide 21 text

21 LLNL-PRES-850669 GitLab CI: Administrative Controls into Technical Controls § Historically, practices are passed down verbally, rules are enforced by individuals § CI brings rigor and repeatability to software development practices § New developers learn from automated pipelines, as they are committing — GitLab CI in LC configuration management

Slide 22

Slide 22 text

22 LLNL-PRES-850669 Configuration Management CI/CD - Today § Whitespace check § Generate managed files — Genders — Sudoers files § Linting — YAML — Python § Logic Checks — TOSS 3 -> 4 SSH Configs — Ansible inventory logic

Slide 23

Slide 23 text

23 LLNL-PRES-850669 Configuration Management CI/CD - Future Process Changes § Automated deployment to systems § Mandatory Code Review § Deploy center Vault to manage secrets § Configuration validation (Network rules) Challenges § Testing ALL the combinations of systems § What about changes at 02:00 ? § How to manage unlocking the Vault? § Where is the source of truth?

Slide 24

Slide 24 text

24 LLNL-PRES-850669 Monitoring

Slide 25

Slide 25 text

25 LLNL-PRES-850669 § Security requirements often quite prescriptive — STIG > CIS Benchmark > Vendor Guideline > generic NIST 800-53 controls § Developed a STIG for the TOSS operation system with DISA — Inspired by RHEL 8 STIG, which TOSS 4 is derived from — Small tweaks: adjust some DoD specific language to make compatible for other Gov agencies — Larger requests: no explicit allow-listing of software on TOSS, being a software development OS — HPC specific: RHEL STIG says 10 concurrent sessions for DOS reasons, TOSS STIG allows 256 § Need to regularly check and validation configuration Security Baseline

Slide 26

Slide 26 text

26 LLNL-PRES-850669 Security Dashboards – Operational and Compliance

Slide 27

Slide 27 text

27 LLNL-PRES-850669 Continuous Monitoring § LC HPC is the gold standard for continuous monitoring at LLNL § Aligns with federal trends towards continuous monitoring § Reduce burden of manual processes on sys admins, shifting those efforts to automation and alerting — Let SysAdmins focus on the engineering work

Slide 28

Slide 28 text

28 LLNL-PRES-850669 Automation with Data

Slide 29

Slide 29 text

29 LLNL-PRES-850669 § Single Source of Truth — Bring key information in to Ansible CM — Validate it! § Automatic update of CMDB, and other databases § Drive actions from system data § Testing and Verification of configurations and business rules ”Automate the Boring Stuff”

Slide 30

Slide 30 text

30 LLNL-PRES-850669 More, Better Targeted Alerts

Slide 31

Slide 31 text

Thank you! Happy to chat and answer questions! [email protected] @IanLee1521

Slide 32

Slide 32 text

32 LLNL-PRES-850669 Elastic Stack

Slide 33

Slide 33 text

33 LLNL-PRES-850669 § Elastic Stack is a collection of software for logging, reporting, and visualization of data — Formerly “ELK stack” or just “ELK” § Consists of Elasticsearch, Logstash, Kibana, Beats, and more § Open source components, commercial support, similar idea to GitLab What is Elastic?

Slide 34

Slide 34 text

34 LLNL-PRES-850669 Kibana UI

Slide 35

Slide 35 text

35 LLNL-PRES-850669 Elastic Logging Architecture Cluster Node Node Node Node Node Node

Slide 36

Slide 36 text

36 LLNL-PRES-850669 Current Elastic Hardware ~ 112TB NVMe (total) ~ 2PB HDD (total) 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Myelin5 32GB / 8 core data_warm Myelin4 32GB / 8 core data_warm Myelin3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Myelin2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core Myelin1 (mgmt) Myelin 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Axon5 32GB / 8 core data_warm Axon4 32GB / 8 core data_warm Axon3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Axon2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core 32GB / 8 core Master, data_hot, data_ingest Axon1 (mgmt) Axon Centrebrain3 (Monitoring “cluster”) 16GB / 8 core 4GB / 2 core Centrebrain2 (Dedicated Master Node) 8GB / 8 core Master, voting_only Centrebrain1 (mgmt) Centrebrain F5

Slide 37

Slide 37 text

37 LLNL-PRES-850669 Continuous Monitoring – Auth Failures

Slide 38

Slide 38 text

38 LLNL-PRES-850669 Continuous Monitoring – GitLab

Slide 39

Slide 39 text

39 LLNL-PRES-850669 Tech Stack History

Slide 40

Slide 40 text

40 LLNL-PRES-850669 Coming From § Various instances of Atlassian tools — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § LC Bamboo Service (~ 2017 – 2022)

Slide 41

Slide 41 text

41 LLNL-PRES-850669 Today § Various instances of Atlassian tools — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § Programmatic GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT

Slide 42

Slide 42 text

42 LLNL-PRES-850669 Going Forward § Various instances of Atlassian tools — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § Programmatic and Institutional GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT — Institutional GitLab Server? • Projects configure runners for use with their projects