Slide 1

Slide 1 text

LLNL-PRES-850669 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC DevOps in HPC 4th NIST HPC Security Workshop Ian Lee HPC Security Architect 2024-05-21

Slide 2

Slide 2 text

2 LLNL-PRES-850669 More Complete View of LC SNSI RZ SCF • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems CZ Green

Slide 3

Slide 3 text

3 LLNL-PRES-850669 Tech Stack… YES

Slide 4

Slide 4 text

4 LLNL-PRES-850669 § A common operating system and computing environment for HPC clusters — Based on RHEL operating system — Modified RHEL Kernel § Methodology for building, quality assurance, integration, and configuration management § Add in customization for HPC specific needs — Consistent source and software across architectures: Intel, PowerPC, and ARM — High speed interconnect — Very large filesystems Tri-Lab Operating System Stack (TOSS)

Slide 5

Slide 5 text

5 LLNL-PRES-850669 GitLab (Version Control)

Slide 6

Slide 6 text

6 LLNL-PRES-850669 Example .gitlab-ci.yml file

Slide 7

Slide 7 text

7 LLNL-PRES-850669 § GitLab CI + Ansible configuration — Separate from the TOSS Ansible repo § Very fast to destroy and rebuild the Elastic clusters § Straightforward to scale up the service to meet demand — Now running 6 production Elastic deployments LC Example: Elastic Cluster Deployment

Slide 8

Slide 8 text

8 LLNL-PRES-850669 LC Example: Elastic Cluster Deployment

Slide 9

Slide 9 text

9 LLNL-PRES-850669 LC Example: GitLab Server Deployment

Slide 10

Slide 10 text

10 LLNL-PRES-850669 Custom Logstash Image

Slide 11

Slide 11 text

11 LLNL-PRES-850669 Configuration Management

Slide 12

Slide 12 text

12 LLNL-PRES-850669 Tools / Techniques for Configuration Management § Cfengine3 -> Ansible § SVN -> Git + GitLab § Branch and merge workflow — Automated tests must pass — CODEOWNERS (in progress) — Code review not required (yet) Challenges / Quirks § HPC Clusters are traditionally pets, not cattle — Implementations differ by Sys Admin § Multiple network zones / enclaves — Airgaps / one way links — Use git branching to store local changes

Slide 13

Slide 13 text

13 LLNL-PRES-850669 GitLab CI: Administrative Controls into Technical Controls § Historically, practices are passed down verbally, rules are enforced by individuals § CI brings rigor and repeatability to software development practices § New developers learn from automated pipelines, as they are committing — GitLab CI in LC configuration management

Slide 14

Slide 14 text

14 LLNL-PRES-850669 Configuration Management CI/CD - Today § Whitespace check § Generate managed files — Genders — Sudoers files § Linting — Markdown — Python — YAML § Logic / Quality Checks — TOSS 3 -> 4 SSH Configs — Ansible inventory logic

Slide 15

Slide 15 text

15 LLNL-PRES-850669 Linting Standards

Slide 16

Slide 16 text

16 LLNL-PRES-850669 Configuration Management CI/CD - Future Process Changes § Automated deployment to systems § Mandatory Code Review § Deploy center Vault to manage secrets — Deployed and then moved away from Vault § Configuration validation (Network rules) Challenges § Testing ALL the combinations of systems § What about changes at 02:00 ? § How to manage unlocking the Vault? § Where is the source of truth?

Slide 17

Slide 17 text

17 LLNL-PRES-850669 § Single Source of Truth — Bring key information in to Ansible CM — Validate it! § Automatic update of CMDB, and other databases § Drive actions from system data — pulse.py § Testing and Verification of configurations and business rules ”Automate the Boring Stuff”

Slide 18

Slide 18 text

18 LLNL-PRES-850669 Working Across An Airgap

Slide 19

Slide 19 text

19 LLNL-PRES-850669 Low-side Branch

Slide 20

Slide 20 text

20 LLNL-PRES-850669 High Side Branch

Slide 21

Slide 21 text

21 LLNL-PRES-850669 High Side moves ahead uniquely

Slide 22

Slide 22 text

22 LLNL-PRES-850669 Low Side moves ahead uniquely

Slide 23

Slide 23 text

23 LLNL-PRES-850669 Low Side merges into High Side

Slide 24

Slide 24 text

24 LLNL-PRES-850669 Repeat…

Slide 25

Slide 25 text

25 LLNL-PRES-850669 Repeat…

Slide 26

Slide 26 text

26 LLNL-PRES-850669 Open Source Contributions

Slide 27

Slide 27 text

27 LLNL-PRES-850669

Slide 28

Slide 28 text

28 LLNL-PRES-850669 LLNL Open Source Repositories

Slide 29

Slide 29 text

29 LLNL-PRES-850669 § https://github.com/llnl/cmvl (WIP) —Repository of Elastic, Splunk, etc queries, dashboards, and visualizations § https://github.com/LLNL/elastic-stacker —Export saved objects from Kibana for sharing § https://github.com/LLNL/toss-configs (WIP) —Configuration files and scripts for setting up and maintaining TOSS HPC systems § https://github.com/llnl/toss-stig —Ansible implementation of the TOSS STIG Community Work

Slide 30

Slide 30 text

Thank you! Happy to chat and answer questions! [email protected] @IanLee1521

Slide 31

Slide 31 text

31 LLNL-PRES-850669 Livermore Computing 31 Platforms, software, and consulting to enable world-class scientific simulation Mission Enable national security missions, support groundbreaking computational science, and advance High Performance Computing Platforms Sierra #12, El Cap (early delivery) #46, rzAdams #47, Tuolumne #48, Lassen #57, Dane #130, Bengal #151, rzVernal #166, Tioga #187, Ruby #232, Tenaya #259, Magma #269, Jade #368, Quartz #369, and many smaller compute clusters and infrastructure systems

Slide 32

Slide 32 text

32 LLNL-PRES-850669 HPC Center Architecture

Slide 33

Slide 33 text

33 LLNL-PRES-850669 What Makes Up an HPC Cluster https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.ipd.pdf NIST 800-223 IPD

Slide 34

Slide 34 text

34 LLNL-PRES-850669 HPC Collaboration Zone Internet LLNL Enterprise Resticted Zone HPC Zones – User View

Slide 35

Slide 35 text

35 LLNL-PRES-850669 HPC Zones – Wider View Open Compute Facility Collaboration Zone Internet LLNL Enterprise Resticted Zone Infrastructure Zone GDO / ATTB (DMZ) Secure Compute Facility FIS

Slide 36

Slide 36 text

36 LLNL-PRES-850669 Tech Stack History

Slide 37

Slide 37 text

37 LLNL-PRES-850669 Coming From § Various instances of Atlassian tools — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § LC Bamboo Service (~ 2017 – 2022)

Slide 38

Slide 38 text

38 LLNL-PRES-850669 Today § Various instances of Atlassian tools — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § End users responsible for their own CI/CD — Mostly Hudson / Jenkins servers § Programmatic GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT

Slide 39

Slide 39 text

39 LLNL-PRES-850669 Going Forward § Various instances of Atlassian tools — Programmatic (LC, NIF, etc; classified and unclassified) — Institutional (MyConfluence, MyJira, MyBitbucket) § Programmatic and Institutional GitLab Servers — LC, WCI, NIF, GS, NARAC, SO, LivIT — Institutional GitLab Server? • Projects configure runners for use with their projects