Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring HPC Security at LLNL

Monitoring HPC Security at LLNL

A discussion on best practices for both security and operational monitoring of HPC systems. This includes security monitoring areas like baseline configurations, configuration management standards, user activity, and network behaviors, as well as operational and facility monitoring efforts that are being worked on at LLNL. Additionally, we will highlight several of the open source efforts that we are kicking off to create collaboration within the HPC community to visualize and monitor these systems.

Ian Lee

May 20, 2024
Tweet

More Decks by Ian Lee

Other Decks in Technology

Transcript

  1. LLNL-PRES-864588 This work was performed under the auspices of the

    U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Monitoring HPC Security at LLNL 4th NIST HPC Security Workshop Ian Lee HPC Security Architect 2024-05-20
  2. 2 LLNL-PRES-864588 User Centric View of LC • Each bubble

    represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented
  3. 4 LLNL-PRES-864588 More Complete View of LC SNSI RZ SCF

    • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems CZ Green
  4. 5 LLNL-PRES-864588 HPC Zones – Wider View Open Compute Facility

    Collaboration Zone Internet LLNL Enterprise Resticted Zone Infrastructure Zone GDO / ATTB (DMZ) Secure Compute Facility FIS
  5. 8 LLNL-PRES-864588 • Each bubble represents a cluster • Size

    reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)
  6. 12 LLNL-PRES-864588 Current Service to Hardware Allocations ~ 112TB NVMe

    (total) ~ 2PB HDD (total) Myelin5 64GB / 16 core data_warm Myelin4 64GB / 16 core data_warm Myelin1 (mgmt) Myelin Myelin3 64GB / 8 core Master, data_hot, data_ingest 28 TB NVMe 64GB / 16 core Master, data_hot, data_ingest 5 TB 5 TB Myelin2 64GB / 16 core Master, data_hot, data_ingest 28 TB NVMe 64GB / 16 core Master, data_hot, data_ingest 5 TB 5 TB 8GB / 2 core 90x 16TB HDD JBOD 45x HDD 500 TB 50 TB 45x HDD 500 TB 50 TB Axon5 64GB / 16 core data_warm Axon4 64GB / 16 core data_warm Axon3 64GB / 16 core Master, data_hot, data_ingest 28 TB NVMe 5 TB Axon2 64GB / 16 core Master, data_hot, data_ingest 28 TB NVMe 64GB / 16 core Master, data_hot, data_ingest 5 TB 5 TB 8GB / 2 core Axon1 (mgmt) Axon Elastic Architecture Diagram Current: 2023-09-18 90x 16TB HDD JBOD 45x HDD 500 TB 50 TB 45x HDD 500 TB 50 TB 64GB / 16 core data_warm 50 TB 200 TB 64GB / 16 core data_warm 50 TB 200 TB 64GB / 16 core data_warm 50 TB 64GB / 16 core data_warm 50 TB 200 TB 200 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB axon[2-3] and axon[4-5] are in 2U boxes, each with: - 2 socket 16 core 2.9 GHz - 192 GB RAM - 500GB M.2 SSD (boot) - 5x 7.68TB NVMe (app) - 10 Gb Ethernet axon[1] is in a 1U box, each with: - 2 socket 16 core 2.9 GHz - 192 GB RAM - 2x 600GB HDD (boot) - 2x 2TB HDD (other) - 10 Gb Ethernet RAM allocations in this diagram correspond to the amount of jvm_heap which is what Elastic licenses are based on. 64GB / 16 core Master, data_hot, data_ingest 5 TB Centrebrain3 (Monitoring “cluster”) 16GB / 8 core 4GB / 2 core Centrebrain2 (Dedicated Master Node) 8GB / 8 core Master, voting_only Centrebrain1 (mgmt) Centrebrain F5
  7. 26 LLNL-PRES-864588 § Explore other offerings — ML / Anomaly

    Detection — Enterprise Search (unified search across web + confluence + gitlab) — Elastic Defend § More automated alerts / processes Monitoring Vision Going Forward
  8. 27 LLNL-PRES-864588 § Security requirements often quite prescriptive — STIG

    > CIS Benchmark > Vendor Guideline > generic NIST 800-53 controls § Developed a STIG for the TOSS operation system with DISA — Inspired by RHEL 8 STIG, which TOSS 4 is derived from — Small tweaks: adjust some DoD specific language to make compatible for other Gov agencies — Larger requests: no explicit allow-listing of software on TOSS, being a software development OS — HPC specific: RHEL STIG says 10 concurrent sessions for DOS reasons, TOSS STIG allows 256 § Need to regularly check and validation configuration — https://github.com/llnl/toss-stig Security Baseline
  9. 28 LLNL-PRES-864588 § https://github.com/llnl/cmvl (WIP) —Repository of Elastic, Splunk, etc

    queries, dashboards, and visualizations § https://github.com/LLNL/elastic-stacker —Export saved objects from Kibana for sharing § https://github.com/LLNL/toss-configs (WIP) —Configuration files and scripts for setting up and maintaining TOSS HPC systems § https://github.com/llnl/toss-stig —Ansible implementation of the TOSS STIG Community Work
  10. 29 LLNL-PRES-864588 HPC Security Technical Exchange – August 5 –

    8, 2024 § August 5-8 — Lawrence Livermore National Laboratory, CA § Government focus; CUI up to TS//SCI § Registration open imminently, including Call For Topics / Prompts — Contact [email protected] for details — “HPC Security Technical Exchange” on Intelink § Compliance in HPC systems § Incident Handling § Use of containers / virtualization § Assessments, Penetration Testing, Red Teaming § Hardware procurement challenges § Staffing challenges § Configuration Management / Secure Baselines of HPC systems § Logging and Monitoring § Building and deploying custom software § Ongoing programmatic work
  11. 31 LLNL-PRES-864588 § Elastic Stack is a collection of software

    for logging, reporting, and visualization of data — Formerly “ELK stack” or just “ELK” § Consists of Elasticsearch, Logstash, Kibana, Beats, and more § Open source components, commercial support, similar idea to GitLab What is Elastic?
  12. 32 LLNL-PRES-864588 § Were already using parts of the Elastic

    stack before (Logstash, *Beats) § Better integration / extension support — Enterprise Search, Machine Learning tools, Elastic Integrations via Agent § Performance claims (should be significantly faster searches compared to Splunk) — Reality has been a bit mixed here, and there is definitely room to continue tuning our deployment. § Compared notes with ORNL folks who moved (at least partially) from Splunk to Elastic Why Elastic?
  13. 33 LLNL-PRES-864588 § GitLab CI + Ansible configuration (separate from

    the TOSS Ansible repo) — James Taliaferro did a talk at S3C at NLIT going into detail on this § Very fast to destroy and rebuild the Elastic clusters § Straightforward to scale up the service to meet demand Deployment