Slide 1

Slide 1 text

LLNL-PRES-864588 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Monitoring HPC Security at LLNL 4th NIST HPC Security Workshop Ian Lee HPC Security Architect 2024-05-20

Slide 2

Slide 2 text

2 LLNL-PRES-864588 User Centric View of LC • Each bubble represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented

Slide 3

Slide 3 text

3 LLNL-PRES-864588 HPC Collaboration Zone Internet LLNL Enterprise Resticted Zone HPC Zones – User View

Slide 4

Slide 4 text

4 LLNL-PRES-864588 More Complete View of LC SNSI RZ SCF • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems CZ Green

Slide 5

Slide 5 text

5 LLNL-PRES-864588 HPC Zones – Wider View Open Compute Facility Collaboration Zone Internet LLNL Enterprise Resticted Zone Infrastructure Zone GDO / ATTB (DMZ) Secure Compute Facility FIS

Slide 6

Slide 6 text

6 LLNL-PRES-864588 © 2016 Ian Lee

Slide 7

Slide 7 text

7 LLNL-PRES-864588 El Capitan https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

Slide 8

Slide 8 text

8 LLNL-PRES-864588 • Each bubble represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)

Slide 9

Slide 9 text

9 LLNL-PRES-864588 system.syslog per hour

Slide 10

Slide 10 text

10 LLNL-PRES-864588 Logging Infrastructure

Slide 11

Slide 11 text

11 LLNL-PRES-864588 Logging Architecture Cluster Node Node Node Node Node Node CSP Splunk LC Splunk

Slide 12

Slide 12 text

12 LLNL-PRES-864588 Current Service to Hardware Allocations ~ 112TB NVMe (total) ~ 2PB HDD (total) Myelin5 64GB / 16 core data_warm Myelin4 64GB / 16 core data_warm Myelin1 (mgmt) Myelin Myelin3 64GB / 8 core Master, data_hot, data_ingest 28 TB NVMe 64GB / 16 core Master, data_hot, data_ingest 5 TB 5 TB Myelin2 64GB / 16 core Master, data_hot, data_ingest 28 TB NVMe 64GB / 16 core Master, data_hot, data_ingest 5 TB 5 TB 8GB / 2 core 90x 16TB HDD JBOD 45x HDD 500 TB 50 TB 45x HDD 500 TB 50 TB Axon5 64GB / 16 core data_warm Axon4 64GB / 16 core data_warm Axon3 64GB / 16 core Master, data_hot, data_ingest 28 TB NVMe 5 TB Axon2 64GB / 16 core Master, data_hot, data_ingest 28 TB NVMe 64GB / 16 core Master, data_hot, data_ingest 5 TB 5 TB 8GB / 2 core Axon1 (mgmt) Axon Elastic Architecture Diagram Current: 2023-09-18 90x 16TB HDD JBOD 45x HDD 500 TB 50 TB 45x HDD 500 TB 50 TB 64GB / 16 core data_warm 50 TB 200 TB 64GB / 16 core data_warm 50 TB 200 TB 64GB / 16 core data_warm 50 TB 64GB / 16 core data_warm 50 TB 200 TB 200 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB 16GB / 8GB JVM / 8 core 10 TB axon[2-3] and axon[4-5] are in 2U boxes, each with: - 2 socket 16 core 2.9 GHz - 192 GB RAM - 500GB M.2 SSD (boot) - 5x 7.68TB NVMe (app) - 10 Gb Ethernet axon[1] is in a 1U box, each with: - 2 socket 16 core 2.9 GHz - 192 GB RAM - 2x 600GB HDD (boot) - 2x 2TB HDD (other) - 10 Gb Ethernet RAM allocations in this diagram correspond to the amount of jvm_heap which is what Elastic licenses are based on. 64GB / 16 core Master, data_hot, data_ingest 5 TB Centrebrain3 (Monitoring “cluster”) 16GB / 8 core 4GB / 2 core Centrebrain2 (Dedicated Master Node) 8GB / 8 core Master, voting_only Centrebrain1 (mgmt) Centrebrain F5

Slide 13

Slide 13 text

13 LLNL-PRES-864588 Cluster Stats Today

Slide 14

Slide 14 text

14 LLNL-PRES-864588 Log Sources Breakdown

Slide 15

Slide 15 text

15 LLNL-PRES-864588 Example: Auditd log issue

Slide 16

Slide 16 text

16 LLNL-PRES-864588 Jan 1 – Jan 2 (1.25 seconds)

Slide 17

Slide 17 text

17 LLNL-PRES-864588 Jan 1 – Jan 8 (20 seconds)

Slide 18

Slide 18 text

18 LLNL-PRES-864588 Nov 1 – Jan 8 (42 seconds)

Slide 19

Slide 19 text

19 LLNL-PRES-864588 Monitoring

Slide 20

Slide 20 text

20 LLNL-PRES-864588

Slide 21

Slide 21 text

21 LLNL-PRES-864588

Slide 22

Slide 22 text

22 LLNL-PRES-864588 Operating System Versions

Slide 23

Slide 23 text

23 LLNL-PRES-864588 PAN Firewall Traffic Flows

Slide 24

Slide 24 text

24 LLNL-PRES-864588 SSH Authentications

Slide 25

Slide 25 text

25 LLNL-PRES-864588 Kibana Security Dashboards

Slide 26

Slide 26 text

26 LLNL-PRES-864588 § Explore other offerings — ML / Anomaly Detection — Enterprise Search (unified search across web + confluence + gitlab) — Elastic Defend § More automated alerts / processes Monitoring Vision Going Forward

Slide 27

Slide 27 text

27 LLNL-PRES-864588 § Security requirements often quite prescriptive — STIG > CIS Benchmark > Vendor Guideline > generic NIST 800-53 controls § Developed a STIG for the TOSS operation system with DISA — Inspired by RHEL 8 STIG, which TOSS 4 is derived from — Small tweaks: adjust some DoD specific language to make compatible for other Gov agencies — Larger requests: no explicit allow-listing of software on TOSS, being a software development OS — HPC specific: RHEL STIG says 10 concurrent sessions for DOS reasons, TOSS STIG allows 256 § Need to regularly check and validation configuration — https://github.com/llnl/toss-stig Security Baseline

Slide 28

Slide 28 text

28 LLNL-PRES-864588 § https://github.com/llnl/cmvl (WIP) —Repository of Elastic, Splunk, etc queries, dashboards, and visualizations § https://github.com/LLNL/elastic-stacker —Export saved objects from Kibana for sharing § https://github.com/LLNL/toss-configs (WIP) —Configuration files and scripts for setting up and maintaining TOSS HPC systems § https://github.com/llnl/toss-stig —Ansible implementation of the TOSS STIG Community Work

Slide 29

Slide 29 text

29 LLNL-PRES-864588 HPC Security Technical Exchange – August 5 – 8, 2024 § August 5-8 — Lawrence Livermore National Laboratory, CA § Government focus; CUI up to TS//SCI § Registration open imminently, including Call For Topics / Prompts — Contact [email protected] for details — “HPC Security Technical Exchange” on Intelink § Compliance in HPC systems § Incident Handling § Use of containers / virtualization § Assessments, Penetration Testing, Red Teaming § Hardware procurement challenges § Staffing challenges § Configuration Management / Secure Baselines of HPC systems § Logging and Monitoring § Building and deploying custom software § Ongoing programmatic work

Slide 30

Slide 30 text

Thank you! Happy to chat and answer questions! [email protected] @IanLee1521

Slide 31

Slide 31 text

31 LLNL-PRES-864588 § Elastic Stack is a collection of software for logging, reporting, and visualization of data — Formerly “ELK stack” or just “ELK” § Consists of Elasticsearch, Logstash, Kibana, Beats, and more § Open source components, commercial support, similar idea to GitLab What is Elastic?

Slide 32

Slide 32 text

32 LLNL-PRES-864588 § Were already using parts of the Elastic stack before (Logstash, *Beats) § Better integration / extension support — Enterprise Search, Machine Learning tools, Elastic Integrations via Agent § Performance claims (should be significantly faster searches compared to Splunk) — Reality has been a bit mixed here, and there is definitely room to continue tuning our deployment. § Compared notes with ORNL folks who moved (at least partially) from Splunk to Elastic Why Elastic?

Slide 33

Slide 33 text

33 LLNL-PRES-864588 § GitLab CI + Ansible configuration (separate from the TOSS Ansible repo) — James Taliaferro did a talk at S3C at NLIT going into detail on this § Very fast to destroy and rebuild the Elastic clusters § Straightforward to scale up the service to meet demand Deployment