You Must Unlearn What You Have Learned

LLNL-PRES-844466 This work was performed under the auspices of the
U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC You Must Unlearn What You Have Learned ElasticON Public Sector 2023 Ian Lee HPC Security Architect 2023-02-01

2 LLNL-PRES-844466 https://upload.wikimedia.org/wikipedia/commons/a/a8/U.S._National_labs_map.jpg

3 LLNL-PRES-844466 Livermore Computing 3 Platforms, software, and consulting to
enable world-class scientific simulation Mission Enable national security missions, support groundbreaking computational science, and advance High Performance Computing Platforms Sierra #6, Lassen #34, rzVernal #107, Tioga #121, Ruby #145, Tenaya #165, Magma #173, Jade #258, Quartz #259, and many smaller compute clusters and infrastructure systems

4 LLNL-PRES-844466 The user-centric view of the Livermore Computing environment
looks deceptively simple (as it should) • Each bubble represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented Most of the complex underlying infrastructure is hidden from users so that they can focus on their work

5 LLNL-PRES-844466 The operational view reflects the actual complexity of
the computing environment that we manage SNSI RZ SCF • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems • This is before arrival of CTS-2 and El Capitan • Complexity is increasing over time while staffing levels stay flat • Older systems are staying in service longer • It is imperative that we continually reevaluate and improve our tools and processes to keep complexity from getting out of control CZ Green

6 LLNL-PRES-844466 The Old Ways

7 LLNL-PRES-844466 Log Flow to Splunk Cluster Node Node Node
Node Node Node

8 LLNL-PRES-844466 Security Dashboards – Operational and Compliance

10 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §
November 2021: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana Rough Timeline (approx.)

November 2021: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana Rough Timeline (approx.) All of the things. All of them. (Original from Hyperbole and a Half) (https://www.gamespot.com/articles/the-new-gamespot-faq/1100-6414631/)

November: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana § February / March 2022: Fleshing out architecture ideas, finding hardware — Elastic 8.0 (Feb 10) § April – May 2022: Elastic Cluster Architecture v1 Rough Timeline (approx.)

13 LLNL-PRES-844466

November: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana § February / March 2022: Fleshing out architecture ideas, finding hardware — Elastic 8.0 (Feb 10) § April – May 2022: Elastic Cluster Architecture v1 § May – August 2022: Elastic Cluster Architecture v2 § September – December 2022: Deployment into pre-production, production § January 2023 (Last week): LLNL x Elastic Onsite Rough Timeline (approx.)

15 LLNL-PRES-844466 Current Architecture Cluster Node Node Node Node Node
Node

17 LLNL-PRES-844466 El Capitan https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

18 LLNL-PRES-844466 • Each bubble represents a cluster • Size
reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)

Disclaimer This document was prepared as an account of work
sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. Using or Thinking about Elastic? I would love to chat! [email protected] @IanLee1521

21 LLNL-PRES-844466 Previous Architecture (Splunk) Splunk Logstash Logstash Logstash Logstash
Cluster Node Node Node Node Node Node Datacenters Filebeat Cluster Node Node Node Node Node Node Datacenters Filebeat Cluster Node Node Node Node Node Node Datacenters Filebeat …. F5 Syslog

22 LLNL-PRES-844466 Current Elastic Hardware ~ 112TB NVMe (total) ~
2PB HDD (total) 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Myelin5 32GB / 8 core data_warm Myelin4 32GB / 8 core data_warm Myelin3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Myelin2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core Myelin1 (mgmt) Myelin 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Axon5 32GB / 8 core data_warm Axon4 32GB / 8 core data_warm Axon3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Axon2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core 32GB / 8 core Master, data_hot, data_ingest Axon1 (mgmt) Axon Centrebrain3 (Monitoring “cluster”) 16GB / 8 core 4GB / 2 core Centrebrain2 (Dedicated Master Node) 8GB / 8 core Master, voting_only Centrebrain1 (mgmt) Centrebrain F5

23 LLNL-PRES-844466 Continuous Monitoring § LC HPC is the gold
standard for continuous monitoring at LLNL § Aligns with federal trends towards continuous monitoring § Reduce burden of manual processes on sys admins, shifting those efforts to automation and alerting — Let SysAdmins focus on the engineering work

You Must Unlearn What You Have Learned

You Must Unlearn What You Have Learned

Ian Lee

More Decks by Ian Lee

Other Decks in Technology

Featured

Transcript

LLNL-PRES-844466 This work was performed under the auspices of the

2 LLNL-PRES-844466 https://upload.wikimedia.org/wikipedia/commons/a/a8/U.S._National_labs_map.jpg

3 LLNL-PRES-844466 Livermore Computing 3 Platforms, software, and consulting to

4 LLNL-PRES-844466 The user-centric view of the Livermore Computing environment

5 LLNL-PRES-844466 The operational view reflects the actual complexity of

6 LLNL-PRES-844466 The Old Ways

7 LLNL-PRES-844466 Log Flow to Splunk Cluster Node Node Node

8 LLNL-PRES-844466 Security Dashboards – Operational and Compliance

9 LLNL-PRES-844466 © 2016 Ian Lee

10 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

11 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

12 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

13 LLNL-PRES-844466

14 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

15 LLNL-PRES-844466 Current Architecture Cluster Node Node Node Node Node

16 LLNL-PRES-844466 © 2016 Ian Lee

17 LLNL-PRES-844466 El Capitan https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

18 LLNL-PRES-844466 • Each bubble represents a cluster • Size

19 LLNL-PRES-844466 Today © 2018 Heather Lee

Disclaimer This document was prepared as an account of work

21 LLNL-PRES-844466 Previous Architecture (Splunk) Splunk Logstash Logstash Logstash Logstash

22 LLNL-PRES-844466 Current Elastic Hardware ~ 112TB NVMe (total) ~

23 LLNL-PRES-844466 Continuous Monitoring § LC HPC is the gold