Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You Must Unlearn What You Have Learned

Ian Lee
February 01, 2023

You Must Unlearn What You Have Learned

High Performance Computing (HPC) systems generate massive amounts of data and logs. In addition, the retention requirements are only increasing to ensure data remains available for incident response, audits, and other business needs. Ingesting and making sense of all the data takes a correspondingly large amount of computing power and storage. With El Capitan, a 2 Exaflop computer arriving and being deployed at LLNL in 2023, we’ll have even larger processing needs in the future. Therefore over the past year, Livermore Computing at LLNL has been migrating our current logging infrastructure to Elasticsearch and Kibana in an effort to handle the increasing amount of data even faster than before. This talk will focus on the changes we’ve made, why we decided to go with Elastic, and address some of the bumps we’ve hit along the way.

Ian Lee

February 01, 2023
Tweet

More Decks by Ian Lee

Other Decks in Technology

Transcript

  1. LLNL-PRES-844466 This work was performed under the auspices of the

    U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC You Must Unlearn What You Have Learned ElasticON Public Sector 2023 Ian Lee HPC Security Architect 2023-02-01
  2. 3 LLNL-PRES-844466 Livermore Computing 3 Platforms, software, and consulting to

    enable world-class scientific simulation Mission Enable national security missions, support groundbreaking computational science, and advance High Performance Computing Platforms Sierra #6, Lassen #34, rzVernal #107, Tioga #121, Ruby #145, Tenaya #165, Magma #173, Jade #258, Quartz #259, and many smaller compute clusters and infrastructure systems
  3. 4 LLNL-PRES-844466 The user-centric view of the Livermore Computing environment

    looks deceptively simple (as it should) • Each bubble represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented Most of the complex underlying infrastructure is hidden from users so that they can focus on their work
  4. 5 LLNL-PRES-844466 The operational view reflects the actual complexity of

    the computing environment that we manage SNSI RZ SCF • Each bubble represents a managed cluster or system • Size reflects number of nodes, color indicates type of system • All production systems are shown, including non-user-facing systems • This is before arrival of CTS-2 and El Capitan • Complexity is increasing over time while staffing levels stay flat • Older systems are staying in service longer • It is imperative that we continually reevaluate and improve our tools and processes to keep complexity from getting out of control CZ Green
  5. 10 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

    November 2021: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana Rough Timeline (approx.)
  6. 11 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

    November 2021: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana Rough Timeline (approx.) All of the things. All of them. (Original from Hyperbole and a Half) (https://www.gamespot.com/articles/the-new-gamespot-faq/1100-6414631/)
  7. 12 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

    November: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana § February / March 2022: Fleshing out architecture ideas, finding hardware — Elastic 8.0 (Feb 10) § April – May 2022: Elastic Cluster Architecture v1 Rough Timeline (approx.)
  8. 14 LLNL-PRES-844466 § October 2021: Division re-organization, architecture re-evaluation §

    November: Proposal to migrate to Elasticsearch and Kibana § January 2022: Decision to migrate Elasticsearch and Kibana § February / March 2022: Fleshing out architecture ideas, finding hardware — Elastic 8.0 (Feb 10) § April – May 2022: Elastic Cluster Architecture v1 § May – August 2022: Elastic Cluster Architecture v2 § September – December 2022: Deployment into pre-production, production § January 2023 (Last week): LLNL x Elastic Onsite Rough Timeline (approx.)
  9. 18 LLNL-PRES-844466 • Each bubble represents a cluster • Size

    reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)
  10. Disclaimer This document was prepared as an account of work

    sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. Using or Thinking about Elastic? I would love to chat! [email protected] @IanLee1521
  11. 21 LLNL-PRES-844466 Previous Architecture (Splunk) Splunk Logstash Logstash Logstash Logstash

    Cluster Node Node Node Node Node Node Datacenters Filebeat Cluster Node Node Node Node Node Node Datacenters Filebeat Cluster Node Node Node Node Node Node Datacenters Filebeat …. F5 Syslog
  12. 22 LLNL-PRES-844466 Current Elastic Hardware ~ 112TB NVMe (total) ~

    2PB HDD (total) 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Myelin5 32GB / 8 core data_warm Myelin4 32GB / 8 core data_warm Myelin3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Myelin2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core Myelin1 (mgmt) Myelin 90x 16TB HDD JBOD 45x HDD ~ 512TB 45x HDD ~ 512TB Axon5 32GB / 8 core data_warm Axon4 32GB / 8 core data_warm Axon3 32GB / 8 core Master, data_hot, data_ingest ~ 27 TB NVMe 16-32GB / 8-12 core Axon2 ~ 27TB NVMe 32GB / 8 core Master, data_hot, data_ingest 4-8GB / 2 core 32GB / 8 core Master, data_hot, data_ingest Axon1 (mgmt) Axon Centrebrain3 (Monitoring “cluster”) 16GB / 8 core 4GB / 2 core Centrebrain2 (Dedicated Master Node) 8GB / 8 core Master, voting_only Centrebrain1 (mgmt) Centrebrain F5
  13. 23 LLNL-PRES-844466 Continuous Monitoring § LC HPC is the gold

    standard for continuous monitoring at LLNL § Aligns with federal trends towards continuous monitoring § Reduce burden of manual processes on sys admins, shifting those efforts to automation and alerting — Let SysAdmins focus on the engineering work