$30 off During Our Annual Pro Sale. View Details »

You Must Unlearn What You Have Learned

Ian Lee
February 01, 2023

You Must Unlearn What You Have Learned

High Performance Computing (HPC) systems generate massive amounts of data and logs. In addition, the retention requirements are only increasing to ensure data remains available for incident response, audits, and other business needs. Ingesting and making sense of all the data takes a correspondingly large amount of computing power and storage. With El Capitan, a 2 Exaflop computer arriving and being deployed at LLNL in 2023, we’ll have even larger processing needs in the future. Therefore over the past year, Livermore Computing at LLNL has been migrating our current logging infrastructure to Elasticsearch and Kibana in an effort to handle the increasing amount of data even faster than before. This talk will focus on the changes we’ve made, why we decided to go with Elastic, and address some of the bumps we’ve hit along the way.

Ian Lee

February 01, 2023
Tweet

More Decks by Ian Lee

Other Decks in Technology

Transcript

  1. LLNL-PRES-844466
    This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-
    AC52-07NA27344. Lawrence Livermore National Security, LLC
    You Must Unlearn What You Have Learned
    ElasticON Public Sector 2023
    Ian Lee
    HPC Security Architect
    2023-02-01

    View Slide

  2. 2
    LLNL-PRES-844466
    https://upload.wikimedia.org/wikipedia/commons/a/a8/U.S._National_labs_map.jpg

    View Slide

  3. 3
    LLNL-PRES-844466
    Livermore Computing
    3
    Platforms, software, and consulting to enable world-class scientific simulation
    Mission Enable national security missions, support groundbreaking computational science,
    and advance High Performance Computing
    Platforms Sierra #6, Lassen #34, rzVernal #107, Tioga #121, Ruby #145, Tenaya #165, Magma
    #173, Jade #258, Quartz #259, and many smaller compute clusters and infrastructure
    systems

    View Slide

  4. 4
    LLNL-PRES-844466
    The user-centric view of the Livermore Computing environment
    looks deceptively simple (as it should)
    • Each bubble represents a cluster
    • Size reflects theoretical peak performance, color indicates computing zone
    • Only large user-facing compute clusters are represented
    Most of the complex
    underlying infrastructure is
    hidden from users so that
    they can focus on their
    work

    View Slide

  5. 5
    LLNL-PRES-844466
    The operational view reflects the actual complexity of the
    computing environment that we manage
    SNSI
    RZ SCF
    • Each bubble represents a managed cluster or system
    • Size reflects number of nodes, color indicates type of system
    • All production systems are shown, including non-user-facing systems
    • This is before arrival of CTS-2 and
    El Capitan
    • Complexity is increasing over
    time while staffing levels stay flat
    • Older systems are staying in
    service longer
    • It is imperative that we
    continually reevaluate and
    improve our tools and processes
    to keep complexity from getting
    out of control
    CZ
    Green

    View Slide

  6. 6
    LLNL-PRES-844466
    The Old Ways

    View Slide

  7. 7
    LLNL-PRES-844466
    Log Flow to Splunk
    Cluster
    Node Node Node Node Node Node

    View Slide

  8. 8
    LLNL-PRES-844466
    Security Dashboards – Operational and Compliance

    View Slide

  9. 9
    LLNL-PRES-844466
    © 2016 Ian Lee

    View Slide

  10. 10
    LLNL-PRES-844466
    § October 2021: Division re-organization, architecture re-evaluation
    § November 2021: Proposal to migrate to Elasticsearch and Kibana
    § January 2022: Decision to migrate Elasticsearch and Kibana
    Rough Timeline (approx.)

    View Slide

  11. 11
    LLNL-PRES-844466
    § October 2021: Division re-organization, architecture re-evaluation
    § November 2021: Proposal to migrate to Elasticsearch and Kibana
    § January 2022: Decision to migrate Elasticsearch and Kibana
    Rough Timeline (approx.)
    All of the things. All of them. (Original from Hyperbole and a Half) (https://www.gamespot.com/articles/the-new-gamespot-faq/1100-6414631/)

    View Slide

  12. 12
    LLNL-PRES-844466
    § October 2021: Division re-organization, architecture re-evaluation
    § November: Proposal to migrate to Elasticsearch and Kibana
    § January 2022: Decision to migrate Elasticsearch and Kibana
    § February / March 2022: Fleshing out architecture ideas, finding hardware
    — Elastic 8.0 (Feb 10)
    § April – May 2022: Elastic Cluster Architecture v1
    Rough Timeline (approx.)

    View Slide

  13. 13
    LLNL-PRES-844466

    View Slide

  14. 14
    LLNL-PRES-844466
    § October 2021: Division re-organization, architecture re-evaluation
    § November: Proposal to migrate to Elasticsearch and Kibana
    § January 2022: Decision to migrate Elasticsearch and Kibana
    § February / March 2022: Fleshing out architecture ideas, finding hardware
    — Elastic 8.0 (Feb 10)
    § April – May 2022: Elastic Cluster Architecture v1
    § May – August 2022: Elastic Cluster Architecture v2
    § September – December 2022: Deployment into pre-production, production
    § January 2023 (Last week): LLNL x Elastic Onsite
    Rough Timeline (approx.)

    View Slide

  15. 15
    LLNL-PRES-844466
    Current Architecture
    Cluster
    Node Node Node Node Node Node

    View Slide

  16. 16
    LLNL-PRES-844466
    © 2016 Ian Lee

    View Slide

  17. 17
    LLNL-PRES-844466
    El Capitan
    https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

    View Slide

  18. 18
    LLNL-PRES-844466
    • Each bubble represents a cluster
    • Size reflects theoretical peak performance, color indicates computing zone
    • Only large user-facing compute clusters are represented
    El Capitan vs the Rest
    El Capitan
    ~ 2 EF (~ 2,000,000 TF)

    View Slide

  19. 19
    LLNL-PRES-844466
    Today
    © 2018 Heather Lee

    View Slide

  20. Disclaimer
    This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United
    States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or
    implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus,
    product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific
    commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or
    imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC.
    The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or
    Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.
    Using or Thinking about Elastic?
    I would love to chat!
    [email protected]
    @IanLee1521

    View Slide

  21. 21
    LLNL-PRES-844466
    Previous Architecture (Splunk)
    Splunk
    Logstash Logstash Logstash Logstash
    Cluster
    Node Node Node Node Node Node
    Datacenters
    Filebeat
    Cluster
    Node Node Node Node Node Node
    Datacenters
    Filebeat
    Cluster
    Node Node Node Node Node Node
    Datacenters
    Filebeat
    ….
    F5
    Syslog

    View Slide

  22. 22
    LLNL-PRES-844466
    Current Elastic Hardware
    ~ 112TB NVMe (total)
    ~ 2PB HDD (total)
    90x 16TB HDD JBOD
    45x HDD
    ~ 512TB
    45x HDD
    ~ 512TB
    Myelin5
    32GB / 8 core
    data_warm
    Myelin4
    32GB / 8 core
    data_warm
    Myelin3
    32GB / 8 core
    Master, data_hot, data_ingest
    ~ 27 TB
    NVMe
    16-32GB / 8-12 core
    Myelin2
    ~ 27TB
    NVMe
    32GB / 8 core
    Master, data_hot, data_ingest 4-8GB / 2 core
    Myelin1 (mgmt)
    Myelin
    90x 16TB HDD JBOD
    45x HDD
    ~ 512TB
    45x HDD
    ~ 512TB
    Axon5
    32GB / 8 core
    data_warm
    Axon4
    32GB / 8 core
    data_warm
    Axon3
    32GB / 8 core
    Master, data_hot, data_ingest
    ~ 27 TB
    NVMe
    16-32GB / 8-12 core
    Axon2
    ~ 27TB
    NVMe
    32GB / 8 core
    Master, data_hot, data_ingest 4-8GB / 2 core
    32GB / 8 core
    Master, data_hot, data_ingest
    Axon1 (mgmt)
    Axon
    Centrebrain3
    (Monitoring “cluster”)
    16GB / 8 core
    4GB / 2 core
    Centrebrain2
    (Dedicated Master Node)
    8GB / 8 core
    Master, voting_only
    Centrebrain1 (mgmt)
    Centrebrain
    F5

    View Slide

  23. 23
    LLNL-PRES-844466
    Continuous Monitoring
    § LC HPC is the gold standard for
    continuous monitoring at LLNL
    § Aligns with federal trends towards
    continuous monitoring
    § Reduce burden of manual processes on
    sys admins, shifting those efforts
    to automation and alerting
    — Let SysAdmins focus on the
    engineering work

    View Slide