Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pass On What You Have Learned: Deploying to Pro...

Ian Lee
March 13, 2024

Pass On What You Have Learned: Deploying to Production

Presentation about the LLNL experience with deploying Elastic into production over the past year. Discussion of both the good things we've seen, as well as a variety of warts we've found along the way.

Originally presented at: https://elasticpublicsectorsummit.upgather.com/

Ian Lee

March 13, 2024
Tweet

More Decks by Ian Lee

Other Decks in Technology

Transcript

  1. LLNL-PRES-861410 This work was performed under the auspices of the

    U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Pass On What You Have Learned: Deploying to Production Elastic Public Sector Summit 2024 Ian Lee HPC Security Architect Security Operations Team Lead 2024-03-13
  2. 4 LLNL-PRES-861410 • Each bubble represents a cluster • Size

    reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)
  3. 6 LLNL-PRES-861410 § Significantly better performance à Opens new doors

    for analysis of log data § Splunk (fast mode) — index=lc | stats count by source | sort –count § Elastic (ESQL) — from logs-* | stats count = count() by log.file.path | limit 10000 | sort count desc Performance is Noticeably Better Lookback # documents Splunk (fast mode) Elastic (ESQL) 60 minutes 50 M 133 sec ~ 2 sec 24 hours 1.8 B 2,294 sec ~ 10 sec 7 days 14 B 10,440 sec ~ 20 sec
  4. 11 LLNL-PRES-861410 § Explore other offerings — ML / Anomaly

    Detection — Enterprise Search (unified search across web + confluence + gitlab) — Elastic Defend § More automated alerts / processes Monitoring Vision Going Forward
  5. 14 LLNL-PRES-861410 § We’ve taken a significant hit to capabilities

    we’ve enjoyed for years. § Operational monitoring of our HPC systems is not as good today as it was with Splunk. — User Experience in particular is not as polished. — Enrich/Transform/Watcher system has been a significant pain point. § Looking forward to partnering with the Elastic and Federal communities further — https://github.com/LLNL/elastic-stacker — Upcoming Continuous Monitoring Dashboard repository — ESQL “Difficult to see; always in motion is the future” - Yoda
  6. 15 LLNL-PRES-861410 § Kibana — [Dashboard][Research] Refactor Grid and Layout

    Systems #88710 ** — Filter only the relevant panels in a dashboard #170395 ** — Sparklines #3395 — Allow dynamic naming of file attachments to watcher emails #169891 — [Fleet] Allow KQL queries with no field specified in fleet endpoints #171425 § Elastic Agent — Reusable integration policies #2227 ** — Fleet Server configuration does not contain all the hosts available in the Elasticsearch cluster #2784 § Elasticsearch — GPU accelerated Machine learning #61690 § Integrations — Gitlab #1741 Ongoing Work We’re Excited About
  7. 16 LLNL-PRES-861410 § LLNL product decisions and timelines are decades

    long — We are expected to deliver on our timelines and roadmaps § Will issues that we care about receive meaningful attention? Product Roadmaps? https://github.com/elastic/kibana/issues/17888