Slide 1

Slide 1 text

LLNL-PRES-861410 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Pass On What You Have Learned: Deploying to Production Elastic Public Sector Summit 2024 Ian Lee HPC Security Architect Security Operations Team Lead 2024-03-13

Slide 2

Slide 2 text

2 LLNL-PRES-861410 Cluster Stats Today

Slide 3

Slide 3 text

3 LLNL-PRES-861410 El Capitan https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer

Slide 4

Slide 4 text

4 LLNL-PRES-861410 • Each bubble represents a cluster • Size reflects theoretical peak performance, color indicates computing zone • Only large user-facing compute clusters are represented El Capitan vs the Rest El Capitan ~ 2 EF (~ 2,000,000 TF)

Slide 5

Slide 5 text

5 LLNL-PRES-861410 system.syslog per hour

Slide 6

Slide 6 text

6 LLNL-PRES-861410 § Significantly better performance à Opens new doors for analysis of log data § Splunk (fast mode) — index=lc | stats count by source | sort –count § Elastic (ESQL) — from logs-* | stats count = count() by log.file.path | limit 10000 | sort count desc Performance is Noticeably Better Lookback # documents Splunk (fast mode) Elastic (ESQL) 60 minutes 50 M 133 sec ~ 2 sec 24 hours 1.8 B 2,294 sec ~ 10 sec 7 days 14 B 10,440 sec ~ 20 sec

Slide 7

Slide 7 text

7 LLNL-PRES-861410 Example: Auditd log issue

Slide 8

Slide 8 text

8 LLNL-PRES-861410 Jan 1 – Jan 2 (1.25 seconds)

Slide 9

Slide 9 text

9 LLNL-PRES-861410 Jan 1 – Jan 8 (20 seconds)

Slide 10

Slide 10 text

10 LLNL-PRES-861410 Nov 1 – Jan 8 (42 seconds)

Slide 11

Slide 11 text

11 LLNL-PRES-861410 § Explore other offerings — ML / Anomaly Detection — Enterprise Search (unified search across web + confluence + gitlab) — Elastic Defend § More automated alerts / processes Monitoring Vision Going Forward

Slide 12

Slide 12 text

12 LLNL-PRES-861410

Slide 13

Slide 13 text

13 LLNL-PRES-861410

Slide 14

Slide 14 text

14 LLNL-PRES-861410 § We’ve taken a significant hit to capabilities we’ve enjoyed for years. § Operational monitoring of our HPC systems is not as good today as it was with Splunk. — User Experience in particular is not as polished. — Enrich/Transform/Watcher system has been a significant pain point. § Looking forward to partnering with the Elastic and Federal communities further — https://github.com/LLNL/elastic-stacker — Upcoming Continuous Monitoring Dashboard repository — ESQL “Difficult to see; always in motion is the future” - Yoda

Slide 15

Slide 15 text

15 LLNL-PRES-861410 § Kibana — [Dashboard][Research] Refactor Grid and Layout Systems #88710 ** — Filter only the relevant panels in a dashboard #170395 ** — Sparklines #3395 — Allow dynamic naming of file attachments to watcher emails #169891 — [Fleet] Allow KQL queries with no field specified in fleet endpoints #171425 § Elastic Agent — Reusable integration policies #2227 ** — Fleet Server configuration does not contain all the hosts available in the Elasticsearch cluster #2784 § Elasticsearch — GPU accelerated Machine learning #61690 § Integrations — Gitlab #1741 Ongoing Work We’re Excited About

Slide 16

Slide 16 text

16 LLNL-PRES-861410 § LLNL product decisions and timelines are decades long — We are expected to deliver on our timelines and roadmaps § Will issues that we care about receive meaningful attention? Product Roadmaps? https://github.com/elastic/kibana/issues/17888

Slide 17

Slide 17 text

Working on similar problems? I would love to chat! [email protected] @IanLee1521