Elastic{ON} 2018 - Bigger, Faster, Stronger - Leveling Up Enterprise Logging

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
March 01, 2018

Elastic{ON} 2018 - Bigger, Faster, Stronger - Leveling Up Enterprise Logging

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 01, 2018
Tweet

Transcript

  1. February 28, 2018 Bigger, Faster, Stronger: Leveling Up Enterprise Logging

    David Sarmanian General Dynamics IT, Elastic Solutions Lead Jared McQueen McQueen Solutions, Principal Systems Engineer
  2. None
  3. What problems are we really trying to solve? Too much

    data!!!
  4. • Slow Data Ingest • Slow Query Times • Single

    Points of Failure • Visualizations • Very Costly to Scale Problems:
  5. What is the next solution?

  6. Customer Requirements/Wants 5+ Years Searchable 18 Months Online (Hot) Petabyte

    Search Simultaneous Queries High Sustained EPS Ingest Automation Fast Data Queries Redundancy Visualization and Metrics Dashboards Actionable Data
  7. Time to pull our heads out of the sand What

    else is out there? • Magic Quadrant • Current Vendors • Recommended Vendors • Other
  8. By the phases Here is the plan 8 Requirements Usability

    and Security 1 2 3 4 5 Order/Buy Build Test Migrate Historical Data And Build a Streaming Capability Regression And User Testing Deploy
  9. None
  10. • Now able to find the data in milliseconds instead

    of hours • Orchestration increases the uptime and lower O&S costs • Hundreds of users using Elastic without any performance issues • We have actionable visualization through the use of Kibana • Control cloud costs by keeping 6 months hot and 5 years searchable Key Takeaways
  11. Let’s Get Technical

  12. Quick Stats 12 2.6 Million indexed docs / second Delivered

    a long-term search capability 5+ years in seconds - IOI 10+ PB of logs indexed
  13. 13 Work Backwards • START by defining the end goal.

    Be specific. • Engineer in reverse Understand Constraints • Chase bottlenecks • “Instrument everything” • Don’t be afraid to go deep Scale Up • Only after optimization • Get parallel: Network, Compute, Storage Elastic Engineering Core Principals Works 60% of the time, every time
  14. GETTING DATA OUT (Using the 3 principals)

  15. Getting the data OUT 15 Working backwards - define the

    end goal End Goal Be as specific as possible: Compressed, multi-line JSON files
 saved to on-prem SAN ??? JSON { }
  16. Getting the data OUT 16 Bottlenecks everywhere JSON { }

    PL/SQL Disk I/O contention Lacked valid JSON output Slow cursors Slow concurrency Result: ~700 EPS Let’s try logstash
  17. Getting the data OUT 17 More bottlenecks JSON { }

    logstash-input-jdbc logstash-codec-json logstash-output-file logstash-filter-metrics Result: ~1500 EPS Can we do better?
  18. Getting the data OUT 18 More bottlenecks JSON { }

    Use tools at your disposal: - Node Stats - Hot Threads - htop, nload, iostats - visualVM to profile heap Slow JDBC transfers JSON serialization was expensive Let’s write our own app
  19. Getting the data OUT 19 Custom Python App JSON {

    } Result were impressive: 1 Core: ~10k EPS 40 Cores, multiproc’d, threaded: ~100,000 EPS cx_Oracle great JSON support complete, granular control now scale up!
  20. Getting the data OUT 20 Open the flood gates JSON

    { } Result across 10 boxes: ~1 Million EPS a quick word on scaling:
  21. Getting the data OUT 21 A quick word on scaling

    Be careful! - impact infrastructure - degrade networks - disrupt storage - cascading compute - stress hardware* *Trust us, Pay attention to high/critical temperature warning sensors on bare metal boxes
  22. GETTING DATA IN Early ~1.25M EPS test speed run

  23. Getting the data IN 23 Cloud Architecture bulk ingestors Elasticsearch

    Coordinating Nodes JSON { } High level challenges in the cloud: Network bottlenecks Reliability not guaranteed - many outages Getting data *into* cloud storage Absolutely use an orchestration tool!
  24. Getting the data IN 24 many years’ worth of compressed

    JSON files sitting in S3 buckets Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }
  25. Getting the data IN 25 Python app to pull down

    compressed JSON and ship to elastic. (32) m4.xl instances (4/16/high) bulk ingestors Elasticsearch Data Nodes dedicated coordinating node Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { } boto-3 elasticsearch-py Built for speed: multiprocessing, threading 2:1 ratio ingest instances to coordinating nodes
  26. Getting the data IN 26 es = Elasticsearch(maxsize=4) # limit

    connections results = helpers.parallel_bulk( client=coord_node, # dedicated coord node actions=getDocs(), # use a generator thread_count=8, # size of threadpool chunk_size=2500 # batch size ) see elasticsearch-py docs for more info Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { } bulk ingestors Elasticsearch Coordinating Nodes JSON { } ` optimizing elasticsearch-py
  27. Getting the data IN 27 Data Nodes: hundreds of d2.2xl

    instances 12TB storage per box Orchestration: SALT Centos 7 ES 5.6, X-Pack Coordinating Nodes: (16) m4.xl instances (4/16/high) Unpack and route bulk requests moderately CPU intensive highly network bound Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }
  28. Getting the data IN 28 Data Retention Strategies: Archive: Index

    historical data > snapshots > IOI > delete (optimized for ingest) Recent (within X days): Index with replicas. Keep online (balanced between ingest, usability, reliability) Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }
  29. Getting the data IN 29 "settings": { "index": { "codec":

    "best_compression", "refresh_interval": "-1", // disable refresh "number_of_shards": "10", // shoot for 40-60gb shards "number_of_replicas": "0", // danger! danger! "translog.durability": “async”, // fsync commit in the background "index.merge.scheduler.max_thread_count": “1" // spinning disks } } High indexing rates require optimizing settings and mappings Getting fast ingest speeds index settings for archive data: Caution! these settings are for short-term indexing! Don’t use for 30, 60, 90 day production data.
  30. Get the data in 30 Enable case insensitivity "normalizer": {

    "to_lower": { "filter": [ "lowercase", "asciifolding" ], "type": "custom", "char_filter": [] } } “properties": { "sourceUserName": { "type": "keyword", "normalizer": "to_lower" } } Always normalize keywords: jmcqueen != jMcQueen for fast ingest, limit analyzed fields Use templates to define your index settings/mappings index settings: index mappings:
  31. INDEX OF INDEX (AKA metadata AKA rollup, but IOI just

    sounds cooler)
  32. IOI 32 Enterprise Logging invites interesting questions: When your system

    contains contains such a rich and wonderful data set Cyber Analysts: When was the first time we ever saw communication with an IP address? Approx time and frequency of unusual port activity? Network Operations: When was the last time a network device was updated, or patched? When was a box last seen on the network? Secret Squirrels / Information Assurance: request for user attributable data Validate security controls for systems The usual response: We can only go back X days
  33. IOI 33 Enables historical search It’s expensive keep 5+ years

    enterprise logging online and searchable. “hot”/“warm”. After X days, daily indices are snapshot’ed to “cold” storage Most often, analysts want to search since the beginning of time IOI allows us to quickly search metadata about documents stored in offline indices. Serves as a starting point for further investigation, rehydration
  34. IOI 34 General guidelines Engage with your analytical groups. Establish

    as list of valuable, low-cardinality fields. source and dest IP source and dest ports user names host, network (enclave) names email address event IDs Note: ‘event name’ is a poor choice, high cardinality
  35. Get the data in 35 High level overview daily index

    Run IOI aggs snapshot old index alias new index Iterate through your list of “high-value” fields: run terms aggregation get the values and counts store them back into a new index for that day with a note in the doc that the data is archived Add newly-created summary index to enterprise logging alias
  36. Get the data in 36 High level overview "_source": {

    "@timestamp": "2013-11-13", "userName": "jmcqueen", "count": 3, "note": "stored in archive" }
  37. 37 More Questions? Visit us at the AMA

  38. www.elastic.co

  39. Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/

    Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 39 Please attribute Elastic with a link to elastic.co