Elastic{ON} 2018 - Bigger, Faster, Stronger - Leveling Up Enterprise Logging

February 28, 2018 Bigger, Faster, Stronger: Leveling Up Enterprise Logging
David Sarmanian General Dynamics IT, Elastic Solutions Lead Jared McQueen McQueen Solutions, Principal Systems Engineer

What problems are we really trying to solve? Too much
data!!!

• Slow Data Ingest • Slow Query Times • Single
Points of Failure • Visualizations • Very Costly to Scale Problems:

What is the next solution?

Customer Requirements/Wants 5+ Years Searchable 18 Months Online (Hot) Petabyte
Search Simultaneous Queries High Sustained EPS Ingest Automation Fast Data Queries Redundancy Visualization and Metrics Dashboards Actionable Data

Time to pull our heads out of the sand What
else is out there? • Magic Quadrant • Current Vendors • Recommended Vendors • Other

By the phases Here is the plan 8 Requirements Usability
and Security 1 2 3 4 5 Order/Buy Build Test Migrate Historical Data And Build a Streaming Capability Regression And User Testing Deploy

• Now able to find the data in milliseconds instead
of hours • Orchestration increases the uptime and lower O&S costs • Hundreds of users using Elastic without any performance issues • We have actionable visualization through the use of Kibana • Control cloud costs by keeping 6 months hot and 5 years searchable Key Takeaways

Let’s Get Technical

Quick Stats 12 2.6 Million indexed docs / second Delivered
a long-term search capability 5+ years in seconds - IOI 10+ PB of logs indexed

13 Work Backwards • START by defining the end goal.
Be specific. • Engineer in reverse Understand Constraints • Chase bottlenecks • “Instrument everything” • Don’t be afraid to go deep Scale Up • Only after optimization • Get parallel: Network, Compute, Storage Elastic Engineering Core Principals Works 60% of the time, every time

GETTING DATA OUT (Using the 3 principals)

Getting the data OUT 15 Working backwards - define the
end goal End Goal Be as specific as possible: Compressed, multi-line JSON files  saved to on-prem SAN ??? JSON { }

Getting the data OUT 16 Bottlenecks everywhere JSON { }
PL/SQL Disk I/O contention Lacked valid JSON output Slow cursors Slow concurrency Result: ~700 EPS Let’s try logstash

Getting the data OUT 17 More bottlenecks JSON { }
logstash-input-jdbc logstash-codec-json logstash-output-file logstash-filter-metrics Result: ~1500 EPS Can we do better?

Getting the data OUT 18 More bottlenecks JSON { }
Use tools at your disposal: - Node Stats - Hot Threads - htop, nload, iostats - visualVM to profile heap Slow JDBC transfers JSON serialization was expensive Let’s write our own app

Getting the data OUT 19 Custom Python App JSON {
} Result were impressive: 1 Core: ~10k EPS 40 Cores, multiproc’d, threaded: ~100,000 EPS cx_Oracle great JSON support complete, granular control now scale up!

Getting the data OUT 20 Open the flood gates JSON
{ } Result across 10 boxes: ~1 Million EPS a quick word on scaling:

Getting the data OUT 21 A quick word on scaling
Be careful! - impact infrastructure - degrade networks - disrupt storage - cascading compute - stress hardware* *Trust us, Pay attention to high/critical temperature warning sensors on bare metal boxes

GETTING DATA IN Early ~1.25M EPS test speed run

Getting the data IN 23 Cloud Architecture bulk ingestors Elasticsearch
Coordinating Nodes JSON { } High level challenges in the cloud: Network bottlenecks Reliability not guaranteed - many outages Getting data *into* cloud storage Absolutely use an orchestration tool!

Getting the data IN 24 many years’ worth of compressed
JSON files sitting in S3 buckets Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }

Getting the data IN 25 Python app to pull down
compressed JSON and ship to elastic. (32) m4.xl instances (4/16/high) bulk ingestors Elasticsearch Data Nodes dedicated coordinating node Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { } boto-3 elasticsearch-py Built for speed: multiprocessing, threading 2:1 ratio ingest instances to coordinating nodes

Getting the data IN 26 es = Elasticsearch(maxsize=4) # limit
connections results = helpers.parallel_bulk( client=coord_node, # dedicated coord node actions=getDocs(), # use a generator thread_count=8, # size of threadpool chunk_size=2500 # batch size ) see elasticsearch-py docs for more info Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { } bulk ingestors Elasticsearch Coordinating Nodes JSON { } ` optimizing elasticsearch-py

Getting the data IN 27 Data Nodes: hundreds of d2.2xl
instances 12TB storage per box Orchestration: SALT Centos 7 ES 5.6, X-Pack Coordinating Nodes: (16) m4.xl instances (4/16/high) Unpack and route bulk requests moderately CPU intensive highly network bound Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }

Getting the data IN 28 Data Retention Strategies: Archive: Index
historical data > snapshots > IOI > delete (optimized for ingest) Recent (within X days): Index with replicas. Keep online (balanced between ingest, usability, reliability) Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }

Getting the data IN 29 "settings": { "index": { "codec":
"best_compression", "refresh_interval": "-1", // disable refresh "number_of_shards": "10", // shoot for 40-60gb shards "number_of_replicas": "0", // danger! danger! "translog.durability": “async”, // fsync commit in the background "index.merge.scheduler.max_thread_count": “1" // spinning disks } } High indexing rates require optimizing settings and mappings Getting fast ingest speeds index settings for archive data: Caution! these settings are for short-term indexing! Don’t use for 30, 60, 90 day production data.

Get the data in 30 Enable case insensitivity "normalizer": {
"to_lower": { "filter": [ "lowercase", "asciifolding" ], "type": "custom", "char_filter": [] } } “properties": { "sourceUserName": { "type": "keyword", "normalizer": "to_lower" } } Always normalize keywords: jmcqueen != jMcQueen for fast ingest, limit analyzed fields Use templates to define your index settings/mappings index settings: index mappings:

INDEX OF INDEX (AKA metadata AKA rollup, but IOI just
sounds cooler)

IOI 32 Enterprise Logging invites interesting questions: When your system
contains contains such a rich and wonderful data set Cyber Analysts: When was the first time we ever saw communication with an IP address? Approx time and frequency of unusual port activity? Network Operations: When was the last time a network device was updated, or patched? When was a box last seen on the network? Secret Squirrels / Information Assurance: request for user attributable data Validate security controls for systems The usual response: We can only go back X days

IOI 33 Enables historical search It’s expensive keep 5+ years
enterprise logging online and searchable. “hot”/“warm”. After X days, daily indices are snapshot’ed to “cold” storage Most often, analysts want to search since the beginning of time IOI allows us to quickly search metadata about documents stored in offline indices. Serves as a starting point for further investigation, rehydration

IOI 34 General guidelines Engage with your analytical groups. Establish
as list of valuable, low-cardinality fields. source and dest IP source and dest ports user names host, network (enclave) names email address event IDs Note: ‘event name’ is a poor choice, high cardinality

Get the data in 35 High level overview daily index
Run IOI aggs snapshot old index alias new index Iterate through your list of “high-value” fields: run terms aggregation get the values and counts store them back into a new index for that day with a note in the doc that the data is archived Add newly-created summary index to enterprise logging alias

Get the data in 36 High level overview "_source": {
"@timestamp": "2013-11-13", "userName": "jmcqueen", "count": 3, "note": "stored in archive" }

37 More Questions? Visit us at the AMA

www.elastic.co

Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/
Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 39 Please attribute Elastic with a link to elastic.co

Elastic{ON} 2018 - Bigger, Faster, Stronger - L...

Elastic{ON} 2018 - Bigger, Faster, Stronger - Leveling Up Enterprise Logging

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript