Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - Bigger, Faster, Stronger - Leveling Up Enterprise Logging

Elastic Co
March 01, 2018

Elastic{ON} 2018 - Bigger, Faster, Stronger - Leveling Up Enterprise Logging

Elastic Co

March 01, 2018
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. February 28, 2018
    Bigger, Faster, Stronger: Leveling Up
    Enterprise Logging
    David Sarmanian
    General Dynamics IT, Elastic Solutions Lead
    Jared McQueen
    McQueen Solutions, Principal Systems Engineer

    View Slide

  2. View Slide

  3. What problems are we really
    trying to solve?
    Too much data!!!

    View Slide

  4. • Slow Data Ingest
    • Slow Query Times
    • Single Points of Failure
    • Visualizations
    • Very Costly to Scale
    Problems:

    View Slide

  5. What is the next solution?

    View Slide

  6. Customer Requirements/Wants
    5+ Years Searchable
    18 Months Online (Hot)
    Petabyte Search
    Simultaneous Queries
    High Sustained EPS Ingest
    Automation
    Fast Data Queries
    Redundancy
    Visualization and Metrics
    Dashboards
    Actionable Data

    View Slide

  7. Time to pull our heads out of the sand
    What else is out there?
    • Magic Quadrant
    • Current Vendors
    • Recommended Vendors
    • Other

    View Slide

  8. By the phases
    Here is the plan
    8
    Requirements
    Usability and Security
    1 2 3 4 5
    Order/Buy
    Build
    Test
    Migrate Historical Data
    And
    Build a Streaming
    Capability
    Regression
    And
    User
    Testing
    Deploy

    View Slide

  9. View Slide

  10. • Now able to find the data in milliseconds instead of hours
    • Orchestration increases the uptime and lower O&S costs
    • Hundreds of users using Elastic without any performance
    issues
    • We have actionable visualization through the use of
    Kibana
    • Control cloud costs by keeping 6 months hot and 5 years
    searchable
    Key Takeaways

    View Slide

  11. Let’s
    Get Technical

    View Slide

  12. Quick Stats
    12
    2.6 Million indexed docs / second
    Delivered a long-term search capability
    5+ years in seconds - IOI
    10+ PB of logs indexed

    View Slide

  13. 13
    Work
    Backwards
    • START by defining the end
    goal. Be specific.
    • Engineer in reverse
    Understand
    Constraints
    • Chase bottlenecks
    • “Instrument everything”
    • Don’t be afraid to go deep
    Scale Up
    • Only after optimization
    • Get parallel: Network,
    Compute, Storage
    Elastic Engineering Core Principals
    Works 60% of the time, every time

    View Slide

  14. GETTING DATA OUT
    (Using the 3 principals)

    View Slide

  15. Getting the data OUT
    15
    Working backwards - define the end goal
    End Goal
    Be as specific as possible:
    Compressed,
    multi-line JSON files

    saved to on-prem SAN
    ??? JSON
    { }

    View Slide

  16. Getting the data OUT
    16
    Bottlenecks everywhere
    JSON
    { }
    PL/SQL
    Disk I/O contention
    Lacked valid JSON output
    Slow cursors
    Slow concurrency
    Result: ~700 EPS
    Let’s try logstash

    View Slide

  17. Getting the data OUT
    17
    More bottlenecks
    JSON
    { }
    logstash-input-jdbc
    logstash-codec-json
    logstash-output-file
    logstash-filter-metrics
    Result: ~1500 EPS
    Can we do better?

    View Slide

  18. Getting the data OUT
    18
    More bottlenecks
    JSON
    { }
    Use tools at your disposal:
    - Node Stats
    - Hot Threads
    - htop, nload, iostats
    - visualVM to profile heap
    Slow JDBC transfers
    JSON serialization was expensive
    Let’s write our own app

    View Slide

  19. Getting the data OUT
    19
    Custom Python App
    JSON
    { }
    Result were impressive:
    1 Core:
    ~10k EPS
    40 Cores, multiproc’d, threaded:
    ~100,000 EPS
    cx_Oracle
    great JSON support
    complete, granular control
    now scale up!

    View Slide

  20. Getting the data OUT
    20
    Open the flood gates
    JSON
    { }
    Result across 10 boxes:
    ~1 Million EPS
    a quick word on scaling:

    View Slide

  21. Getting the data OUT
    21
    A quick word on scaling
    Be careful!
    - impact infrastructure
    - degrade networks
    - disrupt storage
    - cascading compute
    - stress hardware*
    *Trust us,
    Pay attention to high/critical temperature warning sensors on bare metal boxes

    View Slide

  22. GETTING DATA IN
    Early ~1.25M EPS test speed run

    View Slide

  23. Getting the data IN
    23
    Cloud Architecture
    bulk
    ingestors
    Elasticsearch
    Coordinating
    Nodes
    JSON
    { }
    High level challenges in the cloud:
    Network bottlenecks
    Reliability not guaranteed - many outages
    Getting data *into* cloud storage
    Absolutely use an orchestration tool!

    View Slide

  24. Getting the data IN
    24
    many years’ worth of compressed JSON
    files sitting in S3 buckets
    Cloud Architecture
    bulk
    ingestors
    Elasticsearch
    Coordinating
    Nodes
    JSON
    { }

    View Slide

  25. Getting the data IN
    25
    Python app to pull down compressed JSON and
    ship to elastic.
    (32) m4.xl instances (4/16/high) bulk ingestors
    Elasticsearch
    Data Nodes
    dedicated
    coordinating node
    Cloud Architecture
    bulk
    ingestors
    Elasticsearch
    Coordinating
    Nodes
    JSON
    { }
    boto-3
    elasticsearch-py
    Built for speed:
    multiprocessing, threading
    2:1 ratio ingest instances to coordinating nodes

    View Slide

  26. Getting the data IN
    26
    es = Elasticsearch(maxsize=4) # limit connections
    results = helpers.parallel_bulk(
    client=coord_node, # dedicated coord node
    actions=getDocs(), # use a generator
    thread_count=8, # size of threadpool
    chunk_size=2500 # batch size
    )
    see elasticsearch-py docs for more info
    Cloud Architecture
    bulk
    ingestors
    Elasticsearch
    Coordinating
    Nodes
    JSON
    { }
    bulk
    ingestors
    Elasticsearch
    Coordinating
    Nodes
    JSON
    { }
    `
    optimizing elasticsearch-py

    View Slide

  27. Getting the data IN
    27
    Data Nodes:
    hundreds of d2.2xl instances
    12TB storage per box
    Orchestration: SALT
    Centos 7
    ES 5.6, X-Pack
    Coordinating Nodes:
    (16) m4.xl instances (4/16/high)
    Unpack and route bulk requests
    moderately CPU intensive
    highly network bound
    Cloud Architecture
    bulk
    ingestors
    Elasticsearch
    Coordinating
    Nodes
    JSON
    { }

    View Slide

  28. Getting the data IN
    28
    Data Retention Strategies:
    Archive:
    Index historical data > snapshots > IOI > delete
    (optimized for ingest)
    Recent (within X days):
    Index with replicas. Keep online
    (balanced between ingest, usability, reliability)
    Cloud Architecture
    bulk
    ingestors
    Elasticsearch
    Coordinating
    Nodes
    JSON
    { }

    View Slide

  29. Getting the data IN
    29
    "settings": {
    "index": {
    "codec": "best_compression",
    "refresh_interval": "-1", // disable refresh
    "number_of_shards": "10", // shoot for 40-60gb shards
    "number_of_replicas": "0", // danger! danger!
    "translog.durability": “async”, // fsync commit in the background
    "index.merge.scheduler.max_thread_count": “1" // spinning disks
    }
    }
    High indexing rates require optimizing settings and mappings
    Getting fast ingest speeds
    index settings for archive data:
    Caution! these settings are for short-term indexing!
    Don’t use for 30, 60, 90 day production data.

    View Slide

  30. Get the data in
    30
    Enable case insensitivity
    "normalizer": {
    "to_lower": {
    "filter": [
    "lowercase",
    "asciifolding"
    ],
    "type": "custom",
    "char_filter": []
    }
    }
    “properties": {
    "sourceUserName": {
    "type": "keyword",
    "normalizer": "to_lower"
    }
    }
    Always normalize keywords:
    jmcqueen != jMcQueen
    for fast ingest, limit analyzed fields
    Use templates to define your index settings/mappings
    index settings: index mappings:

    View Slide

  31. INDEX OF INDEX
    (AKA metadata AKA rollup, but IOI just sounds cooler)

    View Slide

  32. IOI
    32
    Enterprise Logging invites interesting questions:
    When your system contains contains such a rich and wonderful data set
    Cyber Analysts:
    When was the first time we ever saw communication with an IP address?
    Approx time and frequency of unusual port activity?
    Network Operations:
    When was the last time a network device was updated, or patched?
    When was a box last seen on the network?
    Secret Squirrels / Information Assurance:
    request for user attributable data
    Validate security controls for systems
    The usual response: We can only go back X days

    View Slide

  33. IOI
    33
    Enables historical search
    It’s expensive keep 5+ years enterprise logging online and searchable.
    “hot”/“warm”. After X days, daily indices are snapshot’ed to “cold” storage
    Most often, analysts want to search since the beginning of time
    IOI allows us to quickly search metadata about documents stored
    in offline indices.
    Serves as a starting point for further investigation, rehydration

    View Slide

  34. IOI
    34
    General guidelines
    Engage with your analytical groups.
    Establish as list of valuable, low-cardinality fields.
    source and dest IP
    source and dest ports
    user names
    host, network (enclave) names
    email address
    event IDs
    Note: ‘event name’ is a poor choice, high cardinality

    View Slide

  35. Get the data in
    35
    High level overview
    daily
    index
    Run
    IOI aggs
    snapshot old index
    alias new index
    Iterate through your list of “high-value” fields:
    run terms aggregation
    get the values and counts
    store them back into a new index for that day
    with a note in the doc that the data is archived
    Add newly-created summary index to enterprise logging alias

    View Slide

  36. Get the data in
    36
    High level overview
    "_source": {
    "@timestamp": "2013-11-13",
    "userName": "jmcqueen",
    "count": 3,
    "note": "stored in archive"
    }

    View Slide

  37. 37
    More Questions?
    Visit us at the AMA

    View Slide

  38. www.elastic.co

    View Slide

  39. Except where otherwise noted, this work is licensed under
    http://creativecommons.org/licenses/by-nd/4.0/
    Creative Commons and the double C in a circle are
    registered trademarks of Creative Commons in the United States and other countries.
    Third party marks and brands are the property of their respective holders.
    39
    Please attribute Elastic with a link to elastic.co

    View Slide