Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quit Yammer(ing) Away & Start Analyzing Your Log Data!

Elastic Co
February 17, 2016

Quit Yammer(ing) Away & Start Analyzing Your Log Data!

Hear how Yammer ships logs from over 4,000 servers to Elasticsearch, Logstash, and Kibana for production on-call support, monitoring, and analytics, accelerating incident detection and helping to identify trends over time.

Elastic Co

February 17, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 2 Hi. •  About Me ‒  Recently graduated from UC

    Berkeley ‒  Software Engineer at Microsoft ‒  Tech Lead for Log Aggregation at Yammer •  About Yammer ‒  Enterprise social network ‒  Global engineering teams committing code ‒  Millions of users WHO ARE YOU???
  2. 3 Machines With multiple log files to ingest, from variable

    locations 4K 35K 1TB The Situation with Numbers Events Processed per second by Logstash Log Data Produced per day and stored in Elasticsearch
  3. 5 Yammer’s Log Situation •  No tooling to efficiently analyze

    data •  Manual search for on-call responses •  Tail, Grep, Pray •  Wants ‒  30 day log retention ‒  Visualization of data ‒  Faster on-call response time Pre-Elasticsearch
  4. 6 Yammer & Elasticsearch Agenda 1 2 3 4 First

    iteration: backpressure Second iteration: scaling out & incorporating Kafka Notable Issues & Workarounds Use Cases
  5. 7 First Pipeline Iteration •  First attempt at using ELK

    for log aggregation at Yammer •  Demonstrated functionality of Logstash in extracting fields from logs •  Demonstrated scalability of ES & effective visualizations in Kibana Attempt and Result
  6. 8 First Pipeline Iteration •  First attempt at using ELK

    for log aggregation at Yammer •  Demonstrated functionality of Logstash in extracting fields from logs •  Demonstrated scalability of ES & effective visualizations in Kibana •  Poorly received; not well planned out ‒  Improperly scaled (disk space), unaware of size of incoming data ‒  Pipeline held up at many points, resulted in backpressure, long delays Attempt and Result
  7. 9 Yammer & Elasticsearch Agenda 1 2 3 4 First

    iteration: backpressure Second iteration: scaling out & incorporating Kafka Notable Issues & Workarounds Use Cases
  8. 11 Second Pipeline Iteration •  Apache Kafka introduced into pipeline

    •  Backpressure relieved from applications •  Reduced cascading failures •  Split up log collection from indexing Kafka & Scale
  9. 12 Second Pipeline Iteration •  Scale Revamp - Made informed

    calculations, determined scale need •  Scaled out each component, allow plenty of headroom •  Rule of thumb: 30% headroom, on all components Kafka & Scale
  10. 13 Second Pipeline Iteration •  Elasticsearch Cluster ‒  Hot (smaller)

    nodes, Cold (larger) nodes •  Daily Curator runs ‒  Allocate to cold, Close @ 30 days ‒  Delete @ 75 days Kafka & Scale
  11. 14 Second Pipeline Iteration •  Daily Indices varying on type

    (os, nginx, etc) ‒  Easy separation for our log types ‒  Speeds up queries by specifying patterns ‒  Separate grok filter per type, less confusion Kafka & Scale
  12. 15 Yammer & Elasticsearch Agenda 1 2 3 4 First

    iteration: backpressure Second iteration: scaling out & incorporating Kafka Notable Issues & Workarounds Use Cases
  13. 16 Notable Issues & Workarounds •  Problem: Running out of

    disk space, resource efficiency ‒  Pipeline backpressure ‒  Constant shard reallocation (if uneven distro) until balanced disk space ‒  Ingest to Index time delay Disk Space Capacity
  14. 17 Notable Issues & Workarounds •  Problem: Running out of

    disk space, resource efficiency ‒  Pipeline backpressure ‒  Constant shard reallocation (if uneven distro) until balanced disk space ‒  Ingest to Index time delay •  Solution: Scaling out, cloud-based solutions with larger storage, or hybrid ‒  Yammer uses hybrid solution ‒  Reduce replica usage where applicable ‒  Ensure log data is not abnormally (and unnecessarily) large ‒  Consider allocating for headroom, if possible Disk Space Capacity
  15. 18 Notable Issues & Workarounds •  Problem: Multiline filter was

    not thread-safe ‒  Want to record stack traces Multiline Logs
  16. 19 Notable Issues & Workarounds •  Problem: Multiline filter was

    not thread-safe ‒  Want to record stack traces •  Solution: Wrap them in JSON! J ‒  Engineers @ Yammer developed a JSON logger for Dropwizard applications ‒  Multiline logs are handled before ingestion by Logstash Forwarders ‒  Exceptions logged to separate files Multiline Logs
  17. 22 Notable Issues & Workarounds •  Problem: Large amounts of

    fields take a long time to load in Kibana ‒  Unintended/uninformed use: fields constructed without aggregation in mind ‒  Fields that have unique identifiers, e.g. ’user.{id}.activated’ ‒  Could reduce to just user.activated, aggregate across {id} ‒  Leads to failure to load Discover in Kibana 4.1 Fields on Fields on Fields
  18. 23 Notable Issues & Workarounds •  Problem: Large amounts of

    fields take a long time to load in Kibana ‒  Unintended/uninformed use: fields constructed without aggregation in mind ‒  Fields that have unique identifiers, e.g. ’user.{id}.activated’ ‒  Could reduce to just user.activated, aggregate across {id} ‒  Leads to failure to load Discover in Kibana 4.1 •  Solution: Uniform mapping across an index, reduce usage of dynamic mappings ‒  If developing for engineer base, document usage instructions ‒  Generalize structured logging as much as possible ‒  Try to keep less than 50-100 fields per index Fields on Fields on Fields
  19. 24 Yammer & Elasticsearch Agenda 1 2 3 4 First

    iteration: backpressure Second iteration: scaling out & incorporating Kafka Notable Issues & Workarounds Use Cases
  20. 25 Use Cases •  Locating & isolating failed machines ‒ 

    On-call gets alerted for machine going down.. WAT??? ‒  Metrics can only give part of the picture Diagnosing Failure Points
  21. 26 Use Cases •  Locating & isolating failed machines ‒ 

    On-call gets alerted for machine going down.. WAT??? ‒  Metrics can only give part of the picture Diagnosing Failure Points
  22. 27 Use Cases •  Watching logs for exception spikes generated

    post-deploy •  Provides confidence & assurance in successful deploys Evaluating Deployments
  23. 28 Use Cases •  Visualizations can give insight to potentially

    invisible points of concern ‒  Are all networks healthy (i.e. giving healthy response codes)? ‒  Ex. Lots of 4xx responses come from a particular network Trend Analysis
  24. 30 Use Cases •  Percentile Aggregations ‒  Ex. Aggregating over

    API endpoints for request times ‒  Evaluating slow points as points of concern and project targets Trend Analysis
  25. 31 Conclusion •  Faster on-call incident response •  Confidence in

    deploys •  Identify suspicious trends •  Determine metrics of concern What Elastic & Log Aggregation has done for Yammer