Quit Yammer(ing) Away & Start Analyzing Your Log Data!

1 Jeffrey Zeng, Software Engineer February 17, 2016 Quit Yammer(ing)
& Start Analyzing Your Log Data!

2 Hi. •  About Me ‒  Recently graduated from UC
Berkeley ‒  Software Engineer at Microsoft ‒  Tech Lead for Log Aggregation at Yammer •  About Yammer ‒  Enterprise social network ‒  Global engineering teams committing code ‒  Millions of users WHO ARE YOU???

3 Machines With multiple log files to ingest, from variable
locations 4K 35K 1TB The Situation with Numbers Events Processed per second by Logstash Log Data Produced per day and stored in Elasticsearch

4 Yammer’s Log Situation •  On-Call Support •  Deployment Confidence
•  Analyzing Trends How we use Elasticsearch

5 Yammer’s Log Situation •  No tooling to efficiently analyze
data •  Manual search for on-call responses •  Tail, Grep, Pray •  Wants ‒  30 day log retention ‒  Visualization of data ‒  Faster on-call response time Pre-Elasticsearch

6 Yammer & Elasticsearch Agenda 1 2 3 4 First
iteration: backpressure Second iteration: scaling out & incorporating Kafka Notable Issues & Workarounds Use Cases

7 First Pipeline Iteration •  First attempt at using ELK
for log aggregation at Yammer •  Demonstrated functionality of Logstash in extracting fields from logs •  Demonstrated scalability of ES & effective visualizations in Kibana Attempt and Result

8 First Pipeline Iteration •  First attempt at using ELK
for log aggregation at Yammer •  Demonstrated functionality of Logstash in extracting fields from logs •  Demonstrated scalability of ES & effective visualizations in Kibana •  Poorly received; not well planned out ‒  Improperly scaled (disk space), unaware of size of incoming data ‒  Pipeline held up at many points, resulted in backpressure, long delays Attempt and Result

10 Pipeline Architecture

11 Second Pipeline Iteration •  Apache Kafka introduced into pipeline
•  Backpressure relieved from applications •  Reduced cascading failures •  Split up log collection from indexing Kafka & Scale

12 Second Pipeline Iteration •  Scale Revamp - Made informed
calculations, determined scale need •  Scaled out each component, allow plenty of headroom •  Rule of thumb: 30% headroom, on all components Kafka & Scale

13 Second Pipeline Iteration •  Elasticsearch Cluster ‒  Hot (smaller)
nodes, Cold (larger) nodes •  Daily Curator runs ‒  Allocate to cold, Close @ 30 days ‒  Delete @ 75 days Kafka & Scale

14 Second Pipeline Iteration •  Daily Indices varying on type
(os, nginx, etc) ‒  Easy separation for our log types ‒  Speeds up queries by specifying patterns ‒  Separate grok filter per type, less confusion Kafka & Scale

16 Notable Issues & Workarounds •  Problem: Running out of
disk space, resource efficiency ‒  Pipeline backpressure ‒  Constant shard reallocation (if uneven distro) until balanced disk space ‒  Ingest to Index time delay Disk Space Capacity

17 Notable Issues & Workarounds •  Problem: Running out of
disk space, resource efficiency ‒  Pipeline backpressure ‒  Constant shard reallocation (if uneven distro) until balanced disk space ‒  Ingest to Index time delay •  Solution: Scaling out, cloud-based solutions with larger storage, or hybrid ‒  Yammer uses hybrid solution ‒  Reduce replica usage where applicable ‒  Ensure log data is not abnormally (and unnecessarily) large ‒  Consider allocating for headroom, if possible Disk Space Capacity

18 Notable Issues & Workarounds •  Problem: Multiline filter was
not thread-safe ‒  Want to record stack traces Multiline Logs

19 Notable Issues & Workarounds •  Problem: Multiline filter was
not thread-safe ‒  Want to record stack traces •  Solution: Wrap them in JSON! J ‒  Engineers @ Yammer developed a JSON logger for Dropwizard applications ‒  Multiline logs are handled before ingestion by Logstash Forwarders ‒  Exceptions logged to separate files Multiline Logs

20 Notable Issues & Workarounds NO!!!!!!!

21 Notable Issues & Workarounds

22 Notable Issues & Workarounds •  Problem: Large amounts of
fields take a long time to load in Kibana ‒  Unintended/uninformed use: fields constructed without aggregation in mind ‒  Fields that have unique identifiers, e.g. ’user.{id}.activated’ ‒  Could reduce to just user.activated, aggregate across {id} ‒  Leads to failure to load Discover in Kibana 4.1 Fields on Fields on Fields

23 Notable Issues & Workarounds •  Problem: Large amounts of
fields take a long time to load in Kibana ‒  Unintended/uninformed use: fields constructed without aggregation in mind ‒  Fields that have unique identifiers, e.g. ’user.{id}.activated’ ‒  Could reduce to just user.activated, aggregate across {id} ‒  Leads to failure to load Discover in Kibana 4.1 •  Solution: Uniform mapping across an index, reduce usage of dynamic mappings ‒  If developing for engineer base, document usage instructions ‒  Generalize structured logging as much as possible ‒  Try to keep less than 50-100 fields per index Fields on Fields on Fields

25 Use Cases •  Locating & isolating failed machines ‒ 
On-call gets alerted for machine going down.. WAT??? ‒  Metrics can only give part of the picture Diagnosing Failure Points

26 Use Cases •  Locating & isolating failed machines ‒ 
On-call gets alerted for machine going down.. WAT??? ‒  Metrics can only give part of the picture Diagnosing Failure Points

27 Use Cases •  Watching logs for exception spikes generated
post-deploy •  Provides confidence & assurance in successful deploys Evaluating Deployments

28 Use Cases •  Visualizations can give insight to potentially
invisible points of concern ‒  Are all networks healthy (i.e. giving healthy response codes)? ‒  Ex. Lots of 4xx responses come from a particular network Trend Analysis

29 Use Cases

30 Use Cases •  Percentile Aggregations ‒  Ex. Aggregating over
API endpoints for request times ‒  Evaluating slow points as points of concern and project targets Trend Analysis

31 Conclusion •  Faster on-call incident response •  Confidence in
deploys •  Identify suspicious trends •  Determine metrics of concern What Elastic & Log Aggregation has done for Yammer

32 Thanks for listening! Jeffrey Zeng [email protected]

Quit Yammer(ing) Away & Start Analyzing Your Lo...

Quit Yammer(ing) Away & Start Analyzing Your Log Data!

Elastic Co

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript

1 Jeffrey Zeng, Software Engineer February 17, 2016 Quit Yammer(ing)

2 Hi. •  About Me ‒  Recently graduated from UC

3 Machines With multiple log files to ingest, from variable

4 Yammer’s Log Situation •  On-Call Support •  Deployment Confidence

5 Yammer’s Log Situation •  No tooling to efficiently analyze

6 Yammer & Elasticsearch Agenda 1 2 3 4 First

7 First Pipeline Iteration •  First attempt at using ELK

8 First Pipeline Iteration •  First attempt at using ELK

9 Yammer & Elasticsearch Agenda 1 2 3 4 First

10 Pipeline Architecture

11 Second Pipeline Iteration •  Apache Kafka introduced into pipeline

12 Second Pipeline Iteration •  Scale Revamp - Made informed

13 Second Pipeline Iteration •  Elasticsearch Cluster ‒  Hot (smaller)

14 Second Pipeline Iteration •  Daily Indices varying on type

15 Yammer & Elasticsearch Agenda 1 2 3 4 First

16 Notable Issues & Workarounds •  Problem: Running out of

17 Notable Issues & Workarounds •  Problem: Running out of

18 Notable Issues & Workarounds •  Problem: Multiline filter was

19 Notable Issues & Workarounds •  Problem: Multiline filter was

20 Notable Issues & Workarounds NO!!!!!!!

21 Notable Issues & Workarounds

22 Notable Issues & Workarounds •  Problem: Large amounts of

23 Notable Issues & Workarounds •  Problem: Large amounts of

24 Yammer & Elasticsearch Agenda 1 2 3 4 First

25 Use Cases •  Locating & isolating failed machines ‒

26 Use Cases •  Locating & isolating failed machines ‒

27 Use Cases •  Watching logs for exception spikes generated

28 Use Cases •  Visualizations can give insight to potentially

29 Use Cases

30 Use Cases •  Percentile Aggregations ‒  Ex. Aggregating over

31 Conclusion •  Faster on-call incident response •  Confidence in

32 Thanks for listening! Jeffrey Zeng [email protected]