Scaling Elasticsearch, Logstash and Kibana - Angad Singh

Slide 1

Slide 1 text

ELK Log processing at Scale DevOpsDays 2015, Singapore Angad Singh

Slide 2

Slide 2 text

About me DevOps at Viki, Inc - A global video streaming site with subtitles. Previously a Twitter SRE, National University of Singapore Twitter @angadsg, Github @angad

Slide 3

Slide 3 text

Elasticsearch - Log Indexing and Searching Logstash - Log Ingestion plumbing Kibana - Frontend {

Slide 4

Slide 4 text

Metrics vs Logging Metrics ● Numeric timeseries data ● Actionable ● Counts, Statistical (p90, p99 etc.) ● Scalable cost-effective solutions already available

Slide 5

Slide 5 text

Logging ● Useful for debugging ● Catch-all ● Full text searching ● Computationally intensive, harder to scale Metrics vs Logging Metrics ● Numeric timeseries data ● Actionable ● Counts, Statistical (p90, p99 etc.) ● Scalable cost-effective solutions already available

Slide 6

Slide 6 text

Alerting and Monitoring at Viki Deeper level debugging with application logs Success Rate Alert for service X

Slide 7

Slide 7 text

Logs ● Application logs - Stack Traces, Handled Exceptions ● Access Logs - Status codes, URI, HTTP Method at all levels of the stack ● Client Logs - Direct HTTP requests containing log events from client-side Javascript or Mobile application (android/ios) ● Standardized log format to JSON - easy to add / remove fields. ● Request tracing through various services using Unique-ID at Load Balancer

Slide 8

Slide 8 text

● Log aggregator ● Log preprocessing (Filtering etc.) ● 3 stage pipeline ● Input > Filter > Output Logstash

Slide 9

Slide 9 text

● Log aggregator ● Log preprocessing (Filtering etc.) ● 3 stage pipeline ● Input > Filter > Output Logstash Elasticsearch ● Full text searching and indexing ● on top of Apache Lucene ● RESTful web interface ● Horizontally scalable

Slide 10

Slide 10 text

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Input Any Stream ● local file ● queue ● tcp, udp ● twitter ● etc.. Logstash Filter Mutation ● add/remove field ● parse as json ● ruby code ● parse geoip ● etc.. Output ● elasticsearch ● redis ● queue ● file ● pagerduty ● etc..

Slide 13

Slide 13 text

● Golang program that sits next to log files, lumberjack protocol ● Forwards logs from a file to a logstash server ● Removes the need for a buffer (such as redis, or a queue) for logs pending ingestion to logstash. ● Docker container with volume mounted /var/log. Configuration stored in Consul. ● Application containers with volume mounted /var/log to /var/log/docker//application.log Logstash Forwarder

Slide 14

Slide 14 text

Logstash pool with HAProxy 4 x logstash machines, 8 cores, 16 GB RAM 7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs. Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol. Easily scalable by adding more machines and spinning up more logstash processes.

Slide 15

Slide 15 text

Application Service Container 1 Application Service Container 2 Logstash-Forwarder Container Mounted /var/log to /var/log/docker/ on host

Slide 16

Slide 16 text

Elasticsearch Hardware 12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks. 20 nodes, 20 shards, 3 replicas (with 1 primary). Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB. Average 6k-8k logs per second, peak 25k logs per second. https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Slide 17

Slide 17 text

Elasticsearch Hardware

Slide 18

Slide 18 text

● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap ● Sweet spot - 64GB of RAM with half available for Lucene file buffers. ● SSD or RAID 0 (or multiple path directories similar to RAID 0). ● If SSD then set I/O scheduler to deadline instead of cfq. ● RAID0 - no need to worry about disks failing as machines can easily be replaced due to multiple copies of data. ● Disable swap. Hardware Tuning

Slide 19

Slide 19 text

● 20 days of indexes open based on available memory, rest closed - open on demand ● Field data - cache used while sorting and aggregating data. ● Circuit breaker - cancels requests which require large memory, prevent OOM, http://elasticsearch:9200/_cache/clear if field data is very close to memory limit. ● Shards >= Number of nodes ● Lucene forceMerge - minor performance improvements for older indexes (https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize. html) Elasticsearch Configuration

Slide 20

Slide 20 text

Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1) Set higher ulimit for elasticsearch process Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days And also...

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Marvel - Official plugin from Elasticsearch KOPF - Index management plugin CAT APIs - REST APIs to view cluster information Curator - Data management Monitoring

Slide 23

Slide 23 text

Thanks email: [email protected] twitter: @angadsg