How to scale a Logging Infrastructure

How do you scale a logging infrastructure to accept a
billion messages a day? Paul Stack http://twitter.com/stack72 mail: [email protected]

About Me Infrastructure Engineer for a cool startup :) Reformed
ASP.NET / C# Developer DevOps Extremist Conference Junkie

Background Project was to replace the legacy ‘logging solution’

Iteration 0: A Developer created a single box with the
ELK all in 1 jar

Time to make it production ready now

Iteration 1: Using Redis as the input mechanism for LogStash

Enter Apache Kafka

“Kafka is a distributed publish- subscribe messaging system that is
designed to be fast, scalable, and durable” Source: Cloudera Blog

Introduction to Kafka • Kafka is made up of ‘topics’,
‘producers’, ‘consumers’ and ‘brokers’ • Communication is via TCP • Backed by Zookeeper

Kafka Topics Source: http://kafka.apache.org/documentation.html

Kafka Producers • Producers are responsible to chose what topic
to publish data to • The producer is responsible for choosing a partition to write to • Can be handled round robin or partition functions

Kafka Consumers • Consumption can be done via: • queuing
• pub-sub

Kafka Consumers • Kafka consumer group • Strong ordering

Kafka Consumers • Strong ordering

https://github.com/opentable/puppet-exhibitor

Iteration 2 Introduction of Kafka

Iteration 3 Further ‘Improvements’ to the cluster layout

The Numbers • Logs kept in ES for 30 days
then archived • 12 billion documents active in ES • ES space was about 25 - 30TB in EBS volumes • Average Doc Size ~ 1.2KB • V-Day 2015: ~750M docs collected without failure

What about metrics and monitoring?

Monitoring - Nagios • Alerts on • ES Cluster •
zK and Kafka Nodes • Logstash / Redis nodes

https://github.com/stack72/nagios-elasticsearch

Metrics - Kafka Offset Monitor

https://github.com/opentable/KafkaOffsetMonitor

Metrics - ElasticSearch

Visibility Rocks!

So what would I do differently?

Questions?

Paul Stack @stack72

How to scale a Logging Infrastructure

How to scale a Logging Infrastructure

Paul Stack

More Decks by Paul Stack

Other Decks in Technology

Featured

Transcript

How do you scale a logging infrastructure to accept a

About Me Infrastructure Engineer for a cool startup :) Reformed

Background Project was to replace the legacy ‘logging solution’

Iteration 0: A Developer created a single box with the

Time to make it production ready now

Iteration 1: Using Redis as the input mechanism for LogStash

Enter Apache Kafka

“Kafka is a distributed publish- subscribe messaging system that is

Introduction to Kafka • Kafka is made up of ‘topics’,

Kafka Topics Source: http://kafka.apache.org/documentation.html

Kafka Producers • Producers are responsible to chose what topic

Kafka Consumers • Consumption can be done via: • queuing

Kafka Consumers • Kafka consumer group • Strong ordering

Kafka Consumers • Strong ordering

https://github.com/opentable/puppet-exhibitor

Iteration 2 Introduction of Kafka

Iteration 3 Further ‘Improvements’ to the cluster layout

The Numbers • Logs kept in ES for 30 days

What about metrics and monitoring?

Monitoring - Nagios • Alerts on • ES Cluster •

https://github.com/stack72/nagios-elasticsearch

Metrics - Kafka Offset Monitor

https://github.com/opentable/KafkaOffsetMonitor

Metrics - ElasticSearch

Visibility Rocks!

So what would I do differently?

Questions?

Paul Stack @stack72