How to scale a Logging Infrastructure

Slide 1

Slide 1 text

How do you scale a logging infrastructure to accept a billion messages a day? Paul Stack http://twitter.com/stack72 mail: [email protected]

Slide 2

Slide 2 text

About Me Infrastructure Engineer for a cool startup :) Reformed ASP.NET / C# Developer DevOps Extremist Conference Junkie

Slide 3

Slide 3 text

Background Project was to replace the legacy ‘logging solution’

Slide 4

Slide 4 text

Iteration 0: A Developer created a single box with the ELK all in 1 jar

Slide 5

Slide 5 text

Time to make it production ready now

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Iteration 1: Using Redis as the input mechanism for LogStash

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Enter Apache Kafka

Slide 11

Slide 11 text

“Kafka is a distributed publish- subscribe messaging system that is designed to be fast, scalable, and durable” Source: Cloudera Blog

Slide 12

Slide 12 text

Introduction to Kafka • Kafka is made up of ‘topics’, ‘producers’, ‘consumers’ and ‘brokers’ • Communication is via TCP • Backed by Zookeeper

Slide 13

Slide 13 text

Kafka Topics Source: http://kafka.apache.org/documentation.html

Slide 14

Slide 14 text

Kafka Producers • Producers are responsible to chose what topic to publish data to • The producer is responsible for choosing a partition to write to • Can be handled round robin or partition functions

Slide 15

Slide 15 text

Kafka Consumers • Consumption can be done via: • queuing • pub-sub

Slide 16

Slide 16 text

Kafka Consumers • Kafka consumer group • Strong ordering

Slide 17

Slide 17 text

Kafka Consumers • Strong ordering

Slide 18

Slide 18 text

https://github.com/opentable/puppet-exhibitor

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Iteration 2 Introduction of Kafka

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Iteration 3 Further ‘Improvements’ to the cluster layout

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

The Numbers • Logs kept in ES for 30 days then archived • 12 billion documents active in ES • ES space was about 25 - 30TB in EBS volumes • Average Doc Size ~ 1.2KB • V-Day 2015: ~750M docs collected without failure

Slide 26

Slide 26 text

What about metrics and monitoring?

Slide 27

Slide 27 text

Monitoring - Nagios • Alerts on • ES Cluster • zK and Kafka Nodes • Logstash / Redis nodes

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

https://github.com/stack72/nagios-elasticsearch

Slide 30

Slide 30 text

Metrics - Kafka Offset Monitor

Slide 31

Slide 31 text

https://github.com/opentable/KafkaOffsetMonitor

Slide 32

Slide 32 text

Metrics - ElasticSearch

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Visibility Rocks!

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

So what would I do differently?

Slide 39

Slide 39 text

Questions?

Slide 40

Slide 40 text

Paul Stack @stack72