Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How to scale a Logging Infrastructure
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Paul Stack
June 03, 2015
Technology
210
0
Share
How to scale a Logging Infrastructure
Logging infrastructure using ELK + Kafka
Paul Stack
June 03, 2015
More Decks by Paul Stack
See All by Paul Stack
Infrastructure as Software
stack72
0
100
Mirror, Mirror on the way, what is the vainest metric of them all?
stack72
1
2.4k
Continuously Delivering Infrastructure to the Cloud
stack72
0
240
DevOops 2016
stack72
0
140
The Quest for Infrastructure Management 2.0
stack72
0
170
The Biggest Trick Consultants Ever Pulled was Telling The World Continuous Delivery is Easy
stack72
1
150
The Transition from Product to Infrastructure
stack72
0
91
Continuous Delivery - the missing parts
stack72
0
1k
Windows: Having its ass kicked by puppet and powershell
stack72
0
160
Other Decks in Technology
See All in Technology
TypeScriptとAngular Signal で実現する保守性の高いアプリケーション設計 - 3層アーキテクチャによる責務分離の実践(たつかわ) https://2026.tskaigi.org/talks/10
nealle
1
370
Javaコミュニティをもっと楽しむための9箇条
takasyou
0
250
Python開発環境にハーネス適用を検討する
yuuka51
1
540
【ハノーバーメッセ振り返りイベントat名古屋】データは集約からAI起点の収集に ~組織内・組織間でのデータ連携~
tanakaseiya
0
130
long-running-tasks
cipepser
2
400
人が担う「価値」とは?これからの「QA」とは / Human Value and the Future of Quality Assurance
bitkey
PRO
0
110
A Harness for Behaviour: how to get AI to generate code that does what we intend, or "TDD in the age of AI"
xpmatteo
0
450
【禁断】Obsidianの第二の脳に「知の巨人」と呼ばれた師匠の脳をロードしてみた
nagatsu
0
6.9k
RubyでRuby拡張を書いたらRubyより35倍速になったってどういうこと??
kazuho
3
650
エンジニアは生成AIと どのように向き合うべきか? ことばの意味という観点から
verypluming
3
250
GitHub Copilot のこれまでとこれから: From Copilot to Collaborative Agents
yuriemori
1
200
Agentic AI時代における メルカリのAIガバナンスとガードレール実装
naoichihara
16
15k
Featured
See All Featured
Facilitating Awesome Meetings
lara
57
6.9k
Speed Design
sergeychernyshev
33
1.7k
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
540
We Have a Design System, Now What?
morganepeng
55
8.1k
The Curse of the Amulet
leimatthew05
1
12k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
800
Designing for humans not robots
tammielis
254
26k
エンジニアに許された特別な時間の終わり
watany
107
240k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
9k
Un-Boring Meetings
codingconduct
0
300
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
360
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
200
Transcript
How do you scale a logging infrastructure to accept a
billion messages a day? Paul Stack http://twitter.com/stack72 mail:
[email protected]
About Me Infrastructure Engineer for a cool startup :) Reformed
ASP.NET / C# Developer DevOps Extremist Conference Junkie
Background Project was to replace the legacy ‘logging solution’
Iteration 0: A Developer created a single box with the
ELK all in 1 jar
Time to make it production ready now
None
Iteration 1: Using Redis as the input mechanism for LogStash
None
None
Enter Apache Kafka
“Kafka is a distributed publish- subscribe messaging system that is
designed to be fast, scalable, and durable” Source: Cloudera Blog
Introduction to Kafka • Kafka is made up of ‘topics’,
‘producers’, ‘consumers’ and ‘brokers’ • Communication is via TCP • Backed by Zookeeper
Kafka Topics Source: http://kafka.apache.org/documentation.html
Kafka Producers • Producers are responsible to chose what topic
to publish data to • The producer is responsible for choosing a partition to write to • Can be handled round robin or partition functions
Kafka Consumers • Consumption can be done via: • queuing
• pub-sub
Kafka Consumers • Kafka consumer group • Strong ordering
Kafka Consumers • Strong ordering
https://github.com/opentable/puppet-exhibitor
None
Iteration 2 Introduction of Kafka
None
None
Iteration 3 Further ‘Improvements’ to the cluster layout
None
The Numbers • Logs kept in ES for 30 days
then archived • 12 billion documents active in ES • ES space was about 25 - 30TB in EBS volumes • Average Doc Size ~ 1.2KB • V-Day 2015: ~750M docs collected without failure
What about metrics and monitoring?
Monitoring - Nagios • Alerts on • ES Cluster •
zK and Kafka Nodes • Logstash / Redis nodes
None
https://github.com/stack72/nagios-elasticsearch
Metrics - Kafka Offset Monitor
https://github.com/opentable/KafkaOffsetMonitor
Metrics - ElasticSearch
None
None
None
Visibility Rocks!
None
So what would I do differently?
Questions?
Paul Stack @stack72