Diving In The Deep End: Logging and Metrics at Digital Ocean

Diving in the Deep End: Logging & Metrics @ DigitalOcean
Brian Knox, Tech Lead - Metrics & Logging Team | DigitalOcean 1

2 Who Am I? What Do I Do?

Who Is this Person? 3 Brian Knox Things I Am:
•Tech Lead, Metrics Team •Open Source Contributor ▪Rsyslog ▪ZeroMQ

DigitalOcean 4

Who Is this Person? 5 Brian Knox Things I Am
Not: •Frequent Speaker •Comfortable •Head Shot Model •Actually A Captain

The Shallows – Where We Came From 6

The Scope of The Problem 7 •10,000+ systems and devices
•Multiple Datacenters •Dozens of Critical Services •No log aggregation.

How We Spent Most Of Our Time 8

How Did We Diagnose and Solve Problems? 9

SSH, Tail, Cat, and Grep 10

Impressive But Not Scalable. 11

The Crew 12

Metrics Team Mission 13

That Was A Lot Of Words We Help People To:
•Know what is happening now. •Reason about what will happen in future. 14

Putting A Toe In The Water 15

Solving One Problem At A Time You can't design a
correct architecture when you don't understand the scope of the problem. 16

Aggregation Problem: We could not view anything at an aggregate
level. 17

Aggregation Solution: Forward all logs in each region to a
regional rsyslog aggregator. 18

Aggregation 19

Aggregation 20 •Rsyslog aggregator per region •Forward all logs for
each region to the local regional aggregator •Write the logs to local disk, organized by host and program name ▪Easy to do with Rsyslog, it’s what it was made for ▪In-house expertise (me!)

Aggregation 21 •Immediate Benefits: ▪Could begin analysis on log volume
per day ▪Could now SSH to a central host to tail, grep, etc

Aggregation •We were receiving around 100,000 log lines a second
total. •That's more than we knew before. •Started doing some aggregate analysis of logs with simple scripts and learned... 22

Aggregation •~ 70% of our log traffic was a single
program that ran on every hypervisor, essentially saying “I'M STILL NOT DOING ANYTHING” as fast as it could. •Easy win: make it stop. 23

Elasticsearch Problem: We could not easily query the aggregated logs.
24

Elasticsearch Solution: Index the logs in Elasticsearch 25

Elasticsearch 26 •Get all logs loaded into Elasticsearch ▪More detailed
analysis on log volume broken out by: oRegions oHosts oPrograms oLog Levels ▪Begin analysis of log content (thanks to full text indexing)

Elasticsearch 27 •Small cluster from repurposed hardware •Did not
have to be (and could not possibly be) perfect •Just needed to serve its purpose: lLearn what we could about our logs lLearn what we could about Elasticsearch from an operational perspective lUse what we learned to design the next iteration

Elasticsearch – What Did We Learn? 28 •Learned who our
loggers were: lPerl services lGolang services lRuby services lThird party services lLinux services lLinux Kernel lNetwork devices (routers, switches, firewalls) •Learned there was a lot of data in our logs that could be utilized if we structured out logs better

Normalization Problem: Most of our logs were unstructured, making them
difficult to analyze 29

Normalization Solution: Structure our logs. 30

Normalization -‐ CEE 31

Normalization – CEE – The Vision (TM) “Common Event Expression
(CEE™) improves the audit process and the ability of users to effectively interpret and analyze event log and audit data. This is accomplished by defining an extensible unified event structure, which users and developers can leverage to describe, encode, and exchange their CEE Event Records.” 32

Normalization – CEE – Oops 33

Normalization – CEE – The Good News 34

Normalization - CEE <190>2015-03-25T16:57:40.945788-04:00 prod-imageindexer01 indexer[13813]: @cee:{"action":"image_delete", "controller":"images", "count":0, "egid":0,
"eid":0, "env":"production", "host":"prod- imageindexer01.nyc3.internal.digitalocean.com", "level":"info", "msg":"deleting images/kernels", "pid":13813, "pname":"/opt/apps/ imagemanagement/bin/indexer", "request.id":"14234b67-3dd6-4926- bfdc-3cb74219c512", "time":"2015-03-25T16:57:40-04:00", "version":"bc304e26752d81ba9c6530076a94d4f5f512d0bd"} 35

Normalization - CEE 36

Normalization - CEE 37

Diving A Little Deeper 38

Kibana What We Now Had: lAll logs forwarded to regional
aggregators lMost logs from our own systems structured lLogs stored on disk on aggregators for 3 days lLogs forwarded from aggregators to Elasticsearch 39

Kibana Problem: It was difficult to know what was happening
at a glance. 40

Kibana Solution: Kibana 41

Kibana 42

Kibana 43

Kibana 44

Kibana 45

Ummon Problem: It was difficult for support to examine event
logs the way they were accustomed to. 46

Ummon Solution: Ummon, a command line tool for searching event
logs in Elasticsearch. 47

Ummon 48

Logtalez Problem: We want to “tail” logs from remote services
in real-time in a safe, secure, convenient manner. 49

Logtalez Solution: Logtalez – ephemeral, encrypted topic based log subscriptions.
50

LogTalez 51

Atlantis Integration Problem: Too many steps to see event logs
from the in house support system. 52

Atlantis Integration Solution: Integrate Elasticsearch queries into our support system.
53

Atlantis Integration 54

Architecture In Depth 55

Logging Pipeline Components •Rsyslog for log shipping, parsing, and routing.
•ZeroMQ for ephemeral real-time log stream subscriptions. •HAProxy for load balancing syslog traffic. •Elasticsearch for log indexing, storage and search. •Kibana for dashboards and exploration. 56

Logging Architecture 57

Rsyslog – Log Shipper on All Systems 58

Rsyslog – Log Shipper on All Systems - Configs 59

Rsyslog – Log Aggregators 60

Rsyslog – Log Aggregators – File Template 61

Rsyslog – Log Aggregators – Publish Template 62

Rsyslog – Log Aggregators – ZeroMQ Output 63

Rsyslog – Log Aggregators – HAProxy Out 64

Rsyslog – Elasticsearch Index Loaders 65

Rsyslog – Elasticsearch Loaders - Input 66

Rsyslog – Elasticsearch Loaders – Set Index 67

Rsyslog – Elasticsearch Loaders - Output 68

Rsyslog – Elasticsearch Loaders – Create Indexes 69

Rsyslog – Elasticsearch Loaders – Structure Data 70

Rsyslog – Elasticsearch Loaders 71

Rsyslog – Elasticsearch Loaders - Billions 72

Where We’re Going 73

New Elasticsearch Cluster 74 Problem: Internal “droplets” weren’t available at
the time, we went with available hardware. This gave us what we needed short term, but we couldn't horizontally scale.

New Elasticsearch Cluster 75 Solution: A new Elasticsearch Cluster.

New Elasticsearch Cluster - Planning 76 What We Knew: •Our
total daily ingest rate •Our ingest rate per index •How fast a single droplet can index data What We Needed To Know: •The right droplet size to pick for the most benefit •How many of them we would need

New Elasticsearch Cluster - Platform 77

New Elasticsearch Cluster - Topology 78 •108 Total Shards on
43 16GB Droplets ▪344 Cores ▪6.8 Terrabytes Max Storage ( 5.1 Terrabytes Usable @ 75% ) ▪688 Gigs of Memory ▪2 to 3 shards per droplet per day ▪28-42 shards for 14 total day retention

Liblognorm 79 Problem: Some logs are still semi- structured, making
it difficult to extract useful information from them.

Liblognorm •Solution: Write a collection of liblognorm rules for normalizing
the most valuable logs. 80

Liblognorm •Liblognorm is a log normalization library that creates log
parsers for extracting field data from rulesets. •Liblognorm parse rules can be loaded into rsyslog using the mmnormalize module. 81

Liblognorm – Field Extractors •Number •Float •Kernel-timestamp •Word •String-to •Char-to
•Quoted-string •Date-rfc3164 •Date-rfc5424 •Ipv4 •Mac48 82 •Tokenized •Recursive •Regex •Iptables •Time-24h •Time-12hr •Duration •named_suffixed •Json •Cee-syslog

Liblognorm – Field Extractors rule=: %-:word% IN=%-:word% OUT=%-:word% PHYSIN=%-:word% PHYSOUT=
%-:word% SRC=%src-ip:ipv4% DST=%dst-ip:ipv4% LEN=%-:number% TOS= %-:word% PREC=%-:word% TTL=%-:number% ID=%-:number% %-:word% PROTO=%proto:word% SPT=%src-port:number% DPT=%dst-port:number% %-:rest% 83

Watcher for Real Time Alerting •Problem: While it's easier to
see what is going on in our infrastructure, we still aren't as proactive as we need to be. 84

Watcher for Real Time Alerting •Solution: Watcher (?) 85

ZeroMQ Log Transport •Problem: Our log stream topology is too
rigid. 86

ZeroMQ Log Transport •Solution: ZeroMQ end to end. 87

ZeroMQ Log Transport •Omczmq – Rsyslog ZeroMQ Output •Imczmq –
Rsyslog ZeroMQ Input 88

ZeroMQ Log Transport •Stateless connections •Encryption ( libsodium ) •Certificate
Auth ( CurveZMQ ) •Load Balancing •Publish Subscribe •Application Layer Routing •Batch Acknowledgement •Credit based flow control 89

ZeroMQ Log Transport - Stateless •Rsyslog on the Elasticsearch indexers
can connect back to bound endpoints on the aggregators. The aggregators do not need to know about the indexing endpoints. Traffic will automatically be load balanced across all elasticsearch indexer endpoints. 90

ZeroMQ Log Transport – Pub / Sub •Each branch in
each rsyslog routing rule will have a ZeroMQ publish port where authorized subscribers can connect and receive topic based streams. This allows for: l Ad-hoc analytics l Easy tracing and debugging of log flow end to end 91

ZeroMQ Log Transport – Microservices •Creating log flows through a
series of microservices providing various filters and rules in an on demand fashion. Spin up, analyze in real-time, spin down. 92

ZeroMQ Log Transport – Efficient Security •Current throughput tests of
plugins with “typical” DO logs shows an upper capacity of ~ 150,000 encrypted log lines a second with simple RFC3164 parsing 93

ZeroMQ Log Transport 94

Questions 95

Diving In The Deep End: Logging and Metrics at ...

Diving In The Deep End: Logging and Metrics at Digital Ocean

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript