Un-broken logging - the foundation of software operability - Operability.io - 2015 - Matthew Skelton

Un-Broken Logging the foundation of software operability Operability.io conference #OIO15
Friday 25th September 2015 Matthew Skelton Skelton Thatcher Consulting @matthewpskelton

The way we use logging is (often) broken How to
make our logging more awesome Why we should care

Matthew Skelton @matthewpskelton #OIO15

@Operability #operability WhoOwnsMyOperability.com

confession: I am a big fan of logging

exceptional situations edge cases metrics analytics ‘audits’ … @evanphx

execution trace

BAD STUFF

Logging is often unloved 1. Discontinuous 2. Errors only, or
arbitrary 3. ‘Bolted on’ 4. No aggregation & search 5. Specify severity up front

GOOD STUFF

How to make logging awesome 1. Continuous event IDs 2.
Transaction tracing 3. Log aggregation & search tools 4. Design for logging 5. Decoupled severity

reduce time-to-detect increase team engagement increase configurability enhance DevOps collaboration
#operability

Background

Autonomous weather station

MRI brain scan imaging

Oil well monitoring

Web-scale systems

logging makes things work

(event sourcing) (structured logging) (CQRS)

How is logging usually broken?

using logging mainly for errors

inconsistent use of logging

logging slows down the software

logging ‘pollutes’ my precious domain model

logging is just for those weird Ops people

logging assumed to be free ($0) to implement no budget
for aggregating logs across machines

log aggregation happens only in Production logs not available to
Devs

fights over log severity levels

poor time synchronisation

Some history, with pirates

weather, course, sightings, latitude, longitude, … (even when quiet)

John Harrison

Why log?

verification traceability accountability charting the waters

- June 13th – Pirates!!!! - Weds – Sharks!!! -
19th Jun –BIGGER sharks!!!!

How to make logging awesome

Storage I/O Worker Job Queue Upload

Continuous event IDs

How many distinct event types (state transitions) in your application?

represent distinct states

enum Human-readable sets: unique values, sparse, immutable C#, Java, Python,
node (Ruby, PHP, …)

public enum EventID { // Badly-initialised logging data NotSet =
0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, PageGenerationStarted = 30000, PageGenerationCompleted = 30001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... }

Technical Domain public enum EventID { // Badly-initialised logging data
NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, PageGenerationStarted = 30000, PageGenerationCompleted = 30001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... }

BasketItemAdded = 60001

BasketItemAdded = 60001 BasketItemRemoved = 60002

represent distinct states

OrderSvc_BasketItemAdded

Monolith to microservices: debugger does not have the full view

Even with remote debugger, it’s boring to attach and detach

Storage I/O Worker Job Queue Upload

Transaction tracing

‘Unique-ish’ identifier for each request Passed through downstream layers

Unique-ish ID

What about APM?

APM gives us application insight BUT How much do we
learn? Is APM available on the Dev box? It’s not just ‘an Ops problem’!

Helps us to understand how the software really works Small
overhead is worth it

Configurable severity levels

Which log level is right?

DEBUG, INFO, WARNING, ERROR, CRITICAL

Log level should *not* be fixed at compile or build
time!

Tune log levels

{ "eventmappings": { "events": { "event": [ { "id": "CacheServiceStarted",
"severity": { "level": "Information" } }, { "id": "PageCachePurged", "severity": { "level": "Debug" }, "state": { "enabled": false } }, { "id": "DatabaseConnectionTimeOut", "severity": { "level": "Error" } } ] } } }

Tune severity levels of specific event IDs

Event tracing Use enumerations (or closest thing) Technical and Domain
event types Distributed systems: debuggers less useful Trace calls with ‘unique-enough’ handles Tune log levels via config

Log aggregation & search tools

Design for log aggregation

develop the software using log aggregation as a first-class thing

stories for testing logging

BasketItemAdded grep BasketItem

logging is (‘just’) another system component

Dev and Ops collaboration* * and testers too!

Where?

auditing compliance pre-emptive fault diagnosis performance metrics …

logging makes things work

“There is no thought behind aspect-oriented programming”

MINDFUL LOGGING (?!)

database transaction logs

‘Structured Logging’ TW: “Adopt” (May 2015) https://www.thoughtworks.com/radar/techniques/structured-logging http://gregoryszorc.com/ .NET: http://serilog.net/
Java: https://github.com/fluent/fluent-logger-java

sanity

More Ditch the Debugger and Use Log Analysis Instead Matthew
Skelton https://blog.logentries.com/2015/07/ditch- the-debugger-and-use-log-analysis-instead/

More Using Log Aggregation Across Dev & Ops: The Pricing
Advantage Rob Thatcher https://blog.logentries.com/2015/08/using- log-aggregation-across-dev-ops-the-pricing- advantage/

Evan Phoenix (@evanphx) youtube.com/watch?v=Z-JskKlIBOA

Books operabilitybook.com operationalfeatures.com

Thank you http://skeltonthatcher.com/ [email protected] @SkeltonThatcher +44 (0)20 8242 4103 @matthewpskelton

Un-broken logging - the foundation of software ...

Un-broken logging - the foundation of software operability - Operability.io - 2015 - Matthew Skelton

Video

More Decks by Matthew Skelton

Other Decks in Technology

Featured

Transcript