Slide 1

Slide 1 text

Metrics & Monitoring Sleman, January 9th 2016

Slide 2

Slide 2 text

“You cannot manage what you do not measure” - Edward Deming

Slide 3

Slide 3 text

Metrics indicators to find what is on track and what needs to be changed before it is too late

Slide 4

Slide 4 text

Metrics Categories ● Business performance ● Usage ● Health

Slide 5

Slide 5 text

What product guys wants to see

Slide 6

Slide 6 text

What backend guys want to see

Slide 7

Slide 7 text

What mobile guys want to see

Slide 8

Slide 8 text

What we DON’T WANT to see

Slide 9

Slide 9 text

Monitoring Core Principles 1. Identify important topics as many as possible 2. Identify all topics as early as possible 3. Generate alarms as few as possible 4. Do it with as little work as possible

Slide 10

Slide 10 text

Tools • Alert Alarms that wakes us up if something happened • Graph Summarize all of trends. We are visual beings. • Logs Base source of truth and contains all of details

Slide 11

Slide 11 text

DEVELOPMENT ENV

Slide 12

Slide 12 text

Development Practices 1. Naive Implementation 2. Measure Implementation 3. Optimize (if needed) ● Measure Everything ● Logging is Cheap

Slide 13

Slide 13 text

Continuous Integration Pipeline

Slide 14

Slide 14 text

Insights ● Potential thresholds ● Consequences for failures ● Filtered important resources to be monitored ● Where to do improvement or optimization

Slide 15

Slide 15 text

PRODUCTION ENV

Slide 16

Slide 16 text

Where to set our eyes? Layer 0: Application Layer 1: Process ● Active connections ● Slow processing ● Throughput ● Warning, Error, Fatal logs, etc ● Changes in process status e.g. terminated, stopped, restarted ● Uptime ● Consumed resources, etc

Slide 17

Slide 17 text

Where to set our eyes? Layer 2: Server Layer 3: Hosting ● Number of running processes ● System resource ( CPU, network, IO, memory, etc) ● Hardware health, etc ● Latency ● Availability ● Maintenance schedules, etc

Slide 18

Slide 18 text

Where to set our eyes? Layer 4: External Dependencies Layer 5: USERS!! ● API Changes ● SSL-APNS certificates renewals ● Policy changes, etc ● Behaviours ● Crashes ● Device types ● Social Media oauth logins ● Successful responses * ● Sessions, etc

Slide 19

Slide 19 text

Four Monitoring Steps • Monitor potential bad things • Monitor actual bad things • Monitor good things • Tune and Improve

Slide 20

Slide 20 text

Monitor potential bad things • Identify resource • Understand the threshold value and consequences • Set alert before the threshold reached ● Daily active users reached 70% of PubNub threshold ● Increased social login failure in 30 minutes ● Increased timeouts in 30 minutes ● Increased >= 400 HTTP Codes

Slide 21

Slide 21 text

Monitor actual bad things • INEVITABLE • Identify resource • Understand the failure effects • Ensure alert triggered • Ensure all source of truths exists ● Application server restarted ● Fatal error or exceptions happened in apps * ● Mobile apps crashed ● Chats aren’t delivered ● Twilio failed to send SMS

Slide 22

Slide 22 text

Monitor Good Things ( before turns into disaster ) • Identify resource • Set alert when changes happened • It’s BETTER to compare to sudden drops/spike rather than gradual changes / threshold reached * ● Stores created every hour ● Transactions created every hour ● Successful payments every hour ● Chat delivered every hour, etc

Slide 23

Slide 23 text

Tune and Improve ● Add metrics as part of our retrospectives ● Asks our teams if any metrics need to be added / changed ● Add / remove logs if necessary ● Remove noise alerts ● Pay close attention to our tools *

Slide 24

Slide 24 text

A metric will tell you that something is happening, while an analysis will tell you why something is happening. - Vince Law

Slide 25

Slide 25 text

maturnuwun :)

Slide 26

Slide 26 text

Source: ● Scalyr.com ● Fabric.io ● Newrelic.com