System Operations - Speaker Deck

Slide 1

Slide 1 text

Alex Xandra Albert Sim System Operations How to operate high scale software systems

Slide 2

Slide 2 text

Introduction Alex Xandra Albert Sim • Foodpanda / Delivery Hero APAC (2021-now) • Backend engineer - Consumer Wallet Team • blibli.com (2017-2021) • Backend engineer - R&D + Engagement Team • tokopedia.com (2016-2017) • Frontend engineer - Mobile Web Team

Slide 3

Slide 3 text

Operations?

Slide 4

Slide 4 text

1st Stage of Software Systems Maintenance Just after launch • Stu ff just got into production, everything runs fi ne • Some bugs here and there might be found, you fi x it quick • Not many people uses it yet, things seem to work well • Everyone is happy!

Slide 5

Slide 5 text

2nd Stage of Software Systems Maintenance Users start to come • Tra ff i c is high, things start to break • You tried to upgrade your servers, things are still breaking • e.g. update DB? DOWN FOR 6 HOURS • Bugs, bugs, bugs • Questions raised: • How do people get 99.99%+ uptime? That’s ~52m downtime per year! • How can I know something is going wrong before my users does?

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

3rd Stage of Software Systems Maintenance Learn to fear, learn to walk, learn to run • Planned, maintained systems running well and with minimum disruption • People with titles like “SRE (Site Reliability Engineer)”, “Devops Engineer”, “System Engineer”, “Software Engineer” starts to come in • You start seeing less and less downtime or error pages • Question: How to get there?

Slide 8

Slide 8 text

Operations Meaning and De fi nition • Planning, execution, maintenance of software systems • How do you keep your systems running? • How do you make your system runs well (less bugs, errors, etc)?

Slide 9

Slide 9 text

References

Slide 10

Slide 10 text

Smooth Systems Operations

Slide 11

Slide 11 text

First: Ownership Who owns what? • A service owner is responsible for managing the service throughout the entire lifecycle • From requirement gathering to handling production issue • There might be infra team to help on platform, but ultimately the team is responsible for the system • No “we don’t understand Kubernetes / Networking / Linux” excuse • This is why DevOps is Development and Operations

Slide 12

Slide 12 text

First: Ownership (cont.) Who owns what? • Usually ownership means the team are also responsible for the product as a whole • This means that the team has its own product, business, qa, etc person • The team would also de fi nes their own successes • i.e. Success for wallet team is if 10% of payment uses wallet • The team operates locally, within their own domain

Slide 13

Slide 13 text

Locality Team should operates locally, within their own domain • The team should only need to do local changes for their change • Want to change 1 feature? Update only 1 repo / directory (if you use monorepo) • Changes should impact others as little as possible • If to deploy you often need to coordinate with multiple teams, its not local • If you often need to work with too many teams to deliver a feature, it’s not local • How do you achieve locality?

Slide 14

Slide 14 text

Simplicty How to achieve locality? By being simple • How much can we decouple applications / services? • What architectural / code pattern to use? • i.e. (design patterns) Mediator, Bridge, Factory • How do we know we’re simple? • i.e. (software metrics) Connascene, Coupling • Your process has to be simple as well • i.e. do you need to talk to engineers to ask something? Or do your manager have to talk to their manager to do something at all?

Slide 15

Slide 15 text

Second: Focus, Flow and Joy Developers Should be Productive • As a developer, you have to be able to focus on your work • Coding should be a joy: with minium dependencies, delays, and impediments • If it’s not like that, you should make it so • This is part of “ownership”

Slide 16

Slide 16 text

Third: Improvement of Daily Work Make it happen, make it better • Start from architecture: good architecture is one of the most important part of reducing coupling • Sadly huge topic. Next slide will show some references. • Pay down your technical debt constantly • Each iteration (sprint, release, etc) should have tech debt tickets • The team should roughly know what their painpoints are, and when it will be addressed

Slide 17

Slide 17 text

Software Architecture References

Slide 18

Slide 18 text

Fourth: Psychological Safety It’s okay not to be okay • It should be safe to talk about problems • It should be safe to make mistakes • There should never be castigation, ridicule, blame even for big mistakes • Personally, I’ve seen mistake amounts to ~15m+ USD in a day and leadership’s focus is “what can we learn from it?”. That is good leadership. • It’s very important to have blameless culture because people learn when they make mistake

Slide 19

Slide 19 text

Fifth: Focus on Customer Can’t live without customers • Find and iterate on what the customer wants from your product • Every change in your system should answer the question: “what problem does this change solve for my customer?” • Understand core and context: • Core: how much the customer is willing to pay • Context: what the customer don’t care about

Slide 20

Slide 20 text

Fifth: Focus on Customer (cont.) Can’t live without customers 2. Deploy 3. Manage 1. Invent 4. O ff l oad Core (di ff erentiation) Context (everything else) Mission critical (must not fail) Non-mission critical (Everything else)

Slide 21

Slide 21 text

To Reiterate Key points for “Smooth System Operations” • Own your systems - you are responsible for its entire lifecycle • Make things local and simple - simple == maintainable • Improve on things daily - one step at a time • Failure should be safe - humans are fl awed, we learn from mistakes • Focus on customer - a great product without customers is a dead product

Slide 22

Slide 22 text

Metrics + Monitoring

Slide 23

Slide 23 text

Metric Driven Design Measure what matters • A good team should know what “success” means • How do you know if you have achieve success? Track it via metrics! • De fi ne your success criteria upfront, and track it • Your sucecss criteria must be measurable, hence trackable metrics • Have a simple way for everyone in your team to see if the system is meeting its success criteria • Usually this would be a (soft) real-time dashboard

Slide 24

Slide 24 text

Metrics? What should we measure? • 2 kinds of metrics: 1. Business Metrics • Tracks what makes a business successful • Example: • e-commerce: no. of new customers, no. of transactions, Gross Merchandise Value • e-wallet: total transaction amount, no. of daily transactions • Usually discussed with business people • Sometimes is simply your KPI / OKR etc

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Metrics? What should we measure? • 2 kinds of metrics: 2. Service Metrics • Tracks if your service is healthy • Example: • API server: response time, error rates, request per second • Server: CPU utilization, free storage, RAM usage • Usually de fi ned by tech team, to make sure the system could handle users’ request

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Four Golden Signals For service metrics in distributed systems • Latency • Time it takes to service a request. Successful and failure should be tracked separately • i.e. error latency might be related to timeouts, success taking too long is indication of slow system somewhere • Tra ffi c • How much demand is being placed in your system? • i.e. request / second, network I/O, concurrent sessions

Slide 29

Slide 29 text

Four Golden Signals For service metrics in distributed systems • Errors • Rate of request failure • i.e. HTTP 5xx, cache miss • Saturation • How much of your system is “utilized” • i.e. memory utilization, 99th percentile response time

Slide 30

Slide 30 text

Metric Collection How do you send values to dashboards? • Usually your metrics & monitoring tool will provide SDK • Use the SDK to send data to your monitoring system, then build a dashboard out of the data you sent • If you are afraid of vendor locking, there’s projects like Open Telemetry (https://opentelemetry.io) that you can use • You should build your system such as any feature build should answer the question “how will I track the success of this feature?”

Slide 31

Slide 31 text

OpenTelemetry Sample Code Food for thought: How to abstract builders so it’s convenient?

Slide 32

Slide 32 text

Metrics What to do with it now that we have it? • Metrics should be tracked. A metrics that no one looks at is a waste of resources. • A monitoring system could give you visualization of your metrics • This could help with things like analyzing long term trends • A monitoring system could also keep an eye of your metrics and alerts you when things don’t go as expected • i.e. too many errors? Alert the team!

Slide 33

Slide 33 text

Monitoring System Approximation of how it works Flight Service Metrics Instrumentation Metrics Datastore Monitoring Systems Communication Systems Developer Open source example: Prometheus, metricbeat Open source example: Elasticsearch Open source example: Grafana Open source example: MatterMost Free: Slack, Telegram Alert / notif Detect anomalies

Slide 34

Slide 34 text

What to monitor? When do we need to act? • Too many monitors and alerts leads to apathy • “Oh the alarm is ringing all the time, just ingore it” <- we don’t want this • Rule of thumb of what to monitor: Can your customer still use your product if this metrics is out of ordinary? • Example: a payment system that can only process 10% of transaction is useless. Alert when success rate is < 90% • This means that we at least need 3 metrics: transaction start, success, and failure

Slide 35

Slide 35 text

What makes a good monitor / alert Show, notify, warn, alert • A monitor should have a clear threshold before alerting developers • A good monitor sholud show the data being monitored • Usually we’ll start with warning, then alert • This is because sometimes spikes / glitch happen, and we don’t want to be alerted every time • Example: • timeout due to network glitch for 1s might not need an alert, just noti fi cation • If the timeout persist for 30s, we should be worried and check

Slide 36

Slide 36 text

Good Monitor Example Should have normal, alert, warning threshold Warning (Chat message) Alert (Call, alarm)

Slide 37

Slide 37 text

What makes a good monitor / alert All actions, less talking • A monitor must be actionable • The receiver of an alert should know what happened and why it happened • The receiver of an alert must be able to act on it • A good alert would have a message to tell the receiver what happen and what to do about it • We could also write a runbook: a guide on how to handle a situation

Slide 38

Slide 38 text

Good Alert Example Actionable and clear

Slide 39

Slide 39 text

Good Alert Example Actionable and clear What is happening

Slide 40

Slide 40 text

Good Alert Example Actionable and clear What to do

Slide 41

Slide 41 text

Good Alert Example Actionable and clear How to do things

Slide 42

Slide 42 text

To Reiterate Key points for “Monitoring + Alerting” • Business metric to de fi ne your success • System metrics to make sure you can be successful • Metrics must be measurable above all • Collect your metric, and acts on it. Monitor it. • Notify, warn, then alert. Too many alerts leads to apathy. • Make sure your alert is actionable, otherwise the responder would be helpless.

Slide 43

Slide 43 text

Continuous Delivery

Slide 44

Slide 44 text

We measured. Then? What comes after measurement? • Now that we measure and monitor our goals, what comes next? • The best of the best softwares have 1 key: continuous improvement • Getting it right the fi rst time is a rare event • It’s better to have something, then improve it as you go • You can try to understand a user, but it’s di ffi cult to understand millions of user

Slide 45

Slide 45 text

Continuous Delivery What? • In the software (as a service) world, the ability to change your system easily, reliably, and fast is very rare • The one that can do that usually achieve it by doing it continuously • The core of continious delivery is that you deliver your changes: • In short cycle (fast and small) • In any time (safe and reliable) • Automatically (minimum human intervention)

Slide 46

Slide 46 text

Continuous Delivery Pipeline Github / Gitlab Code hosting JUnit / Mocha Test Rails Docker / k8s Could all be triggered in CI/CD system, i.e. drone.io / Jenkins

Slide 47

Slide 47 text

Continuous Delivery Why? • It’s very important that developers can do fast deployment, anytime • Small bug fi x would take minutes-hours instead of days • Easy deployment incentivize daily improvement - it’s motivating if your mini- changes go to production fast and directly improve a user’s life • Having a continuous release means that changes are usually small - when something goes wrong, it’s easier to rollback a small change vs huge change • Small changes also means it’s cheap to throw away - easy to pivot before you are investing in a feature too much

Slide 48

Slide 48 text

Future Topics

Slide 49

Slide 49 text

What’s Next? We only have 80 mins.. • Logging and Debugging • Managing Dependencies - both in tech (library, framework, tool) and people (teams) • Scaling Operations - what works for 100 person team might not work for 1000 person team • Toil - How to fi nd and eliminate them

Slide 50

Slide 50 text

Q&A