Upgrade to Pro — share decks privately, control downloads, hide ads and more …

System Operations

System Operations

Alex Xandra Albert Sim

September 20, 2022
Tweet

More Decks by Alex Xandra Albert Sim

Other Decks in Technology

Transcript

  1. Introduction Alex Xandra Albert Sim • Foodpanda / Delivery Hero

    APAC (2021-now) • Backend engineer - Consumer Wallet Team • blibli.com (2017-2021) • Backend engineer - R&D + Engagement Team • tokopedia.com (2016-2017) • Frontend engineer - Mobile Web Team
  2. 1st Stage of Software Systems Maintenance Just after launch •

    Stu ff just got into production, everything runs fi ne • Some bugs here and there might be found, you fi x it quick • Not many people uses it yet, things seem to work well • Everyone is happy!
  3. 2nd Stage of Software Systems Maintenance Users start to come

    • Tra ff i c is high, things start to break • You tried to upgrade your servers, things are still breaking • e.g. update DB? DOWN FOR 6 HOURS • Bugs, bugs, bugs • Questions raised: • How do people get 99.99%+ uptime? That’s ~52m downtime per year! • How can I know something is going wrong before my users does?
  4. 3rd Stage of Software Systems Maintenance Learn to fear, learn

    to walk, learn to run • Planned, maintained systems running well and with minimum disruption • People with titles like “SRE (Site Reliability Engineer)”, “Devops Engineer”, “System Engineer”, “Software Engineer” starts to come in • You start seeing less and less downtime or error pages • Question: How to get there?
  5. Operations Meaning and De fi nition • Planning, execution, maintenance

    of software systems • How do you keep your systems running? • How do you make your system runs well (less bugs, errors, etc)?
  6. First: Ownership Who owns what? • A service owner is

    responsible for managing the service throughout the entire lifecycle • From requirement gathering to handling production issue • There might be infra team to help on platform, but ultimately the team is responsible for the system • No “we don’t understand Kubernetes / Networking / Linux” excuse • This is why DevOps is Development and Operations
  7. First: Ownership (cont.) Who owns what? • Usually ownership means

    the team are also responsible for the product as a whole • This means that the team has its own product, business, qa, etc person • The team would also de fi nes their own successes • i.e. Success for wallet team is if 10% of payment uses wallet • The team operates locally, within their own domain
  8. Locality Team should operates locally, within their own domain •

    The team should only need to do local changes for their change • Want to change 1 feature? Update only 1 repo / directory (if you use monorepo) • Changes should impact others as little as possible • If to deploy you often need to coordinate with multiple teams, its not local • If you often need to work with too many teams to deliver a feature, it’s not local • How do you achieve locality?
  9. Simplicty How to achieve locality? By being simple • How

    much can we decouple applications / services? • What architectural / code pattern to use? • i.e. (design patterns) Mediator, Bridge, Factory • How do we know we’re simple? • i.e. (software metrics) Connascene, Coupling • Your process has to be simple as well • i.e. do you need to talk to engineers to ask something? Or do your manager have to talk to their manager to do something at all?
  10. Second: Focus, Flow and Joy Developers Should be Productive •

    As a developer, you have to be able to focus on your work • Coding should be a joy: with minium dependencies, delays, and impediments • If it’s not like that, you should make it so • This is part of “ownership”
  11. Third: Improvement of Daily Work Make it happen, make it

    better • Start from architecture: good architecture is one of the most important part of reducing coupling • Sadly huge topic. Next slide will show some references. • Pay down your technical debt constantly • Each iteration (sprint, release, etc) should have tech debt tickets • The team should roughly know what their painpoints are, and when it will be addressed
  12. Fourth: Psychological Safety It’s okay not to be okay •

    It should be safe to talk about problems • It should be safe to make mistakes • There should never be castigation, ridicule, blame even for big mistakes • Personally, I’ve seen mistake amounts to ~15m+ USD in a day and leadership’s focus is “what can we learn from it?”. That is good leadership. • It’s very important to have blameless culture because people learn when they make mistake
  13. Fifth: Focus on Customer Can’t live without customers • Find

    and iterate on what the customer wants from your product • Every change in your system should answer the question: “what problem does this change solve for my customer?” • Understand core and context: • Core: how much the customer is willing to pay • Context: what the customer don’t care about
  14. Fifth: Focus on Customer (cont.) Can’t live without customers 2.

    Deploy 3. Manage 1. Invent 4. O ff l oad Core (di ff erentiation) Context (everything else) Mission critical (must not fail) Non-mission critical (Everything else)
  15. To Reiterate Key points for “Smooth System Operations” • Own

    your systems - you are responsible for its entire lifecycle • Make things local and simple - simple == maintainable • Improve on things daily - one step at a time • Failure should be safe - humans are fl awed, we learn from mistakes • Focus on customer - a great product without customers is a dead product
  16. Metric Driven Design Measure what matters • A good team

    should know what “success” means • How do you know if you have achieve success? Track it via metrics! • De fi ne your success criteria upfront, and track it • Your sucecss criteria must be measurable, hence trackable metrics • Have a simple way for everyone in your team to see if the system is meeting its success criteria • Usually this would be a (soft) real-time dashboard
  17. Metrics? What should we measure? • 2 kinds of metrics:

    1. Business Metrics • Tracks what makes a business successful • Example: • e-commerce: no. of new customers, no. of transactions, Gross Merchandise Value • e-wallet: total transaction amount, no. of daily transactions • Usually discussed with business people • Sometimes is simply your KPI / OKR etc
  18. Metrics? What should we measure? • 2 kinds of metrics:

    2. Service Metrics • Tracks if your service is healthy • Example: • API server: response time, error rates, request per second • Server: CPU utilization, free storage, RAM usage • Usually de fi ned by tech team, to make sure the system could handle users’ request
  19. Four Golden Signals For service metrics in distributed systems •

    Latency • Time it takes to service a request. Successful and failure should be tracked separately • i.e. error latency might be related to timeouts, success taking too long is indication of slow system somewhere • Tra ffi c • How much demand is being placed in your system? • i.e. request / second, network I/O, concurrent sessions
  20. Four Golden Signals For service metrics in distributed systems •

    Errors • Rate of request failure • i.e. HTTP 5xx, cache miss • Saturation • How much of your system is “utilized” • i.e. memory utilization, 99th percentile response time
  21. Metric Collection How do you send values to dashboards? •

    Usually your metrics & monitoring tool will provide SDK • Use the SDK to send data to your monitoring system, then build a dashboard out of the data you sent • If you are afraid of vendor locking, there’s projects like Open Telemetry (https://opentelemetry.io) that you can use • You should build your system such as any feature build should answer the question “how will I track the success of this feature?”
  22. Metrics What to do with it now that we have

    it? • Metrics should be tracked. A metrics that no one looks at is a waste of resources. • A monitoring system could give you visualization of your metrics • This could help with things like analyzing long term trends • A monitoring system could also keep an eye of your metrics and alerts you when things don’t go as expected • i.e. too many errors? Alert the team!
  23. Monitoring System Approximation of how it works Flight Service Metrics

    Instrumentation Metrics Datastore Monitoring Systems Communication Systems Developer Open source example: Prometheus, metricbeat Open source example: Elasticsearch Open source example: Grafana Open source example: MatterMost Free: Slack, Telegram Alert / notif Detect anomalies
  24. What to monitor? When do we need to act? •

    Too many monitors and alerts leads to apathy • “Oh the alarm is ringing all the time, just ingore it” <- we don’t want this • Rule of thumb of what to monitor: Can your customer still use your product if this metrics is out of ordinary? • Example: a payment system that can only process 10% of transaction is useless. Alert when success rate is < 90% • This means that we at least need 3 metrics: transaction start, success, and failure
  25. What makes a good monitor / alert Show, notify, warn,

    alert • A monitor should have a clear threshold before alerting developers • A good monitor sholud show the data being monitored • Usually we’ll start with warning, then alert • This is because sometimes spikes / glitch happen, and we don’t want to be alerted every time • Example: • timeout due to network glitch for 1s might not need an alert, just noti fi cation • If the timeout persist for 30s, we should be worried and check
  26. What makes a good monitor / alert All actions, less

    talking • A monitor must be actionable • The receiver of an alert should know what happened and why it happened • The receiver of an alert must be able to act on it • A good alert would have a message to tell the receiver what happen and what to do about it • We could also write a runbook: a guide on how to handle a situation
  27. To Reiterate Key points for “Monitoring + Alerting” • Business

    metric to de fi ne your success • System metrics to make sure you can be successful • Metrics must be measurable above all • Collect your metric, and acts on it. Monitor it. • Notify, warn, then alert. Too many alerts leads to apathy. • Make sure your alert is actionable, otherwise the responder would be helpless.
  28. We measured. Then? What comes after measurement? • Now that

    we measure and monitor our goals, what comes next? • The best of the best softwares have 1 key: continuous improvement • Getting it right the fi rst time is a rare event • It’s better to have something, then improve it as you go • You can try to understand a user, but it’s di ffi cult to understand millions of user
  29. Continuous Delivery What? • In the software (as a service)

    world, the ability to change your system easily, reliably, and fast is very rare • The one that can do that usually achieve it by doing it continuously • The core of continious delivery is that you deliver your changes: • In short cycle (fast and small) • In any time (safe and reliable) • Automatically (minimum human intervention)
  30. Continuous Delivery Pipeline Github / Gitlab Code hosting JUnit /

    Mocha Test Rails Docker / k8s Could all be triggered in CI/CD system, i.e. drone.io / Jenkins
  31. Continuous Delivery Why? • It’s very important that developers can

    do fast deployment, anytime • Small bug fi x would take minutes-hours instead of days • Easy deployment incentivize daily improvement - it’s motivating if your mini- changes go to production fast and directly improve a user’s life • Having a continuous release means that changes are usually small - when something goes wrong, it’s easier to rollback a small change vs huge change • Small changes also means it’s cheap to throw away - easy to pivot before you are investing in a feature too much
  32. What’s Next? We only have 80 mins.. • Logging and

    Debugging • Managing Dependencies - both in tech (library, framework, tool) and people (teams) • Scaling Operations - what works for 100 person team might not work for 1000 person team • Toil - How to fi nd and eliminate them
  33. Q&A