Troubleshooting in a distributed system

Komodor <> Epsagon | May 2021 Tracking changes in a
distributed system The dark side of changes

Cloud native | March 2021 Komodor <> Epsagon | May
2021 • The CTO and co-founder of Komodor, a startup building the first k8s-native troubleshooting platform. • A big believer in dev empowerment and moving fast. • Worked at eBay|Forter| Rookout (first developer), A lot backend and infra developer experience (“DevOps”) • K8S fan 😃 Who am I?

Cloud native | March 2021 Komodor <> Epsagon | May
2021 Agenda 1. Why should you care what changed 2. What is a change 3. Why is it so hard to find what changed 4. The future of changes tracking 5. What can you do???

Komodor <> Epsagon | May 2021 Why should you care
what changed • Issues happen on an hourly basis • They derive from complete system downtime to a small bug in staging • 85% of incidents can be traced to system changes!!! • Most troubleshooting time is focused around identifying the issue

Komodor <> Epsagon | May 2021 What is a change?
Any action that altered the system state. For example: • Code deployment • Infra changes (Cloud/on prem) • Config change • Feature flag • Job’s changes • DB migrations • 3 party changes • Customer usage or data*

Komodor <> Epsagon | May 2021 Why is it so
hard to find what changed?

Komodor <> Epsagon | May 2021 1. Heavily Rely on
3parties (cloud/ api’s etc’) 2. Includes dozens of microservices 3. Changes rapidly (the more the better) 4. Everyone can make a change (shift left) TL;DR Modern systems are basically a super complex puzzle that changes rapidly. Modern Haystack

Komodor <> Epsagon | May 2021 What makes it extra
hard? 1. Everything is connected - Ripple effect can cause “unrelated change” to crash the system 2. Dark data - Unaudited changes are happening all day long! (cloud changes/deploy to production/3 parties changes etc.) 3. Scattered data - Tracking changes efficnetly require opening up different systems and query each individually

Komodor <> Epsagon | May 2021 #alerts- production current status
find last job what code changed “who changed what” How does it look like? original alert Other “unrelated” service change was the root cause

Komodor <> Epsagon | May 2021 All indicators of change
tracking & troubleshooting are moving in the same direction Velocity is ever growing More people can change System are becoming more complex

Komodor <> Epsagon | May 2021 So, what can you
do? 1. Admitting you have a problem 2. Automate change Notification to slack (or monitoring tools) 3. Use IAC as much as possible 4. Create a changes process (even if just for reporting) 5. Improve cross team communication while troubleshooting 6. Eliminate unaudited change: use process or tool 7. Use distributed tracing to better understand system topology 8. Use tags/ annotation and metadata with relevant version 9. Gitops can eliminate some of the issues 10. Create playbooks with links to relevant tools changes 10 quick tips

Komodor <> Epsagon | May 2021 Troubleshooting can be easy
😎 BTW, We are HIRING!

Troubleshooting in a distributed system

Troubleshooting in a distributed system

Komodor

More Decks by Komodor

Other Decks in Technology

Featured

Transcript

Komodor <> Epsagon | May 2021 Tracking changes in a

Cloud native | March 2021 Komodor <> Epsagon | May

Cloud native | March 2021 Komodor <> Epsagon | May

Komodor <> Epsagon | May 2021 Why should you care

Komodor <> Epsagon | May 2021 What is a change?

Komodor <> Epsagon | May 2021 Why is it so

Komodor <> Epsagon | May 2021 1. Heavily Rely on

Komodor <> Epsagon | May 2021 What makes it extra

Komodor <> Epsagon | May 2021 #alerts- production current status

Komodor <> Epsagon | May 2021 All indicators of change

Komodor <> Epsagon | May 2021 So, what can you

Komodor <> Epsagon | May 2021 Troubleshooting can be easy