Defensive Programming & Resilient systems in Real World (TM)

Slide 1

Slide 1 text

Defensive programming & resilient systems Don’t trust anyone not even yourself A “not just testing” talk [email protected] @kinisoftware

Slide 2

Slide 2 text

Self-promotion

Slide 3

Slide 3 text

Why this talk? • 100% test coverage (as you told us Kini) • Code reviews • Manual testing • So, after release, my job is done Right?  

Slide 4

Slide 4 text

Why this talk? No, it isn’t 

Slide 5

Slide 5 text

Why this talk? Inside every enterprise today is a mesh of interconnected, interdependent systems. They cannot—must not—allow bugs to cause a chain of failures. Bugs will happen. They cannot be eliminated, so they must be survived instead. Production is the only place to learn how the software will respond to real-world Release 1.0 is the beginning of your software’s life, not the end of the project.

Slide 6

Slide 6 text

Why this talk? • Early detection • Reduce impact to customers • Know why an issue happened • Don’t depend on somebody looking at error log, daily email, ... • Prevent different conditions in dev/testing & production

Slide 7

Slide 7 text

Glossary Defensive Programming is a form of defensive design intended to ensure the continuing function of a piece of software in spite of unforeseeable usage of said software. The idea can be viewed as reducing or eliminating the prospect of Murphy's Law having effect. Resilient system stays responsive in the face of failure, any system that is not resilient will be unresponsive after a failure. Resilience is achieved by replication, containment, isolation and delegation.

Slide 8

Slide 8 text

State of the art @Tuenti

Slide 9

Slide 9 text

State of the art @Tuenti

Slide 10

Slide 10 text

State of the art @Tuenti

Slide 11

Slide 11 text

State of the art @Tuenti Service Api - Monolith PHP ChargingApi ProvisioningApi SubscriptionsApi EventHistoryApi µServices - Java Charging Provisioning Subscriptions Notifications BSS GW Providers WS Notifications / Files Mobile Apps Web Admin Tools

Slide 12

Slide 12 text

Real World(TM) • Avoid wrong dependencies • Feature disabling • Detect unfinished processes • Go async (and retry) • Log inputs & outputs • Monitoring • Alarms

Slide 13

Slide 13 text

Real World(TM) Avoid wrong dependencies • Integration points are the number-one killer of systems. • A subsystem should be as isolate as possible • Consider health checks • Design and architecture decisions are also financial decisions.

Slide 14

Slide 14 text

Real World(TM) • Allowing teams to modify system behavior without changing code (no release needed) • Do “Dark launches” when possible • When replacing old code, always keep it until you know new one works fine • Configuration files (overriding & hot reloading) Feature Disabling

Slide 15

Slide 15 text

Real World(TM) Detect unfinished processes • Your business logic is composed for many methods • Each one of them can fail by a lot of reasons • Depending of the underlying tech, not all of them may be catchable • How do you detect something is failing?

Slide 16

Slide 16 text

Real World(TM) Detect unfinished processes • Just detecting differences between events started and ended tells you something is wrong • Integration points without timeouts is a surefire way to create cascading failures • Consider fail fast

Slide 17

Slide 17 text

Real World(TM) Go async (and retry) • Each system is protected over other systems failures or service degradation • Be careful with operations that make changes • Don’t make request too quick • Check if operation is pending, even if previous call failed • Circuit Breaker • Limit the number of retries & Log them

Slide 18

Slide 18 text

Real World(TM) Log inputs & outputs If you don’t log…

Slide 19

Slide 19 text

Real World(TM) Log inputs & outputs If you don’t log…

Slide 20

Slide 20 text

Real World(TM) Log inputs & outputs • Logs service & third parties input & outputs • As humans read (or even just scan) log files for a new system, they are learning what “normal” means for that system • Reserve “ERROR” for a serious system problem • Don’t leave log files on production systems. Copy them to a staging area for analysis • Log file rotation

Slide 21

Slide 21 text

Real World(TM) Monitoring • A system without transparency cannot survive long in production • Good data enables good decision making • Logging and monitoring are both good for exposing and understanding the immediate behavior of an application or system

Slide 22

Slide 22 text

Real World(TM) Monitoring • Transparency: historical trending, predictive forecasting, present status, and instantaneous behavior • Dashboards • Messages should include an identifier that can be used to trace the steps of a transaction

Slide 23

Slide 23 text

Real World(TM) Alarms • Independent alarms system • Log: event happens, event data, error detected, … • Priorities: Critical, Error and Warning • Predicting the future • Document alarms • Reporting system

Slide 24

Slide 24 text

Q&A [email protected] @kinisoftware

Slide 25

Slide 25 text

Thanks!! [email protected] @kinisoftware

Slide 26

Slide 26 text

Extra Ball • Tools: • Graphite & Grafana (metrics & monitoring) • Cabot (alarms) • Elasticsearch, Logstash and Kibana (logging & monitoring) • Hystrix (circuit breaker)

Slide 27

Slide 27 text

Extra Ball • Resources: • Release it! • Reactive Manifesto • Feature Toggles • Microservices Guide • Publish-Subscriber pattern