Defensive programming & resilient systems
Don’t trust anyone not even yourself
A “not just testing” talk
kini@tuenti.com
@kinisoftware
Slide 2
Slide 2 text
Self-promotion
Slide 3
Slide 3 text
Why this talk?
• 100% test coverage (as you told us Kini)
• Code reviews
• Manual testing
• So, after release, my job is done
Right?
Slide 4
Slide 4 text
Why this talk?
No, it isn’t
Slide 5
Slide 5 text
Why this talk?
Inside every enterprise today is a mesh of interconnected, interdependent systems.
They cannot—must not—allow bugs to cause a chain of failures.
Bugs will happen. They cannot be eliminated, so they must be survived instead.
Production is the only place to learn how the software will respond to real-world
Release 1.0 is the beginning of your software’s life, not the end of the project.
Slide 6
Slide 6 text
Why this talk?
• Early detection
• Reduce impact to customers
• Know why an issue happened
• Don’t depend on somebody looking at error log, daily email, ...
• Prevent different conditions in dev/testing & production
Slide 7
Slide 7 text
Glossary
Defensive Programming is a form of defensive design intended to ensure the
continuing function of a piece of software in spite of unforeseeable usage of
said software. The idea can be viewed as reducing or eliminating the prospect
of Murphy's Law having effect.
Resilient system stays responsive in the face of failure, any system that is
not resilient will be unresponsive after a failure. Resilience is achieved by
replication, containment, isolation and delegation.
Slide 8
Slide 8 text
State of the art @Tuenti
Slide 9
Slide 9 text
State of the art @Tuenti
Slide 10
Slide 10 text
State of the art @Tuenti
Slide 11
Slide 11 text
State of the art @Tuenti
Service Api - Monolith PHP
ChargingApi ProvisioningApi SubscriptionsApi EventHistoryApi
µServices - Java
Charging Provisioning Subscriptions Notifications
BSS GW
Providers WS Notifications / Files
Mobile Apps Web Admin Tools
Real World(TM)
Avoid wrong dependencies
• Integration points are the number-one killer of systems.
• A subsystem should be as isolate as possible
• Consider health checks
• Design and architecture decisions are also financial decisions.
Slide 14
Slide 14 text
Real World(TM)
• Allowing teams to modify system behavior without
changing code (no release needed)
• Do “Dark launches” when possible
• When replacing old code, always keep it until you know
new one works fine
• Configuration files (overriding & hot reloading)
Feature Disabling
Slide 15
Slide 15 text
Real World(TM)
Detect unfinished processes
• Your business logic is composed for many methods
• Each one of them can fail by a lot of reasons
• Depending of the underlying tech, not all of them may
be catchable
• How do you detect something is failing?
Slide 16
Slide 16 text
Real World(TM)
Detect unfinished processes
• Just detecting differences between events started and
ended tells you something is wrong
• Integration points without timeouts is a surefire way to
create cascading failures
• Consider fail fast
Slide 17
Slide 17 text
Real World(TM)
Go async (and retry)
• Each system is protected over other systems failures or
service degradation
• Be careful with operations that make changes
• Don’t make request too quick
• Check if operation is pending, even if previous call failed
• Circuit Breaker
• Limit the number of retries & Log them
Slide 18
Slide 18 text
Real World(TM)
Log inputs & outputs
If you don’t log…
Slide 19
Slide 19 text
Real World(TM)
Log inputs & outputs
If you don’t log…
Slide 20
Slide 20 text
Real World(TM)
Log inputs & outputs
• Logs service & third parties input & outputs
• As humans read (or even just scan) log files for a new system, they
are learning what “normal” means for that system
• Reserve “ERROR” for a serious system problem
• Don’t leave log files on production systems. Copy them to a
staging area for analysis
• Log file rotation
Slide 21
Slide 21 text
Real World(TM)
Monitoring
• A system without transparency cannot survive long in
production
• Good data enables good decision making
• Logging and monitoring are both good for exposing
and understanding the immediate behavior of an
application or system
Slide 22
Slide 22 text
Real World(TM)
Monitoring
• Transparency: historical trending, predictive forecasting,
present status, and instantaneous behavior
• Dashboards
• Messages should include an identifier that can be used
to trace the steps of a transaction
Slide 23
Slide 23 text
Real World(TM)
Alarms
• Independent alarms system
• Log: event happens, event data, error detected, …
• Priorities: Critical, Error and Warning
• Predicting the future
• Document alarms
• Reporting system