Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Defensive Programming & Resilient systems in Real World (TM)

Tuenti
November 18, 2016

Defensive Programming & Resilient systems in Real World (TM)

Tuenti

November 18, 2016
Tweet

More Decks by Tuenti

Other Decks in Programming

Transcript

  1. Why this talk? • 100% test coverage (as you told

    us Kini) • Code reviews • Manual testing • So, after release, my job is done Right? 

  2. Why this talk? Inside every enterprise today is a mesh

    of interconnected, interdependent systems. They cannot—must not—allow bugs to cause a chain of failures. Bugs will happen. They cannot be eliminated, so they must be survived instead. Production is the only place to learn how the software will respond to real-world Release 1.0 is the beginning of your software’s life, not the end of the project.
  3. Why this talk? • Early detection • Reduce impact to

    customers • Know why an issue happened • Don’t depend on somebody looking at error log, daily email, ... • Prevent different conditions in dev/testing & production
  4. Glossary Defensive Programming is a form of defensive design intended

    to ensure the continuing function of a piece of software in spite of unforeseeable usage of said software. The idea can be viewed as reducing or eliminating the prospect of Murphy's Law having effect. Resilient system stays responsive in the face of failure, any system that is not resilient will be unresponsive after a failure. Resilience is achieved by replication, containment, isolation and delegation.
  5. State of the art @Tuenti Service Api - Monolith PHP

    ChargingApi ProvisioningApi SubscriptionsApi EventHistoryApi µServices - Java Charging Provisioning Subscriptions Notifications BSS GW Providers WS Notifications / Files Mobile Apps Web Admin Tools
  6. Real World(TM) • Avoid wrong dependencies • Feature disabling •

    Detect unfinished processes • Go async (and retry) • Log inputs & outputs • Monitoring • Alarms
  7. Real World(TM) Avoid wrong dependencies • Integration points are the

    number-one killer of systems. • A subsystem should be as isolate as possible • Consider health checks • Design and architecture decisions are also financial decisions.
  8. Real World(TM) • Allowing teams to modify system behavior without

    changing code (no release needed) • Do “Dark launches” when possible • When replacing old code, always keep it until you know new one works fine • Configuration files (overriding & hot reloading) Feature Disabling
  9. Real World(TM) Detect unfinished processes • Your business logic is

    composed for many methods • Each one of them can fail by a lot of reasons • Depending of the underlying tech, not all of them may be catchable • How do you detect something is failing?
  10. Real World(TM) Detect unfinished processes • Just detecting differences between

    events started and ended tells you something is wrong • Integration points without timeouts is a surefire way to create cascading failures • Consider fail fast
  11. Real World(TM) Go async (and retry) • Each system is

    protected over other systems failures or service degradation • Be careful with operations that make changes • Don’t make request too quick • Check if operation is pending, even if previous call failed • Circuit Breaker • Limit the number of retries & Log them
  12. Real World(TM) Log inputs & outputs • Logs service &

    third parties input & outputs • As humans read (or even just scan) log files for a new system, they are learning what “normal” means for that system • Reserve “ERROR” for a serious system problem • Don’t leave log files on production systems. Copy them to a staging area for analysis • Log file rotation
  13. Real World(TM) Monitoring • A system without transparency cannot survive

    long in production • Good data enables good decision making • Logging and monitoring are both good for exposing and understanding the immediate behavior of an application or system
  14. Real World(TM) Monitoring • Transparency: historical trending, predictive forecasting, present

    status, and instantaneous behavior • Dashboards • Messages should include an identifier that can be used to trace the steps of a transaction
  15. Real World(TM) Alarms • Independent alarms system • Log: event

    happens, event data, error detected, … • Priorities: Critical, Error and Warning • Predicting the future • Document alarms • Reporting system
  16. Extra Ball • Tools: • Graphite & Grafana (metrics &

    monitoring) • Cabot (alarms) • Elasticsearch, Logstash and Kibana (logging & monitoring) • Hystrix (circuit breaker)
  17. Extra Ball • Resources: • Release it! • Reactive Manifesto

    • Feature Toggles • Microservices Guide • Publish-Subscriber pattern