Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Defensive Programming & Resilient systems in Real World (TM)

Tuenti
November 18, 2016

Defensive Programming & Resilient systems in Real World (TM)

Tuenti

November 18, 2016
Tweet

More Decks by Tuenti

Other Decks in Programming

Transcript

  1. Defensive programming & resilient systems
    Don’t trust anyone not even yourself
    A “not just testing” talk
    [email protected]
    @kinisoftware

    View full-size slide

  2. Self-promotion

    View full-size slide

  3. Why this talk?
    • 100% test coverage (as you told us Kini)
    • Code reviews
    • Manual testing
    • So, after release, my job is done
    Right? 


    View full-size slide

  4. Why this talk?
    No, it isn’t


    View full-size slide

  5. Why this talk?
    Inside every enterprise today is a mesh of interconnected, interdependent systems.
    They cannot—must not—allow bugs to cause a chain of failures.
    Bugs will happen. They cannot be eliminated, so they must be survived instead.
    Production is the only place to learn how the software will respond to real-world
    Release 1.0 is the beginning of your software’s life, not the end of the project.

    View full-size slide

  6. Why this talk?
    • Early detection
    • Reduce impact to customers
    • Know why an issue happened
    • Don’t depend on somebody looking at error log, daily email, ...
    • Prevent different conditions in dev/testing & production

    View full-size slide

  7. Glossary
    Defensive Programming is a form of defensive design intended to ensure the
    continuing function of a piece of software in spite of unforeseeable usage of
    said software. The idea can be viewed as reducing or eliminating the prospect
    of Murphy's Law having effect.
    Resilient system stays responsive in the face of failure, any system that is
    not resilient will be unresponsive after a failure. Resilience is achieved by
    replication, containment, isolation and delegation.

    View full-size slide

  8. State of the art @Tuenti

    View full-size slide

  9. State of the art @Tuenti

    View full-size slide

  10. State of the art @Tuenti

    View full-size slide

  11. State of the art @Tuenti
    Service Api - Monolith PHP
    ChargingApi ProvisioningApi SubscriptionsApi EventHistoryApi
    µServices - Java
    Charging Provisioning Subscriptions Notifications
    BSS GW
    Providers WS Notifications / Files
    Mobile Apps Web Admin Tools

    View full-size slide

  12. Real World(TM)
    • Avoid wrong dependencies
    • Feature disabling
    • Detect unfinished processes
    • Go async (and retry)
    • Log inputs & outputs
    • Monitoring
    • Alarms

    View full-size slide

  13. Real World(TM)
    Avoid wrong dependencies
    • Integration points are the number-one killer of systems.
    • A subsystem should be as isolate as possible
    • Consider health checks
    • Design and architecture decisions are also financial decisions.

    View full-size slide

  14. Real World(TM)
    • Allowing teams to modify system behavior without
    changing code (no release needed)
    • Do “Dark launches” when possible
    • When replacing old code, always keep it until you know
    new one works fine
    • Configuration files (overriding & hot reloading)
    Feature Disabling

    View full-size slide

  15. Real World(TM)
    Detect unfinished processes
    • Your business logic is composed for many methods
    • Each one of them can fail by a lot of reasons
    • Depending of the underlying tech, not all of them may
    be catchable
    • How do you detect something is failing?

    View full-size slide

  16. Real World(TM)
    Detect unfinished processes
    • Just detecting differences between events started and
    ended tells you something is wrong
    • Integration points without timeouts is a surefire way to
    create cascading failures
    • Consider fail fast

    View full-size slide

  17. Real World(TM)
    Go async (and retry)
    • Each system is protected over other systems failures or
    service degradation
    • Be careful with operations that make changes
    • Don’t make request too quick
    • Check if operation is pending, even if previous call failed
    • Circuit Breaker
    • Limit the number of retries & Log them

    View full-size slide

  18. Real World(TM)
    Log inputs & outputs
    If you don’t log…

    View full-size slide

  19. Real World(TM)
    Log inputs & outputs
    If you don’t log…

    View full-size slide

  20. Real World(TM)
    Log inputs & outputs
    • Logs service & third parties input & outputs
    • As humans read (or even just scan) log files for a new system, they
    are learning what “normal” means for that system
    • Reserve “ERROR” for a serious system problem
    • Don’t leave log files on production systems. Copy them to a
    staging area for analysis
    • Log file rotation

    View full-size slide

  21. Real World(TM)
    Monitoring
    • A system without transparency cannot survive long in
    production
    • Good data enables good decision making
    • Logging and monitoring are both good for exposing
    and understanding the immediate behavior of an
    application or system

    View full-size slide

  22. Real World(TM)
    Monitoring
    • Transparency: historical trending, predictive forecasting,
    present status, and instantaneous behavior
    • Dashboards
    • Messages should include an identifier that can be used
    to trace the steps of a transaction

    View full-size slide

  23. Real World(TM)
    Alarms
    • Independent alarms system
    • Log: event happens, event data, error detected, …
    • Priorities: Critical, Error and Warning
    • Predicting the future
    • Document alarms
    • Reporting system

    View full-size slide

  24. Extra Ball
    • Tools:
    • Graphite & Grafana (metrics & monitoring)
    • Cabot (alarms)
    • Elasticsearch, Logstash and Kibana (logging &
    monitoring)
    • Hystrix (circuit breaker)

    View full-size slide

  25. Extra Ball
    • Resources:
    • Release it!
    • Reactive Manifesto
    • Feature Toggles
    • Microservices Guide
    • Publish-Subscriber pattern

    View full-size slide