Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Plan for Success

Plan for Success

What happens to your system, when you get lucky and become successful?

Monica Giambitto

May 13, 2020
Tweet

More Decks by Monica Giambitto

Other Decks in Technology

Transcript

  1. PLAN FOR SUCCESS

    View Slide

  2. RIGHT PLACE

    View Slide

  3. RIGHT TIME
    27TH JANUARY - First COVID-19 case in Germany

    24TH FEBRUARY - Pandemic plan activated

    09TH MARCH - First COVID-19 death in Germany

    21ST MARCH - Official lockdown called in Bavaria, more Länder to follow

    View Slide

  4. REGULAR MONTH ~10K REGISTRATIONS / DAY

    View Slide

  5. REGULAR MONTH ~25K TRAINING USERS / DAY

    View Slide

  6. PEAK MONTH ~20K REGISTRATIONS / DAY

    View Slide

  7. PEAK MONTH ~40K TRAINING USERS / DAY

    View Slide

  8. View Slide

  9. ~50K REGISTRATIONS / DAY
    +500%

    View Slide

  10. ~50K REGISTRATIONS / DAY
    +250%

    View Slide

  11. +400%
    ~100K TRAINING USERS / DAY

    View Slide

  12. +250%
    ~100K TRAINING USERS / DAY

    View Slide

  13. View Slide

  14. DNS HELL
    PROBLEM #1

    View Slide

  15. DNS HELL
    • KubeDNS doesn’t cache name
    resolution on internal calls ->
    moved to CoreDNS does it

    • Short DNS Names for internal
    calls: we used bodyweight.api
    instead of
    bodyweight.api.svc.cluster.local,
    requiring 2 DNS resolution
    requests for each internal call
    PROBLEM #1
    • Sidekiqs and internal calls

    • KubeDNS asks AWS for internal
    calls as well, so we used up our
    quota for external DNS requests
    very fast. As there is a high
    chance that a high error rate on
    something leads to an increased
    pile on sidekiq delayed jobs that
    call dns we increased the
    pressure here.

    View Slide

  16. AUTOSCALER MAX TOO SMALL
    PROBLEM #2

    View Slide

  17. PROBLEM #3
    CIRCULAR
    DEPENDENCIES

    View Slide

  18. PROBLEM #4
    SYNCRONOUS CALLS

    View Slide

  19. INCIDENT MANAGEMENT

    View Slide

  20. • Task Force

    • Calendario & Meeting

    • Confluence

    • Miro Board

    • Tools

    • NR

    • Grafana

    • AWS Dashboard

    • Statuspage

    • Slack

    • Hangout

    View Slide

  21. countermeasures

    View Slide

  22. some things you’ll never figure out

    View Slide

  23. note down everything as it happens

    View Slide

  24. interdisciplinary task force

    View Slide

  25. communicate towards company

    View Slide

  26. use the momentum

    View Slide

  27. feedback

    View Slide

  28. Thank you
    [email protected]
    @KFMOLLI - TWITTER
    @NIRNAETH - GITHUB

    View Slide