Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scale Summit: Microservices and Scale

Scale Summit: Microservices and Scale

Microservices can help with scale issues, but there are scale issues involved in *operating* them *causes*. Sarah discusses the impact of having 150+ microservices in a system. Spoiler: Everything needs to be automated :)

Sarah Wells

March 24, 2017
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Microservices and scale
    Sarah Wells
    Principal Engineer, Financial Times
    @sarahjwells

    View full-size slide

  2. @sarahjwells
    BUT - I’m not here to talk about that

    View full-size slide

  3. @sarahjwells
    I’m not dealing with this kind of scaling challenge

    View full-size slide

  4. The FT’s Universal Publishing Platform

    View full-size slide

  5. Not publishing at huge scale: around 7000 publishes a day

    View full-size slide

  6. @sarahjwells
    14 million concepts in our graph database or ~20GB

    View full-size slide

  7. Around 180,000 API requests an hour

    View full-size slide

  8. @sarahjwells
    We’re not doing microservices to help with scale

    View full-size slide

  9. @sarahjwells
    They let us move faster

    View full-size slide

  10. Deploys to production last year

    View full-size slide

  11. Deploys to production of the monolith

    View full-size slide

  12. @sarahjwells
    Releasing nearly 190 times as often

    View full-size slide

  13. @sarahjwells
    So - why am I talking about scale at all?

    View full-size slide

  14. @sarahjwells
    *Operating* microservices is a
    scale challenge

    View full-size slide

  15. @sarahjwells
    150+ microservices: we need to automate things

    View full-size slide

  16. @sarahjwells
    The challenges:
    1. Provisioning and deployment
    2. Monitoring and alerting
    3. Logging
    4. Service documentation

    View full-size slide

  17. @sarahjwells
    Provisioning and deployment

    View full-size slide

  18. Provisioning time scale

    View full-size slide

  19. @sarahjwells
    Provisioning needs to take minutes

    View full-size slide

  20. @sarahjwells
    Deployment must be (almost entirely) automated

    View full-size slide

  21. @sarahjwells
    Our old process was very manual…

    View full-size slide

  22. @sarahjwells
    Setting up new deployment pipelines has to be quick

    View full-size slide

  23. @sarahjwells
    Need to be able to make global changes to them

    View full-size slide

  24. @sarahjwells
    Monitoring and alerting can be very noisy

    View full-size slide

  25. @sarahjwells
    With resilience, we have 568 instances

    View full-size slide

  26. @sarahjwells
    If we checked each service every minute…

    View full-size slide

  27. @sarahjwells
    817,920 checks per day

    View full-size slide

  28. @sarahjwells
    One service per VM, 20 system checks, running every
    minute…

    View full-size slide

  29. @sarahjwells
    16,358,400 checks per day

    View full-size slide

  30. @sarahjwells
    “One-in-a-million” issues would hit us 16 times every
    day

    View full-size slide

  31. @sarahjwells
    Which is why we don’t have one service per VM…

    View full-size slide

  32. @sarahjwells
    Running containers on shared VMs reduces this to
    92,160 system checks per day

    View full-size slide

  33. @sarahjwells
    Still a total of 910,080 checks per day

    View full-size slide

  34. @sarahjwells
    Logging

    View full-size slide

  35. @sarahjwells
    ~50,000 log lines per minute

    View full-size slide

  36. @sarahjwells
    Service documentation

    View full-size slide

  37. @sarahjwells
    The service registry… who owns what

    View full-size slide

  38. @sarahjwells
    Lots of information per service

    View full-size slide

  39. @sarahjwells
    Our GDPR process meant receiving 150 google
    forms…

    View full-size slide

  40. @sarahjwells
    How can you solve the
    operational scale issues?

    View full-size slide

  41. @sarahjwells
    1. Provisioning and deployment
    2. Monitoring and alerting
    3. Logging
    4. Service documentation

    View full-size slide

  42. @sarahjwells
    Provisioning and deployment: automation and
    tooling

    View full-size slide

  43. @sarahjwells
    Invest in automation of provisioning

    View full-size slide

  44. @sarahjwells
    Deployment: move away from Jenkins

    View full-size slide

  45. @sarahjwells
    To set up deployment for a new service…

    View full-size slide

  46. @sarahjwells
    1. Configure CircleCI
    2. Configure Docker hub
    3. Add service files to a services repo

    View full-size slide

  47. @sarahjwells
    Not perfect…

    View full-size slide

  48. @sarahjwells
    Looking at templated pipelines…

    View full-size slide

  49. @sarahjwells
    Monitoring and alerting: focus on what matters

    View full-size slide

  50. @sarahjwells
    It’s the business functionality you should care about

    View full-size slide

  51. @sarahjwells
    Logging: log aggregation and transaction ids

    View full-size slide

  52. @sarahjwells
    Effective log aggregation needs a way to find all
    related logs

    View full-size slide

  53. Transaction ids tie all microservices together

    View full-size slide

  54. @sarahjwells

    View full-size slide

  55. @sarahjwells
    Documentation: standards, templates,
    automation, tooling

    View full-size slide

  56. @sarahjwells
    Executable documentation

    View full-size slide

  57. @sarahjwells
    Healthchecks

    View full-size slide

  58. The FT healthcheck standard
    GET http://{service}/__health

    View full-size slide

  59. The FT healthcheck standard
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck

    View full-size slide

  60. The FT healthcheck standard
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck
    each check will return "ok": true or "ok": false

    View full-size slide

  61. @sarahjwells
    Healthchecks are unit tested

    View full-size slide

  62. @sarahjwells
    Keeping information near to the code

    View full-size slide

  63. @sarahjwells
    Update automatically on deploy

    View full-size slide

  64. @sarahjwells
    Other teams need to adapt too

    View full-size slide

  65. @sarahjwells
    Change and release management

    View full-size slide

  66. @sarahjwells
    2256 releases = 53 working days doing CRs

    View full-size slide

  67. @sarahjwells
    Automation, again

    View full-size slide

  68. @sarahjwells
    Github web hook for our CRs

    View full-size slide

  69. @sarahjwells
    First line support

    View full-size slide

  70. @sarahjwells
    There are many different technologies for them to
    understand now

    View full-size slide

  71. @sarahjwells
    Our development teams don’t know the whole
    system either…

    View full-size slide

  72. @sarahjwells
    Operating microservices *is* a
    challenge

    View full-size slide

  73. @sarahjwells
    The benefits can be worth it…

    View full-size slide

  74. Deploys to production last year

    View full-size slide

  75. Deploys to production of the monolith

    View full-size slide

  76. @sarahjwells
    But you have to be prepared to pay the cost

    View full-size slide

  77. @sarahjwells
    Thank you!

    View full-size slide