Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scale Summit: Microservices and Scale

Scale Summit: Microservices and Scale

Microservices can help with scale issues, but there are scale issues involved in *operating* them *causes*. Sarah discusses the impact of having 150+ microservices in a system. Spoiler: Everything needs to be automated :)

Sarah Wells

March 24, 2017
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Microservices and scale
    Sarah Wells
    Principal Engineer, Financial Times
    @sarahjwells

    View Slide

  2. @sarahjwells
    BUT - I’m not here to talk about that

    View Slide

  3. @sarahjwells
    I’m not dealing with this kind of scaling challenge

    View Slide

  4. The FT’s Universal Publishing Platform

    View Slide

  5. View Slide

  6. 1

    View Slide

  7. 1
    2

    View Slide

  8. Not publishing at huge scale: around 7000 publishes a day

    View Slide

  9. 1
    2
    3

    View Slide

  10. @sarahjwells
    14 million concepts in our graph database or ~20GB

    View Slide

  11. 1
    2
    3
    4

    View Slide

  12. Around 180,000 API requests an hour

    View Slide

  13. @sarahjwells
    We’re not doing microservices to help with scale

    View Slide

  14. @sarahjwells
    They let us move faster

    View Slide

  15. Deploys to production last year

    View Slide

  16. Deploys to production of the monolith

    View Slide

  17. @sarahjwells
    Releasing nearly 190 times as often

    View Slide

  18. @sarahjwells
    So - why am I talking about scale at all?

    View Slide

  19. @sarahjwells
    *Operating* microservices is a
    scale challenge

    View Slide

  20. View Slide

  21. @sarahjwells
    150+ microservices: we need to automate things

    View Slide

  22. @sarahjwells
    The challenges:
    1. Provisioning and deployment
    2. Monitoring and alerting
    3. Logging
    4. Service documentation

    View Slide

  23. @sarahjwells
    Provisioning and deployment

    View Slide

  24. Provisioning time scale

    View Slide

  25. @sarahjwells
    Provisioning needs to take minutes

    View Slide

  26. @sarahjwells
    Deployment must be (almost entirely) automated

    View Slide

  27. @sarahjwells
    Our old process was very manual…

    View Slide

  28. View Slide

  29. View Slide

  30. @sarahjwells
    Setting up new deployment pipelines has to be quick

    View Slide

  31. @sarahjwells
    Need to be able to make global changes to them

    View Slide

  32. @sarahjwells
    Monitoring and alerting can be very noisy

    View Slide

  33. @sarahjwells
    With resilience, we have 568 instances

    View Slide

  34. @sarahjwells
    If we checked each service every minute…

    View Slide

  35. @sarahjwells
    817,920 checks per day

    View Slide

  36. @sarahjwells
    One service per VM, 20 system checks, running every
    minute…

    View Slide

  37. @sarahjwells
    16,358,400 checks per day

    View Slide

  38. @sarahjwells
    “One-in-a-million” issues would hit us 16 times every
    day

    View Slide

  39. @sarahjwells
    Which is why we don’t have one service per VM…

    View Slide

  40. @sarahjwells
    Running containers on shared VMs reduces this to
    92,160 system checks per day

    View Slide

  41. @sarahjwells
    Still a total of 910,080 checks per day

    View Slide

  42. @sarahjwells
    Logging

    View Slide

  43. @sarahjwells
    ~50,000 log lines per minute

    View Slide

  44. @sarahjwells
    Service documentation

    View Slide

  45. @sarahjwells
    The service registry… who owns what

    View Slide

  46. @sarahjwells
    Lots of information per service

    View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. @sarahjwells
    Our GDPR process meant receiving 150 google
    forms…

    View Slide

  51. @sarahjwells
    How can you solve the
    operational scale issues?

    View Slide

  52. @sarahjwells
    1. Provisioning and deployment
    2. Monitoring and alerting
    3. Logging
    4. Service documentation

    View Slide

  53. @sarahjwells
    Provisioning and deployment: automation and
    tooling

    View Slide

  54. @sarahjwells
    Invest in automation of provisioning

    View Slide

  55. View Slide

  56. @sarahjwells
    Deployment: move away from Jenkins

    View Slide

  57. View Slide

  58. @sarahjwells
    To set up deployment for a new service…

    View Slide

  59. @sarahjwells
    1. Configure CircleCI
    2. Configure Docker hub
    3. Add service files to a services repo

    View Slide

  60. @sarahjwells
    Not perfect…

    View Slide

  61. @sarahjwells
    Looking at templated pipelines…

    View Slide

  62. @sarahjwells
    Monitoring and alerting: focus on what matters

    View Slide

  63. @sarahjwells
    It’s the business functionality you should care about

    View Slide

  64. View Slide

  65. @sarahjwells
    Logging: log aggregation and transaction ids

    View Slide

  66. View Slide

  67. @sarahjwells
    Effective log aggregation needs a way to find all
    related logs

    View Slide

  68. Transaction ids tie all microservices together

    View Slide

  69. @sarahjwells

    View Slide

  70. @sarahjwells
    Documentation: standards, templates,
    automation, tooling

    View Slide

  71. @sarahjwells
    Executable documentation

    View Slide

  72. @sarahjwells
    Healthchecks

    View Slide

  73. The FT healthcheck standard
    GET http://{service}/__health

    View Slide

  74. The FT healthcheck standard
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck

    View Slide

  75. The FT healthcheck standard
    GET http://{service}/__health
    returns 200 if the service can run the healthcheck
    each check will return "ok": true or "ok": false

    View Slide

  76. View Slide

  77. View Slide

  78. @sarahjwells
    Healthchecks are unit tested

    View Slide

  79. @sarahjwells
    Keeping information near to the code

    View Slide

  80. @sarahjwells
    Update automatically on deploy

    View Slide

  81. @sarahjwells
    Other teams need to adapt too

    View Slide

  82. @sarahjwells
    Change and release management

    View Slide

  83. @sarahjwells
    2256 releases = 53 working days doing CRs

    View Slide

  84. @sarahjwells
    Automation, again

    View Slide

  85. View Slide

  86. View Slide

  87. @sarahjwells
    Github web hook for our CRs

    View Slide

  88. View Slide

  89. @sarahjwells
    First line support

    View Slide

  90. @sarahjwells
    There are many different technologies for them to
    understand now

    View Slide

  91. @sarahjwells
    Our development teams don’t know the whole
    system either…

    View Slide

  92. View Slide

  93. @sarahjwells
    Operating microservices *is* a
    challenge

    View Slide

  94. @sarahjwells
    The benefits can be worth it…

    View Slide

  95. Deploys to production last year

    View Slide

  96. Deploys to production of the monolith

    View Slide

  97. @sarahjwells
    But you have to be prepared to pay the cost

    View Slide

  98. @sarahjwells
    Thank you!

    View Slide