Scale Summit: Microservices and Scale

Scale Summit: Microservices and Scale

Microservices can help with scale issues, but there are scale issues involved in *operating* them *causes*. Sarah discusses the impact of having 150+ microservices in a system. Spoiler: Everything needs to be automated :)

A288fb976fc633cde90a2bc19bf2b5a6?s=128

Sarah Wells

March 24, 2017
Tweet

Transcript

  1. Microservices and scale Sarah Wells Principal Engineer, Financial Times @sarahjwells

  2. @sarahjwells BUT - I’m not here to talk about that

  3. @sarahjwells I’m not dealing with this kind of scaling challenge

  4. The FT’s Universal Publishing Platform

  5. None
  6. 1

  7. 1 2

  8. Not publishing at huge scale: around 7000 publishes a day

  9. 1 2 3

  10. @sarahjwells 14 million concepts in our graph database or ~20GB

  11. 1 2 3 4

  12. Around 180,000 API requests an hour

  13. @sarahjwells We’re not doing microservices to help with scale

  14. @sarahjwells They let us move faster

  15. Deploys to production last year

  16. Deploys to production of the monolith

  17. @sarahjwells Releasing nearly 190 times as often

  18. @sarahjwells So - why am I talking about scale at

    all?
  19. @sarahjwells *Operating* microservices is a scale challenge

  20. None
  21. @sarahjwells 150+ microservices: we need to automate things

  22. @sarahjwells The challenges: 1. Provisioning and deployment 2. Monitoring and

    alerting 3. Logging 4. Service documentation
  23. @sarahjwells Provisioning and deployment

  24. Provisioning time scale

  25. @sarahjwells Provisioning needs to take minutes

  26. @sarahjwells Deployment must be (almost entirely) automated

  27. @sarahjwells Our old process was very manual…

  28. None
  29. None
  30. @sarahjwells Setting up new deployment pipelines has to be quick

  31. @sarahjwells Need to be able to make global changes to

    them
  32. @sarahjwells Monitoring and alerting can be very noisy

  33. @sarahjwells With resilience, we have 568 instances

  34. @sarahjwells If we checked each service every minute…

  35. @sarahjwells 817,920 checks per day

  36. @sarahjwells One service per VM, 20 system checks, running every

    minute…
  37. @sarahjwells 16,358,400 checks per day

  38. @sarahjwells “One-in-a-million” issues would hit us 16 times every day

  39. @sarahjwells Which is why we don’t have one service per

    VM…
  40. @sarahjwells Running containers on shared VMs reduces this to 92,160

    system checks per day
  41. @sarahjwells Still a total of 910,080 checks per day

  42. @sarahjwells Logging

  43. @sarahjwells ~50,000 log lines per minute

  44. @sarahjwells Service documentation

  45. @sarahjwells The service registry… who owns what

  46. @sarahjwells Lots of information per service

  47. None
  48. None
  49. None
  50. @sarahjwells Our GDPR process meant receiving 150 google forms…

  51. @sarahjwells How can you solve the operational scale issues?

  52. @sarahjwells 1. Provisioning and deployment 2. Monitoring and alerting 3.

    Logging 4. Service documentation
  53. @sarahjwells Provisioning and deployment: automation and tooling

  54. @sarahjwells Invest in automation of provisioning

  55. None
  56. @sarahjwells Deployment: move away from Jenkins

  57. None
  58. @sarahjwells To set up deployment for a new service…

  59. @sarahjwells 1. Configure CircleCI 2. Configure Docker hub 3. Add

    service files to a services repo
  60. @sarahjwells Not perfect…

  61. @sarahjwells Looking at templated pipelines…

  62. @sarahjwells Monitoring and alerting: focus on what matters

  63. @sarahjwells It’s the business functionality you should care about

  64. None
  65. @sarahjwells Logging: log aggregation and transaction ids

  66. None
  67. @sarahjwells Effective log aggregation needs a way to find all

    related logs
  68. Transaction ids tie all microservices together

  69. @sarahjwells

  70. @sarahjwells Documentation: standards, templates, automation, tooling

  71. @sarahjwells Executable documentation

  72. @sarahjwells Healthchecks

  73. The FT healthcheck standard GET http://{service}/__health

  74. The FT healthcheck standard GET http://{service}/__health returns 200 if the

    service can run the healthcheck
  75. The FT healthcheck standard GET http://{service}/__health returns 200 if the

    service can run the healthcheck each check will return "ok": true or "ok": false
  76. None
  77. None
  78. @sarahjwells Healthchecks are unit tested

  79. @sarahjwells Keeping information near to the code

  80. @sarahjwells Update automatically on deploy

  81. @sarahjwells Other teams need to adapt too

  82. @sarahjwells Change and release management

  83. @sarahjwells 2256 releases = 53 working days doing CRs

  84. @sarahjwells Automation, again

  85. None
  86. None
  87. @sarahjwells Github web hook for our CRs

  88. None
  89. @sarahjwells First line support

  90. @sarahjwells There are many different technologies for them to understand

    now
  91. @sarahjwells Our development teams don’t know the whole system either…

  92. None
  93. @sarahjwells Operating microservices *is* a challenge

  94. @sarahjwells The benefits can be worth it…

  95. Deploys to production last year

  96. Deploys to production of the monolith

  97. @sarahjwells But you have to be prepared to pay the

    cost
  98. @sarahjwells Thank you!