Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tech Exeter Conference: Scaling clusters to thousands of servers in the cloud

Ca3d0556227d66b3c15be1eadf69473b?s=47 Jacob Tomlinson
September 21, 2017

Tech Exeter Conference: Scaling clusters to thousands of servers in the cloud

In order to analyse the petabytes of data we have at the Met Office we need very large clusters of servers. However procuring these pieces of infrastructure takes months or even years of planning and large up-front capital expense.

In the Informatics Lab we have been exploring using scalable cloud infrastructure to create next generation data analysis clusters. In our latest prototype we used scalable resources from AWS along with a Python computation scheduler called Dask to create clusters with thousands of CPU cores on-demand. The cluster only exists for the time that we need it and then we can shut it down again, so we only pay for what we use.

Scaling to these levels takes a lot of thinking about. In order for everything to scale linearly you need to also scale your data access, monitoring, system configuration and everything else to avoid bottlenecks.

This talk will cover the practicalities of building these things, the pitfalls we found when crossing certain thresholds and the new challenges we face when working in this new paradigm.


Jacob Tomlinson

September 21, 2017


  1. Scaling clusters to thousands of servers in the cloud Jacob

    Tomlinson Met Office Informatics Lab @_jacobtomlinson
  2. .@_jacobtomlinson. .linux. .bash. .cloud. .sysadmin. .web development. .docker. .javascript. .git.

    .public speaking. .CI/CD. .automation. .python.
  3. None
  4. None
  5. The challenge

  6. None
  7. Getting to the cloud

  8. None
  9. HDF5 & netCDF

  10. Producing and analysing data

  11. None
  12. .Currently storing 40PB. .Archiving 200TB per day. .Will store 300PB

    by 2020. *Approx figures - early 2017
  13. Data is heavy

  14. None
  15. None
  16. Storytime

  17. None
  18. Processing data

  19. Laziness

  20. Iris

  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. Scaling on AWS

  30. Terraform

  31. None
  32. Value for money

  33. Spot pricing

  34. Storytime

  35. None
  36. Packing workloads

  37. Vertical scaling

  38. Horizontal scaling

  39. Spot fleet scaling

  40. Monitoring

  41. None
  42. http://inlb.co/alerting-guidance

  43. Statelessness

  44. None
  45. External storage

  46. Distributed queues

  47. Wrap up

  48. None
  49. Thank you Jacob Tomlinson www.informaticslab.co.uk @_jacobtomlinson