Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tech Exeter Conference: Scaling clusters to thousands of servers in the cloud

Jacob Tomlinson
September 21, 2017
62

Tech Exeter Conference: Scaling clusters to thousands of servers in the cloud

In order to analyse the petabytes of data we have at the Met Office we need very large clusters of servers. However procuring these pieces of infrastructure takes months or even years of planning and large up-front capital expense.

In the Informatics Lab we have been exploring using scalable cloud infrastructure to create next generation data analysis clusters. In our latest prototype we used scalable resources from AWS along with a Python computation scheduler called Dask to create clusters with thousands of CPU cores on-demand. The cluster only exists for the time that we need it and then we can shut it down again, so we only pay for what we use.

Scaling to these levels takes a lot of thinking about. In order for everything to scale linearly you need to also scale your data access, monitoring, system configuration and everything else to avoid bottlenecks.

This talk will cover the practicalities of building these things, the pitfalls we found when crossing certain thresholds and the new challenges we face when working in this new paradigm.

Jacob Tomlinson

September 21, 2017
Tweet

Transcript

  1. Scaling clusters to
    thousands of servers in
    the cloud
    Jacob Tomlinson
    Met Office Informatics Lab
    @_jacobtomlinson

    View Slide

  2. .@_jacobtomlinson.
    .linux. .bash. .cloud. .sysadmin.
    .web development. .docker.
    .javascript. .git. .public speaking.
    .CI/CD. .automation. .python.

    View Slide

  3. View Slide

  4. View Slide

  5. The challenge

    View Slide

  6. View Slide

  7. Getting to the cloud

    View Slide

  8. View Slide

  9. HDF5 & netCDF

    View Slide

  10. Producing and analysing data

    View Slide

  11. View Slide

  12. .Currently storing 40PB.
    .Archiving 200TB per day.
    .Will store 300PB by 2020.
    *Approx figures - early 2017

    View Slide

  13. Data is heavy

    View Slide

  14. View Slide

  15. View Slide

  16. Storytime

    View Slide

  17. View Slide

  18. Processing data

    View Slide

  19. Laziness

    View Slide

  20. Iris

    View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. Scaling on AWS

    View Slide

  30. Terraform

    View Slide

  31. View Slide

  32. Value for money

    View Slide

  33. Spot pricing

    View Slide

  34. Storytime

    View Slide

  35. View Slide

  36. Packing workloads

    View Slide

  37. Vertical scaling

    View Slide

  38. Horizontal scaling

    View Slide

  39. Spot fleet scaling

    View Slide

  40. Monitoring

    View Slide

  41. View Slide

  42. http://inlb.co/alerting-guidance

    View Slide

  43. Statelessness

    View Slide

  44. View Slide

  45. External storage

    View Slide

  46. Distributed queues

    View Slide

  47. Wrap up

    View Slide

  48. View Slide

  49. Thank you
    Jacob Tomlinson
    www.informaticslab.co.uk
    @_jacobtomlinson

    View Slide