Save 37% off PRO during our Black Friday Sale! »

Lessons Learned from Five Years of Multi-Cloud at PagerDuty

Lessons Learned from Five Years of Multi-Cloud at PagerDuty

PagerDuty has been running a multi-cloud infrastructure over the past 5 years. In that time, we have tested multiple providers, learned about fun networking routes, saw what traffic filtering looks like, and many other horrors.

In this talk, I will be going over the decisions and events that led up PagerDuty's multi-cloud environment and how we managed it. I will go through the benefits and problems with our setup and the assumptions that we made that turned out to be completely wrong. By the end of this talk, you will be able to better answer the question of whether a multi-cloud setup is the right thing for your team or company.

Ebe1d126c7c859171156efb4c08db14f?s=128

Arup Chakrabarti

March 28, 2018
Tweet

Transcript

  1. Arup Chakrabarti Director of Engineering, PagerDuty Five Years of Multi-Cloud

    at PagerDuty A ROMANTIC AND COMPLICATED LOVE STORY SRECON AMERICAS 2018 @arupchak
  2. @arupchak Disclaimers and Context

  3. @arupchak What is PagerDuty?

  4. @arupchak I work with smrt smart people

  5. @arupchak You are not PagerDuty

  6. @arupchak We get this wrong sometimes

  7. @arupchak You will not get an easy answer

  8. @arupchak Not a vendor endorsement

  9. @arupchak Slides will be posted online afterward

  10. @arupchak Terminology

  11. @arupchak Multi-Cloud

  12. @arupchak Having Active or Passive Infrastructure in Multiple Cloud Providers

  13. @arupchak Having the same product or service spread across Multiple

    Cloud Provider
  14. @arupchak What every Procurement Manager thinks they want

  15. @arupchak Active / Active

  16. @arupchak Running the same workload across multiple datacenters

  17. @arupchak “Distributed Systems”

  18. @arupchak Active / Passive

  19. @arupchak Running a workload in one datacenter with a standby

    datacenter
  20. @arupchak History Lesson

  21. @arupchak PagerDuty Early 2012 • Cloud Native • Used Failover

    for High Availability • MySQL Master/Slave Topology based on DRBD • Stateless Rails app behind Load Balancers • AWS us-east-1 and failover site in New Jersey
  22. @arupchak

  23. @arupchak

  24. @arupchak

  25. @arupchak

  26. @arupchak 2012: Cloud is Unreliable

  27. @arupchak Minutes of downtime is unacceptable

  28. @arupchak Only way to achieve Reliability is through distinct Regions

  29. @arupchak PagerDuty Late 2012 • Started teasing apart PagerDuty into

    separate Services • Starting using Quorum based systems • Cassandra and Zookeeper • Favored Durability over Performance • Still needed Regions or Datacenters within 50ms • Tried AWS us-east-1, us-west-1, us-west-2
  30. @arupchak

  31. @arupchak Remember that 50ms requirement?

  32. @arupchak 20ms 75ms 100ms

  33. @arupchak Had to go Multi-Cloud due to latency requirement

  34. @arupchak 20ms 5ms 20ms

  35. @arupchak PagerDuty Early 2018 • Software deployed to AWS us-west-1,

    us-west-2 and Azure Fresno • ~50 Services across ~10 Engineering teams • Each team owns the entire vertical stack
  36. @arupchak What went well

  37. @arupchak Reliability Benefits

  38. @arupchak

  39. @arupchak Reliability: Hard to measure

  40. @arupchak Portability Benefits

  41. @arupchak Portability Benefits • Everything is treated as Compute •

    If there is a base Ubuntu image, we can secure and use it • Actually helped in pricing
  42. @arupchak Engineering Culture Benefits

  43. @arupchak Engineering Culture Benefits • Teams built for Reliability early

    in the SDLC • Teams had deep expertise in their technical stacks (double-edged sword) • Failure Injection / Chaos Engineering
  44. @arupchak What did not go well

  45. @arupchak Right sizing is hard

  46. @arupchak Pinned to limiting system resource

  47. @arupchak AWS m3.large = 
 Azure Standard F4

  48. @arupchak 8GB / 2 Cores ≠ 
 8GB / 4

    Cores
  49. @arupchak $112 ≠ 
 $182

  50. @arupchak Deep Technical Expertise Required

  51. @arupchak Deep Technical Expertise Required • Forced to only use

    common Compute across providers • Every engineer needs to know how to run their own: • Load Balancers • Databases • Applications • HA systems
  52. @arupchak Complexity Overhead

  53. @arupchak Abstract away providers via Chef

  54. @arupchak

  55. @arupchak Even Less Control Over Network

  56. @arupchak

  57. @arupchak The farther apart your datacenters, the less control you

    have
  58. @arupchak Cannot use hosted services

  59. @arupchak The Big Question

  60. @arupchak Should you go Multi-Cloud?

  61. @arupchak “It Depends” -Arup on almost everything

  62. @arupchak What to consider • Business requirements first, technical requirements

    second • Company buy-in • Engineering staff capabilities • What do your customers care about?
  63. @arupchak “Understand your customer’s problems better than they do” -Andrew

    Miklas, PagerDuty Co-Founder
  64. Arup Chakrabarti Director of Engineering, PagerDuty Thank You WE ARE

    HIRING! PAGERDUTY.COM/CAREERS @arupchak