Save 37% off PRO during our Black Friday Sale! »

Distributed Systems at PagerDuty

Distributed Systems at PagerDuty

This is a talk that I gave at the San Francisco Distributed Computing Meetup in December 2014.

http://www.meetup.com/San-Francisco-Distributed-Computing/events/153886592/?a=gs1.1_l

Ebe1d126c7c859171156efb4c08db14f?s=128

Arup Chakrabarti

December 11, 2013
Tweet

Transcript

  1. PagerDuty Arup Chakrabarti OPERATIONS ENGINEERING @arupchak arup@pagerduty.com Distributed Systems at

    PagerDuty
  2. PagerDuty I did not come up with everything Disclaimer

  3. PagerDuty What we will cover • What is PagerDuty? •

    Distributed Hardware • Distributed Software Design • Design Challenges • How we cheat by using Chef • Failure Fridays • Q and A
  4. PagerDuty Ops Guys know all too well... What is PagerDuty?

    • Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Notify the right person about problems
  5. PagerDuty What is PagerDuty?

  6. PagerDuty Higher Distribution ~ Higher Availability Distributed Systems at PagerDuty

    • PagerDuty is spread across multiple providers, regions, DC’s • AWS US-West-1 • AWS US-West-2 • Linode Fremont • Adding another provider soon
  7. PagerDuty It’s not just about the Hardware Software has to

    be Distributed too • Failover does not work • Services have multiple end points • Clients prefer local calls, but can tolerate remote calls • Service itself handles coordination
  8. PagerDuty at PagerDuty What a typical service looks like US-West-1

    US-West-2 Linode Client Requests Local Request Remote Request
  9. PagerDuty and how do we get around it? What makes

    this hard? • Failover does not work • Easy for stateless app servers • Harder for stateful data stores • Multi-Master Datastores • Percona MySQL XtraDB Cluster • Replaced DRBD pair • Cassandra • Zookeeper
  10. PagerDuty and how do we get around it? What makes

    this hard? • Multiple end-points and local vs. remote calls • Service is backed by HAProxy locally • Even if the service is failing on the host, HAProxy can re-route • Eventually host is marked as bad • HAProxy prefers local calls
  11. PagerDuty and how do we get around it? What makes

    this hard? • Cannot use vendor specific tools • No AWS specific services • All servers have to be treated as bare metal • Network routes need to be distributed • AWS regions share common peering points
  12. PagerDuty or how you can cheat with automation Chef at

    PagerDuty • No AWS Security Groups • Completely automated IPTable chain build outs • No vendor specific monitoring • Deploy StatsD everywhere with DataDog as our backend
  13. PagerDuty or how you can cheat with automation Chef at

    PagerDuty • Configuration file management • HAProxy configs
  14. PagerDuty how we reinforce a culture of failure handling Failure

    Fridays at Pagerduty • Attack our own services • Based on Netflix’s Simian Army • Service shutdown • Network blocks • Network slowness • Single Host Failure • Regional and Datacenter Failure
  15. PagerDuty arup@pagerduty.com Thank you. We are hiring! http://pagerduty.com/jobs Arup Chakrabarti

    OPERATIONS ENGINEERING @arupchak