Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Systems at PagerDuty

Distributed Systems at PagerDuty

This is a talk that I gave at the San Francisco Distributed Computing Meetup in December 2014.

http://www.meetup.com/San-Francisco-Distributed-Computing/events/153886592/?a=gs1.1_l

Arup Chakrabarti

December 11, 2013
Tweet

More Decks by Arup Chakrabarti

Other Decks in Technology

Transcript

  1. PagerDuty What we will cover • What is PagerDuty? •

    Distributed Hardware • Distributed Software Design • Design Challenges • How we cheat by using Chef • Failure Fridays • Q and A
  2. PagerDuty Ops Guys know all too well... What is PagerDuty?

    • Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Notify the right person about problems
  3. PagerDuty Higher Distribution ~ Higher Availability Distributed Systems at PagerDuty

    • PagerDuty is spread across multiple providers, regions, DC’s • AWS US-West-1 • AWS US-West-2 • Linode Fremont • Adding another provider soon
  4. PagerDuty It’s not just about the Hardware Software has to

    be Distributed too • Failover does not work • Services have multiple end points • Clients prefer local calls, but can tolerate remote calls • Service itself handles coordination
  5. PagerDuty at PagerDuty What a typical service looks like US-West-1

    US-West-2 Linode Client Requests Local Request Remote Request
  6. PagerDuty and how do we get around it? What makes

    this hard? • Failover does not work • Easy for stateless app servers • Harder for stateful data stores • Multi-Master Datastores • Percona MySQL XtraDB Cluster • Replaced DRBD pair • Cassandra • Zookeeper
  7. PagerDuty and how do we get around it? What makes

    this hard? • Multiple end-points and local vs. remote calls • Service is backed by HAProxy locally • Even if the service is failing on the host, HAProxy can re-route • Eventually host is marked as bad • HAProxy prefers local calls
  8. PagerDuty and how do we get around it? What makes

    this hard? • Cannot use vendor specific tools • No AWS specific services • All servers have to be treated as bare metal • Network routes need to be distributed • AWS regions share common peering points
  9. PagerDuty or how you can cheat with automation Chef at

    PagerDuty • No AWS Security Groups • Completely automated IPTable chain build outs • No vendor specific monitoring • Deploy StatsD everywhere with DataDog as our backend
  10. PagerDuty or how you can cheat with automation Chef at

    PagerDuty • Configuration file management • HAProxy configs
  11. PagerDuty how we reinforce a culture of failure handling Failure

    Fridays at Pagerduty • Attack our own services • Based on Netflix’s Simian Army • Service shutdown • Network blocks • Network slowness • Single Host Failure • Regional and Datacenter Failure