Distributed Systems at PagerDuty

PagerDuty Arup Chakrabarti OPERATIONS ENGINEERING @arupchak [email protected] Distributed Systems at
PagerDuty

PagerDuty I did not come up with everything Disclaimer

PagerDuty What we will cover • What is PagerDuty? •
Distributed Hardware • Distributed Software Design • Design Challenges • How we cheat by using Chef • Failure Fridays • Q and A

PagerDuty Ops Guys know all too well... What is PagerDuty?
• Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Notify the right person about problems

PagerDuty What is PagerDuty?

PagerDuty Higher Distribution ~ Higher Availability Distributed Systems at PagerDuty
• PagerDuty is spread across multiple providers, regions, DC’s • AWS US-West-1 • AWS US-West-2 • Linode Fremont • Adding another provider soon

PagerDuty It’s not just about the Hardware Software has to
be Distributed too • Failover does not work • Services have multiple end points • Clients prefer local calls, but can tolerate remote calls • Service itself handles coordination

PagerDuty at PagerDuty What a typical service looks like US-West-1
US-West-2 Linode Client Requests Local Request Remote Request

PagerDuty and how do we get around it? What makes
this hard? • Failover does not work • Easy for stateless app servers • Harder for stateful data stores • Multi-Master Datastores • Percona MySQL XtraDB Cluster • Replaced DRBD pair • Cassandra • Zookeeper

this hard? • Multiple end-points and local vs. remote calls • Service is backed by HAProxy locally • Even if the service is failing on the host, HAProxy can re-route • Eventually host is marked as bad • HAProxy prefers local calls

this hard? • Cannot use vendor speciﬁc tools • No AWS speciﬁc services • All servers have to be treated as bare metal • Network routes need to be distributed • AWS regions share common peering points

PagerDuty or how you can cheat with automation Chef at
PagerDuty • No AWS Security Groups • Completely automated IPTable chain build outs • No vendor speciﬁc monitoring • Deploy StatsD everywhere with DataDog as our backend

PagerDuty or how you can cheat with automation Chef at
PagerDuty • Configuration file management • HAProxy configs

PagerDuty how we reinforce a culture of failure handling Failure
Fridays at Pagerduty • Attack our own services • Based on Netﬂix’s Simian Army • Service shutdown • Network blocks • Network slowness • Single Host Failure • Regional and Datacenter Failure

PagerDuty [email protected] Thank you. We are hiring! http://pagerduty.com/jobs Arup Chakrabarti
OPERATIONS ENGINEERING @arupchak

Distributed Systems at PagerDuty

Distributed Systems at PagerDuty

Arup Chakrabarti

More Decks by Arup Chakrabarti

Other Decks in Technology

Featured

Transcript

PagerDuty Arup Chakrabarti OPERATIONS ENGINEERING @arupchak [email protected] Distributed Systems at

PagerDuty I did not come up with everything Disclaimer

PagerDuty What we will cover • What is PagerDuty? •

PagerDuty Ops Guys know all too well... What is PagerDuty?

PagerDuty What is PagerDuty?

PagerDuty Higher Distribution ~ Higher Availability Distributed Systems at PagerDuty

PagerDuty It’s not just about the Hardware Software has to

PagerDuty at PagerDuty What a typical service looks like US-West-1

PagerDuty and how do we get around it? What makes

PagerDuty and how do we get around it? What makes

PagerDuty and how do we get around it? What makes

PagerDuty or how you can cheat with automation Chef at

PagerDuty or how you can cheat with automation Chef at

PagerDuty how we reinforce a culture of failure handling Failure

PagerDuty [email protected] Thank you. We are hiring! http://pagerduty.com/jobs Arup Chakrabarti