Slide 1

Slide 1 text

Building Healthy Distributed Systems Erlang Factory SF March 29, 2012 basho Thursday, March 29, 12

Slide 2

Slide 2 text

basho @pharkmillups themarkphillips.com [email protected] Mark Phillips Thursday, March 29, 12

Slide 3

Slide 3 text

“A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal.” [1] What is a distributed system? basho Thursday, March 29, 12

Slide 4

Slide 4 text

Distributed, Scalable, Fault Tolerant No central coordinator; Easy to setup and operate basho Thursday, March 29, 12

Slide 5

Slide 5 text

Distributed, Scalable, Fault Tolerant Horizontally Scalable; Add commodity hardware to get more [throughput | processing | storage]. basho Thursday, March 29, 12

Slide 6

Slide 6 text

Distributed, Scalable, Fault Tolerant Always Available No Single Point of Failure Self-healing basho Thursday, March 29, 12

Slide 7

Slide 7 text

basho Thursday, March 29, 12

Slide 8

Slide 8 text

basho { • Founded in 2007 • Collapsed in 2008 • “Pivoted” in 2009 • Commercial Sponsors of Riak, an Open Source, NoSQL Database • Sells Closed Source Extensions to Riak in the form licenses Thursday, March 29, 12

Slide 9

Slide 9 text

Year on Year Growth basho Thursday, March 29, 12

Slide 10

Slide 10 text

2009 Year on Year Growth basho Thursday, March 29, 12

Slide 11

Slide 11 text

2009 14 Year on Year Growth basho Thursday, March 29, 12

Slide 12

Slide 12 text

2009 2010 14 Year on Year Growth basho Thursday, March 29, 12

Slide 13

Slide 13 text

2009 2010 14 25 Year on Year Growth basho Thursday, March 29, 12

Slide 14

Slide 14 text

2009 2010 2011 14 25 Year on Year Growth basho Thursday, March 29, 12

Slide 15

Slide 15 text

2009 2010 2011 14 60 25 Year on Year Growth basho Thursday, March 29, 12

Slide 16

Slide 16 text

basho Office Locations Thursday, March 29, 12

Slide 17

Slide 17 text

basho Office Locations Thursday, March 29, 12

Slide 18

Slide 18 text

basho Office Locations Thursday, March 29, 12

Slide 19

Slide 19 text

basho Office Locations Thursday, March 29, 12

Slide 20

Slide 20 text

Actual Employee Distribution basho Thursday, March 29, 12

Slide 21

Slide 21 text

“A distributed [company] consists of multiple autonomous [team members] that communicate [and collaborate] through various [channels]. The [team members] interact with each other in order to achieve a common goal.” What is a distributed [company]? basho Thursday, March 29, 12

Slide 22

Slide 22 text

Hiring where the talent is means we don’t sacrifice great hires for location, but it also presents various hurdles when attempting to build culture and community. basho basho Thursday, March 29, 12

Slide 23

Slide 23 text

1. Make Basho into a Powerhouse 2. Professional Development 3. Employee Happiness 4. Deliver Exceptional Product Common Goals for Basho basho Thursday, March 29, 12

Slide 24

Slide 24 text

Internal Communication and Collaboration • Real-time Chat (Jabber, Camp Fire) • Skype (or some for of video chat) • Yammer • GitHub • AgileZen • Email (sort of) • Documentation basho Thursday, March 29, 12

Slide 25

Slide 25 text

Culture: Be Corny and Childish basho Thursday, March 29, 12

Slide 26

Slide 26 text

basho Thursday, March 29, 12

Slide 27

Slide 27 text

basho Thursday, March 29, 12

Slide 28

Slide 28 text

basho Thursday, March 29, 12

Slide 29

Slide 29 text

basho Thursday, March 29, 12

Slide 30

Slide 30 text

basho Thursday, March 29, 12

Slide 31

Slide 31 text

basho Thursday, March 29, 12

Slide 32

Slide 32 text

Good Meetings basho • Quarterly In-person “Summits” • Bi-Monthly, Non-Mandatory Company All Hands • Stands up, Scrum Thursday, March 29, 12

Slide 33

Slide 33 text

Make Documentation Part of Your Culture basho • Inside Jokes • Internal Talks • Design Documents • Product Ideas • Product Feedback • New Hire Processes • Everything Else Thursday, March 29, 12

Slide 34

Slide 34 text

Open Source Your Code. And Use GitHub. basho • Contributes Directly to Developer Happiness • Makes Your Company’s Product Better • Great Marketing • Use a Permissive License (http://bit.ly/clJyDO) (http://bit.ly/v3OMEf) “Open Source Almost Everything” “Why Your Company Should Have a Permissive Open Source Policy” Thursday, March 29, 12

Slide 35

Slide 35 text

Thursday, March 29, 12

Slide 36

Slide 36 text

basho basho Hiring Should Not Happen In A Vacuum Thursday, March 29, 12

Slide 37

Slide 37 text

Poor Culture Rots a Company from within and Lessens its Resiliency basho Thursday, March 29, 12

Slide 38

Slide 38 text

basho Company Fault Tolerance • New CEO + Massive Growth = New Challenges • Our System is Constantly Improving Thursday, March 29, 12

Slide 39

Slide 39 text

2012 1** Planned Growth basho Thursday, March 29, 12

Slide 40

Slide 40 text

Community Thursday, March 29, 12

Slide 41

Slide 41 text

“A distributed [community] consists of multiple autonomous [members] that communicate [and collaborate] through various [channels]. The [members] interact with each other in order to achieve a common goal.” What is a distributed [community]? basho Thursday, March 29, 12

Slide 42

Slide 42 text

basho Why Build A Community? Thursday, March 29, 12

Slide 43

Slide 43 text

Grassroots Marketing, Branding, Awareness: basho Thursday, March 29, 12

Slide 44

Slide 44 text

Code Contributions and Bug Fixes : basho 180 names in our THANKS file 1600 hours contributed from Oct 2010 - Sept 2011 Thursday, March 29, 12

Slide 45

Slide 45 text

Support: basho Thursday, March 29, 12

Slide 46

Slide 46 text

Revenue: basho 75% of new customers in 2011 came from the Open Source Community Thursday, March 29, 12

Slide 47

Slide 47 text

Importance of Community for Community Members basho •Working, Quality Code •Recognition and Praise •Desire to Contribute •Jobs (whether they like it or not) •Skills Acquisition Thursday, March 29, 12

Slide 48

Slide 48 text

Communication and Collaboration in a Distributed [Community] basho •IRC •Mailing List •Twitter •Riak Recap •Meetups •Q & A Sites •Blogs •Books •Conferences •Actual Meetings •GitHub •Drinking Thursday, March 29, 12

Slide 49

Slide 49 text

Riak Recap basho Thursday, March 29, 12

Slide 50

Slide 50 text

Books basho http://riakhandbook.com/ Thursday, March 29, 12

Slide 51

Slide 51 text

Meetups and Drinking basho Thursday, March 29, 12

Slide 52

Slide 52 text

GitHub basho Thursday, March 29, 12

Slide 53

Slide 53 text

Give Things Away basho Thursday, March 29, 12

Slide 54

Slide 54 text

Build Communities Regardless basho Thursday, March 29, 12

Slide 55

Slide 55 text

basho Community Fault Tolerance Thursday, March 29, 12

Slide 56

Slide 56 text

basho Thursday, March 29, 12

Slide 57

Slide 57 text

“A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal.” What is a distributed system? basho Thursday, March 29, 12

Slide 58

Slide 58 text

• a database • a key/value store • distributed • fault-tolerant • scalable • Dynamo-inspired • used by startups • used by FORTUNE 100 companies • written (primarily) in Erlang • pronounced “REE-awk” • not the right fit for every project and app basho { Thursday, March 29, 12

Slide 59

Slide 59 text

1000s of Deployments Thursday, March 29, 12

Slide 60

Slide 60 text

basho Thursday, March 29, 12

Slide 61

Slide 61 text

basho Common Goals for Voxer’s System 1. Serve and Receive App Traffic 2. Perform Queries When Needed 3. Don’t Go Down 4. Scale Out to Meet Demand 5. Low, Consistent Response Times Thursday, March 29, 12

Slide 62

Slide 62 text

Voxer’s Initial Riak Cluster Stats (Oct 2011) •11 Riak Nodes •Modest Data Set Size (100s of Gs) •~20,000 Peak Concurrent Users •~4,000,000 Daily Total Requests Then something happened... basho Thursday, March 29, 12

Slide 63

Slide 63 text

Thursday, March 29, 12

Slide 64

Slide 64 text

basho Thursday, March 29, 12

Slide 65

Slide 65 text

Voxer’s Current Riak Cluster Stats basho Thursday, March 29, 12

Slide 66

Slide 66 text

Voxer’s Current Riak Cluster Stats • >40 Node Cluster for User Data basho Thursday, March 29, 12

Slide 67

Slide 67 text

Voxer’s Current Riak Cluster Stats • >40 Node Cluster for User Data • >40 Node Cluster to serve app traffic basho Thursday, March 29, 12

Slide 68

Slide 68 text

Voxer’s Current Riak Cluster Stats • >40 Node Cluster for User Data • >40 Node Cluster to serve app traffic • ~1TB/day of user data being added daily basho Thursday, March 29, 12

Slide 69

Slide 69 text

Voxer’s Current Riak Cluster Stats • >40 Node Cluster for User Data • >40 Node Cluster to serve app traffic • ~1TB/day of user data being added daily • 100,000s of concurrent users at peak basho Thursday, March 29, 12

Slide 70

Slide 70 text

Voxer’s Current Riak Cluster Stats • >40 Node Cluster for User Data • >40 Node Cluster to serve app traffic • ~1TB/day of user data being added daily • 100,000s of concurrent users at peak • Went from 11 to about 80 nodes in a month basho Thursday, March 29, 12

Slide 71

Slide 71 text

Voxer’s Current Riak Cluster Stats • >40 Node Cluster for User Data • >40 Node Cluster to serve app traffic • ~1TB/day of user data being added daily • 100,000s of concurrent users at peak • Went from 11 to about 80 nodes in a month • At one point adding three nodes/day basho Thursday, March 29, 12

Slide 72

Slide 72 text

basho Voxer’s Fault Tolerance • Have lost a lot of nodes in production • TCP Incast Problem [2] • LevelDB merge issues • Lots of other shit went wrong but it’s still running :) Thursday, March 29, 12

Slide 73

Slide 73 text

“Scalability is the ability of a system, network, or process, to handle growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.”[3] basho Thursday, March 29, 12

Slide 74

Slide 74 text

Present System Health Dictates Future Ability to Scale basho Thursday, March 29, 12

Slide 75

Slide 75 text

credit: http://blogs.ajc.com/jeff-schultz-blog/files/2009/06/closedsign.png basho Distributed [ Companies | Communities | Systems ] are all susceptible to downtime. Thursday, March 29, 12

Slide 76

Slide 76 text

Capacity Plan or Perish basho Thursday, March 29, 12

Slide 77

Slide 77 text

Everything Is Distributed Now basho Thursday, March 29, 12

Slide 78

Slide 78 text

basho @pharkmillups themarkphillips.com [email protected] Mark Phillips Questions? Thursday, March 29, 12

Slide 79

Slide 79 text

basho References 1. http://en.wikipedia.org/wiki/Distributed_computing 2. http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/ 3. http://en.wikipedia.org/wiki/Scalability Thursday, March 29, 12