Building Healthy
Distributed Systems
Erlang Factory SF
March 29, 2012
basho
Thursday, March 29, 12
Slide 2
Slide 2 text
basho
@pharkmillups
themarkphillips.com
[email protected]
Mark Phillips
Thursday, March 29, 12
Slide 3
Slide 3 text
“A distributed system consists of multiple
autonomous computers that communicate
through a computer network.
The computers interact with each other in order to
achieve a common goal.” [1]
What is a distributed system?
basho
Thursday, March 29, 12
Slide 4
Slide 4 text
Distributed, Scalable, Fault Tolerant
No central coordinator;
Easy to setup and operate
basho
Thursday, March 29, 12
Slide 5
Slide 5 text
Distributed, Scalable, Fault Tolerant
Horizontally Scalable;
Add commodity hardware to get more
[throughput | processing | storage].
basho
Thursday, March 29, 12
Slide 6
Slide 6 text
Distributed, Scalable, Fault Tolerant
Always Available
No Single Point of Failure
Self-healing
basho
Thursday, March 29, 12
Slide 7
Slide 7 text
basho
Thursday, March 29, 12
Slide 8
Slide 8 text
basho
{ • Founded in 2007
• Collapsed in 2008
• “Pivoted” in 2009
• Commercial Sponsors of Riak, an
Open Source, NoSQL Database
• Sells Closed Source Extensions to
Riak in the form licenses
Thursday, March 29, 12
Slide 9
Slide 9 text
Year on Year Growth
basho
Thursday, March 29, 12
Slide 10
Slide 10 text
2009
Year on Year Growth
basho
Thursday, March 29, 12
Slide 11
Slide 11 text
2009
14
Year on Year Growth
basho
Thursday, March 29, 12
Slide 12
Slide 12 text
2009
2010
14
Year on Year Growth
basho
Thursday, March 29, 12
Slide 13
Slide 13 text
2009
2010
14
25
Year on Year Growth
basho
Thursday, March 29, 12
Slide 14
Slide 14 text
2009
2010
2011
14
25
Year on Year Growth
basho
Thursday, March 29, 12
Slide 15
Slide 15 text
2009
2010
2011
14
60
25
Year on Year Growth
basho
Thursday, March 29, 12
Slide 16
Slide 16 text
basho Office Locations
Thursday, March 29, 12
Slide 17
Slide 17 text
basho Office Locations
Thursday, March 29, 12
Slide 18
Slide 18 text
basho Office Locations
Thursday, March 29, 12
Slide 19
Slide 19 text
basho Office Locations
Thursday, March 29, 12
Slide 20
Slide 20 text
Actual Employee Distribution
basho
Thursday, March 29, 12
Slide 21
Slide 21 text
“A distributed [company] consists of multiple
autonomous [team members] that communicate
[and collaborate] through various [channels].
The [team members] interact with each other in
order to achieve a common goal.”
What is a distributed [company]?
basho
Thursday, March 29, 12
Slide 22
Slide 22 text
Hiring where the talent is means
we don’t sacrifice great hires for
location, but it also presents
various hurdles when attempting
to build culture and community.
basho
basho
Thursday, March 29, 12
Slide 23
Slide 23 text
1. Make Basho into a Powerhouse
2. Professional Development
3. Employee Happiness
4. Deliver Exceptional Product
Common Goals
for Basho
basho
Thursday, March 29, 12
Slide 24
Slide 24 text
Internal Communication
and Collaboration
• Real-time Chat (Jabber, Camp Fire)
• Skype (or some for of video chat)
• Yammer
• GitHub
• AgileZen
• Email (sort of)
• Documentation
basho
Thursday, March 29, 12
Slide 25
Slide 25 text
Culture:
Be Corny and Childish
basho
Thursday, March 29, 12
Slide 26
Slide 26 text
basho
Thursday, March 29, 12
Slide 27
Slide 27 text
basho
Thursday, March 29, 12
Slide 28
Slide 28 text
basho
Thursday, March 29, 12
Slide 29
Slide 29 text
basho
Thursday, March 29, 12
Slide 30
Slide 30 text
basho
Thursday, March 29, 12
Slide 31
Slide 31 text
basho
Thursday, March 29, 12
Slide 32
Slide 32 text
Good Meetings
basho
• Quarterly In-person “Summits”
• Bi-Monthly, Non-Mandatory Company All Hands
• Stands up, Scrum
Thursday, March 29, 12
Slide 33
Slide 33 text
Make Documentation Part
of Your Culture
basho
• Inside Jokes
• Internal Talks
• Design Documents
• Product Ideas
• Product Feedback
• New Hire Processes
• Everything Else
Thursday, March 29, 12
Slide 34
Slide 34 text
Open Source Your Code.
And Use GitHub.
basho
• Contributes Directly to Developer Happiness
• Makes Your Company’s Product Better
• Great Marketing
• Use a Permissive License
(http://bit.ly/clJyDO)
(http://bit.ly/v3OMEf)
“Open Source Almost Everything”
“Why Your Company Should
Have a Permissive Open Source
Policy”
Thursday, March 29, 12
Slide 35
Slide 35 text
Thursday, March 29, 12
Slide 36
Slide 36 text
basho
basho
Hiring Should Not Happen
In A Vacuum
Thursday, March 29, 12
Slide 37
Slide 37 text
Poor Culture Rots a
Company from within and
Lessens its Resiliency
basho
Thursday, March 29, 12
Slide 38
Slide 38 text
basho Company Fault Tolerance
• New CEO + Massive Growth = New Challenges
• Our System is Constantly Improving
Thursday, March 29, 12
Slide 39
Slide 39 text
2012
1**
Planned Growth
basho
Thursday, March 29, 12
Slide 40
Slide 40 text
Community
Thursday, March 29, 12
Slide 41
Slide 41 text
“A distributed [community] consists of
multiple autonomous [members] that communicate
[and collaborate] through various [channels].
The [members] interact with each other in order to
achieve a common goal.”
What is a distributed [community]?
basho
Thursday, March 29, 12
Slide 42
Slide 42 text
basho
Why Build A
Community?
Thursday, March 29, 12
Slide 43
Slide 43 text
Grassroots Marketing,
Branding, Awareness:
basho
Thursday, March 29, 12
Slide 44
Slide 44 text
Code Contributions
and Bug Fixes :
basho
180
names in our
THANKS file
1600
hours contributed from
Oct 2010 - Sept 2011
Thursday, March 29, 12
Slide 45
Slide 45 text
Support:
basho
Thursday, March 29, 12
Slide 46
Slide 46 text
Revenue:
basho
75%
of new customers in
2011 came from the
Open Source Community
Thursday, March 29, 12
Slide 47
Slide 47 text
Importance of Community
for Community Members
basho
•Working, Quality Code
•Recognition and Praise
•Desire to Contribute
•Jobs (whether they like it or not)
•Skills Acquisition
Thursday, March 29, 12
Slide 48
Slide 48 text
Communication and Collaboration
in a Distributed [Community]
basho
•IRC
•Mailing List
•Twitter
•Riak Recap
•Meetups
•Q & A Sites
•Blogs
•Books
•Conferences
•Actual Meetings
•GitHub
•Drinking
Thursday, March 29, 12
Slide 49
Slide 49 text
Riak Recap
basho
Thursday, March 29, 12
Slide 50
Slide 50 text
Books
basho
http://riakhandbook.com/
Thursday, March 29, 12
Slide 51
Slide 51 text
Meetups and Drinking
basho
Thursday, March 29, 12
Slide 52
Slide 52 text
GitHub
basho
Thursday, March 29, 12
Slide 53
Slide 53 text
Give Things Away
basho
Thursday, March 29, 12
Slide 54
Slide 54 text
Build Communities Regardless
basho
Thursday, March 29, 12
Slide 55
Slide 55 text
basho Community Fault Tolerance
Thursday, March 29, 12
Slide 56
Slide 56 text
basho
Thursday, March 29, 12
Slide 57
Slide 57 text
“A distributed system consists of multiple
autonomous computers that communicate
through a computer network.
The computers interact with each other in order to
achieve a common goal.”
What is a distributed system?
basho
Thursday, March 29, 12
Slide 58
Slide 58 text
• a database
• a key/value store
• distributed
• fault-tolerant
• scalable
• Dynamo-inspired
• used by startups
• used by FORTUNE 100 companies
• written (primarily) in Erlang
• pronounced “REE-awk”
• not the right fit for every project and app
basho
{
Thursday, March 29, 12
Slide 59
Slide 59 text
1000s of Deployments
Thursday, March 29, 12
Slide 60
Slide 60 text
basho
Thursday, March 29, 12
Slide 61
Slide 61 text
basho Common Goals
for Voxer’s System
1. Serve and Receive App Traffic
2. Perform Queries When Needed
3. Don’t Go Down
4. Scale Out to Meet Demand
5. Low, Consistent Response Times
Thursday, March 29, 12
Slide 62
Slide 62 text
Voxer’s Initial Riak Cluster
Stats (Oct 2011)
•11 Riak Nodes
•Modest Data Set Size (100s of Gs)
•~20,000 Peak Concurrent Users
•~4,000,000 Daily Total Requests
Then something happened...
basho
Thursday, March 29, 12
Slide 63
Slide 63 text
Thursday, March 29, 12
Slide 64
Slide 64 text
basho
Thursday, March 29, 12
Slide 65
Slide 65 text
Voxer’s Current Riak
Cluster Stats
basho
Thursday, March 29, 12
Slide 66
Slide 66 text
Voxer’s Current Riak
Cluster Stats
• >40 Node Cluster for User Data
basho
Thursday, March 29, 12
Slide 67
Slide 67 text
Voxer’s Current Riak
Cluster Stats
• >40 Node Cluster for User Data
• >40 Node Cluster to serve app traffic
basho
Thursday, March 29, 12
Slide 68
Slide 68 text
Voxer’s Current Riak
Cluster Stats
• >40 Node Cluster for User Data
• >40 Node Cluster to serve app traffic
• ~1TB/day of user data being added daily
basho
Thursday, March 29, 12
Slide 69
Slide 69 text
Voxer’s Current Riak
Cluster Stats
• >40 Node Cluster for User Data
• >40 Node Cluster to serve app traffic
• ~1TB/day of user data being added daily
• 100,000s of concurrent users at peak
basho
Thursday, March 29, 12
Slide 70
Slide 70 text
Voxer’s Current Riak
Cluster Stats
• >40 Node Cluster for User Data
• >40 Node Cluster to serve app traffic
• ~1TB/day of user data being added daily
• 100,000s of concurrent users at peak
• Went from 11 to about 80 nodes in a month
basho
Thursday, March 29, 12
Slide 71
Slide 71 text
Voxer’s Current Riak
Cluster Stats
• >40 Node Cluster for User Data
• >40 Node Cluster to serve app traffic
• ~1TB/day of user data being added daily
• 100,000s of concurrent users at peak
• Went from 11 to about 80 nodes in a month
• At one point adding three nodes/day
basho
Thursday, March 29, 12
Slide 72
Slide 72 text
basho Voxer’s Fault Tolerance
• Have lost a lot of nodes in production
• TCP Incast Problem [2]
• LevelDB merge issues
• Lots of other shit went wrong
but it’s still running :)
Thursday, March 29, 12
Slide 73
Slide 73 text
“Scalability is the ability of a system, network,
or process, to handle growing amount of work
in a capable manner or its ability to be enlarged to
accommodate that growth.”[3]
basho
Thursday, March 29, 12
Slide 74
Slide 74 text
Present System Health Dictates
Future Ability to Scale
basho
Thursday, March 29, 12
Slide 75
Slide 75 text
credit: http://blogs.ajc.com/jeff-schultz-blog/files/2009/06/closedsign.png
basho
Distributed
[ Companies | Communities | Systems ]
are all susceptible to downtime.
Thursday, March 29, 12
Slide 76
Slide 76 text
Capacity Plan
or Perish
basho
Thursday, March 29, 12
Slide 77
Slide 77 text
Everything Is
Distributed Now
basho
Thursday, March 29, 12
Slide 78
Slide 78 text
basho
@pharkmillups
themarkphillips.com
[email protected]
Mark Phillips
Questions?
Thursday, March 29, 12