Slide 1

Slide 1 text

Building Reliable Cloud Services Andy Gross, @argv0 Chief Architect, Basho Technologies Tuesday, June 5, 12

Slide 2

Slide 2 text

About Basho Sponsor of Riak Sells subscription licenses for Riak EDS, support Founded in 2008 Open Sourced Riak in 2009 Released Riak 1.0 in 2011 Released Riak CS in 2012 Tuesday, June 5, 12

Slide 3

Slide 3 text

What is a cloud service? Horizontally scalable Operationally “simple” Globally distributed Multi-tenant ...network service. Tuesday, June 5, 12

Slide 4

Slide 4 text

How did we get here? Tuesday, June 5, 12

Slide 5

Slide 5 text

Some History 1970-1997: Foundational distributed systems research 1998-2002: Inktomi, Akamai, Google, LinkedIn founded 2003-2005: Virtualization, AJAX, Web 2.0 2006-2007: S3, EC2, Chubby, Bigtable, Dynamo 2008-Present: NoSQL, Big Data, Cloud, DevOps Tuesday, June 5, 12

Slide 6

Slide 6 text

Horizontal Scale Vertical scale becomes too costly Embrace failure, plan (and code) for recovery Enables “elastic” scaling/pricing models Tuesday, June 5, 12

Slide 7

Slide 7 text

Operational Simplicity Horizontal scale == more machines == more complexity Homogeneity pays dividends Simpler deployment Simpler monitoring Simpler recovery Tuesday, June 5, 12

Slide 8

Slide 8 text

Global Distribution Latency matters - ask Amazon 2,3 datacenter deployments now common Impossible to guarantee strong consistency Tuesday, June 5, 12

Slide 9

Slide 9 text

Free Lunches 1,2,3: Over Single-core Single-machine Single-datacenter Tuesday, June 5, 12

Slide 10

Slide 10 text

NoSQL Absence of SQL? Presence of Tradeoffs? Tuesday, June 5, 12

Slide 11

Slide 11 text

Review of CAP Consistency (C) Availability (A) Partition tolerance (P) Tuesday, June 5, 12

Slide 12

Slide 12 text

CAP variations CA ~= single-site db CP ~= multi-site db, traditional AP ~= multi-site db w/ sloppy consistency Tuesday, June 5, 12

Slide 13

Slide 13 text

What is Riak? Eventually consistent Fault tolerant Distributed Highly Available ...database. Tuesday, June 5, 12

Slide 14

Slide 14 text

What’s it not good for? Rich, consistent queries Strongly typed storage Single/low-multiple speed Low-developer overhead Tuesday, June 5, 12

Slide 15

Slide 15 text

What’s it good for? Elastic, highly-available storage Simple, fast queries (k/v storage + minor variations) Storing BLOBs Large-scale speed Tuesday, June 5, 12

Slide 16

Slide 16 text

Riak as Foundation Horizontally Scalable ✓ Operationally “simple” ✓ Globally Distributed ✓ (with EDS) Multi-tenant ✗ Tuesday, June 5, 12

Slide 17

Slide 17 text

Dynamo-style Complexity Sibling management Restricted query model Quorum controls: R, W, DW, PW, PR Backend choices Tuesday, June 5, 12

Slide 18

Slide 18 text

Challenges Multi-tenancy Authn/z Usage billing/accounting Large file storage Needs chunking for latency/throughput Chunking == pain w/ eventual consistency Standardized protocol (S3-compatible) Global bucket namespace Tuesday, June 5, 12

Slide 19

Slide 19 text

Riak Riak Riak Boto (Python) s3cmd fog (Ruby) HTTP/S (S3) Riak PB API Architecture Tuesday, June 5, 12

Slide 20

Slide 20 text

Riak CS Vertical service atop Riak S3 compatible interface Files up to 5G (for now) Billing/auditing REST interfaces Tuesday, June 5, 12

Slide 21

Slide 21 text

What worked well? Tuesday, June 5, 12

Slide 22

Slide 22 text

Riak! No code modifications required to build Riak CS Resulting service inherits all of Riak’s operational properties Tuesday, June 5, 12

Slide 23

Slide 23 text

Tools Erlang Rebar Quickcheck Webmachine Other Basho Open Source projects http://github.com/basho Tuesday, June 5, 12

Slide 24

Slide 24 text

Process Started as a prototype Iterated quickly with a beta customer Shipped frequently Small, close team, grown slowly Tuesday, June 5, 12

Slide 25

Slide 25 text

What was hard? Tuesday, June 5, 12

Slide 26

Slide 26 text

Connection Pooling Just as hard as caching and naming # incoming connections > # connection capacity of cluster Started with naive approach Outsourced to proxy software Wrote proper connection pool Tuesday, June 5, 12

Slide 27

Slide 27 text

Conflict Resolution Is Hard Implementation of conflict-handling code can be very tricky Required for high availability CRDTs may help QuickCheck saves the day, as always Tuesday, June 5, 12

Slide 28

Slide 28 text

Lack Of Strong Consistency Some S3 operations need to be atomic Riak can’t do this Implemented a stopgap solution with less-than-ideal availability properties Tuesday, June 5, 12

Slide 29

Slide 29 text

Customer Environments Everything besides Riak and Riak CS Software != Service Planning Provisioning Deployment Monitoring Tuesday, June 5, 12

Slide 30

Slide 30 text

What is a cloud service? Horizontally scalable Operationally “simple” Globally distributed Multi-tenant ...network service. Tuesday, June 5, 12

Slide 31

Slide 31 text

Questions? Tuesday, June 5, 12