Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Reliable Cloud Services

Building Reliable Cloud Services

Presentation at OSBC 2012

Andy Gross

June 05, 2012

More Decks by Andy Gross

Other Decks in Technology


  1. Building Reliable Cloud Services Andy Gross, @argv0 Chief Architect, Basho

    Technologies Tuesday, June 5, 12
  2. About Basho Sponsor of Riak Sells subscription licenses for Riak

    EDS, support Founded in 2008 Open Sourced Riak in 2009 Released Riak 1.0 in 2011 Released Riak CS in 2012 Tuesday, June 5, 12
  3. What is a cloud service? Horizontally scalable Operationally “simple” Globally

    distributed Multi-tenant ...network service. Tuesday, June 5, 12
  4. How did we get here? Tuesday, June 5, 12

  5. Some History 1970-1997: Foundational distributed systems research 1998-2002: Inktomi, Akamai,

    Google, LinkedIn founded 2003-2005: Virtualization, AJAX, Web 2.0 2006-2007: S3, EC2, Chubby, Bigtable, Dynamo 2008-Present: NoSQL, Big Data, Cloud, DevOps Tuesday, June 5, 12
  6. Horizontal Scale Vertical scale becomes too costly Embrace failure, plan

    (and code) for recovery Enables “elastic” scaling/pricing models Tuesday, June 5, 12
  7. Operational Simplicity Horizontal scale == more machines == more complexity

    Homogeneity pays dividends Simpler deployment Simpler monitoring Simpler recovery Tuesday, June 5, 12
  8. Global Distribution Latency matters - ask Amazon 2,3 datacenter deployments

    now common Impossible to guarantee strong consistency Tuesday, June 5, 12
  9. Free Lunches 1,2,3: Over Single-core Single-machine Single-datacenter Tuesday, June 5,

  10. NoSQL Absence of SQL? Presence of Tradeoffs? Tuesday, June 5,

  11. Review of CAP Consistency (C) Availability (A) Partition tolerance (P)

    Tuesday, June 5, 12
  12. CAP variations CA ~= single-site db CP ~= multi-site db,

    traditional AP ~= multi-site db w/ sloppy consistency Tuesday, June 5, 12
  13. What is Riak? Eventually consistent Fault tolerant Distributed Highly Available

    ...database. Tuesday, June 5, 12
  14. What’s it not good for? Rich, consistent queries Strongly typed

    storage Single/low-multiple speed Low-developer overhead Tuesday, June 5, 12
  15. What’s it good for? Elastic, highly-available storage Simple, fast queries

    (k/v storage + minor variations) Storing BLOBs Large-scale speed Tuesday, June 5, 12
  16. Riak as Foundation Horizontally Scalable ✓ Operationally “simple” ✓ Globally

    Distributed ✓ (with EDS) Multi-tenant ✗ Tuesday, June 5, 12
  17. Dynamo-style Complexity Sibling management Restricted query model Quorum controls: R,

    W, DW, PW, PR Backend choices Tuesday, June 5, 12
  18. Challenges Multi-tenancy Authn/z Usage billing/accounting Large file storage Needs chunking

    for latency/throughput Chunking == pain w/ eventual consistency Standardized protocol (S3-compatible) Global bucket namespace Tuesday, June 5, 12
  19. Riak Riak Riak Boto (Python) s3cmd fog (Ruby) HTTP/S (S3)

    Riak PB API Architecture Tuesday, June 5, 12
  20. Riak CS Vertical service atop Riak S3 compatible interface Files

    up to 5G (for now) Billing/auditing REST interfaces Tuesday, June 5, 12
  21. What worked well? Tuesday, June 5, 12

  22. Riak! No code modifications required to build Riak CS Resulting

    service inherits all of Riak’s operational properties Tuesday, June 5, 12
  23. Tools Erlang Rebar Quickcheck Webmachine Other Basho Open Source projects

    http://github.com/basho Tuesday, June 5, 12
  24. Process Started as a prototype Iterated quickly with a beta

    customer Shipped frequently Small, close team, grown slowly Tuesday, June 5, 12
  25. What was hard? Tuesday, June 5, 12

  26. Connection Pooling Just as hard as caching and naming #

    incoming connections > # connection capacity of cluster Started with naive approach Outsourced to proxy software Wrote proper connection pool Tuesday, June 5, 12
  27. Conflict Resolution Is Hard Implementation of conflict-handling code can be

    very tricky Required for high availability CRDTs may help QuickCheck saves the day, as always Tuesday, June 5, 12
  28. Lack Of Strong Consistency Some S3 operations need to be

    atomic Riak can’t do this Implemented a stopgap solution with less-than-ideal availability properties Tuesday, June 5, 12
  29. Customer Environments Everything besides Riak and Riak CS Software !=

    Service Planning Provisioning Deployment Monitoring Tuesday, June 5, 12
  30. What is a cloud service? Horizontally scalable Operationally “simple” Globally

    distributed Multi-tenant ...network service. Tuesday, June 5, 12
  31. Questions? Tuesday, June 5, 12