Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Reliable Cloud Services

Building Reliable Cloud Services

Presentation at OSBC 2012

Andy Gross

June 05, 2012
Tweet

More Decks by Andy Gross

Other Decks in Technology

Transcript

  1. Building Reliable Cloud
    Services
    Andy Gross, @argv0
    Chief Architect, Basho Technologies
    Tuesday, June 5, 12

    View Slide

  2. About Basho
    Sponsor of Riak
    Sells subscription licenses for Riak EDS, support
    Founded in 2008
    Open Sourced Riak in 2009
    Released Riak 1.0 in 2011
    Released Riak CS in 2012
    Tuesday, June 5, 12

    View Slide

  3. What is a cloud service?
    Horizontally scalable
    Operationally “simple”
    Globally distributed
    Multi-tenant
    ...network service.
    Tuesday, June 5, 12

    View Slide

  4. How did we get here?
    Tuesday, June 5, 12

    View Slide

  5. Some History
    1970-1997: Foundational distributed systems research
    1998-2002: Inktomi, Akamai, Google, LinkedIn
    founded
    2003-2005: Virtualization, AJAX, Web 2.0
    2006-2007: S3, EC2, Chubby, Bigtable, Dynamo
    2008-Present: NoSQL, Big Data, Cloud, DevOps
    Tuesday, June 5, 12

    View Slide

  6. Horizontal Scale
    Vertical scale becomes too costly
    Embrace failure, plan (and code) for recovery
    Enables “elastic” scaling/pricing models
    Tuesday, June 5, 12

    View Slide

  7. Operational Simplicity
    Horizontal scale == more machines == more
    complexity
    Homogeneity pays dividends
    Simpler deployment
    Simpler monitoring
    Simpler recovery
    Tuesday, June 5, 12

    View Slide

  8. Global Distribution
    Latency matters - ask Amazon
    2,3 datacenter deployments now common
    Impossible to guarantee strong consistency
    Tuesday, June 5, 12

    View Slide

  9. Free Lunches 1,2,3: Over
    Single-core
    Single-machine
    Single-datacenter
    Tuesday, June 5, 12

    View Slide

  10. NoSQL
    Absence of SQL?
    Presence of Tradeoffs?
    Tuesday, June 5, 12

    View Slide

  11. Review of CAP
    Consistency (C)
    Availability (A)
    Partition tolerance (P)
    Tuesday, June 5, 12

    View Slide

  12. CAP variations
    CA ~= single-site db
    CP ~= multi-site db, traditional
    AP ~= multi-site db w/ sloppy consistency
    Tuesday, June 5, 12

    View Slide

  13. What is Riak?
    Eventually consistent
    Fault tolerant
    Distributed
    Highly Available
    ...database.
    Tuesday, June 5, 12

    View Slide

  14. What’s it not good for?
    Rich, consistent queries
    Strongly typed storage
    Single/low-multiple speed
    Low-developer overhead
    Tuesday, June 5, 12

    View Slide

  15. What’s it good for?
    Elastic, highly-available storage
    Simple, fast queries (k/v storage + minor variations)
    Storing BLOBs
    Large-scale speed
    Tuesday, June 5, 12

    View Slide

  16. Riak as Foundation
    Horizontally Scalable ✓
    Operationally “simple” ✓
    Globally Distributed

    (with EDS)
    Multi-tenant ✗
    Tuesday, June 5, 12

    View Slide

  17. Dynamo-style Complexity
    Sibling management
    Restricted query model
    Quorum controls: R, W, DW, PW, PR
    Backend choices
    Tuesday, June 5, 12

    View Slide

  18. Challenges
    Multi-tenancy
    Authn/z
    Usage billing/accounting
    Large file storage
    Needs chunking for latency/throughput
    Chunking == pain w/ eventual consistency
    Standardized protocol (S3-compatible)
    Global bucket namespace
    Tuesday, June 5, 12

    View Slide

  19. Riak Riak Riak
    Boto
    (Python)
    s3cmd
    fog
    (Ruby)
    HTTP/S (S3)
    Riak PB
    API
    Architecture
    Tuesday, June 5, 12

    View Slide

  20. Riak CS
    Vertical service atop Riak
    S3 compatible interface
    Files up to 5G (for now)
    Billing/auditing REST interfaces
    Tuesday, June 5, 12

    View Slide

  21. What worked well?
    Tuesday, June 5, 12

    View Slide

  22. Riak!
    No code modifications required to build Riak CS
    Resulting service inherits all of Riak’s operational
    properties
    Tuesday, June 5, 12

    View Slide

  23. Tools
    Erlang
    Rebar
    Quickcheck
    Webmachine
    Other Basho Open Source projects
    http://github.com/basho
    Tuesday, June 5, 12

    View Slide

  24. Process
    Started as a prototype
    Iterated quickly with a beta customer
    Shipped frequently
    Small, close team, grown slowly
    Tuesday, June 5, 12

    View Slide

  25. What was hard?
    Tuesday, June 5, 12

    View Slide

  26. Connection Pooling
    Just as hard as caching and naming
    # incoming connections > # connection capacity of
    cluster
    Started with naive approach
    Outsourced to proxy software
    Wrote proper connection pool
    Tuesday, June 5, 12

    View Slide

  27. Conflict Resolution Is Hard
    Implementation of conflict-handling code can be very
    tricky
    Required for high availability
    CRDTs may help
    QuickCheck saves the day, as always
    Tuesday, June 5, 12

    View Slide

  28. Lack Of Strong Consistency
    Some S3 operations need to be atomic
    Riak can’t do this
    Implemented a stopgap solution with less-than-ideal
    availability properties
    Tuesday, June 5, 12

    View Slide

  29. Customer Environments
    Everything besides Riak and Riak CS
    Software != Service
    Planning
    Provisioning
    Deployment
    Monitoring
    Tuesday, June 5, 12

    View Slide

  30. What is a cloud service?
    Horizontally scalable
    Operationally “simple”
    Globally distributed
    Multi-tenant
    ...network service.
    Tuesday, June 5, 12

    View Slide

  31. Questions?
    Tuesday, June 5, 12

    View Slide