Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Reliable Cloud Storage Services with Riak

Building Reliable Cloud Storage Services with Riak

Presented to the SIlicon Valley Cloud Computing Group, April 2, 2013 @ Citrix HQ, Santa Clara, CA

Andy Gross

April 02, 2013
Tweet

More Decks by Andy Gross

Other Decks in Technology

Transcript

  1. Riak and Riak CS
    Andy Gross <@argv0>
    Chief Architect, Basho Technologies
    Silicon Valley Cloud Computing Group
    April 2, 2013
    Tuesday, April 2, 13

    View Slide

  2. Basho
    120+ employees, offices in SF, MA, London,
    Japan
    Founded in 2008, open sourced Riak in 2009
    Sponsors of the Riak open source database
    (Apache 2)
    Sell Enterprise features (multi-DC replication),
    support, training.
    Riak CS (S3-compat storage) released in March
    2012
    Tuesday, April 2, 13

    View Slide

  3. Now Open Source
    (Apache 2)
    Cloud storage software
    backed by Riak
    S3 API
    Formerly closed-source
    Per-tenant reporting
    Pluggable authentication
    Detailed stats
    DTrace support
    Multi-datacenter
    replication (Enterprise)
    Preliminary integration
    with CloudStack
    Tuesday, April 2, 13

    View Slide

  4. what is a cloud service?
    Tuesday, April 2, 13

    View Slide

  5. what is a cloud service?
    fault tolerant
    Tuesday, April 2, 13

    View Slide

  6. what is a cloud service?
    horizontally scalable
    fault tolerant
    Tuesday, April 2, 13

    View Slide

  7. what is a cloud service?
    operationally simple
    horizontally scalable
    fault tolerant
    Tuesday, April 2, 13

    View Slide

  8. what is a cloud service?
    operationally simple
    horizontally scalable
    no SPOFs
    fault tolerant
    Tuesday, April 2, 13

    View Slide

  9. what is a cloud service?
    operationally simple
    horizontally scalable
    highly available
    no SPOFs
    fault tolerant
    Tuesday, April 2, 13

    View Slide

  10. what is a cloud service?
    operationally simple
    horizontally scalable
    globally distributed
    highly available
    no SPOFs
    fault tolerant
    Tuesday, April 2, 13

    View Slide

  11. you can’t outsource these
    properties
    operationally simple
    horizontally scalable
    globally distributed
    highly available
    no SPOFs
    fault tolerant
    Tuesday, April 2, 13

    View Slide

  12. “use pacemaker” =
    wrong answer
    Tuesday, April 2, 13

    View Slide

  13. “use mysql best practices
    for redundancy” = wrong
    answer
    Tuesday, April 2, 13

    View Slide

  14. “just plug it into a SAN” =
    wrong answer
    Tuesday, April 2, 13

    View Slide

  15. all cloud services need
    reliable, distributed state
    storage
    Tuesday, April 2, 13

    View Slide

  16. storage is the most
    important and hardest
    part
    Tuesday, April 2, 13

    View Slide

  17. Riak CS uses Riak
    Tuesday, April 2, 13

    View Slide

  18. What is Riak?
    Tuesday, April 2, 13

    View Slide

  19. Key-Value store (plus extras)
    Distributed, horizontally scalable
    Eventually consistent
    Fault-tolerant
    Highly-available
    Inspired by Amazon’s Dynamo
    Tuesday, April 2, 13

    View Slide

  20. Simple operations - get, put, delete
    Value is mostly opaque (some metadata)
    Extras
    MapReduce
    Secondary Indexes
    Full-text search (optional)
    Key-Value
    Tuesday, April 2, 13

    View Slide

  21. Distributed & Horizontally
    Scalable
    Default configuration is in a cluster
    Load and data are spread evenly via consistent hashing
    Scalable: Add more nodes to get more X
    Tuesday, April 2, 13

    View Slide

  22. Fault-Tolerant
    Symmetry: All nodes participate equally
    Decentralized: no central control, no SPOF
    All data is replicated 3x by default
    Cluster transparently survives...
    node failure
    network partitions
    Tuesday, April 2, 13

    View Slide

  23. Highly-Available
    Any node can serve client requests
    Fallbacks (sloppy quorums) are used when nodes
    are down
    Always accepts write requests
    Accepts read request as long as R/N nodes are
    alive
    Per-request quorums
    Tuesday, April 2, 13

    View Slide

  24. Inspired by Amazon’s
    Dynamo
    Masterless, peer-coordinated replication
    Consistent hashing
    Eventually consistent
    Quorum reads and writes
    Anti-entropy: read repair, hinted handoff
    Tuesday, April 2, 13

    View Slide

  25. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Tuesday, April 2, 13

    View Slide

  26. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    1. user uploads
    an object
    Tuesday, April 2, 13

    View Slide

  27. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Tuesday, April 2, 13

    View Slide

  28. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    1 MB
    2. Riak CS
    breaks object
    into 1 MB chunks
    1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB
    Tuesday, April 2, 13

    View Slide

  29. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB
    Tuesday, April 2, 13

    View Slide

  30. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB
    3. Riak CS
    streams chunks
    to Riak nodes
    Tuesday, April 2, 13

    View Slide

  31. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Tuesday, April 2, 13

    View Slide

  32. Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Riak
    Node
    Large Object
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    Riak CS
    S3
    API
    Reporting
    API
    4. Riak
    replicates
    and stores
    chunks
    Tuesday, April 2, 13

    View Slide

  33. Principles
    Always-writable
    Incrementally scalable
    Symmetrical
    Decentralized
    Focus on SLAs, tail latency
    Tuesday, April 2, 13

    View Slide

  34. Techniques
    Consistent Hashing
    Vector Clocks
    Read Repair
    Anti-Entropy
    Hinted Handoff
    Gossip Protocol
    Tuesday, April 2, 13

    View Slide

  35. Consistent Hashing
    Invented by Danny Lewin and others @ MIT/Akamai
    Minimizes remapping of keys when number of hash
    slots changes
    Originally applied to CDNs, used in Dynamo for replica
    placement
    Enables incremental scalability, even spread
    Minimizes hot spots
    Tuesday, April 2, 13

    View Slide

  36. Tuesday, April 2, 13

    View Slide

  37. Vector Clocks
    Introduced by Mattern et al, in 1988
    Extends Lamport’s timestamps (1978)
    Each value in Dynamo tagged with vector clock
    Allows detection of stale values, logical siblings
    Tuesday, April 2, 13

    View Slide

  38. Read Repair
    Update stale versions opportunistically on reads
    (instead of writes)
    Pushes system toward consistency, after returning
    value to client
    Reflects focus on a cheap, always-available write path
    Tuesday, April 2, 13

    View Slide

  39. Hinted Handoff
    Any node can accept writes for other nodes if they’re
    down
    All messages include a destination
    Data accepted by node other than destination is
    handed off when node recovers
    As long as a single node is alive the cluster can accept
    a write
    Tuesday, April 2, 13

    View Slide

  40. Anti-Entropy
    Replicas maintain a Merkle Tree of keys and their
    versions/hashes
    Trees periodically exchanged with peer vnodes
    Merkle tree enables cheap comparison
    Only values with different hashes are exchanged
    Pushes system toward consistency
    Tuesday, April 2, 13

    View Slide

  41. Gossip Protocol
    Decentralized approach to managing global state
    Trades off atomicity of state changes for a
    decentralized approach
    Volume of gossip can overwhelm networks without
    care
    Tuesday, April 2, 13

    View Slide

  42. Hinted Handoff
    Tuesday, April 2, 13

    View Slide

  43. Hinted Handoff
    • Node fails
    X
    X
    X
    X
    X
    X
    X
    X
    Tuesday, April 2, 13

    View Slide

  44. Hinted Handoff
    • Node fails
    • Requests go to fallback
    hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    X
    X
    X
    X
    X
    X
    X
    X
    Tuesday, April 2, 13

    View Slide

  45. Hinted Handoff
    • Node fails
    • Requests go to fallback
    • Node comes back
    hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Tuesday, April 2, 13

    View Slide

  46. Hinted Handoff
    • Node fails
    • Requests go to fallback
    • Node comes back
    • “Handoff” - data returns
    to recovered node
    hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Tuesday, April 2, 13

    View Slide

  47. Hinted Handoff
    • Node fails
    • Requests go to fallback
    • Node comes back
    • “Handoff” - data returns
    to recovered node
    • Normal operations
    resume
    hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Tuesday, April 2, 13

    View Slide

  48. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Tuesday, April 2, 13

    View Slide

  49. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    client
    Riak
    Tuesday, April 2, 13

    View Slide

  50. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    Tuesday, April 2, 13

    View Slide

  51. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    hash(“blocks/
    6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    == 10, 11, 12
    Tuesday, April 2, 13

    View Slide

  52. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    hash(“blocks/
    6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    == 10, 11, 12
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring
    Tuesday, April 2, 13

    View Slide

  53. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring
    Tuesday, April 2, 13

    View Slide

  54. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring
    R=2
    Tuesday, April 2, 13

    View Slide

  55. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring
    R=2 v1
    Tuesday, April 2, 13

    View Slide

  56. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    R=2 v1 v2
    Tuesday, April 2, 13

    View Slide

  57. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    R=2 v2
    v2
    Tuesday, April 2, 13

    View Slide

  58. Anatomy of a Request
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    v2
    Tuesday, April 2, 13

    View Slide

  59. Read Repair
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    R=2 v1 v2
    v2
    v2
    v1
    Tuesday, April 2, 13

    View Slide

  60. Read Repair
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    R=2 v2
    v2
    v2
    v1
    Tuesday, April 2, 13

    View Slide

  61. Read Repair
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    R=2 v2
    v2
    v2
    v1
    v1
    Tuesday, April 2, 13

    View Slide

  62. v2
    v2
    Read Repair
    get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    R=2 v2
    v2
    v2
    v2
    v2
    Tuesday, April 2, 13

    View Slide

  63. Erlang/OTP Runtime
    Riak Architecture
    Tuesday, April 2, 13

    View Slide

  64. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Tuesday, April 2, 13

    View Slide

  65. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Tuesday, April 2, 13

    View Slide

  66. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs HTTP
    Tuesday, April 2, 13

    View Slide

  67. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs HTTP Protocol Buffers
    Tuesday, April 2, 13

    View Slide

  68. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs HTTP Protocol Buffers
    Erlang local client
    Tuesday, April 2, 13

    View Slide

  69. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    HTTP Protocol Buffers
    Erlang local client
    Tuesday, April 2, 13

    View Slide

  70. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    Tuesday, April 2, 13

    View Slide

  71. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    Tuesday, April 2, 13

    View Slide

  72. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    consistent hashing
    Tuesday, April 2, 13

    View Slide

  73. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing
    Tuesday, April 2, 13

    View Slide

  74. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    Tuesday, April 2, 13

    View Slide

  75. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    node-liveness
    Tuesday, April 2, 13

    View Slide

  76. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    node-liveness
    gossip
    Tuesday, April 2, 13

    View Slide

  77. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    node-liveness
    gossip
    buckets
    Tuesday, April 2, 13

    View Slide

  78. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    node-liveness
    gossip
    buckets
    vnode master
    Tuesday, April 2, 13

    View Slide

  79. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    node-liveness
    gossip
    buckets
    vnodes
    vnode master
    Tuesday, April 2, 13

    View Slide

  80. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    node-liveness
    gossip
    buckets
    vnodes
    storage backend
    vnode master
    Tuesday, April 2, 13

    View Slide

  81. Erlang/OTP Runtime
    Riak KV
    Riak Architecture
    Client APIs
    Request Coordination
    Riak Core
    get put delete map-reduce
    HTTP Protocol Buffers
    Erlang local client
    membership
    consistent hashing handoff
    node-liveness
    gossip
    buckets
    vnodes
    storage backend
    JS Runtime
    vnode master
    Tuesday, April 2, 13

    View Slide

  82. riak is a solid foundation
    for building cloud
    services
    Tuesday, April 2, 13

    View Slide

  83. Coming Soon:
    Riak CS 1.4 (Q2)
    Swift API
    Keystone Integration
    S3 Features
    COPY Object
    Object Versioning
    Riak CS 1.5 (Q3)
    Server side encryption
    More S3 features
    Enhanced CloudStack
    and OpenStack
    integration
    Tuesday, April 2, 13

    View Slide

  84. Coming Later (2014)
    Erasure coding
    Reduced redundancy storage
    Native indexing/search
    Tuesday, April 2, 13

    View Slide

  85. RICON East - May 13-14,
    NYC
    A distributed systems conference for developers
    Speakers from Comcast, State Farm, UC Berkeley,
    Harvard, and many more
    Use discount code SVCloud20 for 20% off tickets
    http://ricon.io/east.html
    Tuesday, April 2, 13

    View Slide

  86. thanks!/questions?
    download riakcs:
    http://docs.basho.com/riakcs/latest/riakcs-downloads/
    hack riakcs:
    http://github.com/basho/riak_cs
    work at basho:
    http://bashojobs.theresumator.com
    follow basho on twitter:
    http:/twitter.com/basho
    Tuesday, April 2, 13

    View Slide