Riak Enterprise Revisited (RICON East 2013)

Riak Enterprise Revisited (RICON East 2013)

Presented by Chris Tilt at RICON East 2013.

Riak Enterprise has undergone an overhaul since it's 1.2 days, mostly around Mult-DataCenter replication. We'll talk about the "Brave New World" of replication in depth, how it manages concurrent TCP/IP connections, Realtime Sync, and the technology preview of Active Anti-Entropy Fullsync. Finally, we'll peek over the horizon at new features such as chaining of Realtime sync messages across multiple clusters.

About Chris

Chris has 25 years in the high technology industry as a software developer, CTO, designer, and startup co-founder. He discovered Erlang indirectly through development of telecommunications test equipment at Tektronix, which launched a new passion in functional programming. During the Dot Com days, he co-founded a startup using OCaml as the core language for graph theoretic analysis of web sites. A fear of compilers led to an intense study and eventual job as the lead on a Java-to-native assembly language compiler for a Massively Parallel Processor Array at Ambric, also written in OCaml. Thinking about how to scale software concurrently lead right back to Erlang and Basho, where he works on the Enterprise project team. Chris develops iPhone and Android applications as a hobby. He and his son, Geordie, co-designed a concurrent programming language called 'G' which compiles to C, and a LEGO-sized underwater ROV - both targeted for Arduino. When not programming, he enjoys rocket stoves, cob structures, remodeling, Minecraft, and Kendo.

E0f4dbccf64a1d37a92e224b070ee84f?s=128

Basho Technologies

May 13, 2013
Tweet

Transcript

  1. Enterprise Reloaded Replication in Record Time Chris Tilt (ctilt@basho.com) Basho

    Technologies RICON East 2013
  2. Talk • Riak Enterprise Overview • Focus: the “Brave New

    World” of replication • New Features in 1.3 (Released) • What’s coming in 1.4 • Futures
  3. Riak Enterprise • Built upon open-source Riak • 24 x

    7 Legendary Basho Technical Support • Closed-source • Extended Monitoring • Multi Data Center Replication*
  4. Multi Data Center Replication Cluster A Cluster B

  5. Multi Data Center Replication Cluster A Cluster B source sink

  6. Multi Data Center Replication Cluster A Cluster B riak objects

    source sink
  7. Use Cases • Primary cluster with hot failover • Availability

    Zones: active-active clusters create data locality and reduce latency • Reporting/Analytics
  8. Primary with failover Cluster A Cluster B Uni-directional sync DNS

    Director
  9. Primary with failover Cluster A Cluster B Uni-directional sync DNS

    Director
  10. Primary with failover Cluster A Cluster B DNS Director

  11. Availability Zones DNS Director Cluster Cluster Cluster Cluster Cluster Cluster

  12. Availability Zones DNS Director Cluster Cluster Cluster Cluster Cluster Cluster

  13. Reporting/Analytics Cluster A Cluster B Uni-directional sync Report Generator

  14. Replication Protocols Cluster A Cluster B riak objects source sink

    Realtime
  15. Cluster A Cluster B riak objects source sink Realtime Fullsync

    Replication Protocols
  16. Cluster A Cluster B riak objects source sink Realtime Fullsync

    Proxy GET (for riak CS) Replication Protocols
  17. The Old Way Cluster A Cluster B Realtime Fullsync Proxy

    GET ... ... single, multi-plexed connection
  18. The Old Way Cluster A Cluster B Realtime Fullsync Proxy

    GET ... ... “listeners”
  19. The Old Way Cluster A Cluster B Realtime Fullsync Proxy

    GET ... ... “sites”
  20. Lessons Learned • A single shared TCP/IP connection, for all

    replication, doesn’t scale well.
  21. Lessons Learned • A single shared TCP/IP connection, for all

    replication, doesn’t scale well. • Make it easy to configure!
  22. Lessons Learned • A single shared TCP/IP connection, for all

    replication, doesn’t scale well. • Make it easy to configure! • Dropping realtime objects during intermittent connectivity is not OK.
  23. Lessons Learned • A single shared TCP/IP connection, for all

    replication, doesn’t scale well. • Make it easy to configure! • Dropping realtime objects during intermittent connectivity is not OK. • Networks are unreliable, even in high-end data centers. Connectivity is hard.
  24. Lessons Learned • A single shared TCP/IP connection, for all

    replication, doesn’t scale well. • Make it easy to configure! • Dropping realtime objects during intermittent connectivity is not OK. • Networks are unreliable, even in high-end data centers. Connectivity is hard. • Load balancing is critical for fullsync.
  25. Big Pain Motivates Big Ideas.

  26. The Brave New World riak 1.3 • Ground up re-write

    of replication • Node to Node connections • All connections start at single IP:Port • Each protocol has it’s own channel
  27. The Brave New World riak 1.3 • Realtime queues separate

    from connections • Fullsync coordinator controls work load • Connection Manager (now moved to Core) handles backoff and retry • Much simpler command and configuration
  28. Praise • Andy Gross (the epoch) • Andrew Thomson •

    Chris Tilt • Dave Parfitt • Jon Merideth • Micah Warren
  29. Node to Node Connections Cluster A Cluster B source sink

    Realtime 1:1
  30. Realtime Write Cluster A Cluster B source sink Client PUT

  31. Realtime Write Cluster A Cluster B source sink Client PUT

  32. Realtime Write Cluster A Cluster B source sink Client PUT

  33. Realtime Write Cluster A Cluster B source sink Client PUT

  34. Realtime Queues • Hook on Post commit • Push objects

    to ETS Queue • RT connection pulls from Queue • Bounded, drop objects when full • On shutdown, proxy objects to peer node
  35. Fullsync to the Max = N:M Cluster A Cluster B

    source sink e.g. Fullsync 2:2
  36. Fullsync Coordinator workload balancing • Schedules each partition on it’s

    vnode • Reservation system on each sink node • Respects max_fssource_node and a “busy” response from a sink node • Caps connections at max_fssource_cluster
  37. Fullsync to the Max = N:M Cluster A Cluster B

    source sink Fullsync 2:2, max_fssource_cluster = 10
  38. Fullsync Coordinator Cluster A Cluster B source sink Fullsync 2:2,

    max_fssouce_cluster = 5
  39. Fullsync Write Cluster A P Cluster B P source sink

  40. Riak CS Replication blocks not replicated Cluster A Cluster B

    source sink Client GET block
  41. Riak CS Replication Cluster A Cluster B source sink Client

    GET block of manifest
  42. Riak CS Replication Cluster A Cluster B source sink Proxy

    GET Client GET block
  43. Riak CS Replication Cluster A Cluster B source sink Proxy

    GET Client GET block
  44. Riak CS Replication Cluster A Cluster B source sink Client

    GET block
  45. Riak CS Replication Cluster A Cluster B source sink Client

    GET block
  46. Riak Cloud Storage See Reid Draper’s Riak CS talk tomorrow!

  47. Cluster Manager For Easier Configuration • riak-repl clustername A •

    riak-repl connect 192.168.1.100 • riak-repl realtime enable B • riak-repl realtime start B
  48. Coming in 1.4 • Secure Sockets Layer • Network Address

    Translation • Proxied GET for Riak CS • Fine tuned fullsync concurrency controls
  49. Coming in 1.4 • Better per-connection stats for RT and

    FS • Realtime Chaining amongst multiple clusters • Technology preview of AAE Fullsync
  50. Realtime Chaining across Mutliple Data Centers Cluster A Cluster B

    Cluster C
  51. Realtime Write Cluster A Cluster B Cluster C Client PUT

  52. Realtime Write Cluster A Cluster B Cluster C Client PUT

  53. Realtime Write Cluster A Cluster B Cluster C Client

  54. Realtime Write Cluster A Cluster B Cluster C Client

  55. Realtime Write Cluster A Cluster B Cluster C Client Oops!

  56. Realtime Chaining Cluster A Cluster B Cluster C Client

  57. Realtime Chaining Cluster A Cluster B Cluster C Client

  58. Realtime Chaining Cluster A Cluster B Cluster C Client /

  59. Faster Fullsync • Current Keylist method is slow • Riak

    1.3 has Active Anti-Entropy • Technology Preview AAE Fullsync!
  60. Keylist Compare • Source and Sink each create a keylist

    file • For all keys, write hash of key and object to file • Send the Keylist file from sink to source • Source side compares it’s file to sink’s file
  61. Keylist Compare • Each cluster has to fold over entire

    key space • Time is linear with number of keys, K • Network traffic also liner with K • Fullsync of a 0% update still costs K. Ouch. • Discourages frequent syncrhonizing. Boo.
  62. Active Anti-Entropy • AAE Maintains a hash tree • real-time

    updates • persistent • non-blocking
  63. 63

  64. 64 hash exchange

  65. 65

  66. 66

  67. 67

  68. 68

  69. 69

  70. 70

  71. Fullsync AAE • Implement the AAE exchange over TCP/IP •

    Expect compare time to be linear with %differences • No additional read load from a fold :-)
  72. Fullsync Benchmarks measure key compare time Cluster A Cluster B

    source sink 1 M Keys 1 M - (%missing x 1M) Keys
  73. Keylist vs AAE 0 150 300 450 600 100 50

    10 00.5 64 222 383 488 521 500 486 key compare time (secs) % missing Keylist AAE
  74. What’s UP?

  75. What’s Up? • Fast local cloning of a data center

    • Per-bucket replication between multiple data centers • Support AAE fullsync over clusters of differing ring sizes • Replication of CRDTs across clusters • Strong consistency
  76. Thanks! ctilt@basho.com