Implementing Real-Time Geo-Replication with ElasticSearch

Slide 1

Slide 1 text

http://goo.gl/mfQcmo Implementing Real-Time Geo-Replication with  ElasticSearch Sunny Gleason Boston ElasticSearch Meetup October 14, 2014

Slide 2

Slide 2 text

var self = this; • Sunny Gleason, All-Stack Engineer • Previous: Amazon (web services), Ning, Startup(1..N) • Current: SunnyCloud • Short Story: A-Team you can call on to help you build (or rescue) cloud services, web & mobile applications • Longer Story: Network of developers aiming to change the way businesses build applications

Slide 3

Slide 3 text

take these… PubNub Account http://goo.gl/oJnTpv Quickstart Guide http://goo.gl/eFHYLc Quickstart Bundle http://goo.gl/d05uwg Changes Plugin http://goo.gl/tLVhhP River Plugin http://goo.gl/WhqvAu

Slide 4

Slide 4 text

elasticsearch clustering • NODE : process / unit (~JVM) • PRIMARY : master of a shard * • REPLICA : copies of a shard    * at a particular moment in time

Slide 5

Slide 5 text

global network model • Availability zone : a single data center* • Region : a collection of data centers within ~1ms    * for the purposes of fault-tolerance

Slide 6

Slide 6 text

geo-replication

Slide 7

Slide 7 text

how to geo-replicate in ElasticSearch? • Create routing conﬁguration for global index and shard placement • Update each ElasticSearch cluster with its own version of the conﬁguration • IAD: [me, SFO, DUB] • SFO: [IAD, me, DUB] • DUB: [IAD, SFO, me]

Slide 8

Slide 8 text

issues w/ ElasticSearch geo- replication out-of-the-box • lots of global configuration state • geo-distant sites see each other’s internals (violates encapsulation) • N:N networking - topologically inefficient • requires network connectivity among all nodes • reasoning about failure is extremely difficult

Slide 9

Slide 9 text

our vision • what if each logical data store had a publish/ subscribe channel? • what if each primary cluster could publish changes to that channel? • what if each replica cluster could simply listen on that channel and apply updates to its local index? • what if there was smart routing so that global update propagation follows a minimal spanning tree?

Slide 10

Slide 10 text

what we’d do • write an ElasticSearch Changes plugin so that updates are published to a channel • write an ElasticSearch River plugin so that updates from channel(s) could be applied to the local index

Slide 11

Slide 11 text

doesn’t this already exist? • Amazon Simple Queue Service (SQS): 1:1 messaging, not global, not easy make M:N • RabbitMQ / AMQP : probably more challenging than ElasticSearch to set up globally in a fault- tolerant manner • It’s hard to ﬁnd or create a global system with consistent availability & real-time performance

Slide 12

Slide 12 text

hello pubnub OK, we have a Data Stream Network…

Slide 13

Slide 13 text

PubNub properties • global data stream network • 14 global data centers • global update propagation in < 250ms • publish/subscribe with presence & history

Slide 14

Slide 14 text

hello elasticsearch and away we go…

Slide 15

Slide 15 text

hello elasticsearch

Slide 16

Slide 16 text

PubNub Changes Plugin • extends IndexingOperationListener • Attaches to PRIMARY indexes, publishes 3 types of events to PubNub channel • CREATE, INDEX, DELETE • Create & Index require OpType to be set • Feedback welcome!

Slide 17

Slide 17 text

PubNub River Plugin • implements River • Subscribes to PubNub channel • Replays operations from channel against local index(es) • Not quite happy when version conﬂicts occur • Feedback welcome!

Slide 18

Slide 18 text

interesting aspects • presence support allows operational insight  (similar to a chat room where “users” are the cluster members) • history support allows messages to be replayed (conﬁgurable message retention) • built-in transport & message-level encryption can provide a reasonable level of security for many use cases

Slide 19

Slide 19 text

related work • PubNub MongoDB Plugin : http://goo.gl/4etuYK • PubNub Redis Plugin : http://goo.gl/2Sf33N     Allow updates to be propagated to/replayed from a PubNub channel

Slide 20

Slide 20 text

future work • handling batch calls • versioning in a multi-master world • ﬁnding and ﬁxing failures in a distributed model • semantics and better ordering guarantees to support higher update rates • anti-entropy, possibly using: ElasticSearch transaction log, PubNub history, checksum trees • operational insight using presence features • more polyglot persistence use cases

Slide 21

Slide 21 text

… and you’re done! (for now)    questions/feedback? thank you so much!