#18 Big Data 101

BIG DATA 101, FOUNDATIONAL KNOWLEDGE FOR A NEW PROJECT IN
2016 @doanduyhai Apache Cassandra™ Evangelist Apache Zeppelin™ Committer @doanduyhai 1

Who Am I ? Duy Hai DOAN Apache Cassandra™ evangelist
•  talks, meetups, confs •  open-source devs (Achilles, Zeppelin,…) •  OSS Cassandra point of contact ☞ [email protected] ☞ @doanduyhai Apache Zeppelin™ committer @doanduyhai 2

Agenda 1) Distributed systems: abstractions & models 2) CAP theorem
3) Distributed systems architecture: master/slave vs masterless 4) Distributed algorithms: tricks & traps @doanduyhai 3

Distributed systems @doanduyhai 4 Abstractions and models

Processing model Single threaded Multi-threaded ☞ requires thread coordination, sometimes
shared- states, shared storage … Concurrency: multiple tasks A, B, C ... need to be executed but their execution order is undeﬁned Parallelism: an impl of concurrency, execute multiple tasks at the same time using multiple threads @doanduyhai 5

Communication model Synchronous ☞ the client request blocks until getting
an answer Asynchronous ☞ the client request always returns (very fast), the answer will arrive later, eventually Usually, asychronous requests are handled by multiple threads @doanduyhai 6

Time There is no absolute time in theory (even with
atomic clocks!) Time-drift is unavoidable •  unless you provide atomic clock to each server •  unless you’re Google NTP is your friend ☞ configure it properly ! @doanduyhai 7

Ordering of operations How to order operations ? What does
before/after mean ? •  when clock is not 100% reliable •  when operations occur on multiple machines … •  … that live in multiple continents (1000s km distance) @doanduyhai 8

Ordering of operations Local/relative ordering is possible Global ordering ?
•  either execute all operations on single machine (☞ master) •  or ensure time is perfectly synchronized on all machines executing the operations (really feasible ?) @doanduyhai 9

Known algorithms Lamport clock •  algorithm for message sender • 
algorithm for message receiver •  partial ordering between a pair of (sender, receiver) is possible @doanduyhai 10 time = time+1; time_stamp = time; send(Message, time_stamp); (message, time_stamp) = receive(); time = max(time_stamp, time)+1;

Vector Clock @doanduyhai 11

Latency Def: time interval between request & response. Latency is
composed of •  network delay: router/switch delay + physical medium delay •  OS delay (negligible) •  time to process the query by the target (disk access, computation …) @doanduyhai 12

Latency Speed of light physics •  ≈ 300 000 km/s
in the void •  ≈ 197 000 km/s in ﬁber optic cable (due to refraction indice) London – New York bird flight distance ≈ 5500km è 28ms for a one way trip Conclusion: a ping between London – New York cannot take less than 56ms @doanduyhai 13

@doanduyhai 14 "The mean latency is below 10ms" Database vendor
X ✔︎ ✘︎

@doanduyhai 15 "The mean latency is below 10ms" Database vendor
X ✔︎ ✘︎

Failure modes •  Byzantine failure: same input, different outputs à
application bug !!! •  Performance failure: response correct but arrives too late •  Omission failure: special case of performance failure, no response (timeout?) •  Crash failure: self-explanatory, server stops responding Byzantine failure à value issue Other failures à timing issue @doanduyhai 16

Failure Root causes •  Hardware: disk, CPU, … •  Software:
packet lost, process crash, OS crash … •  Workload-specific: flushing huge ﬁle to SAN () •  JVM-related: long GC pause Deﬁning failure is hard @doanduyhai 17

@doanduyhai 18 "A server fails when it does not respond
to one or multiple request(s) in a timely manner" Usual meaning of failure

Failure detection Timely manner ☞ timeout! Failure detector: •  heart
beat: binary state, (up/down), too simple •  exponential backoff with threshold: better model •  phi accrual detector: advanced model using statictics @doanduyhai 19

Distributed consensus protocols Major properties: •  validity: the agreed value
must have been proposed by some process •  termination: at least one non-faulty process eventually decides •  agreement: all processes agree on the same value @doanduyhai 20

Distributed consensus protocols 2-phases commit •  termination KO: the protocol
can be blocked if coordinator fails 3-phases commit •  agreement KO: in case of network partition, possibility of inconsistent state Paxos, RAFT & Zab (Zookeeper) •  OK: satisﬁes 3 requirements •  QUORUM-based: requires a strict majority of nodes/replicas to be alive @doanduyhai 21

High availability How ? By having multiple copies: e.g. replicas
Type of replicas •  symetric: no role, each replica is similar to others •  asymetric: master/slave role. Write on master replica à dispatch to slave replicas. Read on master replica (inconsistent read on slave replicas possible) When failure occurs •  symetric: no operation, only online replicas will respond to requests •  asymetric: master ownership has to be transfered to one elected replica @doanduyhai 22

CAP Theorem @doanduyhai 23 Pick 2 out of 3

CAP theorem @doanduyhai 24 Conjecture by Brewer, formalized later in
a paper (2002): The CAP theorem states that any networked shared-data system can have at most two of three desirable properties •  consistency (C): equivalent to having a single up-to-date copy of the data •  high availability (A): of that data (for updates) •  and tolerance to network partitions (P)

CAP triangle @doanduyhai 25

CAP theorem revised (2012) @doanduyhai 26 You cannot choose not
to be partition-tolerance Choice is not that binary: •  in the absence of partition, you can tend toward CA •  if when partition occurs, choose your side (C or A) ☞ tunable consistency

What is Consistency ? @doanduyhai 27 Meaning is different from
the C of ACID Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination

Consistency with some CP (supposedly) system @doanduyhai 28 DB name
censored

Consistency with some AP system @doanduyhai 29 Cassandra tunable consistency
Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination Consistency Level ONE

Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination Consistency Level QUORUM

Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination LightWeight Transaction Single partition writes are linearizable

What is availability ? @doanduyhai 32 Ability to: •  Read
in the case of failure ? •  Write in the case of failure ? Brewer deﬁnition: high availability of the data (for updates)

Real world example @doanduyhai 33 Cassandra claims to be highly
available, is it true ? Some marketing slide even claims continous availability (100% uptime), is it true ?

Network partition scenario with Cassandra @doanduyhai 34 C* C* C*
C* C* C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE ✔︎

Network partition scenario with Cassandra @doanduyhai 35 C* C* C*
C* C* C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE ✘︎

So how can it be highly available ??? @doanduyhai 36
C* C* C* C* C* C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE C* C* C* C* C* C* C* C* C* C* C* C* C* US DataCenter EU DataCenter ✘ Datacenter-aware load balancing strategy at driver level

Architecture @doanduyhai 37 Master/Slave vs Masterless

Pure master/slave architecture @doanduyhai 38 Single server for all writes,
read can be done on master or any slave Advantages •  operations can be serialized •  easy to reason about Drawbacks •  cannot scale on write (read can be scaled) •  single point of failure (SPOF)

Master/slave SPOF @doanduyhai 39 Write request MASTER SLAVE1 SLAVE2 SLAVE3

Multi-master/slave layout @doanduyhai 40 Write request MASTER1 SLAVE11 SLAVE12 SLAVE13
Shard1 MASTER2 SLAVE21 SLAVE22 SLAVE23 Shard2 … Proxy layer

@doanduyhai 41 "Failure of a shard-master is not a problem
because it takes less than 10ms to elect a slave into a master" Wrong Objection Rhetoric

The wrong objection rhetoric @doanduyhai 42 How long does it
take to detect that a shard-master has failed ? •  heart-beat is not used because too simple •  so usually after a timeout, after some successive retries Timeout is usually in tens of seconds •  you cannot write during this time period

Multi-master/slave architecture @doanduyhai 43 Distribute data between shards. One master
per shard Advantages •  operations can still be serialized in a single shard •  easy to reason about in a single shard •  no more big SPOF Drawbacks •  consistent only in a single shard (unless global lock) •  multiple small points of failure (SPOF inside a shard)

Fake masterless architecture @doanduyhai 44 In reality, multi-master architecture …
… but branded as shared-nothing architecture

@doanduyhai 45 "XXX has a shared-nothing architecture" … "There is
no concept of master nodes, slave nodes, config nodes, name nodes, head nodes, etc, and all the software loaded on each node is identical" Database Vendor XXX

@doanduyhai 46 "Each node replicates separate slices of its active
data to multiple other nodes » … "As mentioned earlier, by default replicas are only for the purpose of high availability and are not used in the normal serving of data. This allows XXX to be strongly consistent and applications immediately read their own writes by not ever requesting data from a node that it is not active on" Database Vendor XXX

@doanduyhai 47 XXX

@doanduyhai 48 "XXX has a shared-nothing architecture » … "There
is no concept of master nodes, slave nodes, config nodes, name nodes, head nodes, etc, and all the software loaded on each node is identical" … Beware of marketing! Database Vendor XXX

Masterless architecture @doanduyhai 49 No master, every node has equal
role ☞ how to manage consistency then if there is no master ? ☞ which replica has the right value ? Some data-structures to the rescue: •  vector clock •  CRDT (Convergent Replicated Data Type)

CRDT @doanduyhai 50 Riak •  Registers •  Counters •  Sets
•  Maps •  … Cassandra only propose LWW-register (Last Write Win) •  based on write timestamp

Timestamp, again … @doanduyhai 51 But didn’t we say that
timestamp not really reliable ? Why not implement pure CRDTs ? Why choose LWW-registered ? •  because last-write-win is still the most "intuitive" •  because conflict resolution with other CRDT is the user responsibility •  because one should not be required to have a PhD in CS to use Cassandra

Example of write conflict with Cassandra @doanduyhai 52 C* C*
C* C* C* C* C* C* C* C* UPDATE users SET age=32 WHERE id=1 C* C* C* Local time 10:00:01.050 age=32 @ 10:00:01.050 age=32 @ 10:00:01.050 age=32 @ 10:00:01.050

C* C* C* C* C* C* C* C* UPDATE users SET age=33 WHERE id=1 C* C* C* Local time 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020

Example of write conflict @doanduyhai 55 How can we cope
with this ? •  It’s functionally rare to have a update on the same column by differents clients at atmost same time (few millisecs apart) •  can also force timestamp at client-side (but need to synchronize clients now …) •  can always use LightWeight Transaction to guarantee linearizability UPDATE user SET age = 33 WHERE id = 1 IF age = 32

Masterless architecture @doanduyhai 56 Advantages •  no SPOF •  no
failover procedure •  can achieve 0 downtime with correct tuning Drawbacks •  hard to reason about •  require some knowledge about distributed systems

Distributed computation @doanduyhai 57 Necessary & sufficient conditions

Can all operations be distributed ? @doanduyhai 58 Simple example
•  sum à yes, partial sum à global sum •  min à yes, local min à global min •  average à yes, partial average (+ count) à global average •  standard deviation …. Ehhh

Back to school @doanduyhai 59 An operation (+) is associative
iff •  (a + b) + c = a + (b + c) An operation (+) is commutative iff •  a + b = b + a An operation (f()) is idempotent iff •  f(a) = b, f(f(a)) = b, … Ex: min(a) = min(min(a))

Associativity & Commutativity @doanduyhai 60 Associative & commutative operation can
be applied on different machines (associative) and out of order (commutative) So if your operation is associative & commutative, they can be distributed

Associativity & Commutativity @doanduyhai 61 Associative & commutative operation can
be applied on different machines (associative) and out of order (commutative) So if your operation is associative & commutative, they can be distributed

Idempotency @doanduyhai 62 Idempotency is important for reply & failure
resilience Applying 2x, 3x the same operation on the same data, yield the same final value If idempotency is not available, perform regular checkpoints (intermediate results) to be able to re-compute the results after failure

@doanduyhai 63 Q & R ! "

@doanduyhai 64 Thank You !

#18 Big Data 101

#18 Big Data 101

More Decks by Toulouse Data Science

Other Decks in Technology

Featured

Transcript