#18 Big Data 101

#18 Big Data 101

Les données affluent, les systèmes traditionnelles ne sont plus adaptés ni pour stocker, ni pour traiter les données... Bref bienvenue dans un projet typique du 21ème siècle : vous voilà embarqués dans l'univers du Big Data...

Pour bien aborder ce monde étrange et vaste avec tous ses jargons et concepts en toute sérénité, Duy Hai, évangéliste technique chez Datastax, vous proposera de poser quelques notions de bases nécessaires:

- le théorème CAP, ou "il n'y a pas de magie dans la vie"

- les architectures master/slave et masterless, avantages et inconvénients de chacun

- critères nécessaires pour savoir si un algorithme est distribuable, ou "comment ne pas se faire enfumer par le discours marketing"

- questions/réponses

--------------------------------------------------------------------------------------------------------

Bio :

Duy Hai est évangéliste technique pour Datastax, l’entreprise commerciale derrière Apache Cassandra. Il est également committer pour Apache Zeppelin.

Il partage son temps entre donner des présentations/meetups/talks sur Cassandra, développer sur des projets open-source pour la communauté(Achilles, Zeppelin) et aider les projets utilisant Cassandra.

Avant Datastax, Duy Hai a été développeur freelance Java/Cassandra

contact: duy_hai.doan@datastax.com

6aa4f3c589d3108830b371d0310bc4da?s=128

Toulouse Data Science

November 29, 2016
Tweet

Transcript

  1. 1.

    BIG DATA 101, FOUNDATIONAL KNOWLEDGE FOR A NEW PROJECT IN

    2016 @doanduyhai Apache Cassandra™ Evangelist Apache Zeppelin™ Committer @doanduyhai 1
  2. 2.

    Who Am I ? Duy Hai DOAN Apache Cassandra™ evangelist

    •  talks, meetups, confs •  open-source devs (Achilles, Zeppelin,…) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai Apache Zeppelin™ committer @doanduyhai 2
  3. 3.

    Agenda 1) Distributed systems: abstractions & models 2) CAP theorem

    3) Distributed systems architecture: master/slave vs masterless 4) Distributed algorithms: tricks & traps @doanduyhai 3
  4. 5.

    Processing model Single threaded Multi-threaded ☞ requires thread coordination, sometimes

    shared- states, shared storage … Concurrency: multiple tasks A, B, C ... need to be executed but their execution order is undefined Parallelism: an impl of concurrency, execute multiple tasks at the same time using multiple threads @doanduyhai 5
  5. 6.

    Communication model Synchronous ☞ the client request blocks until getting

    an answer Asynchronous ☞ the client request always returns (very fast), the answer will arrive later, eventually Usually, asychronous requests are handled by multiple threads @doanduyhai 6
  6. 7.

    Time There is no absolute time in theory (even with

    atomic clocks!) Time-drift is unavoidable •  unless you provide atomic clock to each server •  unless you’re Google NTP is your friend ☞ configure it properly ! @doanduyhai 7
  7. 8.

    Ordering of operations How to order operations ? What does

    before/after mean ? •  when clock is not 100% reliable •  when operations occur on multiple machines … •  … that live in multiple continents (1000s km distance) @doanduyhai 8
  8. 9.

    Ordering of operations Local/relative ordering is possible Global ordering ?

    •  either execute all operations on single machine (☞ master) •  or ensure time is perfectly synchronized on all machines executing the operations (really feasible ?) @doanduyhai 9
  9. 10.

    Known algorithms Lamport clock •  algorithm for message sender • 

    algorithm for message receiver •  partial ordering between a pair of (sender, receiver) is possible @doanduyhai 10 time = time+1; time_stamp = time; send(Message, time_stamp); (message, time_stamp) = receive(); time = max(time_stamp, time)+1;
  10. 12.

    Latency Def: time interval between request & response. Latency is

    composed of •  network delay: router/switch delay + physical medium delay •  OS delay (negligible) •  time to process the query by the target (disk access, computation …) @doanduyhai 12
  11. 13.

    Latency Speed of light physics •  ≈ 300 000 km/s

    in the void •  ≈ 197 000 km/s in fiber optic cable (due to refraction indice) London – New York bird flight distance ≈ 5500km è 28ms for a one way trip Conclusion: a ping between London – New York cannot take less than 56ms @doanduyhai 13
  12. 16.

    Failure modes •  Byzantine failure: same input, different outputs à

    application bug !!! •  Performance failure: response correct but arrives too late •  Omission failure: special case of performance failure, no response (timeout?) •  Crash failure: self-explanatory, server stops responding Byzantine failure à value issue Other failures à timing issue @doanduyhai 16
  13. 17.

    Failure Root causes •  Hardware: disk, CPU, … •  Software:

    packet lost, process crash, OS crash … •  Workload-specific: flushing huge file to SAN () •  JVM-related: long GC pause Defining failure is hard @doanduyhai 17
  14. 18.

    @doanduyhai 18 "A server fails when it does not respond

    to one or multiple request(s) in a timely manner" Usual meaning of failure
  15. 19.

    Failure detection Timely manner ☞ timeout! Failure detector: •  heart

    beat: binary state, (up/down), too simple •  exponential backoff with threshold: better model •  phi accrual detector: advanced model using statictics @doanduyhai 19
  16. 20.

    Distributed consensus protocols Major properties: •  validity: the agreed value

    must have been proposed by some process •  termination: at least one non-faulty process eventually decides •  agreement: all processes agree on the same value @doanduyhai 20
  17. 21.

    Distributed consensus protocols 2-phases commit •  termination KO: the protocol

    can be blocked if coordinator fails 3-phases commit •  agreement KO: in case of network partition, possibility of inconsistent state Paxos, RAFT & Zab (Zookeeper) •  OK: satisfies 3 requirements •  QUORUM-based: requires a strict majority of nodes/replicas to be alive @doanduyhai 21
  18. 22.

    High availability How ? By having multiple copies: e.g. replicas

    Type of replicas •  symetric: no role, each replica is similar to others •  asymetric: master/slave role. Write on master replica à dispatch to slave replicas. Read on master replica (inconsistent read on slave replicas possible) When failure occurs •  symetric: no operation, only online replicas will respond to requests •  asymetric: master ownership has to be transfered to one elected replica @doanduyhai 22
  19. 24.

    CAP theorem @doanduyhai 24 Conjecture by Brewer, formalized later in

    a paper (2002): The CAP theorem states that any networked shared-data system can have at most two of three desirable properties •  consistency (C): equivalent to having a single up-to-date copy of the data •  high availability (A): of that data (for updates) •  and tolerance to network partitions (P)
  20. 26.

    CAP theorem revised (2012) @doanduyhai 26 You cannot choose not

    to be partition-tolerance Choice is not that binary: •  in the absence of partition, you can tend toward CA •  if when partition occurs, choose your side (C or A) ☞ tunable consistency
  21. 27.

    What is Consistency ? @doanduyhai 27 Meaning is different from

    the C of ACID Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination
  22. 29.

    Consistency with some AP system @doanduyhai 29 Cassandra tunable consistency

    Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination Consistency Level ONE
  23. 30.

    Consistency with some AP system @doanduyhai 30 Cassandra tunable consistency

    Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination Consistency Level QUORUM
  24. 31.

    Consistency with some AP system @doanduyhai 31 Cassandra tunable consistency

    Read Uncommited Read Commited Cursor Stability Repeatable Read Eventual Consistency Read Your Write Pipelined RAM Causal Snapshot Isolation Linearizability Serializability Without coordination Requires coordination LightWeight Transaction Single partition writes are linearizable
  25. 32.

    What is availability ? @doanduyhai 32 Ability to: •  Read

    in the case of failure ? •  Write in the case of failure ? Brewer definition: high availability of the data (for updates)
  26. 33.

    Real world example @doanduyhai 33 Cassandra claims to be highly

    available, is it true ? Some marketing slide even claims continous availability (100% uptime), is it true ?
  27. 34.

    Network partition scenario with Cassandra @doanduyhai 34 C* C* C*

    C* C* C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE ✔︎
  28. 35.

    Network partition scenario with Cassandra @doanduyhai 35 C* C* C*

    C* C* C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE ✘︎
  29. 36.

    So how can it be highly available ??? @doanduyhai 36

    C* C* C* C* C* C* C* C* C* C* C* C* C* Read/Write at Consistency level ONE C* C* C* C* C* C* C* C* C* C* C* C* C* US DataCenter EU DataCenter ✘ Datacenter-aware load balancing strategy at driver level
  30. 38.

    Pure master/slave architecture @doanduyhai 38 Single server for all writes,

    read can be done on master or any slave Advantages •  operations can be serialized •  easy to reason about Drawbacks •  cannot scale on write (read can be scaled) •  single point of failure (SPOF)
  31. 40.

    Multi-master/slave layout @doanduyhai 40 Write request MASTER1 SLAVE11 SLAVE12 SLAVE13

    Shard1 MASTER2 SLAVE21 SLAVE22 SLAVE23 Shard2 … Proxy layer
  32. 41.

    @doanduyhai 41 "Failure of a shard-master is not a problem

    because it takes less than 10ms to elect a slave into a master" Wrong Objection Rhetoric
  33. 42.

    The wrong objection rhetoric @doanduyhai 42 How long does it

    take to detect that a shard-master has failed ? •  heart-beat is not used because too simple •  so usually after a timeout, after some successive retries Timeout is usually in tens of seconds •  you cannot write during this time period
  34. 43.

    Multi-master/slave architecture @doanduyhai 43 Distribute data between shards. One master

    per shard Advantages •  operations can still be serialized in a single shard •  easy to reason about in a single shard •  no more big SPOF Drawbacks •  consistent only in a single shard (unless global lock) •  multiple small points of failure (SPOF inside a shard)
  35. 45.

    @doanduyhai 45 "XXX has a shared-nothing architecture" … "There is

    no concept of master nodes, slave nodes, config nodes, name nodes, head nodes, etc, and all the software loaded on each node is identical" Database Vendor XXX
  36. 46.

    @doanduyhai 46 "Each node replicates separate slices of its active

    data to multiple other nodes » … "As mentioned earlier, by default replicas are only for the purpose of high availability and are not used in the normal serving of data. This allows XXX to be strongly consistent and applications immediately read their own writes by not ever requesting data from a node that it is not active on" Database Vendor XXX
  37. 48.

    @doanduyhai 48 "XXX has a shared-nothing architecture » … "There

    is no concept of master nodes, slave nodes, config nodes, name nodes, head nodes, etc, and all the software loaded on each node is identical" … Beware of marketing! Database Vendor XXX
  38. 49.

    Masterless architecture @doanduyhai 49 No master, every node has equal

    role ☞ how to manage consistency then if there is no master ? ☞ which replica has the right value ? Some data-structures to the rescue: •  vector clock •  CRDT (Convergent Replicated Data Type)
  39. 50.

    CRDT @doanduyhai 50 Riak •  Registers •  Counters •  Sets

    •  Maps •  … Cassandra only propose LWW-register (Last Write Win) •  based on write timestamp
  40. 51.

    Timestamp, again … @doanduyhai 51 But didn’t we say that

    timestamp not really reliable ? Why not implement pure CRDTs ? Why choose LWW-registered ? •  because last-write-win is still the most "intuitive" •  because conflict resolution with other CRDT is the user responsibility •  because one should not be required to have a PhD in CS to use Cassandra
  41. 52.

    Example of write conflict with Cassandra @doanduyhai 52 C* C*

    C* C* C* C* C* C* C* C* UPDATE users SET age=32 WHERE id=1 C* C* C* Local time 10:00:01.050 age=32 @ 10:00:01.050 age=32 @ 10:00:01.050 age=32 @ 10:00:01.050
  42. 53.

    Example of write conflict with Cassandra @doanduyhai 53 C* C*

    C* C* C* C* C* C* C* C* UPDATE users SET age=33 WHERE id=1 C* C* C* Local time 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020
  43. 54.

    Example of write conflict with Cassandra @doanduyhai 54 C* C*

    C* C* C* C* C* C* C* C* UPDATE users SET age=33 WHERE id=1 C* C* C* Local time 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020 age=32 @ 10:00:01.050 age=33 @ 10:00:01.020
  44. 55.

    Example of write conflict @doanduyhai 55 How can we cope

    with this ? •  It’s functionally rare to have a update on the same column by differents clients at atmost same time (few millisecs apart) •  can also force timestamp at client-side (but need to synchronize clients now …) •  can always use LightWeight Transaction to guarantee linearizability UPDATE user SET age = 33 WHERE id = 1 IF age = 32
  45. 56.

    Masterless architecture @doanduyhai 56 Advantages •  no SPOF •  no

    failover procedure •  can achieve 0 downtime with correct tuning Drawbacks •  hard to reason about •  require some knowledge about distributed systems
  46. 58.

    Can all operations be distributed ? @doanduyhai 58 Simple example

    •  sum à yes, partial sum à global sum •  min à yes, local min à global min •  average à yes, partial average (+ count) à global average •  standard deviation …. Ehhh
  47. 59.

    Back to school @doanduyhai 59 An operation (+) is associative

    iff •  (a + b) + c = a + (b + c) An operation (+) is commutative iff •  a + b = b + a An operation (f()) is idempotent iff •  f(a) = b, f(f(a)) = b, … Ex: min(a) = min(min(a))
  48. 60.

    Associativity & Commutativity @doanduyhai 60 Associative & commutative operation can

    be applied on different machines (associative) and out of order (commutative) So if your operation is associative & commutative, they can be distributed
  49. 61.

    Associativity & Commutativity @doanduyhai 61 Associative & commutative operation can

    be applied on different machines (associative) and out of order (commutative) So if your operation is associative & commutative, they can be distributed
  50. 62.

    Idempotency @doanduyhai 62 Idempotency is important for reply & failure

    resilience Applying 2x, 3x the same operation on the same data, yield the same final value If idempotency is not available, perform regular checkpoints (intermediate results) to be able to re-compute the results after failure