Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#18 Big Data 101

#18 Big Data 101

Les données affluent, les systèmes traditionnelles ne sont plus adaptés ni pour stocker, ni pour traiter les données... Bref bienvenue dans un projet typique du 21ème siècle : vous voilà embarqués dans l'univers du Big Data...

Pour bien aborder ce monde étrange et vaste avec tous ses jargons et concepts en toute sérénité, Duy Hai, évangéliste technique chez Datastax, vous proposera de poser quelques notions de bases nécessaires:

- le théorème CAP, ou "il n'y a pas de magie dans la vie"

- les architectures master/slave et masterless, avantages et inconvénients de chacun

- critères nécessaires pour savoir si un algorithme est distribuable, ou "comment ne pas se faire enfumer par le discours marketing"

- questions/réponses

--------------------------------------------------------------------------------------------------------

Bio :

Duy Hai est évangéliste technique pour Datastax, l’entreprise commerciale derrière Apache Cassandra. Il est également committer pour Apache Zeppelin.

Il partage son temps entre donner des présentations/meetups/talks sur Cassandra, développer sur des projets open-source pour la communauté(Achilles, Zeppelin) et aider les projets utilisant Cassandra.

Avant Datastax, Duy Hai a été développeur freelance Java/Cassandra

contact: [email protected]

Toulouse Data Science

November 29, 2016
Tweet

More Decks by Toulouse Data Science

Other Decks in Technology

Transcript

  1. BIG DATA 101, FOUNDATIONAL
    KNOWLEDGE FOR A NEW PROJECT
    IN 2016
    @doanduyhai
    Apache Cassandra™ Evangelist
    Apache Zeppelin™ Committer
    @doanduyhai
    1

    View full-size slide

  2. Who Am I ?
    Duy Hai DOAN
    Apache Cassandra™ evangelist
    •  talks, meetups, confs
    •  open-source devs (Achilles, Zeppelin,…)
    •  OSS Cassandra point of contact
    [email protected]
    ☞ @doanduyhai
    Apache Zeppelin™ committer
    @doanduyhai
    2

    View full-size slide

  3. Agenda
    1) Distributed systems: abstractions & models
    2) CAP theorem
    3) Distributed systems architecture: master/slave vs masterless
    4) Distributed algorithms: tricks & traps
    @doanduyhai
    3

    View full-size slide

  4. Distributed systems
    @doanduyhai
    4
    Abstractions and models

    View full-size slide

  5. Processing model
    Single threaded
    Multi-threaded ☞ requires thread coordination, sometimes shared-
    states, shared storage …
    Concurrency: multiple tasks A, B, C ... need to be executed but their
    execution order is undefined
    Parallelism: an impl of concurrency, execute multiple tasks at the
    same time using multiple threads
    @doanduyhai
    5

    View full-size slide

  6. Communication model
    Synchronous ☞ the client request blocks until getting an answer
    Asynchronous ☞ the client request always returns (very fast), the
    answer will arrive later, eventually
    Usually, asychronous requests are handled by multiple threads
    @doanduyhai
    6

    View full-size slide

  7. Time
    There is no absolute time in theory (even with atomic clocks!)
    Time-drift is unavoidable
    •  unless you provide atomic clock to each server
    •  unless you’re Google
    NTP is your friend ☞ configure it properly !
    @doanduyhai
    7

    View full-size slide

  8. Ordering of operations
    How to order operations ?
    What does before/after mean ?
    •  when clock is not 100% reliable
    •  when operations occur on multiple machines …
    •  … that live in multiple continents (1000s km distance)
    @doanduyhai
    8

    View full-size slide

  9. Ordering of operations
    Local/relative ordering is possible
    Global ordering ?
    •  either execute all operations on single machine (☞ master)
    •  or ensure time is perfectly synchronized on all machines executing the
    operations (really feasible ?)
    @doanduyhai
    9

    View full-size slide

  10. Known algorithms
    Lamport clock
    •  algorithm for message sender
    •  algorithm for message receiver
    •  partial ordering between a pair of (sender, receiver) is possible
    @doanduyhai
    10
    time = time+1;
    time_stamp = time;
    send(Message, time_stamp);
    (message, time_stamp) = receive();
    time = max(time_stamp, time)+1;

    View full-size slide

  11. Vector Clock
    @doanduyhai
    11

    View full-size slide

  12. Latency
    Def: time interval between request & response.
    Latency is composed of
    •  network delay: router/switch delay + physical medium delay
    •  OS delay (negligible)
    •  time to process the query by the target (disk access, computation …)
    @doanduyhai
    12

    View full-size slide

  13. Latency
    Speed of light physics
    •  ≈ 300 000 km/s in the void
    •  ≈ 197 000 km/s in fiber optic cable (due to refraction indice)
    London – New York bird flight distance ≈ 5500km è 28ms for a
    one way trip
    Conclusion: a ping between London – New York cannot take less
    than 56ms
    @doanduyhai
    13

    View full-size slide

  14. @doanduyhai
    14
    "The mean latency is
    below 10ms"
    Database vendor X
    ✔︎ ✘︎

    View full-size slide

  15. @doanduyhai
    15
    "The mean latency is
    below 10ms"
    Database vendor X
    ✔︎ ✘︎

    View full-size slide

  16. Failure modes
    •  Byzantine failure: same input, different outputs à application bug !!!
    •  Performance failure: response correct but arrives too late
    •  Omission failure: special case of performance failure, no response (timeout?)
    •  Crash failure: self-explanatory, server stops responding
    Byzantine failure à value issue
    Other failures à timing issue
    @doanduyhai
    16

    View full-size slide

  17. Failure
    Root causes
    •  Hardware: disk, CPU, …
    •  Software: packet lost, process crash, OS crash …
    •  Workload-specific: flushing huge file to SAN ()
    •  JVM-related: long GC pause
    Defining failure is hard
    @doanduyhai
    17

    View full-size slide

  18. @doanduyhai
    18
    "A server fails when it does
    not respond to one or
    multiple request(s) in a
    timely manner"
    Usual meaning of failure

    View full-size slide

  19. Failure detection
    Timely manner ☞ timeout!
    Failure detector:
    •  heart beat: binary state, (up/down), too simple
    •  exponential backoff with threshold: better model
    •  phi accrual detector: advanced model using statictics
    @doanduyhai
    19

    View full-size slide

  20. Distributed consensus protocols
    Major properties:
    •  validity: the agreed value must have been proposed by some process
    •  termination: at least one non-faulty process eventually decides
    •  agreement: all processes agree on the same value
    @doanduyhai
    20

    View full-size slide

  21. Distributed consensus protocols
    2-phases commit
    •  termination KO: the protocol can be blocked if coordinator fails
    3-phases commit
    •  agreement KO: in case of network partition, possibility of inconsistent state
    Paxos, RAFT & Zab (Zookeeper)
    •  OK: satisfies 3 requirements
    •  QUORUM-based: requires a strict majority of nodes/replicas to be alive
    @doanduyhai
    21

    View full-size slide

  22. High availability
    How ? By having multiple copies: e.g. replicas
    Type of replicas
    •  symetric: no role, each replica is similar to others
    •  asymetric: master/slave role. Write on master replica à dispatch to slave
    replicas. Read on master replica (inconsistent read on slave replicas possible)
    When failure occurs
    •  symetric: no operation, only online replicas will respond to requests
    •  asymetric: master ownership has to be transfered to one elected replica
    @doanduyhai
    22

    View full-size slide

  23. CAP Theorem
    @doanduyhai
    23
    Pick 2 out of 3

    View full-size slide

  24. CAP theorem
    @doanduyhai
    24
    Conjecture by Brewer, formalized later in a paper (2002):
    The CAP theorem states that any networked shared-data system can
    have at most two of three desirable properties
    •  consistency (C): equivalent to having a single up-to-date copy of the data
    •  high availability (A): of that data (for updates)
    •  and tolerance to network partitions (P)

    View full-size slide

  25. CAP triangle
    @doanduyhai
    25

    View full-size slide

  26. CAP theorem revised (2012)
    @doanduyhai
    26
    You cannot choose not to be partition-tolerance
    Choice is not that binary:
    •  in the absence of partition, you can tend toward CA
    •  if when partition occurs, choose your side (C or A)
    ☞ tunable consistency

    View full-size slide

  27. What is Consistency ?
    @doanduyhai
    27
    Meaning is different from the C of ACID
    Read Uncommited
    Read Commited
    Cursor Stability
    Repeatable Read
    Eventual Consistency
    Read Your Write
    Pipelined RAM
    Causal
    Snapshot Isolation Linearizability
    Serializability
    Without coordination
    Requires coordination

    View full-size slide

  28. Consistency with some CP (supposedly) system
    @doanduyhai
    28
    DB name
    censored

    View full-size slide

  29. Consistency with some AP system
    @doanduyhai
    29
    Cassandra tunable consistency
    Read Uncommited
    Read Commited
    Cursor Stability
    Repeatable Read
    Eventual Consistency
    Read Your Write
    Pipelined RAM
    Causal
    Snapshot Isolation Linearizability
    Serializability
    Without coordination
    Requires coordination
    Consistency Level
    ONE

    View full-size slide

  30. Consistency with some AP system
    @doanduyhai
    30
    Cassandra tunable consistency
    Read Uncommited
    Read Commited
    Cursor Stability
    Repeatable Read
    Eventual Consistency
    Read Your Write
    Pipelined RAM
    Causal
    Snapshot Isolation Linearizability
    Serializability
    Without coordination
    Requires coordination
    Consistency Level
    QUORUM

    View full-size slide

  31. Consistency with some AP system
    @doanduyhai
    31
    Cassandra tunable consistency
    Read Uncommited
    Read Commited
    Cursor Stability
    Repeatable Read
    Eventual Consistency
    Read Your Write
    Pipelined RAM
    Causal
    Snapshot Isolation Linearizability
    Serializability
    Without coordination
    Requires coordination
    LightWeight
    Transaction
    Single partition writes
    are linearizable

    View full-size slide

  32. What is availability ?
    @doanduyhai
    32
    Ability to:
    •  Read in the case of failure ?
    •  Write in the case of failure ?
    Brewer definition: high availability of the data (for updates)

    View full-size slide

  33. Real world example
    @doanduyhai
    33
    Cassandra claims to be highly available, is it true ?
    Some marketing slide even claims continous availability (100%
    uptime), is it true ?

    View full-size slide

  34. Network partition scenario with Cassandra
    @doanduyhai
    34
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    Read/Write at
    Consistency level ONE
    ✔︎

    View full-size slide

  35. Network partition scenario with Cassandra
    @doanduyhai
    35
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    Read/Write at
    Consistency level ONE
    ✘︎

    View full-size slide

  36. So how can it be highly available ???
    @doanduyhai
    36
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    Read/Write at
    Consistency level ONE
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    US DataCenter EU DataCenter

    Datacenter-aware load balancing strategy at driver level

    View full-size slide

  37. Architecture
    @doanduyhai
    37
    Master/Slave vs Masterless

    View full-size slide

  38. Pure master/slave architecture
    @doanduyhai
    38
    Single server for all writes, read can be done on master or any slave
    Advantages
    •  operations can be serialized
    •  easy to reason about
    Drawbacks
    •  cannot scale on write (read can be scaled)
    •  single point of failure (SPOF)

    View full-size slide

  39. Master/slave SPOF
    @doanduyhai
    39
    Write request
    MASTER
    SLAVE1
    SLAVE2
    SLAVE3

    View full-size slide

  40. Multi-master/slave layout
    @doanduyhai
    40
    Write request
    MASTER1
    SLAVE11
    SLAVE12
    SLAVE13
    Shard1
    MASTER2
    SLAVE21
    SLAVE22
    SLAVE23
    Shard2

    Proxy layer

    View full-size slide

  41. @doanduyhai
    41
    "Failure of a shard-master is
    not a problem because it
    takes less than 10ms to elect
    a slave into a master"
    Wrong Objection Rhetoric

    View full-size slide

  42. The wrong objection rhetoric
    @doanduyhai
    42
    How long does it take to detect that a shard-master has failed ?
    •  heart-beat is not used because too simple
    •  so usually after a timeout, after some successive retries
    Timeout is usually in tens of seconds
    •  you cannot write during this time period

    View full-size slide

  43. Multi-master/slave architecture
    @doanduyhai
    43
    Distribute data between shards. One master per shard
    Advantages
    •  operations can still be serialized in a single shard
    •  easy to reason about in a single shard
    •  no more big SPOF
    Drawbacks
    •  consistent only in a single shard (unless global lock)
    •  multiple small points of failure (SPOF inside a shard)

    View full-size slide

  44. Fake masterless architecture
    @doanduyhai
    44
    In reality, multi-master architecture …
    … but branded as shared-nothing architecture

    View full-size slide

  45. @doanduyhai
    45
    "XXX has a shared-nothing architecture"

    "There is no concept of master nodes,
    slave nodes, config nodes, name nodes,
    head nodes, etc, and all the software
    loaded on each node is identical"
    Database Vendor XXX

    View full-size slide

  46. @doanduyhai
    46
    "Each node replicates separate slices of
    its active data to multiple other nodes »

    "As mentioned earlier, by default replicas
    are only for the purpose of high
    availability and are not used in the
    normal serving of data. This allows XXX to
    be strongly consistent and applications
    immediately read their own writes by not
    ever requesting data from a node that it
    is not active on"
    Database Vendor XXX

    View full-size slide

  47. @doanduyhai
    47
    XXX

    View full-size slide

  48. @doanduyhai
    48
    "XXX has a shared-nothing architecture »

    "There is no concept of master nodes,
    slave nodes, config nodes, name nodes,
    head nodes, etc, and all the software
    loaded on each node is identical"

    Beware of
    marketing!
    Database Vendor XXX

    View full-size slide

  49. Masterless architecture
    @doanduyhai
    49
    No master, every node has equal role
    ☞ how to manage consistency then if there is no master ?
    ☞ which replica has the right value ?
    Some data-structures to the rescue:
    •  vector clock
    •  CRDT (Convergent Replicated Data Type)

    View full-size slide

  50. CRDT
    @doanduyhai
    50
    Riak
    •  Registers
    •  Counters
    •  Sets
    •  Maps
    •  …
    Cassandra only propose LWW-register (Last Write Win)
    •  based on write timestamp

    View full-size slide

  51. Timestamp, again …
    @doanduyhai
    51
    But didn’t we say that timestamp not really reliable ?
    Why not implement pure CRDTs ?
    Why choose LWW-registered ?
    •  because last-write-win is still the most "intuitive"
    •  because conflict resolution with other CRDT is the user responsibility
    •  because one should not be required to have a PhD in CS to use Cassandra

    View full-size slide

  52. Example of write conflict with Cassandra
    @doanduyhai
    52
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    UPDATE users SET
    age=32 WHERE id=1
    C* C*
    C*
    Local time
    10:00:01.050
    age=32 @ 10:00:01.050 age=32 @ 10:00:01.050
    age=32 @ 10:00:01.050

    View full-size slide

  53. Example of write conflict with Cassandra
    @doanduyhai
    53
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    UPDATE users SET
    age=33 WHERE id=1
    C* C*
    C*
    Local time
    10:00:01.020
    age=32 @ 10:00:01.050
    age=33 @ 10:00:01.020
    age=32 @ 10:00:01.050
    age=33 @ 10:00:01.020
    age=32 @ 10:00:01.050
    age=33 @ 10:00:01.020

    View full-size slide

  54. Example of write conflict with Cassandra
    @doanduyhai
    54
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    C*
    UPDATE users SET
    age=33 WHERE id=1
    C* C*
    C*
    Local time
    10:00:01.020
    age=32 @ 10:00:01.050
    age=33 @ 10:00:01.020
    age=32 @ 10:00:01.050
    age=33 @ 10:00:01.020
    age=32 @ 10:00:01.050
    age=33 @ 10:00:01.020

    View full-size slide

  55. Example of write conflict
    @doanduyhai
    55
    How can we cope with this ?
    •  It’s functionally rare to have a update on the same column by differents
    clients at atmost same time (few millisecs apart)
    •  can also force timestamp at client-side (but need to synchronize clients now
    …)
    •  can always use LightWeight Transaction to guarantee linearizability
    UPDATE user SET age = 33 WHERE id = 1 IF age = 32

    View full-size slide

  56. Masterless architecture
    @doanduyhai
    56
    Advantages
    •  no SPOF
    •  no failover procedure
    •  can achieve 0 downtime with correct tuning
    Drawbacks
    •  hard to reason about
    •  require some knowledge about distributed systems

    View full-size slide

  57. Distributed computation
    @doanduyhai
    57
    Necessary & sufficient conditions

    View full-size slide

  58. Can all operations be distributed ?
    @doanduyhai
    58
    Simple example
    •  sum à yes, partial sum à global sum
    •  min à yes, local min à global min
    •  average à yes, partial average (+ count) à global average
    •  standard deviation …. Ehhh

    View full-size slide

  59. Back to school
    @doanduyhai
    59
    An operation (+) is associative iff
    •  (a + b) + c = a + (b + c)
    An operation (+) is commutative iff
    •  a + b = b + a
    An operation (f()) is idempotent iff
    •  f(a) = b, f(f(a)) = b, … Ex: min(a) = min(min(a))

    View full-size slide

  60. Associativity & Commutativity
    @doanduyhai
    60
    Associative & commutative operation can be applied on different
    machines (associative) and out of order (commutative)
    So if your operation is associative & commutative, they can be
    distributed

    View full-size slide

  61. Associativity & Commutativity
    @doanduyhai
    61
    Associative & commutative operation can be applied on different
    machines (associative) and out of order (commutative)
    So if your operation is associative & commutative, they can be
    distributed

    View full-size slide

  62. Idempotency
    @doanduyhai
    62
    Idempotency is important for reply & failure resilience
    Applying 2x, 3x the same operation on the same data, yield the
    same final value
    If idempotency is not available, perform regular checkpoints
    (intermediate results) to be able to re-compute the results after
    failure

    View full-size slide

  63. @doanduyhai
    63
    Q & R
    ! "

    View full-size slide

  64. @doanduyhai
    64
    Thank You !

    View full-size slide