Slide 1

Slide 1 text

Distributed Systems Introduction and Overview

Slide 2

Slide 2 text

Who am I? Emmanuel Bakare but people call me “Bakman” Senior DevOps Engineer at Twilio Fair amount of distributed systems experience (Replex, AWS)

Slide 3

Slide 3 text

We will be discussing Distributed Systems Starting off easy: ● Concurrency and parallelism ● Shared state ○ Inter Process Communication (IPC) ○ Distributed Memory ■ RDMA ● Networking (reliability) ○ Partitions and failures ● Cascading failures ○ Retry storms ○ Chained workflows ■ Sagas Rounding off hard: ● Timing (realtime, logical and monotonic) ● Synchronisation ● Acknowledgements ○ Message Ordering ○ Delivery mechanisms ● Processing ○ Asynchronous ○ Event Driven ● Consensus ○ CAP ○ FLP Impossibility ○ Failure Modes

Slide 4

Slide 4 text

Notes ⛕ - Means there is a small detour to cover additional material The material covered here will be relatively simplified, the technical summary of the concepts will still be delivered. A lot of these concepts are theoretical and practical in a mix. I hope you have fun going through this.

Slide 5

Slide 5 text

What is a distributed system?

Slide 6

Slide 6 text

What is a distributed system? As found in the Part II chapter titled “Distributed Systems” of Database Internals “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” - Leslie Lamport

Slide 7

Slide 7 text

Breakdown of how distributed systems help teams Star Networks Client Master

Slide 8

Slide 8 text

Breakdown of how distributed systems help teams Star Networks - Single - Point - Of - Failure Client Master 💀

Slide 9

Slide 9 text

Breakdown of how distributed systems help teams Mesh Networks - Multiple - Point - Of - Redundancy Client Master 😎

Slide 10

Slide 10 text

Breakdown of how distributed systems help teams Mesh Networks - Multiple - Point - Of - Redundancy Client Master 😎

Slide 11

Slide 11 text

Network Routing That’s a simple example of how it works

Slide 12

Slide 12 text

Failures happen with every system Distributed systems help with stability by finding smarter ways to have “redundant” paths improving the ease of separating individual systems whilst keeping it working.

Slide 13

Slide 13 text

Distributed systems are made of nodes Nodes

Slide 14

Slide 14 text

Distributed systems are made of nodes Nodes Nodes are the individual components in the distributed system. Multiple nodes communicate with each other in the distributed system.

Slide 15

Slide 15 text

Distributed systems are made of nodes Nodes Nodes are the individual components in the distributed system. Multiple nodes communicate with each other in the distributed system. These nodes may be called replicas also, the terminology varies

Slide 16

Slide 16 text

Again, Failures happen with every system Distributed systems help with stability by finding smarter ways to have “redundant” paths improving the ease of separating individual systems whilst keeping it working.

Slide 17

Slide 17 text

Again, Failures happen with every system Distributed systems help with stability by finding smarter ways to have “redundant” paths improving the ease of separating individual systems whilst keeping it working. Despite helping improve stability, they can also increase complexity. Use these concepts sparingly.

Slide 18

Slide 18 text

We will be discussing Distributed Systems This is not related to distributed algorithms, this is an overview of the more infrastructure and theoretical related parts of it.

Slide 19

Slide 19 text

We will be discussing Distributed Systems This is not related to distributed algorithms, this is an overview of the more infrastructure and theoretical related parts of it. Algorithms handle the interactions between these systems in the more lower level, this is very high level, we will not be doing any coding.

Slide 20

Slide 20 text

We will be discussing Distributed Systems This is not related to distributed algorithms, this is an overview of the more infrastructure and theoretical related parts of it. Algorithms handle the interactions between these systems in the more lower level, this is very high level, we will not be doing any coding. This discussion is based off the Database Internals book by Alex Petrov.

Slide 21

Slide 21 text

Starting off easy Let’s get the basics

Slide 22

Slide 22 text

Concurrency and Parallelism

Slide 23

Slide 23 text

Concurrency and Parallelism These are both ways to break down tasks so they can run faster. It is simply “division of labour” across various workers.

Slide 24

Slide 24 text

Concurrency and Parallelism These are both ways to break down tasks so they can run faster. It is simply “division of labour” across various workers. The more workers you have does not always imply faster performance And breaking a task into bits does not always make it more efficient

Slide 25

Slide 25 text

Concurrency and Parallelism Concurrency is the ability to run one or more independent tasks independently. Concurrency can mean taking a process A and creating a chain of processes to become: B -> C -> D -> A Where B, C and D are running independently to achieve A. Parallelism is the ability to run the same thing at the same time multiple times. Parallelism means taking a process A and making it run in X times where each is distinctly independent but performing the same operation. X -> A X -> A X -> A

Slide 26

Slide 26 text

Concurrency and Parallelism Concurrency will usually involve context switching across multiple tasks and is not limited by hardware Concurrency can be context switched across hardware so there is not a “hard” reliance on the number of cores assigned. Parallelism implies each process is running on a single dedicated resource and will usually just stay active from start to finish Parallelism requires actual physical hardware to run so if you have X cores, you can reasonably run X versions of a task before it becomes slower.

Slide 27

Slide 27 text

⛕ Hyperthreads are not cores

Slide 28

Slide 28 text

Intermittent break: Hyperthreads are not cores Hyperthreading is a hack in the physical setup of a core to use a high speed register to emulate context switching with such performance that it can be likened to an actual physical core.

Slide 29

Slide 29 text

Intermittent break: Hyperthreads are not cores Hyperthreading is a hack in the physical setup of a core to use a high speed register to emulate context switching with such performance that it can be likened to an actual physical core. Hyperthreaded v Real Core Thread 1 Thread 2 (Register) Core Thread 1

Slide 30

Slide 30 text

Intermittent break: Hyperthreads are not cores This means if you schedule two of the same tasks on a single core, it will really be fighting for resources on the same core. This works for most things because cache access and other resources on a core are efficient so it is appears seamless.

Slide 31

Slide 31 text

Intermittent break: Hyperthreads are not cores This means if you schedule two of the same tasks on a single core, it will really be fighting for resources on the same core. This works for most things because cache access and other resources on a core are efficient so it is appears seamless. The operating system will usually just take threads as cores to simplify scheduling tasks onto them, but there is always a performance hit for such emulation.

Slide 32

Slide 32 text

Intermittent break: Hyperthreads are not cores This means if you schedule two of the same tasks on a single core, it will really be fighting for resources on the same core. This works for most things because cache access and other resources on a core are efficient so it is appears seamless. The operating system will usually just take threads as cores to simplify scheduling tasks onto them, but there is always a performance hit for such emulation. If you have latency specific or contention driven processes like databases that need performance, use physical cores against hyperthreaded ones.

Slide 33

Slide 33 text

Concurrency v parallelism When considering parallel tasks, you can have concurrency in parallelism but you cannot have parallelism in concurrent tasks.

Slide 34

Slide 34 text

Concurrency v parallelism When considering parallel tasks, you can have concurrency in parallelism but you cannot have parallelism in concurrent tasks. The reason is parallelism implies that the same process is executed at the same time in independent blobs.

Slide 35

Slide 35 text

Concurrency v parallelism When considering parallel tasks, you can have concurrency in parallelism but you cannot have parallelism in concurrent tasks. The reason is parallelism implies that the same process is executed at the same time in independent blobs. Concurrency implies composition of independent single tasks working together so the timelines overlap.

Slide 36

Slide 36 text

Concurrency v parallelism Basically Concurrent tasks can be run in parallel BUT Parallel tasks cannot be run concurrently

Slide 37

Slide 37 text

Concurrency is not parallelism I recommend watching this video from Rob Pike at a later time for more details: https://go.dev/blog/waza-talk

Slide 38

Slide 38 text

Shared State

Slide 39

Slide 39 text

Shared State There are primarily two methods for sharing state in distributed systems You either have

Slide 40

Slide 40 text

Shared State There are primarily two methods for sharing state in distributed systems You either have ● Message passing (queues, signals)

Slide 41

Slide 41 text

Shared State There are primarily two methods for sharing state in distributed systems You either have ● Message passing (queues, signals) ● Shared memory (databases, heaps)

Slide 42

Slide 42 text

Shared State There are primarily two methods for sharing state in distributed systems You either have ● Message passing (queues, signals) ● Shared memory (databases, heaps) Unlike message passing, shared memory requires synchronization to allow for safe(serializable) updates. The concept of serializable consistency will be discussed later on.

Slide 43

Slide 43 text

Shared State - Message Passing Message passing is more about using pipes. Yeah, same concept as a | b PUBLISHER PIPE [/dev/mqueue] SUBSCRIBER

Slide 44

Slide 44 text

Shared State - Message Passing When you perform message passing, you ephemerally take some data from point A and pass it to point B through a pipe. That pipe is usually called a queue but you can do so through io redirection etc.

Slide 45

Slide 45 text

Shared State - Message Passing When you perform message passing, you ephemerally take some data from point A and pass it to point B through a pipe. That pipe is usually called a queue but you can do so through io redirection etc. In Linux, this can be emulated with the /dev/mqueue filesystem which allows kernel support for message passing primitives. See here: https://man7.org/linux/man-pages/man7/mq_overview.7.html

Slide 46

Slide 46 text

Shared State - Message Passing When you perform message passing, you ephemerally take some data from point A and pass it to point B through a pipe. That pipe is usually called a queue but you can do so through io redirection etc. In Linux, this can be emulated with code but most queues will use the /dev/mqueue filesystem which allows kernel support for message passing primitives. See here: https://man7.org/linux/man-pages/man7/mq_overview.7.html This can also be done with code using your basic queue data structure from the user space.

Slide 47

Slide 47 text

Shared State - Shared Memory Shared Memory is what we understand as dictionaries, tuples etc. Things that persist data. Shared memory stores are basically databases, filesystems and more trivially files on filesystems. In Linux, even a file can be a filesystem. SHM Writer Reader Writer

Slide 48

Slide 48 text

Shared State - Shared Memory When you use shared memory, you are guaranteed that within the allocated block of memory, data persisted can be retrieved multiple times.

Slide 49

Slide 49 text

Shared State - Shared Memory When you use shared memory, you are guaranteed that within the allocated block of memory, data persisted can be retrieved multiple times. In Linux, the /dev/shm filesystem allows you to perform allocations with kernel support for shared memory stores. See https://man7.org/linux/man-pages/man7/shm_overview.7.html

Slide 50

Slide 50 text

Shared State - Shared Memory When you use shared memory, you are guaranteed that within the allocated block of memory, data persisted can be retrieved multiple times. In Linux, the /dev/shm filesystem allows you to perform allocations for shared memory stores. See https://man7.org/linux/man-pages/man7/shm_overview.7.html Memory allocators, arenas and various other primitives allow for allocation of shared memory blocks. Just like the message passing note, this can be implemented without this filesystem.

Slide 51

Slide 51 text

⛕ Distributed Memory

Slide 52

Slide 52 text

Intermittent break: Distributed Memory In addition to shared memory, you can also have distributed memory which in addition to data locality, moves memory information to a store outside the systems where it is running.

Slide 53

Slide 53 text

Intermittent break: Distributed Memory In addition to shared memory, you can also have distributed memory which in addition to data locality, moves memory information to a store outside the systems where it is running. The interconnect to allow data transfer is usually over some network (wired or wireless) and pages can be swapped between independent systems as required.

Slide 54

Slide 54 text

Intermittent break: Source: Distributed memory - Wikipedia

Slide 55

Slide 55 text

Intermittent break: Distributed Memory A good example of distributed memory protocols is Remote DMA (RDMA) which allows direct access to memory pages bypassing the CPU. You can find this in Infiniband NICs and GPUs for example. To cover GPUs, they have both local and remote memory regions that can be accessed over interconnects for fast processing and retrieval across many parallel executing cores.

Slide 56

Slide 56 text

Networking (reliability)

Slide 57

Slide 57 text

Networking In distributed systems, everything is separated physically in some sense. This means we need some way to communicate.

Slide 58

Slide 58 text

Networking In distributed systems, everything is separated physically in some sense. This means we need some way to communicate. Networking allows different systems to pass information over a protocol that defines the standard for how these informations are sent and received.

Slide 59

Slide 59 text

Networking Client -> Server Communication Client Server

Slide 60

Slide 60 text

Networking Client -> Server Communication Client Server Both the client and server understand the protocol so they can communicate with each other

Slide 61

Slide 61 text

Networking Client -> Server Communication Client Server Both the client and server understand the protocol so they can communicate with each other For databases, we have protocols also. MySQL, MongoDB, Redis etc all implement a protocol for communication with their clients.

Slide 62

Slide 62 text

Networking Reliability Sometimes, the network fails Client Server

Slide 63

Slide 63 text

Networking Reliability Sometimes, the network fails Client Server A client is isolated from the server due to networking issues. This can be a broken link or intentional blocks (eg firewalls)

Slide 64

Slide 64 text

Networking Reliability Sometimes, the network fails Client Server A client is isolated from the server due to networking issues. This can be a broken link or intentional blocks (eg firewalls) We call these scenarios network partitions and they can be good or BAD

Slide 65

Slide 65 text

Networking Reliability Networks fail! - Biggest - Single - Point - Of - Failure Client Server 💀

Slide 66

Slide 66 text

Networking Reliability Networks fail! - Biggest - Single - Point - Of - Failure Client Server 💀 Sometimes, the network can also be bad in general where nothing can communicate.

Slide 67

Slide 67 text

Networking Reliability Networks - Multiple - Point - Of - Redundancy Client Server 😎 In designing resilient distributed systems, there is usually notions of redundancy where we apply multiple paths for communication.

Slide 68

Slide 68 text

Cascading Failures

Slide 69

Slide 69 text

Cascading Failures Usually, when networks, storage or most other systems fail, in the delicate balance of distributed systems, this can cause cascading failures.

Slide 70

Slide 70 text

Cascading Failures Usually, when networks, storage or most other systems fail, in the delicate balance of distributed systems, this can cause cascading failures. Cascading failures can occur for a number of reasons, for example, replication failures reach a limit due to network partitions and the nodes failover due to WAL build up causing allocated storage to become filled up.

Slide 71

Slide 71 text

Cascading Failures Usually, when networks, storage or most other systems fail, in the delicate balance of distributed systems, this can cause cascading failures. Cascading failures can occur for a number of reasons, for example, replication failures reach a limit due to network partitions and the nodes failover due to WAL build up causing allocated storage to become filled up. that’s a mouthful of things in succession but it happens

Slide 72

Slide 72 text

Cascading Failures In designing distributed systems, cascading failures are part of the process and designing to handle them is more difficult than it might appear.

Slide 73

Slide 73 text

Cascading Failures In designing distributed systems, cascading failures are part of the process and designing to handle them is more difficult than it might appear. The general goal is to try to fix them as they appear cause there’s no universal fix.

Slide 74

Slide 74 text

Cascading Failures In designing distributed systems, cascading failures are part of the process and designing to handle them is more difficult than it might appear. The general goal is to try to fix them as they appear cause there’s no universal fix. In the words of Leslie Lamport “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

Slide 75

Slide 75 text

Recovering from failure In recovering from failure, you can apply a numerous amount of techniques to do so. I cover only two for simplicity, you can use retries in the event of short blips and/or use transactions to rollback in the event of such failures so they can retried later on.

Slide 76

Slide 76 text

Retries “If at first you don’t succeed, try try again” It’s as simple as that.

Slide 77

Slide 77 text

Retry Storms In distributed systems, there’s always more than one machine. Imagine if 1000 nodes retried (multiple) requests to a server during a momentary outage, the requests coming back would leave the entire system under significant load due to the now synchronised barrage of requests that failed.

Slide 78

Slide 78 text

Retry Storms In distributed systems, there’s always more than one machine. Imagine if 1000 nodes retried (multiple) requests to a server during a momentary outage, the requests coming back would leave the entire system under significant load due to the now synchronised barrage of requests that failed. To avoid this, apply some random jitter and backoff in some effect on startup, on retry etc to save yourself from the loop of incoming requests that end up causing cascading failures downstream.

Slide 79

Slide 79 text

Transactions “If I pay you, you owe me bitcoin” - Good transaction “If I don’t pay you, you owe me bitcoin” - Bad transaction Simple as that.

Slide 80

Slide 80 text

Transactions Transactions allow us chain workflows together so we can prevent actions that partially succeed.

Slide 81

Slide 81 text

Transactions Transactions allow us chain workflows together so we can prevent actions that partially succeed. In distributed systems, transactions are performed through different means. You can do so through consensus algorithms like Raft, Paxos, 2+ Phase Commits and various others.

Slide 82

Slide 82 text

⛕ Sagas

Slide 83

Slide 83 text

Intermittent break: Sagas (contrary to belief, this is not a star trek reference) Sagas are way to chain transactions across many distributed systems. When you have multiple transactions that occur across multiple isolated data sources, it is impossible to synchronize because each place the transaction goes has a different view of the world.

Slide 84

Slide 84 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D A B C D 1 1 1

Slide 85

Slide 85 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D ✅ A B C D 1 1 4

Slide 86

Slide 86 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D ✅ ✅ A B C D 1 4 4

Slide 87

Slide 87 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D ✅ ✅ ❌ The transaction fails on C->D A B C D 1 4 4

Slide 88

Slide 88 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D 🔙 We rollback all the changes made thus far A B C D 1 4 4

Slide 89

Slide 89 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D 🔙 🔙 We rollback all the changes made thus far A B C D 1 1 4

Slide 90

Slide 90 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D 🔙 🔙 🔙 We rollback all the changes made thus far A B C D 1 1 1

Slide 91

Slide 91 text

Intermittent break: Sagas We want to change the value under D from 1 to 4, This can be only done through A -> B -> C -> D We’re back to where we started. A B C D 1 1 1

Slide 92

Slide 92 text

Intermittent break: Sagas Sagas help orchestrate a means to avoid a transaction failure on one of many services from being achieved. If one system fails in the chain, all the previous changes are rolled back downstream to prevent inconsistency.

Slide 93

Slide 93 text

Intermittent break: Sagas Sagas help orchestrate a means to avoid a transaction failure on one of many services from being achieved. If one system fails in the chain, all the previous changes are rolled back downstream to prevent inconsistency. As you can imagine, it is quite difficult to achieve this pattern but it is very useful in proper microservice designs for systems with independent transactions eg finance

Slide 94

Slide 94 text

Intermittent break: Sagas More details on sagas can be found here: https://microservices.io/patterns/data/saga.html

Slide 95

Slide 95 text

Easy stuff done Questions?

Slide 96

Slide 96 text

Rounding off hard Let’s get the beef cooking

Slide 97

Slide 97 text

Timing

Slide 98

Slide 98 text

Timing “It is impossible to know the position and momentum of an atomic particle at any given instant in time” - Heisenberg’s Uncertainty Principle So, it is also difficult to accurately state the time in distributed systems. Based on the frequency of ticks to count the time, we will always have some miniscule amount of drift due to the nature of particle physics and delay in fetching the time to continue ticking with.

Slide 99

Slide 99 text

Timing In distributed systems, we can tell the time using three different clocks These clocks are:

Slide 100

Slide 100 text

Timing In distributed systems, we can tell the time using three different clocks These clocks are: ● Realtime (based on world time and time zones)

Slide 101

Slide 101 text

Timing In distributed systems, we can tell the time using three different clocks These clocks are: ● Realtime (based on world time and time zones) ● Monotonic (based on the difference of when we started counting)

Slide 102

Slide 102 text

Timing In distributed systems, we can tell the time using three different clocks These clocks are: ● Realtime (based on world time and time zones) ● Monotonic (based on the difference of when we started counting) ● Logical (based on an algorithm for timing events)

Slide 103

Slide 103 text

Timing: Realtime Clocks This is the usual time that we get from our digital time keepers syncing with NTP servers that provide the time. However, this fails because the time lost in getting the time now is usually off by some miniscule fraction or larger.

Slide 104

Slide 104 text

Timing: Realtime Clocks This is the usual time that we get from our digital time keepers syncing with NTP servers that provide the time. However, this fails because the time lost in getting the time now is usually off by some miniscule fraction or larger. This is not an issue unless you have lots of events happening at the same time in different parts of the world, the tiny difference in skew and the fact that time can be globally the same (collision) means this system across multiple transactions can cause issues in consensus and ordering of events.

Slide 105

Slide 105 text

Timing: Monotonic Clocks Monotonic clocks work based on some starting point, like the big bang (theory?) or the instance the system booted to be more realistic. In Linux, this time can be accessed through syscalls, documentation is available here: https://man7.org/linux/man-pages/man2/clock_getres.2.html

Slide 106

Slide 106 text

Timing: Logical Clocks This is based on algorithms that time the transactions of events so as to better alleviate the issues with real time clocks. An example of this is Lamport’s Clock where rather than a tick being defined by some frequency pinned to a quartz crystal, we tick based on a monotonically increasing counter for every event that happens.

Slide 107

Slide 107 text

Timing: Logical Clocks We start off with a logical time X X

Slide 108

Slide 108 text

Timing: Logical Clocks An event A happens, X is incremented by 1 and assigned as the time Event A happened X X + 1 Event A

Slide 109

Slide 109 text

Timing: Logical Clocks Another event B happens, X + 1 from Event A is incremented by 1 and assigned X X + 1 Event A X + 2 Event B

Slide 110

Slide 110 text

Timing: Logical Clocks Logical clocks are great because they have no skew and we can accurately process the events based on the counter tracking how many “events” have occurred since we started and assigning IDs of time based on that value.

Slide 111

Slide 111 text

Why is timing important for databases? When resolving conflicting entries across multiple nodes, the time is very important. The time tells a story of when something was created and what was created after it. This allows for what database enthusiasts called serializable consistency which defines that if event A < event B in time, then event B will take precedence over event A. If we updated a row with event A, it would be overwritten by event B.

Slide 112

Slide 112 text

Why is timing important for databases? When resolving conflicting entries across multiple nodes, the time is very important. The time tells a story of when something was created and what was created after it. This allows for what database enthusiasts called serializable consistency which defines that if event A < event B in time, then event B will take precedence over event A. If we updated a row with event A, it would be overwritten by event B. If the timing is off, then serializable consistency fails. Very large databases cannot rely on realtime clocks so they have a mix of logical and realtime clocks.

Slide 113

Slide 113 text

Timing Resources It is very important to have the time correct across database replicas, you can find some resources below on how this is accomplished across distributed systems: ● Spanner: TrueTime and external consistency | Google Cloud ● Consistency without Clocks: The Fauna Distributed Transaction Protocol ● Facebook did tons of research just to fix timing: NTP: Building a more accurate time service at Facebook scale

Slide 114

Slide 114 text

Synchronization

Slide 115

Slide 115 text

Synchronization Go to this link https://www.google.com/search?q=cha+cha+slide and tap the shiny microphone icon One step before the other, that’s all there is to it.

Slide 116

Slide 116 text

Synchronization In earlier slides, we discussed shared memory systems that have readers and writers. When performing read and writes, we have to ensure we do not perform writes at the same time else we get collisions as discussed in the timing chapter. SHM Writer Reader Writer

Slide 117

Slide 117 text

Synchronization Synchronization in distributed systems is very complex depending on the setup. We covered sagas where transactions can happen in one far away system. What happens when we have two events at the same time? What do we take, A1 or A2? A 1 B C D 1 1 1 A 2

Slide 118

Slide 118 text

Synchronization For databases, this system of consistency ignoring all the timing and networking is based on who can go first and who’s next. It will usually involve entering a critical section where such updates are only permitted to be done.

Slide 119

Slide 119 text

Synchronization This is achieved through the following: ● Locks ● Semaphores ● Spinlocks ● Conditions ● Signals

Slide 120

Slide 120 text

Synchronization This is achieved through the following: ● Locks ● Semaphores ● Spinlocks ● Conditions ● Signals And a lot more primitives depending on the use case.

Slide 121

Slide 121 text

Synchronization This is achieved through the following: ● Locks ● Semaphores ● Spinlocks ● Conditions ● Signals And a lot more primitives depending on the use case. I heavily recommend reading Djikstra’s original paper on this topic: Concurrent Programming, Mutual Exclusion (1965; Dijkstra)

Slide 122

Slide 122 text

Synchronization in Databases Databases are complex systems with varying critical sections that rely on multiple parameters. In replicated storage, the WAL (Write Ahead Log) is a localized transaction log containing sequential entries of transactions. In the wider spectrum of replication, the master replica will apply these replicated updates based on another set of factors, usually this only happens during a master failover. There are tons and tons of consensus requirements in database design which work past mutexes. However, most will resolve as long as the system stays operational.

Slide 123

Slide 123 text

Acknowledgements

Slide 124

Slide 124 text

Acknowledgements “You, you, you” - How many times did I spell you? It is 3 times. However, I only intended to spell it once. Seems we have duplicates.

Slide 125

Slide 125 text

Acknowledgements Acknowledgements in Distributed Systems aim to guarantee message ordering and delivery requirements. These are specific agreements which are made on messages delivered.

Slide 126

Slide 126 text

Acknowledgements Acknowledgements in Distributed Systems aim to guarantee message ordering and delivery requirements. These are specific agreements which are made on messages delivered. Usually, you will find acknowledgements in message passing systems than shared memory stores like databases. However, shared memory systems also enforce deduplication and other forms of acknowledgements in their design.

Slide 127

Slide 127 text

Acknowledgements You can have quite simply the following types of acknowledgements:

Slide 128

Slide 128 text

Acknowledgements You can have quite simply the following types of acknowledgements: ● At least once delivery

Slide 129

Slide 129 text

Acknowledgements You can have quite simply the following types of acknowledgements: ● At least once delivery ● Exactly once delivery

Slide 130

Slide 130 text

Acknowledgements You can have quite simply the following types of acknowledgements: ● At least once delivery ● Exactly once delivery ● At most once delivery

Slide 131

Slide 131 text

At least once delivery In this system, the messages will be retried indefinitely until there is at least one message received on one end and an acknowledgement sent. It does not matter if the requests are duplicated, what matters is that at least one event of it exists.

Slide 132

Slide 132 text

At least once delivery In this system, the messages will be retried indefinitely until there is at least one message received on one end and an acknowledgement sent. It does not matter if the requests are duplicated, what matters is that at least one event of it exists. This is useful for things like heartbeats where multiple events are good to signal that things are fine. No events implies some failure with the existing system.

Slide 133

Slide 133 text

Exactly once delivery In this system, the messages will be retried indefinitely until one message is received on one end and an acknowledgement sent. This is needed for critical events where duplicates just cannot happen.

Slide 134

Slide 134 text

Exactly once delivery In this system, the messages will be retried indefinitely until one message is received on one end and an acknowledgement sent. This is needed for critical events where duplicates just cannot happen. In cases like data transfer, we cannot have the same packet for example sent and acknowledged twice or not acknowledged at all. That would imply data corruption and misordering of TCP sequences.

Slide 135

Slide 135 text

At most once delivery In this system, the system will send a message but there is no issue if it is never received or sent.

Slide 136

Slide 136 text

At most once delivery In this system, the system will send a message but there is no issue if it is never received or sent. Systems using UDP packets for data transfer like torrents are a great example. Having at most once delivery implies that acknowledgements are not required for the system to be operational, these can also be looked at as lossy delivery acknowledgements.

Slide 137

Slide 137 text

Acknowledgements in Databases In databases, you will usually have the exactly once delivery although in practice, it is usually at most once delivery. Databases are high integrity systems requiring strict acknowledgement of packets and data storage and retrieval.

Slide 138

Slide 138 text

Acknowledgements in Databases In databases, you will usually have the exactly once delivery although in practice, it is usually at most once delivery. Databases are high integrity systems requiring strict acknowledgement of packets and data storage and retrieval. The WAL is implemented to keep data localised until it can be replicated and the acknowledgements are strictly required so database protocols are built on TCP as a result.

Slide 139

Slide 139 text

Processing

Slide 140

Slide 140 text

Processing Remember our discussion on concurrency and parallelism, we stated that it is possible to process things at the same time speeding up the process. In a similar fashion, we can employ techniques of state sharing to apply those techniques in distributed systems.

Slide 141

Slide 141 text

Processing - Pipelines To achieve the benefits of concurrency in addition to parallel execution, we model the stages of processing in chained workflows. These chained workflows are called pipelines and these pipelines may employ any number of transactions and rollbacks in processing information.

Slide 142

Slide 142 text

Processing - Pipelines To achieve the benefits of concurrency in addition to parallel execution, we model the stages of processing in chained workflows. These chained workflows are called pipelines and these pipelines may employ any number of transactions and rollbacks in processing information. As a refresher: MQ - Message Queue (message queue) SHM - Shared Memory (database)

Slide 143

Slide 143 text

⛕ Message Passing Terminology

Slide 144

Slide 144 text

Intermittent break: Message Passing Terminology Message passing implementations can have varying names, some might say:

Slide 145

Slide 145 text

Intermittent break: Message Passing Terminology Message passing implementations can have varying names, some might say: ● Message Queue

Slide 146

Slide 146 text

Intermittent break: Message Passing Terminology Message passing implementations can have varying names, some might say: ● Message Queue ● Message Broker

Slide 147

Slide 147 text

Intermittent break: Message Passing Terminology Message passing implementations can have varying names, some might say: ● Message Queue ● Message Broker ● Service Bus

Slide 148

Slide 148 text

Intermittent break: Message Passing Terminology Message passing implementations can have varying names, some might say: ● Message Queue ● Message Broker ● Service Bus ● Streams Processor

Slide 149

Slide 149 text

Intermittent break: Message Passing Terminology Message passing implementations can have varying names, some might say: ● Message Queue ● Message Broker ● Service Bus ● Streams Processor And several others.

Slide 150

Slide 150 text

Intermittent break: Message Passing Terminology Message passing implementations can have varying names, some might say: ● Message Queue ● Message Broker ● Service Bus ● Streams Processor And several others. There are differences but the basic idea is all of them implement message passing.

Slide 151

Slide 151 text

Intermittent break: Message Passing Terminology Despite the terminology, you will always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed:

Slide 152

Slide 152 text

Intermittent break: Message Passing Terminology Despite the terminology, you will always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed: ● PUBlisher and SUBscriber (PUBSUB)

Slide 153

Slide 153 text

Intermittent break: Message Passing Terminology Despite the terminology, you will always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed: ● PUBlisher and SUBscriber (PUBSUB) ● Producer and Consumer

Slide 154

Slide 154 text

Intermittent break: Message Passing Terminology Despite the terminology, you will always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed: ● PUBlisher and SUBscriber (PUBSUB) ● Producer and Consumer And several others, it’s all still message passing.

Slide 155

Slide 155 text

Processing

Slide 156

Slide 156 text

Processing - Pipelines A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 157

Slide 157 text

Processing - Pipelines A -> Sends the message through MQ 1 to B A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 158

Slide 158 text

Processing - Pipelines B -> Writes to SHM A and MQ 2, persisted and async respectively A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 159

Slide 159 text

Processing - Pipelines C -> Processes each event from MQ 2 and persists to SHM A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 160

Slide 160 text

Processing - Pipelines D -> Batch processes from SHM A and writes to another SHM B A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 161

Slide 161 text

Processing - Pipelines SHM B is the final path of our processing pipeline A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 162

Slide 162 text

Processing - Pipelines In distributed systems, we will always have pipelines that employ various forms of techniques, these pipelines can be:

Slide 163

Slide 163 text

Processing - Pipelines In distributed systems, we will always have pipelines that employ various forms of techniques, these pipelines can be: ● Event Driven / Stream Processing

Slide 164

Slide 164 text

Processing - Pipelines In distributed systems, we will always have pipelines that employ various forms of techniques, these pipelines can be: ● Event Driven / Stream Processing ● Batch Processing

Slide 165

Slide 165 text

Processing - Pipelines In distributed systems, we will always have pipelines that employ various forms of techniques, these pipelines can be: ● Event Driven / Stream Processing ● Batch Process ● Extract, Transform, Load (ETL)

Slide 166

Slide 166 text

Processing - Event Driven / Stream Processing In this instance, we just take events or streams of data as they come and process them to where they need to go.

Slide 167

Slide 167 text

Processing - Event Driven / Stream Processing In this instance, we just take events or streams of data as they come and process them to where they need to go. No persistence needed to make this work. Webhooks are a good example of this paradigm.

Slide 168

Slide 168 text

Processing - Event Driven / Stream Processing A -> B and B -> C are Event Driven / Stream Processed A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 169

Slide 169 text

Processing - Batch Processing In batch processing, we require some persistent store of data to aggregate all the information over some interval. Batch processing is useful when you need to gather tons of data and makes sense of it in a single outlook.

Slide 170

Slide 170 text

Processing - Batch Processing In batch processing, we require some persistent store of data to aggregate all the information over some interval. Batch processing is useful when you need to gather tons of data and makes sense of it in a single outlook. An example of this is Payroll where all your expenses over a month need to be added together so you can get a final ledger over that period.

Slide 171

Slide 171 text

Processing - Batch Processing SHM A -> D -> SHM B can be Batch Processed A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 172

Slide 172 text

Processing - Extract, Transform, Load (ETL) In this processing pattern, you are usually running a mix of batch and stream processing. You take data aggregated over a period of time (Extract) with various input and output streams performing operations on them in the process (Transform) only to save the final output in a warehouse for consumption (Load)

Slide 173

Slide 173 text

Processing - Extract, Transform, Load (ETL) This entire pipeline is an ETL A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Slide 174

Slide 174 text

Processing - OLAP and OLTP Usually in database processing systems, you will always find OLAP and OLTP mentioned.

Slide 175

Slide 175 text

Processing - OLAP and OLTP Usually in database processing systems, you will always find OLAP and OLTP mentioned. OLAP - Online Analytical Processing

Slide 176

Slide 176 text

Processing - OLAP and OLTP Usually in database processing systems, you will always find OLAP and OLTP mentioned. OLAP - Online Analytical Processing OLTP - Online Transaction Processing

Slide 177

Slide 177 text

Processing - OLAP OLAP databases are very large and accumulate tons of information for acquiring trends and performing complex data analysis. These processing systems have very long storage requirements and integrate multiple sources of information.They have very strong transaction guarantees and require significant compute to run.

Slide 178

Slide 178 text

Processing - OLAP OLAP databases are very large and accumulate tons of information for acquiring trends and performing complex data analysis. These processing systems have very long storage requirements and integrate multiple sources of information.They have very strong transaction guarantees and require significant compute to run. OLAPs databases are usually the likes of Spanner, Aurora, Snowflake and most other data warehousing solutions you will find.

Slide 179

Slide 179 text

Processing - OLTP OLTPs are good for the stream/event driven and short term batch processing outcomes where data needs to be processed in a relatively short time and the data sources to aggregate from are minimal. OLTP databases usually do not have very strong transaction guarantees and this eases performance in the processing aspect.

Slide 180

Slide 180 text

Processing - OLTP OLTPs are good for the stream/event driven and short term batch processing outcomes where data needs to be processed in a relatively short time and the data sources to aggregate from are minimal. OLTP databases usually do not have very strong transaction guarantees and this eases performance in the processing aspect. Examples are the classic databases, Redshift, RDS and others.

Slide 181

Slide 181 text

Processing - Conclusion In summary, several of the patterns noted here are used internally in the database internals design. In building distributed systems with databases, these are similar concepts that can be applied to their design.

Slide 182

Slide 182 text

Consensus

Slide 183

Slide 183 text

Consensus: CAP Unlike the head wear, this is more about systems. CAP refers to:

Slide 184

Slide 184 text

Consensus: CAP Unlike the head wear, this is more about systems. CAP refers to: ● Consistency

Slide 185

Slide 185 text

Consensus: CAP Unlike the head wear, this is more about systems. CAP refers to: ● Consistency ● Availability

Slide 186

Slide 186 text

Consensus: CAP Unlike the head wear, this is more about systems. CAP refers to: ● Consistency ● Availability ● Partition Tolerance

Slide 187

Slide 187 text

Consensus: CAP Consistency means that knowledge of information in the distributed system is always the same at any given time.

Slide 188

Slide 188 text

Consensus: CAP Consistency means that knowledge of information in the distributed system is always the same at any given time. Availability means that every update to such information will always be received without error.

Slide 189

Slide 189 text

Consensus: CAP Consistency means that knowledge of information in the distributed system is always the same at any given time. Availability means that every update to such information will always be received without error. Partition Tolerance means that no matter the number of nodes, we will always be operating despite the failure of other nodes.

Slide 190

Slide 190 text

Consensus: CAP Theorem The CAP theorem basically states that it is impossible to have Consistency, Availability and Partition Tolerance in a distributed system.

Slide 191

Slide 191 text

Consensus: CAP Theorem The CAP theorem basically states that it is impossible to have Consistency, Availability and Partition Tolerance in a distributed system. Partition Tolerance is more aligned with the “distributed” aspect of systems as every node can independently perform actions.

Slide 192

Slide 192 text

Consensus: CAP Theorem The CAP theorem basically states that it is impossible to have Consistency, Availability and Partition Tolerance in a distributed system. Partition Tolerance is more aligned with the “distributed” aspect of systems as every node can independently perform actions, doing so in consensus means that we will either sacrifice availability of the system since the now failed nodes will give errors.

Slide 193

Slide 193 text

Consensus: CAP Theorem The CAP theorem basically states that it is impossible to have Consistency, Availability and Partition Tolerance in a distributed system. Partition Tolerance is more aligned with the “distributed” aspect of systems as every node can independently perform actions, doing so in consensus means that we will either sacrifice availability of the system since the now failed nodes will give errors or consistency in the results since failed nodes cannot get the most updated information.

Slide 194

Slide 194 text

Consensus: CAP Theorem The CAP Theorem holds that there is no perfect distributed system that can operate in all three states, only two are possible. Hence it is visualised as the CAP triangle.

Slide 195

Slide 195 text

CAP Theorem: Source:CAP theorem - Wikipedia

Slide 196

Slide 196 text

Consensus: FLP Impossibility This paper was written by Fisher, Lynch and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process:

Slide 197

Slide 197 text

Consensus: FLP Impossibility This paper was written by Fisher, Lynch and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process: ● Agreement

Slide 198

Slide 198 text

Consensus: FLP Impossibility This paper was written by Fisher, Lynch and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process: ● Agreement ● Validity

Slide 199

Slide 199 text

Consensus: FLP Impossibility This paper was written by Fisher, Lynch and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process: ● Agreement ● Validity ● Termination

Slide 200

Slide 200 text

Consensus: FLP Impossibility Agreement defines that a consensus protocol should ensure decisions are defined and agreed upon by active non-failing members of the network.

Slide 201

Slide 201 text

Consensus: FLP Impossibility Agreement defines that a consensus protocol should ensure decisions are defined and agreed upon by active non-failing members of the network. Validity implies that the value proposed is made by members of the network participating in the voting process, it cannot be externally provided.

Slide 202

Slide 202 text

Consensus: FLP Impossibility Agreement defines that a consensus protocol should ensure decisions are defined and agreed upon by active non-failing members of the network. Validity implies that the value proposed is approved by members of the network participating in the voting process, it cannot be externally provided or defaulted. Termination means a decision is only made when active non-faulty members agree it can be made.

Slide 203

Slide 203 text

Consensus: FLP Impossibility In the paper, we assume all processes are asynchronous implying there is no upper bound on processing time before a decision is made. The basis of the paper establishes that:

Slide 204

Slide 204 text

Consensus: FLP Impossibility In the paper, we assume all processes are asynchronous implying there is no upper bound on processing time before a decision is made. The basis of the paper establishes that: 1. it is not possible to have a distributed system that can guarantee its current state is always up to date without failure

Slide 205

Slide 205 text

Consensus: FLP Impossibility In the paper, we assume all processes are asynchronous implying there is no upper bound on processing time before a decision is made. The basis of the paper establishes that: 1. it is not possible to have a distributed system that can guarantee its current state is always up to date without failure 2. consensus cannot be said to occur within a bounded time, there will always be delays that violate a predefined interval for resolution

Slide 206

Slide 206 text

Consensus: FLP Impossibility and CAP These two argue different perspectives of the same problem, that distributed systems are a tradeoff in consensus.

Slide 207

Slide 207 text

Consensus: FLP Impossibility and CAP These two argue different perspectives of the same problem, that distributed systems are a tradeoff in consensus. FLP Impossibility makes it known that time in distributed systems cannot be bounded and the state of each interacting member is never perfectly known, hence why consensus will always take an unbounded amount of time due to uncooperating members.

Slide 208

Slide 208 text

Consensus: FLP Impossibility and CAP These two argue different perspectives of the same problem, that distributed systems are a tradeoff in consensus. FLP Impossibility makes it known that time in distributed systems cannot be bounded and the state of each interacting member is never perfectly known, hence why consensus will always take an unbounded amount of time due to uncooperating members. CAP argues that a distributed system cannot be perfectly designed to have all members achieve consensus without ignoring failing members or the known state of the system.

Slide 209

Slide 209 text

Conclusion

Slide 210

Slide 210 text

Conclusion Distributed systems incorporate asynchronous patterns of communication that makes it impossible to reliably know the state of the network at any given time.

Slide 211

Slide 211 text

Conclusion Distributed systems incorporate asynchronous patterns of communication that makes it impossible to reliably know the state of the network at any given time. Workarounds for failures are best known depending on the problem and there are extensive literatures taking into consideration these failures in considering solutions.

Slide 212

Slide 212 text

Conclusion Distributed systems incorporate asynchronous patterns of communication that makes it impossible to reliably know the state of the network at any given time. Workarounds for failures are best known depending on the problem and there are extensive literatures taking into consideration these failures in considering solutions. In summary, there is no perfect distributed system.

Slide 213

Slide 213 text

Questions