Distributed Systems - Database Internals

Distributed Systems Introduction and Overview

Who am I? Emmanuel Bakare but people call me “Bakman”
Senior DevOps Engineer at Twilio Fair amount of distributed systems experience (Replex, AWS)

We will be discussing Distributed Systems Starting off easy: •
Concurrency and parallelism • Shared state ◦ Inter Process Communication (IPC) ◦ Distributed Memory ▪ RDMA • Networking (reliability) ◦ Partitions and failures • Cascading failures ◦ Retry storms ◦ Chained workflows ▪ Sagas Rounding off hard: • Timing (realtime, logical and monotonic) • Synchronisation • Acknowledgements ◦ Message Ordering ◦ Delivery mechanisms • Processing ◦ Asynchronous ◦ Event Driven • Consensus ◦ CAP ◦ FLP Impossibility ◦ Failure Modes

Notes ⛕ - Means there is a small detour to
cover additional material The material covered here will be relatively simplified, the technical summary of the concepts will still be delivered. A lot of these concepts are theoretical and practical in a mix. I hope you have fun going through this.

What is a distributed system?

What is a distributed system? As found in the Part
II chapter titled “Distributed Systems” of Database Internals “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” - Leslie Lamport

Breakdown of how distributed systems help teams Star Networks Client
Master

Breakdown of how distributed systems help teams Star Networks -
Single - Point - Of - Failure Client Master 💀

Breakdown of how distributed systems help teams Mesh Networks -
Multiple - Point - Of - Redundancy Client Master 😎

Network Routing That’s a simple example of how it works

Failures happen with every system Distributed systems help with stability
by finding smarter ways to have “redundant” paths improving the ease of separating individual systems whilst keeping it working.

Distributed systems are made of nodes Nodes

Distributed systems are made of nodes Nodes Nodes are the
individual components in the distributed system. Multiple nodes communicate with each other in the distributed system.

Distributed systems are made of nodes Nodes Nodes are the
individual components in the distributed system. Multiple nodes communicate with each other in the distributed system. These nodes may be called replicas also, the terminology varies

Again, Failures happen with every system Distributed systems help with
stability by finding smarter ways to have “redundant” paths improving the ease of separating individual systems whilst keeping it working.

Again, Failures happen with every system Distributed systems help with
stability by finding smarter ways to have “redundant” paths improving the ease of separating individual systems whilst keeping it working. Despite helping improve stability, they can also increase complexity. Use these concepts sparingly.

We will be discussing Distributed Systems This is not related
to distributed algorithms, this is an overview of the more infrastructure and theoretical related parts of it.

to distributed algorithms, this is an overview of the more infrastructure and theoretical related parts of it. Algorithms handle the interactions between these systems in the more lower level, this is very high level, we will not be doing any coding.

to distributed algorithms, this is an overview of the more infrastructure and theoretical related parts of it. Algorithms handle the interactions between these systems in the more lower level, this is very high level, we will not be doing any coding. This discussion is based off the Database Internals book by Alex Petrov.

Starting off easy Let’s get the basics

Concurrency and Parallelism

Concurrency and Parallelism These are both ways to break down
tasks so they can run faster. It is simply “division of labour” across various workers.

Concurrency and Parallelism These are both ways to break down
tasks so they can run faster. It is simply “division of labour” across various workers. The more workers you have does not always imply faster performance And breaking a task into bits does not always make it more efficient

Concurrency and Parallelism Concurrency is the ability to run one
or more independent tasks independently. Concurrency can mean taking a process A and creating a chain of processes to become: B -> C -> D -> A Where B, C and D are running independently to achieve A. Parallelism is the ability to run the same thing at the same time multiple times. Parallelism means taking a process A and making it run in X times where each is distinctly independent but performing the same operation. X -> A X -> A X -> A

Concurrency and Parallelism Concurrency will usually involve context switching across
multiple tasks and is not limited by hardware Concurrency can be context switched across hardware so there is not a “hard” reliance on the number of cores assigned. Parallelism implies each process is running on a single dedicated resource and will usually just stay active from start to finish Parallelism requires actual physical hardware to run so if you have X cores, you can reasonably run X versions of a task before it becomes slower.

⛕ Hyperthreads are not cores

Intermittent break: Hyperthreads are not cores Hyperthreading is a hack
in the physical setup of a core to use a high speed register to emulate context switching with such performance that it can be likened to an actual physical core.

Intermittent break: Hyperthreads are not cores Hyperthreading is a hack
in the physical setup of a core to use a high speed register to emulate context switching with such performance that it can be likened to an actual physical core. Hyperthreaded v Real Core Thread 1 Thread 2 (Register) Core Thread 1

Intermittent break: Hyperthreads are not cores This means if you
schedule two of the same tasks on a single core, it will really be fighting for resources on the same core. This works for most things because cache access and other resources on a core are efficient so it is appears seamless.

schedule two of the same tasks on a single core, it will really be fighting for resources on the same core. This works for most things because cache access and other resources on a core are efficient so it is appears seamless. The operating system will usually just take threads as cores to simplify scheduling tasks onto them, but there is always a performance hit for such emulation.

schedule two of the same tasks on a single core, it will really be fighting for resources on the same core. This works for most things because cache access and other resources on a core are efficient so it is appears seamless. The operating system will usually just take threads as cores to simplify scheduling tasks onto them, but there is always a performance hit for such emulation. If you have latency specific or contention driven processes like databases that need performance, use physical cores against hyperthreaded ones.

Concurrency v parallelism When considering parallel tasks, you can have
concurrency in parallelism but you cannot have parallelism in concurrent tasks.

concurrency in parallelism but you cannot have parallelism in concurrent tasks. The reason is parallelism implies that the same process is executed at the same time in independent blobs.

concurrency in parallelism but you cannot have parallelism in concurrent tasks. The reason is parallelism implies that the same process is executed at the same time in independent blobs. Concurrency implies composition of independent single tasks working together so the timelines overlap.

Concurrency v parallelism Basically Concurrent tasks can be run in
parallel BUT Parallel tasks cannot be run concurrently

Concurrency is not parallelism I recommend watching this video from
Rob Pike at a later time for more details: https://go.dev/blog/waza-talk

Shared State

Shared State There are primarily two methods for sharing state
in distributed systems You either have

in distributed systems You either have • Message passing (queues, signals)

in distributed systems You either have • Message passing (queues, signals) • Shared memory (databases, heaps)

in distributed systems You either have • Message passing (queues, signals) • Shared memory (databases, heaps) Unlike message passing, shared memory requires synchronization to allow for safe(serializable) updates. The concept of serializable consistency will be discussed later on.

Shared State - Message Passing Message passing is more about
using pipes. Yeah, same concept as a | b PUBLISHER PIPE [/dev/mqueue] SUBSCRIBER

Shared State - Message Passing When you perform message passing,
you ephemerally take some data from point A and pass it to point B through a pipe. That pipe is usually called a queue but you can do so through io redirection etc.

you ephemerally take some data from point A and pass it to point B through a pipe. That pipe is usually called a queue but you can do so through io redirection etc. In Linux, this can be emulated with the /dev/mqueue filesystem which allows kernel support for message passing primitives. See here: https://man7.org/linux/man-pages/man7/mq_overview.7.html

you ephemerally take some data from point A and pass it to point B through a pipe. That pipe is usually called a queue but you can do so through io redirection etc. In Linux, this can be emulated with code but most queues will use the /dev/mqueue filesystem which allows kernel support for message passing primitives. See here: https://man7.org/linux/man-pages/man7/mq_overview.7.html This can also be done with code using your basic queue data structure from the user space.

Shared State - Shared Memory Shared Memory is what we
understand as dictionaries, tuples etc. Things that persist data. Shared memory stores are basically databases, filesystems and more trivially files on filesystems. In Linux, even a file can be a filesystem. SHM Writer Reader Writer

Shared State - Shared Memory When you use shared memory,
you are guaranteed that within the allocated block of memory, data persisted can be retrieved multiple times.

you are guaranteed that within the allocated block of memory, data persisted can be retrieved multiple times. In Linux, the /dev/shm filesystem allows you to perform allocations with kernel support for shared memory stores. See https://man7.org/linux/man-pages/man7/shm_overview.7.html

you are guaranteed that within the allocated block of memory, data persisted can be retrieved multiple times. In Linux, the /dev/shm filesystem allows you to perform allocations for shared memory stores. See https://man7.org/linux/man-pages/man7/shm_overview.7.html Memory allocators, arenas and various other primitives allow for allocation of shared memory blocks. Just like the message passing note, this can be implemented without this filesystem.

⛕ Distributed Memory

Intermittent break: Distributed Memory In addition to shared memory, you
can also have distributed memory which in addition to data locality, moves memory information to a store outside the systems where it is running.

Intermittent break: Distributed Memory In addition to shared memory, you
can also have distributed memory which in addition to data locality, moves memory information to a store outside the systems where it is running. The interconnect to allow data transfer is usually over some network (wired or wireless) and pages can be swapped between independent systems as required.

Intermittent break: Source: Distributed memory - Wikipedia

Intermittent break: Distributed Memory A good example of distributed memory
protocols is Remote DMA (RDMA) which allows direct access to memory pages bypassing the CPU. You can find this in Infiniband NICs and GPUs for example. To cover GPUs, they have both local and remote memory regions that can be accessed over interconnects for fast processing and retrieval across many parallel executing cores.

Networking (reliability)

Networking In distributed systems, everything is separated physically in some
sense. This means we need some way to communicate.

Networking In distributed systems, everything is separated physically in some
sense. This means we need some way to communicate. Networking allows different systems to pass information over a protocol that defines the standard for how these informations are sent and received.

Networking Client -> Server Communication Client Server

Networking Client -> Server Communication Client Server Both the client
and server understand the protocol so they can communicate with each other

Networking Client -> Server Communication Client Server Both the client
and server understand the protocol so they can communicate with each other For databases, we have protocols also. MySQL, MongoDB, Redis etc all implement a protocol for communication with their clients.

Networking Reliability Sometimes, the network fails Client Server

Networking Reliability Sometimes, the network fails Client Server A client
is isolated from the server due to networking issues. This can be a broken link or intentional blocks (eg firewalls)

Networking Reliability Sometimes, the network fails Client Server A client
is isolated from the server due to networking issues. This can be a broken link or intentional blocks (eg firewalls) We call these scenarios network partitions and they can be good or BAD

Networking Reliability Networks fail! - Biggest - Single - Point
- Of - Failure Client Server 💀

Networking Reliability Networks fail! - Biggest - Single - Point
- Of - Failure Client Server 💀 Sometimes, the network can also be bad in general where nothing can communicate.

Networking Reliability Networks - Multiple - Point - Of -
Redundancy Client Server 😎 In designing resilient distributed systems, there is usually notions of redundancy where we apply multiple paths for communication.

Cascading Failures

Cascading Failures Usually, when networks, storage or most other systems
fail, in the delicate balance of distributed systems, this can cause cascading failures.

fail, in the delicate balance of distributed systems, this can cause cascading failures. Cascading failures can occur for a number of reasons, for example, replication failures reach a limit due to network partitions and the nodes failover due to WAL build up causing allocated storage to become filled up.

fail, in the delicate balance of distributed systems, this can cause cascading failures. Cascading failures can occur for a number of reasons, for example, replication failures reach a limit due to network partitions and the nodes failover due to WAL build up causing allocated storage to become filled up. that’s a mouthful of things in succession but it happens

Cascading Failures In designing distributed systems, cascading failures are part
of the process and designing to handle them is more difficult than it might appear.

of the process and designing to handle them is more difficult than it might appear. The general goal is to try to fix them as they appear cause there’s no universal fix.

of the process and designing to handle them is more difficult than it might appear. The general goal is to try to fix them as they appear cause there’s no universal fix. In the words of Leslie Lamport “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

Recovering from failure In recovering from failure, you can apply
a numerous amount of techniques to do so. I cover only two for simplicity, you can use retries in the event of short blips and/or use transactions to rollback in the event of such failures so they can retried later on.

Retries “If at first you don’t succeed, try try again”
It’s as simple as that.

Retry Storms In distributed systems, there’s always more than one
machine. Imagine if 1000 nodes retried (multiple) requests to a server during a momentary outage, the requests coming back would leave the entire system under significant load due to the now synchronised barrage of requests that failed.

Retry Storms In distributed systems, there’s always more than one
machine. Imagine if 1000 nodes retried (multiple) requests to a server during a momentary outage, the requests coming back would leave the entire system under significant load due to the now synchronised barrage of requests that failed. To avoid this, apply some random jitter and backoff in some effect on startup, on retry etc to save yourself from the loop of incoming requests that end up causing cascading failures downstream.

Transactions “If I pay you, you owe me bitcoin” -
Good transaction “If I don’t pay you, you owe me bitcoin” - Bad transaction Simple as that.

Transactions Transactions allow us chain workflows together so we can
prevent actions that partially succeed.

Transactions Transactions allow us chain workflows together so we can
prevent actions that partially succeed. In distributed systems, transactions are performed through different means. You can do so through consensus algorithms like Raft, Paxos, 2+ Phase Commits and various others.

⛕ Sagas

Intermittent break: Sagas (contrary to belief, this is not a
star trek reference) Sagas are way to chain transactions across many distributed systems. When you have multiple transactions that occur across multiple isolated data sources, it is impossible to synchronize because each place the transaction goes has a different view of the world.

Intermittent break: Sagas We want to change the value under
D from 1 to 4, This can be only done through A -> B -> C -> D A B C D 1 1 1

D from 1 to 4, This can be only done through A -> B -> C -> D ✅ A B C D 1 1 4

D from 1 to 4, This can be only done through A -> B -> C -> D ✅ ✅ A B C D 1 4 4

D from 1 to 4, This can be only done through A -> B -> C -> D ✅ ✅ ❌ The transaction fails on C->D A B C D 1 4 4

D from 1 to 4, This can be only done through A -> B -> C -> D 🔙 We rollback all the changes made thus far A B C D 1 4 4

D from 1 to 4, This can be only done through A -> B -> C -> D 🔙 🔙 We rollback all the changes made thus far A B C D 1 1 4

D from 1 to 4, This can be only done through A -> B -> C -> D 🔙 🔙 🔙 We rollback all the changes made thus far A B C D 1 1 1

D from 1 to 4, This can be only done through A -> B -> C -> D We’re back to where we started. A B C D 1 1 1

Intermittent break: Sagas Sagas help orchestrate a means to avoid
a transaction failure on one of many services from being achieved. If one system fails in the chain, all the previous changes are rolled back downstream to prevent inconsistency.

Intermittent break: Sagas Sagas help orchestrate a means to avoid
a transaction failure on one of many services from being achieved. If one system fails in the chain, all the previous changes are rolled back downstream to prevent inconsistency. As you can imagine, it is quite difficult to achieve this pattern but it is very useful in proper microservice designs for systems with independent transactions eg finance

Intermittent break: Sagas More details on sagas can be found
here: https://microservices.io/patterns/data/saga.html

Easy stuff done Questions?

Rounding off hard Let’s get the beef cooking

Timing

Timing “It is impossible to know the position and momentum
of an atomic particle at any given instant in time” - Heisenberg’s Uncertainty Principle So, it is also difficult to accurately state the time in distributed systems. Based on the frequency of ticks to count the time, we will always have some miniscule amount of drift due to the nature of particle physics and delay in fetching the time to continue ticking with.

Timing In distributed systems, we can tell the time using
three different clocks These clocks are:

three different clocks These clocks are: • Realtime (based on world time and time zones)

three different clocks These clocks are: • Realtime (based on world time and time zones) • Monotonic (based on the difference of when we started counting)

three different clocks These clocks are: • Realtime (based on world time and time zones) • Monotonic (based on the difference of when we started counting) • Logical (based on an algorithm for timing events)

Timing: Realtime Clocks This is the usual time that we
get from our digital time keepers syncing with NTP servers that provide the time. However, this fails because the time lost in getting the time now is usually off by some miniscule fraction or larger.

Timing: Realtime Clocks This is the usual time that we
get from our digital time keepers syncing with NTP servers that provide the time. However, this fails because the time lost in getting the time now is usually off by some miniscule fraction or larger. This is not an issue unless you have lots of events happening at the same time in different parts of the world, the tiny difference in skew and the fact that time can be globally the same (collision) means this system across multiple transactions can cause issues in consensus and ordering of events.

Timing: Monotonic Clocks Monotonic clocks work based on some starting
point, like the big bang (theory?) or the instance the system booted to be more realistic. In Linux, this time can be accessed through syscalls, documentation is available here: https://man7.org/linux/man-pages/man2/clock_getres.2.html

Timing: Logical Clocks This is based on algorithms that time
the transactions of events so as to better alleviate the issues with real time clocks. An example of this is Lamport’s Clock where rather than a tick being defined by some frequency pinned to a quartz crystal, we tick based on a monotonically increasing counter for every event that happens.

Timing: Logical Clocks We start off with a logical time
X X

Timing: Logical Clocks An event A happens, X is incremented
by 1 and assigned as the time Event A happened X X + 1 Event A

Timing: Logical Clocks Another event B happens, X + 1
from Event A is incremented by 1 and assigned X X + 1 Event A X + 2 Event B

Timing: Logical Clocks Logical clocks are great because they have
no skew and we can accurately process the events based on the counter tracking how many “events” have occurred since we started and assigning IDs of time based on that value.

Why is timing important for databases? When resolving conflicting entries
across multiple nodes, the time is very important. The time tells a story of when something was created and what was created after it. This allows for what database enthusiasts called serializable consistency which defines that if event A < event B in time, then event B will take precedence over event A. If we updated a row with event A, it would be overwritten by event B.

Why is timing important for databases? When resolving conflicting entries
across multiple nodes, the time is very important. The time tells a story of when something was created and what was created after it. This allows for what database enthusiasts called serializable consistency which defines that if event A < event B in time, then event B will take precedence over event A. If we updated a row with event A, it would be overwritten by event B. If the timing is off, then serializable consistency fails. Very large databases cannot rely on realtime clocks so they have a mix of logical and realtime clocks.

Timing Resources It is very important to have the time
correct across database replicas, you can find some resources below on how this is accomplished across distributed systems: • Spanner: TrueTime and external consistency | Google Cloud • Consistency without Clocks: The Fauna Distributed Transaction Protocol • Facebook did tons of research just to fix timing: NTP: Building a more accurate time service at Facebook scale

Synchronization

Synchronization Go to this link https://www.google.com/search?q=cha+cha+slide and tap the shiny
microphone icon One step before the other, that’s all there is to it.

Synchronization In earlier slides, we discussed shared memory systems that
have readers and writers. When performing read and writes, we have to ensure we do not perform writes at the same time else we get collisions as discussed in the timing chapter. SHM Writer Reader Writer

Synchronization Synchronization in distributed systems is very complex depending on
the setup. We covered sagas where transactions can happen in one far away system. What happens when we have two events at the same time? What do we take, A1 or A2? A 1 B C D 1 1 1 A 2

Synchronization For databases, this system of consistency ignoring all the
timing and networking is based on who can go first and who’s next. It will usually involve entering a critical section where such updates are only permitted to be done.

Synchronization This is achieved through the following: • Locks •
Semaphores • Spinlocks • Conditions • Signals

Semaphores • Spinlocks • Conditions • Signals And a lot more primitives depending on the use case.

Semaphores • Spinlocks • Conditions • Signals And a lot more primitives depending on the use case. I heavily recommend reading Djikstra’s original paper on this topic: Concurrent Programming, Mutual Exclusion (1965; Dijkstra)

Synchronization in Databases Databases are complex systems with varying critical
sections that rely on multiple parameters. In replicated storage, the WAL (Write Ahead Log) is a localized transaction log containing sequential entries of transactions. In the wider spectrum of replication, the master replica will apply these replicated updates based on another set of factors, usually this only happens during a master failover. There are tons and tons of consensus requirements in database design which work past mutexes. However, most will resolve as long as the system stays operational.

Acknowledgements

Acknowledgements “You, you, you” - How many times did I
spell you? It is 3 times. However, I only intended to spell it once. Seems we have duplicates.

Acknowledgements Acknowledgements in Distributed Systems aim to guarantee message ordering
and delivery requirements. These are specific agreements which are made on messages delivered.

Acknowledgements Acknowledgements in Distributed Systems aim to guarantee message ordering
and delivery requirements. These are specific agreements which are made on messages delivered. Usually, you will find acknowledgements in message passing systems than shared memory stores like databases. However, shared memory systems also enforce deduplication and other forms of acknowledgements in their design.

Acknowledgements You can have quite simply the following types of
acknowledgements:

acknowledgements: • At least once delivery

acknowledgements: • At least once delivery • Exactly once delivery

acknowledgements: • At least once delivery • Exactly once delivery • At most once delivery

At least once delivery In this system, the messages will
be retried indefinitely until there is at least one message received on one end and an acknowledgement sent. It does not matter if the requests are duplicated, what matters is that at least one event of it exists.

At least once delivery In this system, the messages will
be retried indefinitely until there is at least one message received on one end and an acknowledgement sent. It does not matter if the requests are duplicated, what matters is that at least one event of it exists. This is useful for things like heartbeats where multiple events are good to signal that things are fine. No events implies some failure with the existing system.

Exactly once delivery In this system, the messages will be
retried indefinitely until one message is received on one end and an acknowledgement sent. This is needed for critical events where duplicates just cannot happen.

Exactly once delivery In this system, the messages will be
retried indefinitely until one message is received on one end and an acknowledgement sent. This is needed for critical events where duplicates just cannot happen. In cases like data transfer, we cannot have the same packet for example sent and acknowledged twice or not acknowledged at all. That would imply data corruption and misordering of TCP sequences.

At most once delivery In this system, the system will
send a message but there is no issue if it is never received or sent.

At most once delivery In this system, the system will
send a message but there is no issue if it is never received or sent. Systems using UDP packets for data transfer like torrents are a great example. Having at most once delivery implies that acknowledgements are not required for the system to be operational, these can also be looked at as lossy delivery acknowledgements.

Acknowledgements in Databases In databases, you will usually have the
exactly once delivery although in practice, it is usually at most once delivery. Databases are high integrity systems requiring strict acknowledgement of packets and data storage and retrieval.

Acknowledgements in Databases In databases, you will usually have the
exactly once delivery although in practice, it is usually at most once delivery. Databases are high integrity systems requiring strict acknowledgement of packets and data storage and retrieval. The WAL is implemented to keep data localised until it can be replicated and the acknowledgements are strictly required so database protocols are built on TCP as a result.

Processing

Processing Remember our discussion on concurrency and parallelism, we stated
that it is possible to process things at the same time speeding up the process. In a similar fashion, we can employ techniques of state sharing to apply those techniques in distributed systems.

Processing - Pipelines To achieve the benefits of concurrency in
addition to parallel execution, we model the stages of processing in chained workflows. These chained workflows are called pipelines and these pipelines may employ any number of transactions and rollbacks in processing information.

Processing - Pipelines To achieve the benefits of concurrency in
addition to parallel execution, we model the stages of processing in chained workflows. These chained workflows are called pipelines and these pipelines may employ any number of transactions and rollbacks in processing information. As a refresher: MQ - Message Queue (message queue) SHM - Shared Memory (database)

⛕ Message Passing Terminology

Intermittent break: Message Passing Terminology Message passing implementations can have
varying names, some might say:

varying names, some might say: • Message Queue

varying names, some might say: • Message Queue • Message Broker

varying names, some might say: • Message Queue • Message Broker • Service Bus

varying names, some might say: • Message Queue • Message Broker • Service Bus • Streams Processor

varying names, some might say: • Message Queue • Message Broker • Service Bus • Streams Processor And several others.

varying names, some might say: • Message Queue • Message Broker • Service Bus • Streams Processor And several others. There are differences but the basic idea is all of them implement message passing.

Intermittent break: Message Passing Terminology Despite the terminology, you will
always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed:

always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed: • PUBlisher and SUBscriber (PUBSUB)

always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed: • PUBlisher and SUBscriber (PUBSUB) • Producer and Consumer

always have a source (start of the message transfer) and a sink (end of the message transfer) in the process. The source and sink could also be termed: • PUBlisher and SUBscriber (PUBSUB) • Producer and Consumer And several others, it’s all still message passing.

Processing

Processing - Pipelines A B C Workers MQ 1 SHM
A D SHM B Workers MQ 2

Processing - Pipelines A -> Sends the message through MQ
1 to B A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - Pipelines B -> Writes to SHM A and
MQ 2, persisted and async respectively A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - Pipelines C -> Processes each event from MQ
2 and persists to SHM A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - Pipelines D -> Batch processes from SHM A
and writes to another SHM B A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - Pipelines SHM B is the final path of
our processing pipeline A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - Pipelines In distributed systems, we will always have
pipelines that employ various forms of techniques, these pipelines can be:

pipelines that employ various forms of techniques, these pipelines can be: • Event Driven / Stream Processing

pipelines that employ various forms of techniques, these pipelines can be: • Event Driven / Stream Processing • Batch Processing

pipelines that employ various forms of techniques, these pipelines can be: • Event Driven / Stream Processing • Batch Process • Extract, Transform, Load (ETL)

Processing - Event Driven / Stream Processing In this instance,
we just take events or streams of data as they come and process them to where they need to go.

Processing - Event Driven / Stream Processing In this instance,
we just take events or streams of data as they come and process them to where they need to go. No persistence needed to make this work. Webhooks are a good example of this paradigm.

Processing - Event Driven / Stream Processing A -> B
and B -> C are Event Driven / Stream Processed A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - Batch Processing In batch processing, we require some
persistent store of data to aggregate all the information over some interval. Batch processing is useful when you need to gather tons of data and makes sense of it in a single outlook.

Processing - Batch Processing In batch processing, we require some
persistent store of data to aggregate all the information over some interval. Batch processing is useful when you need to gather tons of data and makes sense of it in a single outlook. An example of this is Payroll where all your expenses over a month need to be added together so you can get a final ledger over that period.

Processing - Batch Processing SHM A -> D -> SHM
B can be Batch Processed A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - Extract, Transform, Load (ETL) In this processing pattern,
you are usually running a mix of batch and stream processing. You take data aggregated over a period of time (Extract) with various input and output streams performing operations on them in the process (Transform) only to save the final output in a warehouse for consumption (Load)

Processing - Extract, Transform, Load (ETL) This entire pipeline is
an ETL A B C Workers MQ 1 SHM A D SHM B Workers MQ 2

Processing - OLAP and OLTP Usually in database processing systems,
you will always find OLAP and OLTP mentioned.

you will always find OLAP and OLTP mentioned. OLAP - Online Analytical Processing

you will always find OLAP and OLTP mentioned. OLAP - Online Analytical Processing OLTP - Online Transaction Processing

Processing - OLAP OLAP databases are very large and accumulate
tons of information for acquiring trends and performing complex data analysis. These processing systems have very long storage requirements and integrate multiple sources of information.They have very strong transaction guarantees and require significant compute to run.

Processing - OLAP OLAP databases are very large and accumulate
tons of information for acquiring trends and performing complex data analysis. These processing systems have very long storage requirements and integrate multiple sources of information.They have very strong transaction guarantees and require significant compute to run. OLAPs databases are usually the likes of Spanner, Aurora, Snowflake and most other data warehousing solutions you will find.

Processing - OLTP OLTPs are good for the stream/event driven
and short term batch processing outcomes where data needs to be processed in a relatively short time and the data sources to aggregate from are minimal. OLTP databases usually do not have very strong transaction guarantees and this eases performance in the processing aspect.

Processing - OLTP OLTPs are good for the stream/event driven
and short term batch processing outcomes where data needs to be processed in a relatively short time and the data sources to aggregate from are minimal. OLTP databases usually do not have very strong transaction guarantees and this eases performance in the processing aspect. Examples are the classic databases, Redshift, RDS and others.

Processing - Conclusion In summary, several of the patterns noted
here are used internally in the database internals design. In building distributed systems with databases, these are similar concepts that can be applied to their design.

Consensus

Consensus: CAP Unlike the head wear, this is more about
systems. CAP refers to:

systems. CAP refers to: • Consistency

systems. CAP refers to: • Consistency • Availability

systems. CAP refers to: • Consistency • Availability • Partition Tolerance

Consensus: CAP Consistency means that knowledge of information in the
distributed system is always the same at any given time.

distributed system is always the same at any given time. Availability means that every update to such information will always be received without error.

distributed system is always the same at any given time. Availability means that every update to such information will always be received without error. Partition Tolerance means that no matter the number of nodes, we will always be operating despite the failure of other nodes.

Consensus: CAP Theorem The CAP theorem basically states that it
is impossible to have Consistency, Availability and Partition Tolerance in a distributed system.

is impossible to have Consistency, Availability and Partition Tolerance in a distributed system. Partition Tolerance is more aligned with the “distributed” aspect of systems as every node can independently perform actions.

is impossible to have Consistency, Availability and Partition Tolerance in a distributed system. Partition Tolerance is more aligned with the “distributed” aspect of systems as every node can independently perform actions, doing so in consensus means that we will either sacrifice availability of the system since the now failed nodes will give errors.

is impossible to have Consistency, Availability and Partition Tolerance in a distributed system. Partition Tolerance is more aligned with the “distributed” aspect of systems as every node can independently perform actions, doing so in consensus means that we will either sacrifice availability of the system since the now failed nodes will give errors or consistency in the results since failed nodes cannot get the most updated information.

Consensus: CAP Theorem The CAP Theorem holds that there is
no perfect distributed system that can operate in all three states, only two are possible. Hence it is visualised as the CAP triangle.

CAP Theorem: Source:CAP theorem - Wikipedia

Consensus: FLP Impossibility This paper was written by Fisher, Lynch
and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process:

and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process: • Agreement

and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process: • Agreement • Validity

and Paterson (hence the name). Source: Impossibility of Distributed Consensus with One Faulty Process It draws on three properties of the consensus process: • Agreement • Validity • Termination

Consensus: FLP Impossibility Agreement defines that a consensus protocol should
ensure decisions are defined and agreed upon by active non-failing members of the network.

ensure decisions are defined and agreed upon by active non-failing members of the network. Validity implies that the value proposed is made by members of the network participating in the voting process, it cannot be externally provided.

ensure decisions are defined and agreed upon by active non-failing members of the network. Validity implies that the value proposed is approved by members of the network participating in the voting process, it cannot be externally provided or defaulted. Termination means a decision is only made when active non-faulty members agree it can be made.

Consensus: FLP Impossibility In the paper, we assume all processes
are asynchronous implying there is no upper bound on processing time before a decision is made. The basis of the paper establishes that:

are asynchronous implying there is no upper bound on processing time before a decision is made. The basis of the paper establishes that: 1. it is not possible to have a distributed system that can guarantee its current state is always up to date without failure

are asynchronous implying there is no upper bound on processing time before a decision is made. The basis of the paper establishes that: 1. it is not possible to have a distributed system that can guarantee its current state is always up to date without failure 2. consensus cannot be said to occur within a bounded time, there will always be delays that violate a predefined interval for resolution

Consensus: FLP Impossibility and CAP These two argue different perspectives
of the same problem, that distributed systems are a tradeoff in consensus.

of the same problem, that distributed systems are a tradeoff in consensus. FLP Impossibility makes it known that time in distributed systems cannot be bounded and the state of each interacting member is never perfectly known, hence why consensus will always take an unbounded amount of time due to uncooperating members.

of the same problem, that distributed systems are a tradeoff in consensus. FLP Impossibility makes it known that time in distributed systems cannot be bounded and the state of each interacting member is never perfectly known, hence why consensus will always take an unbounded amount of time due to uncooperating members. CAP argues that a distributed system cannot be perfectly designed to have all members achieve consensus without ignoring failing members or the known state of the system.

Conclusion

Conclusion Distributed systems incorporate asynchronous patterns of communication that makes
it impossible to reliably know the state of the network at any given time.

it impossible to reliably know the state of the network at any given time. Workarounds for failures are best known depending on the problem and there are extensive literatures taking into consideration these failures in considering solutions.

it impossible to reliably know the state of the network at any given time. Workarounds for failures are best known depending on the problem and there are extensive literatures taking into consideration these failures in considering solutions. In summary, there is no perfect distributed system.

Questions

Distributed Systems - Database Internals

Distributed Systems - Database Internals

More Decks by tiemma

Other Decks in Programming

Featured

Transcript