Slide 1

Slide 1 text

Scalability, Availability & Stability Patterns Jonas Bonér CTO Typesafe twitter: @jboner

Slide 2

Slide 2 text

Outline

Slide 3

Slide 3 text

Outline

Slide 4

Slide 4 text

Outline

Slide 5

Slide 5 text

Outline

Slide 6

Slide 6 text

Outline

Slide 7

Slide 7 text

Introduction

Slide 8

Slide 8 text

Scalability Patterns

Slide 9

Slide 9 text

Managing Overload

Slide 10

Slide 10 text

Scale up vs Scale out?

Slide 11

Slide 11 text

General recommendations • Immutability as the default • Referential Transparency (FP) • Laziness • Think about your data: • Different data need different guarantees

Slide 12

Slide 12 text

Scalability Trade-offs

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Trade-offs •Performance vs Scalability •Latency vs Throughput •Availability vs Consistency

Slide 15

Slide 15 text

Performance vs Scalability

Slide 16

Slide 16 text

How do I know if I have a performance problem?

Slide 17

Slide 17 text

How do I know if I have a performance problem? If your system is slow for a single user

Slide 18

Slide 18 text

How do I know if I have a scalability problem?

Slide 19

Slide 19 text

How do I know if I have a scalability problem? If your system is fast for a single user but slow under heavy load

Slide 20

Slide 20 text

Latency vs Throughput

Slide 21

Slide 21 text

You should strive for maximal throughput with acceptable latency

Slide 22

Slide 22 text

Availability vs Consistency

Slide 23

Slide 23 text

Brewer’s CAP theorem

Slide 24

Slide 24 text

You can only pick 2 Consistency Availability Partition tolerance At a given point in time

Slide 25

Slide 25 text

Centralized system • In a centralized system (RDBMS etc.) we don’t have network partitions, e.g. P in CAP • So you get both: •Availability •Consistency

Slide 26

Slide 26 text

Atomic Consistent Isolated Durable

Slide 27

Slide 27 text

Distributed system • In a distributed system we (will) have network partitions, e.g. P in CAP • So you get to only pick one: •Availability •Consistency

Slide 28

Slide 28 text

CAP in practice: • ...there are only two types of systems: 1. CP 2. AP • ...there is only one choice to make. In case of a network partition, what do you sacrifice? 1. C: Consistency 2. A: Availability

Slide 29

Slide 29 text

Basically Available Soft state Eventually consistent

Slide 30

Slide 30 text

Eventual Consistency ...is an interesting trade-off

Slide 31

Slide 31 text

Eventual Consistency ...is an interesting trade-off But let’s get back to that later

Slide 32

Slide 32 text

Availability Patterns

Slide 33

Slide 33 text

•Fail-over •Replication • Master-Slave • Tree replication • Master-Master • Buddy Replication Availability Patterns

Slide 34

Slide 34 text

What do we mean with Availability?

Slide 35

Slide 35 text

Fail-over

Slide 36

Slide 36 text

Fail-over Copyright Michael Nygaard

Slide 37

Slide 37 text

Fail-over But fail-over is not always this simple Copyright Michael Nygaard

Slide 38

Slide 38 text

Fail-over Copyright Michael Nygaard

Slide 39

Slide 39 text

Fail-back Copyright Michael Nygaard

Slide 40

Slide 40 text

Network fail-over

Slide 41

Slide 41 text

Replication

Slide 42

Slide 42 text

• Active replication - Push • Passive replication - Pull • Data not available, read from peer, then store it locally • Works well with timeout-based caches Replication

Slide 43

Slide 43 text

• Master-Slave replication • Tree Replication • Master-Master replication • Buddy replication Replication

Slide 44

Slide 44 text

Master-Slave Replication

Slide 45

Slide 45 text

Master-Slave Replication

Slide 46

Slide 46 text

Tree Replication

Slide 47

Slide 47 text

Master-Master Replication

Slide 48

Slide 48 text

Buddy Replication

Slide 49

Slide 49 text

Buddy Replication

Slide 50

Slide 50 text

Scalability Patterns: State

Slide 51

Slide 51 text

•Partitioning •HTTP Caching •RDBMS Sharding •NOSQL •Distributed Caching •Data Grids •Concurrency Scalability Patterns: State

Slide 52

Slide 52 text

Partitioning

Slide 53

Slide 53 text

HTTP Caching Reverse Proxy • Varnish • Squid • rack-cache • Pound • Nginx • Apache mod_proxy • Traffic Server

Slide 54

Slide 54 text

HTTP Caching CDN, Akamai

Slide 55

Slide 55 text

Generate Static Content Precompute content • Homegrown + cron or Quartz • Spring Batch • Gearman • Hadoop • Google Data Protocol • Amazon Elastic MapReduce

Slide 56

Slide 56 text

HTTP Caching First request

Slide 57

Slide 57 text

HTTP Caching Subsequent request

Slide 58

Slide 58 text

Service of Record SoR

Slide 59

Slide 59 text

Service of Record •Relational Databases (RDBMS) •NOSQL Databases

Slide 60

Slide 60 text

How to scale out RDBMS?

Slide 61

Slide 61 text

Sharding •Partitioning •Replication

Slide 62

Slide 62 text

Sharding: Partitioning

Slide 63

Slide 63 text

Sharding: Replication

Slide 64

Slide 64 text

ORM + rich domain model anti-pattern •Attempt: • Read an object from DB •Result: • You sit with your whole database in your lap

Slide 65

Slide 65 text

Think about your data • When do you need ACID? • When is Eventually Consistent a better fit? • Different kinds of data has different needs Think again

Slide 66

Slide 66 text

When is a RDBMS not good enough?

Slide 67

Slide 67 text

Scaling reads to a RDBMS is hard

Slide 68

Slide 68 text

Scaling writes to a RDBMS is impossible

Slide 69

Slide 69 text

Do we really need a RDBMS?

Slide 70

Slide 70 text

Do we really need a RDBMS? Sometimes...

Slide 71

Slide 71 text

Do we really need a RDBMS?

Slide 72

Slide 72 text

Do we really need a RDBMS? But many times we don’t

Slide 73

Slide 73 text

NOSQL (Not Only SQL)

Slide 74

Slide 74 text

•Key-Value databases •Column databases •Document databases •Graph databases •Datastructure databases NOSQL

Slide 75

Slide 75 text

Who’s ACID? • Relational DBs (MySQL, Oracle, Postgres) • Object DBs (Gemstone, db4o) • Clustering products (Coherence, Terracotta) • Most caching products (ehcache)

Slide 76

Slide 76 text

Who’s BASE? Distributed databases • Cassandra • Riak • Voldemort • Dynomite, • SimpleDB • etc.

Slide 77

Slide 77 text

• Google: Bigtable • Amazon: Dynamo • Amazon: SimpleDB • Yahoo: HBase • Facebook: Cassandra • LinkedIn: Voldemort NOSQL in the wild

Slide 78

Slide 78 text

But first some background...

Slide 79

Slide 79 text

• Distributed Hash Tables (DHT) • Scalable • Partitioned • Fault-tolerant • Decentralized • Peer to peer • Popularized • Node ring • Consistent Hashing Chord & Pastry

Slide 80

Slide 80 text

Node ring with Consistent Hashing Find data in log(N) jumps

Slide 81

Slide 81 text

“How can we build a DB on top of Google File System?” • Paper: Bigtable: A distributed storage system for structured data, 2006 • Rich data-model, structured storage • Clones: HBase Hypertable Neptune Bigtable

Slide 82

Slide 82 text

“How can we build a distributed hash table for the data center?” • Paper: Dynamo: Amazon’s highly available key- value store, 2007 • Focus: partitioning, replication and availability • Eventually Consistent • Clones: Voldemort Dynomite Dynamo

Slide 83

Slide 83 text

Types of NOSQL stores • Key-Value databases (Voldemort, Dynomite) • Column databases (Cassandra, Vertica, Sybase IQ) • Document databases (MongoDB, CouchDB) • Graph databases (Neo4J, AllegroGraph) • Datastructure databases (Redis, Hazelcast)

Slide 84

Slide 84 text

Distributed Caching

Slide 85

Slide 85 text

•Write-through •Write-behind •Eviction Policies •Replication •Peer-To-Peer (P2P) Distributed Caching

Slide 86

Slide 86 text

Write-through

Slide 87

Slide 87 text

Write-behind

Slide 88

Slide 88 text

Eviction policies • TTL (time to live) • Bounded FIFO (first in first out) • Bounded LIFO (last in first out) • Explicit cache invalidation

Slide 89

Slide 89 text

Peer-To-Peer • Decentralized • No “special” or “blessed” nodes • Nodes can join and leave as they please

Slide 90

Slide 90 text

•EHCache •JBoss Cache •OSCache •memcached Distributed Caching Products

Slide 91

Slide 91 text

memcached • Very fast • Simple • Key-Value (string -­‐>  binary) • Clients for most languages • Distributed • Not replicated - so 1/N chance for local access in cluster

Slide 92

Slide 92 text

Data Grids / Clustering

Slide 93

Slide 93 text

Data Grids/Clustering Parallel data storage • Data replication • Data partitioning • Continuous availability • Data invalidation • Fail-over • C + P in CAP

Slide 94

Slide 94 text

Data Grids/Clustering Products • Coherence • Terracotta • GigaSpaces • GemStone • Tibco Active Matrix • Hazelcast

Slide 95

Slide 95 text

Concurrency

Slide 96

Slide 96 text

•Shared-State Concurrency •Message-Passing Concurrency •Dataflow Concurrency •Software Transactional Memory Concurrency

Slide 97

Slide 97 text

Shared-State Concurrency

Slide 98

Slide 98 text

•Everyone can access anything anytime •Totally indeterministic •Introduce determinism at well-defined places... •...using locks Shared-State Concurrency

Slide 99

Slide 99 text

•Problems with locks: • Locks do not compose • Taking too few locks • Taking too many locks • Taking the wrong locks • Taking locks in the wrong order • Error recovery is hard Shared-State Concurrency

Slide 100

Slide 100 text

Please use java.util.concurrent.* • ConcurrentHashMap • BlockingQueue • ConcurrentQueue   • ExecutorService • ReentrantReadWriteLock • CountDownLatch • ParallelArray • and  much  much  more.. Shared-State Concurrency

Slide 101

Slide 101 text

Message-Passing Concurrency

Slide 102

Slide 102 text

•Originates in a 1973 paper by Carl Hewitt •Implemented in Erlang, Occam, Oz •Encapsulates state and behavior •Closer to the definition of OO than classes Actors

Slide 103

Slide 103 text

Actors • Share NOTHING • Isolated lightweight processes • Communicates through messages • Asynchronous and non-blocking • No shared state … hence, nothing to synchronize. • Each actor has a mailbox (message queue)

Slide 104

Slide 104 text

• Easier to reason about • Raised abstraction level • Easier to avoid –Race conditions –Deadlocks –Starvation –Live locks Actors

Slide 105

Slide 105 text

• Akka (Java/Scala) • scalaz actors (Scala) • Lift Actors (Scala) • Scala Actors (Scala) • Kilim (Java) • Jetlang (Java) • Actor’s Guild (Java) • Actorom (Java) • FunctionalJava (Java) • GPars (Groovy) Actor libs for the JVM

Slide 106

Slide 106 text

Dataflow Concurrency

Slide 107

Slide 107 text

• Declarative • No observable non-determinism • Data-driven – threads block until data is available • On-demand, lazy • No difference between: • Concurrent & • Sequential code • Limitations: can’t have side-effects Dataflow Concurrency

Slide 108

Slide 108 text

STM: Software Transactional Memory

Slide 109

Slide 109 text

STM: overview • See the memory (heap and stack) as a transactional dataset • Similar to a database • begin • commit • abort/rollback • Transactions are retried automatically upon collision • Rolls back the memory on abort

Slide 110

Slide 110 text

• Transactions can nest • Transactions compose (yipee!!) atomic  {              ...              atomic  {                    ...                }        } STM: overview

Slide 111

Slide 111 text

All operations in scope of a transaction: l Need to be idempotent STM: restrictions

Slide 112

Slide 112 text

• Akka (Java/Scala) • Multiverse (Java) • Clojure STM (Clojure) • CCSTM (Scala) • Deuce STM (Java) STM libs for the JVM

Slide 113

Slide 113 text

Scalability Patterns: Behavior

Slide 114

Slide 114 text

•Event-Driven Architecture •Compute Grids •Load-balancing •Parallel Computing Scalability Patterns: Behavior

Slide 115

Slide 115 text

Event-Driven Architecture “Four years from now, ‘mere mortals’ will begin to adopt an event-driven architecture (EDA) for the sort of complex event processing that has been attempted only by software gurus [until now]” --Roy Schulte (Gartner), 2003

Slide 116

Slide 116 text

• Domain Events • Event Sourcing • Command and Query Responsibility Segregation (CQRS) pattern • Event Stream Processing • Messaging • Enterprise Service Bus • Actors • Enterprise Integration Architecture (EIA) Event-Driven Architecture

Slide 117

Slide 117 text

Domain Events “It's really become clear to me in the last couple of years that we need a new building block and that is the Domain Events” -- Eric Evans, 2009

Slide 118

Slide 118 text

Domain Events “Domain Events represent the state of entities at a given time when an important event occurred and decouple subsystems with event streams. Domain Events give us clearer, more expressive models in those cases.” -- Eric Evans, 2009

Slide 119

Slide 119 text

Domain Events “State transitions are an important part of our problem space and should be modeled within our domain.” -- Greg Young, 2008

Slide 120

Slide 120 text

Event Sourcing • Every state change is materialized in an Event • All Events are sent to an EventProcessor • EventProcessor stores all events in an Event Log • System can be reset and Event Log replayed • No need for ORM, just persist the Events • Many different EventListeners can be added to EventProcessor (or listen directly on the Event log)

Slide 121

Slide 121 text

Event Sourcing

Slide 122

Slide 122 text

“A single model cannot be appropriate for reporting, searching and transactional behavior.” -- Greg Young, 2008 Command and Query Responsibility Segregation (CQRS) pattern

Slide 123

Slide 123 text

Bidirectional Bidirectional

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

Unidirectional Unidirectional Unidirectional

Slide 126

Slide 126 text

No content

Slide 127

Slide 127 text

No content

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

CQRS in a nutshell • All state changes are represented by Domain Events • Aggregate roots receive Commands and publish Events • Reporting (query database) is updated as a result of the published Events • All Queries from Presentation go directly to Reporting and the Domain is not involved

Slide 130

Slide 130 text

CQRS Copyright by Axis Framework

Slide 131

Slide 131 text

CQRS: Benefits • Fully encapsulated domain that only exposes behavior • Queries do not use the domain model • No object-relational impedance mismatch • Bullet-proof auditing and historical tracing • Easy integration with external systems • Performance and scalability

Slide 132

Slide 132 text

Event Stream Processing select  *  from   Withdrawal(amount>=200).win:length(5)

Slide 133

Slide 133 text

Event Stream Processing Products • Esper (Open Source) • StreamBase • RuleCast

Slide 134

Slide 134 text

Messaging • Publish-Subscribe • Point-to-Point • Store-forward • Request-Reply

Slide 135

Slide 135 text

Publish-Subscribe

Slide 136

Slide 136 text

Point-to-Point

Slide 137

Slide 137 text

Store-Forward Durability, event log, auditing etc.

Slide 138

Slide 138 text

Request-Reply F.e. AMQP’s ‘replyTo’ header

Slide 139

Slide 139 text

Messaging • Standards: • AMQP • JMS • Products: • RabbitMQ (AMQP) • ActiveMQ (JMS) • Tibco • MQSeries • etc

Slide 140

Slide 140 text

ESB

Slide 141

Slide 141 text

ESB products • ServiceMix (Open Source) • Mule (Open Source) • Open ESB (Open Source) • Sonic ESB • WebSphere ESB • Oracle ESB • Tibco • BizTalk Server

Slide 142

Slide 142 text

Actors • Fire-forget • Async send • Fire-And-Receive-Eventually • Async send + wait on Future for reply

Slide 143

Slide 143 text

Enterprise Integration Patterns

Slide 144

Slide 144 text

Enterprise Integration Patterns Apache Camel • More than 80 endpoints • XML (Spring) DSL • Scala DSL

Slide 145

Slide 145 text

Compute Grids

Slide 146

Slide 146 text

Compute Grids Parallel execution • Divide and conquer 1. Split up job in independent tasks 2. Execute tasks in parallel 3. Aggregate and return result • MapReduce - Master/Worker

Slide 147

Slide 147 text

Compute Grids Parallel execution • Automatic provisioning • Load balancing • Fail-over • Topology resolution

Slide 148

Slide 148 text

Compute Grids Products • Platform • DataSynapse • Google MapReduce • Hadoop • GigaSpaces • GridGain

Slide 149

Slide 149 text

Load balancing

Slide 150

Slide 150 text

• Random allocation • Round robin allocation • Weighted allocation • Dynamic load balancing • Least connections • Least server CPU • etc. Load balancing

Slide 151

Slide 151 text

Load balancing • DNS Round Robin (simplest) • Ask DNS for IP for host • Get a new IP every time • Reverse Proxy (better) • Hardware Load Balancing

Slide 152

Slide 152 text

Load balancing products • Reverse Proxies: • Apache mod_proxy (OSS) • HAProxy (OSS) • Squid (OSS) • Nginx (OSS) • Hardware Load Balancers: • BIG-IP • Cisco

Slide 153

Slide 153 text

Parallel Computing

Slide 154

Slide 154 text

• UE: Unit of Execution • Process • Thread • Coroutine • Actor Parallel Computing • SPMD Pattern • Master/Worker Pattern • Loop Parallelism Pattern • Fork/Join Pattern • MapReduce Pattern

Slide 155

Slide 155 text

SPMD Pattern • Single Program Multiple Data • Very generic pattern, used in many other patterns • Use a single program for all the UEs • Use the UE’s ID to select different pathways through the program. F.e: • Branching on ID • Use ID in loop index to split loops • Keep interactions between UEs explicit

Slide 156

Slide 156 text

Master/Worker

Slide 157

Slide 157 text

Master/Worker • Good scalability • Automatic load-balancing • How to detect termination? • Bag of tasks is empty • Poison pill • If we bottleneck on single queue? • Use multiple work queues • Work stealing • What about fault tolerance? • Use “in-progress” queue

Slide 158

Slide 158 text

Loop Parallelism •Workflow 1.Find the loops that are bottlenecks 2.Eliminate coupling between loop iterations 3.Parallelize the loop •If too few iterations to pull its weight • Merge loops • Coalesce nested loops •OpenMP • omp  parallel  for

Slide 159

Slide 159 text

What if task creation can’t be handled by: • parallelizing loops (Loop Parallelism) • putting them on work queues (Master/Worker)

Slide 160

Slide 160 text

What if task creation can’t be handled by: • parallelizing loops (Loop Parallelism) • putting them on work queues (Master/Worker) Enter Fork/Join

Slide 161

Slide 161 text

•Use when relationship between tasks is simple •Good for recursive data processing •Can use work-stealing 1. Fork: Tasks are dynamically created 2. Join: Tasks are later terminated and data aggregated Fork/Join

Slide 162

Slide 162 text

Fork/Join •Direct task/UE mapping • 1-1 mapping between Task/UE • Problem: Dynamic UE creation is expensive •Indirect task/UE mapping • Pool the UE • Control (constrain) the resource allocation • Automatic load balancing

Slide 163

Slide 163 text

Java 7 ParallelArray (Fork/Join DSL) Fork/Join

Slide 164

Slide 164 text

Java 7 ParallelArray (Fork/Join DSL) ParallelArray  students  =      new  ParallelArray(fjPool,  data); double  bestGpa  =  students.withFilter(isSenior)                                                    .withMapping(selectGpa)                                                    .max(); Fork/Join

Slide 165

Slide 165 text

• Origin from Google paper 2004 • Used internally @ Google • Variation of Fork/Join • Work divided upfront not dynamically • Usually distributed • Normally used for massive data crunching MapReduce

Slide 166

Slide 166 text

• Hadoop (OSS), used @ Yahoo • Amazon Elastic MapReduce • Many NOSQL DBs utilizes it for searching/querying MapReduce Products

Slide 167

Slide 167 text

MapReduce

Slide 168

Slide 168 text

Parallel Computing products • MPI • OpenMP • JSR166 Fork/Join • java.util.concurrent • ExecutorService, BlockingQueue etc. • ProActive Parallel Suite • CommonJ WorkManager (JEE)

Slide 169

Slide 169 text

Stability Patterns

Slide 170

Slide 170 text

•Timeouts •Circuit Breaker •Let-it-crash •Fail fast •Bulkheads •Steady State •Throttling Stability Patterns

Slide 171

Slide 171 text

Timeouts Always use timeouts (if possible): • Thread.wait(timeout) • reentrantLock.tryLock • blockingQueue.poll(timeout,  timeUnit)/ offer(..) • futureTask.get(timeout,  timeUnit) • socket.setSoTimeOut(timeout) • etc.

Slide 172

Slide 172 text

Circuit Breaker

Slide 173

Slide 173 text

Let it crash • Embrace failure as a natural state in the life-cycle of the application • Instead of trying to prevent it; manage it • Process supervision • Supervisor hierarchies (from Erlang)

Slide 174

Slide 174 text

Restart Strategy OneForOne

Slide 175

Slide 175 text

Restart Strategy OneForOne

Slide 176

Slide 176 text

Restart Strategy OneForOne

Slide 177

Slide 177 text

Restart Strategy AllForOne

Slide 178

Slide 178 text

Restart Strategy AllForOne

Slide 179

Slide 179 text

Restart Strategy AllForOne

Slide 180

Slide 180 text

Restart Strategy AllForOne

Slide 181

Slide 181 text

Supervisor Hierarchies

Slide 182

Slide 182 text

Supervisor Hierarchies

Slide 183

Slide 183 text

Supervisor Hierarchies

Slide 184

Slide 184 text

Supervisor Hierarchies

Slide 185

Slide 185 text

Fail fast • Avoid “slow responses” • Separate: • SystemError - resources not available • ApplicationError - bad user input etc • Verify resource availability before starting expensive task • Input validation immediately

Slide 186

Slide 186 text

Bulkheads

Slide 187

Slide 187 text

Bulkheads • Partition and tolerate failure in one part • Redundancy • Applies to threads as well: • One pool for admin tasks to be able to perform tasks even though all threads are blocked

Slide 188

Slide 188 text

Steady State • Clean up after you • Logging: • RollingFileAppender (log4j) • logrotate (Unix) • Scribe - server for aggregating streaming log data • Always put logs on separate disk

Slide 189

Slide 189 text

Throttling • Maintain a steady pace • Count requests • If limit reached, back-off (drop, raise error) • Queue requests • Used in for example Staged Event-Driven Architecture (SEDA)

Slide 190

Slide 190 text

?

Slide 191

Slide 191 text

thanks for listening

Slide 192

Slide 192 text

Extra material

Slide 193

Slide 193 text

Client-side consistency • Strong consistency • Weak consistency • Eventually consistent • Never consistent

Slide 194

Slide 194 text

Client-side Eventual Consistency levels • Casual consistency • Read-your-writes consistency (important) • Session consistency • Monotonic read consistency (important) • Monotonic write consistency

Slide 195

Slide 195 text

Server-side consistency N = the number of nodes that store replicas of the data W = the number of replicas that need to acknowledge the receipt of the update before the update completes R = the number of replicas that are contacted when a data object is accessed through a read operation

Slide 196

Slide 196 text

Server-side consistency W + R > N strong consistency W + R <= N eventual consistency