Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
Big Data - 03 - Object and Key-Value Storage
Ghislain Fourny
September 27, 2017
Education
0
42
Big Data - 03 - Object and Key-Value Storage
Lecture given on September 26 and 27 at ETH Zurich.
Ghislain Fourny
September 27, 2017
Tweet
Share
More Decks by Ghislain Fourny
See All by Ghislain Fourny
gfourny
0
120
gfourny
0
41
gfourny
0
120
gfourny
0
68
gfourny
0
56
gfourny
0
94
gfourny
1
110
gfourny
0
56
Other Decks in Education
See All in Education
tibbelit
0
370
barsaka2
0
3.6k
ikumo
0
150
event2020
0
110
signer
0
220
chiemi627
0
190
matleenalaakso
0
1.5k
tibbelit
0
170
matleenalaakso
1
2.6k
soobrosa
1
200
ttak0422
0
390
yasushihara
0
140
Featured
See All Featured
edds
56
9.3k
bermonpainter
342
26k
jcasabona
7
520
searls
204
35k
colly
186
14k
marktimemedia
6
330
lauravandoore
11
1.2k
caitiem20
308
17k
denniskardys
220
120k
kastner
54
1.9k
jasonvnalue
82
8k
mza
80
4.1k
Transcript
Ghislain Fourny Big Data 3. Object and Key-Value Storage satina
/ 123RF Stock Photo
2 Where are we? Last lecture: Reminder on relational databases
3 Where are we? Relational databases fit on a single
machine
4 Where are we? Petabytes do not fit on a
single machine
5 The lecture journey Monolithic Relational Database Modular "Big Data"
Technology Stack
6 Not reinventing the wheel 99% of what we learned
with 46 years of SQL and relational can be reused
7 Important take-away points Relational algebra: Selection Projection Grouping Sorting
Joining
8 Important take-away points Language SQL Declarative languages Functional languages
Optimizations Query plans
9 Important take-away points What a table is made of
Table Columns Primary key Row
10 Important take-away points Denormalization 1NF vs. nesting 2NF/3NF vs.
pre-join
11 Important take-away points Transactions Atomicity Consistency Isolation Durability ACID
12 Important take-away points Transactions Atomicity Consistency Isolation Durability Atomic
Consistency Availability Partition tolerance Eventual Consistency NEW NEW NEW NEW CAP ACID
13 The stack Storage Encoding Syntax Data models Validation Processing
Indexing Data stores User interfaces Querying
14 The stack: Storage Storage Local filesystem NFS GFS HDFS
S3 Azure Blob Storage
15 The stack: Encoding Encoding ASCII ISO-8859-1 UTF-8 BSON
16 The stack: Syntax Syntax Text CSV XML JSON RDF/XML
Turtle XBRL
17 The stack: Data models Data models Tables: Relational model
Trees: XML Infoset, XDM Graphs: RDF Cubes: OLAP
18 The stack: Validation Validation XML Schema JSON Schema Relational
schemas XBRL taxonomies
19 The stack: Processing Processing Two-phase processing: MapReduce DAG-driven processing:
Tez, Spark Elastic computing: EC2
20 The stack: Indexing Indexing Key-value stores Hash indices B-Trees
Geographical indices Spatial indices
21 The stack: Data stores Data stores RDBMS (Oracle/IBM/Microsoft) MongoDB
CouchBase ElasticSearch Hive HBase MarkLogic Cassandra ...
22 The stack: Querying Querying SQL XQuery MDX SPARQL REST
APIs
23 The stack: User interfaces (UI) User interfaces Excel Access
Tableau Qlikview BI tools
24 Storage: from a single machine to a cluster boscorelli
/ 123RF Stock Photo
25 Storage needs to be stored Data somewhere Database Storage
26 Let's start from the 70s... Vitaly Korovin / 123RF
Stock Photo
27 File storage Lorem Ipsum Dolor sit amet Consectetur Adipiscing
Elit. In Imperdiet Ipsum ante Files organized in a hierarchy
28 What is a file made of? Content Metadata File
+
29 File Metadata $ ls -l total 48 drwxr-xr-x 5
gfourny staff 170 Jul 29 08:11 2009 drwxr-xr-x 16 gfourny staff 544 Aug 19 14:02 Exercises drwxr-xr-x 11 gfourny staff 374 Aug 19 14:02 Learning Objectives drwxr-xr-x 18 gfourny staff 612 Aug 19 14:52 Lectures -rw-r--r-- 1 gfourny staff 1788 Aug 19 14:04 README.md Fixed "schema"
30 File Content: Block storage Files content stored in blocks
1 2 3 4 5 6 7 8
31 Local storage Local Machine
32 Local storage Local Machine LAN (NAS) LAN = local-area
network NAS = network-attached storage
33 Local storage Local Machine LAN (NAS) WAN LAN =
local-area network NAS = network-attached storage WAN = wide-area network
34 Scaling Issues Aleksandr Elesin / 123RF Stock Photo 1,000
files 1,000,000 files 1,000,000,000 files
35 Better performance: Explicit Block Storage 1 2 3 4
5 6 7 8 Application (Control over locality of blocks)
36 So how do we make this scale? 1. We
throw away the hierarchy!
37 So how do we make this scale? 2. We
make metadata flexible
38 So how do we make this scale? 3. We
make the data model trivial ID
39 So how do we make this scale? 4. We
use commodity hardware
40 ... and we get Object Storage
41 "Black-box" objects ... and we get Object Storage
42 "Black-box" objects Flat and global key-value model ... and
we get Object Storage
43 "Black-box" objects Flat and global key-value model Flexible metadata
... and we get Object Storage
44 "Black-box" objects Flat and global key-value model Flexible metadata
Commodity hardware ... and we get Object Storage
45 Scale boscorelli / 123RF Stock Photo
46 One machine's not good enough. How do we scale?
47 Approach 1: scaling up
48 Approach 1: scaling up
49 Approach 2: scaling out
50 Approach 2: scaling out
51 Approach 2: scaling out
52 Approach 2: scaling out
53 Hardware price comparison Scale out Scale up
54 Approach 3: be smart Viktorija Reuta / 123RF Stock
Photo
55 Approach 3: be smart Viktorija Reuta / 123RF Stock
Photo “You can have a second computer once you’ve shown you know how to use the first one.” Paul Barham
56 In this lecture Approach 2 Scale out
57 Data centers boscorelli / 123RF Stock Photo
58 Numbers - computing 1,000-100,000 machines in a data center
1-100 cores per server
59 Numbers - storage 1-12 TB local storage per server
16GB-4TB of RAM per server
60 Numbers - network 1 GB/s network bandwith for a
server
61 Racks Height in "rack units" (e.g., 42 RU)
62 Racks Modular: - servers - storage - routers -
...
63 Rack servers Lenovo ThinkServer RD430 Rack Server 1-4 RU
64 Amazon S3
65 S3 Model
66 S3 Model
67 S3 Model Bucket ID
68 S3 Model Bucket ID
69 S3 Model Bucket ID + Object ID
70 Scalability Max. 5 TB
71 Scalability 100/account (more upon request)
72 Durability 99.999999999% Loss of 1 in 1011 objects
73 Availability 99.99% Down 1h / year
74 More about SLAs 9 9 9 9 9 9
9 9 9
75 More about SLA SLA Outage 99% 4 days/year 99.9%
9 hours/year 99.99% 53 minutes/year 99.999% 6 minutes/year 99.9999% 32 seconds/year 99.99999% 4 seconds/year
76 More about SLA Amazon's approach: Response time < 10
ms in 99.9% of the cases (rather than average or median)
77 API REST
78 REST APIs
79 HTTP protocol Sir Tim Berners-Lee Version RFC 1.0 RFC
2616 1.1 RFC 7230 2.0 RFC 7540
80 REST API GET PUT DELETE POST HEAD OPTIONS TRACE
CONNECT + Resource (URI) Method
81 PUT (Idempotent)
82 GET (Side-effect free)
83 DELETE
84 POST ! _________ _________ _________ Most generic: side effects
85 S3 Resources: Buckets http://bucket.s3.amazonaws.com http://bucket.s3-region.amazonaws.com
86 S3 Resources: Objects http://bucket.s3.amazonaws.com/object-name http://bucket.s3-region.amazonaws.com/object-name
87 S3 REST API PUT Bucket DELETE Bucket GET Bucket
PUT Object DELETE Object GET Object
88 Example GET /my-image.jpg HTTP/1.1 Host: bucket.s3.amazonaws.com Date: Tue, 26
Sep 2017 10:55:00 GMT Authorization: authorization string
89 Folders: is S3 a file system? /food/fruits/orange /food/fruits/strawberry /food/vegetables/tomato
/food/vegetables/turnip /food/vegetables/lettuce Physical (Object keys) Logical (Browsing) food fruit orange strawberry vegetables tomato turnip lettuce
90 http://<bucket-name>.s3-website-us-east-1.amazonaws.com/ Static website hosting
91 More on Storage
92 Replication Fault tolerance
93 Faults Local (node failure) versus Regional (natural catastrophe)
94 Regions Songkran Khamlue / 123RF Stock Photo
95 Regions Songkran Khamlue / 123RF Stock Photo
96 Regions Songkran Khamlue / 123RF Stock Photo
97 Storage Class Standard High availability Standard – Infrequent Access
Less availability Cheaper storage Cost for retrieving Amazon Glacier Low-cost Hours to GET
98 Azure Blob Storage
99 Overall comparison Azure vs. S3 S3 Azure Object ID
Bucket + Object Account + Partition + Object Object API Blackbox Blocks or pages Limit 5 TB 195 GB (blocks) 1 TB (pages)
100 Azure Architecture: Storage Stamp Front-Ends Partition Layer Stream Layer
Virtual IP address Account name Partition name Object name
101 Azure Architecture: One storage stamp 10-20 racks * 18
storage nodes/rack (30PB)
102 Azure Architecture: Keep some buffer kept below 70/80% storage
capacity
103 Storage Replication Front-Ends Partition Layer Stream Layer Intra-stamp replication
(synchronous)
104 Storage Replication Front-Ends Partition Layer Stream Layer Front-Ends Partition
Layer Stream Layer Inter-stamp replication (asynchronous)
105 Location Services Front-Ends Partition Layer Stream Layer Location Services
DNS Virtual IP (primary) Virtual IP Account name mapped to one Virtual IP (primary stamp) Front-Ends Partition Layer Stream Layer
106 Location Services Location Services DNS North America Europe Asia
107 Location Services Front-Ends Partition Layer Stream Layer DNS Front-Ends
Partition Layer Stream Layer
108 Key-value storage Sergey Nivens / 123RF Stock Photo
109 Can we consider object storage a database?
110 Issue: latency ~100-300ms 1-9 ms vs. S3 Typical Database
111 Key-value stores ID 1. Similar data model to object
storage
112 Key-value stores 2. Smaller objects 5TB 400KB (DynamoDB)
113 Key-value stores 3. No Metadata
114 Key-value stores: data model
115 Basic API get(key) value
116 Basic API get(key) value put(key, other value)
117 DynamoDB API get(key) value
118 DynamoDB API get(key) value, context
119 DynamoDB API get(key) value, context put(key, context, value) (more
on context later)
120 Key-value stores: why do we simplify? Simplicity More features
121 Key-value stores: why do we simplify? Simplicity Consistency Eventual
consistency More features
122 Key-value stores: why do we simplify? Simplicity Consistency Eventual
consistency More features Performance Overhead
123 Key-value stores: why do we simplify? Simplicity Consistency Eventual
consistency More features Performance Scalability Monolithic Overhead
124 Key Value Key-value stores Which is the most efficient
data structure for querying this?
125 Key Value Key-value stores: logical model Associative Array (aka
Map)
126 Enter nodes: physical level
127 Design principles: incremental stability
128 Design principles: incremental stability
129 Design principles: incremental stability
130 Design principles: symmetry
131 Design principles: decentralization
132 Design principles: heterogeneity
133 Physical layer: Peer to peer networks
134 Distributed Hash Tables: Chord hashed n-bit ID (DynamoDB: 128
bits)
135 IDs are organized in a logical ring 0000000000 1111111111
mod 2n 1000000000 0100000000 1100000000 2n nodes
136 Each Node picks a 128-bit hash (randomly)
137 Nodes are (logically) placed on the ring 0000000000 1111111111
mod 2n
138 ID stored at next node clockwise 000... 111... mod
2n
139 ID stored at next node clockwise 000... 111... mod
2n Domain of responsibility
140 Adding and removing nodes 000... 111... mod 2n
141 Adding and removing nodes 000... 111... mod 2n
142 Adding and removing nodes 000... 111... mod 2n
143 Adding and removing nodes 000... 111... mod 2n Needs
to be transferred
144 Adding and removing nodes 000... 111... mod 2n Needs
to be transferred These nodes are not affected
145 Adding and removing nodes 000... 111... mod 2n
146 Adding and removing nodes 000... 111... mod 2n
147 Adding and removing nodes 000... 111... mod 2n
148 Adding and removing nodes 000... 111... mod 2n Needs
to be transferred
149 Adding and removing nodes 000... 111... mod 2n Needs
to be transferred But what if the node failed?
150 Initial design: one range 000... 111... mod 2n
151 Duplication: 2 ranges 000... 111... mod 2n
152 Duplication: 2 ranges 000... 111... mod 2n
153 Duplication: N ranges 000... 111... mod 2n
154 Finger tables Credits: Thomas Hofmann
155 Finger tables O(log(number of nodes)) Credits: Thomas Hofmann
156 Distributed Hash Tables: Pros Highly scalable Robust against failure
Self organizing Credits: Thomas Hofmann
157 Distributed Hash Tables: Cons Lookup, no search Data integrity
Security issues Credits: Thomas Hofmann
158 Issue 1 000... 111... mod 2n
159 Issue 1: randomness out of luck 000... 111... mod
2n
160 Issue 2: heterogenous performance 000... 111... mod 2n
161 So... How can we • artificially increase the number
of node? and • bring some elasticity to account for performance differences?
162 Virtual nodes are the answer: tokens 000... 111... mod
2n
163 Virtual nodes: tokens 000... 111... mod 2n
164 Deleting a node 000... 111... mod 2n
165 Deleting a node 000... 111... mod 2n
166 Deleting a node 000... 111... mod 2n
167 Adding a node 000... 111... mod 2n
168 Adding a node 000... 111... mod 2n
169 Vector clocks put by Node A
170 Vector clocks put by Node A ([A, 1])
171 Vector clocks put by Node A put by Node
A ([A, 1]) ([A, 2])
172 Vector clocks put by Node A put by Node
A put by Node B ([A, 1]) ([A, 2]) ([A, 2], [B, 1])
173 Vector clocks put by Node A put by Node
A put by Node B put by Node C ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1])
174 Vector clocks put by Node A put by Node
A put by Node B put by Node C reconcile and put by Node A ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1])
175 Vector clocks put by Node A put by Node
A put by Node B put by Node C reconcile and put by Node A ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1]) ([A, 3], [B, 1], [C, 1])
176 Context put by Node A put by Node A
put by Node B put by Node C reconcile and put by Node A ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1]) ([A, 3], [B, 1], [C, 1])
177 Preference lists Key Node Propagate changes to first N
healthy nodes. (avoids hops in routing)
178 Preference lists Key Node 1, 2, 3, ... At
least N nodes
179 Preference lists Key Node 1, 2, 3, ... First
node is called the coordinator
180 Initial request Load balancer
181 Initial request Load balancer Random node
182 Initial request Load balancer Random node Coordinator
183 Initial request Load balancer Random node Coordinator
184 Lower latency Partition-aware client Coordinator
185 Lower latency Partition-aware client Coordinator Hinted handoff
186 Merkle Trees What if hinted replicas get lost? What
if the complexity in replica deltas increases?
187 Merkle Trees What if hinted replicas get lost? What
if the complexity in replica deltas increases? Anti-entropy protocol
188 Merkle Trees Key range (one per virtual node)
189 Amazon mindset
190 Azure mindset Graphs Tables Trees Key-value paradigm
191 § Simplify the model! § Buy cheap hardware! §
Remove schemas! Take away messages: how to scale out?