Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data - 03 - Object and Key-Value Storage

Big Data - 03 - Object and Key-Value Storage

Lecture given on September 26 and 27 at ETH Zurich.

66d5abafc597b670cf6f109e4c278ebc?s=128

Ghislain Fourny

September 27, 2017
Tweet

Transcript

  1. Ghislain Fourny Big Data 3. Object and Key-Value Storage satina

    / 123RF Stock Photo
  2. 2 Where are we? Last lecture: Reminder on relational databases

  3. 3 Where are we? Relational databases fit on a single

    machine
  4. 4 Where are we? Petabytes do not fit on a

    single machine
  5. 5 The lecture journey Monolithic Relational Database Modular "Big Data"

    Technology Stack
  6. 6 Not reinventing the wheel 99% of what we learned

    with 46 years of SQL and relational can be reused
  7. 7 Important take-away points Relational algebra: Selection Projection Grouping Sorting

    Joining
  8. 8 Important take-away points Language SQL Declarative languages Functional languages

    Optimizations Query plans
  9. 9 Important take-away points What a table is made of

    Table Columns Primary key Row
  10. 10 Important take-away points Denormalization 1NF vs. nesting 2NF/3NF vs.

    pre-join
  11. 11 Important take-away points Transactions Atomicity Consistency Isolation Durability ACID

  12. 12 Important take-away points Transactions Atomicity Consistency Isolation Durability Atomic

    Consistency Availability Partition tolerance Eventual Consistency NEW NEW NEW NEW CAP ACID
  13. 13 The stack Storage Encoding Syntax Data models Validation Processing

    Indexing Data stores User interfaces Querying
  14. 14 The stack: Storage Storage Local filesystem NFS GFS HDFS

    S3 Azure Blob Storage
  15. 15 The stack: Encoding Encoding ASCII ISO-8859-1 UTF-8 BSON

  16. 16 The stack: Syntax Syntax Text CSV XML JSON RDF/XML

    Turtle XBRL
  17. 17 The stack: Data models Data models Tables: Relational model

    Trees: XML Infoset, XDM Graphs: RDF Cubes: OLAP
  18. 18 The stack: Validation Validation XML Schema JSON Schema Relational

    schemas XBRL taxonomies
  19. 19 The stack: Processing Processing Two-phase processing: MapReduce DAG-driven processing:

    Tez, Spark Elastic computing: EC2
  20. 20 The stack: Indexing Indexing Key-value stores Hash indices B-Trees

    Geographical indices Spatial indices
  21. 21 The stack: Data stores Data stores RDBMS (Oracle/IBM/Microsoft) MongoDB

    CouchBase ElasticSearch Hive HBase MarkLogic Cassandra ...
  22. 22 The stack: Querying Querying SQL XQuery MDX SPARQL REST

    APIs
  23. 23 The stack: User interfaces (UI) User interfaces Excel Access

    Tableau Qlikview BI tools
  24. 24 Storage: from a single machine to a cluster boscorelli

    / 123RF Stock Photo
  25. 25 Storage needs to be stored Data somewhere Database Storage

  26. 26 Let's start from the 70s... Vitaly Korovin / 123RF

    Stock Photo
  27. 27 File storage Lorem Ipsum Dolor sit amet Consectetur Adipiscing

    Elit. In Imperdiet Ipsum ante Files organized in a hierarchy
  28. 28 What is a file made of? Content Metadata File

    +
  29. 29 File Metadata $ ls -l total 48 drwxr-xr-x 5

    gfourny staff 170 Jul 29 08:11 2009 drwxr-xr-x 16 gfourny staff 544 Aug 19 14:02 Exercises drwxr-xr-x 11 gfourny staff 374 Aug 19 14:02 Learning Objectives drwxr-xr-x 18 gfourny staff 612 Aug 19 14:52 Lectures -rw-r--r-- 1 gfourny staff 1788 Aug 19 14:04 README.md Fixed "schema"
  30. 30 File Content: Block storage Files content stored in blocks

    1 2 3 4 5 6 7 8
  31. 31 Local storage Local Machine

  32. 32 Local storage Local Machine LAN (NAS) LAN = local-area

    network NAS = network-attached storage
  33. 33 Local storage Local Machine LAN (NAS) WAN LAN =

    local-area network NAS = network-attached storage WAN = wide-area network
  34. 34 Scaling Issues Aleksandr Elesin / 123RF Stock Photo 1,000

    files 1,000,000 files 1,000,000,000 files
  35. 35 Better performance: Explicit Block Storage 1 2 3 4

    5 6 7 8 Application (Control over locality of blocks)
  36. 36 So how do we make this scale? 1. We

    throw away the hierarchy!
  37. 37 So how do we make this scale? 2. We

    make metadata flexible
  38. 38 So how do we make this scale? 3. We

    make the data model trivial ID
  39. 39 So how do we make this scale? 4. We

    use commodity hardware
  40. 40 ... and we get Object Storage

  41. 41 "Black-box" objects ... and we get Object Storage

  42. 42 "Black-box" objects Flat and global key-value model ... and

    we get Object Storage
  43. 43 "Black-box" objects Flat and global key-value model Flexible metadata

    ... and we get Object Storage
  44. 44 "Black-box" objects Flat and global key-value model Flexible metadata

    Commodity hardware ... and we get Object Storage
  45. 45 Scale boscorelli / 123RF Stock Photo

  46. 46 One machine's not good enough. How do we scale?

  47. 47 Approach 1: scaling up

  48. 48 Approach 1: scaling up

  49. 49 Approach 2: scaling out

  50. 50 Approach 2: scaling out

  51. 51 Approach 2: scaling out

  52. 52 Approach 2: scaling out

  53. 53 Hardware price comparison Scale out Scale up

  54. 54 Approach 3: be smart Viktorija Reuta / 123RF Stock

    Photo
  55. 55 Approach 3: be smart Viktorija Reuta / 123RF Stock

    Photo “You can have a second computer once you’ve shown you know how to use the first one.” Paul Barham
  56. 56 In this lecture Approach 2 Scale out

  57. 57 Data centers boscorelli / 123RF Stock Photo

  58. 58 Numbers - computing 1,000-100,000 machines in a data center

    1-100 cores per server
  59. 59 Numbers - storage 1-12 TB local storage per server

    16GB-4TB of RAM per server
  60. 60 Numbers - network 1 GB/s network bandwith for a

    server
  61. 61 Racks Height in "rack units" (e.g., 42 RU)

  62. 62 Racks Modular: - servers - storage - routers -

    ...
  63. 63 Rack servers Lenovo ThinkServer RD430 Rack Server 1-4 RU

  64. 64 Amazon S3

  65. 65 S3 Model

  66. 66 S3 Model

  67. 67 S3 Model Bucket ID

  68. 68 S3 Model Bucket ID

  69. 69 S3 Model Bucket ID + Object ID

  70. 70 Scalability Max. 5 TB

  71. 71 Scalability 100/account (more upon request)

  72. 72 Durability 99.999999999% Loss of 1 in 1011 objects

  73. 73 Availability 99.99% Down 1h / year

  74. 74 More about SLAs 9 9 9 9 9 9

    9 9 9
  75. 75 More about SLA SLA Outage 99% 4 days/year 99.9%

    9 hours/year 99.99% 53 minutes/year 99.999% 6 minutes/year 99.9999% 32 seconds/year 99.99999% 4 seconds/year
  76. 76 More about SLA Amazon's approach: Response time < 10

    ms in 99.9% of the cases (rather than average or median)
  77. 77 API REST

  78. 78 REST APIs

  79. 79 HTTP protocol Sir Tim Berners-Lee Version RFC 1.0 RFC

    2616 1.1 RFC 7230 2.0 RFC 7540
  80. 80 REST API GET PUT DELETE POST HEAD OPTIONS TRACE

    CONNECT + Resource (URI) Method
  81. 81 PUT (Idempotent)

  82. 82 GET (Side-effect free)

  83. 83 DELETE

  84. 84 POST ! _________ _________ _________ Most generic: side effects

  85. 85 S3 Resources: Buckets http://bucket.s3.amazonaws.com http://bucket.s3-region.amazonaws.com

  86. 86 S3 Resources: Objects http://bucket.s3.amazonaws.com/object-name http://bucket.s3-region.amazonaws.com/object-name

  87. 87 S3 REST API PUT Bucket DELETE Bucket GET Bucket

    PUT Object DELETE Object GET Object
  88. 88 Example GET /my-image.jpg HTTP/1.1 Host: bucket.s3.amazonaws.com Date: Tue, 26

    Sep 2017 10:55:00 GMT Authorization: authorization string
  89. 89 Folders: is S3 a file system? /food/fruits/orange /food/fruits/strawberry /food/vegetables/tomato

    /food/vegetables/turnip /food/vegetables/lettuce Physical (Object keys) Logical (Browsing) food fruit orange strawberry vegetables tomato turnip lettuce
  90. 90 http://<bucket-name>.s3-website-us-east-1.amazonaws.com/ Static website hosting

  91. 91 More on Storage

  92. 92 Replication Fault tolerance

  93. 93 Faults Local (node failure) versus Regional (natural catastrophe)

  94. 94 Regions Songkran Khamlue / 123RF Stock Photo

  95. 95 Regions Songkran Khamlue / 123RF Stock Photo

  96. 96 Regions Songkran Khamlue / 123RF Stock Photo

  97. 97 Storage Class Standard High availability Standard – Infrequent Access

    Less availability Cheaper storage Cost for retrieving Amazon Glacier Low-cost Hours to GET
  98. 98 Azure Blob Storage

  99. 99 Overall comparison Azure vs. S3 S3 Azure Object ID

    Bucket + Object Account + Partition + Object Object API Blackbox Blocks or pages Limit 5 TB 195 GB (blocks) 1 TB (pages)
  100. 100 Azure Architecture: Storage Stamp Front-Ends Partition Layer Stream Layer

    Virtual IP address Account name Partition name Object name
  101. 101 Azure Architecture: One storage stamp 10-20 racks * 18

    storage nodes/rack (30PB)
  102. 102 Azure Architecture: Keep some buffer kept below 70/80% storage

    capacity
  103. 103 Storage Replication Front-Ends Partition Layer Stream Layer Intra-stamp replication

    (synchronous)
  104. 104 Storage Replication Front-Ends Partition Layer Stream Layer Front-Ends Partition

    Layer Stream Layer Inter-stamp replication (asynchronous)
  105. 105 Location Services Front-Ends Partition Layer Stream Layer Location Services

    DNS Virtual IP (primary) Virtual IP Account name mapped to one Virtual IP (primary stamp) Front-Ends Partition Layer Stream Layer
  106. 106 Location Services Location Services DNS North America Europe Asia

  107. 107 Location Services Front-Ends Partition Layer Stream Layer DNS Front-Ends

    Partition Layer Stream Layer
  108. 108 Key-value storage Sergey Nivens / 123RF Stock Photo

  109. 109 Can we consider object storage a database?

  110. 110 Issue: latency ~100-300ms 1-9 ms vs. S3 Typical Database

  111. 111 Key-value stores ID 1. Similar data model to object

    storage
  112. 112 Key-value stores 2. Smaller objects 5TB 400KB (DynamoDB)

  113. 113 Key-value stores 3. No Metadata

  114. 114 Key-value stores: data model

  115. 115 Basic API get(key) value

  116. 116 Basic API get(key) value put(key, other value)

  117. 117 DynamoDB API get(key) value

  118. 118 DynamoDB API get(key) value, context

  119. 119 DynamoDB API get(key) value, context put(key, context, value) (more

    on context later)
  120. 120 Key-value stores: why do we simplify? Simplicity More features

  121. 121 Key-value stores: why do we simplify? Simplicity Consistency Eventual

    consistency More features
  122. 122 Key-value stores: why do we simplify? Simplicity Consistency Eventual

    consistency More features Performance Overhead
  123. 123 Key-value stores: why do we simplify? Simplicity Consistency Eventual

    consistency More features Performance Scalability Monolithic Overhead
  124. 124 Key Value Key-value stores Which is the most efficient

    data structure for querying this?
  125. 125 Key Value Key-value stores: logical model Associative Array (aka

    Map)
  126. 126 Enter nodes: physical level

  127. 127 Design principles: incremental stability

  128. 128 Design principles: incremental stability

  129. 129 Design principles: incremental stability

  130. 130 Design principles: symmetry

  131. 131 Design principles: decentralization

  132. 132 Design principles: heterogeneity

  133. 133 Physical layer: Peer to peer networks

  134. 134 Distributed Hash Tables: Chord hashed n-bit ID (DynamoDB: 128

    bits)
  135. 135 IDs are organized in a logical ring 0000000000 1111111111

    mod 2n 1000000000 0100000000 1100000000 2n nodes
  136. 136 Each Node picks a 128-bit hash (randomly)

  137. 137 Nodes are (logically) placed on the ring 0000000000 1111111111

    mod 2n
  138. 138 ID stored at next node clockwise 000... 111... mod

    2n
  139. 139 ID stored at next node clockwise 000... 111... mod

    2n Domain of responsibility
  140. 140 Adding and removing nodes 000... 111... mod 2n

  141. 141 Adding and removing nodes 000... 111... mod 2n

  142. 142 Adding and removing nodes 000... 111... mod 2n

  143. 143 Adding and removing nodes 000... 111... mod 2n Needs

    to be transferred
  144. 144 Adding and removing nodes 000... 111... mod 2n Needs

    to be transferred These nodes are not affected
  145. 145 Adding and removing nodes 000... 111... mod 2n

  146. 146 Adding and removing nodes 000... 111... mod 2n

  147. 147 Adding and removing nodes 000... 111... mod 2n

  148. 148 Adding and removing nodes 000... 111... mod 2n Needs

    to be transferred
  149. 149 Adding and removing nodes 000... 111... mod 2n Needs

    to be transferred But what if the node failed?
  150. 150 Initial design: one range 000... 111... mod 2n

  151. 151 Duplication: 2 ranges 000... 111... mod 2n

  152. 152 Duplication: 2 ranges 000... 111... mod 2n

  153. 153 Duplication: N ranges 000... 111... mod 2n

  154. 154 Finger tables Credits: Thomas Hofmann

  155. 155 Finger tables O(log(number of nodes)) Credits: Thomas Hofmann

  156. 156 Distributed Hash Tables: Pros Highly scalable Robust against failure

    Self organizing Credits: Thomas Hofmann
  157. 157 Distributed Hash Tables: Cons Lookup, no search Data integrity

    Security issues Credits: Thomas Hofmann
  158. 158 Issue 1 000... 111... mod 2n

  159. 159 Issue 1: randomness out of luck 000... 111... mod

    2n
  160. 160 Issue 2: heterogenous performance 000... 111... mod 2n

  161. 161 So... How can we • artificially increase the number

    of node? and • bring some elasticity to account for performance differences?
  162. 162 Virtual nodes are the answer: tokens 000... 111... mod

    2n
  163. 163 Virtual nodes: tokens 000... 111... mod 2n

  164. 164 Deleting a node 000... 111... mod 2n

  165. 165 Deleting a node 000... 111... mod 2n

  166. 166 Deleting a node 000... 111... mod 2n

  167. 167 Adding a node 000... 111... mod 2n

  168. 168 Adding a node 000... 111... mod 2n

  169. 169 Vector clocks put by Node A

  170. 170 Vector clocks put by Node A ([A, 1])

  171. 171 Vector clocks put by Node A put by Node

    A ([A, 1]) ([A, 2])
  172. 172 Vector clocks put by Node A put by Node

    A put by Node B ([A, 1]) ([A, 2]) ([A, 2], [B, 1])
  173. 173 Vector clocks put by Node A put by Node

    A put by Node B put by Node C ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1])
  174. 174 Vector clocks put by Node A put by Node

    A put by Node B put by Node C reconcile and put by Node A ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1])
  175. 175 Vector clocks put by Node A put by Node

    A put by Node B put by Node C reconcile and put by Node A ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1]) ([A, 3], [B, 1], [C, 1])
  176. 176 Context put by Node A put by Node A

    put by Node B put by Node C reconcile and put by Node A ([A, 1]) ([A, 2]) ([A, 2], [B, 1]) ([A, 2], [C, 1]) ([A, 3], [B, 1], [C, 1])
  177. 177 Preference lists Key Node Propagate changes to first N

    healthy nodes. (avoids hops in routing)
  178. 178 Preference lists Key Node 1, 2, 3, ... At

    least N nodes
  179. 179 Preference lists Key Node 1, 2, 3, ... First

    node is called the coordinator
  180. 180 Initial request Load balancer

  181. 181 Initial request Load balancer Random node

  182. 182 Initial request Load balancer Random node Coordinator

  183. 183 Initial request Load balancer Random node Coordinator

  184. 184 Lower latency Partition-aware client Coordinator

  185. 185 Lower latency Partition-aware client Coordinator Hinted handoff

  186. 186 Merkle Trees What if hinted replicas get lost? What

    if the complexity in replica deltas increases?
  187. 187 Merkle Trees What if hinted replicas get lost? What

    if the complexity in replica deltas increases? Anti-entropy protocol
  188. 188 Merkle Trees Key range (one per virtual node)

  189. 189 Amazon mindset

  190. 190 Azure mindset Graphs Tables Trees Key-value paradigm

  191. 191 § Simplify the model! § Buy cheap hardware! §

    Remove schemas! Take away messages: how to scale out?