PWLSF#11 => Alex Rasmussen on Flat Datacenter Storage

Flat Datacenter Storage Presented by Alex Rasmussen Papers We Love
SF #11 2015-01-22 Edmund B. Nightingale, Jeremy Elson, Jinliang Fan,  Owen Hofmann, Jon Howell, and Yutaka Suzue

@alexras

Sort Really Fast THEMIS MapReduce Really Fast

Image Credit: http://bit.ly/17Vf8Hb

A Perfect World

“Magic RAID”

The Real World

Move the Computation to the Data!

Location Awareness Adds Complexity

Why “Move the Computation to the Data”?

Remote Data Access is Slow. Why?

The Network is Oversubscribed

Core Aggregation Edge ure 1: Common data center interconnect topology.
Host to switch links are GigE and links between switches are 10 G 25 30 35 40 1:1 3:1 7:1 Fat-tree Hierarchical design Fat-tree Year 10 GigE Hosts Cost/ GigE Hosts C GigE G 2002 28-port 4,480 $25.3K 28-port 5,488 $ 2004 32-port 7,680 $4.4K 48-port 27,648 $ 2006 64-port 10,240 $2.1K 48-port 27,648 $ 2008 128-port 20,480 $1.8K 48-port 27,648 $ Aggregate Bandwidth Above Less Than Aggregate Demand Below Sometimes by 100x or more A B

What if I told you the network isn’t oversubscribed?

Consequences • No local vs. remote disk distinction • Simpler
work schedulers • Simpler programming models

FDS Object Storage Assuming No Oversubscription

Motivation Architecture and API Metadata Management Replication and Recovery Data
Transport Why FDS Matters

Blob 0xbadf00d Tract 0 Tract 1 Tract 2 Tract n
... 8 MB CreateBlob OpenBlob CloseBlob DeleteBlob GetBlobSize ExtendBlob ReadTract WriteTract

API Guarantees • Tractserver writes are atomic • Calls are
asynchronous - Allows deep pipelining • Weak consistency to clients

Tract Locator Version TS 1 0 A 2 0 B
3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Tract Locator Table

Tract_Locator =   TLT[(Hash(GUID) + Tract) % len(TLT)]

Randomize blob’s tractserver, even if GUIDs aren’t random (uses SHA-1)
Tract_Locator =   TLT[(Hash(GUID) + Tract) % len(TLT)]

Large blobs use all TLT entries uniformly Tract_Locator =  
TLT[(Hash(GUID) + Tract) % len(TLT)]

Blob Metadata is Distributed Tract_Locator =   TLT[(Hash(GUID) - 1)
% len(TLT)]

TLT Construction • m Permutations of Tractserver List • Weighted
by disk speed • Served by metadata server to clients • Only update when cluster changes

Tract Locator Version TS 1 0 A 2 0 B
3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Cluster Growth

Tract Locator Version TS 1 1 NEW / A 2
0 B 3 2 D 4 1 NEW / A 5 4 NEW / C 6 0 F ... ... ... Cluster Growth

Tract Locator Version TS 1 2 NEW 2 0 A
3 2 A 4 2 NEW 5 5 NEW 6 0 A ... ... ... Cluster Growth

Tract Locator Version Replica 1 Replica 2 Replica 3 1
0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Replication

Replication • Create, Delete, Extend: - client writes to primary
- primary 2PC to replicas • Write to all replicas • Read from random replica

0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery

0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery Recover 1TB from 3000 disks in < 20 seconds H E A L M E

MIND BLOWN

Networking Pod 0 10.0.2.1 10.0.1.1 Pod 1 Pod 3 Pod
2 10.2.0.2 10.2.0.3 10.2.0.1 10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2 Core 10.2.2.1 10.0.1.2 Edge Aggregation Figure 3: Simple fat-tree topology. Using the two-level routing tables described in Section 3.3, packets from source 10.0.1.2 to destination 10.2.0.3 would take the dashed path. Prefix 10.2.0.0/24 10.2.1.0/24 0.0.0.0/0 Output port 0 1 Suffix Output port Next hop 10.2.0.1 10.2.1.1 10.4.1.1 Address 00 01 10 Output port 0 1 2 RAM Encoder 10.2.0.X 10.2.1.X X.X.X.2 TCAM CLOS topology: small switches + ECMP   = full bisection bandwidth

Networking • Network bandwidth = disk bandwidth • Full bisection
bandwidth is stochastic • Short ﬂows good for ECMP • TCP hates short ﬂows • RTS/CTS to mitigate incast; see paper

Co-Design

Hardware/Software Combination Designed for a Speciﬁc Workload

FDS Works Great for Blob Storage on CLOS Networks

www.sortbenchmark.org

Indy Daytona

MinuteSort - Daytona System! (Nodes) Year Data Sorted Speed  
per Disk Hadoop (1408) 2009 500GB 3 MB/s FDS (256) 2012 1470GB 46 MB/s

MinuteSort - Indy System! (Nodes) Year Data Sorted Speed  
per Disk TritonSort (66) 2012 1353GB 43.3 MB/s FDS (256) 2012 1470GB 47.9 MB/s

FDS isn’t built for oversubscribed networks.  It’s also not a
DBMS.

MapReduce and GFS: Thousands of Cheap PCs, Bulk Synchronous
Processing

10x MapReduce and GFS Aren’t Designed for or Iterative, or
OLAP

FDS’ Lessons • Great example of ground-up rethink - Ambitious
but implementable • Big wins possible with co-design • Constantly re-examine assumptions

TritonSort & Themis • Balanced hardware architecture • Full bisection-bandwidth
network • Job-level fault tolerance • Huge wins possible - Beat 3000+ node cluster by 35% with 52 nodes • NSDI 2012, SoCC 2013

Thanks @alexras • [email protected] • alexras.info

PWLSF#11 => Alex Rasmussen on Flat Datacenter ...

PWLSF#11 => Alex Rasmussen on Flat Datacenter Storage

More Decks by Papers_We_Love

Other Decks in Technology

Featured

Transcript