Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PWLSF#11 => Alex Rasmussen on Flat Datacenter Storage

PWLSF#11 => Alex Rasmussen on Flat Datacenter Storage

Alex Rasmussen presents the "Flat Datacenter Storage" paper by Edmund B. Nightingale, Jeremy Elson,Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. http://css.csail.mit.edu/6.824/2014/papers/fds.pdf

Alex tells us: " Flat Datacenter Storage (FDS) is, as the intro describes, "a high-performance, fault-tolerant, large-scale, locality-oblivious blob store". It's also a great example of how carefully thought-out co-design of software and hardware for a target workload can yield really impressive performance results, even in the presence of heterogeneity and operating at scale. In my (admittedly biased) opinion, this style of system design doesn't get enough attention outside of academia, and has a lot to teach us about how data-intensive systems should be designed."

If you have any questions, thoughts, or related information, please visit our *github-thread* on the matter: https://github.com/papers-we-love/papers-we-love/issues/198

Alex's Bio:

Alex Rasmussen (@alexras) got his Ph.D. from UC San Diego in 2013. While at UCSD, he worked on really efficient large-scale data processing and set a few world sorting records, which makes him a hit at parties. He's currently working at Trifacta, helping build the industry's leading big data washing machine.

The video of this talk is up!! https://www.youtube.com/watch?v=F7heZ9ZBWqI

Papers_We_Love

January 22, 2015
Tweet

More Decks by Papers_We_Love

Other Decks in Technology

Transcript

  1. Flat Datacenter Storage Presented by Alex Rasmussen Papers We Love

    SF #11 2015-01-22 Edmund B. Nightingale, Jeremy Elson, Jinliang Fan,
 Owen Hofmann, Jon Howell, and Yutaka Suzue
  2. Core Aggregation Edge ure 1: Common data center interconnect topology.

    Host to switch links are GigE and links between switches are 10 G 25 30 35 40 1:1 3:1 7:1 Fat-tree Hierarchical design Fat-tree Year 10 GigE Hosts Cost/ GigE Hosts C GigE G 2002 28-port 4,480 $25.3K 28-port 5,488 $ 2004 32-port 7,680 $4.4K 48-port 27,648 $ 2006 64-port 10,240 $2.1K 48-port 27,648 $ 2008 128-port 20,480 $1.8K 48-port 27,648 $ Aggregate Bandwidth Above Less Than Aggregate Demand Below Sometimes by 100x or more A B
  3. Consequences • No local vs. remote disk distinction • Simpler

    work schedulers • Simpler programming models
  4. Blob 0xbadf00d Tract 0 Tract 1 Tract 2 Tract n

    ... 8 MB CreateBlob OpenBlob CloseBlob DeleteBlob GetBlobSize ExtendBlob ReadTract WriteTract
  5. API Guarantees • Tractserver writes are atomic • Calls are

    asynchronous - Allows deep pipelining • Weak consistency to clients
  6. Tract Locator Version TS 1 0 A 2 0 B

    3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Tract Locator Table
  7. Randomize blob’s tractserver, even if GUIDs aren’t random (uses SHA-1)

    Tract_Locator = 
 TLT[(Hash(GUID) + Tract) % len(TLT)]
  8. TLT Construction • m Permutations of Tractserver List • Weighted

    by disk speed • Served by metadata server to clients • Only update when cluster changes
  9. Tract Locator Version TS 1 0 A 2 0 B

    3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Cluster Growth
  10. Tract Locator Version TS 1 1 NEW / A 2

    0 B 3 2 D 4 1 NEW / A 5 4 NEW / C 6 0 F ... ... ... Cluster Growth
  11. Tract Locator Version TS 1 2 NEW 2 0 A

    3 2 A 4 2 NEW 5 5 NEW 6 0 A ... ... ... Cluster Growth
  12. Tract Locator Version Replica 1 Replica 2 Replica 3 1

    0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Replication
  13. Tract Locator Version Replica 1 Replica 2 Replica 3 1

    0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Replication
  14. Replication • Create, Delete, Extend: - client writes to primary

    - primary 2PC to replicas • Write to all replicas • Read from random replica
  15. Tract Locator Version Replica 1 Replica 2 Replica 3 1

    0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery
  16. Tract Locator Version Replica 1 Replica 2 Replica 3 1

    0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery Recover 1TB from 3000 disks in < 20 seconds H E A L M E
  17. Networking Pod 0 10.0.2.1 10.0.1.1 Pod 1 Pod 3 Pod

    2 10.2.0.2 10.2.0.3 10.2.0.1 10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2 Core 10.2.2.1 10.0.1.2 Edge Aggregation Figure 3: Simple fat-tree topology. Using the two-level routing tables described in Section 3.3, packets from source 10.0.1.2 to destination 10.2.0.3 would take the dashed path. Prefix 10.2.0.0/24 10.2.1.0/24 0.0.0.0/0 Output port 0 1 Suffix Output port Next hop 10.2.0.1 10.2.1.1 10.4.1.1 Address 00 01 10 Output port 0 1 2 RAM Encoder 10.2.0.X 10.2.1.X X.X.X.2 TCAM CLOS topology: small switches + ECMP 
 = full bisection bandwidth
  18. Networking • Network bandwidth = disk bandwidth • Full bisection

    bandwidth is stochastic • Short flows good for ECMP • TCP hates short flows • RTS/CTS to mitigate incast; see paper
  19. MinuteSort - Daytona System! (Nodes) Year Data Sorted Speed 


    per Disk Hadoop (1408) 2009 500GB 3 MB/s FDS (256) 2012 1470GB 46 MB/s
  20. MinuteSort - Indy System! (Nodes) Year Data Sorted Speed 


    per Disk TritonSort (66) 2012 1353GB 43.3 MB/s FDS (256) 2012 1470GB 47.9 MB/s
  21. FDS’ Lessons • Great example of ground-up rethink - Ambitious

    but implementable • Big wins possible with co-design • Constantly re-examine assumptions
  22. TritonSort & Themis • Balanced hardware architecture • Full bisection-bandwidth

    network • Job-level fault tolerance • Huge wins possible - Beat 3000+ node cluster by 35% with 52 nodes • NSDI 2012, SoCC 2013