NSDI'16: Distributed Systems

NSDI'16: Distributed Systems

NSDI2016論文読み会 (http://system-reading.connpass.com/event/31207/ ) 、Distributed Systems Session の論文紹介で使用した資料です。

1584f6bd68ef5e17d98a26b3405c0d4c?s=128

disktnk

May 29, 2016
Tweet

Transcript

  1. NSDI2016 Technical Sessions Distributed Systems Session Overview TANAKA Daisuke@PFN

  2. CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware

    (ETH Zürich) • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams (Microsoft, Microsoft Research) • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks (Facebook) • The Design and Implementation of the Warp Transactional Filesystem (Cornell University) • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores
 (University of California, Berkeley) ঺հ͢Δ࿦จ https://www.usenix.org/conference/nsdi16/technical-sessions
  3. ਐΊํ 1. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ 2. Ͳ͏΍ͬͯղܾͨ͠ͷ͔ (ΞϧΰϦζϜ) 3. ݕূ݁Ռ΍Ԡ༻ΞϓϦέʔγϣϯʹ͍ͭͯ

  4. CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware

    • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ
  5. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ Consensus in a Box: Inexpensive Coordination in Hardware •

    γεςϜ߹ҙ͸ίετ (࣮૷ͷෳࡶ͞ɾॲཧ࣌ؒ) ͕େ͖͍ • γεςϜ߹ҙͱ͸جຊతʹҎԼͷ4ͭΛຬͨ͢ඞཁ͕͋Δ • Termination (Liveness) / Validity / Integrity / Agreement • PAXOS, RAFT ͳͲ • ݫີͳworkload͕ඞཁͱ͞Ε͍ͯΔγεςϜͰ͸߹ҙ͕ඞਢͩ ͕ɺύϑΥʔϚϯε΍εέʔϥϏϦςΟͷ੍໿ʹͳΔ͜ͱ͕ଟ͍
  6. Ͳ͏΍ͬͯղܾͨ͠ͷ͔ • Zookeeperͷ atomic broadcast (ZAB) Λ FPGA ʹ࣮ͯ૷ͨ͠ Consensus

    in a Box: Inexpensive Coordination in Hardware
  7. Zookeeper’s atomic broadcast (ZAB) Consensus in a Box: Inexpensive Coordination

    in Hardware
  8. TCP/IP hardware Consensus in a Box: Inexpensive Coordination in Hardware

  9. ݕূ݁Ռ Consensus in a Box: Inexpensive Coordination in Hardware

  10. CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware

    • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ
  11. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ StreamScope: Continuous Reliable Distributed Processing of Big Data Streams

    • ϏοάσʔλͰετϦʔϜॲཧ͕΍Γ͍ͨ • ෳࡶੑɺεέʔϥϏϦςΟɺfault tolerance (଱ނোੑ) ͕ٻΊΒ ΕΔ • ετϦʔϜॲཧͷσʔλϑϩʔͷ෮چ • ೖྗΠϕϯτͷ࠶ૹɺεςʔτ (ঢ়ଶ) ͷ෮چ
 ˠΠϕϯτͷϩετ͋Δ͍͸ॏෳΠϕϯτΛ๷͍͗ͨ
  12. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams •

    σʔλϑϩʔΛ DAG ͱͯ͠දݱ͠ɺϊʔυͱΤοδͷґଘΛ rStream / rVertex ͱͯ͠ந৅Խ • recoveryͷ࣮ݱ • SCOPE ͱ͍͏ Parallel Map Reduce 
 ࣮૷ͷ֦ு Ͳ͏΍ͬͯղܾͨ͠ͷ͔ http://www.vldb.org/pvldb/1/1454166.pdf
  13. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams •

    STREAM SCOPE (StreamS) ͷ࣮ߦ Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  14. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams •

    STREAM SCOPE (StreamS) ͷ࣮ߦ Ͳ͏΍ͬͯղܾͨ͠ͷ͔ rStream rVertex
  15. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams •

    rStream • vertex (ΠϕϯτΛॲཧ͢Δϊʔυͷ૯শ) ΁ͷґଘΛ෼཭ɾந৅ Խͨ͠ඇಉظίϛϡχέʔγϣϯνϟϯωϧ • seq Λ࣋ͭɻಉ͡ seq Λ࣋ͭ event ͷॻ͖ࠐΈ͕੒ޭ͢Δ·Ͱ ಡΈࠐΈ͸ऴΘΒͳ͍ɻ • φΠʔϒʹ࣮૷͢ΔͱಉظϞσϧʹͳΔͷͰ GC ϞσϧΛ࠾༻ Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  16. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams •

    rVertex • vertex Ͱͷܭࢉʹରͯ͠γϯϓϧͳεφοϓγϣοτΛऔΔɻ • εςʔτͷ restart ٴͼ failure recovery Λ࣮૷ Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  17. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams •

    νΣοΫϙΠϯτͷִؒ΍ӬଓԽͷλΠϛϯάʹΑͬͯෳ਺ͷނো ෮چϞσϧΛ࠷খ͢Δ͜ͱ͕Ͱ͖Δ • strict model / relaxed model … • Ͳ͜Ͱނোͨ͠ͷ͔σόοά͢Δͷ͕؆୯ • σϓϩΠٴͼ (rStream / rVertex) Ҏ֎ͷ࣮૷͸طଘࢿ࢈ (ओʹ SCOPE) Λྲྀ༻͢Δ͜ͱ͕Ͱ͖Δ ݕূ
  18. CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware

    • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ
  19. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ • େྔͷHTTPϦΫΤετΛͲͷΑ͏ʹࡹ͖ɺΩϟογϡΛ֤Ϋϥελ ຖʹͲ͏΍ͬͯ෼ࢄͤ͞Δ͔ɻޮ཰తͳ෼ࢄΛߦ͍͍ͨ • balanced / adaptive / stable

    / fast decision Λຬ͍ͨͨ͠ Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks
  20. • ιʔγϟϧάϥϑΛ੩తʹ഑ஔͰ͖Δ “object” ͱ ಈతʹ഑ஔ͢Δ “group” ʹ໌ࣔతʹ෼཭ • Facebook ͷ

    TAO (ιʔγϟϧάϥϑ޲͚෼ࢄDB) Λ࢖༻ • static assignment • ࣅͨάϥϑͷ object ΛूΊΔɻσʔλΞΫηεύλʔϯΛάϥϑ ͱͯ͠දݱ͠ɺ෼ׂɻσʔλ͕େ͖͍ͱ஗͍ɻ Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  21. • dynamic assignment • มԽ͢Δ৘ใʹରͯ͠ಈతʹόϥϯγϯάɻݕূͰ͸ bipartite graph partitioning Λ࢖༻ Social

    Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  22. Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations

    on Social Networks ݕূ
  23. Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations

    on Social Networks ݕূ
  24. CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware

    • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ
  25. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ • ͜Ε·Ͱͷ෼ࢄϑΝΠϧγεςϜ • ෆे෼ͳอূ / ੍໿ͷଟ͍ΠϯλʔϑΣʔε / εέʔϧ͠ͳ͍ •

    ෼ࢄϑΝΠϧγεςϜͷϓϩτίϧͷ֦ுɻུͯ͠ WTF • PAXOS API + new zero-copy API The Design and Implementation of the Warp Transactional Filesystem
  26. • file slicing API The Design and Implementation of the

    Warp Transactional Filesystem Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  27. • file slicing API • metadata strage ͕ slice ͷϙΠϯλΛ࣋ͭ

    • file offset ͢Δɻoverwrite ͨ͠৔߹͸ compaction ͕૸Δɻmeta data compaction • ࢖Θͳ͍෦෼͸ GC ͞ΕΔ • fragmentation ͕ى͜ΔͷͰɺܧଓతʹ locality-aware slice placement Λ࢖༻ The Design and Implementation of the Warp Transactional Filesystem Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  28. • Map Reduce Sort ࣌ͷ store ʹ࢖༻ˠ࣮ߦ͕࣌ؒ 70min ͔Β 15min

    • Videoฤूˠ࣌ܥྻιʔτ͕ૣ͘ͳͬͨ • ͦͷଞ2ͭ঺հ The Design and Implementation of the Warp Transactional Filesystem Ԡ༻ΞϓϦέʔγϣϯ
  29. CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware

    • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ
  30. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ • σʔλετΞʹஔ͍ͯ͸ɺϥϯμϜΞΫηεͱݕࡧͷ2͕ͭجຊతͳ ૢ࡞ • NoSQL͸େ͖ͳσʔλ΁ͷରԠ͕·ͣઌͰɺͦͷ࣍ʹ store Λ͍͔ ʹૣ͘͢Δ͔ͱ͍͏Ξϓϩʔν •

    େ͖ͳσʔλʹରԠ͢ΔͨΊʹѹॖΛར༻͍ͯ͠Δ͕ɺεϧʔϓο τͱͷτϨʔυΦϑ͕ଘࡏ͢Δ BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores
  31. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores

  32. • Layered Sampled Array (LSA) ͱ͍͏৽͍͠σʔλߏ଄ΛఏҊ͠ɺτ ϨʔυΦϑۂઢΛಈతʹม͑ΒΕΔΑ͏ʹͨ͠ BlowFish: Dynamic Storage-Performance

    Tradeoff in Data Stores Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  33. • Succinct store BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores

    Ͳ͏΍ͬͯղܾͨ͠ͷ͔ https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/agarwal
  34. • Succinct Λෳ਺ͷ sampling rate Ͱ store Ͱ͖ΔΑ͏ʹ֦ு BlowFish: Dynamic

    Storage-Performance Tradeoff in Data Stores Ͳ͏΍ͬͯղܾͨ͠ͷ͔
  35. • γϟʔσΟϯά࣌ʹͲ͏΍ͬͯ sampling rate Λௐઅ͢Δ͔ʁ
 ˠઌߦݚڀ: Back-pressure style scheduling BlowFish:

    Dynamic Storage-Performance Tradeoff in Data Stores Ͳ͏΍ͬͯղܾͨ͠ͷ͔ http://dl.acm.org/citation.cfm?id=1285032
  36. • FacebookͷΫϥελͰ͸ 90% ͕ transient failure • 1ͭͷϨϓϦΧ͕ނোͨ࣌͠ͷϩʔυϦΧόϦ͕3ഒߴ଎Խ BlowFish: Dynamic

    Storage-Performance Tradeoff in Data Stores ධՁ
  37. • [slide backup]