Slide 1

Slide 1 text

NSDI2016 Technical Sessions Distributed Systems Session Overview TANAKA Daisuke@PFN

Slide 2

Slide 2 text

CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware (ETH Zürich) • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams (Microsoft, Microsoft Research) • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks (Facebook) • The Design and Implementation of the Warp Transactional Filesystem (Cornell University) • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores
 (University of California, Berkeley) ঺հ͢Δ࿦จ https://www.usenix.org/conference/nsdi16/technical-sessions

Slide 3

Slide 3 text

ਐΊํ 1. ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ 2. Ͳ͏΍ͬͯղܾͨ͠ͷ͔ (ΞϧΰϦζϜ) 3. ݕূ݁Ռ΍Ԡ༻ΞϓϦέʔγϣϯʹ͍ͭͯ

Slide 4

Slide 4 text

CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ

Slide 5

Slide 5 text

ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ Consensus in a Box: Inexpensive Coordination in Hardware • γεςϜ߹ҙ͸ίετ (࣮૷ͷෳࡶ͞ɾॲཧ࣌ؒ) ͕େ͖͍ • γεςϜ߹ҙͱ͸جຊతʹҎԼͷ4ͭΛຬͨ͢ඞཁ͕͋Δ • Termination (Liveness) / Validity / Integrity / Agreement • PAXOS, RAFT ͳͲ • ݫີͳworkload͕ඞཁͱ͞Ε͍ͯΔγεςϜͰ͸߹ҙ͕ඞਢͩ ͕ɺύϑΥʔϚϯε΍εέʔϥϏϦςΟͷ੍໿ʹͳΔ͜ͱ͕ଟ͍

Slide 6

Slide 6 text

Ͳ͏΍ͬͯղܾͨ͠ͷ͔ • Zookeeperͷ atomic broadcast (ZAB) Λ FPGA ʹ࣮ͯ૷ͨ͠ Consensus in a Box: Inexpensive Coordination in Hardware

Slide 7

Slide 7 text

Zookeeper’s atomic broadcast (ZAB) Consensus in a Box: Inexpensive Coordination in Hardware

Slide 8

Slide 8 text

TCP/IP hardware Consensus in a Box: Inexpensive Coordination in Hardware

Slide 9

Slide 9 text

ݕূ݁Ռ Consensus in a Box: Inexpensive Coordination in Hardware

Slide 10

Slide 10 text

CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ

Slide 11

Slide 11 text

ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • ϏοάσʔλͰετϦʔϜॲཧ͕΍Γ͍ͨ • ෳࡶੑɺεέʔϥϏϦςΟɺfault tolerance (଱ނোੑ) ͕ٻΊΒ ΕΔ • ετϦʔϜॲཧͷσʔλϑϩʔͷ෮چ • ೖྗΠϕϯτͷ࠶ૹɺεςʔτ (ঢ়ଶ) ͷ෮چ
 ˠΠϕϯτͷϩετ͋Δ͍͸ॏෳΠϕϯτΛ๷͍͗ͨ

Slide 12

Slide 12 text

StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • σʔλϑϩʔΛ DAG ͱͯ͠දݱ͠ɺϊʔυͱΤοδͷґଘΛ rStream / rVertex ͱͯ͠ந৅Խ • recoveryͷ࣮ݱ • SCOPE ͱ͍͏ Parallel Map Reduce 
 ࣮૷ͷ֦ு Ͳ͏΍ͬͯղܾͨ͠ͷ͔ http://www.vldb.org/pvldb/1/1454166.pdf

Slide 13

Slide 13 text

StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • STREAM SCOPE (StreamS) ͷ࣮ߦ Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 14

Slide 14 text

StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • STREAM SCOPE (StreamS) ͷ࣮ߦ Ͳ͏΍ͬͯղܾͨ͠ͷ͔ rStream rVertex

Slide 15

Slide 15 text

StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • rStream • vertex (ΠϕϯτΛॲཧ͢Δϊʔυͷ૯শ) ΁ͷґଘΛ෼཭ɾந৅ Խͨ͠ඇಉظίϛϡχέʔγϣϯνϟϯωϧ • seq Λ࣋ͭɻಉ͡ seq Λ࣋ͭ event ͷॻ͖ࠐΈ͕੒ޭ͢Δ·Ͱ ಡΈࠐΈ͸ऴΘΒͳ͍ɻ • φΠʔϒʹ࣮૷͢ΔͱಉظϞσϧʹͳΔͷͰ GC ϞσϧΛ࠾༻ Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 16

Slide 16 text

StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • rVertex • vertex Ͱͷܭࢉʹରͯ͠γϯϓϧͳεφοϓγϣοτΛऔΔɻ • εςʔτͷ restart ٴͼ failure recovery Λ࣮૷ Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 17

Slide 17 text

StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • νΣοΫϙΠϯτͷִؒ΍ӬଓԽͷλΠϛϯάʹΑͬͯෳ਺ͷނো ෮چϞσϧΛ࠷খ͢Δ͜ͱ͕Ͱ͖Δ • strict model / relaxed model … • Ͳ͜Ͱނোͨ͠ͷ͔σόοά͢Δͷ͕؆୯ • σϓϩΠٴͼ (rStream / rVertex) Ҏ֎ͷ࣮૷͸طଘࢿ࢈ (ओʹ SCOPE) Λྲྀ༻͢Δ͜ͱ͕Ͱ͖Δ ݕূ

Slide 18

Slide 18 text

CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ

Slide 19

Slide 19 text

ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ • େྔͷHTTPϦΫΤετΛͲͷΑ͏ʹࡹ͖ɺΩϟογϡΛ֤Ϋϥελ ຖʹͲ͏΍ͬͯ෼ࢄͤ͞Δ͔ɻޮ཰తͳ෼ࢄΛߦ͍͍ͨ • balanced / adaptive / stable / fast decision Λຬ͍ͨͨ͠ Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks

Slide 20

Slide 20 text

• ιʔγϟϧάϥϑΛ੩తʹ഑ஔͰ͖Δ “object” ͱ ಈతʹ഑ஔ͢Δ “group” ʹ໌ࣔతʹ෼཭ • Facebook ͷ TAO (ιʔγϟϧάϥϑ޲͚෼ࢄDB) Λ࢖༻ • static assignment • ࣅͨάϥϑͷ object ΛूΊΔɻσʔλΞΫηεύλʔϯΛάϥϑ ͱͯ͠දݱ͠ɺ෼ׂɻσʔλ͕େ͖͍ͱ஗͍ɻ Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 21

Slide 21 text

• dynamic assignment • มԽ͢Δ৘ใʹରͯ͠ಈతʹόϥϯγϯάɻݕূͰ͸ bipartite graph partitioning Λ࢖༻ Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 22

Slide 22 text

Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks ݕূ

Slide 23

Slide 23 text

Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks ݕূ

Slide 24

Slide 24 text

CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ

Slide 25

Slide 25 text

ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ • ͜Ε·Ͱͷ෼ࢄϑΝΠϧγεςϜ • ෆे෼ͳอূ / ੍໿ͷଟ͍ΠϯλʔϑΣʔε / εέʔϧ͠ͳ͍ • ෼ࢄϑΝΠϧγεςϜͷϓϩτίϧͷ֦ுɻུͯ͠ WTF • PAXOS API + new zero-copy API The Design and Implementation of the Warp Transactional Filesystem

Slide 26

Slide 26 text

• file slicing API The Design and Implementation of the Warp Transactional Filesystem Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 27

Slide 27 text

• file slicing API • metadata strage ͕ slice ͷϙΠϯλΛ࣋ͭ • file offset ͢Δɻoverwrite ͨ͠৔߹͸ compaction ͕૸Δɻmeta data compaction • ࢖Θͳ͍෦෼͸ GC ͞ΕΔ • fragmentation ͕ى͜ΔͷͰɺܧଓతʹ locality-aware slice placement Λ࢖༻ The Design and Implementation of the Warp Transactional Filesystem Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 28

Slide 28 text

• Map Reduce Sort ࣌ͷ store ʹ࢖༻ˠ࣮ߦ͕࣌ؒ 70min ͔Β 15min • Videoฤूˠ࣌ܥྻιʔτ͕ૣ͘ͳͬͨ • ͦͷଞ2ͭ঺հ The Design and Implementation of the Warp Transactional Filesystem Ԡ༻ΞϓϦέʔγϣϯ

Slide 29

Slide 29 text

CONTENTS • Consensus in a Box: Inexpensive Coordination in Hardware • StreamScope: Continuous Reliable Distributed Processing of Big Data Streams • Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks • The Design and Implementation of the Warp Transactional Filesystem • BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ঺հ͢Δ࿦จ

Slide 30

Slide 30 text

ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ • σʔλετΞʹஔ͍ͯ͸ɺϥϯμϜΞΫηεͱݕࡧͷ2͕ͭجຊతͳ ૢ࡞ • NoSQL͸େ͖ͳσʔλ΁ͷରԠ͕·ͣઌͰɺͦͷ࣍ʹ store Λ͍͔ ʹૣ͘͢Δ͔ͱ͍͏Ξϓϩʔν • େ͖ͳσʔλʹରԠ͢ΔͨΊʹѹॖΛར༻͍ͯ͠Δ͕ɺεϧʔϓο τͱͷτϨʔυΦϑ͕ଘࡏ͢Δ BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores

Slide 31

Slide 31 text

ͲΜͳ՝୊Λղܾ͠Α͏ͱ͍ͯ͠Δͷ͔ BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores

Slide 32

Slide 32 text

• Layered Sampled Array (LSA) ͱ͍͏৽͍͠σʔλߏ଄ΛఏҊ͠ɺτ ϨʔυΦϑۂઢΛಈతʹม͑ΒΕΔΑ͏ʹͨ͠ BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 33

Slide 33 text

• Succinct store BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores Ͳ͏΍ͬͯղܾͨ͠ͷ͔ https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/agarwal

Slide 34

Slide 34 text

• Succinct Λෳ਺ͷ sampling rate Ͱ store Ͱ͖ΔΑ͏ʹ֦ு BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores Ͳ͏΍ͬͯղܾͨ͠ͷ͔

Slide 35

Slide 35 text

• γϟʔσΟϯά࣌ʹͲ͏΍ͬͯ sampling rate Λௐઅ͢Δ͔ʁ
 ˠઌߦݚڀ: Back-pressure style scheduling BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores Ͳ͏΍ͬͯղܾͨ͠ͷ͔ http://dl.acm.org/citation.cfm?id=1285032

Slide 36

Slide 36 text

• FacebookͷΫϥελͰ͸ 90% ͕ transient failure • 1ͭͷϨϓϦΧ͕ނোͨ࣌͠ͷϩʔυϦΧόϦ͕3ഒߴ଎Խ BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores ධՁ

Slide 37

Slide 37 text

• [slide backup]