Slide 1

Slide 1 text

σʔλϕʔε ΞʔΩςΫνϟʔͷಈ޲ ͱ࢖͍෼͚ 2015/4/21 QConTokyo Bashoδϟύϯɹ্੢߁ଠ

Slide 2

Slide 2 text

ࣗݾ঺հ • @kuenishi • Github, Twitter, etc • ෼ࢄγεςϜྺ7೥ • Bashoδϟύϯͷํ͔Βདྷ·ͨ͠ • Riak CSͷ։ൃ • εϐϦνϡΞϧͳ࿩Λ͠·͢

Slide 3

Slide 3 text

ఏڙ

Slide 4

Slide 4 text

ACID

Slide 5

Slide 5 text

Atomicity Consistency Isolation Durability

Slide 6

Slide 6 text

Atomicity Consistency Isolation Durability

Slide 7

Slide 7 text

Durability “The ACID property which guarantees that transactions that have committed will survive permanently. “ http://en.wikipedia.org/wiki/Durability_(database_systems)

Slide 8

Slide 8 text

Permanently?

Slide 9

Slide 9 text

ӬଓԽ͸ ʢࢥ͍ͬͯͨΑΓʣ λΠϔϯ

Slide 10

Slide 10 text

https://www.flickr.com/photos/bathyporeia/9086009348/

Slide 11

Slide 11 text

ॾߦແৗ सब्बेसंखाराअिफच्चा େൠᔷᒒܦɹॾߦແৗ၎

Slide 12

Slide 12 text

ੜ໓ͷ๏͸ۤͰ͋Δͱ͞Ε͍ͯΔ͕ɺ ੜ໓͢Δ͔ΒۤͳͷͰ͸ͳ͍ɻੜ໓͢ ΔଘࡏͰ͋Δʹ΋͔͔ΘΒͣɺͦΕΛ ৗॅͳ΋ͷͰ͋Δͱ؍Δ͔Β͕ۤੜ͡ ΔͷͰ͋Δɻ Wikipedia ʮॾߦແৗʯ

Slide 13

Slide 13 text

ه࿥ഔମ͸յΕΔ

Slide 14

Slide 14 text

ه࿥ഔମ͸յΕΔ ͳΒ ෳ੡͢Ε͹Α͍

Slide 15

Slide 15 text

ཧ૝ https://www.flickr.com/photos/rebeccalongworth/3445143739/

Slide 16

Slide 16 text

ݱ࣮ https://www.flickr.com/photos/leeziet/3021612079/ https://www.flickr.com/photos/pranavbhasin/6109327813/

Slide 17

Slide 17 text

CAPఆཧ

Slide 18

Slide 18 text

ϨϓϦέʔγϣϯ͸೉͍͠ •CAPఆཧ • ෳ਺ͷϊʔυ͕อ͍࣋ͯ͠Δɺ࠷ॳ͸ಉ͡ΦϒδΣΫτ ʹมߋͷྻΛૹΓଓ͚Δ • ϝοηʔδ͕౸ୡ͠ͳ͍ͱ͖ʹɺશͯͷϊʔυ͕ಉ͡ม ߋͷྻΛอ࣋͢Δ͜ͱ͕Ͱ͖ͳ͍ • ೉͠͞ͷࠜݯ͸ނো୯ҐΛ෼͚ͨ͜ͱ • ผͷ΋ͷ͕ಉҰͰ͋Δ͜ͱΛอূ͢Δ

Slide 19

Slide 19 text

ຌྫ

Slide 20

Slide 20 text

ղ๏ͦͷ̌: Master-Slave • ߋ৽ͷ໋ྩྻ͕།ҰͰ͋Δ͜ͱΛอূ͢Δ • εϨʔϒɺϨϓϦΧ͸ɺܾఆ͞Εͨߋ৽ͷ໋ྩྻΛड͚ औͬͯϩʔΧϧʹ൓ө͢Δ͚ͩ • ϚελʔෆࡏͰ͸Կ΋Ͱ͖ͳ͍ c c w1: x=a w2: x=b r1: read x w3: x=c w1: x=a w2: x=b r1: read x w3: x=c

Slide 21

Slide 21 text

ղ๏ͦͷ̍: ίϯηϯαε • ҟͳΔ࣮ମ͕ಉ͡ঢ়ଶΛ࣋ͭ͜ͱ͕ΰʔϧ • ϨϓϦέʔγϣϯ͸͍ΘΏΔ෼ࢄ߹ҙ໰୊ • ͜ΕΛղ͘௨৴ํࣜΛɺίϯηϯαεϓϩτίϧͱ͍͏ c c w1: x=a c w2: x=b r1: read x w3: x=c

Slide 22

Slide 22 text

۪ऀ͸ܦݧʹֶͼɺ ݡऀ͸ྺ࢙ʹֶͿ

Slide 23

Slide 23 text

1990೥୅ RDBMSීٴظ •OS͕ҙࣝ͢Δ͜ͱ͸ͳ͘ɺΧʔωϧҎԼͰෳ੡ •σΟεΫϨϕϧͰಉظܕɻยํ͕ނোͨ͠Βશܥނো •ϋʔυσΟεΫͱωοτϫʔΫଳҬ͕رগ •RAID, SANͰ·ͱΊͯҰݩ؅ཧɺӡ༻ •τϥϯβΫγϣϯɺΫΤϦॲཧͷجૅٕज़ͷཱ֬

Slide 24

Slide 24 text

2000೥୅ Web࣌୅ (1/2) • ΞϓϦέʔγϣϯɺϛυϧ΢ΣΞͷϨΠϠ(TCP/IP) ͰϨϓϦ έʔγϣϯ͕ҰൠతʹͳΔ • ωοτϫʔΫϨϕϧͰͷಉظܕɻยܥ͕ނোͯ͠΋ಈ࡞ܧଓ • ReadΛεέʔϧΞ΢τͰ͖ΔλΠϓͷ΋ͷ΋͍͔ͭ͘ొ৔ • Master͔ΒSlave (Replica)΁ࠩ෼Λྲྀ͢λΠϓ͕ओྲྀ • MySQLͷbinlog, GFS (BigTable), HDFS (HBase)

Slide 25

Slide 25 text

Master-Slave͸೉͍͠ • Ϛελʔ੾Γସ͑ͷλΠϜϥά • Split brain଱ੑ c b w1: x=a r1: read x w3: x=c w2: x=b

Slide 26

Slide 26 text

2000೥୅ Web࣌୅ (2/2) • WebγεςϜͷෳࡶԽɺڊେԽ • ίϯηϯαεܕͷϨϓϦέʔγϣϯͷ࣮༻Խ • ωοτϫʔΫ෼அ͕ى͖ͯ΋ͳΜͱ͔ͳΔ • Dynamo, Chubby, ZooKeeper, SQL Server (2008?~) • Paxos (1989), Quorum (1979) ͳͲ 2/3 Ack

Slide 27

Slide 27 text

Quorum: ίϯηϯαε͸೉͍͠ • ্ॻ͖Λڐ༰͢ΔφΠʔϒͳϓϩτίϧઃܭͰ͸؆୯ ʹσʔλ͕ഁյ͞ΕΔ • ͍ͭͰ΋୭Ͱ΋ނো͢Δ͠໧Δ͠෮׆͢Δ…ͱ͍͏ݱ ࣮ੈքͰ͸࣮༻తͰ͸ͳ͍ a? c? w1: x=a b? w2: x=b r1: read x w3: x=c

Slide 28

Slide 28 text

Paxos: ίϯηϯαε͸೉͍͠ • 2ϑΣʔζͷ߹ҙϓϩτίϧ • Proposer (஋ΛఏҊ͢Δਓ) Λଟ਺ܾͰܾఆ • Proposed Value (ఏҊ͞Εͨ஋) Λଟ਺ܾͰܾఆ •ఏҊ಺༰ʹॱং൪߸Λৼͬͯ৽چ؅ཧ͢Δ •͍ͭͰ΋୭Ͱ΋ނো͢Δ͠໧Δ͠෮׆͢Δ…ͱ͍͏੍໿ ԼͰ΋ɺແݶʹ͕࣌ؒ͋ͬͯա൒਺͕ނো͍ͯ͠ͳ͚Ε ͹߹ҙ͢Δ •࣮૷͸೉͍͕͠ɺؤுΕ͹ͳΜͱ͔ͳΔ

Slide 29

Slide 29 text

ίϯηϯαεܕ ϨϓϦέʔγϣϯͷ෼ྨ • CPܕ • ෳ੡ؒͷಉҰੑΛอো͢ΔλΠϓ • Paxos, RaftͳͲͷΞϧΰϦζϜΛ࠾༻ • ωοτϫʔΫ෼அͨ͠ͱ͖ʹଟ਺ଆͷωοτϫʔΫʹ͍Δϊʔ υ͔͠ར༻Ͱ͖ͳ͍ • APܕ • ෳ੡͕׬શʹҰக͍ͯ͠ͳ͍͜ͱΛڐ༰͢Δ • Vector Clock΍CRDTʹΑΓҼՌ੔߹ੑΛอোʢ΋͘͠͸୯ͳ ΔλΠϜελϯϓʣ • ωοτϫʔΫ෼அͯ͠΋ɺ͢΂ͯͷϊʔυͰར༻Մೳ

Slide 30

Slide 30 text

ϨϓϦέʔγϣϯ͔ΒΈͨ σʔλϕʔεͷ෼ྨ • Master-Slaveܕ • ࣮૷͕γϯϓϧɺߴ଎ • ίϯηϯαεͱMaster-SlaveͷϋΠϒϦουܕ • Ϛελʔબग़ʹίϯηϯαεϓϩτίϧΛ࠾༻ • ϨϓϦέʔγϣϯͦͷ΋ͷ͸Master-Slave • ίϯηϯαεܕ • ϨϓϦέʔγϣϯʹ΋ίϯηϯαεϓϩτίϧΛར༻ • Ϛελʔނোʹ൐͏μ΢ϯλΠϜ͕ͳ͍

Slide 31

Slide 31 text

ϨϓϦέʔγϣϯ͔ΒΈͨ σʔλϕʔεͷ෼ྨ • Master-Slaveܕ • MySQL, PostgreSQL • ίϯηϯαεͱMaster-SlaveͷϋΠϒϦουܕ • MongoDB, HBase, Redis • ίϯηϯαεܕ • Riak, Cassandra (͍ͣΕ΋AP, CPϞʔυ͋Γ) • CouchBase (CPܕ)

Slide 32

Slide 32 text

2010೥୅ Ϋϥ΢υͷ࣌୅ • NewSQLͱ͍ΘΕΔ෼ྨͷొ৔ • FoundationDB, NuoDB • طଘͷNoSQL͕SQL(-likeͳ΋ͷ)Λ࣮૷͢Δ৔߹ • NewSQL ͷதʹ͸ ACID Λຬͨ͢(?)΋ͷ΋ • ෳ਺σʔληϯλʔͰͷϨϓϦέʔγϣϯ͕ඞਢʹ • ωοτϫʔΫ෼அ΍ϨΠςϯγ͕ΑΓॏཁͳ՝୊ʹ • MPP͕OLAPͷϫʔΫϩʔυͰ࣮༻Խʢ෼ࢄΫΤϦॲཧʣ • BigQuery, Impala, PrestoDB

Slide 33

Slide 33 text

෮श: σʔλϕʔεͷཁૉٕज़ •ΫΤϦॲཧͷ࠷దԽ •SQLΛղੳͯ͠ɺ౷ܭ৘ใ͔Β࠷దͳΫΤϦϓϥϯ Λ࡞੒ɾ࣮ߦ͢Δ •ͦͷͨΊͷσʔλ഑ஔɺΠϯσοΫεઓུ •τϥϯβΫγϣϯॲཧͷ࠷దԽ •AnomalyΛഉআ͠੔߹ੑΛอূͭͭ͠ɺͳΔ΂͘ ଎͘σʔλΛߋ৽͍ͯ͘͠

Slide 34

Slide 34 text

Ϋϥ΢υ࣌୅ͷ σʔλϕʔεͷཁૉٕज़ •෼ࢄ؀ڥͰͷΫΤϦॲཧͷ࠷దԽ •MPPͰฒྻॲཧɺނো࣌͸౤ػ࣮ߦ •Nested ColumnarͰσʔλ഑ஔΛہॴԽ •෼ࢄ؀ڥͰͷτϥϯβΫγϣϯॲཧͷ࠷దԽ •෼ࢄ͍ͯ͠ΔͷʹAnomalyΛഉআʁ੔߹ੑΛอূʁ •ϊʔυ͚ؒͩͰͳ͘DCؒͷ੔߹ੑ΋՝୊

Slide 35

Slide 35 text

෼ࢄDBͰACID •ݱ࣮తͳઃܭ͸ͻͱ௨Γ͔͠ͳ͍ •ίϯηϯαεʹΑΔMasterબग़ʴM/SϨϓϦέʔ γϣϯ or CPܕͷϨϓϦέʔγϣϯ •λΠϜελϯϓͷಉظΛอূ͢Δ࢓૊Έ •ָ؍తฒߦੑ੍ޚ •MegaStore (2011), Spanner (2012)

Slide 36

Slide 36 text

෼ࢄDBͰACID •ͦͷ··τϨʔυΦϑʹͳΔ •ωοτϫʔΫ෼அ࣌ͷՄ༻ੑ •λΠϜελϯϓ؅ཧϊʔυʁ→SPOF •TSOΛOLTPͷϫʔΫϩʔυʹͦͷ··Ԡ༻͠ ͨΒΞϘʔτͷཛྷ

Slide 37

Slide 37 text

WriteΛεέʔϧͤ͞Δ •PaxosͳͲ͸ɺίϯηϯαεϝϯόΛݻఆ͠ͳ͚Ε͹ͳΒͳ͍ •ָ؍తϨϓϦέʔγϣϯ (2005) •ڧ͍੔߹ੑΛຬͨ͞ͳ͍͕ɺಛघͳঢ়گԼͰผछͷ੔߹ੑΛอূ͢Δ ࢓૊Έʢ݁Ռ੔߹ੑͳͲʣ •DNSͳͲ •ԋࢉࢠͷॱ൪͕ೖΕସΘͬͯ΋੔߹͢ΔΑ͏ͳσʔλ؅ཧͷ࢓૊Έ •Vector Clocks, CRDT, boom

Slide 38

Slide 38 text

CRDT • ָ؍తϨϓϦέʔγϣϯΛ؆୯ʹ͢Δσʔλ ܕͱϨϓϦέʔγϣϯٕज़ͷͻͱͭ • Conflict-Free Replicated Data Types • w1(w2(x)) == w2(x1(x)) Λຬͨ͢Α͏ͳ σʔλܕɾσʔλߏ଄ͱԋࢉࢠͷ૊Έ߹Θͤ • ωοτϫʔΫ෼அ࣌Ͱ΋ߋ৽ɺಡΈग़͠Մೳ

Slide 39

Slide 39 text

CRDTྫ: G-Counter • merge •a͕͍࣋ͬͯΔσʔλ: {a: 1, b: 1, c: 2} •b͕͍࣋ͬͯΔσʔλ: {a: 0, b: 2, c: 0} • x => {a: 1, b:2, c:2} => 5 • update • a͕ {increment, 3} Λड͚ͱΔͱ{a: 4, b: 1, c: 2} • C < x ͱ͍͏৚݅ԋࢉΛॲཧͰ͖Δ

Slide 40

Slide 40 text

CRDTྫ: PN-Counter • merge • {a: {1,-1}, b: {1,0}, c: {2,0}} • {a: {0,0}, b: {2, 0}, c: {0, -2}} • => {a: {1,-1}, b:{2,0}, c:{2,-2}} => 2 • update • a͕ {increment, 3} Λड͚෇͚Δͱ • {a: {4,-1}, b: {1,0}, c: {2,0}} • c < x ͱ͍͏৚݅ԋࢉΛॲཧͰ͖ͳ͍

Slide 41

Slide 41 text

CRDTྫ: OR-Sets • merge • a:{“foo”:false, “bar”:true, “baz”:true} • + b:{“bar”:true, “baz”:false}} • => {“foo”:false, “bar”:true, “baz”:true} • => [“foo”] • update • add: a:{} => +”foo” => a:{“foo”:false} • remove: a: {“foo”:false} => a: {“foo”:true}

Slide 42

Slide 42 text

CRDT • ωοτϫʔΫ෼அ࣌Ͱ΋ߋ৽ɺಡΈग़͠Մೳ • Writeͷ ”ฒߦॲཧ” ͕ՄೳʹͳΔσʔλ • ஋Λܭࢉ͢Δํ๏ʹҰఆͷ੍໿͕͋Δ • ޮ཰తͳCRDTͷ࣮૷͸·ͩݚڀஈ֊

Slide 43

Slide 43 text

༧૝: 2010೥୅ޙ൒ • ࣮૷໘Ͱ͸޻෉ͷ༨஍͕͋ΓɺACIDΛຬͨͦ͏ͱ͢Δ෼ࢄDB͸·ͩ ·ͩొ৔͢Δ •෼ࢄΛߟྀͨ͠ฒߦੑ੍ޚ •σʔληϯλʔΛލ͙CPܕϨϓϦέʔγϣϯɺτϥϯβΫγϣϯ •ӡ༻ϊ΢ϋ΢ͷීٴ • NoSQLσʔλϕʔεͷ࠾༻͸͠͹Β͘ଓͩ͘Ζ͏ʢ͍͔ͭ͘͸౫ଡ͞ ΕΔͩΖ͏ʣ • ڧ͍੔߹ੑͱָ؍తϨϓϦέʔγϣϯͷϋΠϒϦουܕσʔλϕʔε ͕ొ৔͢ΔͩΖ͏

Slide 44

Slide 44 text

•OLTP޲͚ͷσʔλϕʔε͕ߋ৽ॲཧͷՄ༻ੑͱੑೳΛ໨తʹָ؍Ϩ ϓϦέʔγϣϯΛಋೖ࢝͠ΊΔͩΖ͏ •ۀ຿ॲཧͷ͏ͪඞͣ͠΋શ͕ͯڧ͍੔߹ੑΛඞཁͱ͍ͯ͠ΔΘ͚Ͱ ͸ͳ͍ •ΞϓϦέʔγϣϯଆͰڧ͍੔߹ੑͱָ؍తϨϓϦέʔγϣϯΛ࢖͍ ෼͚Δ͜ͱͰύϑΥʔϚϯεΛग़͢͜ͱ͕ظ଴Ͱ͖Δ •ΠϯλʔϑΣʔεͱͯ͠͸SQL, JDBC͕࢒ΔͷͰ͸ͳ͍͔ ϋΠϒϦουܕσʔλϕʔε

Slide 45

Slide 45 text

•Θ͔Γ·ͤΜ •ϋΠϒϦουܕσʔλϕʔεͷ҆ఆ࣮ͨ͠૷͕ొ৔ɺීٴ •SSD΍NVM͕ීٴ͠IO͸ϘτϧωοΫͰ͸ͳ͘ͳΔ •Shared NothingܕͷεέʔϧΞ΢τܕDBͰ͸ͳ͘ͳΔͷͰ͸ •ϝϞϦόϯυ෯·ͨ͸CPU͕ϘτϧωοΫʹͳΔ •৽͍͠ϋʔυ΢ΣΞ͕ొ৔͢Ε͹ɺ·ͨͲ͏ͳΔ͔෼͔Βͳ͍ •2000೥୅ͷٕज़X͕࠶ొ৔ ༧૝: 2020೥୅

Slide 46

Slide 46 text

•2000೥͜Ζ͔Βɺσʔλϕʔεͷ2େٕज़ཁૉʹɺঃʑʹ ෼ࢄγεςϜͷٕज़͕ཁૉٕज़ͱͯ͠ඞਢʹͳ͍ͬͯͬͨ •2015೥·Ͱʹొ৔ͨ͠σʔλϕʔεͷϨϓϦέʔγϣϯ ٕज़ʹ͍ͭͯ؆୯ʹղઆ •2015೥ޙ൒ʹ͸ɺCPͱAPͷϨϓϦέʔγϣϯΛಉ͡Πϯ λʔϑΣʔεͰ࢖͍෼͚ɺACIDΛຬͨ͢෼ࢄσʔλϕʔε ͕ొ৔͢ΔͩΖ͏ •2020೥ͷେ·͔ͳໝ^H༧૝Λ ·ͱΊ ※Disclaimer: ͜ͷࢿྉͷ಺༰͸্੢ͷݸਓతͳ༧૝Ͱ͋ΓɺԿΒ͔ͷະདྷΛอূ͢Δ΋ͷͰ͸͋Γ·ͤΜ

Slide 47

Slide 47 text

•Seth Gilbert and Nancy Lynch. 2002. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. •James C. Corbett et al., 2012. Spanner: Google’s Globally Distributed Database. •Yasushi Saito and Marc Shapiro. 2005. Optimistic Replication. •Peter Bailis and Kyle Kingsbury. 2014. The Network is Reliable. •Peter Bailis et al. 2014. Coordination Avoidance in Database Systems. •Mihai Letia et al. 2010. Consistency without Concurrency Control in Large, Dynamic Systems. ࢀߟจݙ