筑波大学の2013年度の情報システム特別講義Dのスライド
σʔλϕʔεͷͳ͠Basho Japan KKKota UENISHI2014/1/31
View Slide
ͲͪΒ༷Ͱ͔͢ʁ• @kuenishi • NTTݚڀॴ ˠ Basho • Erlang/OTPΛॻ͍ͯࣄΛ͍ͯ͠·͢ • ࢄσʔλϕʔεɾࢄγεςϜͷ༻։ൃྺ 6 • ϛυϧΣΞతͳͷ͕ൺֱతಘҙͰ͢
• Riakͱ͍͏OSSͷࢄσʔλϕʔεΛ࡞ͬͯച͍ͬͯ·͢ • Riak CSͱ͍͏OSSͷΫϥυετϨʔδΛ࡞ͬͯଧ͍ͬͯ·͢ • ʮࢄਥʯΧϯϑΝϨϯεRicon.io • Basho.com 3
͖͔͚ͬ4 தུ ʮ༰ͱͯ͠riakͱ ɹNoSQLΛؚΊͯ ɹ͚ͨΒʯ
ࠓ͓΅͑ͯΒ͏͜ͱ• NoSQLόζϫʔυͰ͋Γɺఆٛͳ͍ • ͍Ζ͍ΖͳσʔλετΞͱͦͷྨɺ͍͚ͷίπ • ࢄγεςϜσʔλϕʔεपลͷɺ၆ᛌਤ • RiakૉΒ͍͠σʔλϕʔεͰ͋Δ
ͦͦσʔλϕʔεͱԿʁ• ΞϓϦέʔγϣϯͷσʔλΛίϯϐϡʔλʹอଘ͢ΔͨΊͷιϑτΣΞʢϥΠϒϥϦ or αʔόʔʣ 6 01010110100001001011110010101010101001010100101010100101010010101010010111111110000001011111001010100000001001010100100101001010010010100
อଘ͢Δͱ͖ͷཁٻ• σʔλΛਖ਼͘͠ॻ͖ࠐΉ͜ͱɺσʔλͷߋ৽͕த్ͳঢ়ଶͰࣦഊ͠ͳ͍͜ͱ • ॻ͖ࠐΜͩσʔλΛਖ਼͘͠ಡΈग़͢͜ͱ • ॻ͖ࠐΜͩσʔλΛผܗࣜʹมͯ͠ಡΈͩ͢͜ͱ • ϋʔυΣΞੑೳͷݶք·Ͱߴʹอଘ͢Δ͜ͱ • ϋʔυΣΞੑೳͷݶք·ͰߴʹಡΈग़͢͜ͱ • σʔλ͕ফ͑Δ͕݅ਖ਼֬ʹ໌͍ͯ͠Δ͜ͱ 7
ྺ্࢙ͷσʔλϕʔε• IDS (GE, 1963) • ֊ܕDB • IMS (IBM, 1966 -), • ωοτϫʔΫܕDB • CODASYL (1969, COBOL), • ࠓಈ͍͍ͯΔ͋Δ 8
1977 System R (IBM)• ؔͷཧʹج͍ͮͨσʔλૢ࡞ݴޠSQLΛ࣮ͨ͠ੈքॳͷσʔλϕʔε • Ұ؏ͨ͠σʔλʹର͢Δෳͷૢ࡞Λશͯޭɾશࣦͯഊʹ·ͱΊɺෳͷߋ৽ཁٻΛฒߦ੍ޚ͠ɺσʔλΛӬଓԽ͢ΔτϥϯβΫγϣϯॲཧΛ࣮ͨ͠ੈքॳͷσʔλϕʔε • ۀॲཧͷϞσϧʹඇৗʹΑ͘Ϛονͨͨ͠ΊɺരൃతʹීٴɹˠISOͰඪ४Խ • 1983 Oracle v3 • 1987 Sybase SQL Server (ݱMicrosoft SQLServer) 9
͡Ί͔ΒSQLͩͬͨΘ͚ͰϦϨʔγϣφϧͩͬͨΘ͚Ͱͳ͍ͭ·Γ… 10
RDBMSͷ࣌• ACIDಛੑʹج͍ͮͨτϥϯβΫγϣϯཧ• SQLͱ͍͏౷Ұ͞ΕͨΠϯλʔϑΣʔε • ੈքதͷΤϯλʔϓϥΠζͷσʔλཧΦϑΟεͷOAԽɾγεςϜͷΦʔϓϯԽʹ͍RDBMS͕ओྲྀʹ • εέʔϧΞοϓͷ࣌Ͱ͋Δ • 1995ͷHDDͷGB୯Ձ: ~7.5ສԁ/ GB 11
• 2003ʙɹITόϒϧ่յޙ • Ոఉ༻ίϯϐϡʔλɺϒϩʔυόϯυͷීٴ • Web 2.0ɺϒϩάɺWikipediaɺEϝʔϧɺWebݕࡧɺEC • 1998 Googleۀ • 1995 Amazonۀ • “Web Scale” ͷσʔλྔ 12 Webͷ࣌
• ٻΊΒΕΔ͜ͱ͕มԽ: ߏ؆୯ͰɺྔτϥϑΟοΫɺϨεϙϯεɺՄ༻ੑͳͲͷཁٻ͕ଟ༷ԽɾߴԽ • σʔλͷ୯Ձ͕͍҆etc • RDBMSͰղܾͰ͖ͳ͍Φʔμʔ·Ͱσʔλྔ͕૿Ճ͢ΔʹैͬͯɺͦΕΛղܾ͢Δٕज़͕ొ • ؔϞσϧɺSQLͰѻ͍ʹ͍͘σʔλϞσϧʢ୯७͕ͩΑ͘มԽ͢Δσʔλߏʣ • 2003ͷHDDͷGB୯Ձ: 100ԁ / GBɹ(120GB) 13 “Web Scale”
Web ScaleͷΞϓϩʔν• Ոఉ༻ίϯϐϡʔλͱಉ͡HW • LAMP + Memcached(2003) • Redis (2009) • Google: GFS(2003), BigTable(2006) • Hadoop (2006), HBase (2006) • Amazon: Dynamo(2006) • Cassandra (2008), Riak (2009) 14
ຊʹղ͖͍ͨ15
εέʔϦϯάͷ16
εέʔϦϯάͷํ๏• ෳͷϊʔυʹσʔλΛࢄอ࣋ͯ͠ྔΛՔ͙ • ʮͲͷσʔλ͕Ͳͷαʔόʔʹೖ͍ͬͯΔ͔ʁʯ 17
18
ϋογϡతϧʔςΟϯά• Riak, Cassie,DynamoͳͲ • Կ͕Ͳ͜ʹ͋Δ͔Θ͔Γʹ͍͘ • εέʔϧΞτɺෛՙࢄ͕؆୯ 19 node1node2node3node4
“ϋογϡత”• ΩʔͱϊʔυIDΛϋογϡԽͯ͠ಉ໊͡લۭؒʹஔ͘ • ϋογϡͷ୯ҐͰσʔλΛׂͯ͠ஔ • ϊʔυՃɾআͷͱ͖ͷσʔλ࠶ஔΛ࠷খݶʹͰ͖Δ 20 {foo, SomeData}!Hash(foo)%N!Hash(node1)%N!
21 Locality vs LoadBalancing - Use Cases• ઌಡΈΛޮ͔ͤͯόϧΫͰγʔέϯγϟϧΞΫηε • - HBase, BigTable, etc • ϋογϡͰࢄͤͯ͞ෛՙࢄͯ͠ϥϯμϜΞΫηε • - Riak, Cassie, etc
߹ੑͷ22
Q. ߹ੑͬͯͳΜͩͱ͓͍·͔͢ʁ23
߹ੑʹ2ͭͷจ຺͕͋Δ• ڞ௨͢Δͷʮ୭͕ݟͯಉ͡Α͏ʹݟ͑Δ͜ͱʯ • ෳछྨͷσʔλؒͷInvariant͕कΒΕ͍ͯΔ͜ͱ • “ACID” తͳAnomaly͕ൃੜ͠ͳ͍͜ͱ • Phantom Read, Write Skew, etc etc… • ෳͷσʔλͷίϐʔ͕ಉ͡Ͱ͋Δ͜ͱ • ෳͷίϐʔΛ҆શʹߋ৽Ͱ͖Δ͜ͱ • ผʑͷਓͰಉ͡σʔλΛݟΕ͍ͯΔ͜ͱ
25 Consistent Replicationis Difficult• ϨϓϦέʔγϣϯॱ൪͕ೖΕସΘΔ • CPUͷΞτΦϒΦʔμʔ࣮ߦͱಉ͡w1w1w1w2w2w2Actor 0Actor 1Actor 2w2w2w1
ࢄ߹ҙ• ʮෳ͕ͯ͢ಉ͡Ͱ͋Δʯ • ʹʮશһ͕ͻͱͭͷʹ߹ҙͰ͖͍ͯΔঢ়ଶʯ • ΞϠγ͍ಈ࡞Λ͢ΔՄೳੑ͕͋Δͷ • ωοτϫʔΫ • ߹ҙ͢Δ૬ख • ੍ݶ࣌ؒɿແݶ 26
·͡Ίʹॻ͘ͱ…• ͍ͭͰյΕΔՄೳੑͷ͋Δෳͷϊʔυ͕ɺ • յΕͯ·ͨ෮ؼ͢ΔՄೳੑ͋Δ • ฦࣄ͕͍͚ͩͷ߹ • ৴པੑͷ͍ϊʔυؒ௨৴Λͬͯ • ͷ͘͢͝Ԇ͍ͯ͠Δ͚Ͳ࣮ಧ͘ • ͻͱͭͷΛ߹ҙ͢ΔʢͷʹͲͷΑ͏ͳ݅ͱϓϩτίϧ͕͋ΕΑ͍͔ʁʣ 27
ͳ͍ͥ͠ͷ͔• ࢮ׆ࢹ͕͍͠ • ϦʔμʔΛఘΊΔɺϚελʔΛఘΊΔɺetc • ࢳిͰࢮ׆ࢹ͢Δ࿅श • ނোϞσϧͷͳ͕͍͠Ζ͍Ζ͋Δ 28
ࢮ׆ࢹ͍͠• ʮࢮΜͩ͜ͱʯΛਖ਼͘͠ݟ͚ͭΔͷ͍͠ • ࣮ݧʢ͕࣌ؒ͋Εʣ 29
ʮࢮΜͰ͍Δʯ is Կ30 © ʮేͷݓʯݪɺଚ
ނোతࠔΔ͜ͱͷྨCrash Failure ɹ (fail-stop, fail-safe) ނোͨ͠Βࢭ·Δɺࢭ·ͬͨ͜ͱ͕͙͢ʹ͔Δ Crash Failure (fail-silent) ͬͯࢮ͵ Crash Recovery ނোͨ͠ϑϦͯ͠ΒΜΓͯ͠ؼͬͯ͘Δ Omission Failure(receive / send) ϝοηʔδܽམ Timing Failure ·ͱͳͰಈ͔ͳ͘ͳΔ Response Failure Ϩεϙϯε͕͓͔͍͠ʢσʔλ͕յΕ͍ͯΔetcʣ Arbitrary Failure Ϗβϯνϯނোɺѱҙͷ͋ΔϠπ͕͍Δ 31
ࢄ߹ҙͷछྨ• ʮΞτϛοΫϒϩʔυΩϟετϓϩτίϧʯ • ͓͓·͔ʹ͍ͬͯࢄ߹ҙ͢ΔͨΊͷϓϩτίϧΛ૯শͯ͠ • ΞτϛοΫ͡Όͳ͍ϒϩʔυΩϟετࢁ͋ͬͯͦΕͳΓʹಈ͍͍ͯΔ • දతͳͷPaxos, Rafter, ZAB 32
33 Consensus BasedReplication• ϨϓϦέʔγϣϯͷϦʔμʔΛଟܾͰબग़ • or ϨϓϦέʔγϣϯຖʹଟܾw1w1w1 w2w2w2Actor 0Actor 1Actor 2w2w2w1
34 What is PAXOS? – Example Two phase election – phase 1 Larger n is prior21:328ß: ¼ÛËÙêîĝ21:36O2Ě21:35[_ÕðJK21:36O2ÙÐÄorz21:38O2ÙÐÞorzn=3n=1n=28
35 What is PAXOS? – Example Two phase election – phase 2 confirmation21:42Ãğ21:41O2ÙÀÀëÞĝĚ21:43a~Å…21:44Ãğ 21:43GJproposer21:43ÀêO2ÚyÁËÚÙ|N`çÜßÙn=49
Մ༻ੑͷ• σʔλΛ͍͟ॻ͖ࠐ͏ͱࢥͬͨॠؒʹσʔλ͕σʔλϕʔε͕མ͍ͪͯͨΓݻ·͍ͬͯͯϏδωενϟϯεΛಀ͢ 36
CAPఆཧͲΜͳނোʹରͯ͠ QBSUJUJPOUPMFSBODFσʔλৗʹ߹͓ͯ͠Γ DPOTJTUFODZγεςϜ͕ࢭ·Δ͜ͱͳ͍ BWBJMBCJMJUZ• ͜ͷ3ͭΛಉ࣌ʹຬͨ͢γεςϜଘࡏ͠ͳ͍ • ͜Ε·ͰͷRDBMSCAॏࢹ 37
CAPఆཧʢCॏࢹͷ߹ʣ• n1ͱn2ͷϨϓϦΧΛৗʹ߹͓ͤͯ͘͞ • ωοτϫʔΫ͕ΕͨΓނোͨ͠ΒࢭΊΔˠՄ༻ੑˣ 38 w1 w2 n1 n2
CAPఆཧʢAॏࢹͷ߹ʣ• n1ͱn2ͷϨϓϦΧΛৗʹ͑ΔΑ͏ʹ͢Δ • ωοτϫʔΫ͕ΕͨΓނোͯ͠ॻ͚Δˠ߹ੑˣ 39 w1 w2 n1 n2
CAPఆཧʢPॏࢹͷ߹ʣ• n1ͱn2ͷϨϓϦΧΛΤεύʔʹ͢Δ • ωοτϫʔΫ͕ΕͨΒਖ਼͍͠ํ͕Θ͔ΔˠՄ༻ੑͪΐͬͱˣ 40 w1 w2 n1 n2
Amazon’s Dynamo• ʮσʔλϕʔεϦϨʔγϣφϧ͚ͩ͡Όͳ͍ɺ߹ੑ͚ͩ͡Όͳ͍ɺՄ༻ੑ͕େࣄͳ߹͋ΔΜͩʯ • ΞϚκϯͷγϣοϐϯάΧʔτʹΘΕ͍ͯͨ • Vector Clocks • Handoff • (CRDT in Riak) 41
ʮͱΓ͋͑ͣॻ͘ʯͱ͍͏ߟ͑• “Hinted Handoff” Ϧϯάͷ࣍ͷਓʹͱΓ͓͋͑ͣͯ͘͠ • ނো͔Β͖ͬͯͨΒฦ͢ • ॻ͖ࠐΈ͕ॏෳͨ͠Βʁ 42 w1
ͱΓ͋͑ͣॻ͍͓͍ͯͯॻ͖ࠐΈ͕িಥͨ͠Βʁ• ྆ํ͓͍࣋ͬͯͯɺিಥͨ͠σʔλͷʮҼՌؔʯΛ໌Β͔ʹ͢Δ • “Vector Clock” • {a: 1, b:2, c:1} ͱ {a: 2, b:2, c:1} • {a: 1, b:2, c:1} ͱ {a: 1, c:1, d:1} • িಥͨ͠ͷΞϓϦͰղܾ 43 w1 w2 r
44 CRDT• CRDT (Conflict-Free Replicated Data Types • “AP” Λαϙʔτ • Counter, Register, Sets, Maps • →ผεϥΠυʁ
45 CRDTͰͷRead• ॱ൪͕ೖΕସΘͬͯ݁Ռ͕มΘΒͳ͍ܕ • update(w1, update(w2, Data0) =update(w2, update(w1, Data0) = Dataw1w1w1w2w2w2Actor 0Actor 1Actor 2w1(w2(Data0)) => Dataw1(w2(Data0)) => Dataw2(w1(Data0)) => Data
ACID vs BASEACID BASE ߹͍ͯ͠ͳͯ͘ৗʹσʔλʹΞΫηεͰ͖Δ Basically Avaiable Atomicity ෳͷૢ࡞ͷޭɾࣦഊΛ·ͱΊΔ Consistency Խ͞ΕͨσʔλෳͷϦϨʔγϣϯ͕ৗʹ߹͍ͯ͠Δ ࠷ऴతʹ߹ͨ͠ঢ়ଶʹͳΔ͜ͱ͕อূ͞Ε͍ͯΕΑ͍ Eventually Consistent Isolation ฒߦཧͯ͠ߋ৽్தͷঢ়ଶΛݟͤͳ͍ Durability σʔλΛӬଓԽࣦͯ͠Θͳ͍ ϨϓϦΧܾఆతͰͳ͘ɺ֬తͰ͋ͬͨΓɺάϩʔόϧʹҰ؏͍ͯ͠ͳͯ͘Α͍ Soft-state 46
σʔλετΞͷྨ࣠47
ΫΠζɿͲ͜·Ͱ͕DB?Ͳ͔͜Β͕NoSQL?ACIDͳτϥϯβΫγϣϯ τϥϯβΫγϣϯͳ͠ SQLΠϯλʔϑΣʔε Oracle MySQL PostgreSQL SQL Server, DB2 Cassandra (CQL) Hive, Presto Impala ಠࣗAPI InnoDB BerkeleyDB FoundationDB Riak MongoDB Redis 48
ςετʹग़ͳ͍ຊͷ͜ͱ49
CAPఆཧ͔ΒΈͨͷྨ• CAॏࢹ • RDBMS, MongoDB, HBase, etc… • ߹ੑҡ࣋ͷͨΊͷࢄ߹ҙʢਖ਼͘͠ϑΣΠϧΦʔόʔ͢ΔͨΊʣͷΈ͕ඞཁ • APॏࢹ • Cassandra, Riak, CouchDB • ࢄ߹ҙͷΈ͕ෆཁʹ࣮ӡ༻؆୯ 50
͜͜·Ͱͷ·ͱΊ:ͱͯᐆດͳදCAॏࢹ APॏࢹ RDBMS HBaseMongoDB Cassandra CouchDB Riak 51
ϦϨʔγϣφϧϞσϧΛఘΊΔ52 • ςʔϒϧߏ => KVS (Mapߏ) • PkeyΧϥϜͷͻͱ͔ͭΒબͿͷ=> ඞਢͷͷ • JOINΛͤ͞ͳ͍ • ͜ΕʹΑΓεέʔϧΞτܕͷࢄ͕Մೳʹ • εΩʔϚΛࣄલఆٛ͠ͳ͍
εΩʔϚ:σʔλϕʔεͷσʔλߏΛܾΊΔͷ• εΩʔϚΛࣄલʹఆٛ: ϦϨʔγϣφϧϞσϧʹԊ͍ͬͨํ • Pros: σʔλߏʹڧྗͳ੍Λ՝ͨ͢Ί࠷దԽ͍͢͠ˍΞϓϦΛ։ൃ͍͢͠ • Cons: ΞϓϦͷมߋίετ͕ߴ͍ˍ࠷ॳʹదʹઃܭ͠ͳ͍ͱ͔ͳΓมߋʹऑ͍ • εΩʔϚΛࣄલʹܾΊΔඞཁ͕ͳ͍߹ • Pro: σʔλߏΛޙ͔ΒͰ͔ͳΓॊೈʹมߋͰ͖Δ • Con: ςετ͕ͳ͍ͱόά͕ग़͍͢ɺσʔλߏ͕ෳࡶͩͱ࠷దԽ͠ʹ͍͘ 53
৽͍͠σʔλϞσϧΧϥϜࢦ • ΧϥϜΛ͍͘ΒͰ૿͢͜ͱ͕Ͱ͖Δ • Query LanguageΛ࡞Δ • ʮΧϥϜϑΝϛϦʔʯ • HBase, Cassandra υΩϡϝϯτࢦ • ݸʑͷϨίʔυ͕ࣗ༝ͳܗࣜΛ࣋ͭ • MapReduceͰΫΤϦ • XML, JSON, etc … • CouchDB, MongoDB 54
͜͜·Ͱͷ·ͱΊ:ͱͯᐆດͳද2CAॏࢹ APॏࢹ SQL RDBMS - ΧϥϜࢦ HBase Cassandra υΩϡϝϯτࢦ MongoDB CouchDB (blob) Riak 55
͏ͻͱͭσʔλϞσϧGraph• Semantic WebతͳσʔλߏΛѻ͏ͨΊͷDB • ;ͨͭͷํੑ • ϊʔυͱΤοδΛਂ͘୳ࡧ͍ͨ͠ • ϊʔυͱΤοδ͕ଟ͗͢Δ 56
σʔλϕʔε͕ಈ͘ॲཧܥ• ωΠςΟϒ • ͍ɺϝϞϦཧɺGC͕ͳ͍ • JVM • ͍ɺϝϞϦཧ͠ͳ͍͍ͯ͘ɺGC͕ى͖Δ • Erlang VM • ͍ͦͦ͜͜ɺϝϞϦཧ͠ͳ͍͍ͯ͘ɺGCͰࢭ·Βͳ͍ 57
58
͍··Ͱઆ໌͖ͯͨ͠ͷͰ؆୯ RiakͲ͏͍ͯ͠Δ͔ εέʔϦϯά ϋογϡతϧʔςΟϯάͰGossipͳਫฏࢄ ߹ੑͷཧ Vector Clocks, CRDTͳͲRead࣌ʹղܾ 2.0ͰPaxosೖΔ Մ༻ੑͷอূ Hinted HandoffΛͬͯৗʹॻ͖ࠐΈ͕Ͱ͖ΔΑ͏ʹ͍ͯ͠Δ σʔλϞσϧ Blob + Document Based ΠϯσοΫεΛΕΔ MapReduce͕Ͱ͖Δ ॲཧܥ Erlang VM 59
࣮໘ͰͷRiakͷಛ• ʮதʹΞϥʔτͰى͜͞Εͳ͍ʯ͜ͱΛࢦͨ͠ • Erlang VMΛ༻ • GCʹΑΔఀࢭ͕࣌ؒͳ͍ • ແఀࢭͰͷύονద༻ɺղੳɺૢ࡞͕Մೳ • ίϚϯυମܥɾϑΝΠϧߏɾσʔλߏΛͳΔ͘γϯϓϧʹઃܭ͠ӡ༻ෛ୲Λݮ • ҆ఆੑΛॏࢹͨ͠ઃܭ • ίϯύΫγϣϯɺ༧ଌՄೳੑ 60
݁ہɺͲ͏͍͏ͱ͖ʹͲΕΛ͑Α͍ͷ͔ʁ61
࣮໘ߟྀͨ͠બͷϙΠϯτ• JOINͳͲσʔλͷਖ਼نԽ͕ඞཁͳͱ͖: RDBMS • ΦϯϝϞϦͰ࠷৽ͷ࣌ܥྻσʔλΛݟ͍ͨͱ͖: MongoDB • ΧϥϜΛ͍͘ΒͰ૿͍ͨ͠ͱ͖ • SQL෩ͷΫΤϦΛ͍ͨ͠ͱ͖ or Մ༻ੑ: Cassandra • େྔσʔλʹޮతʹόονॲཧΛ͍ͨ͠ͱ͖: Hbase • σʔλΛͳ͘͞ͳ͍&μϯλΠϜΛͳ͍ͨ͘͠ͱ͖: Riak 62
·ͱΊ• σʔλϕʔεͷྺ࢙తʹNoSQL৽͘͠ͳ͍ • SQLͱACIDผͷͳ͠ɺNoSQLͱ͍͏ΑΓNoACID • εέʔϦϯάʢཧ۶ʣ͘͠ͳ͍ • ߹ੑཧ۶͔Β͍ͯͦͦ͠͠ • Մ༻ੑͦΜͳʹ͘͠ͳ͍ • RiakՄ༻ੑͱεέʔϥϏϦςΟ͕ಘҙ 63
Questions?64 ࢀߟจݙϦετ: https://gist.github.com/kuenishi/8296883#refs