Slide 1

Slide 1 text

AWS GlueͰ ϦϓϨʔεͯ͠Έͨ גࣜձࣾGunosy ։ൃຊ෦ ඿஍྄ี

Slide 2

Slide 2 text

͓·͑ͩΕ • @aibou • SREνʔϜͰܯ࡯ۀ΍ͬͯ·͢ • ϏοάσʔλະܦݧͰ͢ • Ξϝϑτ؍ઓ͕޷͖Ͱ͢

Slide 3

Slide 3 text

ۀ຿಺༰ • άϊγʔɾ޿ࠂαʔόͷӡ༻ • ࣗಈԽɾলྗԽΛ໨తʹ೔ʑ׆ಈ • ίʔυԽ(codenize.tools, terraform) • ʮ͜ΜʹͪΘʔ ܯ࡯ͷ΋ͷͰ͕͢ʔʯ • OpsWorks, Kinesis3ܑఋ • X-Ray࢖͍͍ͨͳ͊ɾɾɾ

Slide 4

Slide 4 text

໨࣍ • GunosyͰGlueΛಋೖͨ͠࿩ʢ͔͠΋ยखؒͰʣ • AthenaύʔςΟγϣϯ࡞੒όονΛGlueʹஔ׵ • ࣗલETLόονΛGlueʹஔ׵ • DynamoDBͷϑΥʔϚοτม׵ΛGlueͰ࣮ݱ

Slide 5

Slide 5 text

ᶃAthenaύʔςΟγϣϯ࡞੒ όονΛGlueʹஔ׵

Slide 6

Slide 6 text

άϊγʔͷ޿ࠂϩάͷྲྀΕ ʢͬ͘͟Γʣ CBUDITFSWFS &.3 4QBSL "1*TFSWFST PS "UIFOB ഑৴ީิ 'JSFIPTF

Slide 7

Slide 7 text

AthenaͱRedshiftͷ࢖͍෼͚ • Athena • ՄࢹԽ༻ • ԕ͍աڈ΍௕ظؒͷूܭͰར༻ • Redshift • ϩδοΫ༻ • ௚ۙͷσʔλ͔Β഑৴ީิͷੜ੒౳ • ετϨʔδ࢖༻཰ϑϧʹͳΔͱRedshiftೖΕସ͑ΔରԠ • ʹݹ͍σʔλΛࣺͯΔ(S3ʹόοΫΞοϓ͋Γ)

Slide 8

Slide 8 text

Athena • fluent-plugin-s3ͰϩάΛS3ʹΞοϓϩʔυ • tag͝ͱʹ෼཭(imp, click౳) • hive friendly ( /click/year=2017/... ) ͳkeyͰ • ఆ࣌όονͰADD PARTITION͢Δ • MSCK REPAIRͩͱ࣮͕͕͔͔֬ͩ࣌ؒΔ • MSCK REPAIR ≒ 2h, ADD PARTITION < 1s

Slide 9

Slide 9 text

Athena ύʔςΟγϣϯ࡞੒όον • ࣾ಺Ͱ͸Կނ͔ʮશࣗಈ͍͋΅͏ʯͱݺ͹Ε͍ͯΔ • CloudWatch Events + Lambda • Lambda͸Java੡ • codenize-tools/monosasi ͰDSLԽ • s3://bucket/path/to/year=2017/month=11/day=24/hour=18/ • ཁ๬͕͋Ε͹ίʔυެ։͠·͢ʢ͕ɺGlueͷํ͕ศརΑͶɾɾɾ • FirehoseϑΥʔϚοτʹ΋ରԠՄೳ

Slide 10

Slide 10 text

Glue DataCatalog΁ • Athena΍LambdaͰRate limit • ҙࣝͯͣ͠Β͞ͳ͍ͱ͕࣌ؒूத͢Δ • εΩʔϚ؅ཧͱCloudWatch Eventsͷ؅ཧਏ͍ • ݱঢ়45ςʔϒϧ͙Β͍ɻطʹ͠ΜͲ͍

Slide 11

Slide 11 text

to Glue Crawler & DataCatalog "1*TFSWFST "-5&35"#-& "%%1"35*5*0/

Slide 12

Slide 12 text

ॴײ • Lambda͕ෆཁʹͳͬͨʢ΍ͬͨͶʣ • 1ΫϩʔϥͰෳ਺ͷSource DataStoreʹରԠ • Ϋϩʔϥ͕ཚཱ͠ͳͯ͘ॿ͔Δ • ϩάྔ͕ଟ͍ͱΫϩʔϦϯάʹ͕͔͔࣌ؒΔ • ͍·ͷͱ͜Ζ౎౓6͔͔࣌ؒͬͯΔ

Slide 13

Slide 13 text

GlueͰͰ͖ͳ͔ͬͨ͜ͱ • ҟϦʔδϣϯͷAthenaʹAdd table • ઌड़ͷύʔςΟγϣϯ௥ՃόονͰରԠ • طଘͷAthena Tableʹରͯ͠DataCatalogͷద༻ • database/tableͷ࡞Γ௚͠ʢӨڹ͋ΔͷͰ·ͩʣ • Glueͷςʔϒϧ໊ͷࢦఆɾมߋ • ಛҟͳϑΥʔϚοτʹରԠෆՄ ཌ೔

Slide 14

Slide 14 text

ᶄࣗલETLόονΛGlueʹஔ׵

Slide 15

Slide 15 text

ϩάͱϚελʔͷJOIN • ϩάσʔλɿS3 ɹɹ Redshift • imp, click etc • ϚελʔσʔλɿRDS • campaign, creative etc • ʮAthenaɾRedshiftͰJOIN͍ͨ͠ʯ • => Digdag + EmbulkͰରԠ͍ͯͨ͠

Slide 16

Slide 16 text

Embulk + digdag (+ docker) "-5&35"#-& "%%1"35*5*0/

Slide 17

Slide 17 text

to Glue Crawler & Glue ETL PSFHPOSFHJPO UPLZPSFHJPO SFQMJDB ᶃ ᶄ ᶄͷNFUBEBUBͰ BEEUBCMF ᶃͷNFUBEBUBͰ &5-

Slide 18

Slide 18 text

Additional • statsςʔϒϧɿ೔ຖͷूܭࡁΈσʔλ • ςʔϒϧશߦͰ͸ͳ͘೔ຖͷσʔλ͚ͩUpload • ETLͷJobͰFilter transform classΛ࢖༻ TUBUTUBCMF TQBUIUPZFBSNPOUIEBZ

Slide 19

Slide 19 text

Filter transformer class def filter_function(dynamic_record):
 if dynamic_record["date"].strftime("%Y-%m-%d") == yesterday.strftime("%Y-%m-%d"):
 return True
 else:
 return False
 
 filtered0 = Filter.apply(frame=datasource0, f=filter_function, transformation_ctx="filtered0")

Slide 20

Slide 20 text

GlueͰΑ͔ͬͨͱ͜Ζ • ETL͕Glue͚ͩͰ׬݁͢Δ • Lambdaෆཁ • Digdag + Embulkෆཁ • αʔό(ECS)ෆཁ • ਓ͕ؒςʔϒϧΛྻڍ͠ͳ͍͍ͯ͘ • ࠓ·ͰLambdaͱ͔EmbulkͷઃఆϑΝΠϧͰશ෦ࢦఆͯͨ͠

Slide 21

Slide 21 text

GlueͰΑ͘ͳ͔ͬͨͱ͜Ζ • CrawlerͰ͸ͳ͘ETLଆͰfilter͍ͯ͠Δ • ൃߦ͞ΕΔSQL͸ SELECT * FROM hoge_stats; • CrawlerଆͰWHEREઃఆ͍ͨ͠ • 18࣌ؒͰऴΘΒͳ͍Job (3ԯϨίʔυ) • JobͷΫϩʔϯ͕Ͱ͖ͳ͍ʢΘΓͱٸ͗Ͱ΄͍͠ʣ • ͍ͭ͘΋ࣅͨΑ͏ͳ΋ͷΛ࡞Δͷʹख͕ؒ

Slide 22

Slide 22 text

ᶅDynamoDBͷϑΥʔϚοτม׵Λ GlueͰ࣮ݱ (ϦϓϨʔεͰ͸ͳ͍͚Ͳ)

Slide 23

Slide 23 text

DynamoDBͷσʔλΛ෼ੳ͍ͨ͠ • άϊγʔͷλϒฒͼ৘ใ • Ͳ͏͍ͬͨϢʔβ͕Ͳ͏͍͏λϒΛϑΥϩʔͯ͠Δ͔ • ಛఆ৚݅Ͱࣗಈ௥Ճ͞ΕΔλϒ • ਖ਼͘͠ػೳ͍ͯ͠Δ͔ • ͲΕ͙Β͍͍Δͷ͔ • ೔͝ͱʹूܭ͍ͨ͠

Slide 24

Slide 24 text

DynamoDBͷFull dumpͱFormat • DataPipelineͰFull dump • DynamoDB Stream͸
 ࠓޙ΍Δʢئ๬ʣ • σʔλهड़ࢠ -> JSON • ͦͷ··͸ਏ͍ • (ͳΜͱ͔ͯ͠) convert͢Δ { "Item": { "Age": {"N": "8"}, "Colors": { "L": [ {"S": "White"}, {"S": "Brown"}, {"S": "Black"} ] }, "Name": {"S": "Fido"}, "Vaccinations": { "M": { "Rabies": { "L": [ {"S": "2009-03-17"}, {"S": "2011-09-21"}, {"S": "2014-07-08"} ] }, "Distemper": {"S": "2015-10-13"} } }, "Breed": {"S": "Beagle"}, "AnimalType": {"S": "Dog"} }

Slide 25

Slide 25 text

Data Pipeline + Glue ETL Amazon DynamoDB %BUB1JQFMJOF w .BQBQQMZ w UP%'GSPN%'

Slide 26

Slide 26 text

ETLͷ࣮ݱʹ͋ͨͬͯ • Built-in Transformer class͚ͩͰͳΜͱ΋ͳΒͳ͍͕࣌͋Δ • ΑΓෳࡶͳFilterɾMap࣌ • toDF & fromDFΛ࢖͏ʢී௨ͷpysparkʣ • job bookmarkΛ࢖͏࣌͸ try & except ඞਢ • toDFͰίέΔʢύʔςΟγϣϯʹ·ͭΘΔ໰୊ͬΆ͍ • job bookmarkͷϢʔεέʔε͕·ͩΘ͔ͬͯͳ͍

Slide 27

Slide 27 text

ϋϚΓϙΠϯτ • Built-in TransformerΫϥεͷίʔϧόοΫؔ਺಺Ͱൃੜͨ͠Exception͸ѲΓͭͿ͞ΕΔ • Exception͕ൃੜͨ͠dynamic record͸ഁغ͞Εɺ࣍ͷϨίʔυ΁ॲཧ͕ҠΔ • σόοά͢ΔͳΒίʔϧόοΫؔ਺શମΛtry-except • RDSΛSource DataStoreʹ͢Δ৔߹͸Subnetʹ஫ҙ • ͍ΘΏΔʮLambda in VPC໰୊ʯ • ඞͣNAT Gateway͔VPC EndpointͰS3ʹΞΫηεͰ͖ΔΑ͏ʹ͓ͯ͘͠(Scriptͷ DL) • NWઃఆ൓өʹԿނ͔30෼͙Β͍͕͔͔࣌ؒΔ

Slide 28

Slide 28 text

Glueͷॴײ • ࣗલETLόονશ෦ΛGlueʹஔ͖׵͍͑ͨ • DataCatalogͱ͔࠷ߴ • ͕ɺݱঢ়ͩͱ৭ʑਏ͍ • ࣮ߦ࣌ؒɾಥવࢮɾ։ൃࠔ೉

Slide 29

Slide 29 text

Glueʹର͢Δཁ๬ • ԿΑΓ΋ߴ଎Խ • DynamoDBΛSource DataStoreͷର৅ʹ • JobεΫϦϓτͷόʔδϣϯ؅ཧ • JobͷσόοάΛ༰қʹ(Τϥʔ͕ѲΓͭͿ͞ΕΔ) • Built-in Transform Classʹ͍ͭͯυΩϡϝϯτͷॆ࣮ • Crawler & ETL jobͷ؂ࢹ΄͍͠ • ScalaͰॻ͔ͤͯ͘Ε • EMR(Spark)ͷࢿ࢈Λྲྀ༻͍ͨ͠