AWS Glueでリプレースしてみた/gunosy-use-glue

Db8ec54bcaba4695821acf233a25afe9?s=47 aibou
December 25, 2017

AWS Glueでリプレースしてみた/gunosy-use-glue

Db8ec54bcaba4695821acf233a25afe9?s=128

aibou

December 25, 2017
Tweet

Transcript

  1. AWS GlueͰ ϦϓϨʔεͯ͠Έͨ גࣜձࣾGunosy ։ൃຊ෦ ඿஍྄ี

  2. ͓·͑ͩΕ • @aibou • SREνʔϜͰܯ࡯ۀ΍ͬͯ·͢ • ϏοάσʔλະܦݧͰ͢ • Ξϝϑτ؍ઓ͕޷͖Ͱ͢

  3. ۀ຿಺༰ • άϊγʔɾ޿ࠂαʔόͷӡ༻ • ࣗಈԽɾলྗԽΛ໨తʹ೔ʑ׆ಈ • ίʔυԽ(codenize.tools, terraform) • ʮ͜ΜʹͪΘʔ

    ܯ࡯ͷ΋ͷͰ͕͢ʔʯ • OpsWorks, Kinesis3ܑఋ • X-Ray࢖͍͍ͨͳ͊ɾɾɾ
  4. ໨࣍ • GunosyͰGlueΛಋೖͨ͠࿩ʢ͔͠΋ยखؒͰʣ • AthenaύʔςΟγϣϯ࡞੒όονΛGlueʹஔ׵ • ࣗલETLόονΛGlueʹஔ׵ • DynamoDBͷϑΥʔϚοτม׵ΛGlueͰ࣮ݱ

  5. ᶃAthenaύʔςΟγϣϯ࡞੒ όονΛGlueʹஔ׵

  6. άϊγʔͷ޿ࠂϩάͷྲྀΕ ʢͬ͘͟Γʣ CBUDITFSWFS &.3 4QBSL "1*TFSWFST PS "UIFOB ഑৴ީิ 'JSFIPTF

  7. AthenaͱRedshiftͷ࢖͍෼͚ • Athena • ՄࢹԽ༻ • ԕ͍աڈ΍௕ظؒͷूܭͰར༻ • Redshift •

    ϩδοΫ༻ • ௚ۙͷσʔλ͔Β഑৴ީิͷੜ੒౳ • ετϨʔδ࢖༻཰ϑϧʹͳΔͱRedshiftೖΕସ͑ΔରԠ • ʹݹ͍σʔλΛࣺͯΔ(S3ʹόοΫΞοϓ͋Γ)
  8. Athena • fluent-plugin-s3ͰϩάΛS3ʹΞοϓϩʔυ • tag͝ͱʹ෼཭(imp, click౳) • hive friendly (

    /click/year=2017/... ) ͳkeyͰ • ఆ࣌όονͰADD PARTITION͢Δ • MSCK REPAIRͩͱ࣮͕͕͔͔֬ͩ࣌ؒΔ • MSCK REPAIR ≒ 2h, ADD PARTITION < 1s
  9. Athena ύʔςΟγϣϯ࡞੒όον • ࣾ಺Ͱ͸Կނ͔ʮશࣗಈ͍͋΅͏ʯͱݺ͹Ε͍ͯΔ • CloudWatch Events + Lambda •

    Lambda͸Java੡ • codenize-tools/monosasi ͰDSLԽ • s3://bucket/path/to/year=2017/month=11/day=24/hour=18/ • ཁ๬͕͋Ε͹ίʔυެ։͠·͢ʢ͕ɺGlueͷํ͕ศརΑͶɾɾɾ • FirehoseϑΥʔϚοτʹ΋ରԠՄೳ
  10. Glue DataCatalog΁ • Athena΍LambdaͰRate limit • ҙࣝͯͣ͠Β͞ͳ͍ͱ͕࣌ؒूத͢Δ • εΩʔϚ؅ཧͱCloudWatch Eventsͷ؅ཧਏ͍

    • ݱঢ়45ςʔϒϧ͙Β͍ɻطʹ͠ΜͲ͍
  11. to Glue Crawler & DataCatalog "1*TFSWFST "-5&35"#-& "%%1"35*5*0/

  12. ॴײ • Lambda͕ෆཁʹͳͬͨʢ΍ͬͨͶʣ • 1ΫϩʔϥͰෳ਺ͷSource DataStoreʹରԠ • Ϋϩʔϥ͕ཚཱ͠ͳͯ͘ॿ͔Δ • ϩάྔ͕ଟ͍ͱΫϩʔϦϯάʹ͕͔͔࣌ؒΔ

    • ͍·ͷͱ͜Ζ౎౓6͔͔࣌ؒͬͯΔ
  13. GlueͰͰ͖ͳ͔ͬͨ͜ͱ • ҟϦʔδϣϯͷAthenaʹAdd table • ઌड़ͷύʔςΟγϣϯ௥ՃόονͰରԠ • طଘͷAthena Tableʹରͯ͠DataCatalogͷద༻ •

    database/tableͷ࡞Γ௚͠ʢӨڹ͋ΔͷͰ·ͩʣ • Glueͷςʔϒϧ໊ͷࢦఆɾมߋ • ಛҟͳϑΥʔϚοτʹରԠෆՄ ཌ೔
  14. ᶄࣗલETLόονΛGlueʹஔ׵

  15. ϩάͱϚελʔͷJOIN • ϩάσʔλɿS3 ɹɹ Redshift • imp, click etc •

    ϚελʔσʔλɿRDS • campaign, creative etc • ʮAthenaɾRedshiftͰJOIN͍ͨ͠ʯ • => Digdag + EmbulkͰରԠ͍ͯͨ͠
  16. Embulk + digdag (+ docker) "-5&35"#-& "%%1"35*5*0/

  17. to Glue Crawler & Glue ETL PSFHPOSFHJPO UPLZPSFHJPO SFQMJDB ᶃ

    ᶄ ᶄͷNFUBEBUBͰ BEEUBCMF ᶃͷNFUBEBUBͰ &5-
  18. Additional • statsςʔϒϧɿ೔ຖͷूܭࡁΈσʔλ • ςʔϒϧશߦͰ͸ͳ͘೔ຖͷσʔλ͚ͩUpload • ETLͷJobͰFilter transform classΛ࢖༻ 

    TUBUTUBCMF TQBUIUPZFBSNPOUIEBZ
  19. Filter transformer class def filter_function(dynamic_record):
 if dynamic_record["date"].strftime("%Y-%m-%d") == yesterday.strftime("%Y-%m-%d"):
 return

    True
 else:
 return False
 
 filtered0 = Filter.apply(frame=datasource0, f=filter_function, transformation_ctx="filtered0")
  20. GlueͰΑ͔ͬͨͱ͜Ζ • ETL͕Glue͚ͩͰ׬݁͢Δ • Lambdaෆཁ • Digdag + Embulkෆཁ •

    αʔό(ECS)ෆཁ • ਓ͕ؒςʔϒϧΛྻڍ͠ͳ͍͍ͯ͘ • ࠓ·ͰLambdaͱ͔EmbulkͷઃఆϑΝΠϧͰશ෦ࢦఆͯͨ͠
  21. GlueͰΑ͘ͳ͔ͬͨͱ͜Ζ • CrawlerͰ͸ͳ͘ETLଆͰfilter͍ͯ͠Δ • ൃߦ͞ΕΔSQL͸ SELECT * FROM hoge_stats; •

    CrawlerଆͰWHEREઃఆ͍ͨ͠ • 18࣌ؒͰऴΘΒͳ͍Job (3ԯϨίʔυ) • JobͷΫϩʔϯ͕Ͱ͖ͳ͍ʢΘΓͱٸ͗Ͱ΄͍͠ʣ • ͍ͭ͘΋ࣅͨΑ͏ͳ΋ͷΛ࡞Δͷʹख͕ؒ
  22. ᶅDynamoDBͷϑΥʔϚοτม׵Λ GlueͰ࣮ݱ (ϦϓϨʔεͰ͸ͳ͍͚Ͳ)

  23. DynamoDBͷσʔλΛ෼ੳ͍ͨ͠ • άϊγʔͷλϒฒͼ৘ใ • Ͳ͏͍ͬͨϢʔβ͕Ͳ͏͍͏λϒΛϑΥϩʔͯ͠Δ͔ • ಛఆ৚݅Ͱࣗಈ௥Ճ͞ΕΔλϒ • ਖ਼͘͠ػೳ͍ͯ͠Δ͔ •

    ͲΕ͙Β͍͍Δͷ͔ • ೔͝ͱʹूܭ͍ͨ͠
  24. DynamoDBͷFull dumpͱFormat • DataPipelineͰFull dump • DynamoDB Stream͸
 ࠓޙ΍Δʢئ๬ʣ •

    σʔλهड़ࢠ -> JSON • ͦͷ··͸ਏ͍ • (ͳΜͱ͔ͯ͠) convert͢Δ { "Item": { "Age": {"N": "8"}, "Colors": { "L": [ {"S": "White"}, {"S": "Brown"}, {"S": "Black"} ] }, "Name": {"S": "Fido"}, "Vaccinations": { "M": { "Rabies": { "L": [ {"S": "2009-03-17"}, {"S": "2011-09-21"}, {"S": "2014-07-08"} ] }, "Distemper": {"S": "2015-10-13"} } }, "Breed": {"S": "Beagle"}, "AnimalType": {"S": "Dog"} }
  25. Data Pipeline + Glue ETL Amazon DynamoDB %BUB1JQFMJOF w .BQBQQMZ

    w UP%'GSPN%'
  26. ETLͷ࣮ݱʹ͋ͨͬͯ • Built-in Transformer class͚ͩͰͳΜͱ΋ͳΒͳ͍͕࣌͋Δ • ΑΓෳࡶͳFilterɾMap࣌ • toDF &

    fromDFΛ࢖͏ʢී௨ͷpysparkʣ • job bookmarkΛ࢖͏࣌͸ try & except ඞਢ • toDFͰίέΔʢύʔςΟγϣϯʹ·ͭΘΔ໰୊ͬΆ͍ • job bookmarkͷϢʔεέʔε͕·ͩΘ͔ͬͯͳ͍
  27. ϋϚΓϙΠϯτ • Built-in TransformerΫϥεͷίʔϧόοΫؔ਺಺Ͱൃੜͨ͠Exception͸ѲΓͭͿ͞ΕΔ • Exception͕ൃੜͨ͠dynamic record͸ഁغ͞Εɺ࣍ͷϨίʔυ΁ॲཧ͕ҠΔ • σόοά͢ΔͳΒίʔϧόοΫؔ਺શମΛtry-except •

    RDSΛSource DataStoreʹ͢Δ৔߹͸Subnetʹ஫ҙ • ͍ΘΏΔʮLambda in VPC໰୊ʯ • ඞͣNAT Gateway͔VPC EndpointͰS3ʹΞΫηεͰ͖ΔΑ͏ʹ͓ͯ͘͠(Scriptͷ DL) • NWઃఆ൓өʹԿނ͔30෼͙Β͍͕͔͔࣌ؒΔ
  28. Glueͷॴײ • ࣗલETLόονશ෦ΛGlueʹஔ͖׵͍͑ͨ • DataCatalogͱ͔࠷ߴ • ͕ɺݱঢ়ͩͱ৭ʑਏ͍ • ࣮ߦ࣌ؒɾಥવࢮɾ։ൃࠔ೉

  29. Glueʹର͢Δཁ๬ • ԿΑΓ΋ߴ଎Խ • DynamoDBΛSource DataStoreͷର৅ʹ • JobεΫϦϓτͷόʔδϣϯ؅ཧ • JobͷσόοάΛ༰қʹ(Τϥʔ͕ѲΓͭͿ͞ΕΔ)

    • Built-in Transform Classʹ͍ͭͯυΩϡϝϯτͷॆ࣮ • Crawler & ETL jobͷ؂ࢹ΄͍͠ • ScalaͰॻ͔ͤͯ͘Ε • EMR(Spark)ͷࢿ࢈Λྲྀ༻͍ͨ͠