Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Glueでリプレースしてみた/gunosy-use-glue

aibou
December 25, 2017

AWS Glueでリプレースしてみた/gunosy-use-glue

aibou

December 25, 2017
Tweet

More Decks by aibou

Other Decks in Programming

Transcript

  1. AthenaͱRedshiftͷ࢖͍෼͚ • Athena • ՄࢹԽ༻ • ԕ͍աڈ΍௕ظؒͷूܭͰར༻ • Redshift •

    ϩδοΫ༻ • ௚ۙͷσʔλ͔Β഑৴ީิͷੜ੒౳ • ετϨʔδ࢖༻཰ϑϧʹͳΔͱRedshiftೖΕସ͑ΔରԠ • ʹݹ͍σʔλΛࣺͯΔ(S3ʹόοΫΞοϓ͋Γ)
  2. Athena • fluent-plugin-s3ͰϩάΛS3ʹΞοϓϩʔυ • tag͝ͱʹ෼཭(imp, click౳) • hive friendly (

    /click/year=2017/... ) ͳkeyͰ • ఆ࣌όονͰADD PARTITION͢Δ • MSCK REPAIRͩͱ࣮͕͕͔͔֬ͩ࣌ؒΔ • MSCK REPAIR ≒ 2h, ADD PARTITION < 1s
  3. Athena ύʔςΟγϣϯ࡞੒όον • ࣾ಺Ͱ͸Կނ͔ʮશࣗಈ͍͋΅͏ʯͱݺ͹Ε͍ͯΔ • CloudWatch Events + Lambda •

    Lambda͸Java੡ • codenize-tools/monosasi ͰDSLԽ • s3://bucket/path/to/year=2017/month=11/day=24/hour=18/ • ཁ๬͕͋Ε͹ίʔυެ։͠·͢ʢ͕ɺGlueͷํ͕ศརΑͶɾɾɾ • FirehoseϑΥʔϚοτʹ΋ରԠՄೳ
  4. GlueͰͰ͖ͳ͔ͬͨ͜ͱ • ҟϦʔδϣϯͷAthenaʹAdd table • ઌड़ͷύʔςΟγϣϯ௥ՃόονͰରԠ • طଘͷAthena Tableʹରͯ͠DataCatalogͷద༻ •

    database/tableͷ࡞Γ௚͠ʢӨڹ͋ΔͷͰ·ͩʣ • Glueͷςʔϒϧ໊ͷࢦఆɾมߋ • ಛҟͳϑΥʔϚοτʹରԠෆՄ ཌ೔
  5. ϩάͱϚελʔͷJOIN • ϩάσʔλɿS3 ɹɹ Redshift • imp, click etc •

    ϚελʔσʔλɿRDS • campaign, creative etc • ʮAthenaɾRedshiftͰJOIN͍ͨ͠ʯ • => Digdag + EmbulkͰରԠ͍ͯͨ͠
  6. to Glue Crawler & Glue ETL PSFHPOSFHJPO UPLZPSFHJPO SFQMJDB ᶃ

    ᶄ ᶄͷNFUBEBUBͰ BEEUBCMF ᶃͷNFUBEBUBͰ &5-
  7. Filter transformer class def filter_function(dynamic_record):
 if dynamic_record["date"].strftime("%Y-%m-%d") == yesterday.strftime("%Y-%m-%d"):
 return

    True
 else:
 return False
 
 filtered0 = Filter.apply(frame=datasource0, f=filter_function, transformation_ctx="filtered0")
  8. GlueͰΑ͔ͬͨͱ͜Ζ • ETL͕Glue͚ͩͰ׬݁͢Δ • Lambdaෆཁ • Digdag + Embulkෆཁ •

    αʔό(ECS)ෆཁ • ਓ͕ؒςʔϒϧΛྻڍ͠ͳ͍͍ͯ͘ • ࠓ·ͰLambdaͱ͔EmbulkͷઃఆϑΝΠϧͰશ෦ࢦఆͯͨ͠
  9. GlueͰΑ͘ͳ͔ͬͨͱ͜Ζ • CrawlerͰ͸ͳ͘ETLଆͰfilter͍ͯ͠Δ • ൃߦ͞ΕΔSQL͸ SELECT * FROM hoge_stats; •

    CrawlerଆͰWHEREઃఆ͍ͨ͠ • 18࣌ؒͰऴΘΒͳ͍Job (3ԯϨίʔυ) • JobͷΫϩʔϯ͕Ͱ͖ͳ͍ʢΘΓͱٸ͗Ͱ΄͍͠ʣ • ͍ͭ͘΋ࣅͨΑ͏ͳ΋ͷΛ࡞Δͷʹख͕ؒ
  10. DynamoDBͷFull dumpͱFormat • DataPipelineͰFull dump • DynamoDB Stream͸
 ࠓޙ΍Δʢئ๬ʣ •

    σʔλهड़ࢠ -> JSON • ͦͷ··͸ਏ͍ • (ͳΜͱ͔ͯ͠) convert͢Δ { "Item": { "Age": {"N": "8"}, "Colors": { "L": [ {"S": "White"}, {"S": "Brown"}, {"S": "Black"} ] }, "Name": {"S": "Fido"}, "Vaccinations": { "M": { "Rabies": { "L": [ {"S": "2009-03-17"}, {"S": "2011-09-21"}, {"S": "2014-07-08"} ] }, "Distemper": {"S": "2015-10-13"} } }, "Breed": {"S": "Beagle"}, "AnimalType": {"S": "Dog"} }
  11. ETLͷ࣮ݱʹ͋ͨͬͯ • Built-in Transformer class͚ͩͰͳΜͱ΋ͳΒͳ͍͕࣌͋Δ • ΑΓෳࡶͳFilterɾMap࣌ • toDF &

    fromDFΛ࢖͏ʢී௨ͷpysparkʣ • job bookmarkΛ࢖͏࣌͸ try & except ඞਢ • toDFͰίέΔʢύʔςΟγϣϯʹ·ͭΘΔ໰୊ͬΆ͍ • job bookmarkͷϢʔεέʔε͕·ͩΘ͔ͬͯͳ͍
  12. ϋϚΓϙΠϯτ • Built-in TransformerΫϥεͷίʔϧόοΫؔ਺಺Ͱൃੜͨ͠Exception͸ѲΓͭͿ͞ΕΔ • Exception͕ൃੜͨ͠dynamic record͸ഁغ͞Εɺ࣍ͷϨίʔυ΁ॲཧ͕ҠΔ • σόοά͢ΔͳΒίʔϧόοΫؔ਺શମΛtry-except •

    RDSΛSource DataStoreʹ͢Δ৔߹͸Subnetʹ஫ҙ • ͍ΘΏΔʮLambda in VPC໰୊ʯ • ඞͣNAT Gateway͔VPC EndpointͰS3ʹΞΫηεͰ͖ΔΑ͏ʹ͓ͯ͘͠(Scriptͷ DL) • NWઃఆ൓өʹԿނ͔30෼͙Β͍͕͔͔࣌ؒΔ
  13. Glueʹର͢Δཁ๬ • ԿΑΓ΋ߴ଎Խ • DynamoDBΛSource DataStoreͷର৅ʹ • JobεΫϦϓτͷόʔδϣϯ؅ཧ • JobͷσόοάΛ༰қʹ(Τϥʔ͕ѲΓͭͿ͞ΕΔ)

    • Built-in Transform Classʹ͍ͭͯυΩϡϝϯτͷॆ࣮ • Crawler & ETL jobͷ؂ࢹ΄͍͠ • ScalaͰॻ͔ͤͯ͘Ε • EMR(Spark)ͷࢿ࢈Λྲྀ༻͍ͨ͠