Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
AWS Glueでリプレースしてみた/gunosy-use-glue
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
aibou
December 25, 2017
Programming
1.1k
4
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
AWS Glueでリプレースしてみた/gunosy-use-glue
aibou
December 25, 2017
More Decks by aibou
See All by aibou
LegalForceの契約データを脅かすリスクの排除と 開発速度の向上をどうやって両立したか
aibou
0
7.4k
LegalForce社での全文検索インフラ活用事例
aibou
0
140
SRE Lounge #7 Gunosy版「SREミッション」策定
aibou
9
7k
その接続先情報はどこに
aibou
0
3.8k
gunosy-beer-2016-07-27
aibou
1
850
Other Decks in Programming
See All in Programming
Datadog × OpenTelemetry 入門と実践のあいだ
kn_to_maxpno
1
160
ふつうのFeature Flag実践入門
irof
7
4k
New "Type" system on PicoRuby
pocke
1
930
Java × distroless で 軽量なコンテナイメージを / Java on Distroless
contour_gara
0
540
エージェンティックRAGにAWSで入門しよう!
har1101
8
1.6k
代数的データ型って何が嬉しいの? #frontend_phpcon_do
kajitack
8
3.7k
さぁV100、メモリをお食べ・・・
nilpe
0
140
Make SRE Operations Easier with Azure SRE Agent
kkamegawa
0
6.1k
TypeScript+Orvalで実現する型安全かつ堅牢でスケーラブルなマルチチャネル通知基盤 / TSKaigi Night talks ~after conference~
d0riven
0
340
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
790
軽量Java基盤の設計 DIコンテナに頼らない、長期保守と1秒起動の実現 JJUG CCC 2026 Spring
macha64
0
520
JavaDoc 再入門
nagise
1
350
Featured
See All Featured
Chasing Engaging Ingredients in Design
codingconduct
0
220
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
230
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
240
Typedesign – Prime Four
hannesfritz
42
3.1k
Producing Creativity
orderedlist
PRO
348
40k
Designing Experiences People Love
moore
143
24k
Un-Boring Meetings
codingconduct
0
310
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
1
2k
Paper Plane
katiecoart
PRO
1
51k
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
1
250
The Invisible Side of Design
smashingmag
302
52k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.5k
Transcript
AWS GlueͰ ϦϓϨʔεͯ͠Έͨ גࣜձࣾGunosy ։ൃຊ෦ ྄ี
͓·͑ͩΕ • @aibou • SREνʔϜͰܯۀͬͯ·͢ • ϏοάσʔλະܦݧͰ͢ • Ξϝϑτ؍ઓ͕͖Ͱ͢
ۀ༰ • άϊγʔɾࠂαʔόͷӡ༻ • ࣗಈԽɾলྗԽΛతʹʑ׆ಈ • ίʔυԽ(codenize.tools, terraform) • ʮ͜ΜʹͪΘʔ
ܯͷͷͰ͕͢ʔʯ • OpsWorks, Kinesis3ܑఋ • X-Ray͍͍ͨͳ͊ɾɾɾ
࣍ • GunosyͰGlueΛಋೖͨ͠ʢ͔͠ยखؒͰʣ • AthenaύʔςΟγϣϯ࡞όονΛGlueʹஔ • ࣗલETLόονΛGlueʹஔ • DynamoDBͷϑΥʔϚοτมΛGlueͰ࣮ݱ
ᶃAthenaύʔςΟγϣϯ࡞ όονΛGlueʹஔ
άϊγʔͷࠂϩάͷྲྀΕ ʢͬ͘͟Γʣ CBUDITFSWFS &.3 4QBSL "1*TFSWFST PS "UIFOB ৴ީิ 'JSFIPTF
AthenaͱRedshiftͷ͍͚ • Athena • ՄࢹԽ༻ • ԕ͍աڈظؒͷूܭͰར༻ • Redshift •
ϩδοΫ༻ • ۙͷσʔλ͔Β৴ީิͷੜ • ετϨʔδ༻ϑϧʹͳΔͱRedshiftೖΕସ͑ΔରԠ • ʹݹ͍σʔλΛࣺͯΔ(S3ʹόοΫΞοϓ͋Γ)
Athena • fluent-plugin-s3ͰϩάΛS3ʹΞοϓϩʔυ • tag͝ͱʹ(imp, click) • hive friendly (
/click/year=2017/... ) ͳkeyͰ • ఆ࣌όονͰADD PARTITION͢Δ • MSCK REPAIRͩͱ࣮͕͕͔͔֬ͩ࣌ؒΔ • MSCK REPAIR ≒ 2h, ADD PARTITION < 1s
Athena ύʔςΟγϣϯ࡞όον • ࣾͰԿނ͔ʮશࣗಈ͍͋΅͏ʯͱݺΕ͍ͯΔ • CloudWatch Events + Lambda •
LambdaJava • codenize-tools/monosasi ͰDSLԽ • s3://bucket/path/to/year=2017/month=11/day=24/hour=18/ • ཁ͕͋Είʔυެ։͠·͢ʢ͕ɺGlueͷํ͕ศརΑͶɾɾɾ • FirehoseϑΥʔϚοτʹରԠՄೳ
Glue DataCatalog • AthenaLambdaͰRate limit • ҙࣝͯͣ͠Β͞ͳ͍ͱ͕࣌ؒूத͢Δ • εΩʔϚཧͱCloudWatch Eventsͷཧਏ͍
• ݱঢ়45ςʔϒϧ͙Β͍ɻطʹ͠ΜͲ͍
to Glue Crawler & DataCatalog "1*TFSWFST "-5&35"#-& "%%1"35*5*0/
ॴײ • Lambda͕ෆཁʹͳͬͨʢͬͨͶʣ • 1ΫϩʔϥͰෳͷSource DataStoreʹରԠ • Ϋϩʔϥ͕ཚཱ͠ͳͯ͘ॿ͔Δ • ϩάྔ͕ଟ͍ͱΫϩʔϦϯάʹ͕͔͔࣌ؒΔ
• ͍·ͷͱ͜Ζ6͔͔࣌ؒͬͯΔ
GlueͰͰ͖ͳ͔ͬͨ͜ͱ • ҟϦʔδϣϯͷAthenaʹAdd table • ઌड़ͷύʔςΟγϣϯՃόονͰରԠ • طଘͷAthena Tableʹରͯ͠DataCatalogͷద༻ •
database/tableͷ࡞Γ͠ʢӨڹ͋ΔͷͰ·ͩʣ • Glueͷςʔϒϧ໊ͷࢦఆɾมߋ • ಛҟͳϑΥʔϚοτʹରԠෆՄ ཌ
ᶄࣗલETLόονΛGlueʹஔ
ϩάͱϚελʔͷJOIN • ϩάσʔλɿS3 ɹɹ Redshift • imp, click etc •
ϚελʔσʔλɿRDS • campaign, creative etc • ʮAthenaɾRedshiftͰJOIN͍ͨ͠ʯ • => Digdag + EmbulkͰରԠ͍ͯͨ͠
Embulk + digdag (+ docker) "-5&35"#-& "%%1"35*5*0/
to Glue Crawler & Glue ETL PSFHPOSFHJPO UPLZPSFHJPO SFQMJDB ᶃ
ᶄ ᶄͷNFUBEBUBͰ BEEUBCMF ᶃͷNFUBEBUBͰ &5-
Additional • statsςʔϒϧɿຖͷूܭࡁΈσʔλ • ςʔϒϧશߦͰͳ͘ຖͷσʔλ͚ͩUpload • ETLͷJobͰFilter transform classΛ༻
TUBUTUBCMF TQBUIUPZFBSNPOUIEBZ
Filter transformer class def filter_function(dynamic_record): if dynamic_record["date"].strftime("%Y-%m-%d") == yesterday.strftime("%Y-%m-%d"): return
True else: return False filtered0 = Filter.apply(frame=datasource0, f=filter_function, transformation_ctx="filtered0")
GlueͰΑ͔ͬͨͱ͜Ζ • ETL͕Glue͚ͩͰ݁͢Δ • Lambdaෆཁ • Digdag + Embulkෆཁ •
αʔό(ECS)ෆཁ • ਓ͕ؒςʔϒϧΛྻڍ͠ͳ͍͍ͯ͘ • ࠓ·ͰLambdaͱ͔EmbulkͷઃఆϑΝΠϧͰશ෦ࢦఆͯͨ͠
GlueͰΑ͘ͳ͔ͬͨͱ͜Ζ • CrawlerͰͳ͘ETLଆͰfilter͍ͯ͠Δ • ൃߦ͞ΕΔSQL SELECT * FROM hoge_stats; •
CrawlerଆͰWHEREઃఆ͍ͨ͠ • 18࣌ؒͰऴΘΒͳ͍Job (3ԯϨίʔυ) • JobͷΫϩʔϯ͕Ͱ͖ͳ͍ʢΘΓͱٸ͗Ͱ΄͍͠ʣ • ͍ͭ͘ࣅͨΑ͏ͳͷΛ࡞Δͷʹख͕ؒ
ᶅDynamoDBͷϑΥʔϚοτมΛ GlueͰ࣮ݱ (ϦϓϨʔεͰͳ͍͚Ͳ)
DynamoDBͷσʔλΛੳ͍ͨ͠ • άϊγʔͷλϒฒͼใ • Ͳ͏͍ͬͨϢʔβ͕Ͳ͏͍͏λϒΛϑΥϩʔͯ͠Δ͔ • ಛఆ݅ͰࣗಈՃ͞ΕΔλϒ • ਖ਼͘͠ػೳ͍ͯ͠Δ͔ •
ͲΕ͙Β͍͍Δͷ͔ • ͝ͱʹूܭ͍ͨ͠
DynamoDBͷFull dumpͱFormat • DataPipelineͰFull dump • DynamoDB Stream ࠓޙΔʢئʣ •
σʔλهड़ࢠ -> JSON • ͦͷ··ਏ͍ • (ͳΜͱ͔ͯ͠) convert͢Δ { "Item": { "Age": {"N": "8"}, "Colors": { "L": [ {"S": "White"}, {"S": "Brown"}, {"S": "Black"} ] }, "Name": {"S": "Fido"}, "Vaccinations": { "M": { "Rabies": { "L": [ {"S": "2009-03-17"}, {"S": "2011-09-21"}, {"S": "2014-07-08"} ] }, "Distemper": {"S": "2015-10-13"} } }, "Breed": {"S": "Beagle"}, "AnimalType": {"S": "Dog"} }
Data Pipeline + Glue ETL Amazon DynamoDB %BUB1JQFMJOF w .BQBQQMZ
w UP%'GSPN%'
ETLͷ࣮ݱʹ͋ͨͬͯ • Built-in Transformer class͚ͩͰͳΜͱͳΒͳ͍͕࣌͋Δ • ΑΓෳࡶͳFilterɾMap࣌ • toDF &
fromDFΛ͏ʢී௨ͷpysparkʣ • job bookmarkΛ͏࣌ try & except ඞਢ • toDFͰίέΔʢύʔςΟγϣϯʹ·ͭΘΔͬΆ͍ • job bookmarkͷϢʔεέʔε͕·ͩΘ͔ͬͯͳ͍
ϋϚΓϙΠϯτ • Built-in TransformerΫϥεͷίʔϧόοΫؔͰൃੜͨ͠ExceptionѲΓͭͿ͞ΕΔ • Exception͕ൃੜͨ͠dynamic recordഁغ͞Εɺ࣍ͷϨίʔυॲཧ͕ҠΔ • σόοά͢ΔͳΒίʔϧόοΫؔશମΛtry-except •
RDSΛSource DataStoreʹ͢Δ߹Subnetʹҙ • ͍ΘΏΔʮLambda in VPCʯ • ඞͣNAT Gateway͔VPC EndpointͰS3ʹΞΫηεͰ͖ΔΑ͏ʹ͓ͯ͘͠(Scriptͷ DL) • NWઃఆөʹԿނ͔30͙Β͍͕͔͔࣌ؒΔ
Glueͷॴײ • ࣗલETLόονશ෦ΛGlueʹஔ͖͍͑ͨ • DataCatalogͱ͔࠷ߴ • ͕ɺݱঢ়ͩͱ৭ʑਏ͍ • ࣮ߦ࣌ؒɾಥવࢮɾ։ൃࠔ
Glueʹର͢Δཁ • ԿΑΓߴԽ • DynamoDBΛSource DataStoreͷରʹ • JobεΫϦϓτͷόʔδϣϯཧ • JobͷσόοάΛ༰қʹ(Τϥʔ͕ѲΓͭͿ͞ΕΔ)
• Built-in Transform Classʹ͍ͭͯυΩϡϝϯτͷॆ࣮ • Crawler & ETL jobͷࢹ΄͍͠ • ScalaͰॻ͔ͤͯ͘Ε • EMR(Spark)ͷࢿ࢈Λྲྀ༻͍ͨ͠