Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
AWS Summit Amsterdam 2023 - SVS204
Search
Pubudu
June 18, 2023
Technology
23
1
Share
AWS Summit Amsterdam 2023 - SVS204
Large scale parallel data processing with AWS Step Functions Distributed Maps
Pubudu
June 18, 2023
More Decks by Pubudu
See All by Pubudu
Moving from single tenant to multi tenant
pubudusj
0
46
COM202 Dev Chat at re:Invent 2022
pubudusj
1
87
Manage webhooks at scale with AWS Serverless
pubudusj
0
56
Smart Doorbell with AWS Serverless - AWS UG Coimbatore
pubudusj
0
66
Smart Doorbell with AWS Serverless - Serverless Summit 21
pubudusj
0
96
Other Decks in Technology
See All in Technology
Expiration of Secure Boot Certificates for vSphere Virtual Machines
mirie_sd
0
120
AIでAIをテストする - 音声AIエージェントの品質保証戦略
morix1500
1
150
これからの「データマネジメント」の話をしよう
sansantech
PRO
0
160
20260428_Product Management Summit_Loglass_JoeHirose
loglassjoe
3
4.1k
マルチプロダクトの信頼性を効率良く保っていくために
kworkdev
PRO
0
180
「責任あるAIエージェント」こそ自社で開発しよう!
minorun365
10
2.3k
AzureのIaC管理からログ調査まで、随所に役立つSkillsとCustom-Instructions / Boosting IaC and Log Analysis with Skills
aeonpeople
0
280
No Types Needed, Just Callable Method Check
dak2
1
2.2k
AI時代のガードレールとしてのAPIガバナンス
nagix
0
310
Agents CLI と Gemini Enterprise Agent Platform で マルチエージェント開発が楽しくなる!
kaz1437
0
170
Building a Standalone Programming Environment
harukasan
PRO
1
130
データを"持てない"環境でのアノテーション基盤設計
sansantech
PRO
1
150
Featured
See All Featured
Everyday Curiosity
cassininazir
0
200
The Cult of Friendly URLs
andyhume
79
6.8k
SEO Brein meetup: CTRL+C is not how to scale international SEO
lindahogenes
1
2.6k
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
130
Test your architecture with Archunit
thirion
1
2.2k
Statistics for Hackers
jakevdp
799
230k
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.9k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.7k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7.4k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.8k
The Cost Of JavaScript in 2023
addyosmani
55
9.9k
Transcript
Large scale parallel data processing with Step Functions Distributed Map
SVS204
About Me Pubudu Jayawardana @pubudusj From Amsterdam, the Netherlands Senior
Backend Developer at starred.com AWS Community Builder (Serverless) AWS Certified - SA Pro https://medium.com/@pubudusj https://pubudu.dev https://dev.to/pubudusj
AWS Community Builders
Step Functions Distributed Map
▪ To iterate over an array ▪ Limitations • 40
parallel iterations at a time • Max payload size - 256KB • Execution history - 25,000 events Map State
▪ Totally separated child executions • 25,000 events each •
10,000 executions at a time • S3 as a source ▪ Result output to S3 ▪ Only applicable for Standard flows Distributed Map
Distributed Map
▪ Source types: • S3 object list • JSON file
in S3 • CSV file in S3 • S3 manifest file ▪ Limit no of items ▪ ItemSelector Source
▪ Batching based on: • No of items • Size
▪ Modify input with Batch input Item Batching
▪ Concurrency limit ▪ Child execution types: • Standard •
Express ▪ Error threshold: • Percentage • No of items Runtime Settings
▪ S3 location ▪ Logs • manifest.json • SUCCEEDED_n.json •
FAILED_n.json • PENDING_n.json Export Result
Execution Details - Parent Event Log
Execution Details - Map Run
Execution Details - Single Child Execution
Process
▪ SAAS application to measure candidate experience ▪ Send surveys
▪ Record the feedback ▪ Visualize in a dashboard (benchmark, filter, comparison) ▪ Transform / Enrich data Process
Source from MySQL Transform Save to S3 Load to Postgres
Hourly ETL
▪ Amazon Managed Workflows for Apache Airflow ▪ Amazon EMR
Hourly ETL
None
Problem
▪ Less visibility ▪ Cannot retry single table load ▪
Takes avg 20 minutes ▪ EC2 cost Data Load Step
Solution
None
None
Demo
▪ Reduced time to avg 5 minutes ▪ Load data
parallelly ▪ Better insights ▪ Retry individual table data load ▪ Cost effective Benefits
▪ Use batching ▪ Set concurrency ▪ Set error threshold
▪ Use express child executions Tips / Lesson Learned
https://bit.ly/s3-to-postgres Read more about this
Thank You! @pubudusj