Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
AWS Summit Amsterdam 2023 - SVS204
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Pubudu
June 18, 2023
Technology
1
22
AWS Summit Amsterdam 2023 - SVS204
Large scale parallel data processing with AWS Step Functions Distributed Maps
Pubudu
June 18, 2023
Tweet
Share
More Decks by Pubudu
See All by Pubudu
Moving from single tenant to multi tenant
pubudusj
0
42
COM202 Dev Chat at re:Invent 2022
pubudusj
1
84
Manage webhooks at scale with AWS Serverless
pubudusj
0
55
Smart Doorbell with AWS Serverless - AWS UG Coimbatore
pubudusj
0
66
Smart Doorbell with AWS Serverless - Serverless Summit 21
pubudusj
0
96
Other Decks in Technology
See All in Technology
AIエージェントに必要なのはデータではなく文脈だった/ai-agent-context-graph-mybest
jonnojun
0
130
コミュニティが変えるキャリアの地平線:コロナ禍新卒入社のエンジニアがAWSコミュニティで見つけた成長の羅針盤
kentosuzuki
0
130
All About Sansan – for New Global Engineers
sansan33
PRO
1
1.4k
Agent Skils
dip_tech
PRO
0
120
2026年、サーバーレスの現在地 -「制約と戦う技術」から「当たり前の実行基盤」へ- /serverless2026
slsops
2
260
プロポーザルに込める段取り八分
shoheimitani
1
500
AWS Network Firewall Proxyを触ってみた
nagisa53
1
240
量子クラウドサービスの裏側 〜Deep Dive into OQTOPUS〜
oqtopus
0
130
SchooでVue.js/Nuxtを技術選定している理由
yamanoku
3
100
Embedded SREの終わりを設計する 「なんとなく」から計画的な自立支援へ
sansantech
PRO
3
2.5k
今日から始めるAmazon Bedrock AgentCore
har1101
4
410
SREが向き合う大規模リアーキテクチャ 〜信頼性とアジリティの両立〜
zepprix
0
460
Featured
See All Featured
Skip the Path - Find Your Career Trail
mkilby
0
57
WENDY [Excerpt]
tessaabrams
9
36k
The Illustrated Children's Guide to Kubernetes
chrisshort
51
51k
sira's awesome portfolio website redesign presentation
elsirapls
0
150
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
1
56
Speed Design
sergeychernyshev
33
1.5k
Raft: Consensus for Rubyists
vanstee
141
7.3k
VelocityConf: Rendering Performance Case Studies
addyosmani
333
24k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.2k
Typedesign – Prime Four
hannesfritz
42
2.9k
jQuery: Nuts, Bolts and Bling
dougneiner
65
8.4k
Ethics towards AI in product and experience design
skipperchong
2
200
Transcript
Large scale parallel data processing with Step Functions Distributed Map
SVS204
About Me Pubudu Jayawardana @pubudusj From Amsterdam, the Netherlands Senior
Backend Developer at starred.com AWS Community Builder (Serverless) AWS Certified - SA Pro https://medium.com/@pubudusj https://pubudu.dev https://dev.to/pubudusj
AWS Community Builders
Step Functions Distributed Map
▪ To iterate over an array ▪ Limitations • 40
parallel iterations at a time • Max payload size - 256KB • Execution history - 25,000 events Map State
▪ Totally separated child executions • 25,000 events each •
10,000 executions at a time • S3 as a source ▪ Result output to S3 ▪ Only applicable for Standard flows Distributed Map
Distributed Map
▪ Source types: • S3 object list • JSON file
in S3 • CSV file in S3 • S3 manifest file ▪ Limit no of items ▪ ItemSelector Source
▪ Batching based on: • No of items • Size
▪ Modify input with Batch input Item Batching
▪ Concurrency limit ▪ Child execution types: • Standard •
Express ▪ Error threshold: • Percentage • No of items Runtime Settings
▪ S3 location ▪ Logs • manifest.json • SUCCEEDED_n.json •
FAILED_n.json • PENDING_n.json Export Result
Execution Details - Parent Event Log
Execution Details - Map Run
Execution Details - Single Child Execution
Process
▪ SAAS application to measure candidate experience ▪ Send surveys
▪ Record the feedback ▪ Visualize in a dashboard (benchmark, filter, comparison) ▪ Transform / Enrich data Process
Source from MySQL Transform Save to S3 Load to Postgres
Hourly ETL
▪ Amazon Managed Workflows for Apache Airflow ▪ Amazon EMR
Hourly ETL
None
Problem
▪ Less visibility ▪ Cannot retry single table load ▪
Takes avg 20 minutes ▪ EC2 cost Data Load Step
Solution
None
None
Demo
▪ Reduced time to avg 5 minutes ▪ Load data
parallelly ▪ Better insights ▪ Retry individual table data load ▪ Cost effective Benefits
▪ Use batching ▪ Set concurrency ▪ Set error threshold
▪ Use express child executions Tips / Lesson Learned
https://bit.ly/s3-to-postgres Read more about this
Thank You! @pubudusj