Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
AWS Summit Amsterdam 2023 - SVS204
Search
Pubudu
June 18, 2023
Technology
1
21
AWS Summit Amsterdam 2023 - SVS204
Large scale parallel data processing with AWS Step Functions Distributed Maps
Pubudu
June 18, 2023
Tweet
Share
More Decks by Pubudu
See All by Pubudu
Moving from single tenant to multi tenant
pubudusj
0
33
COM202 Dev Chat at re:Invent 2022
pubudusj
1
81
Manage webhooks at scale with AWS Serverless
pubudusj
0
52
Smart Doorbell with AWS Serverless - AWS UG Coimbatore
pubudusj
0
64
Smart Doorbell with AWS Serverless - Serverless Summit 21
pubudusj
0
90
Other Decks in Technology
See All in Technology
新規プロダクトでプロトタイプから正式リリースまでNext.jsで開発したリアル
kawanoriku0
1
190
TS-S205_昨年対比2倍以上の機能追加を実現するデータ基盤プロジェクトでのAI活用について
kaz3284
1
210
react-callを使ってダイヤログをいろんなとこで再利用しよう!
shinaps
2
260
株式会社ログラス - 会社説明資料【エンジニア】/ Loglass Engineer
loglass2019
4
65k
Claude Code でアプリ開発をオートパイロットにするためのTips集 Zennの場合 / Claude Code Tips in Zenn
wadayusuke
5
840
Modern Linux
oracle4engineer
PRO
0
160
エンジニアリングマネージャーの成長の道筋とキャリア / Developers Summit 2025 KANSAI
daiksy
3
890
Terraformで構築する セルフサービス型データプラットフォーム / terraform-self-service-data-platform
pei0804
1
190
Unlocking the Power of AI Agents with LINE Bot MCP Server
linedevth
0
110
JTCにおける内製×スクラム開発への挑戦〜内製化率95%達成の舞台裏/JTC's challenge of in-house development with Scrum
aeonpeople
0
250
Snowflake Intelligenceにはこうやって立ち向かう!クラシルが考えるAI Readyなデータ基盤と活用のためのDataOps
gappy50
0
280
実践!カスタムインストラクション&スラッシュコマンド
puku0x
0
520
Featured
See All Featured
Documentation Writing (for coders)
carmenintech
74
5k
How to train your dragon (web standard)
notwaldorf
96
6.2k
[RailsConf 2023] Rails as a piece of cake
palkan
57
5.8k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
580
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
The Art of Programming - Codeland 2020
erikaheidi
56
13k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3k
Rails Girls Zürich Keynote
gr2m
95
14k
How to Think Like a Performance Engineer
csswizardry
26
1.9k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.4k
GraphQLとの向き合い方2022年版
quramy
49
14k
Typedesign – Prime Four
hannesfritz
42
2.8k
Transcript
Large scale parallel data processing with Step Functions Distributed Map
SVS204
About Me Pubudu Jayawardana @pubudusj From Amsterdam, the Netherlands Senior
Backend Developer at starred.com AWS Community Builder (Serverless) AWS Certified - SA Pro https://medium.com/@pubudusj https://pubudu.dev https://dev.to/pubudusj
AWS Community Builders
Step Functions Distributed Map
▪ To iterate over an array ▪ Limitations • 40
parallel iterations at a time • Max payload size - 256KB • Execution history - 25,000 events Map State
▪ Totally separated child executions • 25,000 events each •
10,000 executions at a time • S3 as a source ▪ Result output to S3 ▪ Only applicable for Standard flows Distributed Map
Distributed Map
▪ Source types: • S3 object list • JSON file
in S3 • CSV file in S3 • S3 manifest file ▪ Limit no of items ▪ ItemSelector Source
▪ Batching based on: • No of items • Size
▪ Modify input with Batch input Item Batching
▪ Concurrency limit ▪ Child execution types: • Standard •
Express ▪ Error threshold: • Percentage • No of items Runtime Settings
▪ S3 location ▪ Logs • manifest.json • SUCCEEDED_n.json •
FAILED_n.json • PENDING_n.json Export Result
Execution Details - Parent Event Log
Execution Details - Map Run
Execution Details - Single Child Execution
Process
▪ SAAS application to measure candidate experience ▪ Send surveys
▪ Record the feedback ▪ Visualize in a dashboard (benchmark, filter, comparison) ▪ Transform / Enrich data Process
Source from MySQL Transform Save to S3 Load to Postgres
Hourly ETL
▪ Amazon Managed Workflows for Apache Airflow ▪ Amazon EMR
Hourly ETL
None
Problem
▪ Less visibility ▪ Cannot retry single table load ▪
Takes avg 20 minutes ▪ EC2 cost Data Load Step
Solution
None
None
Demo
▪ Reduced time to avg 5 minutes ▪ Load data
parallelly ▪ Better insights ▪ Retry individual table data load ▪ Cost effective Benefits
▪ Use batching ▪ Set concurrency ▪ Set error threshold
▪ Use express child executions Tips / Lesson Learned
https://bit.ly/s3-to-postgres Read more about this
Thank You! @pubudusj