Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
AWS Summit Amsterdam 2023 - SVS204
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Pubudu
June 18, 2023
Technology
25
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
AWS Summit Amsterdam 2023 - SVS204
Large scale parallel data processing with AWS Step Functions Distributed Maps
Pubudu
June 18, 2023
More Decks by Pubudu
See All by Pubudu
Moving from single tenant to multi tenant
pubudusj
0
47
COM202 Dev Chat at re:Invent 2022
pubudusj
1
89
Manage webhooks at scale with AWS Serverless
pubudusj
0
60
Smart Doorbell with AWS Serverless - AWS UG Coimbatore
pubudusj
0
67
Smart Doorbell with AWS Serverless - Serverless Summit 21
pubudusj
0
98
Other Decks in Technology
See All in Technology
チームで実践する AI-DLC 思考の軌跡を残すチェックポイント設計
belongadmin
0
3.1k
美味しいスイスチーズを作ろう🧀🐭
taigamikami
1
270
SIer20年! 培ったスキルがスタートアップで輝く時
shucho0103
0
790
React、まだ楽しくて草
uhyo
7
4.2k
10倍の生産性を実現するAI駆動並列エージェントのすべて
kumaiu
4
1.1k
社内 AI エージェント Synapse と セマンティックレイヤーの育て方
hiroakis
0
930
Claude Code の Sandbox 機能を Anthropic Sandbox Runtime(srt) で試そう!/lets-play-anthropic-sandbox-runtime
tomoki10
1
250
中期計画、2回作ってみた ~業務委託と正社員、両方の視点から~
demaecan
1
450
実装は速くなった、レビューはどうする? ― 自身のレビューをAIで再現させるサーヴァントエンジニアリングのすゝめ / Implementation got faster. So what about reviews? — An invitation to Servant Engineering: Recreating your own code reviews with AI
nrslib
7
4.3k
新しいVibe Codingと”自走”について
watany
5
240
Rancherの紹介&Update情報(RancherJP Online Meetup #09)
yoshiyuki_kono
0
130
AI-DLCを活用した高品質・安全なAI駆動開発実践 / AI Driven Development with AI-DLC
yoshidashingo
0
160
Featured
See All Featured
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Discover your Explorer Soul
emna__ayadi
2
1.1k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.2k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
860
SEO Brein meetup: CTRL+C is not how to scale international SEO
lindahogenes
1
2.7k
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
1
1.3k
Tell your own story through comics
letsgokoyo
1
950
Practical Orchestrator
shlominoach
191
11k
YesSQL, Process and Tooling at Scale
rocio
174
15k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
250
Product Roadmaps are Hard
iamctodd
PRO
55
12k
Rails Girls Zürich Keynote
gr2m
96
14k
Transcript
Large scale parallel data processing with Step Functions Distributed Map
SVS204
About Me Pubudu Jayawardana @pubudusj From Amsterdam, the Netherlands Senior
Backend Developer at starred.com AWS Community Builder (Serverless) AWS Certified - SA Pro https://medium.com/@pubudusj https://pubudu.dev https://dev.to/pubudusj
AWS Community Builders
Step Functions Distributed Map
▪ To iterate over an array ▪ Limitations • 40
parallel iterations at a time • Max payload size - 256KB • Execution history - 25,000 events Map State
▪ Totally separated child executions • 25,000 events each •
10,000 executions at a time • S3 as a source ▪ Result output to S3 ▪ Only applicable for Standard flows Distributed Map
Distributed Map
▪ Source types: • S3 object list • JSON file
in S3 • CSV file in S3 • S3 manifest file ▪ Limit no of items ▪ ItemSelector Source
▪ Batching based on: • No of items • Size
▪ Modify input with Batch input Item Batching
▪ Concurrency limit ▪ Child execution types: • Standard •
Express ▪ Error threshold: • Percentage • No of items Runtime Settings
▪ S3 location ▪ Logs • manifest.json • SUCCEEDED_n.json •
FAILED_n.json • PENDING_n.json Export Result
Execution Details - Parent Event Log
Execution Details - Map Run
Execution Details - Single Child Execution
Process
▪ SAAS application to measure candidate experience ▪ Send surveys
▪ Record the feedback ▪ Visualize in a dashboard (benchmark, filter, comparison) ▪ Transform / Enrich data Process
Source from MySQL Transform Save to S3 Load to Postgres
Hourly ETL
▪ Amazon Managed Workflows for Apache Airflow ▪ Amazon EMR
Hourly ETL
None
Problem
▪ Less visibility ▪ Cannot retry single table load ▪
Takes avg 20 minutes ▪ EC2 cost Data Load Step
Solution
None
None
Demo
▪ Reduced time to avg 5 minutes ▪ Load data
parallelly ▪ Better insights ▪ Retry individual table data load ▪ Cost effective Benefits
▪ Use batching ▪ Set concurrency ▪ Set error threshold
▪ Use express child executions Tips / Lesson Learned
https://bit.ly/s3-to-postgres Read more about this
Thank You! @pubudusj