Serverless for High Performance Computing

Slide 1

Slide 1 text

Luciano Mammino (@loige) Serverless for HPC 🚀 fth.link/cm22

Slide 2

Slide 2 text

Is Serverless a good option for High Performance Computing? @loige

Slide 3

Slide 3 text

👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino

Slide 4

Slide 4 text

Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch - Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation

Slide 5

Slide 5 text

We host a podcast about AWS and Cloud computing 🔗 awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige

Slide 6

Slide 6 text

Get the slides: fth.link/cm22 @loige

Slide 7

Slide 7 text

Agenda ● The 6 Rs of Cloud Migration ● A serverless case study ○ The problem space and types of workflows ○ Original on premise implementation ○ The PoC ○ The final production version ○ The components of a serverless job scheduler ○ Challenges & Limits @loige fth.link/cm22

Slide 8

Slide 8 text

The 6 Rs of Cloud Migrations @loige 🗑 🕸 🚚 Retire Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase fth.link/cm22

Slide 9

Slide 9 text

A case study @loige Case study on AWS blog: fth.link/awshpc

Slide 10

Slide 10 text

The workloads - Risk Rollup 🏦 Financial modeling to understand the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige

Slide 11

Slide 11 text

The workloads - Deal Analytics ⚡ Near real-time deal pricing using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige

Slide 12

Slide 12 text

Original on-prem implementation @loige

Slide 13

Slide 13 text

Challenges 🐢 Long execution times, constraining business agility 🥊 Competing workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige

Slide 14

Slide 14 text

Thinking Big 💭 Imagine a solution that would … 1. Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige

Slide 15

Slide 15 text

The Goal ⚽ Run a Risk Rollup in 1 hour! @loige

Slide 16

Slide 16 text

Architecture Options for Compute/Orchestration @loige AWS Lambda Amazon SQS AWS Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s

Slide 17

Slide 17 text

POC Architecture @loige AWS Batch S3 Step Functions Lambda SQS

Slide 18

Slide 18 text

Measure Everything! 📏 ⏱ Built metrics in from the start 󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige

Slide 19

Slide 19 text

Measure Everything! 📏 👍 Rollup in 1 hour ☁ Running on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed @loige

Slide 20

Slide 20 text

Beyond the PoC Production: optimise for unique workload characteristics @loige

Slide 21

Slide 21 text

Job Plan @loige

Slide 22

Slide 22 text

In reality, not all jobs are alike! @loige

Slide 23

Slide 23 text

Horizontal scaling 🚀 1000’s of jobs Duration: 1 second – 45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige

Slide 24

Slide 24 text

Moving to production 🚢 @loige

Slide 25

Slide 25 text

Scope @loige

Slide 26

Slide 26 text

Actual End to End overview @loige

Slide 27

Slide 27 text

Modelling Worker @loige

Slide 28

Slide 28 text

Compute Services Scales to 1000’s of tasks (containers) Little management overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige

Slide 29

Slide 29 text

Store all the things in S3! The source of truth for: ● Input Data (JSON, Parquet) ● Intermediate Data (Parquet) ● Results (Parquet) ● Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige

Slide 30

Slide 30 text

Scheduling and Orchestration ✅ We have our cluster (Fargate or Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige

Slide 31

Slide 31 text

Lifecycle of a Job @loige A new job get queued here 👇 A worker picks up the job and executes it The worker emits the job state (success or failure)

Slide 32

Slide 32 text

Event-Driven Scheduler @loige Job states are pulled from a Kinesis Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next

Slide 33

Slide 33 text

Dynamic Runtime Handling @loige We also need to handle system failures!

Slide 34

Slide 34 text

Outcomes 🙌 Business ● Rollup in 1 hour ● Removed limits on number of runs ● Faster, more consistent deal analytics ● Business spending more time on revenue-generating activities ● Support portfolio growth and deliver new capabilities @loige Technology ● Brought serverless to HPC financial modeling ● Reduced codebase by ~70% ● Lowered total cost of ownership ● Increased dev team agility ● Reduced carbon footprint

Slide 35

Slide 35 text

Hitting the limits 😰 @loige

Slide 36

Slide 36 text

S3 Throughput @loige

Slide 37

Slide 37 text

S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions ….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige

Slide 38

Slide 38 text

The Solution Explicit Partitioning: ○Figure out how many partitions you need ○Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… @loige 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments!

Slide 39

Slide 39 text

Fargate Scaling ● We want to run 3000 containers ASAP ● This took > 1 hour! ● We built a custom Fargate scaler ○Using the RunTask API (no ECS Service) ○Hidden quota increases ○Step Function + Lambda ● 3000 containers in ~20 minutes @loige The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes

Slide 40

Slide 40 text

How high can we go today? 🚀 10,000 concurrent Lambda functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige

Slide 41

Slide 41 text

Wrapping up 🎁 ● "Serverless supercomputer" lets you do HPC with commodity AWS compute ● Plenty of challenges, but it's doable! ● Agility and innovation benefits are massive ● Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige

Slide 42

Slide 42 text

Special thanks to @eoins, @cmthorne10 and the awesome team at RenRe! @loige fth.link/cm22 Serverless for HPC? IT WORKS!