Serverless for High Performance Computing

Luciano Mammino (@loige) Serverless for HPC 🚀 fth.link/cm22

Is Serverless a good option for High Performance Computing? @loige

👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect:
🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino

Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch -
Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation

We host a podcast about AWS and Cloud computing 🔗
awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige

Get the slides: fth.link/cm22 @loige

Agenda • The 6 Rs of Cloud Migration • A
serverless case study ◦ The problem space and types of workflows ◦ Original on premise implementation ◦ The PoC ◦ The final production version ◦ The components of a serverless job scheduler ◦ Challenges & Limits @loige fth.link/cm22

The 6 Rs of Cloud Migrations @loige 🗑 🕸 🚚
Retire Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase fth.link/cm22

A case study @loige Case study on AWS blog: fth.link/awshpc

The workloads - Risk Rollup 🏦 Financial modeling to understand
the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige

The workloads - Deal Analytics ⚡ Near real-time deal pricing
using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige

Original on-prem implementation @loige

Challenges 🐢 Long execution times, constraining business agility 🥊 Competing
workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige

Thinking Big 💭 Imagine a solution that would … 1.
Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige

The Goal ⚽ Run a Risk Rollup in 1 hour!
@loige

Architecture Options for Compute/Orchestration @loige AWS Lambda Amazon SQS AWS
Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s

POC Architecture @loige AWS Batch S3 Step Functions Lambda SQS

Measure Everything! 📏 ⏱ Built metrics in from the start
󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige

Measure Everything! 📏 👍 Rollup in 1 hour ☁ Running
on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed @loige

Beyond the PoC Production: optimise for unique workload characteristics @loige

Job Plan @loige

In reality, not all jobs are alike! @loige

Horizontal scaling 🚀 1000’s of jobs Duration: 1 second –
45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige

Moving to production 🚢 @loige

Scope @loige

Actual End to End overview @loige

Modelling Worker @loige

Compute Services Scales to 1000’s of tasks (containers) Little management
overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige

Store all the things in S3! The source of truth
for: • Input Data (JSON, Parquet) • Intermediate Data (Parquet) • Results (Parquet) • Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige

Scheduling and Orchestration ✅ We have our cluster (Fargate or
Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige

Lifecycle of a Job @loige A new job get queued
here 👇 A worker picks up the job and executes it The worker emits the job state (success or failure)

Event-Driven Scheduler @loige Job states are pulled from a Kinesis
Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next

Dynamic Runtime Handling @loige We also need to handle system
failures!

Outcomes 🙌 Business • Rollup in 1 hour • Removed
limits on number of runs • Faster, more consistent deal analytics • Business spending more time on revenue-generating activities • Support portfolio growth and deliver new capabilities @loige Technology • Brought serverless to HPC financial modeling • Reduced codebase by ~70% • Lowered total cost of ownership • Increased dev team agility • Reduced carbon footprint

Hitting the limits 😰 @loige

S3 Throughput @loige

S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions
….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige

The Solution Explicit Partitioning: ◦Figure out how many partitions you
need ◦Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… @loige 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments!

Fargate Scaling • We want to run 3000 containers ASAP
• This took > 1 hour! • We built a custom Fargate scaler ◦Using the RunTask API (no ECS Service) ◦Hidden quota increases ◦Step Function + Lambda • 3000 containers in ~20 minutes @loige The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes

How high can we go today? 🚀 10,000 concurrent Lambda
functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige

Wrapping up 🎁 • "Serverless supercomputer" lets you do HPC
with commodity AWS compute • Plenty of challenges, but it's doable! • Agility and innovation benefits are massive • Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige

Special thanks to @eoins, @cmthorne10 and the awesome team at
RenRe! @loige fth.link/cm22 Serverless for HPC? IT WORKS!

Serverless for High Performance Computing

Serverless for High Performance Computing

More Decks by Luciano Mammino

Other Decks in Technology

Featured

Transcript