Serverless for High Performance Computing

Serverless for HPC Luciano Mammino fourTheorem @loige

Diamond Sponsor Partner Platinum Sponsor Gold Sponsor

👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect:
🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino

Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch -
Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation

We host a podcast about AWS and Cloud computing 🔗
awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige #CLOUDDAY2022

Get the slides: fth.link/cd22 @loige #CLOUDDAY2022

Agenda • The 6 Rs of Cloud Migration • A
serverless case study ◦ The problem space and types of workflows ◦ Original on premise implementation ◦ The PoC ◦ The final production version ◦ The components of a serverless job scheduler ◦ Challenges & Limits fth.link/cd22 @loige #CLOUDDAY2022

The 6 Rs of Cloud Migrations 🗑 🕸 🚚 Retire
Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase @loige #CLOUDDAY2022 fth.link/cd22

A case study Case study on AWS blog: fth.link/awshpc @loige
#CLOUDDAY2022

The workloads - Risk Rollup 🏦 Financial modeling to understand
the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige #CLOUDDAY2022

The workloads - Deal Analytics ⚡ Near real-time deal pricing
using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige #CLOUDDAY2022

Original on-prem implementation @loige #CLOUDDAY2022

Challenges 🐢 Long execution times, constraining business agility 🥊 Competing
workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige #CLOUDDAY2022

Thinking Big 💭 Imagine a solution that would … 1.
Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige #CLOUDDAY2022

The Goal ⚽ Run a Risk Rollup in 1 hour!
@loige #CLOUDDAY2022

Architecture Options for Compute/Orchestration AWS Lambda Amazon SQS AWS Step
Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s @loige #CLOUDDAY2022

POC Architecture AWS Batch S3 Step Functions Lambda SQS @loige
#CLOUDDAY2022

Measure Everything! 📏 ⏱ Built metrics in from the start
󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige #CLOUDDAY2022

Measure Everything! 📏 👍 Rollup in 1 hour ☁ Running
on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed next! @loige #CLOUDDAY2022

Beyond the PoC Production: optimise for unique workload characteristics @loige
#CLOUDDAY2022

Job Plan @loige #CLOUDDAY2022

In reality, not all jobs are alike! @loige #CLOUDDAY2022

Horizontal scaling 🚀 1000’s of jobs Duration: 1 second –
45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige #CLOUDDAY2022

Moving to production 🚢 @loige #CLOUDDAY2022

Scope @loige #CLOUDDAY2022

Actual End to End overview @loige #CLOUDDAY2022

Modelling Worker @loige #CLOUDDAY2022

Compute Services Scales to 1000’s of tasks (containers) Little management
overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige #CLOUDDAY2022

Store all the things in S3! The source of truth
for: • Input Data (JSON, Parquet) • Intermediate Data (Parquet) • Results (Parquet) • Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige #CLOUDDAY2022

Scheduling and Orchestration ✅ We have our cluster (Fargate or
Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige #CLOUDDAY2022

Lifecycle of a Job A new job get queued here
👇 A worker picks up the job and executes it The worker emits the job state (success or failure) @loige #CLOUDDAY2022

Event-Driven Scheduler Job states are pulled from a Kinesis Data
Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next @loige #CLOUDDAY2022

Dynamic Runtime Handling We also need to handle system failures!
@loige #CLOUDDAY2022

Outcomes 🙌 Business • Rollup in 1 hour • Removed
limits on number of runs • Faster, more consistent deal analytics • Business spending more time on revenue-generating activities • Support portfolio growth and deliver new capabilities Technology • Brought serverless to HPC financial modeling • Reduced codebase by ~70% • Lowered total cost of ownership • Increased dev team agility • Reduced carbon footprint @loige #CLOUDDAY2022

Hitting the limits 😰 @loige #CLOUDDAY2022

S3 Throughput @loige #CLOUDDAY2022

S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions
….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige #CLOUDDAY2022

The Solution Explicit Partitioning: ◦Figure out how many partitions you
need ◦Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments! @loige #CLOUDDAY2022

Fargate Scaling •We want to run 3000 containers ASAP •This
took > 1 hour! •We built a custom Fargate scaler ◦Using the RunTask API (no ECS Service) ◦Hidden quota increases ◦Step Function + Lambda •3000 containers in ~20 minutes The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes @loige #CLOUDDAY2022

How high can we go today? 🚀 10,000 concurrent Lambda
functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige #CLOUDDAY2022

Wrapping up 🎁 • "Serverless supercomputer" lets you do HPC
with commodity AWS compute • Plenty of challenges, but it's doable! • Agility and innovation benefits are massive • Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige #CLOUDDAY2022

Special thanks to @eoins and @cmthorne10 fth.link/cd22 @loige #CLOUDDAY2022

Serverless for High Performance Computing

Serverless for High Performance Computing

More Decks by Luciano Mammino

Other Decks in Technology

Featured

Transcript