Slide 1

Slide 1 text

Serverless for HPC Luciano Mammino fourTheorem @loige

Slide 2

Slide 2 text

Diamond Sponsor Partner Platinum Sponsor Gold Sponsor

Slide 3

Slide 3 text

πŸ‘‹ Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐦 @loige πŸŽ₯ loige 🧳 lucianomammino

Slide 4

Slide 4 text

Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch - Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation

Slide 5

Slide 5 text

We host a podcast about AWS and Cloud computing πŸ”— awsbites.com 🎬 YouTube Channel πŸŽ™ Podcast πŸ“… Episodes every week @loige #CLOUDDAY2022

Slide 6

Slide 6 text

Get the slides: fth.link/cd22 @loige #CLOUDDAY2022

Slide 7

Slide 7 text

Agenda ● The 6 Rs of Cloud Migration ● A serverless case study β—‹ The problem space and types of workflows β—‹ Original on premise implementation β—‹ The PoC β—‹ The final production version β—‹ The components of a serverless job scheduler β—‹ Challenges & Limits fth.link/cd22 @loige #CLOUDDAY2022

Slide 8

Slide 8 text

The 6 Rs of Cloud Migrations πŸ—‘ πŸ•Έ 🚚 Retire Retain Rehost πŸ— πŸ“ πŸ’° Replatform Refactor Repurchase @loige #CLOUDDAY2022 fth.link/cd22

Slide 9

Slide 9 text

A case study Case study on AWS blog: fth.link/awshpc @loige #CLOUDDAY2022

Slide 10

Slide 10 text

The workloads - Risk Rollup 🏦 Financial modeling to understand the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals βš™ HPC (High-Performance Computing) workload πŸ—„ ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige #CLOUDDAY2022

Slide 11

Slide 11 text

The workloads - Deal Analytics ⚑ Near real-time deal pricing using the same risk model πŸ—ƒ Lower data volumes πŸ” High frequency of execution – up to 1.000 per day @loige #CLOUDDAY2022

Slide 12

Slide 12 text

Original on-prem implementation @loige #CLOUDDAY2022

Slide 13

Slide 13 text

Challenges 🐒 Long execution times, constraining business agility πŸ₯Š Competing workloads πŸ“ˆ Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige #CLOUDDAY2022

Slide 14

Slide 14 text

Thinking Big πŸ’­ Imagine a solution that would … 1. Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige #CLOUDDAY2022

Slide 15

Slide 15 text

The Goal ⚽ Run a Risk Rollup in 1 hour! @loige #CLOUDDAY2022

Slide 16

Slide 16 text

Architecture Options for Compute/Orchestration AWS Lambda Amazon SQS AWS Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s @loige #CLOUDDAY2022

Slide 17

Slide 17 text

POC Architecture AWS Batch S3 Step Functions Lambda SQS @loige #CLOUDDAY2022

Slide 18

Slide 18 text

Measure Everything! πŸ“ ⏱ Built metrics in from the start 󰀈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) πŸ›  Using CloudWatch, EMF @loige #CLOUDDAY2022

Slide 19

Slide 19 text

Measure Everything! πŸ“ πŸ‘ Rollup in 1 hour ☁ Running on AWS Batch πŸ‘Ž Cluster utilisation was <50% βœ… Goal success πŸ€” Understanding of what needs to be addressed next! @loige #CLOUDDAY2022

Slide 20

Slide 20 text

Beyond the PoC Production: optimise for unique workload characteristics @loige #CLOUDDAY2022

Slide 21

Slide 21 text

Job Plan @loige #CLOUDDAY2022

Slide 22

Slide 22 text

In reality, not all jobs are alike! @loige #CLOUDDAY2022

Slide 23

Slide 23 text

Horizontal scaling πŸš€ 1000’s of jobs Duration: 1 second – 45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige #CLOUDDAY2022

Slide 24

Slide 24 text

Moving to production 🚒 @loige #CLOUDDAY2022

Slide 25

Slide 25 text

Scope @loige #CLOUDDAY2022

Slide 26

Slide 26 text

Actual End to End overview @loige #CLOUDDAY2022

Slide 27

Slide 27 text

Modelling Worker @loige #CLOUDDAY2022

Slide 28

Slide 28 text

Compute Services Scales to 1000’s of tasks (containers) Little management overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige #CLOUDDAY2022

Slide 29

Slide 29 text

Store all the things in S3! The source of truth for: ● Input Data (JSON, Parquet) ● Intermediate Data (Parquet) ● Results (Parquet) ● Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige #CLOUDDAY2022

Slide 30

Slide 30 text

Scheduling and Orchestration βœ… We have our cluster (Fargate or Lambda) βœ… We have a plan! (list of jobs, parameters and dependencies) πŸ€” How do we feed this plan to the cluster?! 🀨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige #CLOUDDAY2022

Slide 31

Slide 31 text

Lifecycle of a Job A new job get queued here πŸ‘‡ A worker picks up the job and executes it The worker emits the job state (success or failure) @loige #CLOUDDAY2022

Slide 32

Slide 32 text

Event-Driven Scheduler Job states are pulled from a Kinesis Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next @loige #CLOUDDAY2022

Slide 33

Slide 33 text

Dynamic Runtime Handling We also need to handle system failures! @loige #CLOUDDAY2022

Slide 34

Slide 34 text

Outcomes πŸ™Œ Business ● Rollup in 1 hour ● Removed limits on number of runs ● Faster, more consistent deal analytics ● Business spending more time on revenue-generating activities ● Support portfolio growth and deliver new capabilities Technology ● Brought serverless to HPC financial modeling ● Reduced codebase by ~70% ● Lowered total cost of ownership ● Increased dev team agility ● Reduced carbon footprint @loige #CLOUDDAY2022

Slide 35

Slide 35 text

Hitting the limits 😰 @loige #CLOUDDAY2022

Slide 36

Slide 36 text

S3 Throughput @loige #CLOUDDAY2022

Slide 37

Slide 37 text

S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions ….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige #CLOUDDAY2022

Slide 38

Slide 38 text

The Solution Explicit Partitioning: β—‹Figure out how many partitions you need β—‹Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments! @loige #CLOUDDAY2022

Slide 39

Slide 39 text

Fargate Scaling ●We want to run 3000 containers ASAP ●This took > 1 hour! ●We built a custom Fargate scaler β—‹Using the RunTask API (no ECS Service) β—‹Hidden quota increases β—‹Step Function + Lambda ●3000 containers in ~20 minutes The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes @loige #CLOUDDAY2022

Slide 40

Slide 40 text

How high can we go today? πŸš€ 10,000 concurrent Lambda functions in seconds 🎒 10,000 Fargate containers in 10 minutes πŸ’Έ No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige #CLOUDDAY2022

Slide 41

Slide 41 text

Wrapping up 🎁 ● "Serverless supercomputer" lets you do HPC with commodity AWS compute ● Plenty of challenges, but it's doable! ● Agility and innovation benefits are massive ● Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige #CLOUDDAY2022

Slide 42

Slide 42 text

Special thanks to @eoins and @cmthorne10 fth.link/cd22 @loige #CLOUDDAY2022