Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Serverless for High Performance Computing

Serverless for High Performance Computing

Serverless is great for web applications and APIs, but this does not mean it cannot be used successfully for other use cases. In this talk, we will discuss a successful application of serverless in the field of High Performance Computing. Specifically we will discuss how Lambda, Fargate, Kinesis and other serverless technologies are being used to run sophisticated financial models at one of the major reinsurance companies in the World. We we learn about the architecture, the tradeoffs, some challenges and some unresolved pain points. Most importantly, we'll find out if serverless can be a great fit for HPC and if we can finally stop managing those boring EC2 instances!

Luciano Mammino

October 18, 2022
Tweet

More Decks by Luciano Mammino

Other Decks in Technology

Transcript

  1. Luciano Mammino (@loige) Serverless for HPC 🚀 fth.link/cm22

  2. Is Serverless a good option for High Performance Computing? @loige

  3. 👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect:

    🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino
  4. Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch -

    Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation
  5. We host a podcast about AWS and Cloud computing 🔗

    awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige
  6. Get the slides: fth.link/cm22 @loige

  7. Agenda • The 6 Rs of Cloud Migration • A

    serverless case study ◦ The problem space and types of workflows ◦ Original on premise implementation ◦ The PoC ◦ The final production version ◦ The components of a serverless job scheduler ◦ Challenges & Limits @loige fth.link/cm22
  8. The 6 Rs of Cloud Migrations @loige 🗑 🕸 🚚

    Retire Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase fth.link/cm22
  9. A case study @loige Case study on AWS blog: fth.link/awshpc

  10. The workloads - Risk Rollup 🏦 Financial modeling to understand

    the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige
  11. The workloads - Deal Analytics ⚡ Near real-time deal pricing

    using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige
  12. Original on-prem implementation @loige

  13. Challenges 🐢 Long execution times, constraining business agility 🥊 Competing

    workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige
  14. Thinking Big 💭 Imagine a solution that would … 1.

    Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige
  15. The Goal ⚽ Run a Risk Rollup in 1 hour!

    @loige
  16. Architecture Options for Compute/Orchestration @loige AWS Lambda Amazon SQS AWS

    Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s
  17. POC Architecture @loige AWS Batch S3 Step Functions Lambda SQS

  18. Measure Everything! 📏 ⏱ Built metrics in from the start

    󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige
  19. Measure Everything! 📏 👍 Rollup in 1 hour ☁ Running

    on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed @loige
  20. Beyond the PoC Production: optimise for unique workload characteristics @loige

  21. Job Plan @loige

  22. In reality, not all jobs are alike! @loige

  23. Horizontal scaling 🚀 1000’s of jobs Duration: 1 second –

    45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige
  24. Moving to production 🚢 @loige

  25. Scope @loige

  26. Actual End to End overview @loige

  27. Modelling Worker @loige

  28. Compute Services Scales to 1000’s of tasks (containers) Little management

    overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige
  29. Store all the things in S3! The source of truth

    for: • Input Data (JSON, Parquet) • Intermediate Data (Parquet) • Results (Parquet) • Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige
  30. Scheduling and Orchestration ✅ We have our cluster (Fargate or

    Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige
  31. Lifecycle of a Job @loige A new job get queued

    here 👇 A worker picks up the job and executes it The worker emits the job state (success or failure)
  32. Event-Driven Scheduler @loige Job states are pulled from a Kinesis

    Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next
  33. Dynamic Runtime Handling @loige We also need to handle system

    failures!
  34. Outcomes 🙌 Business • Rollup in 1 hour • Removed

    limits on number of runs • Faster, more consistent deal analytics • Business spending more time on revenue-generating activities • Support portfolio growth and deliver new capabilities @loige Technology • Brought serverless to HPC financial modeling • Reduced codebase by ~70% • Lowered total cost of ownership • Increased dev team agility • Reduced carbon footprint
  35. Hitting the limits 😰 @loige

  36. S3 Throughput @loige

  37. S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions

    ….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige
  38. The Solution Explicit Partitioning: ◦Figure out how many partitions you

    need ◦Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… @loige 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments!
  39. Fargate Scaling • We want to run 3000 containers ASAP

    • This took > 1 hour! • We built a custom Fargate scaler ◦Using the RunTask API (no ECS Service) ◦Hidden quota increases ◦Step Function + Lambda • 3000 containers in ~20 minutes @loige The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes
  40. How high can we go today? 🚀 10,000 concurrent Lambda

    functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige
  41. Wrapping up 🎁 • "Serverless supercomputer" lets you do HPC

    with commodity AWS compute • Plenty of challenges, but it's doable! • Agility and innovation benefits are massive • Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige
  42. Special thanks to @eoins, @cmthorne10 and the awesome team at

    RenRe! @loige fth.link/cm22 Serverless for HPC? IT WORKS!