Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Serverless for High Performance Computing

Serverless for High Performance Computing

Serverless is great for web applications and APIs, but this does not mean it cannot be used successfully for other use cases. In this talk, we will discuss a successful application of serverless in the field of High Performance Computing. Specifically we will discuss how Lambda, Fargate, Kinesis and other serverless technologies are being used to run sophisticated financial models at one of the major reinsurance companies in the World. We we learn about the architecture, the tradeoffs, some challenges and some unresolved pain points. Most importantly, we'll find out if serverless can be a great fit for HPC and if we can finally stop managing those boring EC2 instances!

Luciano Mammino

October 18, 2022
Tweet

More Decks by Luciano Mammino

Other Decks in Technology

Transcript

  1. Luciano Mammino (@loige)
    Serverless for HPC 🚀
    fth.link/cm22

    View full-size slide

  2. Is Serverless a good option for
    High Performance Computing?
    @loige

    View full-size slide

  3. 👋 Hello, I am Luciano
    Senior architect
    nodejsdesignpatterns.com
    Let’s connect:
    🌎 loige.co
    🐦 @loige
    🎥 loige
    🧳 lucianomammino

    View full-size slide

  4. Middy Framework
    SLIC Starter - Serverless Accelerator
    SLIC Watch - Observability Plugin
    Business focused technologists.
    Accelerated Serverless | AI as a Service | Platform Modernisation

    View full-size slide

  5. We host a podcast about AWS and Cloud computing
    🔗 awsbites.com
    🎬 YouTube Channel
    🎙 Podcast
    📅 Episodes every week
    @loige

    View full-size slide

  6. Get the slides: fth.link/cm22
    @loige

    View full-size slide

  7. Agenda
    ● The 6 Rs of Cloud Migration
    ● A serverless case study
    ○ The problem space and types of workflows
    ○ Original on premise implementation
    ○ The PoC
    ○ The final production version
    ○ The components of a serverless job scheduler
    ○ Challenges & Limits
    @loige
    fth.link/cm22

    View full-size slide

  8. The 6 Rs of Cloud Migrations
    @loige
    🗑 🕸 🚚
    Retire Retain Rehost
    🏗 📐 💰
    Replatform Refactor Repurchase
    fth.link/cm22

    View full-size slide

  9. A case study
    @loige
    Case study on AWS blog: fth.link/awshpc

    View full-size slide

  10. The workloads - Risk Rollup
    🏦 Financial modeling to understand the portfolio of risk
    🧠 Internal, custom-built risk model on all reinsurance deals
    ⚙ HPC (High-Performance Computing) workload
    🗄 ~45TB data processed
    ⏱ 2/3 rollups per day (6-8 hours each!)
    @loige

    View full-size slide

  11. The workloads - Deal Analytics
    ⚡ Near real-time deal pricing using the same risk model
    🗃 Lower data volumes
    🔁 High frequency of execution – up to 1.000 per day
    @loige

    View full-size slide

  12. Original on-prem implementation
    @loige

    View full-size slide

  13. Challenges
    🐢 Long execution times, constraining business agility
    🥊 Competing workloads
    📈 Limits our ability to support portfolio growth
    😩 Can’t deliver new features
    🧾 Very high total cost of ownership
    @loige

    View full-size slide

  14. Thinking Big
    💭 Imagine a solution that would …
    1. Offer a dramatic increase in performance
    2. Provide consistent run times
    3. Support more executions, more often
    4. Support future portfolio growth and new
    capabilities – 15x data volumes
    @loige

    View full-size slide

  15. The Goal ⚽
    Run a Risk Rollup in 1 hour!
    @loige

    View full-size slide

  16. Architecture Options for Compute/Orchestration
    @loige
    AWS Lambda
    Amazon SQS AWS Step Functions
    AWS Fargate
    Com t om :
    Red he b to si l ,
    s a l , ev -d i n
    co n s

    View full-size slide

  17. POC Architecture
    @loige
    AWS Batch
    S3
    Step Functions
    Lambda
    SQS

    View full-size slide

  18. Measure Everything! 📏
    ⏱ Built metrics in from the start
    󰤈 AWS metrics we wish existed out of the box:
    - Number of running containers
    - Success/failure counts
    🎨 Custom metrics:
    - Scheduler overhead
    - Detailed timings (job duration, I/O time, algorithm steps)
    🛠 Using CloudWatch, EMF
    @loige

    View full-size slide

  19. Measure Everything! 📏
    👍 Rollup in 1 hour
    ☁ Running on AWS Batch
    👎 Cluster utilisation was <50%
    ✅ Goal success
    🤔 Understanding of what needs to be addressed
    @loige

    View full-size slide

  20. Beyond the PoC
    Production: optimise for unique workload characteristics
    @loige

    View full-size slide

  21. Job Plan
    @loige

    View full-size slide

  22. In reality, not all jobs are alike!
    @loige

    View full-size slide

  23. Horizontal scaling 🚀
    1000’s of jobs
    Duration: 1 second – 45 minutes
    Scaling horizontally = splitting jobs
    Jobs split according to their
    complexity/duration
    Resulting in >1 million jobs
    @loige

    View full-size slide

  24. Moving to production 🚢
    @loige

    View full-size slide

  25. Actual End to End overview
    @loige

    View full-size slide

  26. Modelling Worker
    @loige

    View full-size slide

  27. Compute Services
    Scales to 1000’s of tasks (containers)
    Little management overhead
    Up to 4 vCPUs and 30GB Memory
    Up to 200GB ephemeral storage
    Scales to 1000’s of function containers (in seconds!)
    Very little management overhead
    Up to 6 vCPUs and 10GB Memory
    Up to 10GB ephemeral storage
    It wasn’t always this way!
    @loige

    View full-size slide

  28. Store all the things in S3!
    The source of truth for:
    ● Input Data (JSON, Parquet)
    ● Intermediate Data (Parquet)
    ● Results (Parquet)
    ● Aggregates (Parquet)
    Input data: 20GB
    Output data: ~1 TB
    Reads and writes: 10,000s of objects per second.
    @loige

    View full-size slide

  29. Scheduling and Orchestration
    ✅ We have our cluster (Fargate or Lambda)
    ✅ We have a plan! (list of jobs, parameters and
    dependencies)
    🤔 How do we feed this plan to the cluster?!
    🤨 Existing schedulers use traditional clusters – there
    is no serverless job scheduler for workloads like this!
    @loige

    View full-size slide

  30. Lifecycle of a Job
    @loige
    A new job
    get queued
    here 👇
    A worker
    picks up the
    job and
    executes it
    The worker
    emits the
    job state
    (success or
    failure)

    View full-size slide

  31. Event-Driven Scheduler
    @loige
    Job states are pulled
    from a Kinesis Data
    Stream
    Redis stores:
    - Job states
    - Dependencies
    This scheduler checks
    new job states against
    the state in Redis and
    figures out if there are
    new jobs that can be
    scheduled next

    View full-size slide

  32. Dynamic Runtime
    Handling
    @loige
    We also need to handle
    system failures!

    View full-size slide

  33. Outcomes 🙌
    Business
    ● Rollup in 1 hour
    ● Removed limits on number of runs
    ● Faster, more consistent deal
    analytics
    ● Business spending more time on
    revenue-generating activities
    ● Support portfolio growth and deliver
    new capabilities
    @loige
    Technology
    ● Brought serverless to HPC financial
    modeling
    ● Reduced codebase by ~70%
    ● Lowered total cost of ownership
    ● Increased dev team agility
    ● Reduced carbon footprint

    View full-size slide

  34. Hitting the limits 😰
    @loige

    View full-size slide

  35. S3 Throughput
    @loige

    View full-size slide

  36. S3 Partitioning
    S3 cleverly detects high-throughput prefixes and creates partitions
    ….normally
    If this does not happen…
    🚨Please reduce your request rate;
    Status Code: 503; Error Code: SlowDown
    @loige

    View full-size slide

  37. The Solution
    Explicit Partitioning:
    ○Figure out how many partitions you need
    ○Update code to create keys uniformly distributed over all partitions
    /part/0…
    /part/1…
    /part/2…
    /part/3…

    /part/f…
    @loige
    1. Talk (a lot) to AWS SAs, Support, Account
    Manager for special requirements like this!
    2. Think ahead if you have multiple accounts
    for different environments!

    View full-size slide

  38. Fargate Scaling
    ● We want to run 3000 containers ASAP
    ● This took > 1 hour!
    ● We built a custom Fargate scaler
    ○Using the RunTask API (no ECS Service)
    ○Hidden quota increases
    ○Step Function + Lambda
    ● 3000 containers in ~20 minutes
    @loige
    The AWS ECS team since made lots of
    improvements, making it possible to scale to
    3,000 containers in under 5 minutes

    View full-size slide

  39. How high can we go today?
    🚀 10,000 concurrent Lambda functions in seconds
    🎢 10,000 Fargate containers in 10 minutes
    💸 No additional cost
    vladionescu.me/posts/scaling-containers-on-aws-in-2022
    @loige

    View full-size slide

  40. Wrapping up 🎁
    ● "Serverless supercomputer" lets you do HPC with
    commodity AWS compute
    ● Plenty of challenges, but it's doable!
    ● Agility and innovation benefits are massive
    ● Customer is now serverless-first and expert in AWS
    Other interesting case studies:
    ☁ AWS HTC Grid - 🧬 COVID genome research
    @loige

    View full-size slide

  41. Special thanks to @eoins, @cmthorne10 and the awesome team at RenRe!
    @loige
    fth.link/cm22
    Serverless for HPC?
    IT WORKS!

    View full-size slide