Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building High Availability and Scalability Online Computing Applications on Ray (Tengwei Cai, Ant Group)

Building High Availability and Scalability Online Computing Applications on Ray (Tengwei Cai, Ant Group)

In many computing scenarios of AntGroup, the computing application need to provide synchronous and reliable service to satisfy online financial business systems' need. We present a general framework, which empowers Ray Job to become an online application with high availability and scalability.

Anyscale

July 19, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Our First Scenario 2 • Strategy and indicators calculation of

    payment institutions • The problem is single node performance bottleneck • The number of institutions is increasing • The latency of calculation is increasing • Logical is complex, cost of rewriting is high • How to solve this problem with low cost? // pseudocode public class Strategy { public void calc(long time, Map banksAndIndicator){ for(Entry e : banksAndIndicator.entrySet()){ String bank = e.getKey(); for(String indicator : e.getValues()){ calcIndicator(time, bank, indicator); } } } public void calcIndicator(long time, String bank, List indicators){ for(String indicator : indicators){ calcBankIndicators(time, bank, indicator); } // do indicators' data calculation } public void calcBankIndicators(long time, String bank, String indicator){ // do bank data calculation } }
  2. The Design of New System 3 • New system design

    • Trigger with RPC or event, very low latency • Easy to rewrite from original code • Dynamic task execution Ray Worker Ray Worker Service Trigger Task #0 Result #0 Task #1 Result #1 Task #N Result #N Process Results Trigger Task #0 Result #0 Ray Worker Task #1 Result #1 Ray Worker Task #N Result #N parallel exec Process Results gather results What’s running What we want
  3. The New System Architecture 4 • One actor manage the

    other actors - AppMaster • Some trigger actor receives task from outside • Start worker actor in advance to save runtime code initialize time • Dispatcher route tasks to the next level actor with load balancing strategies AppMaster Actor User Job #0 Trigger Actor ExecutionRuntime RPC Server Dispatcher Trigger Actor ExecutionRuntime RPC Server Dispatcher Worker Actor L1 ExecutionRuntime User Code Dispatcher Worker Actor L1 ExecutionRuntime User Code Dispatcher Worker Actor L1 ExecutionRuntime User Code Dispatcher Worker Actor L2 ExecutionRuntime User Code Dispatcher Worker Actor L2 ExecutionRuntime User Code Dispatcher create & control User App trigger task & wait for results (within 2xx ms) L1 worker execute `calcIndicator` L2 worker execute `calcBankIndicators`
  4. Model Serving 5 • Online machine learning • Continuous model

    exporting, 10-30 minutes per iteration • Apply models online ASAP • OCR/NLP/CV • Heterogeneous serving Streaming Training Serving Ray Actor Prediction Engine NN Model Serving Runtime Ray Actor Prediction Engine NN Model Serving Runtime Ray Actor Ensamble Engine LR Model Serving Runtime Prediction Engine User App Update Model ~ 10-30 minutes per Iterate RPC Call
  5. Function as a Service 6 • Event-driven platform like Knative

    Eventing, transform event consumption to RPC call • FaaS instances scale in and out automatically (building) Eventing (Knative) Serving Ray Actor FaaS Runtime Serving Runtime User Func A User App Ray Actor FaaS Runtime Serving Runtime User Func A Ray Actor FaaS Runtime Serving Runtime User Func B Event Source Event Source Trigger Event Platform RPC Send Events RPC Direct Call Send Events Send/Pull Events
  6. Online Resource Allocation 7 • Online resource allocation tries to

    resolve high-performance solutions with large-scale linear programming problems. • Checkout Ant Group Talk "Application of online resource allocation based on Ray"
  7. A General Serving Framework 8 • A general serving framework

    is needed: Ant Ray Serving • Challenges: • High Availability, SLA > 99.9% • Fast to upgrade, 500+ instance in 10 minutes • Easy to scale, scale out/in in minutes
  8. Serving Cluster II Serving Cluster I Cross-Cluster Architecture 9 •

    New role introduced: Serving Keeper • Cross-cluster service management • Proxy service deployment request • Start service job in multiple Ray cluster automatically • Coordinate multiple service to • Adjust tra ff i c ratio between di ff erent jobs • Accomplish blue-green and canary release AppMaster Actor II ServingKeeper Serving Client Start/Stop Service Scale/Upgrade Proxy Actor Backend Actor AppMaster Actor I Proxy Actor Backend Actor Proxy Actor Backend Actor Proxy Actor Backend Actor
  9. Cross-Cluster Architecture 10 • Cross-Cluster Service Discovery • Two level

    service discovery • Local fi rst load balancing • Adjust tra ffi c between cluster fl exible • Ray cluster disaster tolerance and rolling updates Serving Cluster I Proxy Actor Proxy Actor Serving Cluster II Proxy Actor service.i.xxx.com service.ii.xxx.com service.global.xxx.com weight: 2 weight: 1 User Application RPC Trigger Serving Client pull instances select living instance (local- fi rst load balance) RPC call selected instance
  10. AppMaster State Management 11 • Non-volatile state make AppMaster high

    reliable • We use database like MySQl as state backend in production • AppMaster state consist of: • Service meta-data • Proxy & backend actors’ handler • Service con fi guration like replica number and latest backend con fi g • Runtime status of proxy, backend and itself Scene #1: AppMaster actor down 1. Ray re-construct AppMaster actor 2. AppMaster found it’s only actor re-construct, resume all state into memory and continue serve Scene #2: Whole service down 1. Service job restarted by user or serving keeper 2. AppMaster found it’s not a actor re-construct, resume service meta-data and runtime con fi gurations, abandon runtime status and actor handlers 3. AppMaster re-create all proxy & backend actors and save new runtime status AppMaster AppMaster State State Backend change persistent restore
  11. Service Continuous Upgrade 12 • Upgrade step size, usually no

    bigger than 25% • Remove tra ff i c before upgrade • Reload in-place is faster no matter upgrade or rollback • Validating is very import in online learning Serving Pre- Updating Detache d Upgradin g Validatin g Rollbacki ng Prepare resource & con fi guration Remove itself from service discovery Reload/Upgrade after no tra ff i c Self validate with recorded requests Validate Fail Validate Success Rollback to previous version Upgrade State Machine of Serving Backend
  12. Serving Scalability 13 • Scalability is not the biggest challenge

    when using Ray • Autoscale (exploring) • Time series algorithm for predictable tra ff i c • The real question is can you scale out fast enough when unforeseen tra ffi c peak is coming? • SLA - "Scheduling" level agreement
  13. Serving Cluster I Overall Architecture 14 User Application (Data Plane)

    User Application (Controll Plane) Serving Client Serving Client ServingKeeper Ray Node Ray Node AppMaster Proxy Java Backend Python Backend … Proxy Backend Backend … Ray Node Ray Node Serving Cluster II Ray Node Ray Node AppMaster Proxy Java Backend Python Backend … Proxy Backend Backend … Ray Node Ray Node Serving Cluster III Serving Cluster X … Control Instruction Data Requests Service Discovery register pull instances
  14. Future Plan 15 • Speed up online learning model updating

    5x faster • Apply autoscale in most production scenarios • Explore more complex scenarios like online distributed computing • Collaboration with Ray Serve • Support Java in Ray Serve • Build pluggable component of controller, ingress, service discovery together • Other potential features