Building High Availability and Scalability Online Computing Applications on Ray (Tengwei Cai, Ant Group)

Building High Availability and Scalability Online Applications on Ray Tengwei
Cai Sta ff Engineer, Ant Group 1

Our First Scenario 2 • Strategy and indicators calculation of
payment institutions • The problem is single node performance bottleneck • The number of institutions is increasing • The latency of calculation is increasing • Logical is complex, cost of rewriting is high • How to solve this problem with low cost? // pseudocode public class Strategy { public void calc(long time, Map banksAndIndicator){ for(Entry e : banksAndIndicator.entrySet()){ String bank = e.getKey(); for(String indicator : e.getValues()){ calcIndicator(time, bank, indicator); } } } public void calcIndicator(long time, String bank, List indicators){ for(String indicator : indicators){ calcBankIndicators(time, bank, indicator); } // do indicators' data calculation } public void calcBankIndicators(long time, String bank, String indicator){ // do bank data calculation } }

The Design of New System 3 • New system design
• Trigger with RPC or event, very low latency • Easy to rewrite from original code • Dynamic task execution Ray Worker Ray Worker Service Trigger Task #0 Result #0 Task #1 Result #1 Task #N Result #N Process Results Trigger Task #0 Result #0 Ray Worker Task #1 Result #1 Ray Worker Task #N Result #N parallel exec Process Results gather results What’s running What we want

The New System Architecture 4 • One actor manage the
other actors - AppMaster • Some trigger actor receives task from outside • Start worker actor in advance to save runtime code initialize time • Dispatcher route tasks to the next level actor with load balancing strategies AppMaster Actor User Job #0 Trigger Actor ExecutionRuntime RPC Server Dispatcher Trigger Actor ExecutionRuntime RPC Server Dispatcher Worker Actor L1 ExecutionRuntime User Code Dispatcher Worker Actor L1 ExecutionRuntime User Code Dispatcher Worker Actor L1 ExecutionRuntime User Code Dispatcher Worker Actor L2 ExecutionRuntime User Code Dispatcher Worker Actor L2 ExecutionRuntime User Code Dispatcher create & control User App trigger task & wait for results (within 2xx ms) L1 worker execute `calcIndicator` L2 worker execute `calcBankIndicators`

Model Serving 5 • Online machine learning • Continuous model
exporting, 10-30 minutes per iteration • Apply models online ASAP • OCR/NLP/CV • Heterogeneous serving Streaming Training Serving Ray Actor Prediction Engine NN Model Serving Runtime Ray Actor Prediction Engine NN Model Serving Runtime Ray Actor Ensamble Engine LR Model Serving Runtime Prediction Engine User App Update Model ~ 10-30 minutes per Iterate RPC Call

Function as a Service 6 • Event-driven platform like Knative
Eventing, transform event consumption to RPC call • FaaS instances scale in and out automatically (building) Eventing (Knative) Serving Ray Actor FaaS Runtime Serving Runtime User Func A User App Ray Actor FaaS Runtime Serving Runtime User Func A Ray Actor FaaS Runtime Serving Runtime User Func B Event Source Event Source Trigger Event Platform RPC Send Events RPC Direct Call Send Events Send/Pull Events

Online Resource Allocation 7 • Online resource allocation tries to
resolve high-performance solutions with large-scale linear programming problems. • Checkout Ant Group Talk "Application of online resource allocation based on Ray"

A General Serving Framework 8 • A general serving framework
is needed: Ant Ray Serving • Challenges: • High Availability, SLA > 99.9% • Fast to upgrade, 500+ instance in 10 minutes • Easy to scale, scale out/in in minutes

Serving Cluster II Serving Cluster I Cross-Cluster Architecture 9 •
New role introduced: Serving Keeper • Cross-cluster service management • Proxy service deployment request • Start service job in multiple Ray cluster automatically • Coordinate multiple service to • Adjust tra ff i c ratio between di ff erent jobs • Accomplish blue-green and canary release AppMaster Actor II ServingKeeper Serving Client Start/Stop Service Scale/Upgrade Proxy Actor Backend Actor AppMaster Actor I Proxy Actor Backend Actor Proxy Actor Backend Actor Proxy Actor Backend Actor

Cross-Cluster Architecture 10 • Cross-Cluster Service Discovery • Two level
service discovery • Local fi rst load balancing • Adjust tra ffi c between cluster fl exible • Ray cluster disaster tolerance and rolling updates Serving Cluster I Proxy Actor Proxy Actor Serving Cluster II Proxy Actor service.i.xxx.com service.ii.xxx.com service.global.xxx.com weight: 2 weight: 1 User Application RPC Trigger Serving Client pull instances select living instance (local- fi rst load balance) RPC call selected instance

AppMaster State Management 11 • Non-volatile state make AppMaster high
reliable • We use database like MySQl as state backend in production • AppMaster state consist of: • Service meta-data • Proxy & backend actors’ handler • Service con fi guration like replica number and latest backend con fi g • Runtime status of proxy, backend and itself Scene #1: AppMaster actor down 1. Ray re-construct AppMaster actor 2. AppMaster found it’s only actor re-construct, resume all state into memory and continue serve Scene #2: Whole service down 1. Service job restarted by user or serving keeper 2. AppMaster found it’s not a actor re-construct, resume service meta-data and runtime con fi gurations, abandon runtime status and actor handlers 3. AppMaster re-create all proxy & backend actors and save new runtime status AppMaster AppMaster State State Backend change persistent restore

Service Continuous Upgrade 12 • Upgrade step size, usually no
bigger than 25% • Remove tra ff i c before upgrade • Reload in-place is faster no matter upgrade or rollback • Validating is very import in online learning Serving Pre- Updating Detache d Upgradin g Validatin g Rollbacki ng Prepare resource & con fi guration Remove itself from service discovery Reload/Upgrade after no tra ff i c Self validate with recorded requests Validate Fail Validate Success Rollback to previous version Upgrade State Machine of Serving Backend

Serving Scalability 13 • Scalability is not the biggest challenge
when using Ray • Autoscale (exploring) • Time series algorithm for predictable tra ff i c • The real question is can you scale out fast enough when unforeseen tra ffi c peak is coming? • SLA - "Scheduling" level agreement

Serving Cluster I Overall Architecture 14 User Application (Data Plane)
User Application (Controll Plane) Serving Client Serving Client ServingKeeper Ray Node Ray Node AppMaster Proxy Java Backend Python Backend … Proxy Backend Backend … Ray Node Ray Node Serving Cluster II Ray Node Ray Node AppMaster Proxy Java Backend Python Backend … Proxy Backend Backend … Ray Node Ray Node Serving Cluster III Serving Cluster X … Control Instruction Data Requests Service Discovery register pull instances

Future Plan 15 • Speed up online learning model updating
5x faster • Apply autoscale in most production scenarios • Explore more complex scenarios like online distributed computing • Collaboration with Ray Serve • Support Java in Ray Serve • Build pluggable component of controller, ingress, service discovery together • Other potential features

Thanks! Email: [email protected] Slack: [email protected] We’re hiring :-) 16

Building High Availability and Scalability Onli...

Building High Availability and Scalability Online Computing Applications on Ray (Tengwei Cai, Ant Group)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Building High Availability and Scalability Online Applications on Ray Tengwei

Our First Scenario 2 • Strategy and indicators calculation of

The Design of New System 3 • New system design

The New System Architecture 4 • One actor manage the

Model Serving 5 • Online machine learning • Continuous model

Function as a Service 6 • Event-driven platform like Knative

Online Resource Allocation 7 • Online resource allocation tries to

A General Serving Framework 8 • A general serving framework

Serving Cluster II Serving Cluster I Cross-Cluster Architecture 9 •

Cross-Cluster Architecture 10 • Cross-Cluster Service Discovery • Two level

AppMaster State Management 11 • Non-volatile state make AppMaster high

Service Continuous Upgrade 12 • Upgrade step size, usually no

Serving Scalability 13 • Scalability is not the biggest challenge

Serving Cluster I Overall Architecture 14 User Application (Data Plane)

Future Plan 15 • Speed up online learning model updating

Thanks! Email: [email protected] Slack: [email protected] We’re hiring :-) 16