Improving Ray for Large-scale Applications (Hao Chen, Ant Group)

Improving Ray for Large-scale Applications Hao Chen, Ant Group

About me •Joined Ant and started working on Ray since
2018. •Leading Ant’s Ray core team, which • Improves Ray’s functionalities, performance, reliability, and scalability. • Supports Ant’s various distributed applications and frameworks on top of Ray. • Contributed a lot of improvements to open source.

Overview of Ant’s Ray applications •Real-time graph computing • Bridges
graph computing and stream computing together. • Real-time fraud detection, data lineage analysis, graph-based ML, etc. •Online machine learning • Runs stream data processing, model training, and model serving in a single system. • Recommendation and ads systems. •Online computing service • Allows doing distributed computations on online services. • Online financial decision systems.

Overview of Ant’s Ray applications (cont’d) •Distributed operational research •
Allows quickly developing high-perf and robust distributed OR algorithms. • Online resource allocation. •[WIP] Large-scale Python data processing (Mars github.com/mars-project/mars) • Scales Numpy, Pandas, and Sci-kit learn. •Etc Ray as a universal foundation for all distributed systems!

Ray scale at Ant

Ray architecture as of 2018 Worker Node Object Store (Shared
Memory) Driver Worker Raylet Worker Node Object Store (Shared Memory) Worker Worker Raylet Head Node GCS (Redis + Redis module) Application layer System layer

Challenge: actor call performance Node Actor 1 Raylet 1 Actor
3 Node Actor 2 Raylet 2 Actor 4 Streaming workers as Ray actors Send messages (actor calls) Example: stream data processing in online ML. Requirement: high-performance actor calls. • All actor tasks needed to go through Raylets. • Raylets became the bottleneck!

Direct actor calls • Actors directly talk to each other
via gRPC. • Get callee locations via Redis Pub/Sub. • RPC layer is implemented in C++. • Normal tasks were switched to direct calls by the community as well. Node Actor 1 Raylet Actor 3 Node Actor 2 Raylet Actor 4 gRP C gRPC 6x throughput GCS/Redi s Actor locations

Actor call performance •Other RPC improvements • Hot path code
optimizations. • Reduced task spec copies. • Java JNI cache. • Dedicated IO threads. • Etc. 10x throughput

Challenge: actor failure recovery Node 1 Raylet 1 Actor 1
GCS/Redis Node 2 Raylet 2 Node 3 Raylet 3 Acquire leases Actor 1 Actor 1 Lease granted w/ a timeout Duplicated leased granted due to network partitions • ❌ Actors may restart multiple times. • Severe for large clusters or bad network. • Workaround: sacrifice actor restart speed by setting longer timeouts. • Graph computing system: restart all actors if any actor fails. • Requirement: fast and reliable actor failure recovery. Raylet-based actor management

GCS-based actor management •Distributed actor management is error-prone. •The number
of actors in a cluster is limited (compared with normal tasks). •GCS for centralized actor management. •But … • Old GCS = Redis server + custom Redis module. • APIs constrained to Redis commands. • Hard to implement complex logics. • Hard to ensure fault tolerance.

Head Node GCS service New GCS = RPC service +
pluggable backend storage. • Fast and reliable actor failure recovery: • 100% successful. • Speed: 1 actor: ~1.5s; 10k actors: ~70s. • GCS fault tolerance: • GCS service can recover states from the storage. • Companies can use a reliable storge of their choices. • Other features based on GCS service: • Cluster membership management. • Job management. • Placement group. • More precise actor scheduling. GCS (Redis + Redis module) GCS service Pluggable storage Raylets, workers, etc gRP C Features not or not fully open-sourced yet

Challenge: reliability and scalability •Environment errors and edge-case bugs occasionally
failed Ray jobs. •Ray itself became flakier as cluster workloads increased. •Dev-ops difficulties raised as the number of clusters increased.

Fault tolerance • Ray itself should be fault tolerant. •
GCS: GCS service recovers states from storage. • Raylets: a daemon process monitors and restarts the raylet. • Nodes: K8s operator will replenish the cluster on node failures. • It should be easy to write fault tolerant applications with Ray. • Tasks/actors: convenient retry/restart API. • Placement groups: implemented with auto failure recovery. • Implemented libraries for common patterns. E.g., fate-sharing multiple actors. • Set up auto fault-injection tests to prevent regressions.

Scaling a single Ray cluster •Scalability optimizations • Separated heartbeats
and resource updates. • Reduced sizes of pub/sub and RPC messages. • Reduced number of connections. • Allow multiple Java actors to share one single JVM process. • Etc. •Now support 1k nodes and 10k actors in a Ray cluster.

Deployment model •Prev: one Ray cluster per job. • Dev-ops
overheads due to too many clusters. • Slow job start due to cluster initialization. • Low resource utilization. •Now: multi-tenancy. • Implemented process-level and node-level job isolation. • Introduced job-level config.

Challenge: debuggability and accessibility •Distributed systems are notoriously difficult to
debug. •Ray core team had to spend a lot of efforts helping users debug problems. •How to make Ray a self-service system?

New dashboard •System and application states. • Resource usages, node/actor
states, cluster/job configurations, etc. •System and application logs, errors, and key events. • WIP: auto root cause analysis based on logs and events. •Integrated common debugging tools. • C++/Python/Java stack viewing tools, profilers, memory tools, etc. Try new dashboard in OSS

Job submission •Simplify job submission: • Users can submit jobs
from a web portal or a RESTful API. • Users simply specify dependencies. Ray will download them into the cluster. • Dependencies are cached. A job usually only takes a few seconds to start if cache hit. •Open sourcing and integration with runtime env in progress.

Future work • Scalability • Support 5k nodes & 50k
actors in a Ray cluster. • Scheduling • Improve resource utilization and scheduler performance, esp. for large-scale clusters with mixed workloads. • Improve auto-scaling. • Ray for data-intensive applications • Improve object store and support running Mars on Ray. • Advanced features and common libraries. • Monitor actor states; Allow running different actor methods in different threads; etc. • Enhance C++ Worker.

Thanks We are hiring! [email protected]

Improving Ray for Large-scale Applications (Hao...

Improving Ray for Large-scale Applications (Hao Chen, Ant Group)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript