Fusion Computing at Ant Group (Dr. Charles He, Ant Group)

Charles He Fusion Computing at Ant Group Chief Architect of
Computing and Storage, Ant Group

Intro to Ant Group Ant Group aims to create the
infrastructure and platform to support the digital transformation of the service industry. We strive to enable all consumers and small businesses to have equal access to financial and other services that are inclusive, green and sustainable. Alipay is China’s largest mobile payments business operated by Ant Group. Alipay has grown through continuous innovation to serve approximately 1.3 billion users worldwide together with its global e-wallet partners.

Building a Fusion Engine with Ray Fusion: Breaking Boundaries between
Computing Paradigms

Ray in Production at Scale The fusion engine on Ray
at Ant Group is the world’s largest Ray cluster in production. Dynamic Graph Online Machine Learning Financial Online Decision Supporting dozens of core business use cases 200K Core Ray Cluster in Production Offline + Nearline + Online (Various computing performance) Big Data + Graph + AI (Various paradigms of computing)

Ray in Production at Scale Various aspects of Ray Core
were enhanced as the fusion engine brings Ray in production at scale. Dynamic Graph Online Machine Learning Financial Online Decision Scalability: 1k nodes and 10k actors in a single cluster. Performance: Actor call throughput 70k/s, latency P999 latency 1.5ms. Scheduling: Implemented GCS-based actor scheduling with improved scheduling algorithms for mixed large-scale workloads. DevOps: Migrated to multi-tenancy deployment model. Diagnostic: Built a dashboard for self-sevice debugging.

Another Case : Operations Research Goal: Find the payment route
with minimum overall cost Constraints: maximum load per day per payment network lowest transaction per day per bank Example Use Case : Payment Routing Demand /Challenge： Accuracy: Overall cost deviation per day Latency: ms response per payment

Another Case : Operations Research • Problem Modeling / Online
Assignment : Customized complex business logic (OR Model) • Specify Model Parameters: Sampling and forecasting based on large-scale data (offline+nearline) • Online Assignment: Online,high concurrency + high stability + ms level latency • Complicated Engineering Pipeline+ Multiple Computing Paradigms Fusion computing based on Ray unifies complex optimization pipeline.

Business Requirements for Distributed System We think Ray is the
best unified distributed foundation for Fusion Computing.

Future Work Customization: • Simpler and more flexible distributed APIs
• More common libraries Real Time / Online • Higher online service stability: higher concurrency + higher stability + lower latency Fusion • Better support for data-intensive applications • Lager-scale cluster Building Ray towards a unified foundation for Fusion Computing.

For more details - • Building High Availability and Scalability
Online Computing Applications on Ray – Tengwei Cai, June 22, 1:45 PM - 2:15 PM PDT • Improving Ray for Large-scale Applications – Hao Chen, June 22, 2:20 PM - 2:50 PM PDT • Application of Online Resource Allocation Based on Ray – Fengbin Fang, June 22, 2:00 PM - 2:50 PM PDT

Fusion Computing at Ant Group (Dr. Charles He, ...

Fusion Computing at Ant Group (Dr. Charles He, Ant Group)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Charles He Fusion Computing at Ant Group Chief Architect of

Intro to Ant Group Ant Group aims to create the

Building a Fusion Engine with Ray Fusion: Breaking Boundaries between

Building a Fusion Engine with Ray Fusion: Breaking Boundaries between

Ray in Production at Scale The fusion engine on Ray

Ray in Production at Scale Various aspects of Ray Core

Another Case : Operations Research Goal: Find the payment route

Another Case : Operations Research • Problem Modeling / Online

Business Requirements for Distributed System We think Ray is the

Future Work Customization: • Simpler and more flexible distributed APIs

For more details - • Building High Availability and Scalability