Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fusion Computing at Ant Group (Dr. Charles He, ...

Fusion Computing at Ant Group (Dr. Charles He, Ant Group)

During the course of serving 1.3 billion users on Alipay, Ant Group found itself living with a multitude of computing paradigms (such as AI, graph, and streaming) This brought about challenges in performance and development efficiency. Ant Group has used Ray to build a fusion engine to break computing boundaries and improve development and performance efficiency.
In Charles's keynote, he will introduce that in the process of large-scale production of the fusion engine, Ray's production cluster has reached 200K core at Ant Group, becoming the world’s largest Ray cluster in production. Meanwhile, we have enhanced various aspects of Ray Core's feature. In addition, Charles will discuss the achievements of the Fusion Engine last year and his vision for the future of Fusion Computing.

Anyscale

July 20, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Intro to Ant Group Ant Group aims to create the

    infrastructure and platform to support the digital transformation of the service industry. We strive to enable all consumers and small businesses to have equal access to financial and other services that are inclusive, green and sustainable. Alipay is China’s largest mobile payments business operated by Ant Group. Alipay has grown through continuous innovation to serve approximately 1.3 billion users worldwide together with its global e-wallet partners.
  2. Ray in Production at Scale The fusion engine on Ray

    at Ant Group is the world’s largest Ray cluster in production. Dynamic Graph Online Machine Learning Financial Online Decision Supporting dozens of core business use cases 200K Core Ray Cluster in Production Offline + Nearline + Online (Various computing performance) Big Data + Graph + AI (Various paradigms of computing)
  3. Ray in Production at Scale Various aspects of Ray Core

    were enhanced as the fusion engine brings Ray in production at scale. Dynamic Graph Online Machine Learning Financial Online Decision Scalability: 1k nodes and 10k actors in a single cluster. Performance: Actor call throughput 70k/s, latency P999 latency 1.5ms. Scheduling: Implemented GCS-based actor scheduling with improved scheduling algorithms for mixed large-scale workloads. DevOps: Migrated to multi-tenancy deployment model. Diagnostic: Built a dashboard for self-sevice debugging.
  4. Another Case : Operations Research Goal: Find the payment route

    with minimum overall cost Constraints: maximum load per day per payment network lowest transaction per day per bank Example Use Case : Payment Routing Demand /Challenge: Accuracy: Overall cost deviation per day Latency: ms response per payment
  5. Another Case : Operations Research • Problem Modeling / Online

    Assignment : Customized complex business logic (OR Model) • Specify Model Parameters: Sampling and forecasting based on large-scale data (offline+nearline) • Online Assignment: Online,high concurrency + high stability + ms level latency • Complicated Engineering Pipeline+ Multiple Computing Paradigms Fusion computing based on Ray unifies complex optimization pipeline.
  6. Business Requirements for Distributed System We think Ray is the

    best unified distributed foundation for Fusion Computing.
  7. Future Work Customization: • Simpler and more flexible distributed APIs

    • More common libraries Real Time / Online • Higher online service stability: higher concurrency + higher stability + lower latency Fusion • Better support for data-intensive applications • Lager-scale cluster Building Ray towards a unified foundation for Fusion Computing.
  8. For more details - • Building High Availability and Scalability

    Online Computing Applications on Ray – Tengwei Cai, June 22, 1:45 PM - 2:15 PM PDT • Improving Ray for Large-scale Applications – Hao Chen, June 22, 2:20 PM - 2:50 PM PDT • Application of Online Resource Allocation Based on Ray – Fengbin Fang, June 22, 2:00 PM - 2:50 PM PDT