Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fusion Computing at Ant Group (Dr. Charles He, ...

Fusion Computing at Ant Group (Dr. Charles He, Ant Group)

Charles Changhua He is currently chief architect of compute and storage at Ant Group, operator of Alipay, world’s largest digital payment platform. He obtained PhD in computer science from Stanford University and served as a seasoned tech leader at Google and Airbnb in Silicon Valley. Charles focuses on large-scale distributed systems and big data systems. While at Google, Charles led the development of Caffeine, a real-time web search indexing system which drastically improved search experience with much fresher results.

Charles joined Ant Group in 2017 and has since been responsible for compute and storage infrastructure, including financial big data systems, large-scale graph computing, and machine learning platform.

Anyscale

July 23, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Intro to Ant Group Ant Group aims to create the

    infrastructure and platform to support the digital transformation of the service industry. We strive to enable all consumers and small businesses to have equal access to financial and other services that are inclusive, green and sustainable. Alipay is China’s largest mobile payments business operated by Ant Group. Alipay has grown through continuous innovation to serve approximately 1.3 billion users worldwide together with its global e-wallet partners.
  2. Ray in Production at Scale The fusion engine on Ray

    at Ant Group is the world’s largest Ray cluster in production. Dynamic Graph Online Machine Learning Financial Online Decision Supporting dozens of core business use cases 200K Core Ray Cluster in Production Offline + Nearline + Online (Various computing performance) Big Data + Graph + AI (Various paradigms of computing)
  3. Ray in Production at Scale Various aspects of Ray Core

    were enhanced as the fusion engine brings Ray in production at scale. Dynamic Graph Online Machine Learning Financial Online Decision Scalability: 1k nodes and 10k actors in a single cluster. Performance: Actor call throughput 70k/s, latency P999 latency 1.5ms. Scheduling: Implemented GCS-based actor scheduling with improved scheduling algorithms for mixed large-scale workloads. DevOps: Migrated to multi-tenancy deployment model. Diagnostic: Built a dashboard for self-sevice debugging.
  4. Another Case : Operations Research Goal: Find the payment route

    with minimum overall cost Constraints: maximum load per day per payment network lowest transaction per day per bank Example Use Case : Payment Routing Demand /Challenge: Accuracy: Overall cost deviation per day Latency: ms response per payment
  5. Another Case : Operations Research • Problem Modeling / Online

    Assignment : Customized complex business logic (OR Model) • Specify Model Parameters: Sampling and forecasting based on large-scale data (offline+nearline) • Online Assignment: Online,high concurrency + high stability + ms level latency • Complicated Engineering Pipeline+ Multiple Computing Paradigms Fusion computing based on Ray unifies complex optimization pipeline.
  6. Business Requirements for Distributed System We think Ray is the

    best unified distributed foundation for Fusion Computing.
  7. Future Work Customization: • Simpler and more flexible distributed APIs

    • More common libraries Real Time / Online • Higher online service stability: higher concurrency + higher stability + lower latency Fusion • Better support for data-intensive applications • Lager-scale cluster Building Ray towards a unified foundation for Fusion Computing.
  8. For more details - • Building High Availability and Scalability

    Online Computing Applications on Ray – Tengwei Cai, June 22, 1:45 PM - 2:15 PM PDT • Improving Ray for Large-scale Applications – Hao Chen, June 22, 2:20 PM - 2:50 PM PDT • Application of Online Resource Allocation Based on Ray – Fengbin Fang, June 22, 2:00 PM - 2:50 PM PDT