Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simplify Big Data & AI on Spark and Ray with ML...

Simplify Big Data & AI on Spark and Ray with MLSQL (Dong Li, Kyligence)

Most organizations prefer Python for AI and machine learning, but the JVM-based distributed system is also popular for big data processing. Many Ray users are willing to incorporate parallel data processing directly into Python applications but suffer from the complexity and low efficiency of existing solutions.

MLSQL is a new SQL variant designed for big data and AI scenarios. It is open source with Apache License V2.0. With MLSQL, users can perform self-service machine learning and AI tasks on large scale datasets on top of Ray and Spark, without caring about the different programming paradigms between PySpark and Ray, simply by writing a few lines of SQL statements. MLSQL optimized its distributed engine by combining Spark and Ray and improving the underlying data exchanging efficiency between them. Also, users can run the same piece of code on any Ray cluster of their choice.

In this presentation, Dong Li will outline the basics of MLSQL with a live demo and a deep-dive into how MLSQL implements Spark+Ray on the engine side to build an efficient and single substrate for big data and AI.

Anyscale

July 20, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Simplify Big Data & AI on Spark and Ray With

    MLSQL Dong Li Head of Product, Kyligence Apache Kylin PMC Member
  2. Is Python (based on Spark + Ray) enough? ◆ Python

    is advanced programming skill ◆ Learn Python ◆ Learn PySpark ◆ Learn Ray programing ◆ No management for data ACL ◆ Intermediate storage is required for data exchange between Spark and Ray
  3. MLSQL: Open Source SQL Variant for Big Data & AI

    Unified Language and Platform for Data Management, Business Intelligence, and Machine Learning / AI
  4. MLSQL Summary ◆ In Notebook ◆ All about SQL ◆

    Seamless SQL & Python ◆ No PySpark ◆ Analyze and explore multiple data sources ◆ Support algorithms and feature engineering, support Python ecosystem ◆ Support Kylin and other analytical engines ◆ Non-intrusive, out-of-box data ACL ◆ Security on Plugin, Algorithm, Data and Directory ◆ Custom desensitization ◆ UDF and UDAF hot deployed ◆ Pluggable architecture ◆ User defined extension
  5. Ray on Spark vs. Spark on Ray (Ray DP) HDFS/Object

    Store Slow Raylet PySpark App Ray Object Store Quick Ray Cluster Spark Driver Executor Ray manager Raylet Executor Ray manager Raylet
  6. The New Way: MLSQL on Spark + Ray Apps JDBC/Rest

    API Proxy Server (Load Balance) MLSQL Engine Driver Executor Executor Executor Java Executor Python Deamon Python Worker Ray Cluster Ray Cluster Ray Cluster Yarn/K8s/Standalone/Local Yarn/K8s/Standalone/Local MLSQL Cluster Existing Ray Cluster
  7. Why do it this way? ◆ Fusion mode, python may

    impact the stability of big data cluster ◆ Traditionally, Data Landing is required ◆ MLSQL exchange data on the fly ◆ Ray is optional ◆ Users can provide multiple Ray clusters to select ◆ Traditionally, you need to learn Python/PySpark/Ray ◆ No need for PySpark with MLSQL
  8. Deep Dive for Data Exchange base on PyJava Lib Learn

    more from: https://github.com/allwefantasy/pyjava
  9. Deep Dive for Data Exchange (In detail) Read Once server

    Read Once server Read Once server python worker Ray client Actor 0 Actor 1 Actor 2 Read Once server Read Once server Read Once server
  10. MLSQL is expected to bridge Data and AI, and become

    the industrial standard of language interface. --- William Zhu, Author of MLSQL
  11. Contact Us Kyligence Inc ◆ http://kyligence.io ◆ [email protected] ◆ Twitter:

    @Kyligence Apache Kylin ◆ http://kylin.apache.org ◆ [email protected] ◆ Twitter: @ApacheKylin