Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Light Up the Spark in Catalyst by avoiding UDFs

adi polak
November 30, 2018

Light Up the Spark in Catalyst by avoiding UDFs

Processing data at scale usually results in struggling with performance, strict SLA, limited hardware etc. I've struggled with cutting Spark SQL query run-time and found the culprit! This culprit, and SOLUTION! I would like to share with you. Today in the world of Big Data and Spark we are processing high volume transactions. Catalyst is the Spark SQL query optimizer and in this talk, you will learn how to fully utilize Catalyst optimization power in order to make our queries as fast as possible, by pushing down actions and trying to avoid UDFs as much as possible and maximizing performance.

adi polak

November 30, 2018
Tweet

More Decks by adi polak

Other Decks in Technology

Transcript

  1. Light Up the Spark in Catalyst by
    avoiding UDFs
    Adi Polak
    Milan | November 29 - 30, 2018

    View Slide

  2. About me – Adi Polak
    @adipolak
    [email protected]
    https://medium.com/@adipolak

    View Slide

  3. Agenda
    scala
    Apache
    Spark 2.3
    UDF
    Catalyst
    Optimization

    View Slide

  4. Who use Apache Spark
    Data scientists
    Data engineers
    Business analysts

    View Slide

  5. Apache Spark is distributed
    general-purpose cluster-computing
    framework
    User-Defined Functions -
    Column-based functions that
    extend the vocabulary
    of Spark SQL's DSL
    Catalyst optimization allows some advanced
    programming language features that allow
    you to build an extensible query optimizer
    Catalyst - Apache
    Spark SQL query
    optimizer
    Concepts

    View Slide

  6. Data analytics generic architecture

    View Slide

  7. Structured
    Streaming
    MLib GraphFrame TensorFrames
    SQL SparkSession / DataFrame / Dataset APIs
    Data Source Connectors
    Catalyst Optimization & Tungsten Execution
    Spark Core (RDD APIs)
    SQL

    View Slide

  8. Lets look at the demo

    View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. CATALYST

    View Slide

  19. Fundamentals of Catalyst Optimizer
    SUB
    Attribute(x) SUB
    some_func(1) some_func(2)
    Tree Rules
    SUB
    Attribute(x) some_func(-1)

    View Slide

  20. Spark SQL Execution Plan
    Logical optimization –> Optimization rules
    • Constant folding
    • Predicate pushdown
    • Projection pruning
    • …
    Physical Planning –> Planning strategies
    Catalyst
    Frontend Backend

    View Slide

  21. SQL query with custom UDF

    View Slide

  22. "Use the higher-level standard Column-based functions with
    Dataset operators whenever possible before reverting to
    using your own custom UDF functions since UDFs are a
    blackbox for Spark and so it does not even try to optimize
    them."
    SQL query with custom UDF

    View Slide

  23. "Use the higher-level standard Column-based functions with
    Dataset operators whenever possible before reverting to
    using your own custom UDF functions since UDFs are a
    blackbox for Spark and so it does not even try to optimize
    them."
    SQL query with custom UDF

    View Slide

  24. "Use the higher-level standard Column-based functions with
    Dataset operators whenever possible before reverting to
    using your own custom UDF functions since UDFs are a
    blackbox for Spark and so it does not even try to optimize
    them."
    SQL query with custom UDF

    View Slide

  25. What do we lose when
    using Custom UDF ?
    • Constant folding
    • Predicate pushdown

    View Slide

  26. What can we do ?

    View Slide

  27. Use queryExecution & explain(true)
    Catalyst
    Frontend Backend

    View Slide

  28. View Slide

  29. Use queryExecution & explain(true) API
    My UDF
    Register

    View Slide

  30. Lost Push Down filter

    View Slide

  31. What can be done instead?
    sql functions DataFrame API:
    Aggregate functions
    Collection functions
    Date time functions
    Math functions
    Non-aggregate functions
    Sorting functions
    String functions
    Window functions
    sql functions Column API
    Expression operations..

    View Slide

  32. How can I find what functions are available?
    arrayContains, minute, round, rand, spark_partition_id, isin …
    version

    View Slide

  33. 0
    0.5
    1
    1.5
    2
    2.5
    3
    3.5
    4
    4.5
    Diamonds (53,940) users_filtered (115,134) BlackFriday (302,675) UserList (537,577)
    Runtime Benchmarks
    RunTime UDF (sec) RunTime Functions (sec)

    View Slide

  34. Takeaways
    • Use UDFs as a last resort
    • Avoid UDFs or UDAFs that perform more than one thing
    • Look under the hood –
    • Analyzing Spark’s execution plan with .explain(true)

    View Slide

  35. Reference
    • https://aka.ms/AA3e7rd
    • https://aka.ms/AA3efgo
    • https://www.kaggle.com

    View Slide

  36. GRATZIE
    @adipolak
    @adipolak
    [email protected]

    View Slide