Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Light Up the Spark in Catalyst by avoiding UDFs

adi polak
November 30, 2018

Light Up the Spark in Catalyst by avoiding UDFs

Processing data at scale usually results in struggling with performance, strict SLA, limited hardware etc. I've struggled with cutting Spark SQL query run-time and found the culprit! This culprit, and SOLUTION! I would like to share with you. Today in the world of Big Data and Spark we are processing high volume transactions. Catalyst is the Spark SQL query optimizer and in this talk, you will learn how to fully utilize Catalyst optimization power in order to make our queries as fast as possible, by pushing down actions and trying to avoid UDFs as much as possible and maximizing performance.

adi polak

November 30, 2018
Tweet

More Decks by adi polak

Other Decks in Technology

Transcript

  1. Light Up the Spark in Catalyst by avoiding UDFs Adi

    Polak Milan | November 29 - 30, 2018
  2. Apache Spark is distributed general-purpose cluster-computing framework User-Defined Functions -

    Column-based functions that extend the vocabulary of Spark SQL's DSL Catalyst optimization allows some advanced programming language features that allow you to build an extensible query optimizer Catalyst - Apache Spark SQL query optimizer Concepts
  3. Structured Streaming MLib GraphFrame TensorFrames SQL SparkSession / DataFrame /

    Dataset APIs Data Source Connectors Catalyst Optimization & Tungsten Execution Spark Core (RDD APIs) SQL
  4. Spark SQL Execution Plan Logical optimization –> Optimization rules •

    Constant folding • Predicate pushdown • Projection pruning • … Physical Planning –> Planning strategies Catalyst Frontend Backend
  5. "Use the higher-level standard Column-based functions with Dataset operators whenever

    possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them." SQL query with custom UDF
  6. "Use the higher-level standard Column-based functions with Dataset operators whenever

    possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them." SQL query with custom UDF
  7. "Use the higher-level standard Column-based functions with Dataset operators whenever

    possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them." SQL query with custom UDF
  8. What do we lose when using Custom UDF ? •

    Constant folding • Predicate pushdown
  9. What can be done instead? sql functions DataFrame API: Aggregate

    functions Collection functions Date time functions Math functions Non-aggregate functions Sorting functions String functions Window functions sql functions Column API Expression operations..
  10. How can I find what functions are available? arrayContains, minute,

    round, rand, spark_partition_id, isin … version
  11. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

    Diamonds (53,940) users_filtered (115,134) BlackFriday (302,675) UserList (537,577) Runtime Benchmarks RunTime UDF (sec) RunTime Functions (sec)
  12. Takeaways • Use UDFs as a last resort • Avoid

    UDFs or UDAFs that perform more than one thing • Look under the hood – • Analyzing Spark’s execution plan with .explain(true)