Light Up the Spark in Catalyst by avoiding UDFs

Light Up the Spark in Catalyst by avoiding UDFs Adi
Polak Milan | November 29 - 30, 2018

About me – Adi Polak @adipolak [email protected] https://medium.com/@adipolak

Agenda scala Apache Spark 2.3 UDF Catalyst Optimization

Who use Apache Spark Data scientists Data engineers Business analysts

Apache Spark is distributed general-purpose cluster-computing framework User-Defined Functions -
Column-based functions that extend the vocabulary of Spark SQL's DSL Catalyst optimization allows some advanced programming language features that allow you to build an extensible query optimizer Catalyst - Apache Spark SQL query optimizer Concepts

Data analytics generic architecture

Structured Streaming MLib GraphFrame TensorFrames SQL SparkSession / DataFrame /
Dataset APIs Data Source Connectors Catalyst Optimization & Tungsten Execution Spark Core (RDD APIs) SQL

Lets look at the demo

CATALYST

Fundamentals of Catalyst Optimizer SUB Attribute(x) SUB some_func(1) some_func(2) Tree
Rules SUB Attribute(x) some_func(-1)

Spark SQL Execution Plan Logical optimization –> Optimization rules •
Constant folding • Predicate pushdown • Projection pruning • … Physical Planning –> Planning strategies Catalyst Frontend Backend

SQL query with custom UDF

"Use the higher-level standard Column-based functions with Dataset operators whenever
possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them." SQL query with custom UDF

What do we lose when using Custom UDF ? •
Constant folding • Predicate pushdown

What can we do ?

Use queryExecution & explain(true) Catalyst Frontend Backend

Use queryExecution & explain(true) API My UDF Register

Lost Push Down filter

What can be done instead? sql functions DataFrame API: Aggregate
functions Collection functions Date time functions Math functions Non-aggregate functions Sorting functions String functions Window functions sql functions Column API Expression operations..

How can I find what functions are available? arrayContains, minute,
round, rand, spark_partition_id, isin … version

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Diamonds (53,940) users_filtered (115,134) BlackFriday (302,675) UserList (537,577) Runtime Benchmarks RunTime UDF (sec) RunTime Functions (sec)

Takeaways • Use UDFs as a last resort • Avoid
UDFs or UDAFs that perform more than one thing • Look under the hood – • Analyzing Spark’s execution plan with .explain(true)

Reference • https://aka.ms/AA3e7rd • https://aka.ms/AA3efgo • https://www.kaggle.com

GRATZIE @adipolak @adipolak [email protected]

Light Up the Spark in Catalyst by avoiding UDFs

Light Up the Spark in Catalyst by avoiding UDFs

adi polak

More Decks by adi polak

Other Decks in Technology

Featured

Transcript