Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Spark SQL Performance_

Improving Spark SQL Performance_

LINE Developers

March 14, 2019
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. SLOW QUERIES Rank Query Count Average Time (min) Total Time

    (min) 1 Query A 335 20 6,842 2 Query B 28 85 2,380 3 Query C 109 20 2,182 4 Query D 4028 0 1,543 5 Query E 14 105 1,480
  2. IMPACTS • Speed of data analysis and decision making •

    Usage efficiency of computing resources
  3. CHALLENGES FOR END USERS • Acquiring knowledge and skills of

    data engineering • Taking care of queries’ performances • Changing queries and tables in the production environment
  4. QUERY select t.dt, m.name, count(1) from ( select dt, key

    from db.transaction where dt between '20190211' and '20190217' ) t inner join db.master m on m.key = t.key group by t.dt, m.name
  5. QUERY select t.dt, m.name, count(1) from ( select dt, key

    from db.transaction where dt between '20190211' and '20190217' ) t inner join db.master m on m.key = t.key group by t.dt, m.name 1,800,000,000 records 200 records
  6. PHYSICAL PLAN *(6) HashAggregate(keys=[dt#958, name#970], functions=[count(1)]) +- Exchange hashpartitioning(dt#958, name#970,

    200) +- *(5) HashAggregate(keys=[dt#958, name#970], functions=[partial_count(1)]) +- *(5) Project [dt#958, name#970] +- *(5) SortMergeJoin [key#937], [key#960], Inner :- *(2) Sort [key#937 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#937, 200) : +- *(1) Project [dt#958, key#937] : +- *(1) Filter isnotnull(key#937) : +- *(1) FileScan orc transaction[key#937,dt#958] Batched: true, Format: ORC, Location:… +- *(4) Sort [key#960 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key#960, 200) +- *(3) Project [key#960, name#970] +- *(3) Filter isnotnull(key#960) +- *(3) FileScan orc master[key#960,name#970] Batched: true, Format: ORC, Location:…
  7. QUERY select t.dt, m.name, count(1) from ( select dt, key

    from db.transaction where dt between '20190211' and '20190217' ) t inner join db.master m on m.key = t.key group by t.dt, m.name 1,800,000,000 records 200 records
  8. PHYSICAL PLAN *(6) HashAggregate(keys=[dt#958, name#970], functions=[count(1)]) +- Exchange hashpartitioning(dt#958, name#970,

    200) +- *(5) HashAggregate(keys=[dt#958, name#970], functions=[partial_count(1)]) +- *(5) Project [dt#958, name#970] +- *(5) SortMergeJoin [key#937], [key#960], Inner :- *(2) Sort [key#937 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#937, 200) : +- *(1) Project [dt#958, key#937] : +- *(1) Filter isnotnull(key#937) : +- *(1) FileScan orc db.transaction[key#937,dt#958] Batched: true, Format: ORC +- *(4) Sort [key#960 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key#960, 200) +- *(3) Project [key#960, name#970] +- *(3) Filter isnotnull(key#960) +- *(3) FileScan orc db.master[key#960,name#970] Batched: true, Format: ORC
  9. PHYSICAL PLAN *(3) HashAggregate(keys=[dt#149, name#161], functions=[count(1)]) +- Exchange hashpartitioning(dt#149, name#161,

    200) +- *(2) HashAggregate(keys=[dt#149, name#161], functions=[partial_count(1)]) +- *(2) Project [dt#149, name#161] +- *(2) BroadcastHashJoin [key#128], [key#151], Inner, BuildRight :- *(2) Project [dt#149, key#128] : +- *(2) Filter isnotnull(key#128) : +- *(2) FileScan orc db.transaction[key#128,dt#149] Batched: true, Format:… +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) +- *(1) Project [key#151, name#161] +- *(1) Filter isnotnull(key#151) +- *(1) FileScan orc db.master[key#151,name#161] Batched: true, Format:…
  10. QUERY select t.dt, a.area_name, count(distinct uid) from ( select dt,

    area_code, uid from db.transaction where dt between '20180818' and '20190217' ) t inner join (select * from db.area where dt = ‘20190217’) a on a.area_code = t.area_code group by t.dt, a.area_name
  11. QUERY select t.dt, a.area_name, count(distinct uid) from ( select dt,

    area_code, uid from db.transaction where dt between '20180818' and '20190217' ) t inner join (select * from db.area where dt = ‘20190217’) a on a.area_code = t.area_code group by t.dt, a.area_name 4,500,000,000 records 90 records
  12. PHYSICAL PLAN +- *(5) HashAggregate(...) +- *(5) HashAggregate(...) +- *(5)

    HashAggregate(...) +- *(5) Project [...] +- *(5) SortMergeJoin [...], Inner :- *(2) Sort [...], false, 0 : +- Exchange hashpartitioning(...) : +- *(1) Project [...] : +- *(1) Filter (...) : +- *(1) FileScan orc db.transaction[...] Batched: true, Format: ORC +- *(4) Sort [...], false, 0 +- Exchange hashpartitioning(...) +- *(3) Filter (...) +- Scan hive db.area [...], HiveTableRelation
  13. CUSTOMIZE OPTIMIZATION object InsertBroadcastHints extends Rule[LogicalPlan] { val targets =

    Map(("db", "area") -> "dt") override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case join: Join if isSelectFromTarget(join.left) => join.copy(left = ResolvedHint(join.left, HintInfo(true))) case join: Join if isSelectFromTarget(join.right) => join.copy(right = ResolvedHint(join.right, HintInfo(true))) } private def isSelectFromTarget(plan: LogicalPlan): Boolean = plan match { case p: Project => p.child match { case f: Filter => f.child match { case rel: HiveTableRelation => targets.get(rel.tableMeta.database, rel.tableMeta.identifier.table) match { case Some(pt) => isFilteredByLiteral(f.condition, pt) case None => false } … } private def isFilteredByLiteral(expr: Expression, pt: String): Boolean = ... } spark.experimental.extraOptimizations = InsertBroadcastHints :: Nil
  14. OPTIMIZED LOGICAL PLAN Aggregate [...] +- Project [...] +- Join

    Inner, (...) :- Project [...] : +- Filter (...) : +- Relation[...] orc +- ResolvedHint (broadcast) +- Project [...] +- Filter (...) +- HiveTableRelation `db`.`area`, ...
  15. PHYSICAL PLAN *(4) HashAggregate(...) +- Exchange hashpartitioning(...) +- *(3) HashAggregate(...)

    +- *(3) HashAggregate(...) +- Exchange hashpartitioning(...) +- *(2) HashAggregate(...) +- *(2) Project [...] +- *(2) BroadcastHashJoin [...], [...], Inner, BuildRight :- *(2) Project [...] : +- *(2) Filter (...) : +- *(2) FileScan orc db.transaction[...] Batched: true, Format: ORC +- BroadcastExchange HashedRelationBroadcastMode(...) +- *(1) Filter (...) +- Scan hive db.area [...], HiveTableRelation
  16. COST-BASED OPTIMIZATION • Apache Spark 2.2.0~ • Statistics, Cost •

    Advanced Optimizations 1. Broadcast Hash Join Based on Column Statistics 2. Join Reordering
  17. COST-BASED OPTIMIZATION • Apache Spark 2.2.0~ • Statistics, Cost •

    Advanced Optimizations 1. Broadcast Hash Join Based on Column Statistics 2. Join Reordering
  18. QUERY select a.* from db.a inner join db.b on a.id

    = b.id inner join db.c on b.key = c.key
  19. QUERY select a.* from db.a inner join db.b on a.id

    = b.id inner join db.c on b.key = c.key 1,500,000,000 records 400,000,000 records 1 records