Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sven Bayer - SQL in Hadoop

Sven Bayer - SQL in Hadoop

This talk describes the Query Compiler of Hive for MapReduce. The architecture of the Hive Query Compiler is explained. Additionally, the compilation of a SQL-query to a MapReduce-Job is shown.

MunichDataGeeks

January 21, 2014
Tweet

More Decks by MunichDataGeeks

Other Decks in Technology

Transcript

  1. QAware GmbH QAware 3 21 January 2014 ▪45 employees ▪Software

    Engineering ▪Quality ▪Agility ▪Projects ▪Software diagnosis ▪Individual Software solutions ▪Customers ▪Automotive, Energy, Retail, Telecommunications, and others
  2. Agenda QAware 4 21 January 2014 1. Motivation 2. Big

    Data 3. MapReduce 4. Hadoop 5. Hive 6. Hive Query Compiler 7. Discussion
  3. Motivation QAware 5 21 January 2014 ▪Hadoop processes huge data

    on clusters ▪Hive provides SQL for Hadoop ▪Hive generates complex MapReduce jobs from SQL ▪How does Hive convert SQL to MapReduce? Masterthesis: Software-metrics + time tracking
  4. Big Data QAware 6 21 January 2014 ▪Defined by 4

    V‘s ▪Volume ▪Velocity ▪Variety ▪Veracity
  5. MapReduce QAware 7 21 January 2014 ▪In 2004 published by

    Google ▪MapReduce is highly scalable on clusters ▪Big Data can be processed with MapReduce ▪Consists mainly of a Map and Reduce function
  6. Example: Word-Count-Algorithm QAware 8 21 January 2014 MapReduce Hadoop uses

    MapReduce. There is a Map phase. There is a Reduce phase, a Map phase. There is a Map phase. Input Map {Hadoop,1}, {uses,1}, {MapReduce,1} Sort, Shuffle {There,1}, {is,1}, {a,1}, {Map,1}, {phase,1} {There,1}, {is,1}, {a,1}, {Reduce,1}, {phase,1}, {a,1}, {Map,1}, {phase,1} {There,1}, {is,1}, {a,1}, {Map,1}, {phase,1} {Hadoop,1} {uses,1} {MapReduce,1} {There,1}, {There,1}, {There, 1} {is,1}, {is,1}, {is, 1}, {a,1}, {a,1}, {a,1}, {a,1} {Map,1}, {Map,1}, {Map, 1} {phase,1}, {phase,1}, {phase, 1}, {phase,1} {Reduce,1} Reduce Hadoop 1 uses 1 MapReduce 1 {There,[1,1,1]}, {is,[1,1,1]}, {a,[1,1,1,1]} {Hadoop,[1]}, {uses,[1]}, {MapReduce,[1]} {Map,[1,1,1]}, {phase,[1,1,1,1]}, {Reduce,[1]} There 3 is 3 a 4 Map 3 phase 4 Reduce 1 Output
  7. MapReduce – In practice QAware 9 21 January 2014 ▪Get

    the users with the products that they watched ▪Get these products with their numbers, makers, price and filter the products on „audi“ access_log product Input Map {pNo1,user1}, {pNo1,user2}, {pNo2,user3} Sort, Shuffle {pNo1,{audi,30€}}, {pNo2,{audi,50€}}, {pNo3,{bmw,60€}} + Filtering on „audi“ {pNo1,user1}, {pNo1,user2}, {pNo1,{audi,30€}} {pNo2,user3}, {pNo2,{audi,50€}} Reduce pNo1 user1,audi,30€, pNo1 user2,audi,30€, pNo2 user3,audi,50€ Join on product_no + {pNo1,[user1,user2,{audi,30€}]}, {pNo2,[user3,{audi,50€}]} Output pNo1 audi 30€ pNo2 audi 50€ pNo3 bmw 60€ id1 user1 pNo1 id2 user2 pNo1 id3 user3 pNo2
  8. Hadoop QAware 10 21 January 2014 ▪In 2006 initiated by

    Yahoo ▪Hadoop cluster ▪Highly scalalbe for Big Data ▪Hadoop architecture Hadoop Common HDFS YARN (MapReduce)
  9. Hive QAware 11 21 January 2014 ▪Built on top of

    Hadoop ▪MapReduce ▪HDFS ▪Provides HiveQL queries for Hadoop ▪Compiles HiveQL to MapReduce
  10. Metastore CLI JDBC/ ODBC Driver Web-UI Thrift Server Execution Engine

    Query Compiler Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Hive architecture QAware 12 21 January 2014 Hive Legend Framework Component Framework Component Call of a component
  11. Hive Query Compiler QAware 21 January 2014 Start HiveQL Parser

    AST Semantic Analyzer QB Logical Plan Generator QB Tree Logical Optimizer QB Tree Physical Plan Generator Phys. Plan Physical Optimizer Phys. Plan Execution Engine End Hive
  12. Parser QAware 15 21 January 2014 SELECT a.user, a.product_no, p.maker,

    p.price FROM access_log a JOIN product p ON (a.product_no = p.product_no) WHERE p.maker = `audi`; Hive Query Compiler HiveQL Parser AST Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine access_log product pNo1 audi 30€ pNo2 audi 50€ pNo3 bmw 60€ id1 user1 pNo1 id2 user2 pNo1 id3 user3 pNo2
  13. Parser QAware 16 21 January 2014 ▪… WHERE p.maker =

    `audi`; Hive Query Compiler HiveQL Parser AST Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine
  14. Semantic Analyzer QAware 17 21 January 2014 Hive Query Compiler

    AST Semantic Analyzer QB Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine Query Block FROM-Clause MetaData ParseInfo Alias to Table Info “a”=Table Info(“access_log”) “p”=Table Info(“product”) AST of Join-Expression
  15. Logical Plan Generator QAware 18 21 January 2014 Hive Query

    Compiler QB Logical Plan Generator QB Tree Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine TableScanOperator TS_0 TableScanOperator TS_1 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 FilterOperator FIL_5 (maker = ‘audi’) SelectOperator SEL_6 FileSinkOperator FS_7
  16. Logical Optimizer QAware 19 21 January 2014 Hive Query Compiler

    QB Tree Logical Optimizer QB Tree Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine TableScanOperator TS_0 TableScanOperator TS_1 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_6 FileSinkOperator FS_7 FilterOperator FIL_8 (maker = ‘audi’)
  17. Phyiscal Plan Generator QAware 20 21 January 2014 Hive Query

    Compiler QB Tree Physical Plan Generator Phys. Plan Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine MapRedTask (Stage-1/root) Reducer Mapper Mapper TableScanOperator TS_0 TableScanOperator TS_1 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_6 FileSinkOperator FS_7 FilterOperator FIL_8 (maker= ‘audi’) MoveTask (Stage-0) StatsTask (Stage-2)
  18. Physical Optimizer QAware 21 21 January 2014 ▪Optimizes the Physical

    Plan ▪Transforms a plan with Joins to multiple MapReduce jobs ▪Converts tasks including a Join to a MapJoin HiveQL-Verarbeitung Phys. Plan Physical Optimizer Phys. Plan Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine
  19. Execution Engine QAware 22 21 January 2014 ▪MapReduce job is

    serialized as plan.xml ▪Returns the result ▪Temporary place ▪Table HiveQL-Verarbeitung Parser Semantic Analyzer Logical Plan Generator Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine Phys. Plan Execution Engine
  20. Discussion QAware 23 21 January 2014 ▪Hive brings SQL to

    Hadoop ▪Advantages of Hive ▪Reduces developer workload ▪No need for manual coding of MapReduce jobs ▪Easy migration for systems interacting with SQL ▪Disadvantages of Hive ▪High latency ▪Outlook for Hive ▪Apache Tez with container reusage, Mapper reduction in DAG ▪Alternatives for Hive ▪Impala, Shark, Presto, Lingual