Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hive HiveServer2 Impala Parquet talk

Kaufman Ng
November 19, 2013

Hive HiveServer2 Impala Parquet talk

Presentation to NJ Hadoop meetup on 11/19/2013

Kaufman Ng

November 19, 2013
Tweet

Other Decks in Programming

Transcript

  1. ©2013 Cloudera, Inc. All rights reserved. 1 1 Hive, Hive

    Server 2, Impala, and Parquet Kaufman Ng Solutions Architect, Cloudera
  2. ©2013 Cloudera, Inc. All rights reserved. 2 Agenda 2 •

    Hive, Hive Server 2, Metastore • Impala • Parquet • Hands on session
  3. ©2013 Cloudera, Inc. All rights reserved. 3 Hive Overview 3

    • Use SQL-like language called HiveQL • Access data in HDFS • Query execution via MapReduce • Best use for batch jobs or ETLs • Functionality can be extended via User Defined Functions (UDF) • Can be accessed via command line or Hue
  4. ©2013 Cloudera, Inc. All rights reserved. 4 Hive Clients 4

    • Command line shell (CLI) • JDBC • ODBC • Thrift client (python, perl, C++, …)
  5. ©2013 Cloudera, Inc. All rights reserved. 5 How does Hive

    know about the tables? Hive Metastore 5 • Needs to have a backing database • Stores the metadata for Hive tables and partitions in a relational database (mysql, postgres, etc) • Provides clients (including Hive) access via the metastore service API
  6. ©2013 Cloudera, Inc. All rights reserved. 7 Different ways to

    run Metastore 7 • Embedded, local, and remote • Embedded is the default • Uses derby • Same JVM process as Hive
  7. ©2013 Cloudera, Inc. All rights reserved. 8 Local Mode 8

    • Metastore process runs as the main Hive Server process • Database runs as separate process
  8. ©2013 Cloudera, Inc. All rights reserved. 9 Remote Mode 9

    • Metastore runs in its own JVM process • HiveServer2, HCatalog, Cloudera Impala™, and other processes communicate with it via the Thrift • Does not require client to know metastore db credentials • Supports security (kerberos, user impersonation) • Recommended
  9. ©2013 Cloudera, Inc. All rights reserved. 11 Hive Server 1

    11 • Thrift based server for data access • Work with CLI, JDBC and ODBC clients • Doesn’t support concurrent client access • No authentication
  10. ©2013 Cloudera, Inc. All rights reserved. 13 Hive Server 2

    13 • Available since Hive 0.10 • Better JDBC/ODBC support • Multiple client concurrency • Security (kerberos support) • Uses Metastore like Hive Server 1 • Allows new types of Hive clients (e.g. beeline)
  11. ©2013 Cloudera, Inc. All rights reserved. 16 Impala • Interactive

    SQL directly on data in HDFS / HBase • Open source under Apache license • Leverages Hive metadata • Same SQL as HiveQL (Create, Alter, Insert, Select, Join, Subqueries) • JDBC/ODBC drivers, Hue interface • Support for variety of data formats • Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO- compressed); and Parquet (Snappy or uncompressed)
  12. ©2013 Cloudera, Inc. All rights reserved. 17 Impala • High

    Performance • C++ instead of Java • New execution engine that does not use MapReduce
  13. ©2013 Cloudera, Inc. All rights reserved. 18 User View of

    Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any of the daemons • Query is distributed to all nodes with relevant data • If any node fails, the query fails • Impala uses Hive's metadata interface, connects to Hive's metastore 18
  14. ©2013 Cloudera, Inc. All rights reserved. 19 Impala Architecture •

    Two binaries: impalad and statestored • Impala daemon (impalad) • handles client requests and all internal requests related to query execution • Exports Thrift services for these two roles • State store daemon (statestored) • provides name service and metadata distribution • also exports a Thrift service 19
  15. ©2013 Cloudera, Inc. All rights reserved. 20 Impala Query Execution

    20 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 1) Request arrives via ODBC/JDBC/Beeswax/Shell
  16. ©2013 Cloudera, Inc. All rights reserved. 21 Impala Query Execution

    21 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data
  17. ©2013 Cloudera, Inc. All rights reserved. 22 Impala Query Execution

    22 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query results
  18. ©2013 Cloudera, Inc. All rights reserved. 23 Impala SQL Support

    • Patterned after Hive's version of SQL • Limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert • Only equi-joins; no non-equi joins, no cross products • Order By only with Limit 23
  19. ©2013 Cloudera, Inc. All rights reserved. 24 Impala and HBase

    • HBase has its own storage format (HFile) • Need to map HBase table to Hive table • Create table in HBase shell: • create 'hbasealltypessmall', 'bools', 'ints', 'floats', 'strings' • enable 'hbasealltypessmall' 24
  20. ©2013 Cloudera, Inc. All rights reserved. 25 Impala and HBase

    (cont’d) CREATE EXTERNAL TABLE hbasealltypessmall ( id string, bool_col boolean, tinyint_col tinyint, smallint_col smallint, int_col int, bigint_col bigint, float_col float, double_col double, date_string_col string, string_col string, timestamp_col timestamp) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,bools:bool_col,ints:tinyint_col,ints:smallint_col,ints:int_col,ints:bigint_col,floats\ :float_col,floats:double_col,strings:date_string_col,strings:string_col,strings:timestamp_col" ) TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall"); 25
  21. ©2013 Cloudera, Inc. All rights reserved. 26 Impala and HBase

    (cont’d) • Finally query the table: • select * from hbasealltypessmall where id < 5 26
  22. ©2013 Cloudera, Inc. All rights reserved. 27 Impala vs Hive

    • No support for maps, arrays, structs • No custom UDFs, custom file formats, custom SerDes • No UDAFs and UDTFs • No multiple DISTINCT clauses • No MapReduce features in Hive (SORT BY, CLUSTER BY) • Not all HiveQL statements are supported (e.g. SHOW CREATE TABLE) • Tables of certain file formats (avro, RCFile, sequence file) need to be loaded in Hive 27
  23. ©2013 Cloudera, Inc. All rights reserved. 28 What’s Impala Good

    For? • Interactive query • Ad-hoc querying/analysis • Integration with BI Tools (via ODBC) • Support for many file formats • Text, avro, sequence file, RC file, Parquet 28
  24. ©2013 Cloudera, Inc. All rights reserved. 29 Impala Roadmap •

    Impala 1.2 • UDFs (built-in and custom) • Automatic metadata refresh • HDFS caching • Cost-based join order optimization • Integration with YARN 29
  25. ©2013 Cloudera, Inc. All rights reserved. 30 Impala Roadmap •

    Impala 2.0 • Analytic window functions • UDTFs • Nested data • Parquet enhancements (e.g. index pages) • More data types: date and decimal 30
  26. ©2013 Cloudera, Inc. All rights reserved. 32 Parquet File Format

    • Open source, columnar file format developed by Cloudera and Twitter • Supports storing each column in a separate file • Supports fully shredded nested data • Native type support • Type-specific value encodings (saves space) • Supports index for fast lookups • Allows nesting of data • Supports for multiple programming languages 32
  27. ©2013 Cloudera, Inc. All rights reserved. 33 Parquet File Format

    (cont’d) • Row Groups: A group of rows in columnar format ‣ Max size buffered in memory while writing ‣ One (or more) per split while reading ‣ About 50MB < row group < 1GB • Columns Chunk: Data for one column in row group ‣ Column chunks can be read independently for efficient scans • Page: Unit of access in a column chunk ‣ Should be big enough for efficient compression ‣ Min size to read while accessing a single record ‣ About 8KB < page < 1MB
  28. ©2013 Cloudera, Inc. All rights reserved. 36 Parquet Model •

    Borrowed from Google Dremel’s ColumnIO file format • Schema is defined in a familiar format • Supports nested data structures • Each cell is encoded as triplet: repetition level, definition level, and the value • Level values are bound by the depth of the schema • Stored in a compact form 36
  29. ©2013 Cloudera, Inc. All rights reserved. 37 Parquet Model (cont’d)

    • Field types are either group or primitive type with repetition of required, optional or repeated • exactly one, zero or one, or zero or more • Various encoding schemes: • Bit packing: good for small integers • Run length encoding: allows compression • Dictionary encoding • Extensible 37
  30. ©2013 Cloudera, Inc. All rights reserved. 38 How To Use

    Parquet • Create new table: [impala-host:21000] > create table parquet_table_name (x INT, y STRING) STORED AS PARQUETFILE; • Create from existing table: [impala-host:21000] > create table parquet_table_name LIKE other_table_name STORED AS PARQUETFILE;
  31. ©2013 Cloudera, Inc. All rights reserved. 39 Parquet, Impala and

    Hive • Impala uses Snappy compression by default with Parquet tables • Starting in Impala 1.1.1 you can use Parquet data in Hive • ALTER TABLE table_name SET FILEFORMAT PARQUETFILE; 39
  32. ©2013 Cloudera, Inc. All rights reserved. 40 Future • Parquet

    2.0 ‣ More encodings - Delta encodings, improved encodings ‣ Statistics - For query planners and predicate pushdown ‣ New page format - skip ahead better
  33. ©2013 Cloudera, Inc. All rights reserved. 41 Parquet and Impala

    URLs • Parquet Sources: https://github.com/parquet/parquet- format • Blog Post with Info: https://blog.twitter.com/2013/dremel- made-simple-with-parquet • Impala Source: https://github.com/cloudera/impala • Impala: http://www.cloudera.com/content/cloudera/en/campaign /introducing-impala.html 41
  34. ©2013 Cloudera, Inc. All rights reserved. 42 42 Impala Hands

    On • Bring up your Cloudera Quickstart VM • In terminal, go to datasets directory • zipcode_setup.sh • tpcds_setup.sh (requires Internet connectivity)
  35. ©2013 Cloudera, Inc. All rights reserved. 43 43 Impala Hands

    On (cont’d) • Bring up impala shell (impala-shell command) • Execute ‘invalidate metadata’ statement (or refresh if impala version < 1.1) • Execute ‘show tables;’ statement and you should see the tables • Execute more queries near the bottom on the blog page http://blog.cloudera.com/blog/2013/06/quickstart-vm- now-with-real-time-big-data/