Hive HiveServer2 Impala Parquet talk

©2013 Cloudera, Inc. All rights reserved. 1 1 Hive, Hive
Server 2, Impala, and Parquet Kaufman Ng Solutions Architect, Cloudera

©2013 Cloudera, Inc. All rights reserved. 2 Agenda 2 •
Hive, Hive Server 2, Metastore • Impala • Parquet • Hands on session

©2013 Cloudera, Inc. All rights reserved. 3 Hive Overview 3
• Use SQL-like language called HiveQL • Access data in HDFS • Query execution via MapReduce • Best use for batch jobs or ETLs • Functionality can be extended via User Defined Functions (UDF) • Can be accessed via command line or Hue

©2013 Cloudera, Inc. All rights reserved. 4 Hive Clients 4
• Command line shell (CLI) • JDBC • ODBC • Thrift client (python, perl, C++, …)

©2013 Cloudera, Inc. All rights reserved. 5 How does Hive
know about the tables? Hive Metastore 5 • Needs to have a backing database • Stores the metadata for Hive tables and partitions in a relational database (mysql, postgres, etc) • Provides clients (including Hive) access via the metastore service API

©2013 Cloudera, Inc. All rights reserved. 7 Different ways to
run Metastore 7 • Embedded, local, and remote • Embedded is the default • Uses derby • Same JVM process as Hive

©2013 Cloudera, Inc. All rights reserved. 8 Local Mode 8
• Metastore process runs as the main Hive Server process • Database runs as separate process

©2013 Cloudera, Inc. All rights reserved. 9 Remote Mode 9
• Metastore runs in its own JVM process • HiveServer2, HCatalog, Cloudera Impala™, and other processes communicate with it via the Thrift • Does not require client to know metastore db credentials • Supports security (kerberos, user impersonation) • Recommended

©2013 Cloudera, Inc. All rights reserved. 10 Remote Mode (cont’d)
10

©2013 Cloudera, Inc. All rights reserved. 11 Hive Server 1
11 • Thrift based server for data access • Work with CLI, JDBC and ODBC clients • Doesn’t support concurrent client access • No authentication

(cont’d) 12

13 • Available since Hive 0.10 • Better JDBC/ODBC support • Multiple client concurrency • Security (kerberos support) • Uses Metastore like Hive Server 1 • Allows new types of Hive clients (e.g. beeline)

(cont’d) 14

©2013 Cloudera, Inc. All rights reserved. 16 Impala • Interactive
SQL directly on data in HDFS / HBase • Open source under Apache license • Leverages Hive metadata • Same SQL as HiveQL (Create, Alter, Insert, Select, Join, Subqueries) • JDBC/ODBC drivers, Hue interface • Support for variety of data formats • Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO- compressed); and Parquet (Snappy or uncompressed)

©2013 Cloudera, Inc. All rights reserved. 17 Impala • High
Performance • C++ instead of Java • New execution engine that does not use MapReduce

©2013 Cloudera, Inc. All rights reserved. 18 User View of
Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any of the daemons • Query is distributed to all nodes with relevant data • If any node fails, the query fails • Impala uses Hive's metadata interface, connects to Hive's metastore 18

©2013 Cloudera, Inc. All rights reserved. 19 Impala Architecture •
Two binaries: impalad and statestored • Impala daemon (impalad) • handles client requests and all internal requests related to query execution • Exports Thrift services for these two roles • State store daemon (statestored) • provides name service and metadata distribution • also exports a Thrift service 19

©2013 Cloudera, Inc. All rights reserved. 20 Impala Query Execution
20 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 1) Request arrives via ODBC/JDBC/Beeswax/Shell

21 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data

22 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query results

©2013 Cloudera, Inc. All rights reserved. 23 Impala SQL Support
• Patterned after Hive's version of SQL • Limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert • Only equi-joins; no non-equi joins, no cross products • Order By only with Limit 23

©2013 Cloudera, Inc. All rights reserved. 24 Impala and HBase
• HBase has its own storage format (HFile) • Need to map HBase table to Hive table • Create table in HBase shell: • create 'hbasealltypessmall', 'bools', 'ints', 'floats', 'strings' • enable 'hbasealltypessmall' 24

(cont’d) CREATE EXTERNAL TABLE hbasealltypessmall ( id string, bool_col boolean, tinyint_col tinyint, smallint_col smallint, int_col int, bigint_col bigint, float_col float, double_col double, date_string_col string, string_col string, timestamp_col timestamp) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,bools:bool_col,ints:tinyint_col,ints:smallint_col,ints:int_col,ints:bigint_col,floats\ :float_col,floats:double_col,strings:date_string_col,strings:string_col,strings:timestamp_col" ) TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall"); 25

(cont’d) • Finally query the table: • select * from hbasealltypessmall where id < 5 26

©2013 Cloudera, Inc. All rights reserved. 27 Impala vs Hive
• No support for maps, arrays, structs • No custom UDFs, custom file formats, custom SerDes • No UDAFs and UDTFs • No multiple DISTINCT clauses • No MapReduce features in Hive (SORT BY, CLUSTER BY) • Not all HiveQL statements are supported (e.g. SHOW CREATE TABLE) • Tables of certain file formats (avro, RCFile, sequence file) need to be loaded in Hive 27

©2013 Cloudera, Inc. All rights reserved. 28 What’s Impala Good
For? • Interactive query • Ad-hoc querying/analysis • Integration with BI Tools (via ODBC) • Support for many file formats • Text, avro, sequence file, RC file, Parquet 28

©2013 Cloudera, Inc. All rights reserved. 29 Impala Roadmap •
Impala 1.2 • UDFs (built-in and custom) • Automatic metadata refresh • HDFS caching • Cost-based join order optimization • Integration with YARN 29

©2013 Cloudera, Inc. All rights reserved. 30 Impala Roadmap •
Impala 2.0 • Analytic window functions • UDTFs • Nested data • Parquet enhancements (e.g. index pages) • More data types: date and decimal 30

©2013 Cloudera, Inc. All rights reserved. 32 Parquet File Format
• Open source, columnar file format developed by Cloudera and Twitter • Supports storing each column in a separate file • Supports fully shredded nested data • Native type support • Type-specific value encodings (saves space) • Supports index for fast lookups • Allows nesting of data • Supports for multiple programming languages 32

(cont’d) • Row Groups: A group of rows in columnar format ‣ Max size buffered in memory while writing ‣ One (or more) per split while reading ‣ About 50MB < row group < 1GB • Columns Chunk: Data for one column in row group ‣ Column chunks can be read independently for efficient scans • Page: Unit of access in a column chunk ‣ Should be big enough for efficient compression ‣ Min size to read while accessing a single record ‣ About 8KB < page < 1MB

(cont’d) 34

(cont’d) 35

©2013 Cloudera, Inc. All rights reserved. 36 Parquet Model •
Borrowed from Google Dremel’s ColumnIO file format • Schema is defined in a familiar format • Supports nested data structures • Each cell is encoded as triplet: repetition level, definition level, and the value • Level values are bound by the depth of the schema • Stored in a compact form 36

©2013 Cloudera, Inc. All rights reserved. 37 Parquet Model (cont’d)
• Field types are either group or primitive type with repetition of required, optional or repeated • exactly one, zero or one, or zero or more • Various encoding schemes: • Bit packing: good for small integers • Run length encoding: allows compression • Dictionary encoding • Extensible 37

©2013 Cloudera, Inc. All rights reserved. 38 How To Use
Parquet • Create new table: [impala-host:21000] > create table parquet_table_name (x INT, y STRING) STORED AS PARQUETFILE; • Create from existing table: [impala-host:21000] > create table parquet_table_name LIKE other_table_name STORED AS PARQUETFILE;

©2013 Cloudera, Inc. All rights reserved. 39 Parquet, Impala and
Hive • Impala uses Snappy compression by default with Parquet tables • Starting in Impala 1.1.1 you can use Parquet data in Hive • ALTER TABLE table_name SET FILEFORMAT PARQUETFILE; 39

©2013 Cloudera, Inc. All rights reserved. 40 Future • Parquet
2.0 ‣ More encodings - Delta encodings, improved encodings ‣ Statistics - For query planners and predicate pushdown ‣ New page format - skip ahead better

©2013 Cloudera, Inc. All rights reserved. 41 Parquet and Impala
URLs • Parquet Sources: https://github.com/parquet/parquet- format • Blog Post with Info: https://blog.twitter.com/2013/dremel- made-simple-with-parquet • Impala Source: https://github.com/cloudera/impala • Impala: http://www.cloudera.com/content/cloudera/en/campaign /introducing-impala.html 41

©2013 Cloudera, Inc. All rights reserved. 42 42 Impala Hands
On • Bring up your Cloudera Quickstart VM • In terminal, go to datasets directory • zipcode_setup.sh • tpcds_setup.sh (requires Internet connectivity)

©2013 Cloudera, Inc. All rights reserved. 43 43 Impala Hands
On (cont’d) • Bring up impala shell (impala-shell command) • Execute ‘invalidate metadata’ statement (or refresh if impala version < 1.1) • Execute ‘show tables;’ statement and you should see the tables • Execute more queries near the bottom on the blog page http://blog.cloudera.com/blog/2013/06/quickstart-vm- now-with-real-time-big-data/

Hive HiveServer2 Impala Parquet talk

Hive HiveServer2 Impala Parquet talk

Other Decks in Programming

Featured

Transcript