Learn Apache HIVE

Topics to Discuss Today What is Hive Why Hive was
developed Hive Components Hive Metastore Embedded Mode Local Mode Remote Mode Session Running Hive Hive DDL Operations Hive DML Operations Hive SQL Operations Hive Developer

What Is Hive? Hive is a data warehouse infrastructure built
on top of Hadoop. It facilitates querying large datasets residing on a distributed storage. It provides a mechanism to project structure on to the data and query the data using a SQL-like language called “HiveQL”. Why Hive was developed? Hive was developed by Facebook and was open sourced in 2008. Data stored in Hadoop is inaccessible to business users. High level languages like Pig, Cascading etc are geared towards developers. SQL is a common language that is known to many. Hive was developed to give access to data stored in Hadoop, translating SQL like queries into map reduce jobs.

Hive Components CLI JDBC/ODBC Web GUI Thrift Server Driver (Complier,
Optimizer, Executor) Metastore Hive Job Tracker Name node Data Node + Task Tracker Hadoop

Main Components of Hive Shell: It allows interactive queries Thrift
Server: This server exposes a very simple client API to execute HiveQL statements. Metastore: The metastore is the system catalog. All other components of Hive interact with the metastore Driver: It manages life cycle of HiveQL statement during compilation, optimization and execution. Query Compiler: The compiler is invoked by driver upon receiving a HiveQL statement. Execution Engine: The driver submits the individual map-reduce jobs from the DAG to the execution engine in a topological order.

Hive Metastore Hive stores all its metadata information in an
embedded Derby database. The disk storage location of this database is determined by the hive variable javax.jdo.option.ConnectionURL . By default this location is ./metastore.db As of now in default configuration, this metadata can be seen by one user at a time.

Hive MetaStore Deployment Modes . The Hive table definitions and
mapping to the data are stored in a Metastore. Metastore constitutes of The meta store service The database The metastore (service and database) can be configured in different ways Embedded Mode Local Mode Remote Mode

. Hive MetaStore Deployment Modes The Hive table definitions and
mapping to the data are stored in a Metastore. Metastore constitutes of The meta store service The database The metastore (service and database) can be configured in different ways Embedded Mode Local Mode Remote Mode

. Embedded Mode: This is the default metastore deployment mode.
In this mode the metastore uses a Derby database, and both the database and the metastore service run embedded in the main Hive Server process Hive Metastore Deployment Modes Hive Service JVM Derby Metastore Driver Embedded Metastore

. . . Hive Metastore Deployment Modes Metastore Hive Service
JVM Local Metastore In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host Metastore Hive Service JVM Derby Driver Driver

. Hive Metastore Deployment Modes Remote Metastore: In this mode
the Hive metastore service runs in its own JVM process Database Hive Driver Hive Driver Metastore Interface Metastore Interface

. Before running HIVE, make sure of the following Hive
uses Hadoop. So you must have hadoop in the PATH environment variable OR export HADOOP_HOME=<hadoop-install-dir> You must have the following directories set up on the hdfs,/tmp and /user/hive/warehouse (hive.metastore.warehouse) directory on your hdfs Change the mode to g+w for this directory hadoop fs -mkdir /tmp hadoop fs -mkdir /user/hive/warehouse hadoop fs -chmod g+w /tmp hadoop fs -chmod g+w /user/hive/warehouse Running Hive

Run hive by typing the command “hive”. Running Hive (contd.)

HIVE DDL Operations HiveQL DDL(data definition language) statements are as
below CREATE DATABASE/SCHEMA, TABLE, VIEW DESCRIBE DATABASE, TABLE, VIEW DROP DATABASE/SCHEMA, TABLE, VIEW TRUNCATE TABLE ALTER TABLE SHOW DATABASES/SCHEMAS, TABLES,

HIVE DML Operations HiveQL DML(data manipulation language) statements are as
below LOAD FILES INTO TABLES INSERTING DATA INTO HIVE TABLE FROM QUERIES

The syntax for creating a database is as below CREATE
(DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path]; Ex: CREATE DATABASE hivedemo COMMENT ‘This database is created for hive demo’ LOCATION ‘/user/hive/warehouse/hivedemo/’; The uses of SCHEMA and DATABASE are interchangeable – they mean the same thing. HIVE DDL – Create Database/Schema

SHOW DATABASES lists all the database that are present in
the hive metastore You can also use the ‘LIKE’ statement to filter the list of databases The syntax for creating a database is as below SHOW (DATABASES|SCHEMAS) [LIKE identifier_with_wildcards]; Ex: show databases; show databases like ‘h*’; HIVE DDL–Show Database/ Schema

The syntax for dropping a database is as below DROP
(DATABASE|SCHEMA) [IF EXISTS] database_name; Ex: drop database hivedemo; HIVE DDL – Drop Database

CREATE TABLE creates a table with the given name. An
error is thrown if a table or view with the same name already exists. You can use IF NOT EXISTS to skip the error. The syntax for creating a table is as below CREATE TABLE table_name [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] Ex: CREATE TABLE hivedemo.employees (EMPLOYEE_ID int,FIRST_NAME String, LAST_NAME String, EMAIL String, PHONE_NUMBER String, HIRE_DATE String, JOB_ID String, SALARY String, COMMISSION_PCT int, MANAGER_ID String, DEPARTMENT_ID String) COMMENT 'This is the Employees table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n’ STORED AS TEXTFILE LOCATION '/user/hive/warehouse/hivedemo/employees‘; HIVE DDL–Create Tables

HIVE DDL–Create Tables (contd.) When this table is dropped, all
the data residing in the table is also deleted from the hdfs Use the EXTERNAL command to create a table and specify its location on the HDFS. When dropping an EXTERNAL table, data is not deleted from the file system(hdfs)

SHOW TABLES shows all the tables that are created under
a particular database The syntax for showing a list of tables is as below SHOW TABLES; Ex: show tables; HIVE DDL – Show Tables

TRUNCATE TABLE will remove all the rows of data from
a table The syntax for dropping a table is as below TRUNCATE TABLE table_name; Ex: truncate table employees; HIVE DDL–Truncate Table

DROP TABLE will remove the metadata and data for the
table When dropping an EXTERNAL table, data in the table will NOT be deleted from the file system When dropping a table referenced by views, no warning is given (the views are left dangling as invalid and must be dropped or recreated by the user). The syntax for dropping a table is as below DROP TABLE [IF EXISTS] table_name; Ex: drop table hivedemo.employees; HIVE DDL – Drop Table

ALTER TABLE will let you alter the table characteristics Rename
Table – It renames the table and also changes the location on hdfs, The syntax for altering a table is as below ALTER TABLE table_name RENAME TO new_table_name; Ex: alter table employees RENAME TO employees_int; HIVE DDL – Alter Table

Hive does not do any transformation while loading data into
tables. Load operations are copy/move operations that move datafiles into locations corresponding to Hive tables. The syntax for loading a file into table is as below LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] Ex: LOAD DATA LOCAL INPATH '/home/hduser/Namita/hivedemo/employees.txt' INTO TABLE hivedemo.employees; LOCAL -> indicates the local file system. If not mentioned, hive will look in HDFS. OVERWRITE -> if mentioned, then overwrites the data, else keeps appendind HIVE DML–Load Files into Tables

Hive allows the query data to be inserted into another
table.not do any transformation while loading data into tables. The syntax for inserting query data into table is as below INSERT INTO TABLE tablename1 Select_statement1 FROM from_statement;I Ex: LOAD DATA LOCAL INPATH '/home/hduser/Namita/hivedemo/employees.txt' INTO TABLE hivedemo.employees; HIVE DML–Inserting Data into Hive table from queries

The data returned by the SELECT queries is not stored
(unless specified by INSERT INTO), but are displayed on the console. The syntax for select query is as below SELECT * FROM table_name; SELECT COL1, COL2 ,.…COLn FROM table_name; Ex: select * from department_ext; HIVE SQL–SELECT

Hive allows data to be filtered by the WHERE conditiom
The syntax for select query is as below SELECT * FROM table_name WHERE <condition> Ex: select * from department_ext where department_id is null; HIVE SQL – FILTER

The syntax for join query is as below SELECT *
FROM table_name _1 t1 JOIN table_name2 t2 ON t1.COL1 = t2. COL1; Ex: select e.employee_id, e.first_name, e.last_name, d.department_name from employees_ext e join department_ext d ON e.department_id = d.department_id; HIVE SQL – JOIN

The syntax for group by query is as below SELECT
COL1 FROM table_name GROUP BY COL 1 Make sure that the column by which group by is taking place is also a part of the SELECT query. Ex: select department_id, count(*) from employees group by department_id; HIVE SQL – GROUP BY

Please go to the following sites for more information on
Hive https://cwiki.apache.org/confluence/display/Hive/GettingStarted https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML HIVE – More Information

What Is Hive Developer? HiveDeveloper is a Hive query tool
developed by StratApps . It gives users an interface to visualize large datasets stored on Hadoop HDFS. Features of HiveDeveloper Connects to multiple databases UI like SQL developer Query data can be exported to excel All hive commands that run on console can be run via HiveDeveloper For more information and to download the tool visit http://stratapps.net/hive-developer-tool.php

Learn Apache HIVE

Learn Apache HIVE

StratApps

More Decks by StratApps

Other Decks in Education

Featured

Transcript

Learn Apache HIVE

Topics to Discuss Today What is Hive Why Hive was

What Is Hive? Hive is a data warehouse infrastructure built

Hive Components CLI JDBC/ODBC Web GUI Thrift Server Driver (Complier,

Main Components of Hive Shell: It allows interactive queries Thrift

Hive Metastore Hive stores all its metadata information in an

Hive MetaStore Deployment Modes . The Hive table definitions and

. Hive MetaStore Deployment Modes The Hive table definitions and

. Embedded Mode: This is the default metastore deployment mode.

. . . Hive Metastore Deployment Modes Metastore Hive Service

. Hive Metastore Deployment Modes Remote Metastore: In this mode

. Before running HIVE, make sure of the following Hive

Run hive by typing the command “hive”. Running Hive (contd.)

HIVE DDL Operations HiveQL DDL(data definition language) statements are as

HIVE DML Operations HiveQL DML(data manipulation language) statements are as

The syntax for creating a database is as below CREATE

SHOW DATABASES lists all the database that are present in

The syntax for dropping a database is as below DROP

CREATE TABLE creates a table with the given name. An

HIVE DDL–Create Tables (contd.) When this table is dropped, all

SHOW TABLES shows all the tables that are created under

TRUNCATE TABLE will remove all the rows of data from

DROP TABLE will remove the metadata and data for the

ALTER TABLE will let you alter the table characteristics Rename

Hive does not do any transformation while loading data into

Hive allows the query data to be inserted into another

The data returned by the SELECT queries is not stored

Hive allows data to be filtered by the WHERE conditiom

The syntax for join query is as below SELECT *

The syntax for group by query is as below SELECT

Please go to the following sites for more information on

What Is Hive Developer? HiveDeveloper is a Hive query tool