Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learn Apache HIVE

StratApps
February 14, 2014

Learn Apache HIVE

What is Hive
Why Hive was developed
Hive Components
Hive Metastore
Embedded Mode
Local Mode
Remote Mode

StratApps

February 14, 2014
Tweet

More Decks by StratApps

Other Decks in Education

Transcript

  1. Topics to Discuss Today What is Hive Why Hive was

    developed Hive Components Hive Metastore Embedded Mode Local Mode Remote Mode Session Running Hive Hive DDL Operations Hive DML Operations Hive SQL Operations Hive Developer
  2. What Is Hive? Hive is a data warehouse infrastructure built

    on top of Hadoop. It facilitates querying large datasets residing on a distributed storage. It provides a mechanism to project structure on to the data and query the data using a SQL-like language called “HiveQL”. Why Hive was developed? Hive was developed by Facebook and was open sourced in 2008. Data stored in Hadoop is inaccessible to business users. High level languages like Pig, Cascading etc are geared towards developers. SQL is a common language that is known to many. Hive was developed to give access to data stored in Hadoop, translating SQL like queries into map reduce jobs.
  3. Hive Components CLI JDBC/ODBC Web GUI Thrift Server Driver (Complier,

    Optimizer, Executor) Metastore Hive Job Tracker Name node Data Node + Task Tracker Hadoop
  4. Main Components of Hive Shell: It allows interactive queries Thrift

    Server: This server exposes a very simple client API to execute HiveQL statements. Metastore: The metastore is the system catalog. All other components of Hive interact with the metastore Driver: It manages life cycle of HiveQL statement during compilation, optimization and execution. Query Compiler: The compiler is invoked by driver upon receiving a HiveQL statement. Execution Engine: The driver submits the individual map-reduce jobs from the DAG to the execution engine in a topological order.
  5. Hive Metastore Hive stores all its metadata information in an

    embedded Derby database. The disk storage location of this database is determined by the hive variable javax.jdo.option.ConnectionURL . By default this location is ./metastore.db As of now in default configuration, this metadata can be seen by one user at a time.
  6. Hive MetaStore Deployment Modes . The Hive table definitions and

    mapping to the data are stored in a Metastore. Metastore constitutes of The meta store service The database The metastore (service and database) can be configured in different ways Embedded Mode Local Mode Remote Mode
  7. . Hive MetaStore Deployment Modes The Hive table definitions and

    mapping to the data are stored in a Metastore. Metastore constitutes of The meta store service The database The metastore (service and database) can be configured in different ways Embedded Mode Local Mode Remote Mode
  8. . Embedded Mode: This is the default metastore deployment mode.

    In this mode the metastore uses a Derby database, and both the database and the metastore service run embedded in the main Hive Server process Hive Metastore Deployment Modes Hive Service JVM Derby Metastore Driver Embedded Metastore
  9. . . . Hive Metastore Deployment Modes Metastore Hive Service

    JVM Local Metastore In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host Metastore Hive Service JVM Derby Driver Driver
  10. . Hive Metastore Deployment Modes Remote Metastore: In this mode

    the Hive metastore service runs in its own JVM process Database Hive Driver Hive Driver Metastore Interface Metastore Interface
  11. . Before running HIVE, make sure of the following Hive

    uses Hadoop. So you must have hadoop in the PATH environment variable OR export HADOOP_HOME=<hadoop-install-dir> You must have the following directories set up on the hdfs,/tmp and /user/hive/warehouse (hive.metastore.warehouse) directory on your hdfs Change the mode to g+w for this directory hadoop fs -mkdir /tmp hadoop fs -mkdir /user/hive/warehouse hadoop fs -chmod g+w /tmp hadoop fs -chmod g+w /user/hive/warehouse Running Hive
  12. HIVE DDL Operations HiveQL DDL(data definition language) statements are as

    below CREATE DATABASE/SCHEMA, TABLE, VIEW DESCRIBE DATABASE, TABLE, VIEW DROP DATABASE/SCHEMA, TABLE, VIEW TRUNCATE TABLE ALTER TABLE SHOW DATABASES/SCHEMAS, TABLES,
  13. HIVE DML Operations HiveQL DML(data manipulation language) statements are as

    below LOAD FILES INTO TABLES INSERTING DATA INTO HIVE TABLE FROM QUERIES
  14. The syntax for creating a database is as below CREATE

    (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path]; Ex: CREATE DATABASE hivedemo COMMENT ‘This database is created for hive demo’ LOCATION ‘/user/hive/warehouse/hivedemo/’; The uses of SCHEMA and DATABASE are interchangeable – they mean the same thing. HIVE DDL – Create Database/Schema
  15. SHOW DATABASES lists all the database that are present in

    the hive metastore You can also use the ‘LIKE’ statement to filter the list of databases The syntax for creating a database is as below SHOW (DATABASES|SCHEMAS) [LIKE identifier_with_wildcards]; Ex: show databases; show databases like ‘h*’; HIVE DDL–Show Database/ Schema
  16. The syntax for dropping a database is as below DROP

    (DATABASE|SCHEMA) [IF EXISTS] database_name; Ex: drop database hivedemo; HIVE DDL – Drop Database
  17. CREATE TABLE creates a table with the given name. An

    error is thrown if a table or view with the same name already exists. You can use IF NOT EXISTS to skip the error. The syntax for creating a table is as below CREATE TABLE table_name [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] Ex: CREATE TABLE hivedemo.employees (EMPLOYEE_ID int,FIRST_NAME String, LAST_NAME String, EMAIL String, PHONE_NUMBER String, HIRE_DATE String, JOB_ID String, SALARY String, COMMISSION_PCT int, MANAGER_ID String, DEPARTMENT_ID String) COMMENT 'This is the Employees table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n’ STORED AS TEXTFILE LOCATION '/user/hive/warehouse/hivedemo/employees‘; HIVE DDL–Create Tables
  18. HIVE DDL–Create Tables (contd.) When this table is dropped, all

    the data residing in the table is also deleted from the hdfs Use the EXTERNAL command to create a table and specify its location on the HDFS. When dropping an EXTERNAL table, data is not deleted from the file system(hdfs)
  19. SHOW TABLES shows all the tables that are created under

    a particular database The syntax for showing a list of tables is as below SHOW TABLES; Ex: show tables; HIVE DDL – Show Tables
  20. TRUNCATE TABLE will remove all the rows of data from

    a table The syntax for dropping a table is as below TRUNCATE TABLE table_name; Ex: truncate table employees; HIVE DDL–Truncate Table
  21. DROP TABLE will remove the metadata and data for the

    table When dropping an EXTERNAL table, data in the table will NOT be deleted from the file system When dropping a table referenced by views, no warning is given (the views are left dangling as invalid and must be dropped or recreated by the user). The syntax for dropping a table is as below DROP TABLE [IF EXISTS] table_name; Ex: drop table hivedemo.employees; HIVE DDL – Drop Table
  22. ALTER TABLE will let you alter the table characteristics Rename

    Table – It renames the table and also changes the location on hdfs, The syntax for altering a table is as below ALTER TABLE table_name RENAME TO new_table_name; Ex: alter table employees RENAME TO employees_int; HIVE DDL – Alter Table
  23. Hive does not do any transformation while loading data into

    tables. Load operations are copy/move operations that move datafiles into locations corresponding to Hive tables. The syntax for loading a file into table is as below LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] Ex: LOAD DATA LOCAL INPATH '/home/hduser/Namita/hivedemo/employees.txt' INTO TABLE hivedemo.employees; LOCAL -> indicates the local file system. If not mentioned, hive will look in HDFS. OVERWRITE -> if mentioned, then overwrites the data, else keeps appendind HIVE DML–Load Files into Tables
  24. Hive allows the query data to be inserted into another

    table.not do any transformation while loading data into tables. The syntax for inserting query data into table is as below INSERT INTO TABLE tablename1 Select_statement1 FROM from_statement;I Ex: LOAD DATA LOCAL INPATH '/home/hduser/Namita/hivedemo/employees.txt' INTO TABLE hivedemo.employees; HIVE DML–Inserting Data into Hive table from queries
  25. The data returned by the SELECT queries is not stored

    (unless specified by INSERT INTO), but are displayed on the console. The syntax for select query is as below SELECT * FROM table_name; SELECT COL1, COL2 ,.…COLn FROM table_name; Ex: select * from department_ext; HIVE SQL–SELECT
  26. Hive allows data to be filtered by the WHERE conditiom

    The syntax for select query is as below SELECT * FROM table_name WHERE <condition> Ex: select * from department_ext where department_id is null; HIVE SQL – FILTER
  27. The syntax for join query is as below SELECT *

    FROM table_name _1 t1 JOIN table_name2 t2 ON t1.COL1 = t2. COL1; Ex: select e.employee_id, e.first_name, e.last_name, d.department_name from employees_ext e join department_ext d ON e.department_id = d.department_id; HIVE SQL – JOIN
  28. The syntax for group by query is as below SELECT

    COL1 FROM table_name GROUP BY COL 1 Make sure that the column by which group by is taking place is also a part of the SELECT query. Ex: select department_id, count(*) from employees group by department_id; HIVE SQL – GROUP BY
  29. Please go to the following sites for more information on

    Hive https://cwiki.apache.org/confluence/display/Hive/GettingStarted https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML HIVE – More Information
  30. What Is Hive Developer? HiveDeveloper is a Hive query tool

    developed by StratApps . It gives users an interface to visualize large datasets stored on Hadoop HDFS. Features of HiveDeveloper Connects to multiple databases UI like SQL developer Query data can be exported to excel All hive commands that run on console can be run via HiveDeveloper For more information and to download the tool visit http://stratapps.net/hive-developer-tool.php