An introduction to Apache HCatalog

Apache HCatalog • What is it ? • How does
it work ? • Interfaces • Architecture • Example www.semtech-solutions.co.nz [email protected]

HCatalog – What is it ? • A Hive metastore
interface set • Shared schema and data types for Hadoop tools • Rest interface for external data access • Assists inter operability between – Pig, Hive and Map Reduce • Table abstraction of data storage • Will provide data availability notifications www.semtech-solutions.co.nz [email protected]

HCatalog – How does it work ? • Pig –
HCatLoader + HCatStorer interface • Map Reduce – HCatInputFormat + HCatOutputFormat interface • Hive – No interface necessary – Direct access to meta data • Notifications when data available www.semtech-solutions.co.nz [email protected]

HCatalog – Interfaces • Interface via – Pig – Map
Reduce – Hive – Streaming • Access data via – Orc file – RC file – Text file – Sequence file – Custom format www.semtech-solutions.co.nz [email protected]

HCatalog – Interfaces www.semtech-solutions.co.nz [email protected]

HCatalog – Architecture www.semtech-solutions.co.nz [email protected]

HCatalog – Example A data flow example from hive.apache.org First
Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; … store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id; www.semtech-solutions.co.nz [email protected]

Contact Us • Feel free to contact us at –
www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

An introduction to Apache HCatalog

An introduction to Apache HCatalog

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

Apache HCatalog • What is it ? • How does

HCatalog – What is it ? • A Hive metastore

HCatalog – How does it work ? • Pig –

HCatalog – Interfaces • Interface via – Pig – Map

HCatalog – Interfaces www.semtech-solutions.co.nz [email protected]

HCatalog – Architecture www.semtech-solutions.co.nz [email protected]

HCatalog – Example A data flow example from hive.apache.org First

Contact Us • Feel free to contact us at –