Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to Apache HCatalog

An introduction to Apache HCatalog

An introduction to Apache HCatalog, what is it ?
Why is it useful and how can it help Pig, Hive and
MapReduce users on Hadoop share data ?

Mike Frampton

August 17, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Apache HCatalog • What is it ? • How does

    it work ? • Interfaces • Architecture • Example www.semtech-solutions.co.nz [email protected]
  2. HCatalog – What is it ? • A Hive metastore

    interface set • Shared schema and data types for Hadoop tools • Rest interface for external data access • Assists inter operability between – Pig, Hive and Map Reduce • Table abstraction of data storage • Will provide data availability notifications www.semtech-solutions.co.nz [email protected]
  3. HCatalog – How does it work ? • Pig –

    HCatLoader + HCatStorer interface • Map Reduce – HCatInputFormat + HCatOutputFormat interface • Hive – No interface necessary – Direct access to meta data • Notifications when data available www.semtech-solutions.co.nz [email protected]
  4. HCatalog – Interfaces • Interface via – Pig – Map

    Reduce – Hive – Streaming • Access data via – Orc file – RC file – Text file – Sequence file – Custom format www.semtech-solutions.co.nz [email protected]
  5. HCatalog – Example A data flow example from hive.apache.org First

    Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; … store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id; www.semtech-solutions.co.nz [email protected]
  6. Contact Us • Feel free to contact us at –

    www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems