Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

HCatalog & Templeton

dgkim84
July 18, 2012

HCatalog & Templeton

dgkim84

July 18, 2012
Tweet

More Decks by dgkim84

Other Decks in Technology

Transcript

  1. HCatalog & Templeton Youngwoo Kim ([email protected], kt.com) Daegeun Kim ([email protected])

    데이터분석플랫폼 KTCloudware (NexR) Wednesday, July 18, 12
  2. Hadoop Ecosystem (Many data processing tools) MapReduce Hive Pig Filesystem

    Metastore InputFormat / OutputFormat / ... SerDe LoadFunc StoreFunc RDBMS SerDe Wednesday, July 18, 12
  3. Problems • Hive 외에는 메타스토어의 부재 • 한 클러스터에서 다양한

    도구를 사용하는 경우 연동이 쉽지 않다. • 매번 커뮤니케이션 비용이 발생 • 어디에? 어떻게? 뭘? • M/R, Pig 사용자는 기억해야할 많은 정보 • 스키마, 데이터 경로 또는 포맷 변경은 M/R, Pig 에 많은 영향 Wednesday, July 18, 12
  4. HCatalog • Apache Incubator • Hive metastore 기반 • M/R,

    Pig 사용자에게 읽고 쓸 수 있는 프로그래밍 인터페이스 제공 • MapReduce 작업이 필요없는 모든 DDL 명령 제공 (CLI Commands) • import/export, CREATE TABLE AS SELECT 등 제외 • Data exploration 기능 제공 • SHOW TABLES, DESCRIBE 제공 • http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html • Hortonworks, Yahoo, Twitter, ... 등 개발 Wednesday, July 18, 12
  5. Table abstraction • 메타데이터 • 데이터 위치, 스키마, 압축, 파티션,

    포맷 등 • HCatalog를 이용하여 데이터를 추상화 • 한 곳에서 메타데이터가 관리되며 그 만큼 역할 또한 중요 • 컬럼 타입으로 primitives, map, list, struct 지원 Wednesday, July 18, 12
  6. HCatalog MapReduce Hive Pig Filesystem HCatLoader HCatStorer RDBMS HCatInputFormat HCatOutputFormat

    InputFormat OutputFormat Metastore SerDe SerDe Wednesday, July 18, 12
  7. Data types : Pig HCatalog = Hive Pig primitives (int,

    long, float, double, string) int, long, float, double, chararray map (contains key and value pairs) map list (contains a list elements of same data type) bag struct (contains elements of different data types) tuple Wednesday, July 18, 12
  8. DDL $HCAT_HOME/bin/hcat -e “ drop table if exists rawevents; create

    external table rawevents ( url string, user string ) partitioned by (ds string) “ $HIVE_HOME/bin/hive -e “ LOAD DATA LOCAL INPATH ‘...’ OVERWRITE INTO TABLE rawevents PARTITION (ds=‘20120530`) “ Wednesday, July 18, 12
  9. Pig raw = LOAD '/data/rawevents/20120530' AS (url, user); botless =

    FILTER raw BY myudfs.NotABot(user); grpd = GROUP botless by (url, user); cntd = FOREACH grpd GENERATE flatten(url, user), COUNT(botless); STORE cntd INTO '/data/counted/20120530'; http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8 Wednesday, July 18, 12
  10. Pig + HCatalog Pig raw = LOAD '/data/rawevents/20120530' AS (url,

    user); Pig + HCatalog raw = LOAD 'rawevents' using org.apache.hcatalog.pig.HCatLoader(); http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8 Pig STORE cntd INTO '/data/counted/20120530'; Pig + HCatalog STORE cntd INTO 'counted' using org.apache.hcatalog.pig.HCatStorer(); LOAD '/data/rawevents/20120530' Pig + HCatalog (Partition Filter) raw_0530 = FILTER raw BY ds = '20120530'; Wednesday, July 18, 12
  11. MapReduce • HCatInputFormat과 HCatOutputFormat 클래스를 활용 • Value 클래스는 기본적으로

    HCatRecord를 활용 • Key는 사용하지 않음 • OutputValueClass는 HCatRecord로 설정 • 언제나 그렇듯 Reducer는 필수가 아닌 선택 • 파티션 제어 가능 • 스키마로 쉽게 제어 가능 Wednesday, July 18, 12
  12. MapReduce - Job Job job = new Job(getConf()); job.setJarByClass(HCatMRTest.class); job.setJobName("HCatMRTest");

    job.setOutputKeyClass(WritableComparable.class); job.setOutputValueClass(HCatRecord.class); job.setMapperClass(HCatMRTest.Map.class); job.setInputFormatClass(HCatInputFormat.class); job.setOutputFormatClass(HCatOutputFormat.class); job.setNumReduceTasks(0); Wednesday, July 18, 12
  13. MapReduce - DB, TBL, Partition java.util.Map<String, String> partition = ...

    partition.put("ds", "20120530"); in = InputJobInfo.create("DB", "rawevents", "ds='20120530'"); out = OutputJobInfo.create("DB", "counted", partition); HCatInputFormat.setInput(job, in); HCatOutputFormat.setOutput(job, out); HCatSchema s = HCatOutputFormat.getTableSchema(job); HCatOutputFormat.setSchema(job, s); Wednesday, July 18, 12
  14. MapReduce - HCatRecord • 레코드 단위로 사용되는 클래스 • boolean,

    byte, short, integer, long, float, double, string, list, struct, map • tinyint : HCatRecord.getByte • smallint : HCatRecord.getShort • Index 또는 컬럼명으로 접근가능 • 컬럼명으로 접근할 때는 HCatSchema 정보 필요 • 파티션 컬럼이 들어갈 수 있도록 공간 확보 Wednesday, July 18, 12
  15. MapReduce - HCatRecord 테이블 스키마 정보 획득 방법 HCatSchema in

    = HCatInputFormat.getTableSchema(context) HCatSchema out = HCatOutputFormat.getTableSchema(context) HCatRecord record = new HCatRecord(3); record.set(“url”, out, value.get(“url”, in)); context.write(null, record); 해당 스키마 정보는 job.xml에 기록(encoded) * mapreduce.lib.hcat.job.info * mapreduce.lib.hcatoutput.info Wednesday, July 18, 12
  16. Conclusions • Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다 •

    다양한 도구를 활용할 때 효과를 발휘 • 빠른 컨트리뷰션이 이루어지고 있어 추후에 더 기대 Wednesday, July 18, 12
  17. The Templeton project is named after the a character in

    the award-winning children's novel Charlotte's Web, by E. B. White. The novel's protagonist is a pig named Wilber. Templeton is a rat who helps Wilber by running errands and making deliveries. Wednesday, July 18, 12
  18. Templeton • HCatalog 연동 • Thrift • Java API (HCATALOG-419)

    • REST API • Web services interface for HCatalog access and Pig, Hive and MR Job excution • http://github.com/hortonworks/templeton • HCATALOG-182 • a.k.a ‘webhcat’ Wednesday, July 18, 12
  19. Getting started • Install ◦ Requirements ▪ Hadoop 0.20.205 or

    Hadoop 1.x ▪ Zookeeper ▪ HCatalog ▪ Hadoop Distributed Cache ▪ To use the Hive, Pig, or hadoop/streaming resources • Configuration ◦ templeton-site.xml • Security ◦ Default security (without additional authentication) ◦ Authentication via Kerberos Wednesday, July 18, 12
  20. Templeton Resources :version Returns a list of supported response types.

    status Returns the Templeton server status. version Returns the a list of supported versions and the current version. Wednesday, July 18, 12
  21. Templeton Resources (2) ddl Performs an HCatalog DDL command. ddl/database

    List HCatalog databases. ddl/database/:db (GET) Describe an HCatalog database. ddl/database/:db (PUT) Create an HCatalog database. ddl/database/:db (DELETE) Delete (drop) an HCatalog database. ddl/database/:db/table List the tables in an HCatalog database. ddl/database/:db/table/:table (GET) Describe an HCatalog table. ddl/database/:db/table/:table (POST) Rename an HCatalog table. ddl/database/:db/table/:table/partion List all partitions in an HCatalog table. ddl/database/:db/table/:table/partion/:partition (GET) Describe a single partition in an HCatalog table. ...... ...... ddl/database/:db/table/:table/partion/:partition (PUT) Wednesday, July 18, 12
  22. Templeton Resources (3) mapreduce/streaming Creates and queues Hadoop streaming MapReduce

    jobs. mapreduce/jar Creates and queues standard Hadoop MapReduce jobs. pig Creates and queues Pig jobs. hive Runs Hive queries and commands. queue Returns a list of all jobids registered for the user. queue/:jobid (GET) Returns the status of a job given its ID. queue/:jobid (DELETE) Kill a job given its ID. Wednesday, July 18, 12
  23. Examples $ curl -s 'http://tb080:50111/templeton/v1/status' {"status":"ok","version":"v1"} $ curl -s -d

    user.name=nexr -d 'exec=show tables;' 'http://tb080:50111/templeton/v1/ddl' { "stdout": "emp\nname\nname_a29\n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... //[jar:file:\/home\/nexr\/nexr_platforms\/hadoop\/hadoop-1.0.3\/ lib\/slf4j-log4j12-1.4.3.jar!\/org\/slf4j\/impl\/ StaticLoggerBinder.class]\nSLF4J: See http:\/\/www.slf4j.org\/ codes.html#multiple_bindings for an explanation.\nOK\nTime taken: 0.491 seconds\n", "exitcode": 0 } Wednesday, July 18, 12
  24. Examples $ curl -s 'http://tb080:50111/templeton/v1/ddl/database/default/ table/emp?user.name=nexr' { "statement": "use default;

    desc emp; ", "error": "...", "exec": { "stdout": "{\"columns\":[{\"name\":\"empno\",\"type\":\"int \"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"deptno \",\"type\":\"int\"}]}\t \t \n", "stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ...... explanation.\nOK\nTime taken: 0.324 seconds\nOK\nTime taken: 0.398 seconds\n", "exitcode": 0 } } Wednesday, July 18, 12
  25. Examples $ curl -s -X PUT -HContent-type:application/json -d '{ "comment":

    "Test table", "columns": [ { "name": "id", "type": "bigint" }, { "name": "price", "type": "float", "comment": "The unit price" } ], "partitionedBy": [ { "name": "country", "type": "string" } ], "format": { "storedAs": "rcfile" } }' \ 'http://tb080:50111/templeton/v1/ddl/database/default/table/test_table? user.name=nexr' hive> show tables; OK emp test_table Time taken: 0.477 seconds hive> describe extended test_table; OK id bigint price float The unit price country string Detailed Table Information Table(tableName:test_table, dbName:default, owner:nexr, createTime:1342578059, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:bigint, comment:null), FieldSchema(name:price, type:float, comment:The unit price), FieldSchema(name:country, type:string, Wednesday, July 18, 12
  26. Future of Templeton • webhcat • Java API based on

    REST API • Integrate or replace existing web interfaces, e.g., WebHDFS Wednesday, July 18, 12
  27. References • Apache HCatalog (Incubating), http:// incubator.apache.org/hcatalog/ • HCatalog, http://www.slideshare.net/ydn/jan-2012-hug-

    hcatalog • Future of HCatalog, http://www.slideshare.net/ hortonworks/future-of-hcatalog-hadoop-summit-2012 • Introduction to HCatalog, http://geekdani.wordpress.com/ 2012/07/11/introduction-to-hcatalog/ • HCatalog 설치와 HCatalog를 이용한 Hive & Pig 스키마 연 동, http://mixellaneous.tistory.com/1123 Wednesday, July 18, 12