Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TreasureData-Hive 勉強会

Altech
February 15, 2013

TreasureData-Hive 勉強会

PIXTA社内勉強会で使った資料の修正版

Altech

February 15, 2013
Tweet

More Decks by Altech

Other Decks in Programming

Transcript

  1. ෼ࢄॲཧͷϑϨʔϜϫʔΫ ෼ࢄฒྻͳϑΝΠϧγεςϜ جૅ஌ࣝ MapReduce / Apache Hadoop Google File System

    / Hadoop Distributed File System MR ͸ GFS ্Ͱ࣮ߦ͞ΕΔ / Hadoop ͸ HDFS ্Ͱ.. ʮϑΝΠϧϕʔεͰͷॲཧʯ
  2. Hive? Apache Hive Hive is a data warehouse system for

    Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. ͜Εʂˠ http://hive.apache.org
  3. ςʔϒϧεΩʔϚͷѻ͍ ൚༻εΩʔϚ͕σϑΥϧτͰઃఆ͞Ε͍ͯΔ time : The time that each entry was

    generated, in int64 UNIX time v : Each log entry, stored as map<string, string> $ td schema:set pixta users id:int status:string ඞཁʹԠͯ͡εΩʔϚΛઃఆɾมߋՄೳ

  4. ΫΤϦ݁Ռͷॻ͖ग़͠ػೳ TreasureData ͷςʔϒϧ
 td://@/testdb/output_table MySQL ͷςʔϒϧ 
 mysql://user:password@host/database/table Google Spread

    Sheet
 gspreadsheet://user:[email protected]/spreadsheet/worksheet Web Server / FTP Server
 web://domain.com/path --result ΦϓγϣϯͰ URI Λࢦఆ
  5. ؔ਺ - TD_TIME_RANGE ൣғΛ۠੾Δؔ਺ SELECT referer, path, time FROM access

    WHERE TD_TIME_RANGE(time, "2013-01-01", "2013-01-03", "JST") boolean TD_TIME_RANGE(int/long/string unix_timestamp, int/long/string start_time, int/long/string end_time, string timezone = 'UTC')
  6. ؔ਺ - TD_LAST / TD_FIRST ͋ΔΧϥϜʹ஫໨͠ɺ࠷େ/࠷খͷϨίʔυΛऔΓग़͢ SELECT access_id, time, TD_LAST(path,

    time) AS last_url FROM access GROUP BY access_id TD_LAST(ret_col, cmp_col1, cmp_col2, ...) TD_FIRST(ret_col, cmp_col1, cmp_col2, ...)
  7. ؔ਺ - TD_X_RANK keys ͷ஋Ͱ෼ׂͨ͠Ϩίʔυ܈ʹ1, 2, 3... ͱॱং෇͚ ༧Ίιʔτͯ͠࢖͏͜ͱ͕ଟ͍ SELECT

    access_id, time, TD_X_RANK(access_id) AS rank FROM access DISTRIBUTE BY access_id SORT BY time long TD_X_RANK(keys)
  8. ׳༻۟ - ϢʔβʔID Λ JOIN ΞΫηεIDͱϢʔβʔIDΛ JOIN ͢Δ SELECT user_id,

    referer, path FROM access t1 JOIN user_access_ids t2 ON (t1.access_id=t2.access_id)
  9. SELECT COUNT(*) FROM users GROUP BY CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN

    1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END SELECT CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN 1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END, COUNT(*) FROM users GROUP BY CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN 1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END ձһొ࿥ͨ͠Ϣʔβʔͷ͏ͪɺ2008೥·ͰΛʮୈҰϢʔβʔ܈ʯɺ2009೥͔Β 2010೥·ͰΛʮୈೋϢʔβʔ܈ʯɺͦΕҎ߱ΛʮୈࡾϢʔβʔ܈ʯͱߟ͑Δɻ ͦΕͧΕͷਓ਺͸Ͳͷ͘Β͍͔ʁ ࢖༻ྫ - CASEࣜ
  10. (SELECT user_id FROM users WHERE TD_TIME_RANGE(confirmed_at, "2013-01-01", "2013-02-01", "JST")) new_users

    ! (SELECT * FROM access JOIN user_access_ids ON (access.access_id=user_access_ids.access_ids)) users_access ! SELECT user_id, TD_LAST(path, time) FROM new_users_access GROUP BY user_id SELECT user_id, TD_LAST(path, time) FROM (SELECT user_id FROM users WHERE TD_TIME_RANGE(confirmed_at, "2013-01-01", "2013-02-01", "JST")) new_users JOIN (SELECT * FROM access JOIN user_access_ids ON (access.access_id=user_access_ids.access_ids)) users_access ON (new_users.id=users_access.user_id) GROUP BY user_id ࢖༻ྫ - ϥϯσΟϯάϖʔδ ϢʔβʔͷΞΫηεϩά͔Βɺ৽نϢʔβʔͷϥϯσΟϯάϖʔδΛௐ΂͍ͨɻ ର৅ظؒ͸ 2013-01-01~2013-01-31 ͱ͢Δɻ