TreasureData-Hive 勉強会

6115584ec4554ed71f17effd3e0b6988?s=47 Altech
February 15, 2013

TreasureData-Hive 勉強会

PIXTA社内勉強会で使った資料の修正版

6115584ec4554ed71f17effd3e0b6988?s=128

Altech

February 15, 2013
Tweet

Transcript

  1. TreasureData-Hive ษڧձ 2013/02/15

  2. ໨త ղੳεΩϧͷ޲্ ཉ͍͠σʔλΛҾ͍ͯ͘Δ෦෼

  3. ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

  4. ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

  5. ෼ࢄॲཧͷϑϨʔϜϫʔΫ ෼ࢄฒྻͳϑΝΠϧγεςϜ جૅ஌ࣝ MapReduce / Apache Hadoop Google File System

    / Hadoop Distributed File System MR ͸ GFS ্Ͱ࣮ߦ͞ΕΔ / Hadoop ͸ HDFS ্Ͱ.. ʮϑΝΠϧϕʔεͰͷॲཧʯ
  6. Hive? Apache Hive Hive is a data warehouse system for

    Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. ͜Εʂˠ http://hive.apache.org
  7. Hive େن໛ͳσʔλΛ RDB / SQL ϥΠΫʹѻ͑Δɻ ࣅ͍ͯΔͷ͸ݴޠͱϞσϧ͚ͩ ʢ໰͍߹Θͤݴޠ / σʔλϕʔε&ςʔϒϧʣ

    ॏཁ ɾɾɾѻ͑Δʁ
  8. Pointer -> Hive Document
 https://cwiki.apache.org/Hive/home.html LanguageManual UDF
 https://cwiki.apache.org/Hive/languagemanual-udf.html


  9. TreasureData Ϋϥ΢υσʔλ΢ΣΞϋ΢ε ʮੈքதͰͻͱ͚ͭͩղੳ༻ͷσʔλϕʔε͕
 ɹ͋ΔͳΒɼͦΕ͕ Treasure Data Ͱ͋Γ͍ͨ ʯ ೖΕͨσʔλΛҾ͍ͯ͘ΔΠϯλʔϑΣʔε͕ SQL

  10. σʔλͷ֨ೲ ूΊͨϩά͸ S3 ʹϑΝΠϧͱͯ֨͠ೲ͞Ε͍ͯΔ ϩάͷλΠϜελϯϓʢtimeΧϥϜʣΛݟͯ
 Ұ࣌ؒ͝ͱʹ෼ׂͯ֨͠ೲ σʔλશମΛॲཧ͢Δ͜ͱΛආ͚ΒΕΔ࢓૊Έ

  11. Hive Query Language HiveQL ૊ΈࠐΈͷ UDF ͸͢΂ͯαϙʔτ Ճ͑ͯ TreasureData ੡ͷศརͳ

    UDF Λఏڙʢޙड़ʣ HiveQL ͷ͏ͪ SELECT จͷΈ࢖͑Δ
  12. ςʔϒϧεΩʔϚͷѻ͍ ൚༻εΩʔϚ͕σϑΥϧτͰઃఆ͞Ε͍ͯΔ time : The time that each entry was

    generated, in int64 UNIX time v : Each log entry, stored as map<string, string> $ td schema:set pixta users id:int status:string ඞཁʹԠͯ͡εΩʔϚΛઃఆɾมߋՄೳ

  13. ΫΤϦͷ౤͛ํ ίϚϯυϥΠϯ͔Β౤͛Δ ϒϥ΢β͔Β౤͛Δ

  14. ΫΤϦ݁Ռͷॻ͖ग़͠ػೳ TreasureData ͷςʔϒϧ
 td://@/testdb/output_table MySQL ͷςʔϒϧ 
 mysql://user:password@host/database/table Google Spread

    Sheet
 gspreadsheet://user:password@domain.com/spreadsheet/worksheet Web Server / FTP Server
 web://domain.com/path --result ΦϓγϣϯͰ URI Λࢦఆ
  15. Pointer -> TreasureData υΩϡϝϯτ
 http://docs.treasure-data.com TreasureData ੡ UDF
 http://docs.treasure-data.com/articles/udfs ΫΤϦ݁Ռͷॻ͖ग़͠ػೳ


    http://docs.treasure-data.com/categories/result
  16. ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

  17. ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

  18. skip

  19. ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

  20. ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

  21. Ωϗϯ͸ SQL Ωϗϯ͸ SQLʢͳͷͰ؆୯ʁʣ SQL Ͱͷσʔλॲཧ͸׳Ε͕ඞཁʢू߹ࢦ޲ݴޠʣ ू߹ࢦ޲ͷࢥߟύλʔϯ S0 S1 S5

    S2 S4 S3 खଓ͖ܕͷࢥߟύλʔϯ ॲཧ̍ ॲཧ̎ ॲཧ̏ ॲཧ̐
  22. จ๏ - CASEࣜ ৚݅ʹΑͬͯผʑͷΧϥϜͷ஋Λฦ͢ COUNT, SUM ͳͲͱಉ͡Α͏ͳؔ਺ͱݟͯ΋͍͍ ΧϥϜͷ஋Λड͚औΓɺ஋Λฦ͢ CASE WHEN

    user_status = 1 THEN 'ߪೖऀ' WHEN user_status = 2 THEN 'ΫϦΤʔλʔ’ ELSE 'ͦͷଞ' END
  23. จ๏ - ιʔτ ORDER BYɿιʔτ. ෼ࢄ࣮ߦ͞Εͳ͍ DISTRIBUTE BYɿࢦఆΧϥϜͷ஋ͰσʔλΛ෼ࢄ SORT BYɿ෼ࢄ͞ΕͨͦΕͧΕΛࢦఆΧϥϜͰιʔτ

    SELECT * FROM access DISTRIBUTE BY access_id SORT BY time ηοτͰ࢖͏
  24. ؔ਺ - TD_TIME_RANGE ൣғΛ۠੾Δؔ਺ SELECT referer, path, time FROM access

    WHERE TD_TIME_RANGE(time, "2013-01-01", "2013-01-03", "JST") boolean TD_TIME_RANGE(int/long/string unix_timestamp, int/long/string start_time, int/long/string end_time, string timezone = 'UTC')
  25. ؔ਺ - TD_LAST / TD_FIRST ͋ΔΧϥϜʹ஫໨͠ɺ࠷େ/࠷খͷϨίʔυΛऔΓग़͢ SELECT access_id, time, TD_LAST(path,

    time) AS last_url FROM access GROUP BY access_id TD_LAST(ret_col, cmp_col1, cmp_col2, ...) TD_FIRST(ret_col, cmp_col1, cmp_col2, ...)
  26. ؔ਺ - TD_X_RANK keys ͷ஋Ͱ෼ׂͨ͠Ϩίʔυ܈ʹ1, 2, 3... ͱॱং෇͚ ༧Ίιʔτͯ͠࢖͏͜ͱ͕ଟ͍ SELECT

    access_id, time, TD_X_RANK(access_id) AS rank FROM access DISTRIBUTE BY access_id SORT BY time long TD_X_RANK(keys)
  27. ׳༻۟ - ϢʔβʔID Λ JOIN ΞΫηεIDͱϢʔβʔIDΛ JOIN ͢Δ SELECT user_id,

    referer, path FROM access t1 JOIN user_access_ids t2 ON (t1.access_id=t2.access_id)
  28. SELECT COUNT(*) FROM users GROUP BY CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN

    1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END SELECT CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN 1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END, COUNT(*) FROM users GROUP BY CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN 1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END ձһొ࿥ͨ͠Ϣʔβʔͷ͏ͪɺ2008೥·ͰΛʮୈҰϢʔβʔ܈ʯɺ2009೥͔Β 2010೥·ͰΛʮୈೋϢʔβʔ܈ʯɺͦΕҎ߱ΛʮୈࡾϢʔβʔ܈ʯͱߟ͑Δɻ ͦΕͧΕͷਓ਺͸Ͳͷ͘Β͍͔ʁ ࢖༻ྫ - CASEࣜ
  29. (SELECT user_id FROM users WHERE TD_TIME_RANGE(confirmed_at, "2013-01-01", "2013-02-01", "JST")) new_users

    ! (SELECT * FROM access JOIN user_access_ids ON (access.access_id=user_access_ids.access_ids)) users_access ! SELECT user_id, TD_LAST(path, time) FROM new_users_access GROUP BY user_id SELECT user_id, TD_LAST(path, time) FROM (SELECT user_id FROM users WHERE TD_TIME_RANGE(confirmed_at, "2013-01-01", "2013-02-01", "JST")) new_users JOIN (SELECT * FROM access JOIN user_access_ids ON (access.access_id=user_access_ids.access_ids)) users_access ON (new_users.id=users_access.user_id) GROUP BY user_id ࢖༻ྫ - ϥϯσΟϯάϖʔδ ϢʔβʔͷΞΫηεϩά͔Βɺ৽نϢʔβʔͷϥϯσΟϯάϖʔδΛௐ΂͍ͨɻ ର৅ظؒ͸ 2013-01-01~2013-01-31 ͱ͢Δɻ
  30. Pointer -> Practice ϦϨʔγϣφϧɾσʔλϕʔεͷੈք
 http://www.geocities.jp/mickindex/database/idx_database.html खଓ͖ܕ͔Βએݴܕɾू߹ࢦ޲ʹ಄Λ੾Γସ͑ΔͨΊͷ7Օ৚
 http://www.geocities.jp/mickindex/database/db_laws.html TreasureData ੡ UDF


    http://docs.treasure-data.com/articles/udfs LanguageManual UDF
 https://cwiki.apache.org/Hive/languagemanual-udf.html

  31. ͓ΘΓ by Sohei Takeno