Slide 1

Slide 1 text

TreasureData-Hive ษڧձ 2013/02/15

Slide 2

Slide 2 text

໨త ղੳεΩϧͷ޲্ ཉ͍͠σʔλΛҾ͍ͯ͘Δ෦෼

Slide 3

Slide 3 text

ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

Slide 4

Slide 4 text

ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

Slide 5

Slide 5 text

෼ࢄॲཧͷϑϨʔϜϫʔΫ ෼ࢄฒྻͳϑΝΠϧγεςϜ جૅ஌ࣝ MapReduce / Apache Hadoop Google File System / Hadoop Distributed File System MR ͸ GFS ্Ͱ࣮ߦ͞ΕΔ / Hadoop ͸ HDFS ্Ͱ.. ʮϑΝΠϧϕʔεͰͷॲཧʯ

Slide 6

Slide 6 text

Hive? Apache Hive Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. ͜Εʂˠ http://hive.apache.org

Slide 7

Slide 7 text

Hive େن໛ͳσʔλΛ RDB / SQL ϥΠΫʹѻ͑Δɻ ࣅ͍ͯΔͷ͸ݴޠͱϞσϧ͚ͩ ʢ໰͍߹Θͤݴޠ / σʔλϕʔε&ςʔϒϧʣ ॏཁ ɾɾɾѻ͑Δʁ

Slide 8

Slide 8 text

Pointer -> Hive Document
 https://cwiki.apache.org/Hive/home.html LanguageManual UDF
 https://cwiki.apache.org/Hive/languagemanual-udf.html


Slide 9

Slide 9 text

TreasureData Ϋϥ΢υσʔλ΢ΣΞϋ΢ε ʮੈքதͰͻͱ͚ͭͩղੳ༻ͷσʔλϕʔε͕
 ɹ͋ΔͳΒɼͦΕ͕ Treasure Data Ͱ͋Γ͍ͨ ʯ ೖΕͨσʔλΛҾ͍ͯ͘ΔΠϯλʔϑΣʔε͕ SQL

Slide 10

Slide 10 text

σʔλͷ֨ೲ ूΊͨϩά͸ S3 ʹϑΝΠϧͱͯ֨͠ೲ͞Ε͍ͯΔ ϩάͷλΠϜελϯϓʢtimeΧϥϜʣΛݟͯ
 Ұ࣌ؒ͝ͱʹ෼ׂͯ֨͠ೲ σʔλશମΛॲཧ͢Δ͜ͱΛආ͚ΒΕΔ࢓૊Έ

Slide 11

Slide 11 text

Hive Query Language HiveQL ૊ΈࠐΈͷ UDF ͸͢΂ͯαϙʔτ Ճ͑ͯ TreasureData ੡ͷศརͳ UDF Λఏڙʢޙड़ʣ HiveQL ͷ͏ͪ SELECT จͷΈ࢖͑Δ

Slide 12

Slide 12 text

ςʔϒϧεΩʔϚͷѻ͍ ൚༻εΩʔϚ͕σϑΥϧτͰઃఆ͞Ε͍ͯΔ time : The time that each entry was generated, in int64 UNIX time v : Each log entry, stored as map $ td schema:set pixta users id:int status:string ඞཁʹԠͯ͡εΩʔϚΛઃఆɾมߋՄೳ


Slide 13

Slide 13 text

ΫΤϦͷ౤͛ํ ίϚϯυϥΠϯ͔Β౤͛Δ ϒϥ΢β͔Β౤͛Δ

Slide 14

Slide 14 text

ΫΤϦ݁Ռͷॻ͖ग़͠ػೳ TreasureData ͷςʔϒϧ
 td://@/testdb/output_table MySQL ͷςʔϒϧ 
 mysql://user:password@host/database/table Google Spread Sheet
 gspreadsheet://user:password@domain.com/spreadsheet/worksheet Web Server / FTP Server
 web://domain.com/path --result ΦϓγϣϯͰ URI Λࢦఆ

Slide 15

Slide 15 text

Pointer -> TreasureData υΩϡϝϯτ
 http://docs.treasure-data.com TreasureData ੡ UDF
 http://docs.treasure-data.com/articles/udfs ΫΤϦ݁Ռͷॻ͖ग़͠ػೳ
 http://docs.treasure-data.com/categories/result

Slide 16

Slide 16 text

ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

Slide 17

Slide 17 text

ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

Slide 18

Slide 18 text

skip

Slide 19

Slide 19 text

ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

Slide 20

Slide 20 text

ϝχϡʔ Hive / TreasureDataɹ ར༻Ͱ͖Δσʔλ Α͘࢖͏จ / ؔ਺ / ׳༻۟

Slide 21

Slide 21 text

Ωϗϯ͸ SQL Ωϗϯ͸ SQLʢͳͷͰ؆୯ʁʣ SQL Ͱͷσʔλॲཧ͸׳Ε͕ඞཁʢू߹ࢦ޲ݴޠʣ ू߹ࢦ޲ͷࢥߟύλʔϯ S0 S1 S5 S2 S4 S3 खଓ͖ܕͷࢥߟύλʔϯ ॲཧ̍ ॲཧ̎ ॲཧ̏ ॲཧ̐

Slide 22

Slide 22 text

จ๏ - CASEࣜ ৚݅ʹΑͬͯผʑͷΧϥϜͷ஋Λฦ͢ COUNT, SUM ͳͲͱಉ͡Α͏ͳؔ਺ͱݟͯ΋͍͍ ΧϥϜͷ஋Λड͚औΓɺ஋Λฦ͢ CASE WHEN user_status = 1 THEN 'ߪೖऀ' WHEN user_status = 2 THEN 'ΫϦΤʔλʔ’ ELSE 'ͦͷଞ' END

Slide 23

Slide 23 text

จ๏ - ιʔτ ORDER BYɿιʔτ. ෼ࢄ࣮ߦ͞Εͳ͍ DISTRIBUTE BYɿࢦఆΧϥϜͷ஋ͰσʔλΛ෼ࢄ SORT BYɿ෼ࢄ͞ΕͨͦΕͧΕΛࢦఆΧϥϜͰιʔτ SELECT * FROM access DISTRIBUTE BY access_id SORT BY time ηοτͰ࢖͏

Slide 24

Slide 24 text

ؔ਺ - TD_TIME_RANGE ൣғΛ۠੾Δؔ਺ SELECT referer, path, time FROM access WHERE TD_TIME_RANGE(time, "2013-01-01", "2013-01-03", "JST") boolean TD_TIME_RANGE(int/long/string unix_timestamp, int/long/string start_time, int/long/string end_time, string timezone = 'UTC')

Slide 25

Slide 25 text

ؔ਺ - TD_LAST / TD_FIRST ͋ΔΧϥϜʹ஫໨͠ɺ࠷େ/࠷খͷϨίʔυΛऔΓग़͢ SELECT access_id, time, TD_LAST(path, time) AS last_url FROM access GROUP BY access_id TD_LAST(ret_col, cmp_col1, cmp_col2, ...) TD_FIRST(ret_col, cmp_col1, cmp_col2, ...)

Slide 26

Slide 26 text

ؔ਺ - TD_X_RANK keys ͷ஋Ͱ෼ׂͨ͠Ϩίʔυ܈ʹ1, 2, 3... ͱॱং෇͚ ༧Ίιʔτͯ͠࢖͏͜ͱ͕ଟ͍ SELECT access_id, time, TD_X_RANK(access_id) AS rank FROM access DISTRIBUTE BY access_id SORT BY time long TD_X_RANK(keys)

Slide 27

Slide 27 text

׳༻۟ - ϢʔβʔID Λ JOIN ΞΫηεIDͱϢʔβʔIDΛ JOIN ͢Δ SELECT user_id, referer, path FROM access t1 JOIN user_access_ids t2 ON (t1.access_id=t2.access_id)

Slide 28

Slide 28 text

SELECT COUNT(*) FROM users GROUP BY CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN 1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END SELECT CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN 1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END, COUNT(*) FROM users GROUP BY CASE WHEN TD_TIME_RANGE(at,NULL,"2009-01-01","JST") THEN 1 WHEN TD_TIME_RANGE(at,"2009-01-01","2011-01-01","JST") THEN ELSE 3 END ձһొ࿥ͨ͠Ϣʔβʔͷ͏ͪɺ2008೥·ͰΛʮୈҰϢʔβʔ܈ʯɺ2009೥͔Β 2010೥·ͰΛʮୈೋϢʔβʔ܈ʯɺͦΕҎ߱ΛʮୈࡾϢʔβʔ܈ʯͱߟ͑Δɻ ͦΕͧΕͷਓ਺͸Ͳͷ͘Β͍͔ʁ ࢖༻ྫ - CASEࣜ

Slide 29

Slide 29 text

(SELECT user_id FROM users WHERE TD_TIME_RANGE(confirmed_at, "2013-01-01", "2013-02-01", "JST")) new_users ! (SELECT * FROM access JOIN user_access_ids ON (access.access_id=user_access_ids.access_ids)) users_access ! SELECT user_id, TD_LAST(path, time) FROM new_users_access GROUP BY user_id SELECT user_id, TD_LAST(path, time) FROM (SELECT user_id FROM users WHERE TD_TIME_RANGE(confirmed_at, "2013-01-01", "2013-02-01", "JST")) new_users JOIN (SELECT * FROM access JOIN user_access_ids ON (access.access_id=user_access_ids.access_ids)) users_access ON (new_users.id=users_access.user_id) GROUP BY user_id ࢖༻ྫ - ϥϯσΟϯάϖʔδ ϢʔβʔͷΞΫηεϩά͔Βɺ৽نϢʔβʔͷϥϯσΟϯάϖʔδΛௐ΂͍ͨɻ ର৅ظؒ͸ 2013-01-01~2013-01-31 ͱ͢Δɻ

Slide 30

Slide 30 text

Pointer -> Practice ϦϨʔγϣφϧɾσʔλϕʔεͷੈք
 http://www.geocities.jp/mickindex/database/idx_database.html खଓ͖ܕ͔Βએݴܕɾू߹ࢦ޲ʹ಄Λ੾Γସ͑ΔͨΊͷ7Օ৚
 http://www.geocities.jp/mickindex/database/db_laws.html TreasureData ੡ UDF
 http://docs.treasure-data.com/articles/udfs LanguageManual UDF
 https://cwiki.apache.org/Hive/languagemanual-udf.html


Slide 31

Slide 31 text

͓ΘΓ by Sohei Takeno