Takayuki Sakai
September 09, 2016
3k

# TDでHivemallを半年使ってみたノウハウ / Hivemall Meetup 20160908

nex8という株式会社ファンコミュニケーションズの開発・運用するDSPにおいてHivemallを使ってみたノウハウです。

## Takayuki Sakai

September 09, 2016

## Transcript

2. ### ञҪ ਸࢸ - 2016/01- F@N Communicationsגࣜձࣾ - ػցֶशΤϯδχΞ - Python,

Scala, Ruby, JS, Goͱ͔΋ॻ͘Α - TDܥTechTalkొஃ2ճ໨

10. ### ֤Ϋϥε੒੷্Ґ2ਓͷྻΛऔಘ͢ΔSQLʁ TUVEFOU DMBTT TDPSF      

            TUVEFOU DMBTT TDPSF             ݩςʔϒϧ ΫΤϦ݁Ռ

12. ### MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2;
13. ### MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; GROUP_CONCATͰ  Ϋϥε͝ͱʹιʔτͯ͠
14. ### MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; FIND_IN_SETͰ  ઌ಄͚ͩऔΓग़͢ GROUP_CONCATͰ  Ϋϥε͝ͱʹιʔτͯ͠
15. ### Presto  Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2
16. ### Presto  Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ  ॱҐΛ͚ͭͯ
17. ### Presto  Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ  ॱҐΛ͚ͭͯ 2ҐҎ্ͷਓ͚ͩൈ͖ग़͢
18. ### SELECT each_top_k( 2, class, score, student, class ) as (rank,

score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall
19. ### SELECT each_top_k( 2, class, score, student, class ) as (rank,

score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ  ෼ࢄ͠ιʔτ
20. ### SELECT each_top_k( 2, class, score, student, class ) as (rank,

score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ  ෼ࢄ͠ιʔτ each_top_kͰ্ҐͷΈநग़

22. ### ੑೳ্͕͕ͬͨ - Hiveͷؔ਺ͷΈ: 2೔ܦͬͯ΋ऴΘΒͳ͍ - Hivemall: 2࣌ؒఔ౓ - 1Ϋϥε͋ͨΓ1,000ਓ -

શ20,000,000Ϋϥε ͷςʔϒϧʹର࣮ͯ͠ߦ͢Δͱ… ※Treasure Data্Ͱ࣮ݧ

26. ### ݟͨ·Μ· SELECT array_avg(arr) avg, array_sum(arr) sum FROM (SELECT ARRAY(1,2) AS

arr UNION ALL SELECT ARRAY(3,4) AS arr) t # avg, sum # [2,4], [4,6]

49. ### - UDF : User Deﬁned Function - UDAF: User Deﬁned

Aggregate Function - UDTF: User Deﬁned Table-generate Function
50. ### - UDF : IN 1ྻ, OUT 1ྻ - UDAF: IN

ෳ਺ྻ, OUT 1ྻ - UDTF : IN 1ྻ, OUT ෳ਺ྻ
51. ### - UDF : exp, log, cosine - UDAF: count, max,

min - UDTF : ͪΐͬͱࢥ͍͔ͭͳ͍ HiveσϑΥϧτؔ਺ͷ৔߹
52. ### - UDF : mhash, cosine_similarity - UDAF: array_avg, rmse -

UDTF : train_xxx ʢػցֶशؔ਺ʣ Hivemallͷؔ਺ͷ৔߹

55. ### ΍ͬͨ͜ͱ - loglossͱ͍͏ UDAFΛ௥Ճ - ͍ΘΏΔଛࣦؔ਺ʢRMSEΈ͍ͨͳʣ - ࣮૷͸ଞͷଛࣦؔ਺Λ΄΅ίϐϖ - GitHubͰϓϧϦΫΤετΛग़ͯٞ͠࿦

- ؔ਺ͷཧ࿦ఆٛͷจݙఏग़ - ଞݴޠʹΑΔ࣮૷ఏग़ - ςετίʔυͷఏग़