Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TDでHivemallを半年使ってみたノウハウ / Hivemall Meetup 20160908

Takayuki Sakai
September 09, 2016

TDでHivemallを半年使ってみたノウハウ / Hivemall Meetup 20160908

nex8という株式会社ファンコミュニケーションズの開発・運用するDSPにおいてHivemallを使ってみたノウハウです。

Takayuki Sakai

September 09, 2016
Tweet

More Decks by Takayuki Sakai

Other Decks in Technology

Transcript

  1. ञҪ ਸࢸ - 2016/01- F@N Communicationsגࣜձࣾ - ػցֶशΤϯδχΞ - Python,

    Scala, Ruby, JS, Goͱ͔΋ॻ͘Α - TDܥTechTalkొஃ2ճ໨
  2. ֤Ϋϥε੒੷্Ґ2ਓͷྻΛऔಘ͢ΔSQLʁ TUVEFOU DMBTT TDPSF      

                TUVEFOU DMBTT TDPSF             ݩςʔϒϧ ΫΤϦ݁Ռ
  3. MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

    class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2;
  4. MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

    class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; GROUP_CONCATͰ
 Ϋϥε͝ͱʹιʔτͯ͠
  5. MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

    class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; FIND_IN_SETͰ
 ઌ಄͚ͩऔΓग़͢ GROUP_CONCATͰ
 Ϋϥε͝ͱʹιʔτͯ͠
  6. Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

    over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2
  7. Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

    over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ
 ॱҐΛ͚ͭͯ
  8. Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

    over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ
 ॱҐΛ͚ͭͯ 2ҐҎ্ͷਓ͚ͩൈ͖ग़͢
  9. SELECT each_top_k( 2, class, score, student, class ) as (rank,

    score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall
  10. SELECT each_top_k( 2, class, score, student, class ) as (rank,

    score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ
 ෼ࢄ͠ιʔτ
  11. SELECT each_top_k( 2, class, score, student, class ) as (rank,

    score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ
 ෼ࢄ͠ιʔτ each_top_kͰ্ҐͷΈநग़
  12. ੑೳ্͕͕ͬͨ - Hiveͷؔ਺ͷΈ: 2೔ܦͬͯ΋ऴΘΒͳ͍ - Hivemall: 2࣌ؒఔ౓ - 1Ϋϥε͋ͨΓ1,000ਓ -

    શ20,000,000Ϋϥε ͷςʔϒϧʹର࣮ͯ͠ߦ͢Δͱ… ※Treasure Data্Ͱ࣮ݧ
  13. ݟͨ·Μ· SELECT array_avg(arr) avg, array_sum(arr) sum FROM (SELECT ARRAY(1,2) AS

    arr UNION ALL SELECT ARRAY(3,4) AS arr) t # avg, sum # [2,4], [4,6]
  14. - UDF : User Defined Function - UDAF: User Defined

    Aggregate Function - UDTF: User Defined Table-generate Function
  15. - UDF : IN 1ྻ, OUT 1ྻ - UDAF: IN

    ෳ਺ྻ, OUT 1ྻ - UDTF : IN 1ྻ, OUT ෳ਺ྻ
  16. - UDF : exp, log, cosine - UDAF: count, max,

    min - UDTF : ͪΐͬͱࢥ͍͔ͭͳ͍ HiveσϑΥϧτؔ਺ͷ৔߹
  17. - UDF : mhash, cosine_similarity - UDAF: array_avg, rmse -

    UDTF : train_xxx ʢػցֶशؔ਺ʣ Hivemallͷؔ਺ͷ৔߹