$30 off During Our Annual Pro Sale. View Details »

TDでHivemallを半年使ってみたノウハウ / Hivemall Meetup 20160908

Takayuki Sakai
September 09, 2016

TDでHivemallを半年使ってみたノウハウ / Hivemall Meetup 20160908

nex8という株式会社ファンコミュニケーションズの開発・運用するDSPにおいてHivemallを使ってみたノウハウです。

Takayuki Sakai

September 09, 2016
Tweet

More Decks by Takayuki Sakai

Other Decks in Technology

Transcript

  1. TDͰHivemallΛ

    ൒೥࢖ͬͯΈͨ

    ϊ΢ϋ΢
    Hivemall Meetup 2016/09/08

    View Slide

  2. ञҪ ਸࢸ
    - 2016/01- F@N Communicationsגࣜձࣾ
    - ػցֶशΤϯδχΞ
    - Python, Scala, Ruby, JS, Goͱ͔΋ॻ͘Α
    - TDܥTechTalkొஃ2ճ໨

    View Slide

  3. ΞυςΫۀքͷதͰ΋

    DSPͱ͍͏ͷΛ࡞ͬͯ·͢

    View Slide

  4. લճͷൃදࢿྉ

    View Slide

  5. ର৅ऀ
    - HivemallΛ͢Ͱʹ࢖͍ͬͯΔਓ
    - ʢHivemallʹඞཁͳؔ਺͕ͳͯ͘ࠔͬͯΔਓʣ
    - ʢHivemallʹίϯτϦϏϡʔτ͍ͨ͠ਓʣ

    View Slide

  6. ࿩͢͜ͱ
    - Hivemallͷศརؔ਺
    - Φεεϝศརؔ਺
    - ศརؔ਺ͷ୳͠ํ
    - HivemallʹίϯτϦϏϡʔτ

    View Slide

  7. Φεεϝศརؔ਺
    - each_top_k
    - array_concat
    - array_avg
    - array_sum

    View Slide

  8. Φεεϝศརؔ਺
    - each_top_k
    - array_concat
    - array_avg
    - array_sum

    View Slide

  9. ಥવͰ͕͢SQLΫΠζ

    View Slide

  10. ֤Ϋϥε੒੷্Ґ2ਓͷྻΛऔಘ͢ΔSQLʁ
    TUVEFOU DMBTT TDPSF






    TUVEFOU DMBTT TDPSF




    ݩςʔϒϧ ΫΤϦ݁Ռ

    View Slide

  11. DBΤϯδϯ͝ͱʹ

    ݟͯΈ·͠ΐ͏

    View Slide

  12. MySQL
    SELECT
    t.*
    FROM
    row_table t
    INNER JOIN (
    SELECT
    class,
    GROUP_CONCAT(student ORDER BY score DESC) grouped_student
    FROM
    row_table
    GROUP BY class) group_max
    ON t.class = group_max.class
    AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2;

    View Slide

  13. MySQL
    SELECT
    t.*
    FROM
    row_table t
    INNER JOIN (
    SELECT
    class,
    GROUP_CONCAT(student ORDER BY score DESC) grouped_student
    FROM
    row_table
    GROUP BY class) group_max
    ON t.class = group_max.class
    AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2;
    GROUP_CONCATͰ

    Ϋϥε͝ͱʹιʔτͯ͠

    View Slide

  14. MySQL
    SELECT
    t.*
    FROM
    row_table t
    INNER JOIN (
    SELECT
    class,
    GROUP_CONCAT(student ORDER BY score DESC) grouped_student
    FROM
    row_table
    GROUP BY class) group_max
    ON t.class = group_max.class
    AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2;
    FIND_IN_SETͰ

    ઌ಄͚ͩऔΓग़͢
    GROUP_CONCATͰ

    Ϋϥε͝ͱʹιʔτͯ͠

    View Slide

  15. Presto

    Hive
    PostgreSQL
    SELECT
    *
    FROM (
    SELECT
    *,
    rank() over (partition by class order by score desc) as rank
    FROM row_table
    ) t
    WHERE rank <= 2

    View Slide

  16. Presto

    Hive
    PostgreSQL
    SELECT
    *
    FROM (
    SELECT
    *,
    rank() over (partition by class order by score desc) as rank
    FROM row_table
    ) t
    WHERE rank <= 2
    rank()ͰΫϥε಺Ͱͷ

    ॱҐΛ͚ͭͯ

    View Slide

  17. Presto

    Hive
    PostgreSQL
    SELECT
    *
    FROM (
    SELECT
    *,
    rank() over (partition by class order by score desc) as rank
    FROM row_table
    ) t
    WHERE rank <= 2
    rank()ͰΫϥε಺Ͱͷ

    ॱҐΛ͚ͭͯ
    2ҐҎ্ͷਓ͚ͩൈ͖ग़͢

    View Slide

  18. SELECT
    each_top_k(
    2, class, score,
    student,
    class
    ) as (rank, score, student, class)
    FROM
    (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t
    Hivemall

    View Slide

  19. SELECT
    each_top_k(
    2, class, score,
    student,
    class
    ) as (rank, score, student, class)
    FROM
    (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t
    Hivemall
    Ϋϥε͝ͱʹϊʔυΛ

    ෼ࢄ͠ιʔτ

    View Slide

  20. SELECT
    each_top_k(
    2, class, score,
    student,
    class
    ) as (rank, score, student, class)
    FROM
    (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t
    Hivemall
    Ϋϥε͝ͱʹϊʔυΛ

    ෼ࢄ͠ιʔτ
    each_top_kͰ্ҐͷΈநग़

    View Slide

  21. ී௨ͷHiveͷΫΤϦͱ

    ௕͞ಉ͘͡Β͍͡ΌΜ…

    View Slide

  22. ੑೳ্͕͕ͬͨ
    - Hiveͷؔ਺ͷΈ: 2೔ܦͬͯ΋ऴΘΒͳ͍
    - Hivemall: 2࣌ؒఔ౓
    - 1Ϋϥε͋ͨΓ1,000ਓ
    - શ20,000,000Ϋϥε
    ͷςʔϒϧʹର࣮ͯ͠ߦ͢Δͱ…
    ※Treasure Data্Ͱ࣮ݧ

    View Slide

  23. ΋ͱ΋ͱ͜ͷॲཧͷߴ଎ԽͷͨΊʹ

    ༉Ҫ͞Μ͕։ൃͨ͠Β͍͠

    View Slide

  24. Φεεϝศརؔ਺
    - each_top_k
    - array_concat
    - array_avg
    - array_sum

    View Slide

  25. ݟͨ·Μ·
    SELECT
    array_concat(ARRAY(1,2), ARRAY(3,4))

    # _col1
    # [1,2,3,4]

    View Slide

  26. ݟͨ·Μ·
    SELECT
    array_avg(arr) avg,
    array_sum(arr) sum
    FROM
    (SELECT ARRAY(1,2) AS arr UNION ALL SELECT
    ARRAY(3,4) AS arr) t
    # avg, sum
    # [2,4], [4,6]

    View Slide

  27. ศརؔ਺ͷ୳͠ํ

    View Slide

  28. Hivemall͋Δ͋Δ1
    ͋Δ೔ͷ๻ʮίαΠϯྨࣅ౓ٻΊ͍ͨͳʔʯ

    ʮHivemall cosine similarityʯͰάάΔ

    ๻ʮ͓ͬɺHivemallͷWikiͩʯ

    View Slide

  29. ๻ʮcosine_simؔ਺Ͷʯ

    ๻ʮTDͰࢼͯ͠ΈΔ͔ʔʯ

    View Slide

  30. ๻ʮ͓΍ͬɺΤϥʔͩʯ

    View Slide

  31. ๻ʮΉΉΉʁʯ

    View Slide

  32. ๻ʮʯ

    View Slide

  33. ๻ʮ͜Μͳؔ਺ͳ͍ʂʯ

    View Slide

  34. ରॲ๏

    View Slide

  35. 1. HivemallͷGitHubϨϙδτϦͰ
    /resources/ddl/define-all.hive Λ։͘

    View Slide

  36. 2. ctrl+FͰʮcosineʯΛจࣈྻݕࡧ

    View Slide

  37. ͋ͬͨʂ

    View Slide

  38. ࣮͸…
    - ੲ͸ cosine_sim ͱ͍͏ؔ਺͕͋ͬͨ
    - Ͳ͔ͬͷλΠϛϯάͰcosine_similarityͱ

    ͍͏໊લʹมߋ͞Εͨ
    - Wikiͷهड़͚ͩ͸ݹ͍··…

    View Slide

  39. define-all.hiveͬͯԿΑʁ
    - HivemallͷUDFΛొ࿥͢ΔεΫϦϓτ
    - HivemallͰ࢖͑Δؔ਺͸શ෦ࡌͬͯΔ (͸ͣ)

    View Slide

  40. Hivemall͋Δ͋Δ2
    ͋Δ೔ͷ๻ʮarray_avgؔ਺ͬͯͷ͕͋Δͷ͔ʯ

    ๻ʮͪΐͬͱࢼͯ͠ΈΑ͏ʯ

    SELECT array_avg(ARRAY(1,2), ARRAY(3,4))

    View Slide

  41. Hivemall͋Δ͋Δ2
    ͋Δ೔ͷ๻ʮarray_avgؔ਺ͬͯͷ͕͋Δͷ͔ʯ

    ๻ʮͪΐͬͱࢼͯ͠ΈΑ͏ʯ

    SELECT array_avg(ARRAY(1,2), ARRAY(3,4))

    UDFArgumentException

    View Slide

  42. Hivemall͋Δ͋Δ2

    ๻ʮҾ਺͕ҧ͏ͬΆ͍ͳʔʯ
    ๻ʮͲΜͳҾ਺͕ਖ਼͍͠ΜͩΖ͏ʯ

    View Slide

  43. Hivemall͋Δ͋Δ2

    ๻ʮҾ਺͕ҧ͏ͬΆ͍ͳʔʯ
    ๻ʮͲΜͳҾ਺͕ਖ਼͍͠ΜͩΖ͏ʯ

    ٧·Δ

    View Slide

  44. ରॲ๏

    View Slide

  45. 1. ઌ΄Ͳಉ༷ʹɺdefine-all.hiveͰؔ਺Λ

    ୳͢

    2. ͦͷؔ਺ͷ࣮૷Ϋϥε໊Λ֬ೝ

    View Slide

  46. 3. ͦͷ࣮૷ΫϥεΛݟͯΈΔͱ…
    Description͕ॻ͍ͯ͋Δʂ
    ʢ͜ͷ৔߹͸Ҿ਺͸ҰͭͩͬͨΒ͍͠ʣ

    View Slide

  47. ͱ͜ΖͰ

    View Slide

  48. HiveͷUDFʹ͸3छྨ͋Γ·͢
    - UDF
    - UDAF
    - UDTF

    View Slide

  49. - UDF : User Defined Function
    - UDAF: User Defined Aggregate Function
    - UDTF: User Defined Table-generate Function

    View Slide

  50. - UDF : IN 1ྻ, OUT 1ྻ
    - UDAF: IN ෳ਺ྻ, OUT 1ྻ
    - UDTF : IN 1ྻ, OUT ෳ਺ྻ

    View Slide

  51. - UDF : exp, log, cosine
    - UDAF: count, max, min
    - UDTF : ͪΐͬͱࢥ͍͔ͭͳ͍
    HiveσϑΥϧτؔ਺ͷ৔߹

    View Slide

  52. - UDF : mhash, cosine_similarity
    - UDAF: array_avg, rmse
    - UDTF : train_xxx ʢػցֶशؔ਺ʣ
    Hivemallͷؔ਺ͷ৔߹

    View Slide

  53. HivemallʹίϯτϦϏϡʔτ

    View Slide

  54. ࠷ۙHivemallʹ

    ίϯτϦϏϡʔτ͠·ͨ͠

    View Slide

  55. ΍ͬͨ͜ͱ
    - loglossͱ͍͏ UDAFΛ௥Ճ
    - ͍ΘΏΔଛࣦؔ਺ʢRMSEΈ͍ͨͳʣ
    - ࣮૷͸ଞͷଛࣦؔ਺Λ΄΅ίϐϖ
    - GitHubͰϓϧϦΫΤετΛग़ͯٞ͠࿦
    - ؔ਺ͷཧ࿦ఆٛͷจݙఏग़
    - ଞݴޠʹΑΔ࣮૷ఏग़
    - ςετίʔυͷఏग़

    View Slide

  56. ͜Μͳײ͡

    View Slide

  57. ඞཁͩͬͨ͜ͱ
    - Java
    - ਺ֶͷ஌ࣝ
    - ӳޠ
    - ΍Δؾ

    View Slide

  58. - Javaʢͪΐͬͱʣ
    - ਺ֶͷ஌ࣝʢͪΐͬͱʣ
    - ӳޠʢͪΐͬͱʣ
    - ΍Δؾʢͪΐͬͱʣ

    View Slide

  59. ͳΜ͔Ͱ͖ͦ͏ͳ

    ؾ͕͖ͯ͠·ͤΜ͔ʁ

    View Slide

  60. ͨͩ͠…
    - ػցֶशؔ਺ͷ࣮૷͸௒೉қ౓ߴ͍Ͱ͢
    - ಛʹ਺ֶͷ஌͕͕͕ࣝ
    - ࣮૷΋͔ͳΓߴ౓

    View Slide

  61. Ͱ͖Δ͜ͱ͔Β΍Γ·͠ΐ͏

    View Slide

  62. \ ͋Γ͕ͱ͏͍͟͝·ͨ͠ /

    View Slide

  63. View Slide