Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TDでHivemallを半年使ってみたノウハウ / Hivemall Meetup 20160908

Be66cfc28b46698ca98fe49381db9368?s=47 Takayuki Sakai
September 09, 2016

TDでHivemallを半年使ってみたノウハウ / Hivemall Meetup 20160908

nex8という株式会社ファンコミュニケーションズの開発・運用するDSPにおいてHivemallを使ってみたノウハウです。

Be66cfc28b46698ca98fe49381db9368?s=128

Takayuki Sakai

September 09, 2016
Tweet

Transcript

  1. TDͰHivemallΛ
 ൒೥࢖ͬͯΈͨ
 ϊ΢ϋ΢ Hivemall Meetup 2016/09/08

  2. ञҪ ਸࢸ - 2016/01- F@N Communicationsגࣜձࣾ - ػցֶशΤϯδχΞ - Python,

    Scala, Ruby, JS, Goͱ͔΋ॻ͘Α - TDܥTechTalkొஃ2ճ໨
  3. ΞυςΫۀքͷதͰ΋
 DSPͱ͍͏ͷΛ࡞ͬͯ·͢

  4. લճͷൃදࢿྉ

  5. ର৅ऀ - HivemallΛ͢Ͱʹ࢖͍ͬͯΔਓ - ʢHivemallʹඞཁͳؔ਺͕ͳͯ͘ࠔͬͯΔਓʣ - ʢHivemallʹίϯτϦϏϡʔτ͍ͨ͠ਓʣ

  6. ࿩͢͜ͱ - Hivemallͷศརؔ਺ - Φεεϝศརؔ਺ - ศརؔ਺ͷ୳͠ํ - HivemallʹίϯτϦϏϡʔτ

  7. Φεεϝศརؔ਺ - each_top_k - array_concat - array_avg - array_sum

  8. Φεεϝศརؔ਺ - each_top_k - array_concat - array_avg - array_sum

  9. ಥવͰ͕͢SQLΫΠζ

  10. ֤Ϋϥε੒੷্Ґ2ਓͷྻΛऔಘ͢ΔSQLʁ TUVEFOU DMBTT TDPSF      

                TUVEFOU DMBTT TDPSF             ݩςʔϒϧ ΫΤϦ݁Ռ
  11. DBΤϯδϯ͝ͱʹ
 ݟͯΈ·͠ΐ͏

  12. MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

    class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2;
  13. MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

    class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; GROUP_CONCATͰ
 Ϋϥε͝ͱʹιʔτͯ͠
  14. MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT

    class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; FIND_IN_SETͰ
 ઌ಄͚ͩऔΓग़͢ GROUP_CONCATͰ
 Ϋϥε͝ͱʹιʔτͯ͠
  15. Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

    over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2
  16. Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

    over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ
 ॱҐΛ͚ͭͯ
  17. Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank()

    over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ
 ॱҐΛ͚ͭͯ 2ҐҎ্ͷਓ͚ͩൈ͖ग़͢
  18. SELECT each_top_k( 2, class, score, student, class ) as (rank,

    score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall
  19. SELECT each_top_k( 2, class, score, student, class ) as (rank,

    score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ
 ෼ࢄ͠ιʔτ
  20. SELECT each_top_k( 2, class, score, student, class ) as (rank,

    score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ
 ෼ࢄ͠ιʔτ each_top_kͰ্ҐͷΈநग़
  21. ී௨ͷHiveͷΫΤϦͱ
 ௕͞ಉ͘͡Β͍͡ΌΜ…

  22. ੑೳ্͕͕ͬͨ - Hiveͷؔ਺ͷΈ: 2೔ܦͬͯ΋ऴΘΒͳ͍ - Hivemall: 2࣌ؒఔ౓ - 1Ϋϥε͋ͨΓ1,000ਓ -

    શ20,000,000Ϋϥε ͷςʔϒϧʹର࣮ͯ͠ߦ͢Δͱ… ※Treasure Data্Ͱ࣮ݧ
  23. ΋ͱ΋ͱ͜ͷॲཧͷߴ଎ԽͷͨΊʹ
 ༉Ҫ͞Μ͕։ൃͨ͠Β͍͠

  24. Φεεϝศརؔ਺ - each_top_k - array_concat - array_avg - array_sum

  25. ݟͨ·Μ· SELECT array_concat(ARRAY(1,2), ARRAY(3,4)) 
 # _col1 # [1,2,3,4]

  26. ݟͨ·Μ· SELECT array_avg(arr) avg, array_sum(arr) sum FROM (SELECT ARRAY(1,2) AS

    arr UNION ALL SELECT ARRAY(3,4) AS arr) t # avg, sum # [2,4], [4,6]
  27. ศརؔ਺ͷ୳͠ํ

  28. Hivemall͋Δ͋Δ1 ͋Δ೔ͷ๻ʮίαΠϯྨࣅ౓ٻΊ͍ͨͳʔʯ ↓ ʮHivemall cosine similarityʯͰάάΔ ↓ ๻ʮ͓ͬɺHivemallͷWikiͩʯ

  29. ๻ʮcosine_simؔ਺Ͷʯ ↓ ๻ʮTDͰࢼͯ͠ΈΔ͔ʔʯ

  30. ๻ʮ͓΍ͬɺΤϥʔͩʯ

  31. ๻ʮΉΉΉʁʯ

  32. ๻ʮʯ

  33. ๻ʮ͜Μͳؔ਺ͳ͍ʂʯ

  34. ରॲ๏

  35. 1. HivemallͷGitHubϨϙδτϦͰ /resources/ddl/define-all.hive Λ։͘

  36. 2. ctrl+FͰʮcosineʯΛจࣈྻݕࡧ

  37. ͋ͬͨʂ

  38. ࣮͸… - ੲ͸ cosine_sim ͱ͍͏ؔ਺͕͋ͬͨ - Ͳ͔ͬͷλΠϛϯάͰcosine_similarityͱ
 ͍͏໊લʹมߋ͞Εͨ - Wikiͷهड़͚ͩ͸ݹ͍··…

  39. define-all.hiveͬͯԿΑʁ - HivemallͷUDFΛొ࿥͢ΔεΫϦϓτ - HivemallͰ࢖͑Δؔ਺͸શ෦ࡌͬͯΔ (͸ͣ)

  40. Hivemall͋Δ͋Δ2 ͋Δ೔ͷ๻ʮarray_avgؔ਺ͬͯͷ͕͋Δͷ͔ʯ ↓ ๻ʮͪΐͬͱࢼͯ͠ΈΑ͏ʯ ↓ SELECT array_avg(ARRAY(1,2), ARRAY(3,4))

  41. Hivemall͋Δ͋Δ2 ͋Δ೔ͷ๻ʮarray_avgؔ਺ͬͯͷ͕͋Δͷ͔ʯ ↓ ๻ʮͪΐͬͱࢼͯ͠ΈΑ͏ʯ ↓ SELECT array_avg(ARRAY(1,2), ARRAY(3,4)) ↓ UDFArgumentException

  42. Hivemall͋Δ͋Δ2 ↓ ๻ʮҾ਺͕ҧ͏ͬΆ͍ͳʔʯ ๻ʮͲΜͳҾ਺͕ਖ਼͍͠ΜͩΖ͏ʯ

  43. Hivemall͋Δ͋Δ2 ↓ ๻ʮҾ਺͕ҧ͏ͬΆ͍ͳʔʯ ๻ʮͲΜͳҾ਺͕ਖ਼͍͠ΜͩΖ͏ʯ ↓ ٧·Δ

  44. ରॲ๏

  45. 1. ઌ΄Ͳಉ༷ʹɺdefine-all.hiveͰؔ਺Λ
 ୳͢ ↓ 2. ͦͷؔ਺ͷ࣮૷Ϋϥε໊Λ֬ೝ

  46. 3. ͦͷ࣮૷ΫϥεΛݟͯΈΔͱ… Description͕ॻ͍ͯ͋Δʂ ʢ͜ͷ৔߹͸Ҿ਺͸ҰͭͩͬͨΒ͍͠ʣ

  47. ͱ͜ΖͰ

  48. HiveͷUDFʹ͸3छྨ͋Γ·͢ - UDF - UDAF - UDTF

  49. - UDF : User Defined Function - UDAF: User Defined

    Aggregate Function - UDTF: User Defined Table-generate Function
  50. - UDF : IN 1ྻ, OUT 1ྻ - UDAF: IN

    ෳ਺ྻ, OUT 1ྻ - UDTF : IN 1ྻ, OUT ෳ਺ྻ
  51. - UDF : exp, log, cosine - UDAF: count, max,

    min - UDTF : ͪΐͬͱࢥ͍͔ͭͳ͍ HiveσϑΥϧτؔ਺ͷ৔߹
  52. - UDF : mhash, cosine_similarity - UDAF: array_avg, rmse -

    UDTF : train_xxx ʢػցֶशؔ਺ʣ Hivemallͷؔ਺ͷ৔߹
  53. HivemallʹίϯτϦϏϡʔτ

  54. ࠷ۙHivemallʹ
 ίϯτϦϏϡʔτ͠·ͨ͠

  55. ΍ͬͨ͜ͱ - loglossͱ͍͏ UDAFΛ௥Ճ - ͍ΘΏΔଛࣦؔ਺ʢRMSEΈ͍ͨͳʣ - ࣮૷͸ଞͷଛࣦؔ਺Λ΄΅ίϐϖ - GitHubͰϓϧϦΫΤετΛग़ͯٞ͠࿦

    - ؔ਺ͷཧ࿦ఆٛͷจݙఏग़ - ଞݴޠʹΑΔ࣮૷ఏग़ - ςετίʔυͷఏग़
  56. ͜Μͳײ͡

  57. ඞཁͩͬͨ͜ͱ - Java - ਺ֶͷ஌ࣝ - ӳޠ - ΍Δؾ

  58. - Javaʢͪΐͬͱʣ - ਺ֶͷ஌ࣝʢͪΐͬͱʣ - ӳޠʢͪΐͬͱʣ - ΍Δؾʢͪΐͬͱʣ

  59. ͳΜ͔Ͱ͖ͦ͏ͳ
 ؾ͕͖ͯ͠·ͤΜ͔ʁ

  60. ͨͩ͠… - ػցֶशؔ਺ͷ࣮૷͸௒೉қ౓ߴ͍Ͱ͢ - ಛʹ਺ֶͷ஌͕͕͕ࣝ - ࣮૷΋͔ͳΓߴ౓

  61. Ͱ͖Δ͜ͱ͔Β΍Γ·͠ΐ͏

  62. \ ͋Γ͕ͱ͏͍͟͝·ͨ͠ /

  63. None