Slide 1

Slide 1 text

TDͰHivemallΛ
 ൒೥࢖ͬͯΈͨ
 ϊ΢ϋ΢ Hivemall Meetup 2016/09/08

Slide 2

Slide 2 text

ञҪ ਸࢸ - 2016/01- F@N Communicationsגࣜձࣾ - ػցֶशΤϯδχΞ - Python, Scala, Ruby, JS, Goͱ͔΋ॻ͘Α - TDܥTechTalkొஃ2ճ໨

Slide 3

Slide 3 text

ΞυςΫۀքͷதͰ΋
 DSPͱ͍͏ͷΛ࡞ͬͯ·͢

Slide 4

Slide 4 text

લճͷൃදࢿྉ

Slide 5

Slide 5 text

ର৅ऀ - HivemallΛ͢Ͱʹ࢖͍ͬͯΔਓ - ʢHivemallʹඞཁͳؔ਺͕ͳͯ͘ࠔͬͯΔਓʣ - ʢHivemallʹίϯτϦϏϡʔτ͍ͨ͠ਓʣ

Slide 6

Slide 6 text

࿩͢͜ͱ - Hivemallͷศརؔ਺ - Φεεϝศརؔ਺ - ศརؔ਺ͷ୳͠ํ - HivemallʹίϯτϦϏϡʔτ

Slide 7

Slide 7 text

Φεεϝศརؔ਺ - each_top_k - array_concat - array_avg - array_sum

Slide 8

Slide 8 text

Φεεϝศརؔ਺ - each_top_k - array_concat - array_avg - array_sum

Slide 9

Slide 9 text

ಥવͰ͕͢SQLΫΠζ

Slide 10

Slide 10 text

֤Ϋϥε੒੷্Ґ2ਓͷྻΛऔಘ͢ΔSQLʁ TUVEFOU DMBTT TDPSF TUVEFOU DMBTT TDPSF ݩςʔϒϧ ΫΤϦ݁Ռ

Slide 11

Slide 11 text

DBΤϯδϯ͝ͱʹ
 ݟͯΈ·͠ΐ͏

Slide 12

Slide 12 text

MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2;

Slide 13

Slide 13 text

MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; GROUP_CONCATͰ
 Ϋϥε͝ͱʹιʔτͯ͠

Slide 14

Slide 14 text

MySQL SELECT t.* FROM row_table t INNER JOIN ( SELECT class, GROUP_CONCAT(student ORDER BY score DESC) grouped_student FROM row_table GROUP BY class) group_max ON t.class = group_max.class AND FIND_IN_SET(student, grouped_student) BETWEEN 1 AND 2; FIND_IN_SETͰ
 ઌ಄͚ͩऔΓग़͢ GROUP_CONCATͰ
 Ϋϥε͝ͱʹιʔτͯ͠

Slide 15

Slide 15 text

Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2

Slide 16

Slide 16 text

Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ
 ॱҐΛ͚ͭͯ

Slide 17

Slide 17 text

Presto
 Hive PostgreSQL SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM row_table ) t WHERE rank <= 2 rank()ͰΫϥε಺Ͱͷ
 ॱҐΛ͚ͭͯ 2ҐҎ্ͷਓ͚ͩൈ͖ग़͢

Slide 18

Slide 18 text

SELECT each_top_k( 2, class, score, student, class ) as (rank, score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall

Slide 19

Slide 19 text

SELECT each_top_k( 2, class, score, student, class ) as (rank, score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ
 ෼ࢄ͠ιʔτ

Slide 20

Slide 20 text

SELECT each_top_k( 2, class, score, student, class ) as (rank, score, student, class) FROM (SELECT * FROM raw_table DISTRIBUTE BY class SORT BY class) t Hivemall Ϋϥε͝ͱʹϊʔυΛ
 ෼ࢄ͠ιʔτ each_top_kͰ্ҐͷΈநग़

Slide 21

Slide 21 text

ී௨ͷHiveͷΫΤϦͱ
 ௕͞ಉ͘͡Β͍͡ΌΜ…

Slide 22

Slide 22 text

ੑೳ্͕͕ͬͨ - Hiveͷؔ਺ͷΈ: 2೔ܦͬͯ΋ऴΘΒͳ͍ - Hivemall: 2࣌ؒఔ౓ - 1Ϋϥε͋ͨΓ1,000ਓ - શ20,000,000Ϋϥε ͷςʔϒϧʹର࣮ͯ͠ߦ͢Δͱ… ※Treasure Data্Ͱ࣮ݧ

Slide 23

Slide 23 text

΋ͱ΋ͱ͜ͷॲཧͷߴ଎ԽͷͨΊʹ
 ༉Ҫ͞Μ͕։ൃͨ͠Β͍͠

Slide 24

Slide 24 text

Φεεϝศརؔ਺ - each_top_k - array_concat - array_avg - array_sum

Slide 25

Slide 25 text

ݟͨ·Μ· SELECT array_concat(ARRAY(1,2), ARRAY(3,4)) 
 # _col1 # [1,2,3,4]

Slide 26

Slide 26 text

ݟͨ·Μ· SELECT array_avg(arr) avg, array_sum(arr) sum FROM (SELECT ARRAY(1,2) AS arr UNION ALL SELECT ARRAY(3,4) AS arr) t # avg, sum # [2,4], [4,6]

Slide 27

Slide 27 text

ศརؔ਺ͷ୳͠ํ

Slide 28

Slide 28 text

Hivemall͋Δ͋Δ1 ͋Δ೔ͷ๻ʮίαΠϯྨࣅ౓ٻΊ͍ͨͳʔʯ ↓ ʮHivemall cosine similarityʯͰάάΔ ↓ ๻ʮ͓ͬɺHivemallͷWikiͩʯ

Slide 29

Slide 29 text

๻ʮcosine_simؔ਺Ͷʯ ↓ ๻ʮTDͰࢼͯ͠ΈΔ͔ʔʯ

Slide 30

Slide 30 text

๻ʮ͓΍ͬɺΤϥʔͩʯ

Slide 31

Slide 31 text

๻ʮΉΉΉʁʯ

Slide 32

Slide 32 text

๻ʮʯ

Slide 33

Slide 33 text

๻ʮ͜Μͳؔ਺ͳ͍ʂʯ

Slide 34

Slide 34 text

ରॲ๏

Slide 35

Slide 35 text

1. HivemallͷGitHubϨϙδτϦͰ /resources/ddl/define-all.hive Λ։͘

Slide 36

Slide 36 text

2. ctrl+FͰʮcosineʯΛจࣈྻݕࡧ

Slide 37

Slide 37 text

͋ͬͨʂ

Slide 38

Slide 38 text

࣮͸… - ੲ͸ cosine_sim ͱ͍͏ؔ਺͕͋ͬͨ - Ͳ͔ͬͷλΠϛϯάͰcosine_similarityͱ
 ͍͏໊લʹมߋ͞Εͨ - Wikiͷهड़͚ͩ͸ݹ͍··…

Slide 39

Slide 39 text

define-all.hiveͬͯԿΑʁ - HivemallͷUDFΛొ࿥͢ΔεΫϦϓτ - HivemallͰ࢖͑Δؔ਺͸શ෦ࡌͬͯΔ (͸ͣ)

Slide 40

Slide 40 text

Hivemall͋Δ͋Δ2 ͋Δ೔ͷ๻ʮarray_avgؔ਺ͬͯͷ͕͋Δͷ͔ʯ ↓ ๻ʮͪΐͬͱࢼͯ͠ΈΑ͏ʯ ↓ SELECT array_avg(ARRAY(1,2), ARRAY(3,4))

Slide 41

Slide 41 text

Hivemall͋Δ͋Δ2 ͋Δ೔ͷ๻ʮarray_avgؔ਺ͬͯͷ͕͋Δͷ͔ʯ ↓ ๻ʮͪΐͬͱࢼͯ͠ΈΑ͏ʯ ↓ SELECT array_avg(ARRAY(1,2), ARRAY(3,4)) ↓ UDFArgumentException

Slide 42

Slide 42 text

Hivemall͋Δ͋Δ2 ↓ ๻ʮҾ਺͕ҧ͏ͬΆ͍ͳʔʯ ๻ʮͲΜͳҾ਺͕ਖ਼͍͠ΜͩΖ͏ʯ

Slide 43

Slide 43 text

Hivemall͋Δ͋Δ2 ↓ ๻ʮҾ਺͕ҧ͏ͬΆ͍ͳʔʯ ๻ʮͲΜͳҾ਺͕ਖ਼͍͠ΜͩΖ͏ʯ ↓ ٧·Δ

Slide 44

Slide 44 text

ରॲ๏

Slide 45

Slide 45 text

1. ઌ΄Ͳಉ༷ʹɺdefine-all.hiveͰؔ਺Λ
 ୳͢ ↓ 2. ͦͷؔ਺ͷ࣮૷Ϋϥε໊Λ֬ೝ

Slide 46

Slide 46 text

3. ͦͷ࣮૷ΫϥεΛݟͯΈΔͱ… Description͕ॻ͍ͯ͋Δʂ ʢ͜ͷ৔߹͸Ҿ਺͸ҰͭͩͬͨΒ͍͠ʣ

Slide 47

Slide 47 text

ͱ͜ΖͰ

Slide 48

Slide 48 text

HiveͷUDFʹ͸3छྨ͋Γ·͢ - UDF - UDAF - UDTF

Slide 49

Slide 49 text

- UDF : User Defined Function - UDAF: User Defined Aggregate Function - UDTF: User Defined Table-generate Function

Slide 50

Slide 50 text

- UDF : IN 1ྻ, OUT 1ྻ - UDAF: IN ෳ਺ྻ, OUT 1ྻ - UDTF : IN 1ྻ, OUT ෳ਺ྻ

Slide 51

Slide 51 text

- UDF : exp, log, cosine - UDAF: count, max, min - UDTF : ͪΐͬͱࢥ͍͔ͭͳ͍ HiveσϑΥϧτؔ਺ͷ৔߹

Slide 52

Slide 52 text

- UDF : mhash, cosine_similarity - UDAF: array_avg, rmse - UDTF : train_xxx ʢػցֶशؔ਺ʣ Hivemallͷؔ਺ͷ৔߹

Slide 53

Slide 53 text

HivemallʹίϯτϦϏϡʔτ

Slide 54

Slide 54 text

࠷ۙHivemallʹ
 ίϯτϦϏϡʔτ͠·ͨ͠

Slide 55

Slide 55 text

΍ͬͨ͜ͱ - loglossͱ͍͏ UDAFΛ௥Ճ - ͍ΘΏΔଛࣦؔ਺ʢRMSEΈ͍ͨͳʣ - ࣮૷͸ଞͷଛࣦؔ਺Λ΄΅ίϐϖ - GitHubͰϓϧϦΫΤετΛग़ͯٞ͠࿦ - ؔ਺ͷཧ࿦ఆٛͷจݙఏग़ - ଞݴޠʹΑΔ࣮૷ఏग़ - ςετίʔυͷఏग़

Slide 56

Slide 56 text

͜Μͳײ͡

Slide 57

Slide 57 text

ඞཁͩͬͨ͜ͱ - Java - ਺ֶͷ஌ࣝ - ӳޠ - ΍Δؾ

Slide 58

Slide 58 text

- Javaʢͪΐͬͱʣ - ਺ֶͷ஌ࣝʢͪΐͬͱʣ - ӳޠʢͪΐͬͱʣ - ΍Δؾʢͪΐͬͱʣ

Slide 59

Slide 59 text

ͳΜ͔Ͱ͖ͦ͏ͳ
 ؾ͕͖ͯ͠·ͤΜ͔ʁ

Slide 60

Slide 60 text

ͨͩ͠… - ػցֶशؔ਺ͷ࣮૷͸௒೉қ౓ߴ͍Ͱ͢ - ಛʹ਺ֶͷ஌͕͕͕ࣝ - ࣮૷΋͔ͳΓߴ౓

Slide 61

Slide 61 text

Ͱ͖Δ͜ͱ͔Β΍Γ·͠ΐ͏

Slide 62

Slide 62 text

\ ͋Γ͕ͱ͏͍͟͝·ͨ͠ /

Slide 63

Slide 63 text

No content