Slide 1

Slide 1 text

Julia ʹΑΔϨίϝϯυΞϧΰϦζϜ࣮૷ Shotaro Tanaka / @yubessy / Ϧϒηϯε MACHINE LEARNING Meetup KANSAI #3 LT

Slide 2

Slide 2 text

ࠓ೔ݴ͍͍ͨ͜ͱ

Slide 3

Slide 3 text

ຊ൪Ͱ Julia ͕ಈ͍͍ͯ·͢

Slide 4

Slide 4 text

Ҏ্

Slide 5

Slide 5 text

΋͏ͪΐͬͱৄ͘͠ • ϦϒηϯεͷϨίϝϯυࣄ৘ • Python ͰͷΞϧΰϦζϜ࣮૷ͷ೰Έ • Julia ΛબΜͩࢥ࿭ͱ࣮ࡍͷͱ͜Ζ

Slide 6

Slide 6 text

ϦϒηϯεͷϨίϝϯυࣄ৘

Slide 7

Slide 7 text

ϦϒηϯεͷαʔϏε

Slide 8

Slide 8 text

స৬φϏͷٻਓΞϓϦͷϨίϝϯυ

Slide 9

Slide 9 text

ٻਓϨίϝϯυͷಛ௃ ECαΠτ΍Web޿ࠂͱ͸ҟͳΔσʔλɾγεςϜཁ݅ • ΞΠςϜ਺ɾϢʔβ਺ɾϢʔβຖͷධՁΞΠςϜ਺͕ͦΕ΄Ͳଟ͘ͳ͍ • ΦϯϥΠϯॲཧͰ͸ͳ͘ఆظతͳόονॲཧͰ΋໰୊ͳ͍ ϨίϝϯυΞϧΰϦζϜʹٻΊΒΕ͜ͱ • ධՁ਺͕গͳ͍ϢʔβͰ΋͋Δఔ౓ྑ͍݁Ռ͕ग़ͯ΄͍͠ • ܭࢉྔ͕͋Δఔ౓͔͔ͬͨͱͯ͠΋ਫ਼౓͕ߴ͍΄͏͕Α͍

Slide 10

Slide 10 text

BPMF ΞϧΰϦζϜ Matrix Factorization Λ֊૚ϕΠζͰϞσϧԽ • MAPਪఆͰ͸ͳ͘ϕΠζਪఆ → গͳ͍σʔλͰ΋ΦʔόʔϑΟοτ͠ʹ͍͘ • ϢʔβɾΞΠςϜͷҼࢠߦྻͷύϥϝʔλʹ΋ࣄલ෼෍Λઃఆ → ϋΠύʔύϥϝʔλνϡʔχϯάͷख͕͔͔ؒΒͳ͍ • ਪఆ͸ MCMC (Gibbs Sampling) ͰՄೳ ৄࡉ: BPMF (Bayesian Probabilistic Matrix Factorization) ʹΑΔϨίϝϯυ

Slide 11

Slide 11 text

Python ͷ৔߹

Slide 12

Slide 12 text

Python Ͱͷ BPMF ͷ࣮૷Πϝʔδ def bpmf_gibbs_sampling(R, D=10, T=1000): N, M = R.shape[0], R.shape[1] U, V = np.zeros((T, N, D)), np.zeros((T, M, D)) # Gibbs Sampling Ͱ Tݸ ͷαϯϓϧΛܭࢉ for t in range(T - 1): # U, V ͷύϥϝʔλͷαϯϓϦϯά lamU, muU = sample_param_U(U[t, :, :]) lamV, muV = sample_param_V(V[t, :, :]) # U ͷαϯϓϦϯά for i in range(N): U[t+1, i, :] = sample_U(R[i, :], U[t, :, :], V[t, :, :], lamU, muU) # V ͷαϯϓϦϯά for j in range(M): V[t+1, j, :] = sample_V(R[i, :], U[t+1, :, :], V[t, :, :], lamV, muV) return U, V

Slide 13

Slide 13 text

MCMC ͱ Python MCMC (Gibbs Sampling) • લͷαϯϓϧ͔Β࣍ͷαϯϓϧΛܭࢉ͢Δ͜ͱΛ܁Γฦ͢ (ϥϯμϜ΢ΥʔΫ) • ૉ௚ʹ࣮૷͢Δͱԋࢉεςοϓ͕Ͳ͏ͯ͠΋ଟ͘ͳΔ Python • ΠϯλϓϦλํࣜͰ1εςοϓͣͭίʔυΛ࣮ߦ • εςοϓ਺ͷଟ͍ԋࢉΛ for ϧʔϓͰ܁Γฦ͢Α͏ͳॲཧ͸஗͘ͳΓ͕ͪ

Slide 14

Slide 14 text

Python ͰίϯύΠϧɾɾɾ͢Δʁ Cython (Ahead-of-Time compilation) • ߏจ͕ࣅ͍ͯΔ΋ͷͷ Python ͱ͸ผݴޠ • ه๏Λ֮͑Δίετ͕ͦΕͳΓʹߴ͍ Numba (Just-in-Time compilation) • ஫ҙ͠ͳ͍ͱ object ܕʹϑΥʔϧόοΫͯ͠ߴ଎Խ͕͖͔ͳ͍ • ϥΠϒϥϦؔ਺ͷݺͼग़͠෦෼͸ίϯύΠϧͰ͖ͳ͍͜ͱ΋

Slide 15

Slide 15 text

Julia ͷ৔߹

Slide 16

Slide 16 text

਺஋ΞϧΰϦζϜ࣮૷ͱͷ਌࿨ੑ Julia Python ଟ࣍ݩ഑ྻ ૊ΈࠐΈܕ NumPy ઢܗ୅਺ ඪ४ϥΠϒϥϦ NumPy, SciPy ίϯύΠϧ JIT͕ඪ४ Numba, Cython • ਺஋ΞϧΰϦζϜͷ࣮૷ʹඞཁͳػೳΛ͸͡Ί͔Β౥ࡌ • JITίϯύΠϧʹΑΓ࠷େͰCͷ1/2ఔ౓ͷύϑΥʔϚϯε (ެশ)

Slide 17

Slide 17 text

Julia Ͱͷ BPMF ͷ࣮૷Πϝʔδ function bpmf(R::SparseMatrixCSC{Float64}, D::Int = 10, T::Int = 1000) N, M = size(R, 1), size(R, 2) U, V = zeros(T, N, D), zeros(T, M, D) # Gibbs Sampling Ͱ Tݸ ͷαϯϓϧΛܭࢉ for t in 1:(T-1) # U, V ͷύϥϝʔλͷαϯϓϦϯά λ_u, μ_u = sample_param_U(U[t, :, :]) λ_v, μ_v = sample_param_V(V[t, :, :]) # U ͷαϯϓϦϯά for i in 1:N U[t+1, i, :] = sample_U(R[i, :], V[t, :, :], λ_u, μ_u) # V ͷαϯϓϦϯά for j in 1:M V[t+1, :, j] = sample_V(R[:, j], U[t+1, :, :], λ_v, μ_v) return U, V end

Slide 18

Slide 18 text

BPMF࣮૷ͷ؆қϕϯνϚʔΫ ࣮ߦ࣌ؒ ഒ཰ Python (NumPy, SciPy) 2382s 1.0 Julia (Python ͱಉ༷࣮૷) 122s 19.5 Julia (@inline ౳Ͱ࠷దԽ) 40s 59.5 • Dataset: MovieLens 100k (100k ratings, 9000 movies, 600 users) • Environment: MBP2017 (2.9 GHz Core i7), 1 process • Parameters: 10 factors, 100 samples

Slide 19

Slide 19 text

ຊ൪Ͱ Julia Λ࢖͏ͨΊʹ • MLγεςϜ͸΄΅͢΂ͯ Docker ίϯςφԽ • όονॲཧͷ֤εςοϓΛผίϯςφͰ࣮ߦ → ਺஋ܭࢉ෦෼͚ͩ Julia / DBೖग़ྗ౳͸ Python DB → (Python) → CSV → (Julia) → CSV → (Python) → DB Julia ෦෼͸υϝΠϯґଘͷॲཧ͕ͳ͍ͷͰOSSԽ΋Մೳʁ

Slide 20

Slide 20 text

ෆศͩͬͨ͜ͱ • ύοέʔδϚωʔδϟ͕ශऑ • sum , mean ͕ܕʹΑͬͯ஗͘ͳΔ • DataFrame ϥΠϒϥϦͱݴޠຊମͷόʔδϣϯ૬ੑ • ... → ࣮͸ 2018/08 ͷ Julia 1.0 Ͱେ෯ʹվળ ͜ͷ࿩͸·ͨޙ೔

Slide 21

Slide 21 text

·ͱΊ • MCMCͷΑ͏ͳԋࢉεςοϓ਺ͷଟ͍ஞ࣍ܭࢉ͸ Python ͷ೰ΈͲ͜Ζ • ਺஋ΞϧΰϦζϜͷ࣮૷Ͱ Julia ͸ Python ͷ༗ྗͳ୅ସʹͳΓͦ͏ • ࠓ೔΋ݩؾʹຊ൪Ͱ Julia ͕ಈ͍͍ͯ·͢ ※BPMFͷJulia࣮૷͸ͦͷ͏ͪެ։͍ͨ͠

Slide 22

Slide 22 text

PR ژ౎ΦϑΟε͸͡Ί·ͨ͠ ػցֶशɾσʔλج൫ΤϯδχΞͷֶੜΞϧόΠτืूத https://recruit.livesense.co.jp/job/+parttimekyoto_data-engineer/