Julia によるレコメンドアルゴリズム実装

Julia によるレコメンドアルゴリズム実装

2018/10/04
MACHINE LEARNING Meetup KANSAI #3

E60aa4f80303f3f386898546ddb3686a?s=128

Livesense Inc.

October 05, 2018
Tweet

Transcript

  1. Julia ʹΑΔϨίϝϯυΞϧΰϦζϜ࣮૷ Shotaro Tanaka / @yubessy / Ϧϒηϯε MACHINE LEARNING

    Meetup KANSAI #3 LT
  2. ࠓ೔ݴ͍͍ͨ͜ͱ

  3. ຊ൪Ͱ Julia ͕ಈ͍͍ͯ·͢

  4. Ҏ্

  5. ΋͏ͪΐͬͱৄ͘͠ • ϦϒηϯεͷϨίϝϯυࣄ৘ • Python ͰͷΞϧΰϦζϜ࣮૷ͷ೰Έ • Julia ΛબΜͩࢥ࿭ͱ࣮ࡍͷͱ͜Ζ

  6. ϦϒηϯεͷϨίϝϯυࣄ৘

  7. ϦϒηϯεͷαʔϏε

  8. స৬φϏͷٻਓΞϓϦͷϨίϝϯυ

  9. ٻਓϨίϝϯυͷಛ௃ ECαΠτ΍Web޿ࠂͱ͸ҟͳΔσʔλɾγεςϜཁ݅ • ΞΠςϜ਺ɾϢʔβ਺ɾϢʔβຖͷධՁΞΠςϜ਺͕ͦΕ΄Ͳଟ͘ͳ͍ • ΦϯϥΠϯॲཧͰ͸ͳ͘ఆظతͳόονॲཧͰ΋໰୊ͳ͍ ϨίϝϯυΞϧΰϦζϜʹٻΊΒΕ͜ͱ • ධՁ਺͕গͳ͍ϢʔβͰ΋͋Δఔ౓ྑ͍݁Ռ͕ग़ͯ΄͍͠ •

    ܭࢉྔ͕͋Δఔ౓͔͔ͬͨͱͯ͠΋ਫ਼౓͕ߴ͍΄͏͕Α͍
  10. BPMF ΞϧΰϦζϜ Matrix Factorization Λ֊૚ϕΠζͰϞσϧԽ • MAPਪఆͰ͸ͳ͘ϕΠζਪఆ → গͳ͍σʔλͰ΋ΦʔόʔϑΟοτ͠ʹ͍͘ •

    ϢʔβɾΞΠςϜͷҼࢠߦྻͷύϥϝʔλʹ΋ࣄલ෼෍Λઃఆ → ϋΠύʔύϥϝʔλνϡʔχϯάͷख͕͔͔ؒΒͳ͍ • ਪఆ͸ MCMC (Gibbs Sampling) ͰՄೳ ৄࡉ: BPMF (Bayesian Probabilistic Matrix Factorization) ʹΑΔϨίϝϯυ
  11. Python ͷ৔߹

  12. Python Ͱͷ BPMF ͷ࣮૷Πϝʔδ def bpmf_gibbs_sampling(R, D=10, T=1000): N, M

    = R.shape[0], R.shape[1] U, V = np.zeros((T, N, D)), np.zeros((T, M, D)) # Gibbs Sampling Ͱ Tݸ ͷαϯϓϧΛܭࢉ for t in range(T - 1): # U, V ͷύϥϝʔλͷαϯϓϦϯά lamU, muU = sample_param_U(U[t, :, :]) lamV, muV = sample_param_V(V[t, :, :]) # U ͷαϯϓϦϯά for i in range(N): U[t+1, i, :] = sample_U(R[i, :], U[t, :, :], V[t, :, :], lamU, muU) # V ͷαϯϓϦϯά for j in range(M): V[t+1, j, :] = sample_V(R[i, :], U[t+1, :, :], V[t, :, :], lamV, muV) return U, V
  13. MCMC ͱ Python MCMC (Gibbs Sampling) • લͷαϯϓϧ͔Β࣍ͷαϯϓϧΛܭࢉ͢Δ͜ͱΛ܁Γฦ͢ (ϥϯμϜ΢ΥʔΫ) •

    ૉ௚ʹ࣮૷͢Δͱԋࢉεςοϓ͕Ͳ͏ͯ͠΋ଟ͘ͳΔ Python • ΠϯλϓϦλํࣜͰ1εςοϓͣͭίʔυΛ࣮ߦ • εςοϓ਺ͷଟ͍ԋࢉΛ for ϧʔϓͰ܁Γฦ͢Α͏ͳॲཧ͸஗͘ͳΓ͕ͪ
  14. Python ͰίϯύΠϧɾɾɾ͢Δʁ Cython (Ahead-of-Time compilation) • ߏจ͕ࣅ͍ͯΔ΋ͷͷ Python ͱ͸ผݴޠ •

    ه๏Λ֮͑Δίετ͕ͦΕͳΓʹߴ͍ Numba (Just-in-Time compilation) • ஫ҙ͠ͳ͍ͱ object ܕʹϑΥʔϧόοΫͯ͠ߴ଎Խ͕͖͔ͳ͍ • ϥΠϒϥϦؔ਺ͷݺͼग़͠෦෼͸ίϯύΠϧͰ͖ͳ͍͜ͱ΋
  15. Julia ͷ৔߹

  16. ਺஋ΞϧΰϦζϜ࣮૷ͱͷ਌࿨ੑ Julia Python ଟ࣍ݩ഑ྻ ૊ΈࠐΈܕ NumPy ઢܗ୅਺ ඪ४ϥΠϒϥϦ NumPy, SciPy

    ίϯύΠϧ JIT͕ඪ४ Numba, Cython • ਺஋ΞϧΰϦζϜͷ࣮૷ʹඞཁͳػೳΛ͸͡Ί͔Β౥ࡌ • JITίϯύΠϧʹΑΓ࠷େͰCͷ1/2ఔ౓ͷύϑΥʔϚϯε (ެশ)
  17. Julia Ͱͷ BPMF ͷ࣮૷Πϝʔδ function bpmf(R::SparseMatrixCSC{Float64}, D::Int = 10, T::Int

    = 1000) N, M = size(R, 1), size(R, 2) U, V = zeros(T, N, D), zeros(T, M, D) # Gibbs Sampling Ͱ Tݸ ͷαϯϓϧΛܭࢉ for t in 1:(T-1) # U, V ͷύϥϝʔλͷαϯϓϦϯά λ_u, μ_u = sample_param_U(U[t, :, :]) λ_v, μ_v = sample_param_V(V[t, :, :]) # U ͷαϯϓϦϯά for i in 1:N U[t+1, i, :] = sample_U(R[i, :], V[t, :, :], λ_u, μ_u) # V ͷαϯϓϦϯά for j in 1:M V[t+1, :, j] = sample_V(R[:, j], U[t+1, :, :], λ_v, μ_v) return U, V end
  18. BPMF࣮૷ͷ؆қϕϯνϚʔΫ ࣮ߦ࣌ؒ ഒ཰ Python (NumPy, SciPy) 2382s 1.0 Julia (Python

    ͱಉ༷࣮૷) 122s 19.5 Julia (@inline ౳Ͱ࠷దԽ) 40s 59.5 • Dataset: MovieLens 100k (100k ratings, 9000 movies, 600 users) • Environment: MBP2017 (2.9 GHz Core i7), 1 process • Parameters: 10 factors, 100 samples
  19. ຊ൪Ͱ Julia Λ࢖͏ͨΊʹ • MLγεςϜ͸΄΅͢΂ͯ Docker ίϯςφԽ • όονॲཧͷ֤εςοϓΛผίϯςφͰ࣮ߦ →

    ਺஋ܭࢉ෦෼͚ͩ Julia / DBೖग़ྗ౳͸ Python DB → (Python) → CSV → (Julia) → CSV → (Python) → DB Julia ෦෼͸υϝΠϯґଘͷॲཧ͕ͳ͍ͷͰOSSԽ΋Մೳʁ
  20. ෆศͩͬͨ͜ͱ • ύοέʔδϚωʔδϟ͕ශऑ • sum , mean ͕ܕʹΑͬͯ஗͘ͳΔ • DataFrame

    ϥΠϒϥϦͱݴޠຊମͷόʔδϣϯ૬ੑ • ... → ࣮͸ 2018/08 ͷ Julia 1.0 Ͱେ෯ʹվળ ͜ͷ࿩͸·ͨޙ೔
  21. ·ͱΊ • MCMCͷΑ͏ͳԋࢉεςοϓ਺ͷଟ͍ஞ࣍ܭࢉ͸ Python ͷ೰ΈͲ͜Ζ • ਺஋ΞϧΰϦζϜͷ࣮૷Ͱ Julia ͸ Python

    ͷ༗ྗͳ୅ସʹͳΓͦ͏ • ࠓ೔΋ݩؾʹຊ൪Ͱ Julia ͕ಈ͍͍ͯ·͢ ※BPMFͷJulia࣮૷͸ͦͷ͏ͪެ։͍ͨ͠
  22. PR ژ౎ΦϑΟε͸͡Ί·ͨ͠ ػցֶशɾσʔλج൫ΤϯδχΞͷֶੜΞϧόΠτืूத https://recruit.livesense.co.jp/job/+parttimekyoto_data-engineer/