Goによる勾配降下法 - 理論と実践 - / gradient-descent-in-golang

Goによる勾配降下法 - 理論と実践 - / gradient-descent-in-golang

プログラマのための数学勉強会@福岡 #5
http://maths4pg-fuk.connpass.com/event/34164/

Cd3d2cb2dadf5488935fe0ddaea7938a?s=128

monochromegane

August 05, 2016
Tweet

Transcript

  1. ࡾ୐༔հ(.01&1"#0JOD ϓϩάϥϚͷͨΊͷ਺ֶษڧձ!෱Ԭ (PʹΑΔޯ഑߱Լ๏ ཧ࿦ͱ࣮ફ

  2. ϓϦϯγύϧΤϯδχΞ ࡾ୐༔հ!NPOPDISPNFHBOF NJOOFࣄۀ෦ IUUQCMPHNPOPDISPNFHBOFDPN

  3. ໨࣍ wޯ഑߱Լ๏ͱ͸ w࠷ٸ߱Լ๏ w֬཰తޯ഑߱Լ๏ wޯ഑߱Լ๏ͷ࠷దԽ w·ͱΊ

  4. ޯ഑߱Լ๏ͱ͸

  5. ޯ഑߱Լ๏ͱ͸ wػցֶशʹ͓͍ͯϞσϧʹରֶͯ͠शΛਐΊΔͨΊͷख๏ͷͻͱͭɻ wτϨʔχϯάର৅ͷσʔλʹରͯ͠Ϟσϧͱͷޡ͕ࠩ࠷খʹͳΔΑ͏ʹϞσϧ ಺ͷύϥϝλΛߋ৽͍ͯ͘͜͠ͱɻ wύϥϝλߋ৽͸ɺޡࠩΛఆٛͨؔ͠਺Λඍ෼ͯ͠࠷খʹ͚ۙͮΔૢ࡞Λ܁Γฦ ͢͜ͱͰߦ͏ɻ

  6. ͳΔ΄Ͳʁʁ

  7. ྫ͑͹ɺ͜͜ʹ ޡࠩΛఆٛͨؔ͠਺ͱͯ͠ ͕͋Δͱ͢Δɻ ͜ΕΛ࠷খԽ͢ΔYͷ஋͕ٻ·Δ ޡ͕ࠩ࠷খʹͳΔͱߟ͑Δɻ f ( x ) =

    ( x 1)2
  8. ͭ·Γɺ ͻͨ͢Βඍ෼ͯ͠܏͖͕ʹͳΔͱ͜ ΖΛ୳͢ɻ

  9. ͋ͯͣͬΆ͏ʁ ͦΕͩͱऴΘΒͳ͍ͷͰɺٻΊͨ܏͖ ΛݩʹYΛ૿΍͠ʢݮΒ͠ʣͯΛ܁ Γฦ͢ x := x d dx f

    ( x ) ಋؔ਺ͷූ߸͕ෛͰ͋Ε͹ɺYΛ૿΍͠ɺ ಋؔ਺ͷූ߸͕ਖ਼Ͱ͋Ε͹ɺYΛݮΒ͢ɻ
  10. ֶश཰ ֶश཰Б͸Yͷߋ৽౓߹͍Λௐ੔͢ Δɻ x := x ⌘ d dx f

    ( x ) େ͖͗͢ΔͱYͷҠಈྔ͕૿͑ͯɺऩ ଋ͠ͳ͍৔߹΍ൃࢄͯ͠͠·͏৔߹͕ ͋Δɻ খ͗͢͞ΔͱYͷҠಈྔ͕ݮΓɺ܁Γ ฦ͠ճ਺͕૿͑ΔՄೳੑ͕͋Δɻ
  11. ໨తؔ਺ wτϨʔχϯάର৅ͷσʔλʹର͢ΔϞσϧͱͷޡࠩΛఆٛͨ͠΋ͷ ٻΊΔύϥϝλΛВͱஔ͘ E ( ✓ ) = 1 2

    n X i=1 ( yi f✓( xi))2 τϨʔχϯάσʔλ Z ͱ͋Δ࣌఺ͷύϥϝλВΛ ࢖ͬͨϞσϧ͔Βࢉग़͞Εͨ༧ଌ஋ͷࠩʢޡࠩʣ શͯͷτϨʔχϯάσʔλʹର͢Δޡࠩͷೋ৐࿨
  12. ໨తؔ਺ w͋ͱ͸ɺޡࠩΛఆٛͨؔ͠਺Ͱ͋Δ໨తؔ਺Λύϥϝλʹରͯ͠ඍ෼ͯ͠ޡࠩ Λ࠷খʹ͍͚ͯ͠͹Α͍ ˠ࠷ٸ߱Լ๏

  13. ࠷ٸ߱Լ๏ - gradient descent, GD -

  14. ۩ମྫ

  15. ଟ߲ࣜճؼ τϨʔχϯάηοτ ਖ਼ݭؔ਺Λσʔλੜ੒ݩͱͯ͠ඪ४ภ ࠩͷཚ਺ΛՃ͑ͨ΋ͷ Ϟσϧ ࣍ͷଟ߲ࣜΛ༻͍ͯ༧ଌ f✓( x ) =

    ✓0 + ✓1x + ✓2x 2 + ✓3x 3
  16. ଟ߲ࣜճؼ ໨తؔ਺ E ( ✓ ) = 1 2 n

    X i=1 ( yi f✓( xi))2 f✓( x ) = ✓0 + ✓1x + ✓2x 2 + ✓3x 3 ΛϞσϧ ͷύϥϝλͰ͋ΔВ    ʹରͯ͠ ภඍ෼Λߦͬͨಋؔ਺Λ༻͍ͯύϥϝ λͷߋ৽Λߦ͏
  17. ଟ߲ࣜճؼ ໨తؔ਺ E ( ✓ ) = 1 2 n

    X i=1 ( yi f✓( xi))2 f✓( x ) = ✓0 + ✓1x + ✓2x 2 + ✓3x 3 ΛϞσϧ ͷύϥϝλͰ͋ΔВ    ʹରͯ͠ ภඍ෼Λߦͬͨಋؔ਺Λ༻͍ͯύϥϝ λͷߋ৽Λߦ͏ ✓0 := ✓0 ⌘ n X i=1 ( f✓( xi) yi) ✓1 := ✓1 ⌘ n X i=1 ( f✓( xi) yi) xi ✓2 := ✓2 ⌘ n X i=1 ( f✓( xi) yi) x 2 i ✓3 := ✓3 ⌘ n X i=1 ( f✓( xi) yi) x 3 i ύϥϝλߋ৽ࣜ В@ʹ͍ͭͯภඍ෼ В@ʹ͍ͭͯภඍ෼ В@ʹ͍ͭͯภඍ෼ В@ʹ͍ͭͯภඍ෼
  18. ࠷ٸ߱Լ๏ʹΑΔଟ߲ࣜճؼ(PMBOH // fθ(x) Ϟσϧ func PredictionFunction(x float64, thetas []float64) float64

    { result := 0.0 for i, theta := range thetas { result += theta * math.Pow(x, float64(i)) } return result } // E(θ) ໨తؔ਺ func ObjectiveFunction(trainings DataSet, thetas []float64) float64 { result := 0.0 for _, training := range trainings { result += math.Pow((training.Y - PredictionFunction(training.X, thetas)), 2) } return result / 2.0 }
  19. ࠷ٸ߱Լ๏ʹΑΔଟ߲ࣜճؼ(PMBOH // ύϥϝλ͝ͱͷޯ഑ func gradient(dataset DataSet, thetas []float64, index int,

    batchSize int) float64 { result := 0.0 for _, data := range dataset[0:batchSize] { result += ((PredictionFunction(data.X, thetas) - data.Y) * math.Pow(data.X, float64(index))) } return result } ✓0 := ✓0 ⌘ n X i=1 ( f✓( xi) yi) ✓1 := ✓1 ⌘ n X i=1 ( f✓( xi) yi) xi ✓2 := ✓2 ⌘ n X i=1 ( f✓( xi) yi) x 2 i ✓3 := ✓3 ⌘ n X i=1 ( f✓( xi) yi) x 3 i
  20. ࠷ٸ߱Լ๏ʹΑΔଟ߲ࣜճؼ(PMBOH // learning (update parameters) for i := 0; i

    < opt.Epoch; i++ { // update parameter by gradient descent org_thetas := make([]float64, cap(thetas)) copy(org_thetas, thetas) shuffled := dataset.Shuffle() for j, _ := range thetas { // compute gradient gradient := gradient(shuffled, org_thetas, j, batchSize) // update parameter thetas[j] = org_thetas[j] - (opt.LearingRate * gradient) } }
  21. ࠷ٸ߱Լ๏ʹΑΔଟ߲ࣜճؼ

  22. ֬཰తޯ഑߱Լ๏ - stochastic gradient descent, SGD -

  23. ࠷ٸ߱Լ๏ͷ՝୊

  24. ࠷ٸ߱Լ๏ͷ՝୊ wύϥϝλߋ৽ຖͷޡࠩͷܭࢉʹશτϨʔχϯάηοτͷ߹ܭ͕ඞཁʹͳΔ wˠτϨʔχϯάηοτ͕ͱͯ΋େ͖͍৔߹ʹܭࢉྔ͕๲େʹͳͬͯ͠·͏ E ( ✓ ) = 1 2

    n X i=1 ( yi f✓( xi))2 wશτϨʔχϯάηοτΛ࢖͏ͨΊ࣮֬ʹޯ഑ΛԼͬͯ͠·͏ wˠہॴղʹั·ΔՄೳੑ͕ߴ͍
  25. ֬཰తޯ഑߱Լ๏ - stochastic gradient descent, SGD -

  26. 4(%ʹΑΔଟ߲ࣜճؼ ύϥϝλߋ৽ࣜ ޡࠩೋ৐࿨Λ࢖ΘͣɺϥϯμϜʹબ୒ ͨ͠σʔλΛ༻͍ͯύϥϝλߋ৽Λߦ ͏ ✓0 := ✓0 ⌘ 1

    X i=1 ( f✓( xi) yi) ✓1 := ✓1 ⌘ 1 X i=1 ( f✓( xi) yi) xi ✓2 := ✓2 ⌘ 1 X i=1 ( f✓( xi) yi) x 2 i ✓3 := ✓3 ⌘ 1 X i=1 ( f✓( xi) yi) x 3 i J͔Β·Ͱɻͻͱ͚ͭͩͷ࿨ ൪໨ͷτϨʔχϯάηοτݻఆͰ ֶश͢ΔͷͰ͸ͳ͘ɺຖճγϟοϑ ϧ্ͨ͠Ͱͷઌ಄σʔλΛ࢖ͬͯύ ϥϝλߋ৽Λߦ͏
  27. ֬཰తޯ഑߱Լ๏ʹΑΔଟ߲ࣜճؼ(PMBOH // GD: batchSize=len(dataset), SGD: batchSize=1 batchSize := len(dataset) if

    opt.Algorithm == "sgd" { if opt.BatchSize == -1 { batchSize = 1 } }
  28. ֬཰తޯ഑߱Լ๏ֶश཰ʹΑΔऩଋਪҠ

  29. ϛχόονޯ഑߱Լ๏ - mini-batch gradient descent, mini-batch SGD -

  30. NJOJCBUDI4(% 㱡#MFO USBJOHJOHTFU ͱͳΔ όοναΠζΛఆΊͯύϥϝλߋ৽Λ ߦ͏͜ͱͰ࠷ٸ߱Լ๏ͱ֬཰తޯ഑߱ Լ๏ͷ͍͍ͱ͜औΓΛૂ͏ɻ ֬཰తޯ഑߱Լ๏͸#ͷಛघܕͱ ݴ͑Δɻ J͔ΒϛχόοναΠζ·Ͱͷ࿨

    ຖճγϟοϑϧ্ͨ͠Ͱઌ಄͔Βϛ χόοναΠζ·ͰͷσʔλΛ࢖ͬ ͯύϥϝλߋ৽Λߦ͏ ✓0 := ✓0 ⌘ B X i=1 ( f✓( xi) yi) ✓1 := ✓1 ⌘ B X i=1 ( f✓( xi) yi) xi ✓2 := ✓2 ⌘ B X i=1 ( f✓( xi) yi) x 2 i ✓3 := ✓3 ⌘ B X i=1 ( f✓( xi) yi) x 3 i
  31. ޯ഑߱Լ๏ͷ࠷దԽ - optimization -

  32. .PNFOUVN

  33. ޯ഑߱Լ๏ͷऩଋΛૣΊΔ

  34. .PNFOUVN ύϥϝλߋ৽ʹϞϝϯλϜʢ׳ੑʣͷ ߟ͑ํΛऔΓೖΕΔ͜ͱͰऩଋΛૣΊ Δɻ ϞϝϯλϜͷͳ͍4(% ϞϝϯλϜͷ͋Δ4(% vk = vk 1

    + ⌘rE(✓) ✓k = ✓k 1 vk લճ·Ͱͷޯ഑ҠಈΛ׳ੑͱͯ͠ྦྷੵ ͢Δɻͭ·Γಉ͡ํ޲΁ͷҠಈͰ͋Ε ͹׳ੑ͸૿Ճ͠ɺํ޲Λม͑ΔҠಈͰ ͋Ε͹ݱ৅ͤ͞Δɻ ޯ഑ ϞϝϯλϜͷྦྷੵ .PNFOUVNBOE-FBSOJOH3BUF"EBQUBUJPO IUUQTXXXXJMMBNFUUFFEVdHPSSDMBTTFTDTNPNSBUFIUNM
  35. .PNFOUVNʹΑΔ࠷దԽ(PMBOH for j, _ := range thetas { // compute

    gradient gradient := gradient(shuffled, org_thetas, j, batchSize) // Use momentum if momentum option is passed velocities[j] = opt.Momentum*velocities[j] -(opt.LearingRate * gradient) // update parameter thetas[j] = org_thetas[j] + velocities[j] } vk = vk 1 + ⌘rE(✓) ✓k = ✓k 1 vk
  36. .PNFOUVNʹΑΔ࠷దԽֶश཰ʹΑΔऩଋਪҠ

  37. "EB(SBE

  38. ֶश཰ΛࣗಈͰௐ੔͢Δ

  39. "EB(SBE Ϟσϧͷֶशͷࡍʹɺֶश཰ΛࣗಈͰ ௐ੔͢Δख๏ͷͻͱͭɻ Gk = Gk 1 + (rE(✓k 1))2

    ✓k = ✓k 1 ⌘ p Gk 1 + ✏ rE(✓k 1) ॳظֶश཰БΛޯ഑ͷઈର஋ͷྦྷੵͰ ׂͬͨ΋ͷΛֶश཰ͱͯ͠࢖͏ ϝϦοτ ֤ύϥϝλ͝ͱʹֶश཰͕ௐ੔Ͱ͖ Δɻ มԽͷগͳ͍ύϥϝλʹରͯ͠͸େ ֶ͖͘श͠ɺมԽ͕ଟ͍ύϥϝλʹ ରͯ͠͸গֶͮͭ͠श͍ͯ͘͠ σϝϦοτ ޯ഑ͷྦྷੵΛ෼฼ͱ͢ΔҎ্ɺֶश ͕ਐΉͱֶश཰͸ඇৗʹখ͘͞ͳͬ ͯ͠·͏ ˠॳظֶश཰Λେ͖Ίʹઃఆ͢Δ
  40. "EB(SBEʹΑΔ࠷దԽ(PMBOH for j, _ := range thetas { ~~~~ //

    optimize by AdaGrad gradients[j] += math.Pow(gradient, 2) learningRate := opt.LearingRate / (math.Sqrt(gradients[j] + opt.Epsilon)) update = -(learningRate * gradient) ~~~~ } Gk = Gk 1 + (rE(✓k 1))2 ✓k = ✓k 1 ⌘ p Gk 1 + ✏ rE(✓k 1)
  41. "EB(SBEʹΑΔ࠷దԽֶश཰ʹΑΔऩଋਪҠ

  42. "EB%FMUB

  43. ֶश཰ΛࣗಈͰௐ੔͢Δ

  44. "EB%FMUB Ϟσϧͷֶशͷࡍʹɺֶश཰ΛࣗಈͰ ௐ੔͢Δख๏ͷͻͱͭɻ ֶश཰ͷ୯ௐݮগΛճආ ୯७ʹޯ഑ͷ߹ܭΛ༻͍ΔͷͰ͸ͳ͘ɺޯ഑ ΛݮਰฏۉԽ͢Δ͜ͱͰ௚ۙͷޯ഑ʹΑΔֶ श཰ͷࢉग़Λߦ͏ɻ ॳظֶश཰ͷઃఆ͕ෆཁ ·ͨॳظֶश཰Λύϥϝλߋ৽஋Λݮਰฏۉ Խͨ͠΋ͷʹஔ͖׵͑Δ

    E ⇥ g2 ⇤ t = E ⇥ g2 ⇤ t 1 + (1 )g2 t ✓t = q E [ ✓2]t 1 + ✏ p E [g2]t + ✏ gt E ⇥ ✓2 ⇤ t = E ⇥ ✓2 ⇤ t 1 + (1 ) ✓2 t ✓t+1 = ✓t + ✓t
  45. "EB%FMUB Ϟσϧͷֶशͷࡍʹɺֶश཰ΛࣗಈͰ ௐ੔͢Δख๏ͷͻͱͭɻ E ⇥ g2 ⇤ t = E

    ⇥ g2 ⇤ t 1 + (1 )g2 t ✓t = q E [ ✓2]t 1 + ✏ p E [g2]t + ✏ gt E ⇥ ✓2 ⇤ t = E ⇥ ✓2 ⇤ t 1 + (1 ) ✓2 t ✓t+1 = ✓t + ✓t ޯ഑ΛݮਰฏۉԽͯ͠஝ੵ ઌఔٻΊͨ஋Λ࢖ͬͯύϥϝ λߋ৽஋ͷݮਰฏۉ஝ੵ ௚ۙͷޯ഑ͱύϥϝλߋ৽஋ ͔Βֶश཰ΛٻΊͯ৽͍͠ύ ϥϝλߋ৽஋ΛಘΔ ύϥϝλߋ৽
  46. "EB%FMUBʹΑΔ࠷దԽ(PMBOH for j, _ := range thetas { ~~~~ //

    optimize by AdaDelta gradients[j] = (opt.DecayRate * gradients[j]) + (1.0- opt.DecayRate)*math.Pow(gradient, 2) update = -(math.Sqrt(updates[j]+opt.Epsilon) / math.Sqrt(gradients[j] +opt.Epsilon)) * gradient updates[j] = (opt.DecayRate * updates[j]) + (1.0- opt.DecayRate)*math.Pow(update, 2) ~~~~ } E ⇥ g2 ⇤ t = E ⇥ g2 ⇤ t 1 + (1 )g2 t ✓t = q E [ ✓2]t 1 + ✏ p E [g2]t + ✏ gt E ⇥ ✓2 ⇤ t = E ⇥ ✓2 ⇤ t 1 + (1 ) ✓2 t ✓t+1 = ✓t + ✓t
  47. "EB%FMUBʹΑΔ࠷దԽݮਰ཰ʹΑΔऩଋਪҠ

  48. "EB(SBEͱ"EB%FMUBͷֶश཰ͷਪҠ

  49. ൺֱ

  50. ֤ޯ഑߱Լ๏ͱ࠷దԽʹΑΔऩଋਪҠͷൺֱ

  51. ·ͱΊ

  52. ·ͱΊ wػցֶशͰ͸ޯ഑߱Լ๏ʹΑͬͯޡࠩΛ࠷খԽ͢Δ͜ͱͰϞσϧͷֶशΛਐΊ Δ wޯ഑߱Լ๏ɺ࠷దԽͷछྨ͸༷ʑ͕ͩɺτϨʔχϯάηοτʹదͨ͠΋ͷΛબ ͿͨΊʹ͸ɺΞϧΰϦζϜͷબ୒ɺϋΠύʔύϥϝʔλʔͷௐ੔ͱ͍ͬͨࢼߦ ࡨޡ͕ݱ࣌఺Ͱ͸ඞཁ w࠷৽ͷख๏͕ৗʹΑ͍ͱ͸ݶΒͳ͍ʜ wϋΠύʔύϥϝʔλʔ͸ͳ͘ͳΒͳ͍ʜ wࣗ෼Ͱ࣮૷͢Δͱཧղ͕ਂ·ͬͯ٢ʂ

  53. $PEF

  54. $PEF $ go run cmd/gradient_descent/main.go \ -eta 0.075 \ -m

    3 \ -epoch 40000 \ -algorithm sgd \ -momentum 0.9 w(PݴޠʹΑΔޯ഑߱Լ๏ͷαϯϓϧ࣮૷ΛҎԼʹஔ͍͍ͯ·͢ wIUUQTHJUIVCDPNNPOPDISPNFHBOFHSBEJFOU@EFTDFOU 6TBHF
  55. ͓ΘΓ

  56. ܅΋ϖύϘͰಇ͔ͳ͍͔ʁ ࠷৽ͷ࠾༻৘ใΛνΣοΫˠ !QC@SFDSVJU