Upgrade to Pro — share decks privately, control downloads, hide ads and more …

機械学習と自動微分

itakigawa
January 29, 2024
110

 機械学習と自動微分

2024年1月26日(金) 同志社大学「最適化法」第15回

itakigawa

January 29, 2024
Tweet

Transcript

  1. 3 § § § § § ( ) § (

    ) § https://itakigawa.github.io/
  2. 9 § ChatGPT 2022 11 2 1 § AI §

    GAFAM IT 2023 OpenAI ChatGPT
  3. 10 1. - - - 2. - - - 3.

    - - - DIY OpenAI ChatGPT 4. - - 5. GPTs - DALL-E - - 6. - - § § ChatGPT (ChatGPT )
  4. 13 § Microsoft OpenAI ChatGPT DALL-E AI § AI 3

    (440 ) 3 Apple 2 (🇬🇧🇫🇷🇮🇹 GDP ) § Bing ( Edge) ChatGPT ChatGPT Web § GitHub Copilot( ) § ( ) § § Windows 11 Microsoft 365 Copilot/Copilot Pro Microsoft Copilot ( )
  5. 14 § AI Windows 11/ Microsoft 365 Copilot § MS

    (Word, Excel, Powerpoint, Outlook, Teams) AI § § Word § Powerpoint § Powerpoint § Excel § Web § Windows MS
  6. 16 AI § § ( 2023 12 ) § ChatGPT

    AI 2 § http://hdl.handle.net/2433/286548 § ( ), ( ) § 1 (Preferred Networks) § 2 ( ) § ( )
  7. 1.

  8. 2.

  9. 28 vs ( ) q1 q2 q3 q4 … Random

    Forest GBDT Nearest Neighbor SVM Gaussian Process Neural Network
  10. 29 vs ( ) p1 p2 p3 p4 … (

    ) q1 q2 q3 q4 …
  11. 30 § § 𝑓(𝑥) 𝑦 ℓ(𝑓 𝑥 , 𝑦) §

    (MSE) § (Cross Entropy) § = (optimization) Minimize! 𝐿 𝜃 𝐿 𝜃 = ∑"#$ % ℓ(𝑓 𝑥" , 𝜃 , 𝑦" ) + Ω(𝜃)
  12. 31 𝐿 𝑎, 𝑏 = − log * !:#!$% 𝑃(𝑦

    = 1|𝑥!) * !:#!$& 𝑃(𝑦 = 0|𝑥!) データ モデル 𝑓 𝑥 = 𝜎 𝑎𝑥 + 𝑏 = 1 1 + 𝑒'()*+,) = 𝑃(𝑦 = 1|𝑥) 𝑥%, 𝑦% , 𝑥., 𝑦. , ⋯ , (𝑥/, 𝑦/) 各𝑦! 0 1 2 (Cross Entropy) (-1)× = − 8 !$% / 𝑦! log 𝑓(𝑥!) + 1 − 𝑦! log(1 − 𝑓 𝑥! ) ( ) 0 1 ( )
  13. 32 1 1 𝑓 𝑥 = 𝜎 𝑎𝑥 + 𝑏

    = 1 1 + 𝑒'()*+,) 𝑎𝑥 + 𝑏 𝑎 𝑏 𝑥 9 𝑦 𝑧 𝑥" , 𝑦" , 𝑥# , 𝑦# , (𝑥$ , 𝑦$ ) Minimize 𝐿 𝑎, 𝑏 𝐿 𝑎, 𝑏 = − 𝑦" log " "%&! "#$%& + 1 − 𝑦" log 1 − " "%&! "#$%& − 𝑦# log 1 1 + 𝑒'()*'%+) + 1 − 𝑦# log 1 − 1 1 + 𝑒'()*'%+) − 𝑦$ log 1 1 + 𝑒'()*(%+) + 1 − 𝑦$ log 1 − 1 1 + 𝑒'()*(%+) 𝜎 𝑥 = 1 1 + 𝑒'* Linear Sigmoid ( )
  14. 33 1 1 3 ( 1 ) Minimize 𝐿 𝑎!

    , 𝑎" , 𝑏! , 𝑏" 𝐿 𝑎) , 𝑎* , 𝑏) , 𝑏* = − 𝑦) log 1 1 + 𝑒+ ,! ) )-."($%&%'(%) -/! + 1 − 𝑦) log 1 − 1 1 + 𝑒+ ,! ) )-."($%&%'(%) -/! − 𝑦* log 1 1 + 𝑒+ ,! ) )-."($%&!'(%) -/! + 1 − 𝑦* log 1 − 1 1 + 𝑒+ ,! ) )-."($%&!'(%) -/! − 𝑦0 log 1 1 + 𝑒+ ,! ) )-."($%&*'(%) -/! + 1 − 𝑦0 log 1 − 1 1 + 𝑒+ ,! ) )-."($%&*'(%) -/! 𝑎% 𝑏% 𝑥 𝑧 1 1 + 𝑒#(%!&'(!) 𝑎. 𝑏. 𝑦 1 1 + 𝑒#(%"*'(") 𝑥 𝑧 𝑦
  15. 34 1 1 3 ( 2 ) 𝑥 𝑦 𝑎"

    , 𝑏" Linear 𝑢" 𝑎# , 𝑏# Linear 𝑢# Sigmoid 𝑧" Sigmoid 𝑧# 𝑎$ , 𝑏$ , 𝑐$ Linear 𝑣 Sigmoid Minimize 𝐿 𝑎! , 𝑎" , 𝑎+ , 𝑏! , 𝑏" , 𝑏+ , 𝑐+ 𝑎0 𝑧) + 𝑏0 𝑧* + 𝑐0
  16. 35 § § ResNet50 2600 § AlexNet 6100 § ResNeXt101

    8400 § VGG19 1 4300 § LLaMa 650 § Chinchilla 700 § GPT-3 1750 § Gopher 2800 § PaLM 5400 (CNN) Transformer ( )
  17. 36 𝐿 𝜃%, 𝜃., ⋯ 𝜃% 𝜃. 1. θ$ ,

    θ; , ⋯ 2. 𝐿 𝜃$ , 𝜃; , ⋯ θ$ , θ; , ⋯ 3. ( )
  18. 37 𝑓 𝑥, 𝑦 (𝑎, 𝑏) 𝑥 𝑥 𝑦 =

    𝑏 𝑥 = 𝑎 𝑓* 𝑎, 𝑏 = lim 0→& 2 )+0, , '2(),,) 0 𝑓& (𝑎, 𝑏) 𝑓 𝑎, 𝑏 𝑎
  19. 41 = • 𝒙 𝑓(𝒙) 𝒙 (learning rate) 𝛼 :

    1 " "(step size) 𝒙 𝑥% 𝑥. ⋮ 𝑥4 ← 𝑥% 𝑥. ⋮ 𝑥4 − 𝛼 × 𝜕𝑓/𝜕𝑥% 𝜕𝑓/𝜕𝑥. ⋮ 𝜕𝑓/𝜕𝑥4 𝑓(𝒙) 𝑥! 𝑥" 𝒙 = 𝑥! 𝑥"
  20. 44 § § § Momentum § AdaGrad § RMSProp §

    Adam https://towardsdatascience.com/a-visual- explanation-of-gradient-descent-methods- momentum-adagrad-rmsprop-adam-f898b102325c
  21. 45 § 45 𝑥 𝑦 𝑎) , 𝑏) Linear 𝑢)

    𝑎* , 𝑏* Linear 𝑢* Sigmoid 𝑧) Sigmoid 𝑧* 𝑎0 , 𝑏0 , 𝑐0 Linear 𝑣 Sigmoid 𝑦 𝜕𝑦/𝜕𝑎! , 𝜕𝑦/𝜕𝑏! , 𝜕𝑦/𝜕𝑐! ,
  22. 47 § (Chain Rule) § 𝑥 𝑢 = 𝑓(𝑥) 𝑦

    = 𝑔(𝑢) 𝑑𝑦 𝑑𝑥 = 𝑑𝑦 𝑑𝑢 𝑑𝑢 𝑑𝑥 § 𝑡 𝑥! = 𝑓! 𝑡 , 𝑥" = 𝑓" 𝑡 , ⋯ ⾒ 𝑧 = 𝑔(𝑥! , 𝑥" , ⋯ ) 𝜕𝑧 𝜕𝑡 = 𝜕𝑧 𝜕𝑥% 𝜕𝑥% 𝜕𝑡 + 𝜕𝑧 𝜕𝑥. 𝜕𝑥. 𝜕𝑡 + ⋯ 𝑡 𝑥% 𝑥. 𝑧 ⋯ 𝑥 𝑢 𝑦 𝑓 𝑔 𝑓! 𝑓" 𝑔
  23. 48 § 𝑦 = 𝑒;< = > < 𝑥 𝑒'*

    1/𝑥 𝑎 𝑏 𝑦 𝑎.𝑏 𝑎 = 𝑒'* 𝑏 = 1 𝑥 𝑦 = 𝑎.𝑏 𝜕𝑓 𝜕𝑥 = 𝑒'.* 𝑥 5 = − 𝑒'.*(2𝑥 + 1) 𝑥. ( or ) 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑎 𝜕𝑎 𝜕𝑥 + 𝜕𝑦 𝜕𝑏 𝜕𝑏 𝜕𝑥 = 2𝑎𝑏 −𝑒'* + 𝑎. − 1 𝑥. = 2𝑒'* J % * −𝑒'* + 𝑒'* . − % *- = − 6.-/(.*+%) *- ( )
  24. 49 1 3 𝜕𝑦/𝜕𝑥 𝑥 𝑦 𝑎" , 𝑏" Linear

    𝑢" 𝑎# , 𝑏# Linear 𝑢# Sigmoid 𝑧" Sigmoid 𝑧# 𝑎$ , 𝑏$ , 𝑐$ Linear 𝑣 Sigmoid
  25. 50 1 3 𝑥 𝑦 𝑢" 𝑢# 𝑧" 𝑧# 𝑣

    𝑑𝑦 𝑑𝑥 = 𝑑𝑢% 𝑑𝑥 𝑑𝑧% 𝑑𝑢% 𝑑𝑣 𝑑𝑧% 𝑑𝑦 𝑑𝑣 + 𝑑𝑢. 𝑑𝑥 𝑑𝑧. 𝑑𝑢. 𝑑𝑣 𝑑𝑧. 𝑑𝑦 𝑑𝑣 𝑑𝑦 𝑑𝑣 𝑑𝑣 𝑑𝑧" 𝑑𝑣 𝑑𝑧# 𝑑𝑧# 𝑑𝑢# 𝑑𝑧" 𝑑𝑢" 𝑑𝑢 𝑑𝑥 𝑑𝑢" 𝑑𝑥 𝑑𝑦 𝑑𝑣 ( )
  26. 51 2 𝑣 𝑢 𝑓% 𝑣 = 3 𝑢 𝑣

    = log 𝑢 𝑣 = 1/𝑢 2 𝑣 𝑎 𝑏 𝑔% 𝑣 = 𝑎 𝑏 𝑣 = 𝑎. + 𝑏 𝑑𝑣 𝑑𝑢 = 3 2 𝑢 𝑑𝑣 𝑑𝑢 = 1 𝑢 𝑑𝑣 𝑑𝑢 = − 1 𝑢. 𝜕𝑣 𝜕𝑎 = 𝑏 𝜕𝑣 𝜕𝑏 = 𝑎 𝜕𝑣 𝜕𝑎 = 2𝑎 𝜕𝑣 𝜕𝑏 = 1 𝑣 𝑢 𝑓. 𝑣 𝑢 𝑓7 𝑣 𝑎 𝑏 𝑔. 𝑦 = 𝑔%(𝑔. 𝑓7 𝑥 , 𝑓. 𝑥 , 𝑓% 𝑥 ) = 3 𝑥 log 𝑥 + 1 𝑥.
  27. 52 2 𝑑𝑦 𝑑𝑥 = 𝑑𝑧, 𝑑𝑥 𝑑𝑧! 𝑑𝑧, 𝑑𝑦

    𝑑𝑧! + 𝑑𝑧+ 𝑑𝑥 𝑑𝑧! 𝑑𝑧+ 𝑑𝑦 𝑑𝑧! + 𝑑𝑧" 𝑑𝑥 𝑑𝑦 𝑑𝑧" = 3 𝑥 1 𝑥 − 2 𝑥+ + 3 log 𝑥 + 1 𝑥" 2 𝑥 𝑦 = 𝑔% 𝑧%, 𝑧. = 𝑧%𝑧. 𝑧% = 𝑔. 𝑧8, 𝑧7 = 𝑧7 + 𝑧8 . 𝑧. = 𝑓% 𝑥 = 3 𝑥 𝑧7 = 𝑓. 𝑥 = log 𝑥 𝑧8 = 𝑓7 𝑥 = 1/𝑥 𝑥 𝑧. 𝑧7 𝑧8 𝑧% 𝑦 𝑓" 𝑓# 𝑓$ 𝑔# 𝑔" 𝑦 = 𝑔%(𝑔. 𝑓7 𝑥 , 𝑓. 𝑥 , 𝑓% 𝑥 ) = 3 𝑥 log 𝑥 + 1 𝑥.
  28. 53 (1 ) 𝑧. = 3 𝑥 𝑧7 = log

    𝑥 𝑧8 = 1/𝑥 𝑧% = 𝑧8 . + 𝑧7 𝑦 = 𝑧%𝑧. 2 x=1.2 𝑦 = 3 𝑥 log 𝑥 + 1 𝑥. 2 • • backward x=1.2 dy/dx ( )
  29. 54 § ( ) § § 𝑑𝑦 𝑑𝑥 = 3

    𝑥 1 𝑥 − 2 𝑥? + 3 log 𝑥 + 1 𝑥; 2 𝑥 𝑥 = 1.2 @A @B 0.14 § ( )
  30. 55 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 data data data data Forward
  31. 56 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 data data data data data Forward
  32. 57 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data data data data data Forward
  33. 58 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 data grad data grad Backward
  34. 59 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 0.88 𝑧8 0.83 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" data grad data grad = 0.88 = 3.29 Backward
  35. 60 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 0.88 𝑧8 0.83 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑧$ = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑦 𝜕𝑧0 = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 = 0.88 = 3.29 Backward
  36. 61 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑧$ = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑦 𝜕𝑧0 = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 = 0.88 = 3.29 Backward
  37. 62 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑧$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 = 1.37 = 0.88 = 3.29 Backward
  38. 63 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑧$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 = 1.37 𝜕𝑧$ 𝜕𝑥 = 1 𝑥 = 0.83 = 0.88 = 3.29 Backward
  39. 64 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑧$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 = 1.37 𝜕𝑧$ 𝜕𝑥 = 1 𝑥 = 0.83 𝜕𝑧0 𝜕𝑥 = − 1 𝑥# = -0.69 = 0.88 = 3.29 Backward
  40. 65 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 data grad data grad 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑧$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 1.37 0.83 -0.69 0.88 3.29 1 Backward
  41. 66 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88

    𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad 𝜕𝑦 𝜕𝑥 𝜕𝑦 𝜕𝑧# 𝜕𝑦 𝜕𝑧$ 𝜕𝑦 𝜕𝑧0 𝜕𝑦 𝜕𝑧" 𝜕𝑦 𝜕𝑦 Forward Backward y ( )
  42. 67 𝑧. = 3 𝑥 𝑥 1.2 data grad 𝑧.

    3.29 data grad Forward Backward z2.data ← 3 * sqrt(x.data) 𝑥 1.2 data grad 𝑧. 3.29 data grad backward x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# × 𝜕𝑧# 𝜕𝑥
  43. 68 𝑧. = 3 𝑥 𝑥 1.2 data grad 𝑧.

    3.29 data grad Forward Backward z2.data ← 3 * sqrt(x.data) 𝑥 1.2 data grad 𝑧. 3.29 data grad backward x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑦 𝜕𝑧# = 0.88 1.20 0.88 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# × 𝜕𝑧# 𝜕𝑥
  44. 69 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data grad data grad data grad grad data grad data grad ( ) 1. Forward ( & )
  45. 70 𝜕𝑧+ 𝜕𝑥 = − 1 𝑥, 𝑦 = 𝑧%𝑧.

    𝑧% = 𝑧8 . + 𝑧7 𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data data data data data grad grad grad grad grad grad ⾒ Backward ⾒ 𝜕𝑧- 𝜕𝑥 = 1 𝑥 𝜕𝑧, 𝜕𝑥 = 3 2 𝑥 𝜕𝑦 𝜕𝑧, = 𝑧. 𝜕𝑧. 𝜕𝑧- = 1 𝜕𝑧. 𝜕𝑧+ = 2 𝑧+ 𝜕𝑦 𝜕𝑧. = 𝑧, 1. Forward ( & )
  46. 71 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data data data data data grad grad grad grad grad grad x.grad ← 3/(2*sqrt(x.data)) * z2.grad x.grad ← (-1/x.data**2) * z4.grad z2.grad ← z1.data* y.grad z1.grad ← z2.data* y.grad z3.grad ← 1.0 * z1.grad z4.grad ← (2*z4.data)* z1.grad x.grad ← 1/x.data * z3.grad 1. Forward ( & ) ⾒ Backward ⾒
  47. 72 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 0.0 𝑧. 3.29 0.0 𝑧8 0.83 0.0 𝑧% 0.88 0.0 𝑦 2.88 0.0 data data data data data data grad grad grad grad grad grad 2. Backward (grad 0 )
  48. 73 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 0.0 𝑧. 3.29 0.0 𝑧8 0.83 0.0 𝑧% 0.88 0.0 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad 3. Backward ( grad 1 )
  49. 74 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 0.0 𝑧. 3.29 0.88 𝑧8 0.83 0.0 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad z2.grad ← z1.data* y.grad z1.grad ← z2.data* y.grad 4. Backward (y grad )
  50. 75 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad z3.grad ← 1.0 * z1.grad z4.grad ← (2*z4.data)* z1.grad 𝜕𝑦 𝜕𝑧$ = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑦 𝜕𝑧0 = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 5. Backward (z1 grad )
  51. 76 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 1.20 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad x.grad ← 3/(2*sqrt(x.data)) * z2.grad 1.20 grad=0.0 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑧$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 6. Backward (z2 grad )
  52. 77 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 3.94 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad x.grad ← 1/x.data * z3.grad 2.74 grad=1.20 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑧$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑦 𝜕𝑧$ 6. Backward (z3 grad )
  53. 78 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad -3.80 grad=3.94 x.grad ← (-1/x.data**2) * z4.grad 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧$ 𝜕𝑧$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑦 𝜕𝑧0 6. Backward (z4 grad )
  54. 79 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

    𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad 𝜕𝑦 𝜕𝑥 𝜕𝑦 𝜕𝑧# 𝜕𝑦 𝜕𝑧$ 𝜕𝑦 𝜕𝑧0 𝜕𝑦 𝜕𝑧" 𝜕𝑦 𝜕𝑦 ( ) y
  55. 80 𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8

    = 1/𝑥 𝑧% = 𝑧8 . + 𝑧7 𝑦 = 𝑧%𝑧. PyTorch Backward Forward
  56. 81 grad x = tensor(1.2, requires_grad=True) z2 = 3*sqrt(x) z3

    = log(x) z4 = 1/x z1 = z4**2 + z3 y = z1 * z2 y.retain_grad() z1.retain_grad() z2.retain_grad() z3.retain_grad() z4.retain_grad() y.backward() print(x.data, z2.data, z3.data, z4.data, z1.data, y.data) print(x.grad, z2.grad, z3.grad, z4.grad, z1.grad, y.grad) tensor(1.20) tensor(3.29) tensor(0.18) tensor(0.83) tensor(0.88) tensor(2.88) tensor(0.14) tensor(0.88) tensor(3.29) tensor(5.48) tensor(3.29) tensor(1.) import torch torch.set_printoptions(2) 2 PyTorch
  57. 82 𝑦 = 𝑥. + 2𝑥 + 3 = 𝑥

    + 1 . + 2 (2.0) • (Forward) • (Backward) • • x.grad 𝑥 = −1 𝑦 = 2
  58. 85 MSE = mean squared errors ( ) SGD =

    stochastic gradient descent ( ) PyTorch
  59. 87 • https://github.com/karpathy/micrograd • Python 94 • https://github.com/karpathy/micrograd/blob/master/micrograd/engine.py • Andrej

    Karpathy OpenAI (Telsa AI 2023 OpenAI ) • https://youtu.be/VMj-3S1tku0?si=91ZWzaA4ECidua4g micrograd
  60. 88 § 𝑦 = 3 𝑥 𝑧 = 𝑥 𝑦

    = 3𝑧 § PyTorch https://pytorch.org/docs/stable/torch.html#math-operations § §