itakigawa
January 29, 2024
88

機械学習と自動微分

2024年1月26日(金) 同志社大学「最適化法」第１５回

January 29, 2024

Transcript

3. 3 § § § § § ( ) § (

) § https://itakigawa.github.io/

9. 9 § ChatGPT 2022 11 2 1 § AI §

GAFAM IT 2023 OpenAI ChatGPT
10. 10 1. - - - 2. - - - 3.

- - - DIY OpenAI ChatGPT 4. - - 5. GPTs - DALL-E - - 6. - - § § ChatGPT (ChatGPT )
11. 13 § Microsoft OpenAI ChatGPT DALL-E AI § AI 3

(440 ) 3 Apple 2 (🇬🇧🇫🇷🇮🇹 GDP ) § Bing ( Edge) ChatGPT ChatGPT Web § GitHub Copilot( ) § ( ) § § Windows 11 Microsoft 365 Copilot/Copilot Pro Microsoft Copilot ( )
12. 14 § AI Windows 11/ Microsoft 365 Copilot § MS

(Word, Excel, Powerpoint, Outlook, Teams) AI § § Word § Powerpoint § Powerpoint § Excel § Web § Windows MS

14. 16 AI § § ( 2023 12 ) § ChatGPT

AI 2 § http://hdl.handle.net/2433/286548 § ( ), ( ) § 1 (Preferred Networks) § 2 ( ) § ( )

18. 20 ( ) J aime la musique I love music

(Software 2.0 )

0 1 or
22. 24 Random Forest Neural Network SVR Kernel Ridge p1 p2

p3 p4 … ( ) ( ) ( ) ⾒

24. 26 Decision Tree Random Forest GBDT Nearest Neighbor Logistic Regression

SVM Gaussian Process Neural Network ( )

26. 28 vs ( ) q1 q2 q3 q4 … Random

Forest GBDT Nearest Neighbor SVM Gaussian Process Neural Network
27. 29 vs ( ) p1 p2 p3 p4 … (

) q1 q2 q3 q4 …
28. 30 § § 𝑓(𝑥) 𝑦 ℓ(𝑓 𝑥 , 𝑦) §

(MSE) § (Cross Entropy) § = (optimization) Minimize! 𝐿 𝜃 𝐿 𝜃 = ∑"#\$ % ℓ(𝑓 𝑥" , 𝜃 , 𝑦" ) + Ω(𝜃)
29. 31 𝐿 𝑎, 𝑏 = − log * !:#!\$% 𝑃(𝑦

= 1|𝑥!) * !:#!\$& 𝑃(𝑦 = 0|𝑥!) データ モデル 𝑓 𝑥 = 𝜎 𝑎𝑥 + 𝑏 = 1 1 + 𝑒'()*+,) = 𝑃(𝑦 = 1|𝑥) 𝑥%, 𝑦% , 𝑥., 𝑦. , ⋯ , (𝑥/, 𝑦/) 各𝑦! 0 1 2 (Cross Entropy) (-1)× = − 8 !\$% / 𝑦! log 𝑓(𝑥!) + 1 − 𝑦! log(1 − 𝑓 𝑥! ) ( ) 0 1 ( )
30. 32 1 1 𝑓 𝑥 = 𝜎 𝑎𝑥 + 𝑏

= 1 1 + 𝑒'()*+,) 𝑎𝑥 + 𝑏 𝑎 𝑏 𝑥 9 𝑦 𝑧 𝑥" , 𝑦" , 𝑥# , 𝑦# , (𝑥\$ , 𝑦\$ ) Minimize 𝐿 𝑎, 𝑏 𝐿 𝑎, 𝑏 = − 𝑦" log " "%&! "#\$%& + 1 − 𝑦" log 1 − " "%&! "#\$%& − 𝑦# log 1 1 + 𝑒'()*'%+) + 1 − 𝑦# log 1 − 1 1 + 𝑒'()*'%+) − 𝑦\$ log 1 1 + 𝑒'()*(%+) + 1 − 𝑦\$ log 1 − 1 1 + 𝑒'()*(%+) 𝜎 𝑥 = 1 1 + 𝑒'* Linear Sigmoid ( )
31. 33 1 1 3 ( 1 ) Minimize 𝐿 𝑎!

, 𝑎" , 𝑏! , 𝑏" 𝐿 𝑎) , 𝑎* , 𝑏) , 𝑏* = − 𝑦) log 1 1 + 𝑒+ ,! ) )-."(\$%&%'(%) -/! + 1 − 𝑦) log 1 − 1 1 + 𝑒+ ,! ) )-."(\$%&%'(%) -/! − 𝑦* log 1 1 + 𝑒+ ,! ) )-."(\$%&!'(%) -/! + 1 − 𝑦* log 1 − 1 1 + 𝑒+ ,! ) )-."(\$%&!'(%) -/! − 𝑦0 log 1 1 + 𝑒+ ,! ) )-."(\$%&*'(%) -/! + 1 − 𝑦0 log 1 − 1 1 + 𝑒+ ,! ) )-."(\$%&*'(%) -/! 𝑎% 𝑏% 𝑥 𝑧 1 1 + 𝑒#(%!&'(!) 𝑎. 𝑏. 𝑦 1 1 + 𝑒#(%"*'(") 𝑥 𝑧 𝑦
32. 34 1 1 3 ( 2 ) 𝑥 𝑦 𝑎"

, 𝑏" Linear 𝑢" 𝑎# , 𝑏# Linear 𝑢# Sigmoid 𝑧" Sigmoid 𝑧# 𝑎\$ , 𝑏\$ , 𝑐\$ Linear 𝑣 Sigmoid Minimize 𝐿 𝑎! , 𝑎" , 𝑎+ , 𝑏! , 𝑏" , 𝑏+ , 𝑐+ 𝑎0 𝑧) + 𝑏0 𝑧* + 𝑐0
33. 35 § § ResNet50 2600 § AlexNet 6100 § ResNeXt101

8400 § VGG19 1 4300 § LLaMa 650 § Chinchilla 700 § GPT-3 1750 § Gopher 2800 § PaLM 5400 (CNN) Transformer ( )
34. 36 𝐿 𝜃%, 𝜃., ⋯ 𝜃% 𝜃. 1. θ\$ ,

θ; , ⋯ 2. 𝐿 𝜃\$ , 𝜃; , ⋯ θ\$ , θ; , ⋯ 3. ( )
35. 37 𝑓 𝑥, 𝑦 (𝑎, 𝑏) 𝑥 𝑥 𝑦 =

𝑏 𝑥 = 𝑎 𝑓* 𝑎, 𝑏 = lim 0→& 2 )+0, , '2(),,) 0 𝑓& (𝑎, 𝑏) 𝑓 𝑎, 𝑏 𝑎

39. 41 = • 𝒙 𝑓(𝒙) 𝒙 (learning rate) 𝛼 :

1 " "(step size) 𝒙 𝑥% 𝑥. ⋮ 𝑥4 ← 𝑥% 𝑥. ⋮ 𝑥4 − 𝛼 × 𝜕𝑓/𝜕𝑥% 𝜕𝑓/𝜕𝑥. ⋮ 𝜕𝑓/𝜕𝑥4 𝑓(𝒙) 𝑥! 𝑥" 𝒙 = 𝑥! 𝑥"

42. 44 § § § Momentum § AdaGrad § RMSProp §

Adam https://towardsdatascience.com/a-visual- explanation-of-gradient-descent-methods- momentum-adagrad-rmsprop-adam-f898b102325c
43. 45 § 45 𝑥 𝑦 𝑎) , 𝑏) Linear 𝑢)

𝑎* , 𝑏* Linear 𝑢* Sigmoid 𝑧) Sigmoid 𝑧* 𝑎0 , 𝑏0 , 𝑐0 Linear 𝑣 Sigmoid 𝑦 𝜕𝑦/𝜕𝑎! , 𝜕𝑦/𝜕𝑏! , 𝜕𝑦/𝜕𝑐! ,

45. 47 § (Chain Rule) § 𝑥 𝑢 = 𝑓(𝑥) 𝑦

= 𝑔(𝑢) 𝑑𝑦 𝑑𝑥 = 𝑑𝑦 𝑑𝑢 𝑑𝑢 𝑑𝑥 § 𝑡 𝑥! = 𝑓! 𝑡 , 𝑥" = 𝑓" 𝑡 , ⋯ ⾒ 𝑧 = 𝑔(𝑥! , 𝑥" , ⋯ ) 𝜕𝑧 𝜕𝑡 = 𝜕𝑧 𝜕𝑥% 𝜕𝑥% 𝜕𝑡 + 𝜕𝑧 𝜕𝑥. 𝜕𝑥. 𝜕𝑡 + ⋯ 𝑡 𝑥% 𝑥. 𝑧 ⋯ 𝑥 𝑢 𝑦 𝑓 𝑔 𝑓! 𝑓" 𝑔
46. 48 § 𝑦 = 𝑒;< = > < 𝑥 𝑒'*

1/𝑥 𝑎 𝑏 𝑦 𝑎.𝑏 𝑎 = 𝑒'* 𝑏 = 1 𝑥 𝑦 = 𝑎.𝑏 𝜕𝑓 𝜕𝑥 = 𝑒'.* 𝑥 5 = − 𝑒'.*(2𝑥 + 1) 𝑥. ( or ) 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑎 𝜕𝑎 𝜕𝑥 + 𝜕𝑦 𝜕𝑏 𝜕𝑏 𝜕𝑥 = 2𝑎𝑏 −𝑒'* + 𝑎. − 1 𝑥. = 2𝑒'* J % * −𝑒'* + 𝑒'* . − % *- = − 6.-/(.*+%) *- ( )
47. 49 1 3 𝜕𝑦/𝜕𝑥 𝑥 𝑦 𝑎" , 𝑏" Linear

𝑢" 𝑎# , 𝑏# Linear 𝑢# Sigmoid 𝑧" Sigmoid 𝑧# 𝑎\$ , 𝑏\$ , 𝑐\$ Linear 𝑣 Sigmoid
48. 50 1 3 𝑥 𝑦 𝑢" 𝑢# 𝑧" 𝑧# 𝑣

𝑑𝑦 𝑑𝑥 = 𝑑𝑢% 𝑑𝑥 𝑑𝑧% 𝑑𝑢% 𝑑𝑣 𝑑𝑧% 𝑑𝑦 𝑑𝑣 + 𝑑𝑢. 𝑑𝑥 𝑑𝑧. 𝑑𝑢. 𝑑𝑣 𝑑𝑧. 𝑑𝑦 𝑑𝑣 𝑑𝑦 𝑑𝑣 𝑑𝑣 𝑑𝑧" 𝑑𝑣 𝑑𝑧# 𝑑𝑧# 𝑑𝑢# 𝑑𝑧" 𝑑𝑢" 𝑑𝑢 𝑑𝑥 𝑑𝑢" 𝑑𝑥 𝑑𝑦 𝑑𝑣 ( )
49. 51 2 𝑣 𝑢 𝑓% 𝑣 = 3 𝑢 𝑣

= log 𝑢 𝑣 = 1/𝑢 2 𝑣 𝑎 𝑏 𝑔% 𝑣 = 𝑎 𝑏 𝑣 = 𝑎. + 𝑏 𝑑𝑣 𝑑𝑢 = 3 2 𝑢 𝑑𝑣 𝑑𝑢 = 1 𝑢 𝑑𝑣 𝑑𝑢 = − 1 𝑢. 𝜕𝑣 𝜕𝑎 = 𝑏 𝜕𝑣 𝜕𝑏 = 𝑎 𝜕𝑣 𝜕𝑎 = 2𝑎 𝜕𝑣 𝜕𝑏 = 1 𝑣 𝑢 𝑓. 𝑣 𝑢 𝑓7 𝑣 𝑎 𝑏 𝑔. 𝑦 = 𝑔%(𝑔. 𝑓7 𝑥 , 𝑓. 𝑥 , 𝑓% 𝑥 ) = 3 𝑥 log 𝑥 + 1 𝑥.
50. 52 2 𝑑𝑦 𝑑𝑥 = 𝑑𝑧, 𝑑𝑥 𝑑𝑧! 𝑑𝑧, 𝑑𝑦

𝑑𝑧! + 𝑑𝑧+ 𝑑𝑥 𝑑𝑧! 𝑑𝑧+ 𝑑𝑦 𝑑𝑧! + 𝑑𝑧" 𝑑𝑥 𝑑𝑦 𝑑𝑧" = 3 𝑥 1 𝑥 − 2 𝑥+ + 3 log 𝑥 + 1 𝑥" 2 𝑥 𝑦 = 𝑔% 𝑧%, 𝑧. = 𝑧%𝑧. 𝑧% = 𝑔. 𝑧8, 𝑧7 = 𝑧7 + 𝑧8 . 𝑧. = 𝑓% 𝑥 = 3 𝑥 𝑧7 = 𝑓. 𝑥 = log 𝑥 𝑧8 = 𝑓7 𝑥 = 1/𝑥 𝑥 𝑧. 𝑧7 𝑧8 𝑧% 𝑦 𝑓" 𝑓# 𝑓\$ 𝑔# 𝑔" 𝑦 = 𝑔%(𝑔. 𝑓7 𝑥 , 𝑓. 𝑥 , 𝑓% 𝑥 ) = 3 𝑥 log 𝑥 + 1 𝑥.
51. 53 (1 ) 𝑧. = 3 𝑥 𝑧7 = log

𝑥 𝑧8 = 1/𝑥 𝑧% = 𝑧8 . + 𝑧7 𝑦 = 𝑧%𝑧. 2 x=1.2 𝑦 = 3 𝑥 log 𝑥 + 1 𝑥. 2 • • backward x=1.2 dy/dx ( )
52. 54 § ( ) § § 𝑑𝑦 𝑑𝑥 = 3

𝑥 1 𝑥 − 2 𝑥? + 3 log 𝑥 + 1 𝑥; 2 𝑥 𝑥 = 1.2 @A @B 0.14 § ( )
53. 55 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 data data data data Forward
54. 56 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 data data data data data Forward
55. 57 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data data data data data Forward
56. 58 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 data grad data grad Backward
57. 59 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 0.88 𝑧8 0.83 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" data grad data grad = 0.88 = 3.29 Backward
58. 60 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 0.88 𝑧8 0.83 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧\$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑧\$ = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑦 𝜕𝑧0 = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 = 0.88 = 3.29 Backward
59. 61 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧\$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑧\$ = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑦 𝜕𝑧0 = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 = 0.88 = 3.29 Backward
60. 62 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧\$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑧\$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 = 1.37 = 0.88 = 3.29 Backward
61. 63 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧\$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑧\$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 = 1.37 𝜕𝑧\$ 𝜕𝑥 = 1 𝑥 = 0.83 = 0.88 = 3.29 Backward
62. 64 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 𝜕𝑦 𝜕𝑧" = 𝑧# 𝜕𝑦 𝜕𝑧# = 𝑧" 𝜕𝑧" 𝜕𝑧\$ = 1 𝜕𝑧" 𝜕𝑧0 = 2 𝑧0 data grad data grad = 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑧\$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 = 1.37 𝜕𝑧\$ 𝜕𝑥 = 1 𝑥 = 0.83 𝜕𝑧0 𝜕𝑥 = − 1 𝑥# = -0.69 = 0.88 = 3.29 Backward
63. 65 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.00 data grad data grad data grad data grad 𝜕𝑦 𝜕𝑦 = 1 data grad data grad 1.67 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑧\$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 1.37 0.83 -0.69 0.88 3.29 1 Backward
64. 66 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88

𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad 𝜕𝑦 𝜕𝑥 𝜕𝑦 𝜕𝑧# 𝜕𝑦 𝜕𝑧\$ 𝜕𝑦 𝜕𝑧0 𝜕𝑦 𝜕𝑧" 𝜕𝑦 𝜕𝑦 Forward Backward y ( )
65. 67 𝑧. = 3 𝑥 𝑥 1.2 data grad 𝑧.

3.29 data grad Forward Backward z2.data ← 3 * sqrt(x.data) 𝑥 1.2 data grad 𝑧. 3.29 data grad backward x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# × 𝜕𝑧# 𝜕𝑥
66. 68 𝑧. = 3 𝑥 𝑥 1.2 data grad 𝑧.

3.29 data grad Forward Backward z2.data ← 3 * sqrt(x.data) 𝑥 1.2 data grad 𝑧. 3.29 data grad backward x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑧# 𝜕𝑥 = 3 2 𝑥 x.grad ← 3/(2*sqrt(x.data)) * z2.grad 𝜕𝑦 𝜕𝑧# = 0.88 1.20 0.88 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# × 𝜕𝑧# 𝜕𝑥
67. 69 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data grad data grad data grad grad data grad data grad ( ) 1. Forward ( & )
68. 70 𝜕𝑧+ 𝜕𝑥 = − 1 𝑥, 𝑦 = 𝑧%𝑧.

𝑧% = 𝑧8 . + 𝑧7 𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data data data data data grad grad grad grad grad grad ⾒ Backward ⾒ 𝜕𝑧- 𝜕𝑥 = 1 𝑥 𝜕𝑧, 𝜕𝑥 = 3 2 𝑥 𝜕𝑦 𝜕𝑧, = 𝑧. 𝜕𝑧. 𝜕𝑧- = 1 𝜕𝑧. 𝜕𝑧+ = 2 𝑧+ 𝜕𝑦 𝜕𝑧. = 𝑧, 1. Forward ( & )
69. 71 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 𝑧7 0.18 𝑧. 3.29 𝑧8 0.83 𝑧% 0.88 𝑦 2.88 data data data data data data grad grad grad grad grad grad x.grad ← 3/(2*sqrt(x.data)) * z2.grad x.grad ← (-1/x.data**2) * z4.grad z2.grad ← z1.data* y.grad z1.grad ← z2.data* y.grad z3.grad ← 1.0 * z1.grad z4.grad ← (2*z4.data)* z1.grad x.grad ← 1/x.data * z3.grad 1. Forward ( & ) ⾒ Backward ⾒
70. 72 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 0.0 𝑧. 3.29 0.0 𝑧8 0.83 0.0 𝑧% 0.88 0.0 𝑦 2.88 0.0 data data data data data data grad grad grad grad grad grad 2. Backward (grad 0 )
71. 73 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 0.0 𝑧. 3.29 0.0 𝑧8 0.83 0.0 𝑧% 0.88 0.0 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad 3. Backward ( grad 1 )
72. 74 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 0.0 𝑧. 3.29 0.88 𝑧8 0.83 0.0 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad z2.grad ← z1.data* y.grad z1.grad ← z2.data* y.grad 4. Backward (y grad )
73. 75 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.0 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad z3.grad ← 1.0 * z1.grad z4.grad ← (2*z4.data)* z1.grad 𝜕𝑦 𝜕𝑧\$ = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑦 𝜕𝑧0 = 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 5. Backward (z1 grad )
74. 76 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 1.20 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad x.grad ← 3/(2*sqrt(x.data)) * z2.grad 1.20 grad=0.0 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑧\$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 6. Backward (z2 grad )
75. 77 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 3.94 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad x.grad ← 1/x.data * z3.grad 2.74 grad=1.20 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑧\$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑦 𝜕𝑧\$ 6. Backward (z3 grad )
76. 78 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad -3.80 grad=3.94 x.grad ← (-1/x.data**2) * z4.grad 𝜕𝑦 𝜕𝑥 = 𝜕𝑦 𝜕𝑧# 𝜕𝑧# 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧\$ 𝜕𝑧\$ 𝜕𝑥 + 𝜕𝑦 𝜕𝑧" 𝜕𝑧" 𝜕𝑧0 𝜕𝑧0 𝜕𝑥 𝜕𝑦 𝜕𝑧0 6. Backward (z4 grad )
77. 79 𝑦 = 𝑧%𝑧. 𝑧% = 𝑧8 . + 𝑧7

𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8 = 1/𝑥 𝑥 1.2 0.14 𝑧7 0.18 3.29 𝑧. 3.29 0.88 𝑧8 0.83 5.48 𝑧% 0.88 3.29 𝑦 2.88 1.0 data data data data data data grad grad grad grad grad grad 𝜕𝑦 𝜕𝑥 𝜕𝑦 𝜕𝑧# 𝜕𝑦 𝜕𝑧\$ 𝜕𝑦 𝜕𝑧0 𝜕𝑦 𝜕𝑧" 𝜕𝑦 𝜕𝑦 ( ) y
78. 80 𝑧. = 3 𝑥 𝑧7 = log 𝑥 𝑧8

= 1/𝑥 𝑧% = 𝑧8 . + 𝑧7 𝑦 = 𝑧%𝑧. PyTorch Backward Forward
79. 81 grad x = tensor(1.2, requires_grad=True) z2 = 3*sqrt(x) z3

= log(x) z4 = 1/x z1 = z4**2 + z3 y = z1 * z2 y.retain_grad() z1.retain_grad() z2.retain_grad() z3.retain_grad() z4.retain_grad() y.backward() print(x.data, z2.data, z3.data, z4.data, z1.data, y.data) print(x.grad, z2.grad, z3.grad, z4.grad, z1.grad, y.grad) tensor(1.20) tensor(3.29) tensor(0.18) tensor(0.83) tensor(0.88) tensor(2.88) tensor(0.14) tensor(0.88) tensor(3.29) tensor(5.48) tensor(3.29) tensor(1.) import torch torch.set_printoptions(2) 2 PyTorch
80. 82 𝑦 = 𝑥. + 2𝑥 + 3 = 𝑥

+ 1 . + 2 (2.0) • (Forward) • (Backward) • • x.grad 𝑥 = −1 𝑦 = 2
81. 83 𝑦 = 𝑥\$ + 3.2 𝑥# + 1.3 𝑥

− 2 20 ( -1.5 1.5)

83. 85 MSE = mean squared errors ( ) SGD =

stochastic gradient descent ( ) PyTorch

GPU
85. 87 • https://github.com/karpathy/micrograd • Python 94 • https://github.com/karpathy/micrograd/blob/master/micrograd/engine.py • Andrej

Karpathy OpenAI (Telsa AI 2023 OpenAI ) • https://youtu.be/VMj-3S1tku0?si=91ZWzaA4ECidua4g micrograd
86. 88 § 𝑦 = 3 𝑥 𝑧 = 𝑥 𝑦

= 3𝑧 § PyTorch https://pytorch.org/docs/stable/torch.html#math-operations § §
87. 89 ( ) J aime la musique I love music

(Software 2.0 )