Contrastive Self-Supervised Learning

Slide 1

Slide 1 text

東海⼤學物理系 Contrastive Self-Supervised Learning https://yenlung.me/2022SSL

Slide 2

Slide 2 text

Contrastive Learning 2 蔡炎龍⾮ (UC Irvine) Py : Python ( ) Python 給 Deep Learning

Slide 3

Slide 3 text

3 Python & AI Python 給 ...

Slide 4

Slide 4 text

Contrastive Learning 4 兩個相關的 MOOC 課程 eWant : https://www.ewant.org MOOC : https://moocs.nccu.edu.tw Python 給

Slide 5

Slide 5 text

Contrastive Learning 5 出了⼀本 Python 的書少年Py的大冒險   成為Python數據分析達人的第一門課 : , , :

Slide 6

Slide 6 text

Contrastive Learning 6 今年會出的書! 少年Py的大冒險 II   成為 Python AI 達人的第一堂課 : , , , : ?

Slide 7

Slide 7 text

Contrastive Learning 7 ⽬前最詳細的直播錄影 https://bit.ly/2021_FG_DeepLearning

Slide 8

Slide 8 text

Contrastive Learning 8 正在開設的 AI 課程 https://www.youtube.com/c/iyenlung > 1102 ( )

Slide 9

Slide 9 text

AI 就是打造函數學習機 01.

Slide 10

Slide 10 text

Contrastive Learning 10 我們的問題都要化為函數的形式 , ! f x y

Slide 11

Slide 11 text

Contrastive Learning 11 也就是我們想逼近⼀個函數... y = f(x) 維 f

Slide 12

Slide 12 text

Contrastive Learning 12 也就是我們想逼近⼀個函數... f(x, y, u(x, y)) = 0 維

Slide 13

Slide 13 text

Contrastive Learning 13 神經網路 (Neural Network) 神經元是基本的運算單元

Slide 14

Slide 14 text

Contrastive Learning 14 深度學習就是建⼀層層「隱藏層」 x1 x2 xn h1 h2 hk x h ℱ1 全連結層 (Dense) 卷積層 (Conv) 遞歸層 (LSTM, GRU) 銘 : DNN CNN RNN

Slide 15

Slide 15 text

Contrastive Learning 15 深度學習就是建⼀層層「隱藏層」 x ̂ y input layer hidden layers outpu t layer DNN, CNN, RNN 銘維 , 銘

Slide 16

Slide 16 text

Contrastive Learning 16 神經元怎麼運作 ,

Slide 17

Slide 17 text

Contrastive Learning 17 神經元怎麼運作 , (activation function) φ( 3 ∑ i=1 wi xi + b) = h φ( ) = h

Slide 18

Slide 18 text

Contrastive Learning 18 Universal Approximation Theorem 銘 , 維 !

Slide 19

Slide 19 text

Contrastive Learning 19 打造「函數學習機」 വᏐ ላशػ ! {wi , bj } θ ,

Slide 20

Slide 20 text

Contrastive Learning 20 訓練 (學習) , , ! θ fθ

Slide 21

Slide 21 text

Contrastive Learning 21 ⽬標函數、loss function , , , : i xi yi ℓi (θ) = ∥yi − fθ (xi )∥2 ( 1/2): L(θ) = 1 2N N ∑ i=1 ∥yi − fθ (xi )∥2

Slide 22

Slide 22 text

Contrastive Learning 22 ⽬標函數、loss function 1 2 3 [ 1 0 0 ] [ 0 1 0 ] [ 0 0 1 ] one-hot encoding [ 1 0 0 ]

Slide 23

Slide 23 text

Contrastive Learning 23 ⽬標函數、loss function pθ ̂ y1 ̂ y2 ̂ y3 softmax 1 , , , xi yi P(yi |x, θ)

Slide 24

Slide 24 text

Contrastive Learning 24 Softmax: 維持⼤⼩關係, 加起來等於 1 , , , a, b, c α, β, γ α + β + γ = 1 0 產 a, b, c S = a + b + c , , α = a S β = b S γ = c S

Slide 25

Slide 25 text

Contrastive Learning 25 Softmax: 維持⼤⼩關係, 加起來等於 1 , , , a, b, c α, β, γ α + β + γ = 1 維 0 a, b, c , a′ = ea, b′ = eb, c′ = ec S = a′ + b′ + c′ , , α = a′ S β = b′ S γ = c′ S

Slide 26

Slide 26 text

Contrastive Learning 26 Softmax: 維持⼤⼩關係, 加起來等於 1 , k , , , , 維 : z1 , z2 , …, zk ¯ z1 , ¯ z2 , …, ¯ zk k ∑ i=1 ¯ zi = 1 ¯ zj = exp(zj ) ∑k i=1 exp(zi )

Slide 27

Slide 27 text

Contrastive Learning 27 再來就是準備訓練資料 (做 labeling) , , 1000 ! × 1000 × 1000 × 1000

Slide 28

Slide 28 text

Contrastive Learning 28 ⽬標函數、loss function Shannon information theory , , 維 −log P(x) ? ...

Slide 29

Slide 29 text

Contrastive Learning 29 ⽬標函數、loss function , !! ℓi (θ) = − log P(yi |x, θ) cross entropy

Slide 30

Slide 30 text

Contrastive Learning 30 【監督式學習】由我們準備訓練資料 ( , " ") ( , "蠎 ") , , ... x1 x2 y2 y1 x k+1 , y k+1 x k , y k x1 , y1 x n , y n , (over fi tting) !

Slide 31

Slide 31 text

Contrastive Learning 31 監督式學習神經網路的⼤成功! AI ( ), 裁

Slide 32

Slide 32 text

Contrastive Learning 32 但是... 需要⼤量、⾼品質的標記資料 , (labeling) 8 0 %

Slide 33

Slide 33 text

Contrastive Learning 33 但有時訓練資料不容易準備! 有標記的資料太少! 我們不知什麼是正確答案! 訓練資料難以準備!

Slide 34

Slide 34 text

Contrastive Learning 34 更重要的, ⼩朋友學習能⼒都比 AI 強 ( ) ! 維 ! 維 ...

Slide 35

Slide 35 text

Contrastive Learning 35 Self-Supervised Learning We believe that self-supervised learning is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems. “ —Yan LeCun (楊⽴昆)/Ishan Misra, 2021 ” Self-supervised Learning: The Dark Matter of Intelligence https://ai.facebook.com/blog/self-supervised- learning-the-dark-matter-of-intelligence/

Slide 36

Slide 36 text

Contrastive Learning 36 【例⼦】訓練資料難以準備 , !

Slide 37

Slide 37 text

Contrastive Learning 37 【例⼦】訓練資料難以準備 ! π

Slide 38

Slide 38 text

Contrastive Learning 38 【例⼦】我們不知道正確答案 f , 狗 !

Slide 39

Slide 39 text

Contrastive Learning 39 【非督督式學習】基本想法1 fθ ( ) ! self-suprevised learning

Slide 40

Slide 40 text

Contrastive Learning 40 【非督督式學習】基本想法2 fθ J(θ) self-supervised Contrastive Learning

Slide 41

Slide 41 text

Contrastive Learning 41 【非督督式學習】基本想法3 銘 embedding Pretext Task

Slide 42

Slide 42 text

NLP 尋找詞代表向量 02.

Slide 43

Slide 43 text

Contrastive Learning 43 Feature Engineering f x y , feature

Slide 44

Slide 44 text

Contrastive Learning 44 Feature Engineering PCA x x dimension reduction, PCA

Slide 45

Slide 45 text

Contrastive Learning 45 Feature Engineering deep learning feature engineering f ... ...

Slide 46

Slide 46 text

Contrastive Learning 46 Feature Engineering 銘維 feature engineering , 維 feature engineering

Slide 47

Slide 47 text

Contrastive Learning 47 Feature Engineering 維銘 feature engineering

Slide 48

Slide 48 text

Contrastive Learning 48 Representation Learning representation

Slide 49

Slide 49 text

Contrastive Learning 49 表⽰向量 ... fθ 輸出輸入 [ 94 87]

Slide 50

Slide 50 text

Contrastive Learning 50 Word Embedding 在⾃然語⾔處理當中, 最基本的問題就是, 我們如何把語⾔「輸入」... fθ ⼀段⽂字

Slide 51

Slide 51 text

Contrastive Learning 51 Word Embedding 通常我們就是⼀個字 (或⼀個詞), 就給它⼀個代表的「特徵向量」。 fθ 龍 [ 94 87] 這樣的函數就叫做⼀個 word embedding。

Slide 52

Slide 52 text

Contrastive Learning 52 Word Embedding 還有個⼩問題... fθ 龍這裡也要變成數字才能輸入電腦

Slide 53

Slide 53 text

Contrastive Learning 53 我們給字編號! 的一了是我最常⾒的⽅式是我們把字依出現的頻率排序, 越常出現給的編號越⼩。 1 2 3 4 5

Slide 54

Slide 54 text

Contrastive Learning 54 然後 one-hot encoding! 的一了是我 one-hot encoding! 1 2 3 4 5 1 0 0 0 0 ⋮ 0 1 0 0 0 ⋮ 0 0 1 0 0 ⋮ 0 0 0 1 0 ⋮ 0 0 0 0 1 ⋮ one-hot encoding !

Slide 55

Slide 55 text

Contrastive Learning 55 Word2Vec 我們以著名的 Word2Vec 來看看怎麼做 word embedding? 相似的字會在⼀起! Google 官網: https://code.google.com/archive/p/word2vec/

Slide 56

Slide 56 text

Contrastive Learning 56 Word2Vec T. Mikolov, K. Chen, G. Corrado, J. Dean. Toutanova. Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR, 2013.. 訓練好了有很多炫炫的功能。巴黎法國義⼤利羅⾺國王男⼈女⼈皇后

Slide 57

Slide 57 text

Contrastive Learning 57 這是學了什麼函數呢? f 龍 [ 94 87] 我們當然知道, word embedding 就是要學⼀個字的特徵向量, 但我們沒辦法準備訓練資料啊!

Slide 58

Slide 58 text

Contrastive Learning 58 重點還是在函數! 基本上你就設計⼀個任務, ⽽這個任務你覺得電腦要「懂字的意思」才能做到! f wt−2 wt wt−1 wt+1 wt+2 CBOW model ⽤周圍的字預測中間的字。

Slide 59

Slide 59 text

Contrastive Learning 59 重點還是在函數! 或是更炫的去訓練這樣的函數! f Skip-Gram model 中間的字預測週圍的字 wt−2 wt wt−1 wt+1 wt+2

Slide 60

Slide 60 text

Contrastive Learning 60 重點還是在函數! Embedding 我們看要壓到幾維向量, 比如說 128 維, 那就在神經網路中間的隱藏層, 放 128 個神經元!

Slide 61

Slide 61 text

Contrastive Learning 61 記憶或理解 word 2 vec , 維 , 維 w11 w12 ⋯ w1N w21 w22 ⋯ w2N ⋮ ⋮ ⋮ wi1 wi2 ⋯ wiN ⋮ ⋮ ⋮ wV1 wV2 ⋯ wVN W

Slide 62

Slide 62 text

Contrastive Learning 62 記憶或理解 h W x One-hot encoding T 0 0 ⋮ 1 ⋮ 0 w11 w12 ⋯ w1N w21 w22 ⋯ w2N ⋮ ⋮ ⋮ wi1 wi2 ⋯ wiN ⋮ ⋮ ⋮ wV1 wV2 ⋯ wVN WTx= h word 2 vec , 維銘 ! = h

Slide 63

Slide 63 text

Contrastive Learning 63 傳統 Word Embedding 還是有缺點 Word Embedding 基本上固定的字 (詞) 就有固定代表的特徵向量。但是... 這個⼈的個性有點天天。我天天都會喝⼀杯咖啡。⼀個字、⼀個詞, 在不同的地⽅可能有不⼀樣的意思。

Slide 64

Slide 64 text

Contrastive Learning 64 語意型的 word embedding! f 某個意涵編碼⽤意涵來編碼! 這真的做得到?

Slide 65

Slide 65 text

Contrastive Learning 65 ELMo 開創⾃然語⾔的「芝⿇街時代」! ELMo M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer. Deep contextualized word representations. NAACL 2018. arXiv preprint arXiv:1802.05365v2. AI2

Slide 66

Slide 66 text

Contrastive Learning 66 其實就是 RNN 的 hidden states 𝐡 1 𝐡 2 𝐡 𝑛 −1 𝐡 𝑛 我天天啡咖喝咖我們要的 embedding 對話機器⼈的 hidden states 就是很好的 embedding!

Slide 67

Slide 67 text

Contrastive Learning 67 沒⼈限制我們只能有⼀層! 𝐡 1 𝐡 2 𝐡 𝑛 −1 𝐡 𝑛 天喝咖 𝐡 1 𝐡 2 𝐡 𝑛 −1 𝐡 𝑛 LSTM1 LSTM2

Slide 68

Slide 68 text

Contrastive Learning 68 於是我們會有更「客製化」embedding hi hi token w1 w2 w3 + + 我們在要⽤時, 才會去學 , 成為「真正」的 embedding。 w1 , w2 , w3 前⾯需要⼤量訓練資料的都不⽤動哦!

Slide 69

Slide 69 text

Contrastive Learning 69 引領⾃然語⾔新時代的 BERT BERT J. Devlin, M.W. Chang, K. Lee, K. Toutanova. BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805v2. Google

Slide 70

Slide 70 text

Contrastive Learning 70 Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008). 運⽤ self-attention, 避開 RNN 的缺點!

Slide 71

Slide 71 text

Contrastive Learning 71 Transformer BERT 的架構基本上是 transformer 的 encoder。其中⼀種訓練⽅式是這樣。 BER 我天天都會喝⼀杯__。咖啡克漏字

Slide 72

Slide 72 text

Contrastive Learning 03.

Slide 73

Slide 73 text

Contrastive Learning 73 假設我們要做⼈臉辨識以後公司⾨禁就直接⽤⼈臉辨識!

Slide 74

Slide 74 text

Contrastive Learning 74 感覺挺容易的, 就給每個同仁⼀個編號 1 2 3 4

Slide 75

Slide 75 text

Contrastive Learning 75 就學這個函數 fθ 輸出輸入 3

Slide 76

Slide 76 text

Contrastive Learning 76 問題1 圖形辨識⼀個類別⼤約要 1000 張才能訓練! 我們總不能叫每位同仁都來個 1000 張照片...

Slide 77

Slide 77 text

Contrastive Learning 77 問題2 我是新加入的, 要重新訓練嗎?

Slide 78

Slide 78 text

Contrastive Learning 78 這和⼈類不太⼀樣... ⼈好像不需要幾千張照片才能辨識...

Slide 79

Slide 79 text

Contrastive Learning 79 ⼩數據的訓練有可能嗎? 有沒有可能教電腦「怎麼學習」? 學會了⼩數據也可以訓練。

Slide 80

Slide 80 text

Contrastive Learning 80 如果可以找到這樣的函數... f ̂ y1 ̂ y2 ̂ yn 於是她就有個代表向量 ̂ y = [ ̂ y1 , ̂ y2 , …, ̂ yn]

Slide 81

Slide 81 text

Contrastive Learning 81 每個⼈就有個「代表向量」假設是公司內四位同仁的照片。 x1 , x2 , x3 , x4 f(x1 ) f(x2 ) f(x3 ) f(x4 ) f ̂ y1 ̂ y2 ̂ yn 看和哪個距離最⼩!

Slide 82

Slide 82 text

Contrastive Learning 82 於是種種問題就解決了! 比⽅說有新⼈來了, 我們就⽤訓練好的這個神經網路做她的代表向量。 f ̂ y1 ̂ y2 ̂ yn

Slide 83

Slide 83 text

Contrastive Learning 83 還有個立即的好處我們可以規定, 什麼才叫「夠像」。就是定義⼀個數 , 如果 τ d(f(x), f(xi )) < τ 就判定是這個⼈。所以, 我們也可以知道, 這個⼈判斷這個⼈不是公司內部的⼈。

Slide 84

Slide 84 text

Contrastive Learning 84 但是訓練資料難以準備... 我怎麼知道什麼是代表她最好的向量?

Slide 85

Slide 85 text

Contrastive Learning 85 ⾃動特徵擷取機神經網路可以想成每個隱藏層在做「⾃動特徵擷取」。所以某個隱藏層的輸出, 可以看成原資料的代表向量!

Slide 86

Slide 86 text

Contrastive Learning 86 從⽂字的 Word Embedding 得到的靈感... CNN ̂ y1 ̂ y2 ̂ yn Dense Output (Softmax) 砍掉最後⼀層就可以! 做「正常」的⼈臉辨識, 然後砍掉最後⼀層!

Slide 87

Slide 87 text

Contrastive Learning 87 也可以直接就訓練判斷是否為同⼀個⼈ CNN ̂ y1 ̂ y2 ̂ yn 砍掉最後⼀層就可以! CNN ̂ y1 ̂ y2 ̂ yn 0

Slide 88

Slide 88 text

Contrastive Learning 88 更好的是⽤ Triplet Loss CNN ̂ y1 ̂ y2 ̂ yn CNN ̂ y1 ̂ y2 ̂ yn CNN ̂ y1 ̂ y2 ̂ yn 越近越好越遠越好 labeling

Slide 89

Slide 89 text

Contrastive Learning 89 更好的是⽤ Triplet Loss F. Schroff, D. Kalenichenko, J. Philbin (Google). FaceNet: A Unified Embedding for Face Recognition and Clustering. arXiv preprint arXiv:1503.03832. CNN ̂ y1 ̂ y2 ̂ yn CNN ̂ y1 ̂ y2 ̂ yn Positive Sample Negative Sample

Slide 90

Slide 90 text

Contrastive Learning 90 更⼀般化就是 Contrastive Learning fθq gθk Target Sample q k 維維 negative samples collapse

Slide 91

Slide 91 text

Contrastive Learning 91 Contrastive Learning sim(q, k) 可以是距離函數, 甚⾄⼤家更常⽤內積。 ∥q − k∥2 ⟨q, k⟩ τ 1 2 例⼦

Slide 92

Slide 92 text

Contrastive Learning 92 Contrastive Loss ℒ(θ) = − log esim(q,k+) ∑ esim(q,k)

Slide 93

Slide 93 text

Contrastive Learning 93 Augmentation: 完全不做 labeling 可能嗎? , 維 labeling ( 1%), 維 model , 維 labeling ? 維 , , , augmentation

Slide 94

Slide 94 text

Contrastive Learning 94 Self-Supervised Learning Contrastive learning representation , labeling, labeling , , representation , , , , Yann LeCun ... self-supervised learning LeCun

Slide 95

Slide 95 text

Contrastive Learning 95 Self-Supervised Learning We believe that self-supervised learning is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems. “ —Yan LeCun (楊⽴昆)/Ishan Misra, 2021 ” Self-supervised Learning: The Dark Matter of Intelligence https://ai.facebook.com/blog/self-supervised- learning-the-dark-matter-of-intelligence/

Slide 96

Slide 96 text

Contrastive Learning 96 Non-Contrastive Learning 產, negative samples, ( ), , 維 negative samples, collapse ? fθq gθk q k x x+ Pφ

Slide 97

Slide 97 text

時間序列型的數據 04. joint work with Yen Jan

Slide 98

Slide 98 text

Contrastive Learning 98 時間序列資料當然也該找表現向量過去 20 天  某股的資料

Slide 99

Slide 99 text

Contrastive Learning 99 有可能更容易學到... f or

Slide 100

Slide 100 text

Contrastive Learning 100 該買或賣? f 買賣 -

Slide 101

Slide 101 text

Contrastive Learning 101 甚⾄預測後⾯的情況 f

Slide 102

Slide 102 text

Contrastive Learning 102 困難點時間序列 contrastive learning 相關⽂獻少很多, 其中⼀個問題是合理的 augmentation 很難做!

Slide 103

Slide 103 text

Contrastive Learning 103 Siamese Network 孿⽣神經網路 fθ fθ x x′ z z′ Contrastive loss

Slide 104

Slide 104 text

Contrastive Learning 104 標記型的數據過去 20 天  某股的資料漲跌

Slide 105

Slide 105 text

Contrastive Learning 105 超嚴格標準 < xt+1 < xt+2 < xt+3 < xt+4 < xt+5 xt * : 裁 xt t 未來五天⼀路上漲才算漲!

Slide 106

Slide 106 text

Contrastive Learning 106 可想⾒是非常不平衡的數據集! 漲跌 0 22500 45000 67500 90000

Slide 107

Slide 107 text

Contrastive Learning 107 不平衡數據處理 v1 v2 v3

Slide 108

Slide 108 text

Contrastive Learning 108 original P-adic 加強版 V1 V2 V3 V1 V2 V3 LSTM - 71.6% 71.7% - 72.1% 69.9% SiamCL 65.6% 71.5% 71.3% 73.7% 73.8% 73.3% * precision

Slide 109

Slide 109 text

Contrastive Learning 109 Q & A 有問題嗎?