Slide 1

Slide 1 text

Probabilistic FastText for Multi-Sense Word Embeddings Ben Athiwaratkun, Andrew Gordon Wilson, Anima Anandkumar ACL 2018 : Tosho Hirasawa Tokyo Metropolitan University Komachi Lab

Slide 2

Slide 2 text

1. Overview • FastText +-$5)8'( / 9%0#, ).05   •  36& • "4 30  • 3027% !1*

Slide 3

Slide 3 text

2. Word Embedding • x km • x s G d kmN • x B kmO LT n • xh l km • B e T Ea c h t • F gi rpM E W Vo • ,- 2 5 ( 2 3 ),- 3 B0 CB

Slide 4

Slide 4 text

3. Background: a 11, • , 1, 4 , 5 , • [ b ] D ] b • ] 9b • a H ] [ b H • 9b ] D [ b

Slide 5

Slide 5 text

3. Background: NNLM [Bengio+, 2003] • Neural Network Language Model • NN  • "   •   • Brown   !

Slide 6

Slide 6 text

3. Background: CBOW [Milkov+, 2013a] • CBOW: Continuous bag-of-words • NNLM (2>@&6)  • NNLM7 • B= :'<8#$  • 1A<8*4/? • . skip-gram   Word2Vec ; • 0 skip-gram • "51 skip-gram -,  "!% 351+9  => .C

Slide 7

Slide 7 text

3. Background: skip-gram [Milkov+, 2013b] • #0)  '40)%1  • "$ •  /* • Hierarchical Softmax • Negative Sampling • (,"$ • Subsampling of Frequent Words • 235)+.& - / 5!

Slide 8

Slide 8 text

3. Background: GloVe [Pennington+, 2014] • GloVe: Global Vector for Word Embeddings • ! "  •  •      !#

Slide 9

Slide 9 text

3. Background: ELMo [Peters, 2018] • ELMo: Embeddings from Language Models • L , bi-LSTM (#)* •   • &, bi-LSTM  % •  '$ "$!+

Slide 10

Slide 10 text

3. Background: FastText [Bojanowski, 2018] • skip-gram with negative sampling  subword-level +) • .!& s #!&SGNS • !& n-gram   *'- $%  • G_w = {2, …, G}, G = 2, … 6 • ",( g  n-gram  

Slide 11

Slide 11 text

3. Background • *04 • Word2Gaussian [Vilnis+, 2014], Word2GaussianMixture [Athiwaratkun+, 2018] • W2GWord Embedding &36!%  • W2GMW2G +7$) +.06"'- • Subword-level  Probabilistic  Word Embedding 12( • Subword-level: /8 05,0# • Probabilistic: +.0# Dictionary-level Subword-level Determinative CBOW Skip-gram GloVe ELMo FastText Probabilistic Word2Gaussian Word2GaussianMixture Probabilistic FastText

Slide 12

Slide 12 text

4. Proposed Model: Probabilistic FastText • K- r o p • Probabilistic Subword Representation • p G p x g N • i n-gram • e i=1 N o b • _ i=2,…,K i b , d p

Slide 13

Slide 13 text

4. Proposed Model: Probabilistic FastText •  Hirbelt   • ξ  partial energy •    1  f=rock, g=pop   partial energy

Slide 14

Slide 14 text

4. Proposed Model: Probabilistic FastText • % • Mikolov+, 2013b &  negative sampling  • U unigram * • )-% • ,% +!#")-  • K = 1 )-   '( $(

Slide 15

Slide 15 text

5. Experiment • Word Similarity Dataset -!($.* 2, •  • UKWAC, WACKYPEDIA (EN), FRWAC (FR), DEWAC (DE), ITWAC (IT) •   • EN  % 4 +0 3' • ! • 7& • )5/ • 6#  • "8 *: K = 2 • 91: l = 10 • subsampling thres: t = 10^-5 • n-gram: n = 3, 4, 5, 6

Slide 16

Slide 16 text

6. Results: Nearest Neighbors 1 : PFT-GM (K=2) PFT-G (K=1) 0

Slide 17

Slide 17 text

6. Results: Word Similarity Dataset

Slide 18

Slide 18 text

6. Results: Word Similarity Dataset        FastText/W2G/M 

Slide 19

Slide 19 text

6. Results: Multi-Prototype Models • SCWS Dataset •  • Dim=300 SOTA •    … •  NEELAKANTAN skip-gram 

Slide 20

Slide 20 text

6. Result: FR, DE, IT •   •  

Slide 21

Slide 21 text

6. Result: Subword Decomposition • #1% -& • subword    3'(" subword -  • #1% . $ • 0/ top-5 / bottom-5 • abnormality / abnormal •  abnorm 0/ • , autobiographer2)!  circumenavigations *+   hypersensitivity

Slide 22

Slide 22 text

6. Result: #=6,. • K = 2  AB • K > 2 +2/!% [Athiwaratkun and Wilson, 2017] • K = 1 $"?.*;F • (“cell”, “jail”), (“cell”, “biology”), (“cell”, “phone”) • 058  #>(-491"(- [Arora+, 2016] ): • 3D @&7E' 

Slide 23

Slide 23 text

7. Conclusion • A8#&4C .% 3 • +9(0;6:B'  • .% 17  • "  /D, "=*- • 5E'? multi-prototype embedding • Further Works • ;6"=)< $!>@,,2 • multi-prototype multi-lingual embedding

Slide 24

Slide 24 text

8. UL+DK • WT'%)<>?GI  • =FH"&$#(*E7 • /1N =FH,P6:4Q8; • ELMo CV.9 -2B • C20MS3:   • 5JRO@ A!