Clustering Semantically Similar Words

Slide 1

Slide 1 text

Clustering Semantically Similar Words 0.397 a a’ DSW Camp & Jam December 4th, 2016 Bayu Aldi Yansyah

Slide 2

Slide 2 text

- Understand step-by-step how to cluster words based on their semantic similarity - Understand how Deep learning model is applied to Natural Language Processing Our Goals Overview

Slide 3

Slide 3 text

- You understand the basic of Natural Language Processing and Machine Learning - You are familiar with Artificial neural networks I assume … Overview

Slide 4

Slide 4 text

1. Introduction to Word Clustering 2. Introduction to Word Embedding - Feed-forward Neural Net Language Model - Continuous Bag-of-Words Model - Continuous Skip-gram Model 3. Similarity metrics - Cosine similarity - Euclidean similarity 4. Clustering algorithm: Consensus clustering Outline Overview

Slide 5

Slide 5 text

1. WORD CLUSTERING INTRODUCTION - Word clustering is a technique for partitioning sets of words into subsets of semantically similar words - Suppose we have set of words W = w$ ,w& , … , w( , n ∈ ℕ , our goal is to find C = C$ ,C& , …, C. , k ∈ ℕ where - w1 is a centroid of cluster C2 - similarity w1 ,w is a function to measure the similarity score - and is a threshold value where if D , ≥ means that D and is semantically similar. - For $ ∈ G and & ∈ H apply that $ , & < , so J = ∀ ∈ where D , ≥ } G ∩ H = ∅, ∀G ,H ∈

Slide 6

Slide 6 text

1. WORD CLUSTERING INTRODUCTION In order to perform word clustering, we need to: 1. Represent word as vector semantics, so we can compute their similarity and dissimilarity score. 2. Find the w1 for each cluster. 3. Choose the similarity metric D , and the threshold value .

Slide 7

Slide 7 text

Semantic ≠ Synonym “Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another.” − Gomaa and Fahmy (2013)

Slide 8

Slide 8 text

2. WORD EMBEDDING INTRODUCTION - Word embedding is a technique to represent a word as a vector. - The result of word embedding frequently referred as “word vector” or “distributed representation of words”. - There are 3 main approaches to word embedding: 1. Neural Networks model based 2. Dimensionality reduction based 3. Probabilistic model based - We focus on (1) - The idea of these approaches are to learn vector representations of words in an unsupervised manner.

Slide 9

Slide 9 text

2. WORD EMBEDDING INTRODUCTION - Some Neural networks models that can learn representation of words are: 1. Feed-forward Neural Net Language Model by Bengio et al. (2003). 2. Continuous Bag-of-Words Model by Mikolov et al. (2013). 3. Continuous Skip-gram Model by Mikolov et al. (2013). - We will compare these 3 models. - Fun fact: the last two models is highly-inpired by no 1. - Only Feed-forward Neural Net Language Model is considered as deep learning model.

Slide 10

Slide 10 text

2. WORD EMBEDDING COMPARING NEURAL NETWORKS MODELS - We will use notation from Collobert et al. (2011) to represent the model. This help us to easily compare the models. - Any feed-forward neural network with layers can be seen as a composition of functions T U(W), corresponding to each layer : - With parameter for each layer : - Usually each layer have weight and bias , U = (U,U). T W = T [ (T [\$(… T $(W)… )) = ($,&, …, [)

Slide 11

Slide 11 text

2.1. FEED-FORWARD NEURAL NET LANGUAGE MODEL Bengio et al. (2003) - The training data is a sequence of words $ , & , … , ] for ^ ∈ - The model is trying predict the next word ^ based on the previous context (previous words: ^ \$ ,^ \& , … , ^ \a ). (Figure 2.1.1) - The model is consist of 4 layers: Input layer, Projection layer, Hidden layer(s) and output layer. (Figure 2.1.2) - Known as NNLM ^ Keren Sale Stock bisa dirumah ... ... ^ \$ ^ \& ^ \b ^ \c Figure 2.1.1

Slide 12

Slide 12 text

2.1. FEED-FORWARD NEURAL NET LANGUAGE MODEL COMPOSITION OF FUNCTIONS: INPUT LAYER - ^\$ ,^\& ,… , ^\a is a 1-of-|| vector or one-hot-encoded vector of ^ \$ , ^\& ,… , ^ \a - is the number of previous words - The input layer is just acting like placeholder here ′^ = T ^\$ ,… , ^\a Output layer : T c J = ′^ = c] T b J + c Hidden layer : T b J = tanh b] T & J + b Projection layer : T & J = &] T $ J Input layer for i-th example : T $ J = ^\$ , ^\& , … , ^\a

Slide 13

Slide 13 text

2.1. FEED-FORWARD NEURAL NET LANGUAGE MODEL COMPOSITION OF FUNCTIONS: PROJECTION LAYER - The idea of this layer is to project the ||-dimension vector to smaller dimension. - & is the ||× matrix, also known as embedding matrix, where each row is a word vector - Unlike hidden layer, there is no non-linearity here - This layer also known as “The shared word features layer” ′^ = T ^\$ ,… , ^\a Output layer : T c J = ′^ = c] T b J + c Hidden layer : T b J = tanh b] T & J + b Projection layer : T & J = &] T $ J Input layer for i-th example : T $ J = ^\$ , ^\& , … , ^\a

Slide 14

Slide 14 text

2.1. FEED-FORWARD NEURAL NET LANGUAGE MODEL COMPOSITION OF FUNCTIONS: HIDDEN LAYER - b is the ℎ× matrix where ℎ is the number of hidden units. - b is a ℎ −dimensional vector. - The activation function is hyperbolic tangent. ′^ = T ^\$ ,… , ^\a Output layer : T c J = ′^ = c] T b J + c Hidden layer : T b J = tanh b] T & J + b Projection layer : T & J = &] T $ J Input layer for i-th example : T $ J = ^\$ , ^\& , … , ^\a

Slide 15

Slide 15 text

2.1. FEED-FORWARD NEURAL NET LANGUAGE MODEL COMPOSITION OF FUNCTIONS: OUPTUT LAYER - c is the ℎ×|| matrix. - c is a ||-dimensional vector. - The activation function is softmax. - ′^ is a ||-dimensional vector. ′^ = T ^\$ ,… , ^\a Output layer : T c J = ′^ = c] T b J + c Hidden layer : T b J = tanh b] T & J + b Projection layer : T & J = &] T $ J Input layer for i-th example : T $ J = ^\$ , ^\& , … , ^\a

Slide 16

Slide 16 text

2.1. FEED-FORWARD NEURAL NET LANGUAGE MODEL LOSS FUNCTION - Where is the number of training data - The goal is to maximize this loss function. - The neural networks are trained using stochastic gradient ascent. = 1 n log T ^\$ , …, ^\a ; J r Js$

Slide 17

Slide 17 text

Figure 2.1.2 Flow of the tensor of Feed-forward Neural Net Language Model with vocabulary size || and hyperparameter = 4, = 2 and ℎ = 5. ^\$ ^\& ^\b ^\c ^\$ ^\& ^\b ^\c ′^

Slide 18

Slide 18 text

2.2. CONTINUOUS BAG-OF-WORDS MODEL Mikolov et al. (2013) - The training data is a sequence of words $ , & , … , ] for ^ ∈ - The model is trying predict the word ^ based on the surrounding context ( words from left: ^\$ ,^ \& and words from the right: ^ \$ , ^\& ). (Figure 2.2.1) - There are no hidden layer in this model. - Projection layer is averaged across input words. ^ x& Keren Sale bisa bayar dirumah ... ... ^ x$ ^ ^ \$ ^ \& Figure 2.2.1

Slide 19

Slide 19 text

2.2. CONTINUOUS BAG-OF-WORDS MODEL COMPOSITION OF FUNCTIONS: INPUT LAYER - ^\y is a 1-of-|| vector or one-hot-encoded vector of ^\y . - is the number of words on the left and the right. ′^ = T ^\a , … , ^\$ ,^x$ , …, ^xa Output layer : T b J = ′^ = b] T & J + b Projection layer : T & J = = 1 2 n &] T $() J \a{y{a,y|} y Input layer for i-th example : T $() J = ^\y , − ≤ ≤ , ≠ 0

Slide 20

Slide 20 text

2.2. CONTINUOUS BAG-OF-WORDS MODEL COMPOSITION OF FUNCTIONS: PROJECTION LAYER - The difference from previous model is this model project all the inputs to one -dimensional vector . - & is the ||× matrix, also known as embedding matrix, where each row is a word vector. ′^ = T ^\a , … , ^\$ ,^x$ , …, ^xa Output layer : T b J = ′^ = b] T & J + b Projection layer : T & J = = 1 2 n &] T $() J \a{y{a,y|} y Input layer for i-th example : T $() J = ^\y , − ≤ ≤ , ≠ 0

Slide 21

Slide 21 text

2.2. CONTINUOUS BAG-OF-WORDS MODEL COMPOSITION OF FUNCTIONS: OUPTUT LAYER - b is the m×|| matrix. - b is a ||-dimensional vector. - The activation function is softmax. - ′^ is a ||-dimensional vector. ′^ = T ^\a , … , ^\$ ,^x$ , …, ^xa Output layer : T b J = ′^ = b] T & J + b Projection layer : T & J = = 1 2 n &] T $() J \a{y{a,y|} y Input layer for i-th example : T $() J = ^\y , − ≤ ≤ , ≠ 0

Slide 22

Slide 22 text

2.2. CONTINOUS BAG-OF-WORDS MODEL LOSS FUNCTION - Where is the number of training data - The goal is to maximize this loss function. - The neural networks are trained using stochastic gradient ascent. = 1 n log T ^\a , … , ^\$ , ^x$ ,… , ^xa J r Js$

Slide 23

Slide 23 text

^x& ^x$ ^\$ ^\& ′^ Figure 2.2.2 Flow of the tensor of Continuous Bag-of-Words Model with vocabulary size || and hyperparameter = 2, = 2.

Slide 24

Slide 24 text

2.3. CONTINUOUS SKIP-GRAM MODEL Mikolov et al. (2013) - The training data is a sequence of words $ , & , … , ] for ^ ∈ - The model is trying predict the surrounding context ( words from left: ^\$ ,^ \& and words from the right: ^ \$ ,^ \& ) based on the word ^ . (Figure 2.3.1) ^ x& Keren bisa ... ... ^ x$ ^ ^ \$ ^ \& Figure 2.3.1

Slide 25

Slide 25 text

2.3. CONTINUOUS SKIP-GRAM MODEL COMPOSITION OF FUNCTIONS: INPUT LAYER - ^ is a 1-of-|| vector or one-hot-encoded vector of ^ . ′ = T ^ Output layer : T b J = ′ = b] T & J + b Projection layer : T & J = &] T $ J Input layer for i-th example : T $ J = ^

Slide 26

Slide 26 text

2.3. CONTINUOUS SKIP-GRAM MODEL COMPOSITION OF FUNCTIONS: PROJECTION LAYER - & is the ||× matrix, also known as embedding matrix, where each row is a word vector. - Same as Continuous Bag-of-Words model ′ = T ^ Output layer : T b J = ′ = b] T & J + b Projection layer : T & J = &] T $ J Input layer for i-th example : T $ J = ^

Slide 27

Slide 27 text

2.3. CONTINUOUS SKIP-GRAM MODEL COMPOSITION OF FUNCTIONS: OUTPUT LAYER - b is the m×2|| matrix. - b is a 2n||-dimensional vector. - The activation function is softmax. - ‚ is a 2n||-dimensional vector can be written as ′ = T ^ Output layer : T b J = ′ = b] T & J + b Projection layer : T & J = &] T $ J Input layer for i-th example : T $ J = ^ ‚ = ((^\a |^ ), … , (^\$ |^ ), (^x$ |^ ),… , (^xa |^ ))

Slide 28

Slide 28 text

2.3. CONTINOUS SKIP-GRAM MODEL LOSS FUNCTION - Where is the number of training data - The goal is to maximize this loss function. - The neural networks are trained using stochastic gradient ascent. = 1 n log n (^\y |^ ) \a{y{a,y|} y J r Js$

Slide 29

Slide 29 text

^x& ^x$ ^\$ ^\& ^ Figure 2.3.2 The flow of tensor of Continuous Skip-gram Model with vocabulary size || and hyperparameter = 2, = 2.

Slide 30

Slide 30 text

3. SIMILARITY METRICS INTRODUCTION - Recall D , ≥ - Similarity metrics of words: Character-Based Similarity Measures and Term-based Similarity Measures. (Gomaa and Fahmy 2013) - We focus on Term-based Similarity Measures

Slide 31

Slide 31 text

Slide 32

Slide 32 text

3.1. SIMILARITY METRICS COSINE - Where J is our word vector - Range value: −1 ≤ $ ,& ≤ 1 - Recommended threshold value : ≥ 0.5 $ , & = $ W & $ &

Slide 33

Slide 33 text

3.2. SIMILARITY METRICS EUCLIDEAN - Where J is our word vector - Range value: 0 ≤ $ , & ≤ 1 - Recommended threshold value : ≥ 0.75 $ ,& = 1 1 − ($ ,& ) $ ,& = n($J − &J )& a Js$

Slide 34

Slide 34 text

4. CONSENSUS CLUSTERING INTRODUCTION - The basic idea here is we want to find the D based on the consensus - There are 3 approaches for Consensus clustering: Iterative Voting Consensus, Iterative Probabilistic Voting Consensus and Iterative Pairwise Consensus. (Nguyen and Caruana 2007) - We use slightly modified version of Iterative Voting Consensus

Slide 35

Slide 35 text

4.1. CONSENSUS CLUSTERING THE ALGORITHM Figure 4.1.1 Iterative Voting Consensus with slightly modification