' (V ; ✓D, ✓emb) = 1 |Vpop | X w2Vpop log f✓D (✓emb w ) + 1 |Vrare | X w2Vrare log(1 f✓D (✓emb w )). wing the principle of adversarial training, we develop a minimax objective to train th fic model (✓model and ✓emb) and the discriminator (✓D) as below: min ✓model,✓emb max ✓D LT (S; ✓model, ✓emb) LD(V ; ✓D, ✓emb), is a coefficient to trade off the two loss terms. We can see that when the model par l and the embedding ✓emb are fixed, the optimization of the discriminator ✓D becomes max ✓D LD(V ; ✓D, ✓emb), is to minimize the classification error of popular and rare words. When the discriminato the optimization of ✓model and ✓emb becomes min ✓model,✓emb LT (S; ✓model, ✓emb) LD(V ; ✓D, ✓emb), LD(V ; ✓D, ✓emb) = 1 |Vpop | X w2Vpop log f✓D (✓emb w ) + 1 |Vrare | X w2Vrare log(1 f✓D (✓emb w )). llowing the principle of adversarial training, we develop a minimax objective to train the ta ecific model (✓model and ✓emb) and the discriminator (✓D) as below: min ✓model,✓emb max ✓D LT (S; ✓model, ✓emb) LD(V ; ✓D, ✓emb), here is a coefficient to trade off the two loss terms. We can see that when the model parame model and the embedding ✓emb are fixed, the optimization of the discriminator ✓D becomes max ✓D LD(V ; ✓D, ✓emb), hich is to minimize the classification error of popular and rare words. When the discriminator ✓D ed, the optimization of ✓model and ✓emb becomes min ✓model,✓emb LT (S; ✓model, ✓emb) LD(V ; ✓D, ✓emb), ., to optimize the task performance as well as fooling the discriminator. We train ✓model, ✓emb a D iteratively by stochastic gradient descent or its variants. The general training process is shown LD(V ; ✓ , ✓ ) = |Vpop | w2Vpop log f✓D (✓w ) + |Vrare | w2Vrare log(1 f✓D (✓w )). llowing the principle of adversarial training, we develop a minimax objective to train the ta ecific model (✓model and ✓emb) and the discriminator (✓D) as below: min ✓model,✓emb max ✓D LT (S; ✓model, ✓emb) LD(V ; ✓D, ✓emb), here is a coefficient to trade off the two loss terms. We can see that when the model parame model and the embedding ✓emb are fixed, the optimization of the discriminator ✓D becomes max ✓D LD(V ; ✓D, ✓emb), ich is to minimize the classification error of popular and rare words. When the discriminator ✓D ed, the optimization of ✓model and ✓emb becomes min ✓model,✓emb LT (S; ✓model, ✓emb) LD(V ; ✓D, ✓emb), ., to optimize the task performance as well as fooling the discriminator. We train ✓model, ✓emb a iteratively by stochastic gradient descent or its variants. The general training process is shown gorithm 1. efficient to trade off the two loss terms. We can see that when the model pa embedding ✓emb are fixed, the optimization of the discriminator ✓D becomes max ✓D LD(V ; ✓D, ✓emb), mize the classification error of popular and rare words. When the discriminat ization of ✓model and ✓emb becomes min ✓model,✓emb LT (S; ✓model, ✓emb) LD(V ; ✓D, ✓emb), the task performance as well as fooling the discriminator. We train ✓model, ✓ y stochastic gradient descent or its variants. The general training process is s ent hod on a wide range of tasks, including word similarity, language modeling, m 1 rare | X w2Vrare log(1 f✓D (✓emb w )). (2) op a minimax objective to train the task- D) as below: LD(V ; ✓D, ✓emb), (3) We can see that when the model parameter n of the discriminator ✓D becomes b), (4) rare words. When the discriminator ✓D is D(V ; ✓D, ✓emb), (5) e discriminator. We train ✓model, ✓emb and . The general training process is shown in