↵)CD(✓D)] CD / X (u,i) log p ⇣ du,i = ( w(1) u,i , · · · , w(nu,i) u,i ) | i ⌘ HFT
BoWLF
p ⇣ du,i = (w(1) u,i , · · · , w(nu,i) u,i )| i ⌘ p ⇣ w(t) u,i = j| i ⌘ y LMLF
We jointly optimize the rating prediction model in Eq. (1) the review model in Eq. (3) by minimizing the convex bination of CR in Eq. (2) and CD in Eq. (4): arg min ✓,✓D ↵ CR(✓) + (1 ↵)CD(✓D , { i }M i =1 ), (6) re the coe cient ↵ is a hyperparmeter. .1 BoWLF: Distributed Bag-of-Word he first model we propose to use is a distributed bag-of- ds prediction. In this case, we represent each review as ag of words, meaning du,i = ⇣w(1) u,i , · · · , w( nu,i) u,i ⌘ ⇡ nw(1) u,i , · · · , w( nu,i) u,i o . (7) s leads to p(du,i | i ) = nu,i Y t =1 p(w( t ) u,i | i ). We model p(w( t ) u,i | i ) as an a ne transformation of the duct representation i followed by, so-called softmax, malization: p(w( t ) u,i = j | i ) = exp {yj} P|V | l =1 exp {yl} , (8) re y = W i + b h( t ) = h( t 1), w( t 1) u,i , i There are a number of choices available the recurrent function . Here, we use memory (LSTM, [9]) which has recently cessfully to natural language-related tasks In the case of the LSTM, the recurrent f in addition to its hidden state h( t ), the me that hh( t ); c( t ) i = ⇣h( t 1), c( t 1), w( t u where h( t ) = o( t ) tanh(c( t )) c( t ) = f( t ) c( t 1) + i( t ) ˜ c The output o, forget f and input i gates a 2 4 o( t ) f( t ) i( t ) 3 5 = (Vg E hw( t 1) u,i i + Wg h( t Ug c( t 1) + and the new memory content ˜ c( t ) by ˜ c( t ) = tanh(Vc E hw( t 1) u,i i + Wc h( t 1)+ Uc c( t 1) + A matrix h shares n model. ling the espond- (3) del. (✓D and (4) does not make any assumption on how each review is rep- resented, but takes a sequence of words as it is, preserving the order of the words. In this case, we model the probability over a review which is a variable-length sequence of words by rewriting the prob- ability as p(du,i = (w(1) u,i , · · · , w( nu,i) u,i ) | i ) =p ⇣w(1) u,i | i ⌘ nu,i Y t =2 p ⇣w( t ) u,i | w(1) u,i , · · · , w( t 1) u,i , i ⌘ , We approximate each conditional distribution with p ⇣w( t ) u,i = j | w( <t ) u,i , i ⌘ = exp ny( t ) j o P|V | l =1 exp ny( t ) l o , distributions. Instead of sampling the topic proportion from the top- level Dirichlet distribution in LDA, HFT replaces it with ⌧ = 1 kexp { i }k 1 exp { i } , where is a free parameter estimated along with all the other parameters of the model. In this case, the probability over a single review du,i given a product i becomes p(du,i | i ) = nu,i Y t =1 dim( i) X k =1 p(w( t ) u,i | zk = 1)p(zk = 1 | i ) (11) = nu,i Y t =1 dim( i) X k =1 ⌧k p(w( t ) u,i | zk = 1) where zk is an indicator variable of the k-th topic out of dim( i ), and ⌧k is the k-th element of ⌧. The conditional probability over words given a topic is modeled with a stochas- tic matrix W⇤ = ⇥w⇤ j,k ⇤ |V |⇥ dim( i) (each column sums to 1). The conditional probability over words given a product i can be written as p(w( t ) u,i = j | i ) = dim( i) X k =1 w⇤ j,k exp { i,k} kexp { i }k 1 . (12) The matrix W⇤ is often parametrized by w⇤ j,k = exp {qj,k } P l exp {ql,k } , where Q = [qj,k] is an unconstrained matrix of the same size as W⇤. In practice, a bias term is added to the formulation above to handle frequent words. 3.4 Comparing HFT and BoWLF common subset of the whole vocabulary, while vastly di↵er from each other depending on th other words, the conditional distribution of w product puts most of its probability mass o product-specific words, while leaving most oth nearly zero probabilities. Product of experts better suited to modeling peaky distribution mixture models. A more concrete way of understanding the tween HFT and BoWLF may be to consider ho representation and the weight matrix interact. the BoWLF, this is a simple matrix-vector pro restrictions on the weight matrix. This means product representation elements as well as th the weight matrix are free to assume negative it is possible that an element of the product r could exercise a strong influence suppressing t of a given set of words. Alternatively, with H the model interprets the elements of the produ tion as mixture components, these elements h anism of suppressing probability mass assigne the other elements of the product representati We suggest that this di↵erence allows the B ter model reviews compared to the HFT, or an based model by o↵ering a mechanism for neg tions between words to be explicitly expressed of the product representation. By o↵ering a and natural model of reviews, the BoWLF m prove the rating prediction generalization per we will see in Sec. 4, our experimental results proposition. The proposed LMLF takes one step further each review with a chain of products of exper account the order of words. This may seem an The model has two components; matrix q. (1) and review modeling, which shares eters ✓ from the rating prediction model. the approach from [13] by modeling the bility of each review given the correspond- = ⇣w(1) u,i , · · · , w( nu,i) u,i ⌘ | i , ✓D ⌘ , (3) of parameters for this review model. parameters of this review model (✓D and g the negative log-likelihood: rg min ,{ i }M i=1 CD(✓D , { i }M i =1 ), = (4) OD log p ⇣du,i = ⇣w(1) u,i , · · · , w( nu,i) u,i ⌘ | i ⌘ . (5) mize the rating prediction model in Eq. (1) does not make any assumption on how each review is rep- resented, but takes a sequence of words as it is, preserving the order of the words. In this case, we model the probability over a review which is a variable-length sequence of words by rewriting the prob- ability as p(du,i = (w(1) u,i , · · · , w( nu,i) u,i ) | i ) =p ⇣w(1) u,i | i ⌘ nu,i Y t =2 p ⇣w( t ) u,i | w(1) u,i , · · · , w( t 1) u,i , i ⌘ , We approximate each conditional distribution with p ⇣w( t ) u,i = j | w( <t ) u,i , i ⌘ = exp ny( t ) j o P|V | l =1 exp ny( t ) l o , where y( t ) = Wh( t ) + b and h( t ) = ⇣h( t 1), w( t 1) u,i , i ⌘ . uct i. The model has two components; matrix n in Eq. (1) and review modeling, which shares parameters ✓ from the rating prediction model. follow the approach from [13] by modeling the probability of each review given the correspond- i : ⇣du,i = ⇣w(1) u,i , · · · , w( nu,i) u,i ⌘ | i , ✓D ⌘ , (3) a set of parameters for this review model. ate the parameters of this review model (✓D and nimizing the negative log-likelihood: arg min ✓D ,{ i }M i=1 CD(✓D , { i }M i =1 ), }M i =1 ) = (4) X ( u,i ) 2OD log p ⇣du,i = ⇣w(1) u,i , · · · , w( nu,i) u,i ⌘ | i ⌘ . (5) y optimize the rating prediction model in Eq. (1) ew model in Eq. (3) by minimizing the convex of CR in Eq. (2) and CD in Eq. (4): does not make any assumption on how ea resented, but takes a sequence of words as the order of the words. In this case, we model the probability ove is a variable-length sequence of words by re ability as p(du,i = (w(1) u,i , · · · , w( nu,i) u,i ) | i ) =p ⇣w(1) u,i | i ⌘ nu,i Y t =2 p ⇣w( t ) u,i | w(1) u,i , · · · , w We approximate each conditional distrib p ⇣w( t ) u,i = j | w( <t ) u,i , i ⌘ = exp n P|V | l =1 exp where y( t ) = Wh( t ) + b and h( t ) = ⇣h( t 1), w( t 1) u,i , i ⌘ There are a number of choices available f the recurrent function . Here, we use a on the product i. The model has two components; matrix factorization in Eq. (1) and review modeling, which shares some of the parameters ✓ from the rating prediction model. Here, we follow the approach from [13] by modeling the conditional probability of each review given the correspond- ing product i : p ⇣du,i = ⇣w(1) u,i , · · · , w( nu,i) u,i ⌘ | i , ✓D ⌘ , (3) where ✓D is a set of parameters for this review model. We estimate the parameters of this review model (✓D and i ’s) by minimizing the negative log-likelihood: arg min ✓D ,{ i }M i=1 CD(✓D , { i }M i =1 ), where CD(✓D , { i }M i =1 ) = (4) 1 |OD| X ( u,i ) 2OD log p ⇣du,i = ⇣w(1) u,i , · · · , w( nu,i) u,i ⌘ | i ⌘ . (5) We jointly optimize the rating prediction model in Eq. (1) and the review model in Eq. (3) by minimizing the convex combination of CR in Eq. (2) and CD in Eq. (4): arg min ✓,✓D ↵ CR(✓) + (1 ↵)CD(✓D , { i }M i =1 ), (6) does not make any assumption on how each resented, but takes a sequence of words as it the order of the words. In this case, we model the probability over a is a variable-length sequence of words by rewr ability as p(du,i = (w(1) u,i , · · · , w( nu,i) u,i ) | i ) =p ⇣w(1) u,i | i ⌘ nu,i Y t =2 p ⇣w( t ) u,i | w(1) u,i , · · · , w( t u We approximate each conditional distribut p ⇣w( t ) u,i = j | w( <t ) u,i , i ⌘ = exp ny( t j P|V | l =1 exp n where y( t ) = Wh( t ) + b and h( t ) = ⇣h( t 1), w( t 1) u,i , i ⌘ . There are a number of choices available for the recurrent function . Here, we use a lo memory (LSTM, [9]) which has recently bee cessfully to natural language-related tasks [7] In the case of the LSTM, the recurrent func du,i = w(1) u,i , · · · , w( nu,i) u,i ⇡ w(1) u,i , · · · This leads to p(du,i | i ) = nu,i Y t =1 p(w( t ) u,i | We model p(w( t ) u,i | i ) as an a ne tran product representation i followed by, s normalization: p(w( t ) u,i = j | i ) = exp {yj} P|V | l =1 exp { where y = W i + b and V , W and b are the vocabulary, a we bias vector. The parameters ✓D of this revi W and b. When we use this distributed bag-of-wo matrix factorization for predicting ratings joint model the bag-of-words regularized la (BoWLF). 3.2.2 LMLF: Recurrent Neural Netw NA over a single review du,i given a product i becomes p(du,i | i ) = nu,i Y t =1 dim( i) X k =1 p(w( t ) u,i | zk = 1)p(zk = 1 | i ) (11) = nu,i Y t =1 dim( i) X k =1 ⌧k p(w( t ) u,i | zk = 1) where zk is an indicator variable of the k-th topic out of dim( i ), and ⌧k is the k-th element of ⌧. The conditional probability over words given a topic is modeled with a stochas- tic matrix W⇤ = ⇥w⇤ j,k ⇤ |V |⇥ dim( i) (each column sums to 1). The conditional probability over words given a product i can be written as p(w( t ) u,i = j | i ) = dim( i) X k =1 w⇤ j,k exp { i,k} kexp { i }k 1 . (12) The matrix W⇤ is often parametrized by w⇤ j,k = exp {qj,k } P l exp {ql,k } , where Q = [qj,k] is an unconstrained matrix of the same size as W⇤. In practice, a bias term is added to the formulation above to handle frequent words. 3.4 Comparing HFT and BoWLF From Eq. (8) and Eq. (12), we can see that the HFT and the proposed BoWLF (see Sec. 3.2.1) are closely re- lated. Most importantly, both of them consider a review as a bag of words and parametrize the conditional probability of a word given a product representation with a single a ne transformation (weight matrix plus o↵set vector). The main di↵erence is in how the product representation and the weight matrix interact to form a point on the |V |- dimensional simplex. In the case of HFT, both the product representation i and the projection matrix W⇤ are sepa- rately stochastic (i.e. each i and each column of W⇤ are in- terpretable as a probability distribution), while the BoWLF projects the result of the matrix-vector product W i onto A more concrete tween HFT and BoW representation and t the BoWLF, this is restrictions on the w product representat the weight matrix a it is possible that a could exercise a str of a given set of wo the model interpret tion as mixture com anism of suppressin the other elements We suggest that t ter model reviews c based model by o↵ tions between word of the product repr and natural model prove the rating pr we will see in Sec. proposition. The proposed LM each review with a account the order o efit at the first sigh order of the words feature of language LMLF will model r rating prediction. 4. EXPERIM 4.1 Dataset We evaluate the p views dataset [13].5 ings and accompan oposed in [13], and discuss it approaches. on latent Dirichlet allocation e distributed bag-of-word model review as a bag of words (see describing how LDA models a f a review/document. It starts proportion ⌧ from a Dirichlet parameter to a multinomial a topic is sampled. The sam- ty distribution over the words rds, given a topic proportion, ith a mixture of multinomial pic proportion from the top- LDA, HFT replaces it with basis, the BoWLF in Eq. (8) can be re-written as a (condi- tional) product of experts by p(w = j | i ) = 1 Z( i ) dim( i) Y k =1 exp {wj,k i,k + bj} , where wj,k and bj are the element at the j-th row and k-th column of W and the j-th element of b, respectively. On the other hand, an inspection of Eq. (11) reveals that, on a per word basis, the HFT model is clearly a mixture model, with the topics playing the role of the mixture components. As argued in [8], a product of experts can more easily model a peaky distribution, especially, in a high-dimensional space.4 The reviews of each product tend to contain a small common subset of the whole vocabulary, while those subsets vastly di↵er from each other depending on the product. In other words, the conditional distribution of words given a ޠॱߟྀ×, mixture model ޠॱߟྀ×, product of expert ޠॱߟྀ◦, LSTMͰඇઢܗؔΛදݱ