Slide 5
Slide 5 text
݁Ռ
[4a-3] ٠ాངฏ 5/6
4.2 Experimental Setup
Data Preparation.
We closely follow the procedure from [13] and [12], where
the evaluation is done per category. We randomly select
80% of ratings, up to two million samples, as a training set,
and split the rest evenly into validation and test sets, for
each category. We preprocess reviews only by tokenizing
them using a script from Moses6, after which we build a
vocabulary of 5000 most frequent words.
Evaluation Criteria.
We use mean squared error (MSE) of the rating prediction
to evaluate each approach. For assessing the performance on
review modeling, we use the average negative log-likelihood.
Baseline.
We compare the two proposed approaches, BoWLF (see
Sec. 3.2.1) and LMLF (see Sec. 3.2.2), against three baseline
methods; matrix factorization with L
2
regularization (MF,
see Eqs. (1)–(2)), the HFT model from [13] (see Sec. 3.3) and
the RMR model from [12]. In the case of HFT, we report
the performance both by evaluating the model ourselves7
and by reporting the results from [13] directly. For RMR,
we only report the results from [12].
Hyper-parameters.
Both user u
and product i
vectors in Eq. (1) are five
dimensional for all the experiments in this section. This
choice was made mainly to make the results comparable to
the previously reported ones in [13] and [12].
We initialize all the user and product representations by
sampling each element from a zero-mean Gaussian distri-
bution with its standard deviation set to 0.01. The biases,
µ, u and i are all initialized to 0. All the parameters in
BoWLF and LMLF are initialized similarly except for the
recurrent weights of the RNN-LM in LMLF which were ini-
tialized to be orthogonal.
Training Procedure.
When training MF, BoWLF and LMLF, we use minibatch
RMSProp with the learning rate, momentum coe cient and
the size of minibatch set to 0.01, 0.9 and 128, respectively.
We trained each model at most 200 epochs, while monitoring
the validation performance. For HFT, we follow [13] which
uses the Expectation Maximization algorithm together with
L-BFGS. In all cases, we early-stop each training run based
on the validation set performance.
In the preliminary experiments, we found the choice of ↵
in Eq. (6), which balances matrix factorization and review
modeling, to be important. We searched for the ↵ that max-
categories in terms of MSE with the standard error of mean
shown in parentheses. From this table, we can see that ex-
cept for a single category of“Jewelry”, the proposed BoWLF
outperforms all the other models with an improvement of
20.29% over MF and 5.64% over HFT across all categories.8
In general, we note better performance of BoWLF and LMLF
models over other methods especially as the size of the dataset
grows, which is evident from Figs. 1 and 2.
Figure 1: Scatterplot showing performance improve-
ment over the number of samples. We see a perfor-
mance improvement of BoWLF over HFT as dataset
size increases.
Figure 2: Scatterplot showing performance improve-
each category. We preprocess reviews only by tokenizing
them using a script from Moses6, after which we build a
vocabulary of 5000 most frequent words.
Evaluation Criteria.
We use mean squared error (MSE) of the rating prediction
to evaluate each approach. For assessing the performance on
review modeling, we use the average negative log-likelihood.
Baseline.
We compare the two proposed approaches, BoWLF (see
Sec. 3.2.1) and LMLF (see Sec. 3.2.2), against three baseline
methods; matrix factorization with L
2
regularization (MF,
see Eqs. (1)–(2)), the HFT model from [13] (see Sec. 3.3) and
the RMR model from [12]. In the case of HFT, we report
the performance both by evaluating the model ourselves7
and by reporting the results from [13] directly. For RMR,
we only report the results from [12].
Hyper-parameters.
Both user u
and product i
vectors in Eq. (1) are five
dimensional for all the experiments in this section. This
choice was made mainly to make the results comparable to
the previously reported ones in [13] and [12].
We initialize all the user and product representations by
sampling each element from a zero-mean Gaussian distri-
bution with its standard deviation set to 0.01. The biases,
µ, u and i are all initialized to 0. All the parameters in
BoWLF and LMLF are initialized similarly except for the
recurrent weights of the RNN-LM in LMLF which were ini-
tialized to be orthogonal.
Training Procedure.
When training MF, BoWLF and LMLF, we use minibatch
RMSProp with the learning rate, momentum coe cient and
the size of minibatch set to 0.01, 0.9 and 128, respectively.
We trained each model at most 200 epochs, while monitoring
the validation performance. For HFT, we follow [13] which
uses the Expectation Maximization algorithm together with
L-BFGS. In all cases, we early-stop each training run based
on the validation set performance.
In the preliminary experiments, we found the choice of ↵
in Eq. (6), which balances matrix factorization and review
modeling, to be important. We searched for the ↵ that max-
imizes the validation performance, in the range of [0.1, 0.01].
We used a CPU cluster of 16 nodes each with 8 cores and
8 16 GB of memory to run experiments on BoWLF, MF,
6
https://github.com/moses-smt/mosesdecoder/
7 The code was kindly provided by the authors of [13].
grows, which is evident from Figs. 1 and 2.
Figure 1: Scatterplot showing performance improve-
ment over the number of samples. We see a perfor-
mance improvement of BoWLF over HFT as dataset
size increases.
Figure 2: Scatterplot showing performance improve-
ment over the number of samples. We see a modest
performance improvement of LMLF over HFT as
dataset size increases.
8 Due to the use of di↵erent splits, the results by HFT re-
ported in [13] and RMR in [12] are not directly comparable.
Dataset (a) (b) (c) (d) BoWLF improvement
Dataset Size MF HFT BoWLF LMLF over (a) over (b) HFT* RMR**
Arts 27K 1.434 (0.04) 1.425 (0.04) 1.413 (0.04) 1.426 (0.04) 2.15% 1.18% 1.388 1.371
Jewelry 58K 1.227 (0.04) 1.208 (0.03) 1.214 (0.03) 1.218 (0.03) 1.24% -0.59% 1.178 1.160
Watches 68K 1.511 (0.03) 1.468 (0.03) 1.466 (0.03) 1.473 (0.03) 4.52% 0.20% 1.486 1.458
Cell Phones 78K 2.133 (0.03) 2.082 (0.02) 2.076 (0.02) 2.077 (0.02) 5.76% 0.66% N/A 2.085
Musical Inst. 85K 1.426 (0.02) 1.382 (0.02) 1.375 (0.02) 1.388 (0.02) 5.12% 0.75% 1.396 1.374
Software 95K 2.241 (0.02) 2.194 (0.02) 2.174 (0.02) 2.203 (0.02) 6.70% 2.06% 2.197 2.173
Industrial 137K 0.360 (0.01) 0.354 (0.01) 0.352 (0.01) 0.356 (0.01) 0.76% 0.24% 0.357 0.362
O ce Products 138K 1.662 (0.02) 1.656 (0.02) 1.629 (0.02) 1.646 (0.02) 3.32% 2.72% 1.680 1.638
Gourmet Foods 154K 1.517 (0.02) 1.486 (0.02) 1.464 (0.02) 1.478 (0.02) 5.36% 2.22% 1.431 1.465
Automotive 188K 1.460 (0.01) 1.429 (0.01) 1.419 (0.01) 1.428 (0.01) 4.17% 1.03% 1.428 1.403
Kindle Store 160K 1.496 (0.01) 1.435 (0.01) 1.418 (0.01) 1.437 (0.01) 7.83% 1.76% N/A 1.412
Baby 184K 1.492 (0.01) 1.437 (0.01) 1.432 (0.01) 1.443 (0.01) 5.95% 0.48% 1.442 N/A
Patio 206K 1.725 (0.01) 1.687 (0.01) 1.674 (0.01) 1.680 (0.01) 5.10% 1.24% N/A 1.669
Pet Supplies 217K 1.583 (0.01) 1.554 (0.01) 1.536 (0.01) 1.544 (0.01) 4.74% 1.78% 1.582 1.562
Beauty 252K 1.378 (0.01) 1.373 (0.01) 1.335 (0.01) 1.370 (0.01) 4.33% 3.82% 1.347 1.334
Shoes 389K 0.226 (0.00) 0.231 (0.00) 0.224 (0.00) 0.225 (0.00) 0.23% 0.72% 0.226 0.251
Tools & Home 409K 1.535 (0.01) 1.498 (0.01) 1.477 (0.01) 1.490 (0.01) 5.78% 2.15% 1.499 1.491
Health 428K 1.535 (0.01) 1.509 (0.01) 1.481 (0.01) 1.499 (0.01) 5.35% 2.82% 1.528 1.512
Toys & Games 435K 1.411 (0.01) 1.372 (0.01) 1.363 (0.01) 1.367 (0.01) 4.71% 0.89% 1.366 1.372
Video Games 463K 1.566 (0.01) 1.501 (0.01) 1.481 (0.01) 1.490 (0.01) 8.47% 2.00% 1.511 1.510
Sports 510K 1.144 (0.01) 1.137 (0.01) 1.115 (0.01) 1.127 (0.01) 2.94% 2.19% 1.136 1.129
Clothing 581K 0.339 (0.00) 0.343 (0.00) 0.333 (0.00) 0.344 (0.00) 0.60% 1.01% 0.327 0.336
Amazon Video 717K 1.317 (0.01) 1.239 (0.01) 1.184 (0.01) 1.206 (0.01) 13.33% 5.47% N/A 1.270
Home 991K 1.587 (0.00) 1.541 (0.00) 1.513 (0.00) 1.535 (0.01) 7.41% 2.79% 1.527 1.501
Electronics 1.2M 1.754 (0.00) 1.694 (0.00) 1.671 (0.00) 1.698 (0.00) 8.29% 2.30% 1.724 1.722
Music 6.3M 1.112 (0.00) 0.970 (0.00) 0.920 (0.00) 0.924 (0.00) 19.15% 4.94% 0.969 0.959
Movies & Tv 7.8M 1.379 (0.00) 1.089 (0.00) 0.999 (0.00) 1.022 (0.00) 37.95% 9.01% 1.119 1.120
Books 12.8M 1.272 (0.00) 1.141 (0.00) 1.080 (0.00) 1.110 (0.00) 19.21% 6.12% 1.135 1.113
All categories 35.3M 1.289 1.143 1.086 1.107 20.29% 5.64%
Table 1: Prediction Mean Squared Error results on test data. Standard error of mean in parenthesis.
Dimensionality of latent factors dim( i
) = 5 for all models. Best results for each dataset in bold. HFT* and
RMR** represent original paper results over di↵erent data splits [13, 12].
Interestingly, BoWLF always outperforms LMLF. These
results indicate that the complex language model, which the
LMLF learns using an LSTM network, does not seem to im-
prove over a simple bag-of-word representation, which the
BoWLF learns, in terms of the learned product representa-
tions.
This can be understood from how the product representa-
ਤදݪจΑΓҾ༻
࠷γϯϓϧͳBoWLF͕ྑ͍݁Ռ
→ ݴޠϞσϧͷੑೳॏཁͰͳ͍Մೳੑ
BoWLF͕HFTΑΓ্
LMLF͕HFTΑΓ্