Improving Recommendation Systems with User Personality Inferred from Product Reviews

I Im mp pr ro ov vi in ng g
R Re ec co om mm me en nd da at ti io on n S Sy ys st te em ms s w wi it th h U Us se er r P Pe er rs so on na al li it ty y I In nf fe er rr re ed d f fr ro om m P Pr ro od du uc ct t R Re ev vi ie ew ws s Lu Xinyuan (Presenter)1,2 Kan Min-Yen2 IRS Workshop in WSDM’23 March 3 1 ISEP Program, NUS Graduate School 2 School of Computing, National University of Singapore

RecSys: Item recommendations to the end users 2 Recommendation System
(RecSys) movies, music, products… Search Recommendations

RecSys were designed to stimulate user’s consumption behavior. Such user
behaviors are largely influenced by user’s profile 3 Recommendation System (RecSys) movies, music, products… Search Recommendations age education demographic information user profile

Traditional RecSys focuses on a user’s static profile User’s psychology
– e.g., personality, emotion – can help model a user’s dynamic profile. 4 Recommendation System (RecSys) age education demographic information user’s static profile user’s dynamic profile personality emotion

5 Personality has been shown to be directly related to
user preference Example: o Open people are more likely to watch comedy movies [Ivan et al. 2013] o Open people favor energetic music genres [Mariappan et al. 2012] Why we need personality in RecSys Personality affects user preference

6 The recommendation should depend on user’s current emotion state.
Example: o The same user is likely to watch comedy movies when he/she is happy while watching tragedy movies when he/she is sad. Why we need emotion in RecSys Emotion state can influence people’s decisions

Privacy of Personality Information. 7 Challenges • Personality information can
be misused by malicious users to cause undesirable outcomes. [Hinds et al. 2020] • A challenging balance: utilizing information vs protecting privacy

Lack of Large Datasets 8 Challenges • Ground-truth psychology information
is expensive to collect from users. • Currently, only small-scale datasets were built in existing works. • In 2018, a larger dataset myPersonality has been stopped sharing https://sites.google.com/michalkosinski.com/mypersonality

Subjectivity of Personality Measurement 9 Challenges • The measurement of
personality can be very subjective. • The reference-group effect often occurs. [Wu et al. 2017] • The inaccurate measurement of users’ personality trait is likely to bring more noise.

10 Personality Model • Openness to experience: conventional vs creative
thinking • Conscientiousness: disorganized vs organized • Extraversion: engagement with the external world • Agreeableness: need for social harmony • Neuroticism: emotional instability q OCEAN (Big 5)

11 • 10-item Big Five Inventory (BFI) test • 5-level
Likert scale (Strongly agree, agree, neutral, disagree, strongly disagree) • Example: I am outgoing, sociable[1 2 3 4 5] (Extraversion related) • Time consuming Explicit Method: Questionnaire Personality Detection

12 • Language use has an individual difference • Infer
from texts, social media posts Implicit Methods : Automatic personality detection Personality Detection • APIs: 1) IBM Personality Insights: discontinued after 2021 2) Receptiviti: sentence level 3) SenticNet: lexicon-based approach

Receptiviti API 13 • Receptiviti API is a computational language
psychology platform for understanding human behavior. • Receptiviti was co-founded by Prof. James W. Pennebaker, the former Chair of the Department of Psychology, and the inventor of LIWC -- the gold-standard algorithm in the field of language psychology. https://www.receptiviti.com/

Receptiviti API: Personality API package 14 • We use Personality
API Package • Our budget: $250 USD/month includes 500,000 words. 6-month subscription. • Ongoing work: We’ll discuss our own methods to replicate a personality API. https://www.receptiviti.com/personality

Receptiviti example 15 • Input: Pieces of texts. The more
words in the text, the higher the accuracy. • In Receptiviti, more than 300 words are needed. • Output: Big 5 categorypersonality score

16 • Serendipity 2018 o A version of MovieLens dataset.
o It is used for serendipity in RecSys. o There are 10 million ratings. Drawback: It is basically offline evaluation of recommendation algorithms. It did not contain real-time feedback (online) evaluation. • Personality 2018 o A version of MovieLens dataset. o Includes: personality information of the users + movie ratings Drawback: It only contains the Big 5 score of 1,834 users along with the movie rating that were given by these users Current Datasets

Taobao Serendipity 17 Datasets • A user survey on Mobile
Taobao • The users first received a recommended product, then completed a questionnaire that assessed immediate feedback. • Fill in two psychological quizzes: 1) 10-item Curiosity and Exploration Inventory-II (CEI-II) 2) 10-Item Personality Inventory (TIPI) • This dataset contains 11,383 users’ feedback in the user survey. Drawback: Due to the commercial privacy concerns, the Taobao item descriptions and item category information are not public available

18 My Work How can we acquire personality data for
RecSys? How can we explore the impact of personality on RecSys?

19 My Work How can we acquire personality data for
RecSys? How can we explore the impact of personality on RecSys?

Amazon Review dataset (updated version in 2018) 2014 This is
a large crawl of product reviews from Amazon. This dataset contains 82.83 million unique reviews, from around 20 million users. Metadata ◦ reviews and ratings ◦ item-to-item relationships (e.g. "people who bought X also bought Y") ◦ timestamps ◦ helpfulness votes ◦ product image (and CNN features) ◦ Price ◦ Product descriptions ◦ category ◦ Sales Rank 20 download: https://nijianmo.github.io/amazon/index.html Infor:https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews 2018 • More reviews: • The total number of reviews is 233.1 million (142.8 million in 2014). • Newer reviews: • Current data includes reviews in the range May 1996 - Oct 2018.

Amazon Review dataset 21 user ID item ID rating score
review text

Personality Data Preparation • To study whether personality has different
influences on users’ behaviours for different domains, we choose All-Beauty and Music as 2 domains. • Input: Each user's review text (more than 300 words) • Output: Each user’s big 5 personality score. 22 Dataset # of items # of users # of ratings % of interaction Avg. Words per user Avg. Words per review Amazon-beauty 85 991 5269 6.26% 990.48 466.43 Amazon-Music 8,895 1,791 28,399 0.18% 51.01 51.18 • Sample dataset after filtering: • 80% training / 20% testing

Personality Data Preparation • To study the difference between questionnaire-based
personality trait scores with our review-based automatic personality trait detection scores, we also include an existing dataset : Personality 2018. • User: 1,834; • MovieID [1-197,529] • Raw Ratings: 1,028,751 (scores 1-7) 23 # of items # of users # of ratings % of interaction 197,529 1,834 339,000 0.28%

24 How can we acquire personality data for RecSys? How
can we explore the impact of personality on RecSys? My Work

Models • Baseline Models: • (1) Neural Collaborative Filtering (NCF)
• (2) NCF + Random: randomly assign personality label • (3) NCF + Same: assign same personality label • Personality-based Models: • (1) NCF + Most salient personality: assign most salient personality label 25 Single personality

Models • Baseline Models: • (1) Neural Collaborative Filtering (NCF)
• (2) NCF + Random: randomly assign personality label • (3) NCF + Same: assign same personality label • Personality-based Models: • (1) NCF + Most salient personality: assign most salient personality label • (2) NCF + Soft-labeled: take all personality scores and obtain a personality distribution with softmax. • (3) NCF + Hard-coded: directly add all personality scores as additional feature vector in the network 26 Multi personality distribution Single personality

Model 27

RQ1: Can we accurately detect personality from texts? 28 •
To evaluate whether we can accurately detect personality traits from texts, we analyze the personality scores inferred by the Receptiviti API for each user. • We select the users that receive the top 10 highest scores for each personality type, in a total of 100 samples. • Two graduates are given review texts and personality. We ask them to choose whether the sampled review texts accurately match their inferred personality, choosing between three options of yes, no, or not sure.

RQ1: Can we accurately detect personality from texts? 29 •
We find that the inferred personality matches with the review text in 81% of the Amazon-beauty samples, and 79% of the samples from Amazon-music. The average Cohen’s Kappa is 0.70. Personality Type Score Review Texts Extroversion 75.06 Love this shampoo! Recommended by a friend! The color really lasts!!! Agreeable 80.06 Great product - my wife loves it Agreeable 78.18 Great deal and leaves my kids smelling awesome! I bought a box of them years ago and we still have some left!!! Neuroticism 62.28 Nope. It smells like artificial bananas, and this smell does linger. It’s pure liquid, there is no thickness to it at all, it’s like pouring banana water on your head that lathers. It does not help with an itchy scalp either.

RQ2: What is the distribution of users’ personalities? 30 •
We further analyze the personality distribution for all users by plotting the score histograms for each personality trait in the Amazon- beauty dataset and the Amazon-music dataset.

RQ2: What is the distribution of users’ personalities? 31 Summary
• the personality traits of users are not evenly distributed. There are more instances of people with certain personality traits (e.g., agreeableness) than others (e.g., neuroticism). A possible reason is that people with certain personalities are more willing to write product reviews. • The distributions for the two domains are generally the same, with higher agreeable scores and lower neurotic scores. However, there is a slight difference. For example, the scores of extroverts in music are generally higher than that in the beauty domain. This could be explained by the possibility that people who are passionate about music may be more emotional.

RQ3: Does incorporating personality improve RecSys performance? 32 Model HR@3
NDCG@3 HR@5 NDCG@5 HR@10 NDCG@10 NCF+ Random 0.923 0.675 0.965 0.605 0.975 0.660 NCF + Same 0.918 0.683 0.967 0.630 0.975 0.662 NCF + Most salient personality 0.939 0.714 0.969 0.676 0.977 0.707 NCF + Soft-label 0.936 0.810 0.965 0.867 0.973 0.831 NCF + Hard-coded 0.948 0.849 0.961 0.826 0.977 0.848 Experiment Results: Amazon

Model HR@3 NDCG@3 HR@5 NDCG@5 HR@10 NDCG@10 NCF+ Random 0.923
0.675 0.965 0.605 0.975 0.660 NCF + Same 0.918 0.683 0.967 0.630 0.975 0.662 NCF + Most salient personality 0.939 0.714 0.969 0.676 0.977 0.707 NCF + Soft-label 0.936 0.810 0.965 0.867 0.973 0.831 NCF + Hard-coded 0.948 0.849 0.961 0.826 0.977 0.848 33 • Observation 1: NCF + Most salient personality is larger than NCF + Same / Random in terms of NDCG. • Conclusion: adding personality label indeed helps Experiment Results: Amazon RQ3: Does incorporating personality improve RecSys performance?

0.675 0.965 0.605 0.975 0.660 NCF + Same 0.918 0.683 0.967 0.630 0.975 0.662 NCF + Most salient personality 0.939 0.714 0.969 0.676 0.977 0.707 NCF + Soft-label 0.936 0.810 0.965 0.867 0.973 0.831 NCF + Hard-coded 0.948 0.849 0.961 0.826 0.977 0.848 34 RQ3: Does incorporating personality improve RecSys performance? Experiment Results: Amazon

0.675 0.965 0.605 0.975 0.660 NCF + Same 0.918 0.683 0.967 0.630 0.975 0.662 NCF + Most salient personality 0.939 0.714 0.969 0.676 0.977 0.707 NCF + Soft-label 0.936 0.810 0.965 0.867 0.973 0.831 NCF + Hard-coded 0.948 0.849 0.961 0.826 0.977 0.848 35 • Observation 2: NCF + Soft-labeled/Hard-coded is larger than NCF + Most Salient in terms of NDCG • Conclusion: using multiple personality features are better than one single personality feature RQ3: Does incorporating personality improve RecSys performance? Experiment Results: Amazon (Beauty)

36 Model HR@3 NDCG@3 HR@5 NDCG@5 HR@10 NDCG@10 NCF+ Random
0.510 0.406 0.628 0.454 0.777 0.504 NCF + Same 0.501 0.403 0.622 0.454 0.777 0.502 NCF + Most salient personality 0.516 0.415 0.631 0.463 0.795 0.511 NCF + Soft-label 0.528 0.421 0.656 0.471 0.805 0.511 NCF + Hard-coded 0.503 0.398 0.622 0.447 0.758 0.498 Experiment Results: Personality2018 RQ3: Does incorporating personality improve RecSys performance?

0.406 0.628 0.454 0.777 0.504 NCF + Same 0.501 0.403 0.622 0.454 0.777 0.502 NCF + Most salient personality 0.516 0.415 0.631 0.463 0.795 0.511 NCF + Soft-label 0.528 0.421 0.656 0.471 0.805 0.511 NCF + Hard-coded 0.503 0.398 0.622 0.447 0.758 0.498 37 • Observation: NCF + Soft-labeled model outperforms the other models. • Conclusion 1: adding personality label indeed helps RQ3: Does incorporating personality improve RecSys performance? Experiment Results: Personality2018

0.406 0.628 0.454 0.777 0.504 NCF + Same 0.501 0.403 0.622 0.454 0.777 0.502 NCF + Most salient personality 0.516 0.415 0.631 0.463 0.795 0.511 NCF + Soft-label 0.528 0.421 0.656 0.471 0.805 0.511 NCF + Hard-coded 0.503 0.398 0.622 0.447 0.758 0.498 38 • Observation: NCF + Soft-labeled model outperforms the other models. • Conclusion 1: adding personality label indeed helps Conclusion 2: the improvement in Personality 2018 is less obvious than in Amazon Beauty dataset RQ3: Does incorporating personality improve RecSys performance? Experiment Results: Personality2018

RQ4: How does personality information improve the RecSys performance? •
HR and NDCG group by 5 personalities : Amazon (Beauty) 39 Group OPEN NEU CON EXT AGR + - + - + - + - + - HR 0.833 (+11%) 0.750 0.933 (+12%) 0.833 0.883 (+21%) 0.727 0.970 (+11%) 0.872 0.968 (+12%) 0.864 NDCG 0.729 (+34%) 0.545 0.835 (+56%) 0.536 0.769 (+57%) 0.490 0.882 (+47%) 0.600 0.878 (+48%) 0.593 +: w/ personality -: w/o personality

• HR and NDCG group by 5 personalities : Amazon
(Beauty) 40 Group OPEN NEU CON EXT AGR + - + - + - + - + - HR 0.833 (+11%) 0.75 0.933 (+12%) 0.833 0.883 (+21%) 0.727 0.97 (+11%) 0.872 0.968 (+12%) 0.864 NDCG 0.729 (+34%) 0.545 0.835 (+56%) 0.536 0.769 (+57%) 0.490 0.882 (+47%) 0.600 0.878 (+48%) 0.593 +: w/ personality -: w/o personality • Observation : CON has the largest improvement, OPEN has the least improvement • Conclusion: CON users have the largest impact, OPEN users have the least impact RQ4: How does personality information improve the RecSys performance?

• HR and NDCG group by 5 personalities : Personality2018
41 Group OPEN NEU CON EXT AGR + - + - + - + - + - HR 0.535 (-2%) 0.547 0.489 (-4%) 0.511 0.475 (+8%) 0.441 0.611 (+10%) 0.556 0.621 (+13%) 0.552 NDCG 0.420 (-0.4%) 0.422 0.390 (-6%) 0.415 0.358 (-0.8%) 0.361 0.412 (+0.2%) 0.411 0.512 (+19%) 0.430 +: w/ personality -: w/o personality RQ4: How does personality information improve the RecSys performance?

RQ4: How does personality information improve the RecSys performance? •
HR and NDCG group by 5 personalities : Personality2018 42 Group OPEN NEU CON EXT AGR + - + - + - + - + - HR 0.535 (-2%) 0.547 0.489 (-4%) 0.511 0.475 (+8%) 0.441 0.611 (+10%) 0.556 0.621 (+13%) 0.552 NDCG 0.420 (-0.4%) 0.422 0.390 (-6%) 0.415 0.358 (-0.8%) 0.361 0.412 (+0.2%) 0.411 0.512 (+19%) 0.430 +: w/ personality -: w/o personality • Observation : AGR has the largest improvement • Conclusion: AGR has the largest impact; the results are not consistent with Amazon Beauty dataset

43 Conclusion and Limitations In this work, we make a
preliminary attempt to explore how to automatically infer users’ personality traits from product reviews and how the inferred traits can benefit the state-of-the-art automated recommendation processes. We observe that recommendation performance is indeed boosted by incorporating personality information.

44 Conclusion and Limitations Limitations: 1. Capturing personality from the
review texts may lead to selective bias. 2. More in-depth investigation is necessary on how personality affects recommendation and users’ behavior. 3. Openness, conscientiousness and neuroticism features do not have an obvious impact on the recommendation performance. 4. The 5 personalities are encoded independently of each other in our model. But there is a correlation between these personality traits in real life.

Thank you! Any questions? 45 Lu Xinyuan 📧📧 [email protected]

References • Ivan et al. 2013 Relating personality types with
user preferences in multiple entertainment domains • M. B. Mariappan et al. 2012. Facefetch: A user emotion driven multimedia content recommendation system based on facial expression recognition. • Joanne Hinds, Emma J.Williams, and Adam N. Joinson. “it wouldn’t happen to me”: Privacy concerns and perspectives following the cambridge analytica scandal. International Journal of Human-Computer Studies, 143:102498, 2020. • Wu Youyou, David Stillwell, H. Andrew Schwartz, and Michal Kosinski. Birds of a feather do flock together: Behavior-based personality-assessment method reveals personality similarity among couples and friends. Psychological Science, 28(3):276–284, 2017. PMID: 28059682. • Hsin-Chang Yang and Zi-Rui Huang. Mining personality traits from social messages for game recommender systems. Knowledge-Based Systems, 165:157–168, 2019. 86 • Nana Yaw Asabere, Amevi Acakpovi, and Mathias Bennet Michael. Improving socially aware recommendation accuracy through personality. IEEE Transactions on Affective Computing, 9(3):351–361, 2017. 84 • W. Wu, L. Chen, and Y. Zhao, “Personalizing recommendation diversity based on user personality,” User Modeling and User-Adapted Interaction, vol. 28, no. 3, pp. 237–276, aug 2018. • Ignacio Fernandez-Tobıas, Matthias Braunhofer, Mehdi Elahi, Francesco Ricci, and Iv´an Cantador. Alleviating the new user problem in collaborative filtering by exploiting personality information. User Modeling and User- Adapted Interaction, 26(2):221–255, 2016. 46

47 Supplementary Slides

active users all users active users B A Personality score
B Personality Detector C Personality score C C’ A’ Step 1 Train Validation/Test Personality Detector A’ Step 2 Step 3 Step 4 Step 5 Data Collection Pipeline 48

Data Collection Pipeline ◎ Step 1: Filtering “active” users from
raw data ◎ Step 2: In respect to active users, randomly select part of the data as A, the rest of the data as B. Annotating A with a personality score by Receptiviti API. We got A’ after the annotation. Size of A: 683 users. Size of A’ [Currently 500 users] Size of B: 10268-683 users = 9585 users ◎ Step 3: Train and test a Personality Detector in A’. 537 for training, 136 for testing. (80% for training, 20% for testing) ◎ Step 4: Apply the Personality Detector in B, to select the active users in B, which we call as C. (B->C 9585 users ->2345 users) ◎ Step 5: Annotating C with a personality score by Receptiviti. We got C’ after the annotation. Size of C’ =~ 3028-683=2345 users Output: A’+C’=3 million words ~3028 users. 49

Data Collection Pipeline Step 1: active users satisfied the following
conditions: 1. Each user purchased at least 10 items. 2. Each item contains 30~80 words review. 3. After filtering, a total of 10,268 users are left. Total words are 10,170,213. The average number of words for each user is 990.48. The average number of words for each review is 51.01. 50 Max Words Min Words Min Items No. of Users After Filtering Total Words Average Words for Each User Average Words for Each Review 80 30 10 10,268 10,170,213 990.48 51.01

Improving Recommendation Systems with User Pers...

Improving Recommendation Systems with User Personality Inferred from Product Reviews

More Decks by wing.nus

Other Decks in Education

Featured

Transcript