Slide 1

Slide 1 text

Analysis of Bias in Gathering Information Between User Attributes in News Application Yoshifumi Seki (Gunosy Inc.) Mitsuo Yoshida (Toyohashi University of Technology) ABCCS2018@IEEE Bigdata 2018 2018.12.10

Slide 2

Slide 2 text

Motivations ● Confirmation bias is existed in information gathering on the web. ○ e.g. Filter Bubbles, Echo chamber ○ These phenomena have been investigated by questionnaires. ● We would like to clarify these phenomena by analyzing behavior data. ○ In this study, using user activity logs in news application. ○ For evaluating diversity of recommender systems, improving long-period user satisfaction, and so on.

Slide 3

Slide 3 text

Research Question ● Q. How behavior in the news application differs between user attributes? ○ Ideally, we would like to analyze users based on their interest. ○ Instead of user’s interest, we analyze users based on their attributes. ● Our Contributions: ○ Clarify relationships of user behavior between user attributes. ○ Detect keywords that are biased by attribute, using regression analysis.

Slide 4

Slide 4 text

Data Source ● Gunosy ○ Japanese popular news delivery service ○ providing mobile application (iOS, Android) ○ over 24 million downloads ○ deliver over 600 media news 4

Slide 5

Slide 5 text

DataSet ● August 1 to 31, 2019 (1 month ) ● news articles ○ politics, society ● 2 type action ○ Click, Like ● Clicked more than 100 times ● User Attributes ○ users register own attributes to that application. ■ if users don’t register, their attributes are predicted by supervised learning. ○ age ■ - 29 (younger), 30-39 (middle), 40- (older) ○ gender ■ male, female 5

Slide 6

Slide 6 text

Gender Action Ratio all politics society click male 58.9% 76.2% 54.0% female 41.1% 23.8% 46.0% like male 47.7% 78.2% 47.4% female 52.3% 21.8% 52.6% # of news 1,333 8,801

Slide 7

Slide 7 text

Age Action Ratio all politics society click young 34.7% 16.4% 23.1% middle 30.2% 22.1% 30.4% older 35.1% 61.5% 46.5% like young 25.8% 8.8% 16.0% middle 25.4% 11.0% 22.1% older 48.7% 80.2% 61.9% # of news 1,333 8,801

Slide 8

Slide 8 text

Normalize # of Action ● The trend in # of action is different depending on categories and attributes. ○ The normalization is needed.

Slide 9

Slide 9 text

Scatter Plot by gender Click Like Pearson’s correlation coefficient 0.902 0.883 0.502 0.509 strong positive correlation weakly than click >

Slide 10

Slide 10 text

Pearson’s coefficient by ages politics society click like click like young-middle 0.993 0.909 0.985 0.955 middle-older 0.923 0.845 0.969 0.976 older-young 0.901 0.786 0.936 0.902

Slide 11

Slide 11 text

Result of Correlation Analysis ● Difference in category user behavior by attributes where compared using correlation coefficient. ○ Click number has strong positive correlations between attribute. ○ Like number has weak correlations compared to click’s. ● User behavior between attributes has strong correlation. ○ we are able to discuss about their differences by user behavior data.

Slide 12

Slide 12 text

Comparison by keywords ● Our purpose is to clarify how the behavior differ between user attributes on the topic of news articles. ○ There are various definitions of news topics. ○ This study compares articles based on the keywords included in the title ● Extract keywords from news articles. ○ Divide the title of the news article into morphemes using Mecab ■ These morphemes are taken as keyword candidates. ○ Count news articles including each keyword candidate. ○ We adopt top 100 words in this count as keywords. ■ meaningless words are excluded.

Slide 13

Slide 13 text

Distribution of keyword correlation coefficient ● We would like to compare keywords between user attributes. ○ If the correlation coefficient of the keyword is weak, that keyword is not comparable. ● Keywords with weak correlation coefficient are included articles with very few number of actions. Click Like

Slide 14

Slide 14 text

Regression Analysis ● For detecting the difference of keyword, we adopt regression analysis. ● By regression analysis, Slope and Intercept are obtained. ○ exclude keywords whose coefficient of determination is 0.5 or less. ■ coefficient of determination is similar to correlation coefficient

Slide 15

Slide 15 text

Compare Keyword Intercept The slope of these two keywords are close to the average, the intercept is large and small.

Slide 16

Slide 16 text

Compare Keyword Slope The intercept of these two keywords are close to the average, the slope is large and small.

Slide 17

Slide 17 text

Compare keywords preferred by female Keyword “hospital” has many articles with fewer clicks than keyword “mother”.

Slide 18

Slide 18 text

Biased Keywords Detection ● Using slope (s) and intercept (i), keywords are divided into three categories based on mean ± σ. ○ lager than upper ( x > mean + σ) ○ smaller than lower (x < mean - σ) ○ within the section ( mean - σ < x < mean + σ) ● These category is defined under the assumption that the distribution of these parameter is normal distribution. ○ belonging to 95% or not. ● If one is within section and other is not, this keyword is biased.

Slide 19

Slide 19 text

Biased Keyword by intercept in gender ● Mio Sugita is a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. ● There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. ● A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children

Slide 20

Slide 20 text

Biased Keyword by intercept in gender ● Mio Sugita is a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. ● There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. ● A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children

Slide 21

Slide 21 text

Biased Keyword by intercept in gender ● Mio Sugita is a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. ● There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. ● A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children

Slide 22

Slide 22 text

Biased Keyword by intercept in gender ● Mio Sugita is a Japanese politician who presented papers on LGBT in magazines. The claims in these papers is caused controversy. ● There is news about the possible introduction of Summer Time before the 2020 Summer Olympic Games in Tokyo. ● A 2-year-old boy was missing in the forest and was rescued by a volunteer. politics society click like click like Upper (biased to male) House of Representatives, China Police Obscenity Lower (biased to female) Sugita Mio, Summer Time, Cabinet, Olympics Child, Mother Boy, Crush, Mother, Children

Slide 23

Slide 23 text

Conclusion ● We analyzed behavior differences between user attributes based on the user behavior log of news applications and extracted keywords with biased behavior. ● Using regression analysis, we obtain a biased keyword from the degree of departure from the average value of slope and intercept. ● Future Works ○ Verify whether this result is valid according to social science knowledge. ○ Discover a strong bias topic due to user's interests rather than user attributes. ○ Create a measure that can extract keywords more simply.