What Bid data tells: Sampling the social network by communication channel

What does Big Data tell? Sampling the social network by
communication channels Yohsuke Murase RIKEN Advanced Institute of Computational Science 2016 Aug. 10 @Aalto Univ.

• Name: Yohsuke Murase (ଜ੉༸հʣ • major: statistical physics, complex
networks • grown in Okazaki-city, Aichi • Ph.D at Department of Applied Physics, Univ. Tokyo (Statistical physics) • working as a software engineer in Hamamatsu • In 2013, I joined RIKEN AICS. About me 2

References 3 J. Torok, Y. Murase, H.-H. Jo, J. Kertesz,
K. Kaski, “What does Big Data tell: Sampling the social network by communication channels”, arXiv1511.08749

Computational Social Science 4 In these decades, rapid development of
ICT has generated entirely new approaches in social sciences. 101 102 100 102 104 106 108 10 10 12 10 10 10 8 10 6 10 4 10 2 vj Oij=1/3 Oij=1 A B <O> w , <O> b 0.05 0.1 0.15 0.2 C D Degree k Link weight w (s) P(w) A B 1 100 10 J. P. Onnela et al., Proc. Nat. Acad. Sci, 104, 7332 (2007) https://ﬂic.kr/p/rYDu1X https://ﬂic.kr/p/6pXKes

Question 5 Most of the studies are conducted using one
communication channel. To what extent can we draw conclusions about the “real” network?

Decreasing degree distributions 6 Ugander, arXiv(2011) Onnela et al., PNAS
(2007) Newman, PNAS (2004) phone call Facebook coauthor 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 P(k) k all iWiW (Hungarian SNS) All P(k) shows decreasing proﬁle, i.e., the most probable degree is 1. However, it is quite rare to ﬁnd a person having only one friend in reality.

Conjecture 7 whole social network sampled network sampling sampling P(k)
k P(k) k Presumably, decreasing P(k) is found because we sample only a part of the whole social network. peaked degree distribution decreasing degree distribution

Null model: Random sampling 8 • Stumpf et al., PNAS
(2005) • Stumpf et al., Phys. Rev. E (2005) • Han et al., Nat. Biotech. (2005) • Lee et al., Phys. Rev. E (2006) • Costa et al., Phys. Rev. E (2007) random node/link sampling snowball sampling random link sampling (k=50 → 3) 0 500 1000 1500 2000 2500 0 2 4 6 8 10 12 14 P(k) k Random sampling does not explain why we have a decreasing P(k) so widely.

Channel selections are biased 9 E. Hargittai, AAPSS, 659 (2015)
The Pew Internet Project (PIP) “It draws on survey data to show that people do not select into the use of such sites randomly. Instead, use is biased in certain ways yielding samples that limit the generalizability of ﬁndings.”

Empirical data 10 10-6 10-5 10-4 10-3 10-2 10-1 100
101 102 103 104 P(k) k all active period >1000 >2000 >3000 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 100 101 102 103 P(k) k all # calls > 10 100 1000 mobile phone call Hungarian SNS P(k) k people who devote only marginal efforts active people

• We propose a simple model for a choice of
communication channel, with which networks having decreasing P(k) is reproduced. • With this sampling method, assortativity is inevitably modiﬁed in the sampled networks as well. Even if the original network is not assortative, the sampled network becomes assortative. In this talk, … 11

Model (1/2) 12 Weibull distribution 0 1 2 3 4
5 6 7 8 0 0.2 0.4 0.6 0.8 1 a=1 a=0.5 P(f) = ( c(f/f0)a 1e (f/f0)a (f  1) 0 (f > 1) We randomly assign an intrinsic “afﬁnity”, f, for each node which denotes the degree of preference of the person to adopt that communication channel.

Model (2/2) 13 Link sampling probability : generalized mean of
ﬁ and fj. (maximum) (arithmetic mean) (geometric mean) (minimum) (fi + fj)/2 max {fi, fj } min {fi, fj } p fifj min-type max-type pij = ✓ 1 2 (fi + fj ) ◆1/ = 1 = 1 = 0 = 1 fifj fi + fj (harmonic mean) = 1

Surrogate network 14 Since we do not know the property
of the “whole” social network, we use surrogate networks having peaked P(k). whole social network 1. Erdos-Renyi graph (ER) 2. Regular Random graph (RR) 3. Weighted Social Network with Link Deletion rule (WSN) Y. Murase et al., PLoS ONE (2015)

Result: P(k) 15 10-3 10-2 10-1 0 10 20 30
40 50 P(k) k α=0.8, β=-1 α=0.8, β= 1 0 0.2 0.4 0.6 0.8 1 -3 -2 -1 0 1 2 3 P1 β α = 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 P1 ⌘ P ( k = 1) / max {P ( k ) } For β>0, there is no parameter region showing a decreasing P(k). Min-type sampling probability is a key to realize decreasing P(k).

Result: assortative mixing 16 16 18 20 22 24 26
28 30 100 101 102 knn (k) k α = 0.8, β = -1 α = 0.8, β = 1 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 -3 -2 -1 0 1 2 3 assortativity β α = 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Even if the surrogate network is non-assortative and ﬁ is independently given, assortativity is biased due to the sampling. → “sampling-induced assortativity”

Why decreasing P(k) is found only for β<0 17 i
pij = fi min-type max-type pij = fj To become low degree, not only i but all the surrounding nodes must have low afﬁnity. The sampling probability can be low irrespective of afﬁnity of other nodes. => Low degree nodes are common. i i

Intuitive explanation of assortativity under min rule 18 node with
low affinity node with high affinity all links are equally sampled links with high affinity nodes are more likely to be sampled pij = fi pij = fj i i

Min-Model: A Special Case 19 P ( f ) =
1 f0 exp ( f/f0) a = 1 pij = min {fi, fj } = 1 10-4 10-3 10-2 10-1 0 10 20 30 40 50 60 f0 =0.1 f0 =0.2 f0 =0.3 P(k) k analysis RR analysis ER simulation RR simulation ER (a) P(k) knn(k) 22 24 26 28 30 32 34 36 100 101 102 knn (k) k

Results: Analysis of P(k) 20 10-4 10-3 10-2 10-1 0
10 20 30 40 50 60 f0 =0.1 f0 =0.2 f0 =0.3 P(k) k analysis RR analysis ER simulation RR simulation ER (a) (b) the probability of sampling a link involving the node i the probability that the node i has exactly ki links the degree distribution of the network sampled from RR the degree distribution of the network sampled from a network having P0 (k)

Results: Analysis of Assortativity 21 We consider the correlation of
affinities between neighboring nodes in the sampled networks because it’s hard to calculate k_nn exactly. The average affinity of neighboring nodes of a node with affinity fi, fnn(i) 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 fnn (f) f RR ER WSN analysis This is independent of the surrogate network. We can also rigorously prove that the sign of the derivative of f_nn is determined by the sign of β for any P(f) and any surrogate networks.

Take-home messages 22 • We proposed a simple and plausible
model for a sampling process to reproduce a decreasing P(k). • Decreasing P(k) is found only for the min-type sampling probability, i.e. β<0. For the min-type model, “sampling- induced assortativity” is inevitable. • We must be careful when we obtain data from one communication channel.

What Bid data tells: Sampling the social networ...

What Bid data tells: Sampling the social network by communication channel

Yohsuke Murase

More Decks by Yohsuke Murase

Other Decks in Research

Featured

Transcript

What does Big Data tell? Sampling the social network by

• Name: Yohsuke Murase (ଜ੉༸հʣ • major: statistical physics, complex

References 3 J. Torok, Y. Murase, H.-H. Jo, J. Kertesz,

Computational Social Science 4 In these decades, rapid development of

Question 5 Most of the studies are conducted using one

Decreasing degree distributions 6 Ugander, arXiv(2011) Onnela et al., PNAS

Conjecture 7 whole social network sampled network sampling sampling P(k)

Null model: Random sampling 8 • Stumpf et al., PNAS

Channel selections are biased 9 E. Hargittai, AAPSS, 659 (2015)

Empirical data 10 10-6 10-5 10-4 10-3 10-2 10-1 100

• We propose a simple model for a choice of

Model (1/2) 12 Weibull distribution 0 1 2 3 4

Model (2/2) 13 Link sampling probability : generalized mean of

Surrogate network 14 Since we do not know the property

Result: P(k) 15 10-3 10-2 10-1 0 10 20 30

Result: assortative mixing 16 16 18 20 22 24 26

Why decreasing P(k) is found only for β<0 17 i

Intuitive explanation of assortativity under min rule 18 node with

Min-Model: A Special Case 19 P ( f ) =

Results: Analysis of P(k) 20 10-4 10-3 10-2 10-1 0

Results: Analysis of Assortativity 21 We consider the correlation of

Take-home messages 22 • We proposed a simple and plausible