Slide 1

Slide 1 text

What does Big Data tell? Sampling the social network by communication channels Yohsuke Murase RIKEN Advanced Institute of Computational Science 2016 Aug. 10 @Aalto Univ.

Slide 2

Slide 2 text

• Name: Yohsuke Murase (ଜ੉༸հʣ • major: statistical physics, complex networks • grown in Okazaki-city, Aichi • Ph.D at Department of Applied Physics, Univ. Tokyo (Statistical physics) • working as a software engineer in Hamamatsu • In 2013, I joined RIKEN AICS. About me 2

Slide 3

Slide 3 text

References 3 J. Torok, Y. Murase, H.-H. Jo, J. Kertesz, K. Kaski, “What does Big Data tell: Sampling the social network by communication channels”, arXiv1511.08749

Slide 4

Slide 4 text

Computational Social Science 4 In these decades, rapid development of ICT has generated entirely new approaches in social sciences. 101 102 100 102 104 106 108 10 10 12 10 10 10 8 10 6 10 4 10 2 vj Oij=1/3 Oij=1 A B w , b 0.05 0.1 0.15 0.2 C D Degree k Link weight w (s) P(w) A B 1 100 10 J. P. Onnela et al., Proc. Nat. Acad. Sci, 104, 7332 (2007) https://flic.kr/p/rYDu1X https://flic.kr/p/6pXKes

Slide 5

Slide 5 text

Question 5 Most of the studies are conducted using one communication channel. To what extent can we draw conclusions about the “real” network?

Slide 6

Slide 6 text

Decreasing degree distributions 6 Ugander, arXiv(2011) Onnela et al., PNAS (2007) Newman, PNAS (2004) phone call Facebook coauthor 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 P(k) k all iWiW (Hungarian SNS) All P(k) shows decreasing profile, i.e., the most probable degree is 1. However, it is quite rare to find a person having only one friend in reality.

Slide 7

Slide 7 text

Conjecture 7 whole social network sampled network sampling sampling P(k) k P(k) k Presumably, decreasing P(k) is found because we sample only a part of the whole social network. peaked degree distribution decreasing degree distribution

Slide 8

Slide 8 text

Null model: Random sampling 8 • Stumpf et al., PNAS (2005) • Stumpf et al., Phys. Rev. E (2005) • Han et al., Nat. Biotech. (2005) • Lee et al., Phys. Rev. E (2006) • Costa et al., Phys. Rev. E (2007) random node/link sampling snowball sampling random link sampling (k=50 → 3) 0 500 1000 1500 2000 2500 0 2 4 6 8 10 12 14 P(k) k Random sampling does not explain why we have a decreasing P(k) so widely.

Slide 9

Slide 9 text

Channel selections are biased 9 E. Hargittai, AAPSS, 659 (2015) The Pew Internet Project (PIP) “It draws on survey data to show that people do not select into the use of such sites randomly. Instead, use is biased in certain ways yielding samples that limit the generalizability of findings.”

Slide 10

Slide 10 text

Empirical data 10 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 P(k) k all active period >1000 >2000 >3000 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 100 101 102 103 P(k) k all # calls > 10 100 1000 mobile phone call Hungarian SNS P(k) k people who devote only marginal efforts active people

Slide 11

Slide 11 text

• We propose a simple model for a choice of communication channel, with which networks having decreasing P(k) is reproduced. • With this sampling method, assortativity is inevitably modified in the sampled networks as well. Even if the original network is not assortative, the sampled network becomes assortative. In this talk, … 11

Slide 12

Slide 12 text

Model (1/2) 12 Weibull distribution 0 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 a=1 a=0.5 P(f) = ( c(f/f0)a 1e (f/f0)a (f  1) 0 (f > 1) We randomly assign an intrinsic “affinity”, f, for each node which denotes the degree of preference of the person to adopt that communication channel.

Slide 13

Slide 13 text

Model (2/2) 13 Link sampling probability : generalized mean of fi and fj. (maximum) (arithmetic mean) (geometric mean) (minimum) (fi + fj)/2 max {fi, fj } min {fi, fj } p fifj min-type max-type pij = ✓ 1 2 (fi + fj ) ◆1/ = 1 = 1 = 0 = 1 fifj fi + fj (harmonic mean) = 1

Slide 14

Slide 14 text

Surrogate network 14 Since we do not know the property of the “whole” social network, we use surrogate networks having peaked P(k). whole social network 1. Erdos-Renyi graph (ER) 2. Regular Random graph (RR) 3. Weighted Social Network with Link Deletion rule (WSN) Y. Murase et al., PLoS ONE (2015)

Slide 15

Slide 15 text

Result: P(k) 15 10-3 10-2 10-1 0 10 20 30 40 50 P(k) k α=0.8, β=-1 α=0.8, β= 1 0 0.2 0.4 0.6 0.8 1 -3 -2 -1 0 1 2 3 P1 β α = 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 P1 ⌘ P ( k = 1) / max {P ( k ) } For β>0, there is no parameter region showing a decreasing P(k). Min-type sampling probability is a key to realize decreasing P(k).

Slide 16

Slide 16 text

Result: assortative mixing 16 16 18 20 22 24 26 28 30 100 101 102 knn (k) k α = 0.8, β = -1 α = 0.8, β = 1 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 -3 -2 -1 0 1 2 3 assortativity β α = 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Even if the surrogate network is non-assortative and fi is independently given, assortativity is biased due to the sampling. → “sampling-induced assortativity”

Slide 17

Slide 17 text

Why decreasing P(k) is found only for β<0 17 i pij = fi min-type max-type pij = fj To become low degree, not only i but all the surrounding nodes must have low affinity. The sampling probability can be low irrespective of affinity of other nodes. => Low degree nodes are common. i i

Slide 18

Slide 18 text

Intuitive explanation of assortativity under min rule 18 node with low affinity node with high affinity all links are equally sampled links with high affinity nodes are more likely to be sampled pij = fi pij = fj i i

Slide 19

Slide 19 text

Min-Model: A Special Case 19 P ( f ) = 1 f0 exp ( f/f0) a = 1 pij = min {fi, fj } = 1 10-4 10-3 10-2 10-1 0 10 20 30 40 50 60 f0 =0.1 f0 =0.2 f0 =0.3 P(k) k analysis RR analysis ER simulation RR simulation ER (a) P(k) knn(k) 22 24 26 28 30 32 34 36 100 101 102 knn (k) k

Slide 20

Slide 20 text

Results: Analysis of P(k) 20 10-4 10-3 10-2 10-1 0 10 20 30 40 50 60 f0 =0.1 f0 =0.2 f0 =0.3 P(k) k analysis RR analysis ER simulation RR simulation ER (a) (b) the probability of sampling a link involving the node i the probability that the node i has exactly ki links the degree distribution of the network sampled from RR the degree distribution of the network sampled from a network having P0 (k)

Slide 21

Slide 21 text

Results: Analysis of Assortativity 21 We consider the correlation of affinities between neighboring nodes in the sampled networks because it’s hard to calculate k_nn exactly. The average affinity of neighboring nodes of a node with affinity fi, fnn(i) 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 fnn (f) f RR ER WSN analysis This is independent of the surrogate network. We can also rigorously prove that the sign of the derivative of f_nn is determined by the sign of β for any P(f) and any surrogate networks.

Slide 22

Slide 22 text

Take-home messages 22 • We proposed a simple and plausible model for a sampling process to reproduce a decreasing P(k). • Decreasing P(k) is found only for the min-type sampling probability, i.e. β<0. For the min-type model, “sampling- induced assortativity” is inevitable. • We must be careful when we obtain data from one communication channel.