Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What Bid data tells: Sampling the social network by communication channel

What Bid data tells: Sampling the social network by communication channel

Yohsuke Murase

August 10, 2016
Tweet

More Decks by Yohsuke Murase

Other Decks in Research

Transcript

  1. What does Big Data tell? Sampling the social network by

    communication channels Yohsuke Murase RIKEN Advanced Institute of Computational Science 2016 Aug. 10 @Aalto Univ.
  2. • Name: Yohsuke Murase (ଜ੉༸հʣ • major: statistical physics, complex

    networks • grown in Okazaki-city, Aichi • Ph.D at Department of Applied Physics, Univ. Tokyo (Statistical physics) • working as a software engineer in Hamamatsu • In 2013, I joined RIKEN AICS. About me 2
  3. References 3 J. Torok, Y. Murase, H.-H. Jo, J. Kertesz,

    K. Kaski, “What does Big Data tell: Sampling the social network by communication channels”, arXiv1511.08749
  4. Computational Social Science 4 In these decades, rapid development of

    ICT has generated entirely new approaches in social sciences. 101 102 100 102 104 106 108 10 10 12 10 10 10 8 10 6 10 4 10 2 vj Oij=1/3 Oij=1 A B <O> w , <O> b 0.05 0.1 0.15 0.2 C D Degree k Link weight w (s) P(w) A B 1 100 10 J. P. Onnela et al., Proc. Nat. Acad. Sci, 104, 7332 (2007) https://flic.kr/p/rYDu1X https://flic.kr/p/6pXKes
  5. Question 5 Most of the studies are conducted using one

    communication channel. To what extent can we draw conclusions about the “real” network?
  6. Decreasing degree distributions 6 Ugander, arXiv(2011) Onnela et al., PNAS

    (2007) Newman, PNAS (2004) phone call Facebook coauthor 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 P(k) k all iWiW (Hungarian SNS) All P(k) shows decreasing profile, i.e., the most probable degree is 1. However, it is quite rare to find a person having only one friend in reality.
  7. Conjecture 7 whole social network sampled network sampling sampling P(k)

    k P(k) k Presumably, decreasing P(k) is found because we sample only a part of the whole social network. peaked degree distribution decreasing degree distribution
  8. Null model: Random sampling 8 • Stumpf et al., PNAS

    (2005) • Stumpf et al., Phys. Rev. E (2005) • Han et al., Nat. Biotech. (2005) • Lee et al., Phys. Rev. E (2006) • Costa et al., Phys. Rev. E (2007) random node/link sampling snowball sampling random link sampling (k=50 → 3) 0 500 1000 1500 2000 2500 0 2 4 6 8 10 12 14 P(k) k Random sampling does not explain why we have a decreasing P(k) so widely.
  9. Channel selections are biased 9 E. Hargittai, AAPSS, 659 (2015)

    The Pew Internet Project (PIP) “It draws on survey data to show that people do not select into the use of such sites randomly. Instead, use is biased in certain ways yielding samples that limit the generalizability of findings.”
  10. Empirical data 10 10-6 10-5 10-4 10-3 10-2 10-1 100

    101 102 103 104 P(k) k all active period >1000 >2000 >3000 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 100 101 102 103 P(k) k all # calls > 10 100 1000 mobile phone call Hungarian SNS P(k) k people who devote only marginal efforts active people
  11. • We propose a simple model for a choice of

    communication channel, with which networks having decreasing P(k) is reproduced. • With this sampling method, assortativity is inevitably modified in the sampled networks as well. Even if the original network is not assortative, the sampled network becomes assortative. In this talk, … 11
  12. Model (1/2) 12 Weibull distribution 0 1 2 3 4

    5 6 7 8 0 0.2 0.4 0.6 0.8 1 a=1 a=0.5 P(f) = ( c(f/f0)a 1e (f/f0)a (f  1) 0 (f > 1) We randomly assign an intrinsic “affinity”, f, for each node which denotes the degree of preference of the person to adopt that communication channel.
  13. Model (2/2) 13 Link sampling probability : generalized mean of

    fi and fj. (maximum) (arithmetic mean) (geometric mean) (minimum) (fi + fj)/2 max {fi, fj } min {fi, fj } p fifj min-type max-type pij = ✓ 1 2 (fi + fj ) ◆1/ = 1 = 1 = 0 = 1 fifj fi + fj (harmonic mean) = 1
  14. Surrogate network 14 Since we do not know the property

    of the “whole” social network, we use surrogate networks having peaked P(k). whole social network 1. Erdos-Renyi graph (ER) 2. Regular Random graph (RR) 3. Weighted Social Network with Link Deletion rule (WSN) Y. Murase et al., PLoS ONE (2015)
  15. Result: P(k) 15 10-3 10-2 10-1 0 10 20 30

    40 50 P(k) k α=0.8, β=-1 α=0.8, β= 1 0 0.2 0.4 0.6 0.8 1 -3 -2 -1 0 1 2 3 P1 β α = 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 P1 ⌘ P ( k = 1) / max {P ( k ) } For β>0, there is no parameter region showing a decreasing P(k). Min-type sampling probability is a key to realize decreasing P(k).
  16. Result: assortative mixing 16 16 18 20 22 24 26

    28 30 100 101 102 knn (k) k α = 0.8, β = -1 α = 0.8, β = 1 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 -3 -2 -1 0 1 2 3 assortativity β α = 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Even if the surrogate network is non-assortative and fi is independently given, assortativity is biased due to the sampling. → “sampling-induced assortativity”
  17. Why decreasing P(k) is found only for β<0 17 i

    pij = fi min-type max-type pij = fj To become low degree, not only i but all the surrounding nodes must have low affinity. The sampling probability can be low irrespective of affinity of other nodes. => Low degree nodes are common. i i
  18. Intuitive explanation of assortativity under min rule 18 node with

    low affinity node with high affinity all links are equally sampled links with high affinity nodes are more likely to be sampled pij = fi pij = fj i i
  19. Min-Model: A Special Case 19 P ( f ) =

    1 f0 exp ( f/f0) a = 1 pij = min {fi, fj } = 1 10-4 10-3 10-2 10-1 0 10 20 30 40 50 60 f0 =0.1 f0 =0.2 f0 =0.3 P(k) k analysis RR analysis ER simulation RR simulation ER (a) P(k) knn(k) 22 24 26 28 30 32 34 36 100 101 102 knn (k) k
  20. Results: Analysis of P(k) 20 10-4 10-3 10-2 10-1 0

    10 20 30 40 50 60 f0 =0.1 f0 =0.2 f0 =0.3 P(k) k analysis RR analysis ER simulation RR simulation ER (a) (b) the probability of sampling a link involving the node i the probability that the node i has exactly ki links the degree distribution of the network sampled from RR the degree distribution of the network sampled from a network having P0 (k)
  21. Results: Analysis of Assortativity 21 We consider the correlation of

    affinities between neighboring nodes in the sampled networks because it’s hard to calculate k_nn exactly. The average affinity of neighboring nodes of a node with affinity fi, fnn(i) 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 fnn (f) f RR ER WSN analysis This is independent of the surrogate network. We can also rigorously prove that the sign of the derivative of f_nn is determined by the sign of β for any P(f) and any surrogate networks.
  22. Take-home messages 22 • We proposed a simple and plausible

    model for a sampling process to reproduce a decreasing P(k). • Decreasing P(k) is found only for the min-type sampling probability, i.e. β<0. For the min-type model, “sampling- induced assortativity” is inevitable. • We must be careful when we obtain data from one communication channel.