WSDM2016勉強会資料

Slide 1

Slide 1 text

WSDMษڧձ2016 ! Information Evolution in Social Network ! L. A. Adamic, T. M. Lento, E. Adar, P. C. Ng ٠ాངฏ [email protected] https://www.facebook.com/yohei.kikuta.3  2016/3/19

Slide 2

Slide 2 text

࿦จ֓ཁ Information Evolution in Social Netwrok : ٠ాངฏ 2/8 in which memes occur in online environments, e.g. photographs or videos. However, the information in these is not as readily analyzed, and they cannot be as easily modified by anyone as spoken or written ideas can. We therefore focus our attention on textual status updates on Facebook to understand how information evolves when anyone can easily modify and retell it. In order to generate a set of candidate memes, we identified status updates that had at least 100 exact copies. Nearly all such status updates contained replication instructions such as ‘copy’, ‘paste’, and ‘repost’. The few exceptions included updates generated auto- matically by Facebook applications and some ubiquitous memes: jokes and “wise" sayings. We therefore narrowed our scope to status updates containing replication terms such as ‘copy’, ‘paste’, etc., which included the vast majority of memes propagating via Facebook, but excluded ubiquitous text whose origin would be dif- ficult to discern. Since the replication instructions we searched for were in English, the process captured primarily English language variants of memes. Prior to clustering the variants into memes, we removed non- alphanumeric characters and converted the remainder to lowercase. Each distinct variant was shingled into overlapping 4-word-grams, creating a term frequency vector from the 4-grams. Sorting the meme variants by month, then by frequency, we created a new cluster, i.e. a meme, if the cosine similarity of the 4-gram vector was below 0.2 to all prior clusters. Otherwise, we added the status update to the cluster it matched most closely and adjusted the term-frequency vector of the matching cluster to incorporate the additional variant. We modified the term frequency vector of existing clusters, or created a new cluster, only if the variant frequency exceeded 100 within a month. In a post-processing step we aggregated clusters whose term vectors had converged to a unigram cosine similarity exceeding 0.4. We then gathered all variants for these 4,087 most significant memes by assigning status updates to them if their cosine similarity exceeded 0.05 using 4-grams and 0.1 using unigrams. The unigram threshold assured that an unre- memes with a sufficient number of observations to yield accurate statistics. To estimate power-law exponents, which require observations over several orders of magnitude, we included memes with upwards of 100,000 variants. To estimate the required number of variants to generate accurate Gini coefficients, we simulated the Yule process and contrasted the asymptotic Gini coefficient for a meme that had evolved for a long time period, with the G during the early evolution of the meme. We found that the two values matched closely once the meme had grown to over 1,000 variants and so set this as the lower bound for the number of variants for the empirical measurements of G. Figure 2: Approximate phylogenetic forest of the “no one should” meme. Each node is a variant, and each edge connects a variant Facebook(FB)ʹ͓͍ͯfeed͕ίϐϖ͞Ε͍ͯ͘தͰ”ਐԽ”͢Δ༷ࢠΛղ໌ ※Ҩ఻ֶͱͷΞφϩδʔ ਤ͸ݪ࿦จΑΓҾ༻ “No one should…” Ͱಛ௃͚ͮΒΕΔfeed  (meme)͕ωοτϫʔΫ্Ͱ޿͕Δ༷ࢠ ਓʑͷؒͰ఻೻͍ͯ͘͠จԽత৘ใ → ͜͜Ͱ͸textͰද͞ΕΔτϐοΫͷΑ͏ͳ΋ͷ node : variant edge : ਌ࢠؔ܎ color : ࣌ؒͷҧ͍ text͔Β࡞ΒΕΔ4-gram vectorͰ Ұఆͷྨࣅ౓Λ࣋ͬͨѥछ originalͷfeed͸Լهͷ΋ͷͰ͜Ε͕վม͞Ε఻೻ “No one should die because they cannot afford health care and no one should go broke because they get sick”

Slide 5

Slide 5 text

Yule-Simon process 5/8 Information Evolution in Social Netwrok : ٠ాངฏ ਤ͸ https://www.ndsu.edu/pubweb/~novozhil/Teaching/484%20Data/9.pdf ΑΓҾ༻ (e.g., it is a fact that the exponential growth does describe the population increase when the supply is virtually unlimited, and the logistic equation can be used to estimate the population carrying capacity in some cases), but such applications should be performed with great care and understanding of a huge amount of simplifying assumptions put into these models. Sometimes, however, our models should be able to describe an observed phenomenon not only at a qualitative level, but also quantitatively. In this lecture I plan to discuss one such phenomenon and possible mathematical models explaining it. Let me start with a biological example. In biological classification species is the lowest possible rank, and species are grouped together in genera (or, singular, genus). The next taxonomic unit is a family. Therefore, we can talk about the distribution of the number of species in different genera in a given family. Such data can be collected and analyzed. An example of such data is given in the figure below (this example is borrowed from a wonderful paper by G. Yule1) Here a distribution of the sizes of the genera is shown for the family Chrysomelidx. The distribution means that the number of genera with one species was counted, with two species, with three species, and so on, and after it these numbers were plotted against species numbers (to be more precise, for genera with more than 9 species actually intervals of number of species were considered, i.e., the Math 484/684: Mathematical modeling of biological processes by Artem Novozhilov e-mail: [email protected]. Spring 2014 1Yule, G.U. (1925). A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S. Philosoph- ical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213(402-410), 21-87. ੜ෺ֶʹ͓͚Δ”ଐ”ͷ਺ͱ”छ”ͷ਺ͷؒʹଘࡏ͢Δؔ܎Λઆ໌ ಥવมҟ͠ͳ͕Β૿৩͍ͯ֬͘͠཰աఔͰpower law populationΛදݱ ͋Δछ਺Λ༗͢Δଐ਺ छ਺ छ਺nΛ࣋ͭଐͷ֬཰෼෍ (ρ͸ύϥϝλ) p(n) = ⇢ (n) (1 + ⇢) (n + 1 + ⇢) ͜ΕΛ༻͍ͯvariant਺nΛ༗͢Δmemeͷ෼෍ͱൺֱݕূ

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text