Slide 1

Slide 1 text

Introduction and Overview APAM E4990 Modeling Social Data Jake Hofman Columbia University January 25, 2019 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 1 / 58

Slide 2

Slide 2 text

Course overview Modeling social data requires an understanding of: 1 How to obtain data produced by (online) human interactions 2 What questions we typically ask about human-generated data 3 How to make these questions precise and quantitative 4 How to interpret and communicate results Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 2 / 58

Slide 3

Slide 3 text

Questions Many long-standing questions in the social sciences are notoriously difficult to answer, e.g.: • “Who says what to whom in what channel with what effect”? (Laswell, 1948) • How do ideas and technology spread through cultures? (Rogers, 1962) • How do new forms of communication affect society? (Singer, 1970) • . . . Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 3 / 58

Slide 4

Slide 4 text

Questions Typically difficult to observe the relevant information via conventional methods Moreno, 1933 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 4 / 58

Slide 5

Slide 5 text

Large-scale data Recently available electronic data provide an unprecedented opportunity to address these questions at scale Demographic Behavioral Network Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 5 / 58

Slide 6

Slide 6 text

Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 6 / 58

Slide 7

Slide 7 text

Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science (motivating questions) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 6 / 58

Slide 8

Slide 8 text

Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science (fitting large, potentially sparse models) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 6 / 58

Slide 9

Slide 9 text

Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science (parallel processing for filtering and aggregating data) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 6 / 58

Slide 10

Slide 10 text

Topics Exploratory Data Analysis Classification Regression Networks Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 7 / 58

Slide 11

Slide 11 text

Exploratory Data Analysis (a.k.a. counting and plotting things) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 8 / 58

Slide 12

Slide 12 text

Regression (a.k.a. modeling continuous things) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 9 / 58

Slide 13

Slide 13 text

Classification (a.k.a. modeling discrete things) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 10 / 58

Slide 14

Slide 14 text

Networks (a.k.a. counting complicated things) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 11 / 58

Slide 15

Slide 15 text

Prediction and explanation Important to view prediction and explanation as compliments, not substitutes Computer science ˆ y Predict vs and Social science ˆ β Explain Otherwise it can be difficult to make long-term progress in advancing social science Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 12 / 58

Slide 16

Slide 16 text

The clean real story “We have a habit in writing articles published in scientific journals to make the work as finished as possible, to cover all the tracks, to not worry about the blind alleys or to describe how you had the wrong idea first, and so on. So there isn’t any place to publish, in a dignified manner, what you actually did in order to get to do the work ...” -Richard Feynman Nobel Lecture1, 1965 1http://bit.ly/feynmannobel Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 13 / 58

Slide 17

Slide 17 text

Case studies Web demographics Daily Per−Capita Pageviews 0 10 20 30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Search predictions "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Viral hits Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 14 / 58

Slide 18

Slide 18 text

Predicting consumer activity with Web search with Sharad Goel, S´ ebastien Lahaie, David Pennock, Duncan Watts "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 15 / 58

Slide 19

Slide 19 text

Search predictions Motivation Does collective search activity provide useful predictive signal about real-world outcomes? "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 16 / 58

Slide 20

Slide 20 text

Search predictions Motivation Past work mainly focuses on predicting the present2 and ignores baseline models trained on publicly available data Date Flu Level (Percent) 1 2 3 4 5 6 7 8 2004 2005 2006 2007 2008 2009 2010 Actual Search Autoregressive 2Varian, 2009 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 17 / 58

Slide 21

Slide 21 text

Search predictions Motivation We predict future sales for movies, video games, and music "Transformers 2" Time to Release (Days) Search Volume a −30 −20 −10 0 10 20 30 "Tom Clancy's HAWX" Time to Release (Days) Search Volume b −30 −20 −10 0 10 20 30 "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 18 / 58

Slide 22

Slide 22 text

Search predictions Search models For movies and video games, predict opening weekend box office and first month sales, respectively: log(revenue) = β0 + β1 log(search) + For music, predict following week’s Billboard Hot 100 rank: billboardt+1 = β0 + β1 searcht + β2 searcht−1 + Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 19 / 58

Slide 23

Slide 23 text

Search predictions Search volume Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 20 / 58

Slide 24

Slide 24 text

Search predictions Search models Search activity is predictive for movies, video games, and music weeks to months in advance Movies Predicted Revenue (Dollars) Actual Revenue (Dollars) 103 104 105 106 107 108 109 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a 103 104 105 106 107 108 109 Video Games Predicted Revenue (Dollars) Actual Revenue (Dollars) 103 104 105 106 107 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b 103 104 105 106 107 ● Non−Sequel Sequel Music Predicted Billboard Rank Actual Billboard Rank 0 20 40 60 80 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● c 0 20 40 60 80 100 Movies Time to Release (Weeks) Model Fit 0.4 0.5 0.6 0.7 0.8 0.9 d d d d d d d −6 −5 −4 −3 −2 −1 0 Video Games Time to Release (Weeks) Model Fit 0.4 0.5 0.6 0.7 0.8 0.9 e e e e e e e −6 −5 −4 −3 −2 −1 0 Music Time to Release (Weeks) Model Fit 0.4 0.5 0.6 0.7 0.8 0.9 f f f f f f f −6 −5 −4 −3 −2 −1 0 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 21 / 58

Slide 25

Slide 25 text

Search predictions Baseline models For movies, use budget, number of opening screens and Hollywood Stock Exchange: log(revenue) = β0 + β1 log(budget) + β2 log(screens) + β3 log(hsx) + Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 22 / 58

Slide 26

Slide 26 text

Search predictions Baseline models For video games, use critic ratings and predecessor sales (sequels only): log(revenue) = β0 + β1 rating + β2 log(predecessor) + Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 22 / 58

Slide 27

Slide 27 text

Search predictions Baseline models For music, use an autoregressive model with the previously available rank: billboardt+1 = β0 + β1 billboardt−1 + Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 22 / 58

Slide 28

Slide 28 text

Search predictions Baseline + combined models Baseline models are often surprisingly good Movies (Baseline) Predicted Revenue (Dollars) Actual Revenue (Dollars) 103 104 105 106 107 108 109 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a 103 104 105 106 107 108 109 Video Games (Baseline) Predicted Revenue (Dollars) Actual Revenue (Dollars) 103 104 105 106 107 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b 103 104 105 106 107 ● Non−Sequel Sequel Music (Baseline) Predicted Billboard Rank Actual Billboard Rank 0 20 40 60 80 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● c 0 20 40 60 80 100 Movies (Combined) Predicted Revenue (Dollars) Actual Revenue (Dollars) 103 104 105 106 107 108 109 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d 103 104 105 106 107 108 109 Video Games (Combined) Predicted Revenue (Dollars) Actual Revenue (Dollars) 103 104 105 106 107 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 103 104 105 106 107 ● Non−Sequel Sequel Music (Combined) Predicted Billboard Rank Actual Billboard Rank 0 20 40 60 80 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● f 0 20 40 60 80 100 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 23 / 58

Slide 29

Slide 29 text

Search predictions Model comparison For movies, search is outperformed by the baseline and of little marginal value Model Fit 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Search Search Search Search Search Search Search Search Search Search Search Search Search Search Search Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline N onsequel G am es Sequel G am es M usic M ovies Flu Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 24 / 58

Slide 30

Slide 30 text

Search predictions Model comparison For video games, search helps substantially for non-sequels, less so for sequels Model Fit 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Search Search Search Search Search Search Search Search Search Search Search Search Search Search Search Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline N onsequel G am es Sequel G am es M usic M ovies Flu Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 24 / 58

Slide 31

Slide 31 text

Search predictions Model comparison For music, the addition of search yields a substantially better combined model Model Fit 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Search Search Search Search Search Search Search Search Search Search Search Search Search Search Search Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline N onsequel G am es Sequel G am es M usic M ovies Flu Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 24 / 58

Slide 32

Slide 32 text

Search predictions Summary • Relative performance and value of search varies across domains • Search provides a fast, convenient, and flexible signal across domains • “Predicting consumer activity with Web search” Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 25 / 58

Slide 33

Slide 33 text

P.S. POLICYFORUM In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the fl u tracking system would have hoped. Nature reported that GFT was pre- dicting more than double the pro- portion of doctor visits for influ- enza-like illness (ILI) than the Cen- ters for Disease Control and Preven- tion (CDC), which bases its esti- mates on surveillance reports from laboratories across the United States ( 1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data ( 3, 4), what lessons can we draw from this error? The problems we identify are not limited to GFT. Research on whether search or social media can predict x has become common- place ( 5– 7) and is often put in sharp contrast with traditional methods and hypotheses. surement and construct validity and reli- ability and dependencies among data (12). the algorithm in 2009, and this model has run ever since, with a few changes announced in October 2013 ( 10, 15). Although not widely reported until 2013, the new GFT has been persistently overestimating flu prevalence for a much longer time. GFT also missed by a very large margin in the 2011–2012 fl u sea- son and has missed high for 100 out of 108 weeks starting with August 2011 (see the graph ). These errors are not randomly distributed. For example, last week’s errors predict this week’s errors (temporal auto- correlation), and the direction and magnitude of error varies with the time of year (seasonality). These patterns mean that GFT overlooks considerable information that could be extracted by traditional statistical methods. Even after GFT was updated in 2009, the comparative value of the algorithm as a The Parable of Google Flu: Traps in Big Data Analysis BIG DATA David Lazer, 1, 2 * Ryan Kennedy, 1, 3, 4 Gary King, 3 Alessandro Vespignani 3,5,6 Large errors in fl u prediction were largely avoidable, which offers lessons for the use of big data. FINAL FINAL FINAL FINAL ounda- ntation ruct of ompa- e mea- imum, nstable ecause oogle’s ics are mprove nsum- nges in behav- e most 0 2 4 6 8 10 07/01/09 07/01/10 07/01/11 07/01/12 07/01/13 Google Flu Lagged CDC Google Flu + CDC CDC 50 100 150 Google Flu Lagged CDC Google Flu + CDC Google estimates more than double CDC estimates Google starts estimating high 100 out of 108 weeks % ILI % baseline) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 26 / 58

Slide 34

Slide 34 text

Demographic diversity on the Web with Irmak Sirer and Sharad Goel (ICWSM 2012) Daily Per−Capita Pageviews 0 10 20 30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 27 / 58

Slide 35

Slide 35 text

Motivation Previous work is largely survey-based and focuses and group-level differences in online access Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 28 / 58

Slide 36

Slide 36 text

Motivation “As of January 1997, we estimate that 5.2 million African Americans and 40.8 million whites have ever used the Web, and that 1.4 million African Americans and 20.3 million whites used the Web in the past week.” -Hoffman & Novak (1998) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 28 / 58

Slide 37

Slide 37 text

Motivation Focus on activity instead of access How diverse is the Web? To what extent do online experiences vary across demographic groups? Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 29 / 58

Slide 38

Slide 38 text

Data • Representative sample of 265,000 individuals in the US, paid via the Nielsen MegaPanel3 • Log of anonymized, complete browsing activity from June 2009 through May 2010 (URLs viewed, timestamps, etc.) • Detailed individual and household demographic information (age, education, income, race, sex, etc.) 3Special thanks to Mainak Mazumdar Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 30 / 58

Slide 39

Slide 39 text

Data # ls -alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 31 / 58

Slide 40

Slide 40 text

Data # ls -alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 31 / 58

Slide 41

Slide 41 text

Data # ls -alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 31 / 58

Slide 42

Slide 42 text

Data # ls -alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) • Aggregate activity at the site, group, and user levels Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 31 / 58

Slide 43

Slide 43 text

Aggregate usage patterns How do users distribute their time across different categories? Fraction of total pageviews 0.05 0.10 0.15 0.20 0.25 q q q q q Social M edia E−m ail G am es Portals Search All groups spend the majority of their time in the top five most popular categories Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 32 / 58

Slide 44

Slide 44 text

Aggregate usage patterns How do users distribute their time across different categories? User Rank by Daily Activity Fraction of Pageviews in Category 0.05 0.10 0.15 0.20 0.25 0.30 q q q q q q q q q q 10% 30% 50% 70% 90% q Social Media E−mail Games Portals Search Highly active users devote nearly twice as much of their time to social media relative to typical individuals Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 32 / 58

Slide 45

Slide 45 text

Group-level activity How does browsing activity vary at the group level? Daily Per−Capita Pageviews 0 10 20 30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Large differences exist even at the aggregate level (e.g. women on average generate 40% more pageviews than men) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 33 / 58

Slide 46

Slide 46 text

Group-level activity How does browsing activity vary at the group level? Daily Per−Capita Pageviews 0 10 20 30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Younger and more educated individuals are both more likely to access the Web and more active once they do Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 33 / 58

Slide 47

Slide 47 text

Group-level activity All demographic groups spend the majority of their time in the same categories Age Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 0.5 q q q q q q q q q q q q q q q q 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 q Social Media E−mail Games Portals Search Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 Education ● ● ● ● ● ● ● G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex ● ● Fem ale M ale Income ● ● ● ● ● ● $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race ● ● ● ● ● O ther H ispanic Black W hite Asian Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 34 / 58

Slide 48

Slide 48 text

Group-level activity Older, more educated, male, wealthier, and Asian Internet users spend a smaller fraction of their time on social media Age Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 0.5 q q q q q q q q q q q q q q q q 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 q Social Media E−mail Games Portals Search Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 Education ● ● ● ● ● ● ● G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex ● ● Fem ale M ale Income ● ● ● ● ● ● $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race ● ● ● ● ● O ther H ispanic Black W hite Asian Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 34 / 58

Slide 49

Slide 49 text

Group-level activity Lower social media use by these groups is often accompanied by higher e-mail volume Age Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 0.5 q q q q q q q q q q q q q q q q 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 q Social Media E−mail Games Portals Search Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 Education ● ● ● ● ● ● ● G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex ● ● Fem ale M ale Income ● ● ● ● ● ● $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race ● ● ● ● ● O ther H ispanic Black W hite Asian Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 34 / 58

Slide 50

Slide 50 text

Revisiting the digital divide How does usage of news, health, and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Education ● ● ● ● ● ● ● G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex ● ● Fem ale M ale Income ● ● ● ● ● ● $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race ● ● ● ● ● O ther H ispanic Black W hite Asian ● News Health Reference Post-graduates spend three times as much time on health sites than adults with only some high school education Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 35 / 58

Slide 51

Slide 51 text

Revisiting the digital divide How does usage of news, health, and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Education ● ● ● ● ● ● ● G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex ● ● Fem ale M ale Income ● ● ● ● ● ● $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race ● ● ● ● ● O ther H ispanic Black W hite Asian ● News Health Reference Asians spend more than 50% more time browsing online news than do other race groups Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 35 / 58

Slide 52

Slide 52 text

Revisiting the digital divide How does usage of news, health, and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Education ● ● ● ● ● ● ● G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex ● ● Fem ale M ale Income ● ● ● ● ● ● $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race ● ● ● ● ● O ther H ispanic Black W hite Asian ● News Health Reference Even when less educated and less wealthy groups gain access to the Web, they utilize these resources relatively infrequently Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 35 / 58

Slide 53

Slide 53 text

Revisiting the digital divide How does usage of news, health, and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 News q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Health q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Reference q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Asian Black Hispanic White Controlling for other variables, effects of race and gender largely disappear, while education continues to have large effect pi = j αj xij + j k βjkxij xik + j γj x2 ij + i Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 36 / 58

Slide 54

Slide 54 text

Revisiting the digital divide How does usage of news, health, and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Health q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Female Male However, women spend considerably more time on health sites compared to men Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 37 / 58

Slide 55

Slide 55 text

Revisiting the digital divide How does usage of news, health, and reference vary with demographics? Monthly pageviews on health sites 20 40 60 80 100 Female Male However, women spend considerably more time on health sites compared to men, although means can be misleading Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 37 / 58

Slide 56

Slide 56 text

Individual-level prediction How well can one predict an individual’s demographics from their browsing activity? • Represent each user by the set of sites visited • Fit linear models4 to predict majority/minority for each attribute on 80% of users • Tune model parameters using a 10% validation set • Evaluate final performance on held-out 10% test set 4http://bit.ly/svmperf Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 38 / 58

Slide 57

Slide 57 text

Individual-level prediction Reasonable (∼70-85%) accuracy and AUC across all attributes College/No College Under/Over $50,000 Household Income White/Non−White Female/Male Over/Under 25 Years Old Accuracy q q q q q .5 .6 .7 .8 .9 1 AUC q q q q q .5 .6 .7 .8 .9 1 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 39 / 58

Slide 58

Slide 58 text

Individual-level prediction Highly-weighted sites under the fitted models Large positive weight Large negative weight Female winster.com lancome-usa.com sports.yahoo.com espn.go.com White marlboro.com cmt.com mediatakeout.com bet.com College Educated news.yahoo.com linkedin.com youtube.com myspace.com Over 25 Years Old evite.com classmates.com addictinggames.com youtube.com Household Income Under $50,000 eharmony.com tracfone.com rownine.com matrixdirect.com Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task. College/No College Under/Over $50,000 Household Income White/Non−White Female/Male Over/Under 25 Years Old AUC ! ! ! ! ! .5 .6 .7 .8 .9 1 Accuracy ! ! ! ! ! .5 .6 .7 .8 .9 1 Figure 7, a measure that effectively re-normalizes the ma- jority and minority classes to have equal size. Intuitively, AUC is the probability that a model scores a randomly se- lected positive example higher than a randomly selected neg- ative one (e.g., the probability that the model correctly dis- tinguishes between a randomly selected female and male). Though an uninformative rule would correctly discriminate between such pairs 50% of the time, predictions based on Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 40 / 58

Slide 59

Slide 59 text

Individual-level prediction Proof of concept browser demo http://bit.ly/surfpreds (deprecated) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 41 / 58

Slide 60

Slide 60 text

Summary • Highly active users spend disproportionately more of their time on social media and less on e-mail relative to the overall population • Access to research, news, and healthcare is strongly related to education, not as closely to ethnicity • User demographics can be inferred from browsing activity with reasonable accuracy • “Who Does What on the Web”, Goel, Hofman & Sirer, ICWSM 2012 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 42 / 58

Slide 61

Slide 61 text

The structural virality of online diffusion with Ashton Anderson, Sharad Goel, Duncan Watts (Management Science 2015) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 43 / 58

Slide 62

Slide 62 text

“Going Viral”? Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 44 / 58

Slide 63

Slide 63 text

“Going Viral”? Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 45 / 58

Slide 64

Slide 64 text

“Going Viral”? “Therefore we ... wish to proceed with great care as is proper, and to cut off the advance of this plague and cancerous disease so it will not spread any further ...”5 -Pope Leo X Exsurge Domine (1520) 5http://www.economist.com/node/21541719 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 45 / 58

Slide 65

Slide 65 text

“Going Viral”? Rogers (1962), Bass (1969) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 46 / 58

Slide 66

Slide 66 text

“Going viral”? Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 47 / 58

Slide 67

Slide 67 text

“Going viral”? Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 47 / 58

Slide 68

Slide 68 text

“Going viral”? How do popular things become popular? Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 48 / 58

Slide 69

Slide 69 text

Data • Examined one year of tweets from July 2011 to July 2012 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58

Slide 70

Slide 70 text

Data • Examined one year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58

Slide 71

Slide 71 text

Data • Examined one year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58

Slide 72

Slide 72 text

Data • Examined one year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58

Slide 73

Slide 73 text

Data • Examined one year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter • Inferred “who got what from whom” to construct diffusion trees Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58

Slide 74

Slide 74 text

Data • Examined one year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter • Inferred “who got what from whom” to construct diffusion trees • Characterized size and structure of trees Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58

Slide 75

Slide 75 text

The Structural Virality of Online Diffusion A B D C E Time Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 50 / 58

Slide 76

Slide 76 text

Information diffusion Cascade size distribution 0.00001% 0.0001% 0.001% 0.01% 0.1% 1% 10% 1 10 100 1,000 10,000 Cascade Size CCDF Focus on the rare hits that get at least 100 adoptions Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 51 / 58

Slide 77

Slide 77 text

Quantifying structure Measure the average distance between all pairs of nodes6 6Weiner (1947); correlated with other possible metrics Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 52 / 58

Slide 78

Slide 78 text

Information diffusion Size and virality by category Remarkable structural diversity across across categories 0.001% 0.01% 0.1% 1% 10% 100% 100 1,000 10,000 Cascade Size CCDF Videos Pictures News Petitions 0.001% 0.01% 0.1% 1% 10% 100% 3 10 30 Structural Virality CCDF Videos Pictures News Petitions Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 53 / 58

Slide 79

Slide 79 text

Information diffusion Structural diversity 0 50 100 150 time size 0 5 10 15 20 time size 0 20 40 60 80 100 120 140 time size 0 20 40 60 80 100 120 time size 0.0 0.5 1.0 1.5 time size 0 10 20 30 40 50 60 70 time size Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 54 / 58

Slide 80

Slide 80 text

Information diffusion Structural diversity Size is relatively poor predictive of structure Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 55 / 58

Slide 81

Slide 81 text

Summary Popular = Viral Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 56 / 58

Slide 82

Slide 82 text

Information diffusion Summary • Most cascades fail, resulting in fewer than two adoptions, on average • Of the hits that do succeed, we observe a wide range of diverse diffusion structures • It’s difficult to say how something spread given only its popularity • “The structural virality of online diffusion”, Anderson, Goel, Hofman & Watts (Management Science 2015) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 57 / 58

Slide 83

Slide 83 text

1. Ask good questions There’s nothing interesting in the data without them Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 58 / 58

Slide 84

Slide 84 text

2. Think before you code 5 minutes at the whiteboard is worth an hour at the keyboard Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 58 / 58

Slide 85

Slide 85 text

3. Keep the answers simple Exploratory data analysis and linear models go a long way Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 58 / 58

Slide 86

Slide 86 text

4. Replication is key Otherwise it’s easy to get fooled by randomness and difficult to assess progress Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 58 / 58