How to obtain data produced by (online) human interactions 2 What questions we typically ask about human-generated data 3 How to make these questions precise and quantitative 4 How to interpret and communicate results Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 2 / 58
difficult to answer, e.g.: • “Who says what to whom in what channel with what effect”? (Laswell, 1948) • How do ideas and technology spread through cultures? (Rogers, 1962) • How do new forms of communication affect society? (Singer, 1970) • . . . Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 3 / 58
the social sciences, statistics, and computer science (motivating questions) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 6 / 58
the social sciences, statistics, and computer science (fitting large, potentially sparse models) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 6 / 58
the social sciences, statistics, and computer science (parallel processing for filtering and aggregating data) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 6 / 58
compliments, not substitutes Computer science ˆ y Predict vs and Social science ˆ β Explain Otherwise it can be difficult to make long-term progress in advancing social science Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 12 / 58
articles published in scientific journals to make the work as finished as possible, to cover all the tracks, to not worry about the blind alleys or to describe how you had the wrong idea first, and so on. So there isn’t any place to publish, in a dignified manner, what you actually did in order to get to do the work ...” -Richard Feynman Nobel Lecture1, 1965 1http://bit.ly/feynmannobel Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 13 / 58
30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Search predictions "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Viral hits Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 14 / 58
ebastien Lahaie, David Pennock, Duncan Watts "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 15 / 58
signal about real-world outcomes? "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 16 / 58
games, and music "Transformers 2" Time to Release (Days) Search Volume a −30 −20 −10 0 10 20 30 "Tom Clancy's HAWX" Time to Release (Days) Search Volume b −30 −20 −10 0 10 20 30 "Right Round" Week Rank 40 30 20 10 c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Billboard Search Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 18 / 58
with the previously available rank: billboardt+1 = β0 + β1 billboardt−1 + Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 22 / 58
the baseline and of little marginal value Model Fit 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Combined Search Search Search Search Search Search Search Search Search Search Search Search Search Search Search Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline N onsequel G am es Sequel G am es M usic M ovies Flu Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 24 / 58
varies across domains • Search provides a fast, convenient, and flexible signal across domains • “Predicting consumer activity with Web search” Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 25 / 58
headlines but not for a reason that Google executives or the creators of the fl u tracking system would have hoped. Nature reported that GFT was pre- dicting more than double the pro- portion of doctor visits for influ- enza-like illness (ILI) than the Cen- ters for Disease Control and Preven- tion (CDC), which bases its esti- mates on surveillance reports from laboratories across the United States ( 1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data ( 3, 4), what lessons can we draw from this error? The problems we identify are not limited to GFT. Research on whether search or social media can predict x has become common- place ( 5– 7) and is often put in sharp contrast with traditional methods and hypotheses. surement and construct validity and reli- ability and dependencies among data (12). the algorithm in 2009, and this model has run ever since, with a few changes announced in October 2013 ( 10, 15). Although not widely reported until 2013, the new GFT has been persistently overestimating flu prevalence for a much longer time. GFT also missed by a very large margin in the 2011–2012 fl u sea- son and has missed high for 100 out of 108 weeks starting with August 2011 (see the graph ). These errors are not randomly distributed. For example, last week’s errors predict this week’s errors (temporal auto- correlation), and the direction and magnitude of error varies with the time of year (seasonality). These patterns mean that GFT overlooks considerable information that could be extracted by traditional statistical methods. Even after GFT was updated in 2009, the comparative value of the algorithm as a The Parable of Google Flu: Traps in Big Data Analysis BIG DATA David Lazer, 1, 2 * Ryan Kennedy, 1, 3, 4 Gary King, 3 Alessandro Vespignani 3,5,6 Large errors in fl u prediction were largely avoidable, which offers lessons for the use of big data. FINAL FINAL FINAL FINAL ounda- ntation ruct of ompa- e mea- imum, nstable ecause oogle’s ics are mprove nsum- nges in behav- e most 0 2 4 6 8 10 07/01/09 07/01/10 07/01/11 07/01/12 07/01/13 Google Flu Lagged CDC Google Flu + CDC CDC 50 100 150 Google Flu Lagged CDC Google Flu + CDC Google estimates more than double CDC estimates Google starts estimating high 100 out of 108 weeks % ILI % baseline) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 26 / 58
Goel (ICWSM 2012) Daily Per−Capita Pageviews 0 10 20 30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 27 / 58
African Americans and 40.8 million whites have ever used the Web, and that 1.4 million African Americans and 20.3 million whites used the Web in the past week.” -Hoffman & Novak (1998) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 28 / 58
the Web? To what extent do online experiences vary across demographic groups? Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 29 / 58
paid via the Nielsen MegaPanel3 • Log of anonymized, complete browsing activity from June 2009 through May 2010 (URLs viewed, timestamps, etc.) • Detailed individual and household demographic information (age, education, income, race, sex, etc.) 3Special thanks to Mainak Mazumdar Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 30 / 58
nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 31 / 58
nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 31 / 58
nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) • Aggregate activity at the site, group, and user levels Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 31 / 58
different categories? Fraction of total pageviews 0.05 0.10 0.15 0.20 0.25 q q q q q Social M edia E−m ail G am es Portals Search All groups spend the majority of their time in the top five most popular categories Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 32 / 58
different categories? User Rank by Daily Activity Fraction of Pageviews in Category 0.05 0.10 0.15 0.20 0.25 0.30 q q q q q q q q q q 10% 30% 50% 70% 90% q Social Media E−mail Games Portals Search Highly active users devote nearly twice as much of their time to social media relative to typical individuals Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 32 / 58
level? Daily Per−Capita Pageviews 0 10 20 30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Large differences exist even at the aggregate level (e.g. women on average generate 40% more pageviews than men) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 33 / 58
level? Daily Per−Capita Pageviews 0 10 20 30 40 50 60 70 q q q q q Over $25k Under $25k Black & Hispanic White No College Some College Over 65 Under 65 Female Male Income Race Education Age Sex Younger and more educated individuals are both more likely to access the Web and more active once they do Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 33 / 58
time in the same categories Age Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 0.5 q q q q q q q q q q q q q q q q 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 q Social Media E−mail Games Portals Search Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 Education • • • • • • • G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex • • Fem ale M ale Income • • • • • • $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race • • • • • O ther H ispanic Black W hite Asian Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 34 / 58
users spend a smaller fraction of their time on social media Age Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 0.5 q q q q q q q q q q q q q q q q 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 q Social Media E−mail Games Portals Search Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 Education • • • • • • • G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex • • Fem ale M ale Income • • • • • • $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race • • • • • O ther H ispanic Black W hite Asian Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 34 / 58
often accompanied by higher e-mail volume Age Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 0.5 q q q q q q q q q q q q q q q q 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 q Social Media E−mail Games Portals Search Fraction of total pageviews 0.0 0.1 0.2 0.3 0.4 Education • • • • • • • G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex • • Fem ale M ale Income • • • • • • $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race • • • • • O ther H ispanic Black W hite Asian Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 34 / 58
and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Education • • • • • • • G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex • • Fem ale M ale Income • • • • • • $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race • • • • • O ther H ispanic Black W hite Asian • News Health Reference Post-graduates spend three times as much time on health sites than adults with only some high school education Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 35 / 58
and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Education • • • • • • • G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex • • Fem ale M ale Income • • • • • • $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race • • • • • O ther H ispanic Black W hite Asian • News Health Reference Asians spend more than 50% more time browsing online news than do other race groups Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 35 / 58
and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Education • • • • • • • G ram m ar School Som e H igh School H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Sex • • Fem ale M ale Income • • • • • • $0−25k $25−50k $50−75k $75−100k $100−150k $150k+ Race • • • • • O ther H ispanic Black W hite Asian • News Health Reference Even when less educated and less wealthy groups gain access to the Web, they utilize these resources relatively infrequently Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 35 / 58
and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 News q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Health q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Reference q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Asian Black Hispanic White Controlling for other variables, effects of race and gender largely disappear, while education continues to have large effect pi = j αj xij + j k βjkxij xik + j γj x2 ij + i Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 36 / 58
and reference vary with demographics? Average pageviews per month 0 2 4 6 8 10 12 Health q q q q q H igh School G raduate Som e C ollege Associate D egree Bachelor's D egree Post G raduate D egree Female Male However, women spend considerably more time on health sites compared to men Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 37 / 58
and reference vary with demographics? Monthly pageviews on health sites 20 40 60 80 100 Female Male However, women spend considerably more time on health sites compared to men, although means can be misleading Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 37 / 58
from their browsing activity? • Represent each user by the set of sites visited • Fit linear models4 to predict majority/minority for each attribute on 80% of users • Tune model parameters using a 10% validation set • Evaluate final performance on held-out 10% test set 4http://bit.ly/svmperf Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 38 / 58
weight Large negative weight Female winster.com lancome-usa.com sports.yahoo.com espn.go.com White marlboro.com cmt.com mediatakeout.com bet.com College Educated news.yahoo.com linkedin.com youtube.com myspace.com Over 25 Years Old evite.com classmates.com addictinggames.com youtube.com Household Income Under $50,000 eharmony.com tracfone.com rownine.com matrixdirect.com Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task. College/No College Under/Over $50,000 Household Income White/Non−White Female/Male Over/Under 25 Years Old AUC ! ! ! ! ! .5 .6 .7 .8 .9 1 Accuracy ! ! ! ! ! .5 .6 .7 .8 .9 1 Figure 7, a measure that effectively re-normalizes the ma- jority and minority classes to have equal size. Intuitively, AUC is the probability that a model scores a randomly se- lected positive example higher than a randomly selected neg- ative one (e.g., the probability that the model correctly dis- tinguishes between a randomly selected female and male). Though an uninformative rule would correctly discriminate between such pairs 50% of the time, predictions based on Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 40 / 58
time on social media and less on e-mail relative to the overall population • Access to research, news, and healthcare is strongly related to education, not as closely to ethnicity • User demographics can be inferred from browsing activity with reasonable accuracy • “Who Does What on the Web”, Goel, Hofman & Sirer, ICWSM 2012 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 42 / 58
care as is proper, and to cut off the advance of this plague and cancerous disease so it will not spread any further ...”5 -Pope Leo X Exsurge Domine (1520) 5http://www.economist.com/node/21541719 Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 45 / 58
to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58
to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58
to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58
to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter • Inferred “who got what from whom” to construct diffusion trees Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58
to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter • Inferred “who got what from whom” to construct diffusion trees • Characterized size and structure of trees Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 49 / 58
1% 10% 1 10 100 1,000 10,000 Cascade Size CCDF Focus on the rare hits that get at least 100 adoptions Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 51 / 58
than two adoptions, on average • Of the hits that do succeed, we observe a wide range of diverse diffusion structures • It’s difficult to say how something spread given only its popularity • “The structural virality of online diffusion”, Anderson, Goel, Hofman & Watts (Management Science 2015) Jake Hofman (Columbia University) Introduction and Overview January 25, 2019 57 / 58