Slide 1

Slide 1 text

Survey research in the digital age Matthew J. Salganik Department of Sociology Princeton University msalganik AAPOR Webinar February 21, 2019

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Isn’t “big data” a fad?

Slide 4

Slide 4 text

https://commons.wikimedia.org/wiki/File:Gartner_Hype_Cycle.svg

Slide 5

Slide 5 text

Key abstraction is research design

Slide 6

Slide 6 text

Four main research designs:

Slide 7

Slide 7 text

Four main research designs: Observing behavior

Slide 8

Slide 8 text

Four main research designs: Observing behavior Asking questions

Slide 9

Slide 9 text

Four main research designs: Observing behavior Asking questions Running experiments

Slide 10

Slide 10 text

Four main research designs: Observing behavior Asking questions Running experiments Creating mass collaboration

Slide 11

Slide 11 text

Four main research designs: Observing behavior Asking questions Running experiments Creating mass collaboration

Slide 12

Slide 12 text

Social Scientists ←→ Data Scientists

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Readymades

Slide 15

Slide 15 text

Readymades

Slide 16

Slide 16 text

Readymades Custommades

Slide 17

Slide 17 text

+ Readymades Custommades https://commons.wikimedia.org/wiki/File:Duchamp_Fountaine.jpg https://commons.wikimedia.org/wiki/File:%27David%27_by_Michelangelo_JBU0001.JPG

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Blumenstock et al. (2015), Figure 2

Slide 26

Slide 26 text

Readymade + Custommade

Slide 27

Slide 27 text

Readymade + Custommade Custommade

Slide 28

Slide 28 text

Readymade + Custommade Custommade 10 times faster 50 times cheaper Blumenstock et al. (2015), Figure 3

Slide 29

Slide 29 text

+ Readymades Custommades https://commons.wikimedia.org/wiki/File:Duchamp_Fountaine.jpg https://commons.wikimedia.org/wiki/File:%27David%27_by_Michelangelo_JBU0001.JPG

Slide 30

Slide 30 text

Created by Kedron Rhodes

Slide 31

Slide 31 text

Why should I care about surveys in the age of big data?

Slide 32

Slide 32 text

We will always need to ask limitations of big data (fubu vs. nufu-nubu)

Slide 33

Slide 33 text

We will always need to ask limitations of big data (fubu vs. nufu-nubu) internal states vs. external states

Slide 34

Slide 34 text

We will always need to ask limitations of big data (fubu vs. nufu-nubu) internal states vs. external states inaccessibility of big data

Slide 35

Slide 35 text

We will always need to ask limitations of big data (fubu vs. nufu-nubu) internal states vs. external states inaccessibility of big data But how we are going to ask is going to change

Slide 36

Slide 36 text

Sampling Interviews 1st era Area probability Face-to-face

Slide 37

Slide 37 text

Sampling Interviews 1st era Area probability Face-to-face 2nd era Random digital dial probability Telephone

Slide 38

Slide 38 text

Sampling Interviews 1st era Area probability Face-to-face 2nd era Random digital dial probability Telephone 3rd era

Slide 39

Slide 39 text

Sampling Interviews 1st era Area probability Face-to-face 2nd era Random digital dial probability Telephone 3rd era Non-probability Computer-administered

Slide 40

Slide 40 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 41

Slide 41 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 42

Slide 42 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 43

Slide 43 text

https://www.chicagotribune.com/news/opinion/commentary/ ct-truman-defeats-dewey-1948-flashback-perspec-1113-md-20161111-story.html

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 48

Slide 48 text

Human-administered → computer-administered enables change requires change

Slide 49

Slide 49 text

https://doi.org/10.1371/journal.pone.0123483

Slide 50

Slide 50 text

http://kittenwar.com

Slide 51

Slide 51 text

http://kittenwar.com

Slide 52

Slide 52 text

http://kittenwar.com

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

quantification or openness

Slide 56

Slide 56 text

quantification + openness = wiki surveys

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

General principles of wiki surveys: greedy

Slide 59

Slide 59 text

Good web-based systems use the fat-head and the long-tail Zero Lots Information Contributed Contributors (sorted by rank) Contributes Most Contributes Least

Slide 60

Slide 60 text

Surveys don’t use the fat-head or the long-tail Zero Lots Information Contributed Contributors (sorted by rank) Contributes Most Contributes Least

Slide 61

Slide 61 text

General principles of wiki surveys: greedy

Slide 62

Slide 62 text

General principles of wiki surveys: greedy collaborative

Slide 63

Slide 63 text

General principles of wiki surveys: greedy collaborative adaptive

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

Which do you think is a better idea for creating a greener, greater New York City? Seeded the wiki survey with 25 ideas: Require all big buildings to make certain energy efficiency upgrades Increase targeted tree plantings in neighborhoods with high asthma rates Establish a New York City Energy Planning Board

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

What are we trying to estimate? Data Vote Session Prompt 1 1 item 4 item 1 2 1 item 3 item 1 3 1 item 4 item 3 4 2 item 3 item 4 5 2 item 4 item 2 . . . . . . . . . . . . Opinion matrix      θ1,1 θ1,2 . . . θ1,K θ2,1 θ2,2 . . . θ2,K . . . . . . ... . . . θJ,1 θJ,2 . . . θJ,K      θj,k: how much respondent j likes item k

Slide 72

Slide 72 text

Which do you think is a better idea for creating a greener, greater New York City? Seeded the wiki survey with 25 ideas: Require all big buildings to make certain energy efficiency upgrades Increase targeted tree plantings in neighborhoods with high asthma rates Establish a New York City Energy Planning Board

Slide 73

Slide 73 text

Recruited participants through Twitter, Facebook, blogs, etc. This is not a random sample, but random samples are possible

Slide 74

Slide 74 text

31,893 responses 464 ideas uploaded q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Which do you think is a better idea for creating a greener, greater New York City? Rank of session Responses 1 500 1000 1 200 400 600 800 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 40 80 120 160 200 240 1 10 20 30 40 50 Rank of session Contributed ideas

Slide 75

Slide 75 text

60 65 70 75 80 85 90 Which do you think is a better idea for creating a greener, greater New York City? Score, si ^ q Provide better transit service outside of Manhattan q Create a network of protected bike paths throughout the entire city q Support and protect community gardens and create mechanisms to create new gardens and open space q Implement congestion pricing in lower Manhattan q Require all big buildings to make certain energy efficiency upgrades q Create more year−round Greenmarkets in under−served communities. q Continue enhancing bike lane network, to finally connect separated bike lane systems to each other across all five boroughs. q Plug ships into electricity grid so they don't idle in port − reducing emissions equivalent to 12000 cars per ship. q Invest in multiple modes of transportation and provide both improved infrastructure and improved safety q Keep NYC's drinking water clean by banning fracking in NYC's watershed.

Slide 76

Slide 76 text

Alternative framings: “Keep NYC’s drinking water clean by banning fracking in NYC’s watershed”

Slide 77

Slide 77 text

Alternative framings: “Keep NYC’s drinking water clean by banning fracking in NYC’s watershed” Novel information: “Plug ships into electricity grid so they don’t idle in port - reducing emissions equivalent to 12000 cars per ship.”

Slide 78

Slide 78 text

q q q q q q q q q q q q q q q q q q q q q q q q q Seed ideas Score, si ^ Rank 1 25 10 20 30 40 50 60 70 80 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q User−contributed ideas Rank 1 50 100 150 200 244 10 20 30 40 50 60 70 80

Slide 79

Slide 79 text

q q q q q q q q q q q q q q q q q q q q q q q q q Seed ideas Score, si ^ Rank 1 25 10 20 30 40 50 60 70 80 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q User−contributed ideas Rank 1 50 100 150 200 244 10 20 30 40 50 60 70 80 variance + volume → extreme cases

Slide 80

Slide 80 text

Currently hosting: 14,000 wiki surveys with 700,000 ideas and 21 million votes

Slide 81

Slide 81 text

improved allourideas.org users research

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 85

Slide 85 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 86

Slide 86 text

Will big data kill surveys?

Slide 87

Slide 87 text

http://schlitterblog.com/wp-content/uploads/2014/05/peanutbutterlover.jpg

Slide 88

Slide 88 text

http://schlitterblog.com/wp-content/uploads/2014/05/peanutbutterlover.jpg

Slide 89

Slide 89 text

Note the different role of the big data in each case

Slide 90

Slide 90 text

http://dx.doi.org/10.1126/science.aac4420

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

No content

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

The beginning is not the end . . . .

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

http://dx.doi.org/10.1126/science.aaf7894 https://motherboard.vice.com/en_us/article/artificial-intelligence-is-predicting-human-poverty-from-space

Slide 97

Slide 97 text

Daytime satellite images are available, but most researchers had been using night lights https://www.nasa.gov/multimedia/imagegallery/image_feature_2480.html

Slide 98

Slide 98 text

Prior research: Nightlights + survey data → estimates of wealth in places without surveys

Slide 99

Slide 99 text

Jean et al. (2016): Day pictures + Nightlights + survey data → estimates of wealth in places without surveys

Slide 100

Slide 100 text

Start with CNN pretrained on ImageNet

Slide 101

Slide 101 text

Start with CNN pretrained on ImageNet Train CNN to predict nightlights from day pictures (lots of training data)

Slide 102

Slide 102 text

Start with CNN pretrained on ImageNet (e.g. hampsters and weasels) Train CNN to predict nightlights from day pictures (lots of training data) Take features from CNN and train ridge regression to predict cluster mean survey response http://dx.doi.org/10.1126/science.aah5217

Slide 103

Slide 103 text

http://dx.doi.org/10.1126/science.aaf7894

Slide 104

Slide 104 text

Two patterns: Performance decreases when train on one country and test on another Performance varies by the quantity being estimated (assets seems easier to estimate than consumption expenditures) http://dx.doi.org/10.1126/science.aaf7894

Slide 105

Slide 105 text

https://github.com/nealjean/predicting-poverty

Slide 106

Slide 106 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 107

Slide 107 text

Sampling Interviews Data environment 1st era Area probability Face-to-face Stand-alone 2nd era Random digital dial probability Telephone Stand-alone 3rd era Non-probability Computer-administered Linked

Slide 108

Slide 108 text

https://commons.wikimedia.org/wiki/File:Gartner_Hype_Cycle.svg

Slide 109

Slide 109 text

Read: http://www.bitbybitbook.com Teach: http://www.bitbybitbook.com/en/teaching/