Social Substrates

SOCIAL SUBSTRATES: PEOPLE AND THE DATA THEY MAKE david @ayman
shamma Internet Experiences Microeconomics & Social Systems Yahoo! Research 1 Thursday, March 15, 2012

Nothing ever happens in a vacuum. 2 Elizabeth Churchill Jude
Yew Lyndon Kennedy Thursday, March 15, 2012

3 Big data...there’s stacks and stacks of it...it helps us
solve problems. http://www.flickr.com/photos/mwichary/3369638710/ http://www.flickr.com/photos/mwichary/2355790455/ http://www.flickr.com/photos/jagadish/3072156349/ Thursday, March 15, 2012

4 People Using Technology, Sharing & Communicating Massive Amounts of
Data is Stored Hadoop, Pig, ML 1, 2, 3 Thursday, March 15, 2012

Data is Stored Hadoop, Pig, ML Big Data The Typical “Big Data” Approach Thursday, March 15, 2012

Data is Stored Hadoop, Pig, ML What about motivations? Why did people to make all this data? Motivations Big Data Thursday, March 15, 2012

Dolores Park, San Francisco, 2006 Thursday, March 15, 2012

Social Conversations Happen Around Media Dolores Park, San Francisco, 2006
Thursday, March 15, 2012

Media is a Social Experience 10 Thursday, March 15, 2012

People Tweet While They Watch 11 Thursday, March 15, 2012

INDIRECT ANNOTATION RT: @jowyang If you are watching the debate
you’re invited to participate in #tweetdebate Here is the 411 http://tinyurl.com/3jdy67 Sept 26, 2008 18:23 EST Thursday, March 15, 2012

ANATOMY OF A TWEET RT: @jowyang If you are watching
the debate you’re invited to participate in #tweetdebate Here is the 411 http://tinyurl.com/3jdy67 Repeated (retweet) content starts with RT Address other users with an @ Tags start with # Rich Media embeds via links Thursday, March 15, 2012

A long time ago... • Tweet Data circa 2008 •
Three hashtags: #current #debate08 #tweetdebate • 97 mins debate + 53 mins following = 2.5 hours total. • 3,238 tweets from 1,160 people. • 1,824 tweets from 647 people during the debate. • 1,414 tweets from 738 people post debate. 15 Thursday, March 15, 2012

Volume of Tweets by Minute Crawled from the Twitter RESTful
search API. Thursday, March 15, 2012

Tweets During and After the Debates Conversation swells after the
debate. Thursday, March 15, 2012

Volume of Conversation Follows the Debate Post debate Thursday, March
15, 2012

Does Conversation follow After a Segment Think of Isaac Newton
Post Segment? Thursday, March 15, 2012

Do the roots of fʼ(x) ﬁnd segmentation? Thursday, March 15,
2012

Automatic Segment Detection We use Newtonʼs Method to ﬁnd extrema
outside μ±σ to ﬁnd candidate markers. Any marker that follows from the a marker on the previous minute is ignored. Thursday, March 15, 2012

Automatic Segment Detection with 92% When compared to CSPANʼs editorialized
Debate Summary ± 1 minute. Thursday, March 15, 2012

Directed Communication via @mentions John Tweets: “Hey @mary, my person
is winning!” Makes a directed graph from John to Mary. Thursday, March 15, 2012

Barack, NewsHour (Jim Lehrer), & McCain High Eigenvector Centrality Figures
on Twitter from the First US Presidential Debate of 2008. Thursday, March 15, 2012

Tweets to Terms Common stems in bold-italic. Thursday, March 15,
2012

Tweets are Reaction not Content Thursday, March 15, 2012

27 Sen$ment/Aﬀect judgements from the debate. [1] Diakopoulos, N. A.,
and Shamma, D. A. Characterizing debate performance via aggregated twitter sentiment. In CHI ’10: Proceedings of the 28th international conference on Human factors in computing systems (New York, NY, USA, 2010), ACM, pp. 1195–1198. Thursday, March 15, 2012

Sen$ment/Aﬀect judgements by candidate. 28 [1] Diakopoulos, N. A., and
Shamma, D. A. Characterizing debate performance via aggregated twitter sentiment. In CHI ’10: Proceedings of the 28th international conference on Human factors in computing systems (New York, NY, USA, 2010), ACM, pp. 1195–1198. Thursday, March 15, 2012

It’s still the same today... 29 Thursday, March 15, 2012

Inauguration 2009 http://www.ﬂickr.com/photos/twistedart/3212723019/ Thursday, March 15, 2012

A little less long ago...with more data... • Twitter Data
circa 2009: Data Mining Feed • 600 Tweets per minute • 90 Minutes • 54,000 Tweets from 1.5 hours • Constant data rate means the volume method doesn’t work. Thursday, March 15, 2012

Data Mining Feed 53,000 Tweets @ 600 per minutes Thursday,
March 15, 2012

Drop in @conversation as onset Thursday, March 15, 2012

http://www.ﬂickr.com/photos/wvs/3833148925/ Thursday, March 15, 2012

Less @ means less chars Thursday, March 15, 2012

Terms as topic points Using a TF/IDF window of 5
mins Thursday, March 15, 2012

Terms as topics points Using a TF/IDF window of 5
mins, ﬁnd terms that are only relevant to that slice, subtract out salient, non-stop listed terms like: Obama, president, and speech. No signiﬁcant occurrence of “remaking” Thursday, March 15, 2012

Terms as Sustained Interest Using a TF/IDF window of 5
mins, find terms that are only relevant to that slice, subtract out salient, non-stop listed terms like: Obama, president, and speech. No significant occurrence of “remaking”. Contains an occurrence of “remaking” less significant than peak. Peak occurrence of “remaking”. Thursday, March 15, 2012

Sustained Interest & Background Whispers. Some topics continue over time
with a higher conversational context. Thursday, March 15, 2012

Sustained Interest & Background Whispers. These terms are not sailent
by any standard term/document model. 0.35 0.25 Thursday, March 15, 2012

People Announce (12:05) Bastille71: OMG - Obama just messed up
the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http:// is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He ﬂubbed the oath because Chief Justice screwed up the order of the words. Thursday, March 15, 2012

People Reply (12:05) Bastille71: OMG - Obama just messed up
the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http:// is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He ﬂubbed the oath because Chief Justice screwed up the order of the words. Thursday, March 15, 2012

Two Metrics Thursday, March 15, 2012

Statler http://bit.ly/statler 45 Thursday, March 15, 2012

2.4 Million Tweets on the VMA Thursday, March 15, 2012

2010 VMAs TOC Thursday, March 15, 2012

Top 2 whisper terms during the 2010 MTV 48 Thursday,
March 15, 2012

kanye (solid line) a**--le (dashed line) 49 Thursday, March 15,
2012

Will the scaling end? 50 Does granularity change the picture?
In this case the conversational patterns stayed the same. Firehose data is ≈ 26% dissimilar from other collection methods. What about people and deep motivations? Thursday, March 15, 2012

I’m not talking about many to many anymore... Thursday, March
15, 2012

Social Substraits Dig Deeper 52 Thursday, March 15, 2012

!"#$%%&&&'()*+,'*-.%#!-/-0%*!,)01-2)3%4567895:;<%= 53 We forgot about this experience...how can we better
understand this interaction? http://ﬂickr.com/photos/chrisdonia/2781047956/ Thursday, March 15, 2012

!"#$%%&&&'()*+,'*-.%#!-/-0%1-/23456.)4%7889987::;%< 54 Or the perhaps more common... http://ﬂickr.com/photos/dotbenjamin/2996692553 Thursday, March
15, 2012

Social Conversations happen around videos Well – actually people join
a session and converse afterwards. 55 Thursday, March 15, 2012

What do we need to measure engagement? • Type of
event (Zync player command or a normal chat message) • Anonymous hash (uniquely identiﬁes the sender and the receiver, without exposing personal account data) • URL to the shared video • Timestamp for the event • The player time (with respect to the speciﬁc video) at the point the event occurred • The number of characters and the number words typed (for chat messages) • Emoticons used in the chat message Thursday, March 15, 2012

A Short Movie 58 Thursday, March 15, 2012

Percent of actions over time. 59 Thursday, March 15, 2012

Chat follows the video! 60 CHAT Thursday, March 15, 2012

Reciprocity • 43.6% of the sessions the invitee played at
least one video back to the session’s initiator. • 77.7% sharing reciprocation • Pairs of people often exchanged more than one set of videos in a session. • In the categories of Nonproﬁt, Technology and Shows, the invitees shared more videos to the initiator (5:4, 9:7, and 5:2 respectably). Thursday, March 15, 2012

CLASSIFICATION How do we know what people are watching? How
can we give them better things to watch? Thursday, March 15, 2012

The Big Data Approach: get a ton of ratings! Hey
it worked for Netﬂix... 63 Thursday, March 15, 2012

This was funny...wasn’t it? People & Blogs...all the way 64

There are more “subtle” problems with genre and human classiﬁcation...and
even harder problems with ﬁnding what’s funny... 65 Thursday, March 15, 2012

CLASSIFICATION BASED ON IMPLICIT CONNECTED SOCIAL ACTIONS 5 star ratings
has been the golden egg for recommendation systems so far; implicit human cooperative sharing activity works better. If comedy is a social construct, lets train on the construct. Thursday, March 15, 2012

Legacy online video Tags Video duration Ratings Comments View Count
Sharing Activity Category Recommendations Thursday, March 15, 2012

Let’s not just throw data at the problem... Human perception
and experience –Pilot Survey –Crowdflower/Amazon MT Survey –Interviews 69 Computational classification –Naïve Bayes Classifier –GBDTs Thursday, March 15, 2012

Research Questions RQ1: How do people socially consume, perceive, and
categorize videos, as Comedy in our study, at the time of creation and consumption? RQ2: Can a predictive model be built to automatically categorize media in a manner that is contextually and socially appropriate? Thursday, March 15, 2012

2 Surveys The online clicky web form type... 71 Thursday,
March 15, 2012

Survey One • We sent out a web survey to
get judgements on 20 randomly chosen videos from a sample of videos. (43 complete responses) • In general, the human as a classiﬁer categorizes at 60.9% accuracy. • For Comedy, their accuracy was 52.3% 72 Thursday, March 15, 2012

Crowdﬂower/MTurk Survey Two What’s funny vs what’s comedy? 73 Thursday,
March 15, 2012

Comedy vs Funny Judgements 74 Thursday, March 15, 2012

A few videos stood out 75 Thursday, March 15, 2012

76 “This is very funny video of a baby laughing.
Not sure it should be categorize as a comedy.” (Respondent 16 on Video 8) “Its funny but its only an animal getting startled by her sneezing baby. It is not comedy because the actions were not speciﬁcally done to make us laugh.” (Respondent 9 on Video 9) Thursday, March 15, 2012

11 Interviews 1 hour a piece 77 Thursday, March 15,
2012

Online videos are difficult to classify “Films in general tend
to fit categories more narrowly...If NetFlix says that something is a screwball comedy, I know what to expect. I think on YouTube the range of possibilities for the content of the videos is much less constrained. Cause it might literally be a segment from a film or it could be something shot on a cheap digital camera.” 78 Thursday, March 15, 2012

Comedy has internal cues & signals. “If I ever hear
the laughter track, it doesn’t even have to be funny, it’s intended to be Comedy. The style of it...even if it’s a Music Video.” “Anything comedy is always impressed upon you with laughter in the background or some funny accompanying music...Those contextual cues.” 79 Thursday, March 15, 2012

Comedy has social contexts “There’s a context that you need
to have for something to be identiﬁed as comedy … when I see a video that I have no context for, I don’t know whether to identify it as funny. But if people are interacting with it in a way that makes me believe that it’s funny. Same thing for the wedding dance (JK wedding video) … my interaction with it is, ‘people are saying that this is funny.’” 80 Thursday, March 15, 2012

The “Comedy” label itself is a signal “They (the uploaders
of videos) are uploading things and categorizing it as ‘Comedy’ because they are proposing that there’s something funny in it. Even though the content itself may not be ‘Comedy’.” 81 Thursday, March 15, 2012

FIRST ORDER DATA WASNʼT PRETTY Phone in your favorite ML
technique. Thursday, March 15, 2012

Used and Unused Data 84 You Tube Zync Duration (video)
Views Rating* Duration (session)* # of Play/Pause* # of Scrubs* # of Chats* You Tube Zync Duration (video) Views Rating* Duration (session)* # of Play/Pause* # of Scrubs* # of Chats* You Tube (not used) Zync (not used) Tags Comments Favorites Emoticons User ID data # of Sessions # of Loads Thursday, March 15, 2012

Classiﬁcation with Using Conversation Signals 85 Type Accuracy Random Chance
23.0% You Tube Features 14.6% Humans 60.9% YouTube Features 75.9% Social sharing patterns from Zync are highly predictive of content type. Here we can predict if a video is Sports, Comedy, Entertainment, People, or Film just from how the video is shared. This brings better recommendations for content and advertising. Zync Features 87.8% Thursday, March 15, 2012

What about Viral Content? 86 Viral is a genre too.

Viral Content can be identiﬁed similarly and more accurately at
98.5% (F1 = 0.78) 87 Does a video have over 10M views? We did this using data from 10 people in 5 shared sessions. The drawback is you need to use a synchronous connected experience. Thursday, March 15, 2012

Big Data, Instrumentation, and Social Motivations 88 • More data
isn’t necessarily better data • Corporate design & building patterns can deliver you bad data. • Big Data tends to answer “what can we do with all this data” but not “how can we create and properly instrument the world” • People and their communication patterns can be found and utilized effectively, often requiring signiﬁcantly less data to solve similar problems. Thursday, March 15, 2012

New Directions 89 Thursday, March 15, 2012

New Interestingness Can we get to hyper-personalization? 90 Thursday, March
15, 2012

Buildings account for 48% of all GHG. Farming is 18%.
Transportation is 14%. 91 Photos courtesy of AutoDesk Research Thursday, March 15, 2012

Design and Interaction Meets Big Environments and Big Data with
AutoDesk and MobileLife/KTH 92 Building Floor Area Occupant Sensor Photos courtesy of AutoDesk Research Thursday, March 15, 2012

Fin. Thanks to J.Yew, L. Kennedy, E. Churchill, R. Sheppard,
A. Khan, N. Diakopoulos, A. Brooks, Y. Liu, S. Pentland, J. Antin, J. Dunning, Chloe S., Marc S., & M. Cameron J. Viral Actions: Predicting Video View Counts Using Synchronous Sharing Behaviors David A. Shamma; Jude Yew; Lyndon Kennedy; Elizabeth F. Churchill, ICWSM 2011 - International AAAI Conference on Weblogs and Social Media, AAAI, 2011 Knowing Funny: Genre Perception and Categorization in Social Video Sharing Jude Yew; David A. Shamma; Elizabeth F. Churchill, CHI 2011, ACM, 2011 Peaks and Persistence: Modeling the Shape of Microblog Conversations David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, CSCW 2011, ACM, 2011 Know Your Data: Understanding Implicit Usage versus Explicit Action in Video Content Classiﬁcation Jude Yew; David A. Shamma, Electronic Imaging, IS&T/SPIE, 2011 Beyond Freebird David A. Shamma, XRDS: Crossroads, ACM, 2010, 2 Conversational Shadows: Describing Live Media Events Using Short Messages David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, International AAAI Conference on Weblogs and Social Media, AAAI, 2010 Characterizing Debate Performance via Aggregated Twitter Sentiment Nicholas A. Diakopoulos; David A. Shamma, CHI 2010, ACM, 2010 Statler: Summarizing Media through Short-Message Services David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, CSCW, 2010 Tweet the Debates: Understanding Community Annotation of Uncollected Sources David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, ACM Multimedia, ACM, 2009 Understanding the Creative Conversation: Modeling to Engagement David A. Shamma; Dan Perkel; Kurt Luther, Creativity and Cognition, ACM, 2009 Spinning Online: A Case Study of Internet Broadcasting by DJs David A. Shamma; Elizabeth Churchill; Nikhil Bobb; Matt Fukuda, Communities & Technology, ACM, 2009 Zync with Me: Synchronized Sharing of Video through Instant Messaging David A. Shamma; Yiming Liu; Pablo Cesar, David Geerts, Konstantinos Chorianopoulos, Social Interactive Television: Immersive Shared Experiences and Perspectives, Information Science Reference, IGI Global, 2009 Enhancing online personal connections through the synchronized sharing of online video Shamma, D. A.; Bastéa-Forte, M.; Joubert, N.; Liu, Y., Human Factors in Computing Systems (CHI), ACM, 2008 Supporting creative acts beyond dissemination David A. Shamma; Ryan Shaw, Creativity and Cognition, ACM, 2007 Watch what I watch: using community activity to understand content David A. Shamma; Ryan Shaw; Peter Shafton; Yiming Liu, ACM Multimedia Workshop on Multimedia Information Retrival (MIR), ACM, 2007 Zync: the design of synchronized video sharing Yiming Liu; David A. Shamma; Peter Shafton; Jeannie Yang, Designing for User eXperiences, ACM, 2007 Thursday, March 15, 2012

Social Substrates

Social Substrates

More Decks by David Shamma

Other Decks in Research

Featured

Transcript