Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Social Substrates

Social Substrates

About People and the Data they make.

David Shamma

April 09, 2012
Tweet

More Decks by David Shamma

Other Decks in Research

Transcript

  1. SOCIAL SUBSTRATES: PEOPLE AND THE DATA THEY MAKE david @ayman

    shamma Internet Experiences Microeconomics & Social Systems Yahoo! Research 1 Thursday, March 15, 2012
  2. Nothing ever happens in a vacuum. 2 Elizabeth  Churchill Jude

     Yew Lyndon  Kennedy Thursday, March 15, 2012
  3. 3 Big data...there’s stacks and stacks of it...it helps us

    solve problems. http://www.flickr.com/photos/mwichary/3369638710/ http://www.flickr.com/photos/mwichary/2355790455/ http://www.flickr.com/photos/jagadish/3072156349/ Thursday, March 15, 2012
  4. 4 People Using Technology, Sharing & Communicating Massive Amounts of

    Data is Stored Hadoop, Pig, ML 1, 2, 3 Thursday, March 15, 2012
  5. 5 People Using Technology, Sharing & Communicating Massive Amounts of

    Data is Stored Hadoop, Pig, ML Big Data The Typical “Big Data” Approach Thursday, March 15, 2012
  6. 6 People Using Technology, Sharing & Communicating Massive Amounts of

    Data is Stored Hadoop, Pig, ML What about motivations? Why did people to make all this data? Motivations Big Data Thursday, March 15, 2012
  7. 7 People Using Technology, Sharing & Communicating Massive Amounts of

    Data is Stored Hadoop, Pig, ML What about motivations? Why did people to make all this data? Motivations Big Data Thursday, March 15, 2012
  8. Dolores Park, San Francisco, 2006 Thursday, March 15, 2012

  9. Social Conversations Happen Around Media Dolores Park, San Francisco, 2006

    Thursday, March 15, 2012
  10. Media is a Social Experience 10 Thursday, March 15, 2012

  11. People Tweet While They Watch 11 Thursday, March 15, 2012

  12. Thursday, March 15, 2012

  13. INDIRECT ANNOTATION RT: @jowyang If you are watching the debate

    you’re invited to participate in #tweetdebate Here is the 411 http://tinyurl.com/3jdy67 Sept 26, 2008 18:23 EST Thursday, March 15, 2012
  14. ANATOMY OF A TWEET RT: @jowyang If you are watching

    the debate you’re invited to participate in #tweetdebate Here is the 411 http://tinyurl.com/3jdy67 Repeated (retweet) content starts with RT Address other users with an @ Tags start with # Rich Media embeds via links Thursday, March 15, 2012
  15. A long time ago... • Tweet Data circa 2008 •

    Three hashtags: #current #debate08 #tweetdebate • 97 mins debate + 53 mins following = 2.5 hours total. • 3,238 tweets from 1,160 people. • 1,824 tweets from 647 people during the debate. • 1,414 tweets from 738 people post debate. 15 Thursday, March 15, 2012
  16. Volume of Tweets by Minute Crawled from the Twitter RESTful

    search API. Thursday, March 15, 2012
  17. Tweets During and After the Debates Conversation swells after the

    debate. Thursday, March 15, 2012
  18. Volume of Conversation Follows the Debate Post debate Thursday, March

    15, 2012
  19. Does Conversation follow After a Segment Think of Isaac Newton

    Post Segment? Thursday, March 15, 2012
  20. Do the roots of fʼ(x) find segmentation? Thursday, March 15,

    2012
  21. Automatic Segment Detection We use Newtonʼs Method to find extrema

    outside μ±σ to find candidate markers. Any marker that follows from the a marker on the previous minute is ignored. Thursday, March 15, 2012
  22. Automatic Segment Detection with 92% When compared to CSPANʼs editorialized

    Debate Summary ± 1 minute. Thursday, March 15, 2012
  23. Directed Communication via @mentions John Tweets: “Hey @mary, my person

    is winning!” Makes a directed graph from John to Mary. Thursday, March 15, 2012
  24. Barack, NewsHour (Jim Lehrer), & McCain High Eigenvector Centrality Figures

    on Twitter from the First US Presidential Debate of 2008. Thursday, March 15, 2012
  25. Tweets to Terms Common stems in bold-italic. Thursday, March 15,

    2012
  26. Tweets are Reaction not Content Thursday, March 15, 2012

  27. 27 Sen$ment/Affect  judgements  from  the  debate. [1] Diakopoulos, N. A.,

    and Shamma, D. A. Characterizing debate performance via aggregated twitter sentiment. In CHI ’10: Proceedings of the 28th international conference on Human factors in computing systems (New York, NY, USA, 2010), ACM, pp. 1195–1198. Thursday, March 15, 2012
  28. Sen$ment/Affect  judgements  by  candidate. 28 [1] Diakopoulos, N. A., and

    Shamma, D. A. Characterizing debate performance via aggregated twitter sentiment. In CHI ’10: Proceedings of the 28th international conference on Human factors in computing systems (New York, NY, USA, 2010), ACM, pp. 1195–1198. Thursday, March 15, 2012
  29. It’s still the same today... 29 Thursday, March 15, 2012

  30. Inauguration 2009 http://www.flickr.com/photos/twistedart/3212723019/ Thursday, March 15, 2012

  31. A little less long ago...with more data... • Twitter Data

    circa 2009: Data Mining Feed • 600 Tweets per minute • 90 Minutes • 54,000 Tweets from 1.5 hours • Constant data rate means the volume method doesn’t work. Thursday, March 15, 2012
  32. Data Mining Feed 53,000 Tweets @ 600 per minutes Thursday,

    March 15, 2012
  33. Drop in @conversation as onset Thursday, March 15, 2012

  34. http://www.flickr.com/photos/wvs/3833148925/ Thursday, March 15, 2012

  35. Less @ means less chars Thursday, March 15, 2012

  36. Terms as topic points Using a TF/IDF window of 5

    mins Thursday, March 15, 2012
  37. Terms as topics points Using a TF/IDF window of 5

    mins, find terms that are only relevant to that slice, subtract out salient, non-stop listed terms like: Obama, president, and speech. No significant occurrence of “remaking” Thursday, March 15, 2012
  38. Terms as Sustained Interest Using a TF/IDF window of 5

    mins, find terms that are only relevant to that slice, subtract out salient, non-stop listed terms like: Obama, president, and speech. No  significant   occurrence  of   “remaking”. Contains  an   occurrence  of   “remaking”  less   significant  than   peak. Peak  occurrence   of  “remaking”. Thursday, March 15, 2012
  39. Sustained Interest & Background Whispers. Some topics continue over time

    with a higher conversational context. Thursday, March 15, 2012
  40. Sustained Interest & Background Whispers. These terms are not sailent

    by any standard term/document model. 0.35 0.25 Thursday, March 15, 2012
  41. People Announce (12:05) Bastille71: OMG - Obama just messed up

    the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http:// is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He flubbed the oath because Chief Justice screwed up the order of the words. Thursday, March 15, 2012
  42. People Reply (12:05) Bastille71: OMG - Obama just messed up

    the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http:// is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He flubbed the oath because Chief Justice screwed up the order of the words. Thursday, March 15, 2012
  43. Two Metrics Thursday, March 15, 2012

  44. Two Metrics Thursday, March 15, 2012

  45. Statler http://bit.ly/statler 45 Thursday, March 15, 2012

  46. 2.4 Million Tweets on the VMA Thursday, March 15, 2012

  47. 2010 VMAs TOC Thursday, March 15, 2012

  48. Top 2 whisper terms during the 2010 MTV 48 Thursday,

    March 15, 2012
  49. kanye (solid line) a**--le (dashed line) 49 Thursday, March 15,

    2012
  50. Will the scaling end? 50 Does granularity change the picture?

    In this case the conversational patterns stayed the same. Firehose data is ≈ 26% dissimilar from other collection methods. What about people and deep motivations? Thursday, March 15, 2012
  51. I’m not talking about many to many anymore... Thursday, March

    15, 2012
  52. Social Substraits Dig Deeper 52 Thursday, March 15, 2012

  53. !"#$%%&&&'()*+,'*-.%#!-/-0%*!,)01-2)3%4567895:;<%= 53 We forgot about this experience...how can we better

    understand this interaction? http://flickr.com/photos/chrisdonia/2781047956/ Thursday, March 15, 2012
  54. !"#$%%&&&'()*+,'*-.%#!-/-0%1-/23456.)4%7889987::;%< 54 Or the perhaps more common... http://flickr.com/photos/dotbenjamin/2996692553 Thursday, March

    15, 2012
  55. Social Conversations happen around videos Well – actually people join

    a session and converse afterwards. 55 Thursday, March 15, 2012
  56. Thursday, March 15, 2012

  57. Thursday, March 15, 2012

  58. What do we need to measure engagement? • Type of

    event (Zync player command or a normal chat message) • Anonymous hash (uniquely identifies the sender and the receiver, without exposing personal account data) • URL to the shared video • Timestamp for the event • The player time (with respect to the specific video) at the point the event occurred • The number of characters and the number words typed (for chat messages) • Emoticons used in the chat message Thursday, March 15, 2012
  59. A Short Movie 58 Thursday, March 15, 2012

  60. Percent of actions over time. 59 Thursday, March 15, 2012

  61. Chat follows the video! 60 CHAT Thursday, March 15, 2012

  62. Reciprocity • 43.6% of the sessions the invitee played at

    least one video back to the session’s initiator. • 77.7% sharing reciprocation • Pairs of people often exchanged more than one set of videos in a session. • In the categories of Nonprofit, Technology and Shows, the invitees shared more videos to the initiator (5:4, 9:7, and 5:2 respectably). Thursday, March 15, 2012
  63. CLASSIFICATION How do we know what people are watching? How

    can we give them better things to watch? Thursday, March 15, 2012
  64. The Big Data Approach: get a ton of ratings! Hey

    it worked for Netflix... 63 Thursday, March 15, 2012
  65. This was funny...wasn’t it? People & Blogs...all the way 64

    Thursday, March 15, 2012
  66. There are more “subtle” problems with genre and human classification...and

    even harder problems with finding what’s funny... 65 Thursday, March 15, 2012
  67. Thursday, March 15, 2012

  68. Thursday, March 15, 2012

  69. Thursday, March 15, 2012

  70. CLASSIFICATION BASED ON IMPLICIT CONNECTED SOCIAL ACTIONS 5 star ratings

    has been the golden egg for recommendation systems so far; implicit human cooperative sharing activity works better. If comedy is a social construct, lets train on the construct. Thursday, March 15, 2012
  71. Legacy online video Tags Video duration Ratings Comments View Count

    Sharing Activity Category Recommendations Thursday, March 15, 2012
  72. Let’s not just throw data at the problem... Human perception

    and experience –Pilot Survey –Crowdflower/Amazon MT Survey –Interviews 69 Computational classification –Naïve Bayes Classifier –GBDTs Thursday, March 15, 2012
  73. Research Questions RQ1: How do people socially consume, perceive, and

    categorize videos, as Comedy in our study, at the time of creation and consumption? RQ2: Can a predictive model be built to automatically categorize media in a manner that is contextually and socially appropriate? Thursday, March 15, 2012
  74. 2 Surveys The online clicky web form type... 71 Thursday,

    March 15, 2012
  75. Survey One • We sent out a web survey to

    get judgements on 20 randomly chosen videos from a sample of videos. (43 complete responses) • In general, the human as a classifier categorizes at 60.9% accuracy. • For Comedy, their accuracy was 52.3% 72 Thursday, March 15, 2012
  76. Crowdflower/MTurk Survey Two What’s funny vs what’s comedy? 73 Thursday,

    March 15, 2012
  77. Comedy vs Funny Judgements 74 Thursday, March 15, 2012

  78. A few videos stood out 75 Thursday, March 15, 2012

  79. 76 “This is very funny video of a baby laughing.

    Not sure it should be categorize as a comedy.” (Respondent 16 on Video 8) “Its funny but its only an animal getting startled by her sneezing baby. It is not comedy because the actions were not specifically done to make us laugh.” (Respondent 9 on Video 9) Thursday, March 15, 2012
  80. 11 Interviews 1 hour a piece 77 Thursday, March 15,

    2012
  81. Online videos are difficult to classify “Films in general tend

    to fit categories more narrowly...If NetFlix says that something is a screwball comedy, I know what to expect. I think on YouTube the range of possibilities for the content of the videos is much less constrained. Cause it might literally be a segment from a film or it could be something shot on a cheap digital camera.” 78 Thursday, March 15, 2012
  82. Comedy has internal cues & signals. “If I ever hear

    the laughter track, it doesn’t even have to be funny, it’s intended to be Comedy. The style of it...even if it’s a Music Video.” “Anything comedy is always impressed upon you with laughter in the background or some funny accompanying music...Those contextual cues.” 79 Thursday, March 15, 2012
  83. Comedy has social contexts “There’s a context that you need

    to have for something to be identified as comedy … when I see a video that I have no context for, I don’t know whether to identify it as funny. But if people are interacting with it in a way that makes me believe that it’s funny. Same thing for the wedding dance (JK wedding video) … my interaction with it is, ‘people are saying that this is funny.’” 80 Thursday, March 15, 2012
  84. The “Comedy” label itself is a signal “They (the uploaders

    of videos) are uploading things and categorizing it as ‘Comedy’ because they are proposing that there’s something funny in it. Even though the content itself may not be ‘Comedy’.” 81 Thursday, March 15, 2012
  85. Thursday, March 15, 2012

  86. Thursday, March 15, 2012

  87. Thursday, March 15, 2012

  88. FIRST ORDER DATA WASNʼT PRETTY Phone in your favorite ML

    technique. Thursday, March 15, 2012
  89. Used and Unused Data 84 You Tube Zync Duration (video)

    Views Rating* Duration (session)* # of Play/Pause* # of Scrubs* # of Chats* You Tube Zync Duration (video) Views Rating* Duration (session)* # of Play/Pause* # of Scrubs* # of Chats* You Tube (not used) Zync (not used) Tags Comments Favorites Emoticons User ID data # of Sessions # of Loads Thursday, March 15, 2012
  90. Classification with Using Conversation Signals 85 Type Accuracy Random Chance

    23.0% You Tube Features 14.6% Humans 60.9% YouTube Features 75.9% Social sharing patterns from Zync are highly predictive of content type. Here we can predict if a video is Sports, Comedy, Entertainment, People, or Film just from how the video is shared. This brings better recommendations for content and advertising. Zync Features 87.8% Thursday, March 15, 2012
  91. What about Viral Content? 86 Viral is a genre too.

    Thursday, March 15, 2012
  92. Viral Content can be identified similarly and more accurately at

    98.5% (F1 = 0.78) 87 Does a video have over 10M views? We did this using data from 10 people in 5 shared sessions. The drawback is you need to use a synchronous connected experience. Thursday, March 15, 2012
  93. Big Data, Instrumentation, and Social Motivations 88 • More data

    isn’t necessarily better data • Corporate design & building patterns can deliver you bad data. • Big Data tends to answer “what can we do with all this data” but not “how can we create and properly instrument the world” • People and their communication patterns can be found and utilized effectively, often requiring significantly less data to solve similar problems. Thursday, March 15, 2012
  94. New Directions 89 Thursday, March 15, 2012

  95. New Interestingness Can we get to hyper-personalization? 90 Thursday, March

    15, 2012
  96. Buildings account for 48% of all GHG. Farming is 18%.

    Transportation is 14%. 91 Photos courtesy of AutoDesk Research Thursday, March 15, 2012
  97. Design and Interaction Meets Big Environments and Big Data with

    AutoDesk and MobileLife/KTH 92 Building Floor Area Occupant Sensor Photos courtesy of AutoDesk Research Thursday, March 15, 2012
  98. Fin. Thanks to J.Yew, L. Kennedy, E. Churchill, R. Sheppard,

    A. Khan, N. Diakopoulos, A. Brooks, Y. Liu, S. Pentland, J. Antin, J. Dunning, Chloe S., Marc S., & M. Cameron J. Viral Actions: Predicting Video View Counts Using Synchronous Sharing Behaviors David A. Shamma; Jude Yew; Lyndon Kennedy; Elizabeth F. Churchill, ICWSM 2011 - International AAAI Conference on Weblogs and Social Media, AAAI, 2011 Knowing Funny: Genre Perception and Categorization in Social Video Sharing Jude Yew; David A. Shamma; Elizabeth F. Churchill, CHI 2011, ACM, 2011 Peaks and Persistence: Modeling the Shape of Microblog Conversations David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, CSCW 2011, ACM, 2011 Know Your Data: Understanding Implicit Usage versus Explicit Action in Video Content Classification Jude Yew; David A. Shamma, Electronic Imaging, IS&T/SPIE, 2011 Beyond Freebird David A. Shamma, XRDS: Crossroads, ACM, 2010, 2 Conversational Shadows: Describing Live Media Events Using Short Messages David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, International AAAI Conference on Weblogs and Social Media, AAAI, 2010 Characterizing Debate Performance via Aggregated Twitter Sentiment Nicholas A. Diakopoulos; David A. Shamma, CHI 2010, ACM, 2010 Statler: Summarizing Media through Short-Message Services David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, CSCW, 2010 Tweet the Debates: Understanding Community Annotation of Uncollected Sources David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, ACM Multimedia, ACM, 2009 Understanding the Creative Conversation: Modeling to Engagement David A. Shamma; Dan Perkel; Kurt Luther, Creativity and Cognition, ACM, 2009 Spinning Online: A Case Study of Internet Broadcasting by DJs David A. Shamma; Elizabeth Churchill; Nikhil Bobb; Matt Fukuda, Communities & Technology, ACM, 2009 Zync with Me: Synchronized Sharing of Video through Instant Messaging David A. Shamma; Yiming Liu; Pablo Cesar, David Geerts, Konstantinos Chorianopoulos, Social Interactive Television: Immersive Shared Experiences and Perspectives, Information Science Reference, IGI Global, 2009 Enhancing online personal connections through the synchronized sharing of online video Shamma, D. A.; Bastéa-Forte, M.; Joubert, N.; Liu, Y., Human Factors in Computing Systems (CHI), ACM, 2008 Supporting creative acts beyond dissemination David A. Shamma; Ryan Shaw, Creativity and Cognition, ACM, 2007 Watch what I watch: using community activity to understand content David A. Shamma; Ryan Shaw; Peter Shafton; Yiming Liu, ACM Multimedia Workshop on Multimedia Information Retrival (MIR), ACM, 2007 Zync: the design of synchronized video sharing Yiming Liu; David A. Shamma; Peter Shafton; Jeannie Yang, Designing for User eXperiences, ACM, 2007 Thursday, March 15, 2012