$30 off During Our Annual Pro Sale. View Details »

Social Substrates

Social Substrates

About People and the Data they make.

David Shamma

April 09, 2012
Tweet

More Decks by David Shamma

Other Decks in Research

Transcript

  1. SOCIAL SUBSTRATES:
    PEOPLE AND
    THE DATA THEY MAKE
    david @ayman shamma
    Internet Experiences
    Microeconomics & Social Systems
    Yahoo! Research
    1
    Thursday, March 15, 2012

    View Slide

  2. Nothing ever happens in a vacuum.
    2
    Elizabeth  Churchill
    Jude  Yew
    Lyndon  Kennedy
    Thursday, March 15, 2012

    View Slide

  3. 3
    Big data...there’s stacks and stacks of it...it helps
    us solve problems.
    http://www.flickr.com/photos/mwichary/3369638710/
    http://www.flickr.com/photos/mwichary/2355790455/
    http://www.flickr.com/photos/jagadish/3072156349/
    Thursday, March 15, 2012

    View Slide

  4. 4
    People Using
    Technology,
    Sharing &
    Communicating
    Massive
    Amounts of Data
    is Stored
    Hadoop, Pig, ML
    1, 2, 3
    Thursday, March 15, 2012

    View Slide

  5. 5
    People Using
    Technology,
    Sharing &
    Communicating
    Massive
    Amounts of Data
    is Stored
    Hadoop, Pig, ML
    Big Data
    The Typical “Big Data” Approach
    Thursday, March 15, 2012

    View Slide

  6. 6
    People Using
    Technology,
    Sharing &
    Communicating
    Massive
    Amounts of Data
    is Stored
    Hadoop, Pig, ML
    What about motivations?
    Why did people to make all this data?
    Motivations Big Data
    Thursday, March 15, 2012

    View Slide

  7. 7
    People Using
    Technology,
    Sharing &
    Communicating
    Massive
    Amounts of Data
    is Stored
    Hadoop, Pig, ML
    What about motivations?
    Why did people to make all this data?
    Motivations Big Data
    Thursday, March 15, 2012

    View Slide

  8. Dolores Park, San Francisco, 2006
    Thursday, March 15, 2012

    View Slide

  9. Social Conversations Happen Around Media
    Dolores Park, San Francisco, 2006
    Thursday, March 15, 2012

    View Slide

  10. Media is a Social
    Experience
    10
    Thursday, March 15, 2012

    View Slide

  11. People Tweet While
    They Watch
    11
    Thursday, March 15, 2012

    View Slide

  12. Thursday, March 15, 2012

    View Slide

  13. INDIRECT ANNOTATION
    RT: @jowyang If you are watching the debate you’re
    invited to participate in #tweetdebate Here is the 411
    http://tinyurl.com/3jdy67
    Sept 26, 2008 18:23 EST
    Thursday, March 15, 2012

    View Slide

  14. ANATOMY OF A TWEET
    RT: @jowyang If you are watching the debate you’re
    invited to participate in #tweetdebate Here is the 411
    http://tinyurl.com/3jdy67
    Repeated (retweet) content starts with RT
    Address other users with an @
    Tags start with #
    Rich Media embeds via links
    Thursday, March 15, 2012

    View Slide

  15. A long time ago...
    • Tweet Data circa 2008
    • Three hashtags: #current #debate08 #tweetdebate
    • 97 mins debate + 53 mins following = 2.5 hours total.
    • 3,238 tweets from 1,160 people.
    • 1,824 tweets from 647 people during the debate.
    • 1,414 tweets from 738 people post debate.
    15
    Thursday, March 15, 2012

    View Slide

  16. Volume of Tweets by Minute
    Crawled from the Twitter RESTful search API.
    Thursday, March 15, 2012

    View Slide

  17. Tweets During and After the Debates
    Conversation swells after the debate.
    Thursday, March 15, 2012

    View Slide

  18. Volume of Conversation Follows the Debate
    Post debate
    Thursday, March 15, 2012

    View Slide

  19. Does Conversation follow After a Segment
    Think of Isaac Newton
    Post Segment?
    Thursday, March 15, 2012

    View Slide

  20. Do the roots of fʼ(x) find segmentation?
    Thursday, March 15, 2012

    View Slide

  21. Automatic Segment Detection
    We use Newtonʼs Method to find extrema outside μ±σ to find candidate
    markers. Any marker that follows from the a marker on the previous minute
    is ignored.
    Thursday, March 15, 2012

    View Slide

  22. Automatic Segment Detection with 92%
    When compared to CSPANʼs editorialized Debate Summary ± 1 minute.
    Thursday, March 15, 2012

    View Slide

  23. Directed Communication via @mentions
    John Tweets: “Hey @mary, my person is winning!” Makes a directed graph
    from John to Mary.
    Thursday, March 15, 2012

    View Slide

  24. Barack, NewsHour (Jim Lehrer), & McCain
    High Eigenvector Centrality Figures on Twitter from the First US Presidential
    Debate of 2008.
    Thursday, March 15, 2012

    View Slide

  25. Tweets to Terms
    Common stems in bold-italic.
    Thursday, March 15, 2012

    View Slide

  26. Tweets are Reaction not Content
    Thursday, March 15, 2012

    View Slide

  27. 27
    Sen$ment/Affect  judgements  from  the  debate.
    [1]
    Diakopoulos, N. A., and Shamma, D. A. Characterizing debate performance via
    aggregated twitter sentiment. In CHI ’10: Proceedings of the 28th international
    conference on Human factors in computing systems (New York, NY, USA, 2010), ACM,
    pp. 1195–1198.
    Thursday, March 15, 2012

    View Slide

  28. Sen$ment/Affect  judgements  by  candidate.
    28
    [1]
    Diakopoulos, N. A., and Shamma, D. A. Characterizing debate performance via
    aggregated twitter sentiment. In CHI ’10: Proceedings of the 28th international
    conference on Human factors in computing systems (New York, NY, USA, 2010), ACM,
    pp. 1195–1198.
    Thursday, March 15, 2012

    View Slide

  29. It’s still the same today...
    29
    Thursday, March 15, 2012

    View Slide

  30. Inauguration 2009 http://www.flickr.com/photos/twistedart/3212723019/
    Thursday, March 15, 2012

    View Slide

  31. A little less long ago...with more data...
    • Twitter Data circa 2009: Data Mining Feed
    • 600 Tweets per minute
    • 90 Minutes
    • 54,000 Tweets from 1.5 hours
    • Constant data rate means the volume method doesn’t work.
    Thursday, March 15, 2012

    View Slide

  32. Data Mining Feed
    53,000 Tweets @ 600 per minutes
    Thursday, March 15, 2012

    View Slide

  33. Drop in @conversation as onset
    Thursday, March 15, 2012

    View Slide

  34. http://www.flickr.com/photos/wvs/3833148925/
    Thursday, March 15, 2012

    View Slide

  35. Less @ means less chars
    Thursday, March 15, 2012

    View Slide

  36. Terms as topic points
    Using a TF/IDF window of 5 mins
    Thursday, March 15, 2012

    View Slide

  37. Terms as topics points
    Using a TF/IDF window of 5 mins, find terms that are only relevant to that
    slice, subtract out salient, non-stop listed terms like: Obama, president, and
    speech.
    No significant
    occurrence of
    “remaking”
    Thursday, March 15, 2012

    View Slide

  38. Terms as Sustained Interest
    Using a TF/IDF window of 5 mins, find terms that are only relevant to that
    slice, subtract out salient, non-stop listed terms like: Obama, president, and
    speech.
    No  significant  
    occurrence  of  
    “remaking”.
    Contains  an  
    occurrence  of  
    “remaking”  less  
    significant  than  
    peak.
    Peak  occurrence  
    of  “remaking”.
    Thursday, March 15, 2012

    View Slide

  39. Sustained Interest & Background Whispers.
    Some topics continue over time with a higher conversational context.
    Thursday, March 15, 2012

    View Slide

  40. Sustained Interest & Background Whispers.
    These terms are not sailent by any standard term/document model.
    0.35
    0.25
    Thursday, March 15, 2012

    View Slide

  41. People Announce
    (12:05) Bastille71: OMG - Obama just messed up the oath -
    AWESOME! he’s human!
    (12:07) ryantherobot: LOL Obama messed up his inaugural
    oath twice! regardless, Obama is the president today!
    whoooo!
    (12:46) mattycus: RT @deelah: it wasn’t Obama that
    messed the oath, it was Chief Justice Roberts: http://
    is.gd/gAVo
    (12:53) dawngoldberg: @therichbrooks He flubbed the oath
    because Chief Justice screwed up the order of the words.
    Thursday, March 15, 2012

    View Slide

  42. People Reply
    (12:05) Bastille71: OMG - Obama just messed up the oath -
    AWESOME! he’s human!
    (12:07) ryantherobot: LOL Obama messed up his inaugural
    oath twice! regardless, Obama is the president today!
    whoooo!
    (12:46) mattycus: RT @deelah: it wasn’t Obama that
    messed the oath, it was Chief Justice Roberts: http://
    is.gd/gAVo
    (12:53) dawngoldberg: @therichbrooks He flubbed the oath
    because Chief Justice screwed up the order of the words.
    Thursday, March 15, 2012

    View Slide

  43. Two Metrics
    Thursday, March 15, 2012

    View Slide

  44. Two Metrics
    Thursday, March 15, 2012

    View Slide

  45. Statler http://bit.ly/statler
    45
    Thursday, March 15, 2012

    View Slide

  46. 2.4 Million Tweets on the VMA
    Thursday, March 15, 2012

    View Slide

  47. 2010 VMAs TOC
    Thursday, March 15, 2012

    View Slide

  48. Top 2 whisper terms
    during the 2010 MTV
    48
    Thursday, March 15, 2012

    View Slide

  49. kanye (solid line)
    a**--le (dashed line)
    49
    Thursday, March 15, 2012

    View Slide

  50. Will the scaling end?
    50
    Does granularity change the
    picture? In this case the
    conversational patterns stayed
    the same.
    Firehose data is ≈ 26% dissimilar
    from other collection methods.
    What about people and deep
    motivations?
    Thursday, March 15, 2012

    View Slide

  51. I’m not talking about many to many
    anymore...
    Thursday, March 15, 2012

    View Slide

  52. Social Substraits Dig Deeper
    52
    Thursday, March 15, 2012

    View Slide

  53. !"#$%%&&&'()*+,'*-.%#!-/-0%*!,)01-2)3%4567895:;<%=
    53
    We forgot about this
    experience...how can we better
    understand this interaction?
    http://flickr.com/photos/chrisdonia/2781047956/
    Thursday, March 15, 2012

    View Slide

  54. !"#$%%&&&'()*+,'*-.%#!-/-0%1-/23456.)4%7889987::;%<
    54
    Or the perhaps more
    common... http://flickr.com/photos/dotbenjamin/2996692553
    Thursday, March 15, 2012

    View Slide

  55. Social Conversations
    happen around videos
    Well – actually people join a
    session and converse afterwards.
    55
    Thursday, March 15, 2012

    View Slide

  56. Thursday, March 15, 2012

    View Slide

  57. Thursday, March 15, 2012

    View Slide

  58. What do we need to measure engagement?
    • Type of event
    (Zync player command or a normal chat message)
    • Anonymous hash
    (uniquely identifies the sender and the receiver, without
    exposing personal account data)
    • URL to the shared video
    • Timestamp for the event
    • The player time (with respect to the specific video) at the
    point the event occurred
    • The number of characters and the number words typed (for
    chat messages)
    • Emoticons used in the chat message
    Thursday, March 15, 2012

    View Slide

  59. A Short Movie
    58
    Thursday, March 15, 2012

    View Slide

  60. Percent of actions over time.
    59
    Thursday, March 15, 2012

    View Slide

  61. Chat follows the video!
    60
    CHAT
    Thursday, March 15, 2012

    View Slide

  62. Reciprocity
    • 43.6% of the sessions the invitee played at least one
    video back to the session’s initiator.
    • 77.7% sharing reciprocation
    • Pairs of people often exchanged more than one set
    of videos in a session.
    • In the categories of Nonprofit, Technology and
    Shows, the invitees shared more videos to the
    initiator (5:4, 9:7, and 5:2 respectably).
    Thursday, March 15, 2012

    View Slide

  63. CLASSIFICATION
    How do we know what people are watching?
    How can we give them better things to watch?
    Thursday, March 15, 2012

    View Slide

  64. The Big Data Approach:
    get a ton of ratings! Hey it worked for Netflix...
    63
    Thursday, March 15, 2012

    View Slide

  65. This was funny...wasn’t it? People & Blogs...all the way
    64
    Thursday, March 15, 2012

    View Slide

  66. There are more “subtle” problems with genre and
    human classification...and even harder problems
    with finding what’s funny...
    65
    Thursday, March 15, 2012

    View Slide

  67. Thursday, March 15, 2012

    View Slide

  68. Thursday, March 15, 2012

    View Slide

  69. Thursday, March 15, 2012

    View Slide

  70. CLASSIFICATION BASED ON IMPLICIT
    CONNECTED SOCIAL ACTIONS
    5 star ratings has been the golden egg for recommendation systems so
    far; implicit human cooperative sharing activity works better.
    If comedy is a social construct, lets train on the construct.
    Thursday, March 15, 2012

    View Slide

  71. Legacy online video
    Tags
    Video
    duration
    Ratings
    Comments
    View Count
    Sharing
    Activity
    Category
    Recommendations
    Thursday, March 15, 2012

    View Slide

  72. Let’s not just throw data at the problem...
    Human
    perception and
    experience
    –Pilot Survey
    –Crowdflower/Amazon
    MT Survey
    –Interviews
    69
    Computational
    classification
    –Naïve Bayes Classifier
    –GBDTs
    Thursday, March 15, 2012

    View Slide

  73. Research Questions
    RQ1:
    How do people socially consume, perceive, and
    categorize videos, as Comedy in our study, at the
    time of creation and consumption?
    RQ2:
    Can a predictive model be built to automatically
    categorize media in a manner that is contextually
    and socially appropriate?
    Thursday, March 15, 2012

    View Slide

  74. 2 Surveys
    The online clicky web form type...
    71
    Thursday, March 15, 2012

    View Slide

  75. Survey One
    • We sent out a web survey to
    get judgements on 20 randomly
    chosen videos from a sample of
    videos. (43 complete
    responses)
    • In general, the human as a
    classifier categorizes at 60.9%
    accuracy.
    • For Comedy, their accuracy
    was 52.3%
    72
    Thursday, March 15, 2012

    View Slide

  76. Crowdflower/MTurk
    Survey Two What’s funny vs what’s comedy?
    73
    Thursday, March 15, 2012

    View Slide

  77. Comedy vs Funny
    Judgements
    74
    Thursday, March 15, 2012

    View Slide

  78. A few videos stood out
    75
    Thursday, March 15, 2012

    View Slide

  79. 76
    “This is very funny video of a baby
    laughing. Not sure it should be categorize
    as a comedy.”
    (Respondent 16 on Video 8)
    “Its funny but its only an animal getting
    startled by her sneezing baby. It is not
    comedy because the actions were not
    specifically done to make us laugh.”
    (Respondent 9 on Video 9)
    Thursday, March 15, 2012

    View Slide

  80. 11 Interviews
    1 hour a piece
    77
    Thursday, March 15, 2012

    View Slide

  81. Online videos are difficult to classify
    “Films in general tend to fit categories more narrowly...If NetFlix says that
    something is a screwball comedy, I know what to expect. I think on YouTube
    the range of possibilities for the content of the videos is much less
    constrained. Cause it might literally be a segment from a film or it could be
    something shot on a cheap digital camera.”
    78
    Thursday, March 15, 2012

    View Slide

  82. Comedy has internal cues & signals.
    “If I ever hear the laughter track, it doesn’t even have to be funny, it’s intended
    to be Comedy. The style of it...even if it’s a Music Video.”
    “Anything comedy is always impressed upon you with laughter in the
    background or some funny accompanying music...Those contextual cues.”
    79
    Thursday, March 15, 2012

    View Slide

  83. Comedy has social contexts
    “There’s a context that you need to have for something to be identified as
    comedy … when I see a video that I have no context for, I don’t know whether
    to identify it as funny. But if people are interacting with it in a way that
    makes me believe that it’s funny. Same thing for the wedding dance (JK
    wedding video) … my interaction with it is, ‘people are saying that this is
    funny.’”
    80
    Thursday, March 15, 2012

    View Slide

  84. The “Comedy” label itself is a signal
    “They (the uploaders of videos) are uploading things and categorizing it as
    ‘Comedy’ because they are proposing that there’s something funny in it.
    Even though the content itself may not be ‘Comedy’.”
    81
    Thursday, March 15, 2012

    View Slide

  85. Thursday, March 15, 2012

    View Slide

  86. Thursday, March 15, 2012

    View Slide

  87. Thursday, March 15, 2012

    View Slide

  88. FIRST ORDER DATA WASNʼT PRETTY
    Phone in your favorite ML technique.
    Thursday, March 15, 2012

    View Slide

  89. Used and Unused Data
    84
    You Tube Zync
    Duration (video)
    Views
    Rating*
    Duration (session)*
    # of Play/Pause*
    # of Scrubs*
    # of Chats*
    You Tube Zync
    Duration (video)
    Views
    Rating*
    Duration (session)*
    # of Play/Pause*
    # of Scrubs*
    # of Chats*
    You Tube (not used) Zync (not used)
    Tags
    Comments
    Favorites
    Emoticons
    User ID data
    # of Sessions
    # of Loads
    Thursday, March 15, 2012

    View Slide

  90. Classification with Using Conversation Signals
    85
    Type Accuracy
    Random Chance 23.0%
    You Tube Features 14.6%
    Humans 60.9%
    YouTube Features 75.9%
    Social sharing patterns from Zync are highly predictive of content type.
    Here we can predict if a video is Sports, Comedy, Entertainment, People,
    or Film just from how the video is shared.
    This brings better recommendations for content and advertising.
    Zync Features 87.8%
    Thursday, March 15, 2012

    View Slide

  91. What about Viral Content?
    86
    Viral is a genre too.
    Thursday, March 15, 2012

    View Slide

  92. Viral Content can be identified similarly and more
    accurately at 98.5% (F1 = 0.78)
    87
    Does a video have over 10M views? We did this using data from 10 people in 5
    shared sessions. The drawback is you need to use a synchronous connected
    experience.
    Thursday, March 15, 2012

    View Slide

  93. Big Data, Instrumentation, and Social Motivations
    88
    • More data isn’t necessarily better data
    • Corporate design & building patterns can deliver you bad data.
    • Big Data tends to answer “what can we do with all this data” but not “how
    can we create and properly instrument the world”
    • People and their communication patterns can be found and utilized
    effectively, often requiring significantly less data to solve similar problems.
    Thursday, March 15, 2012

    View Slide

  94. New Directions
    89
    Thursday, March 15, 2012

    View Slide

  95. New Interestingness Can we get to hyper-personalization?
    90
    Thursday, March 15, 2012

    View Slide

  96. Buildings account for 48% of all GHG.
    Farming is 18%. Transportation is 14%. 91
    Photos courtesy of AutoDesk Research
    Thursday, March 15, 2012

    View Slide

  97. Design and Interaction Meets
    Big Environments and Big Data with AutoDesk and MobileLife/KTH
    92
    Building
    Floor
    Area
    Occupant
    Sensor
    Photos courtesy of AutoDesk Research
    Thursday, March 15, 2012

    View Slide

  98. Fin.
    Thanks to J.Yew, L. Kennedy, E. Churchill, R. Sheppard, A. Khan, N. Diakopoulos,
    A. Brooks, Y. Liu, S. Pentland, J. Antin, J. Dunning, Chloe S.,
    Marc S., & M. Cameron J.
    Viral Actions: Predicting Video View Counts Using Synchronous Sharing Behaviors David A. Shamma; Jude Yew; Lyndon Kennedy; Elizabeth F. Churchill,
    ICWSM 2011 - International AAAI Conference on Weblogs and Social Media, AAAI, 2011
    Knowing Funny: Genre Perception and Categorization in Social Video Sharing Jude Yew; David A. Shamma; Elizabeth F. Churchill, CHI 2011, ACM, 2011
    Peaks and Persistence: Modeling the Shape of Microblog Conversations David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, CSCW 2011, ACM,
    2011
    Know Your Data: Understanding Implicit Usage versus Explicit Action in Video Content Classification Jude Yew; David A. Shamma, Electronic Imaging,
    IS&T/SPIE, 2011
    Beyond Freebird David A. Shamma, XRDS: Crossroads, ACM, 2010, 2
    Conversational Shadows: Describing Live Media Events Using Short Messages David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, International
    AAAI Conference on Weblogs and Social Media, AAAI, 2010
    Characterizing Debate Performance via Aggregated Twitter Sentiment Nicholas A. Diakopoulos; David A. Shamma, CHI 2010, ACM, 2010
    Statler: Summarizing Media through Short-Message Services David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, CSCW, 2010
    Tweet the Debates: Understanding Community Annotation of Uncollected Sources David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, ACM
    Multimedia, ACM, 2009
    Understanding the Creative Conversation: Modeling to Engagement David A. Shamma; Dan Perkel; Kurt Luther, Creativity and Cognition, ACM, 2009
    Spinning Online: A Case Study of Internet Broadcasting by DJs David A. Shamma; Elizabeth Churchill; Nikhil Bobb; Matt Fukuda, Communities &
    Technology, ACM, 2009
    Zync with Me: Synchronized Sharing of Video through Instant Messaging David A. Shamma; Yiming Liu; Pablo Cesar, David Geerts, Konstantinos
    Chorianopoulos, Social Interactive Television: Immersive Shared Experiences and Perspectives, Information Science Reference, IGI Global, 2009
    Enhancing online personal connections through the synchronized sharing of online video Shamma, D. A.; Bastéa-Forte, M.; Joubert, N.; Liu, Y., Human
    Factors in Computing Systems (CHI), ACM, 2008
    Supporting creative acts beyond dissemination David A. Shamma; Ryan Shaw, Creativity and Cognition, ACM, 2007
    Watch what I watch: using community activity to understand content David A. Shamma; Ryan Shaw; Peter Shafton; Yiming Liu, ACM Multimedia
    Workshop on Multimedia Information Retrival (MIR), ACM, 2007
    Zync: the design of synchronized video sharing Yiming Liu; David A. Shamma; Peter Shafton; Jeannie Yang, Designing for User eXperiences, ACM, 2007
    Thursday, March 15, 2012

    View Slide