Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flock: Hybrid Crowd-Machine Learning Classifiers

Flock: Hybrid Crowd-Machine Learning Classifiers

Presented at CSCW 2015

Hybrid crowd-machine learning classifiers are classification models that start with a written description of a learning goal, use the crowd to suggest predictive features and label data, and then weigh these features using machine learning to produce models that are accurate and use human-understandable features. These hybrid classifiers enable fast prototyping of machine learning models that can improve on both algorithm performance and human judgment, and accomplish tasks where automated feature extraction is not yet feasible. Flock, an interactive machine learning platform, instantiates this approach.

Justin Cheng

March 17, 2015
Tweet

More Decks by Justin Cheng

Other Decks in Research

Transcript

  1. Flock
    Hybrid Crowd-Machine Learning Classifiers
    Justin Cheng @jcccf, Michael Bernstein @msbernst · Stanford University

    View Slide

  2. We rely on predictions every
    everyday ×
    Today’s weather is…
    If you liked…, you may also like…
    The hourly trending topics are…
    You may know these people…
    Is this email spam?

    View Slide

  3. Developing predictive models is hard!
    s

    View Slide

  4. It’s time-consuming to figure out
    which features work.
    orange contains the word “cat”
    # of page-views
    # of likes
    time between edits
    head looking down
    pastel colors
    positive sentiment
    punctuation
    capitalization
    repetition
    Domingos, P. (CACM 2012)

    View Slide

  5. Identifying Useful Features
    Feature Generation / Annotation / Testing

    View Slide

  6. Identifying Useful Features
    Feature Generation / Annotation / Testing
    Did I miss an important feature?

    View Slide

  7. Identifying Useful Features
    Feature Generation / Annotation / Testing
    How do I extract this feature (automatically)?

    View Slide

  8. Identifying Useful Features
    Feature Generation / Annotation / Testing
    Was the effort worth it?

    View Slide

  9. We could help developers do this faster…

    View Slide

  10. But what if we add more humans?

    View Slide

  11. Embedding crowds inside
    machine learning architectures
    Works in domains where machines alone fail
    Allows for faster prototyping
    Automatically self-improving
    (And is more accurate)
    Flock

    View Slide

  12. Which is the lie?
    Truth Lie

    View Slide

  13. Many prediction tasks exist where
    neither humans nor machines do well!
    (But Flock can help!)

    View Slide

  14. Analogical encoding allows crowds
    to effectively generate features.

    View Slide

  15. Crowds can automatically improve
    a model when it performs poorly.

    View Slide

  16. Where we’re going
    Flock: a crowd-machine classifier
    Generating, labeling and evaluating features
    Evaluating human and machine performance

    View Slide

  17. Flock: automating the learning process
    Output:
    Model
    Input:
    Examples
    Process:
    Feature Engineering
    Feature
    Generation
    Example
    Annotation
    Model
    Evaluation

    View Slide

  18. But how do we use a crowd?

    View Slide

  19. Why use people at all?

    View Slide

  20. Why use people at all?
    Good at generating diverse ideas
    Andre, P., et al. (CSCW 2014), Yu, L., et al. (CHI 2014)

    View Slide

  21. Why use people at all?
    Can annotate arbitrary data

    View Slide

  22. Why use people at all?
    Poor at aggregating information
    Hammond, K. R., et al. (Psych. Rev. 1964), Dawes, R. (Am. Psych. 1971)

    View Slide

  23. People
    Great at weighing
    multiple factors
    Machines
    Good at generating
    diverse ideas
    Poor at aggregating
    information
    Can only annotate
    certain types of data
    Limited in feature
    expressiveness
    Can annotate
    arbitrary data

    View Slide

  24. People
    Great at weighing
    multiple factors
    Machines
    Good at generating
    diverse ideas
    Poor at aggregating
    information
    Can only annotate
    certain types of data
    Limited in feature
    expressiveness
    Can annotate
    arbitrary data

    View Slide

  25. People
    Great at weighing
    multiple factors
    Machines
    Good at generating
    diverse ideas
    Poor at aggregating
    information
    Can only annotate
    certain types of data
    Limited in feature
    expressiveness
    Can annotate
    arbitrary data

    View Slide

  26. Feature engineering in Flock
    Feature Generation Example Annotation Model Evaluation

    View Slide

  27. Flock leverages the complementary
    strengths of humans and machines

    View Slide

  28. Why not directly ask the
    crowd the prediction task?

    View Slide

  29. Why not directly ask the
    crowd the prediction task?
    Because we can be 10% more
    accurate using Flock.

    View Slide

  30. So, how do we generate features?
    Feature Generation Example Annotation Model Evaluation

    View Slide

  31. How do we get a crowd to suggest features?
    (*binary features)

    View Slide

  32. Can you tell a good Wikipedia
    article from a bad one?
    ( )

    View Slide

  33. View Slide

  34. What do you think makes this
    Wikipedia article a “Good Article”?

    View Slide

  35. It makes me feel pleasant.
    What do you think makes this
    Wikipedia article a “Good Article”?

    View Slide

  36. Gentner, D., et al. (J. Ed. Psych. 2003)
    “Good” Article “Bad” Article
    Analogical Encoding

    View Slide

  37. How does the “good article”
    differ from the “bad article”?

    View Slide

  38. Broken down into organized sections.
    Thorough and well-organized.
    First article was poorly organized.
    First article has more in-depth photos.
    One article offers more photos.
    There are insufficient images.
    More pictures and descriptions
    More historically reliable references
    First article offers more in-depth photos

    Too many!

    View Slide

  39. Broken down into organized sections.
    Thorough and well-organized.
    First article was poorly organized.
    First article has more in-depth photos.
    One article offers more photos.
    There are insufficient images.
    Cluster 1
    Cluster 2
    Cluster 3

    View Slide

  40. Broken down into organized sections.
    Thorough and well-organized.
    First article was poorly organized.
    First article has more in-depth photos.
    One article offers more photos.
    There are insufficient images.
    Is this article well-organized?
    Does this article have photos?
    Are there insufficient images?

    View Slide

  41. Is this article well-organized?
    Does this article have photos?
    Are there insufficient images?

    View Slide

  42. Feature generation
    Compare positive and
    negative examples
    Cluster similar
    suggestions
    Generate feature
    for each cluster
    A B
    Lorem ipsum dolor sit
    Lorem ipsum dolor sit
    Lorem ipsum dolor sit
    Lorem ipsum dolor sit

    View Slide

  43. How do we annotate our examples?
    Feature Generation Example Annotation Model Evaluation

    View Slide

  44. Does this article have photos?
    Yes
    No

    View Slide

  45. How do we aggregate these features?
    Feature Generation Example Annotation Model Evaluation

    View Slide

  46. A learning algorithm
    aggregates features.
    List of Features Feature Matrix
    Feature
    Generation
    Example
    Annotation
    Machine
    Learning
    ?
    ?
    ?

    View Slide

  47. Three possible learning algorithms
    Logistic Regression Decision Trees Random Forests
    Interpretable
    Scalable
    Most interpretable
    Prone to over-fitting
    Least interpretable
    Slower
    Tends to be most
    accurate

    View Slide

  48. Short
    Paragraphs?
    Yes No



    Yes No Yes No
    Yes No
    No
    Yes No Yes
    No
    Yes No Yes

    View Slide

  49. Short
    Paragraphs?
    Yes No



    Yes No Yes No
    No
    Yes
    No Yes

    View Slide

  50. Short
    Paragraphs?
    Yes No



    Yes No Yes No
    Strong Intro?
    Attractive
    Images?
    Sounds
    complicated?
    Yes
    Yes Yes
    No
    No
    No
    No
    Yes
    No
    Yes No
    Yes
    No
    Yes No Yes

    View Slide

  51. Interface

    View Slide

  52. Interface

    View Slide

  53. Interface

    View Slide

  54. Humans and machines can work together!
    Feature Generation Example Annotation Model Evaluation

    View Slide

  55. Humans and machines can work together!
    Feature Generation Example Annotation Model Evaluation

    View Slide

  56. Does all of this work?
    (Yes.)

    View Slide

  57. Evaluation
    Paintings Hotel Reviews Wikipedia
    Jokes StackExchange Lying
    200 examples
    200 examples
    200 examples
    200 examples
    200 examples 400 examples

    View Slide

  58. Metrics
    Guessing

    Crowd directly asked the
    prediction question
    Automatic ML

    Training a classifier using
    best features from prior
    work
    Baselines Flock
    Flock

    Training a classifier with
    crowd-nominated features
    Flock + ML

    Training a classifier with
    crowd-nominated and
    machine features

    View Slide

  59. Performance
    Guessing
    Automatic ML
    Flock
    Flock + ML
    Accuracy
    0.5 0.6 0.7 0.8 0.9
    ?
    ?
    ?
    ?

    View Slide

  60. Guessing
    Automatic ML
    Flock
    Flock + ML
    Accuracy
    0.5 0.6 0.7 0.8 0.9
    Baseline performance is decent
    Sen, S., et al. (CSCW 2015)
    ?
    ?
    Lying Jokes StackEx Paintings Reviews Wikipedia Median

    View Slide

  61. Guessing
    Automatic ML
    Flock
    Flock + ML
    Accuracy
    0.5 0.6 0.7 0.8 0.9
    Flock improves on humans/machines
    ?
    Lying Jokes StackEx Paintings Reviews Wikipedia Median

    View Slide

  62. Guessing
    Automatic ML
    Flock
    Flock + ML
    Accuracy
    0.5 0.6 0.7 0.8 0.9
    Flock + ML is 10% more accurate
    Lying Jokes StackEx Paintings Reviews Wikipedia Median

    View Slide

  63. What were the most predictive features?
    Paintings Hotel Reviews Wikipedia
    Jokes StackExchange Lying
    Monet or Sisley?
    Monet are more likely
    to have flowers.
    Truthful or Deceptive?
    Truthful reviews have
    negative content.
    Good or Bad Article?
    Good articles have
    strong introductions.
    Popular joke or not?

    Popular jokes use
    repetition.
    Answer selected?

    Selected answers are
    well-written.
    Lie or truth?

    A lying person has
    shifty eyes.

    View Slide

  64. Shouldn’t ML systems be completely
    automatic?
    $$$
    $

    View Slide

  65. Learning how the crowd labels features
    200 800 Examples
    Annotated by

    the crowd
    Annotated using bigram model

    trained on crowd annotations
    Baseline
    1000 Examples
    All annotated by the crowd
    Learned Features

    View Slide

  66. Learning how the crowd labels features
    0.74
    0.78
    Accuracy
    200 800 Examples
    Annotated by

    the crowd
    Annotated using bigram model

    trained on crowd annotations
    Baseline
    1000 Examples
    All annotated by the crowd
    Learned Features

    View Slide

  67. Flock
    Hybrid Crowd-Machine Learning Classifiers
    Justin Cheng @jcccf, Michael Bernstein @msbernst · Stanford University hci.st/flock

    View Slide