Flock: Hybrid Crowd-Machine Learning Classifiers

Flock: Hybrid Crowd-Machine Learning Classifiers

Presented at CSCW 2015

Hybrid crowd-machine learning classifiers are classification models that start with a written description of a learning goal, use the crowd to suggest predictive features and label data, and then weigh these features using machine learning to produce models that are accurate and use human-understandable features. These hybrid classifiers enable fast prototyping of machine learning models that can improve on both algorithm performance and human judgment, and accomplish tasks where automated feature extraction is not yet feasible. Flock, an interactive machine learning platform, instantiates this approach.

8480b47e733a040fba07c32da414b0e0?s=128

Justin Cheng

March 17, 2015
Tweet

Transcript

  1. Flock Hybrid Crowd-Machine Learning Classifiers Justin Cheng @jcccf, Michael Bernstein

    @msbernst · Stanford University
  2. We rely on predictions every everyday × Today’s weather is…

    If you liked…, you may also like… The hourly trending topics are… You may know these people… Is this email spam?
  3. Developing predictive models is hard! s

  4. It’s time-consuming to figure out which features work. orange contains

    the word “cat” # of page-views # of likes time between edits head looking down pastel colors positive sentiment punctuation capitalization repetition Domingos, P. (CACM 2012)
  5. Identifying Useful Features Feature Generation / Annotation / Testing

  6. Identifying Useful Features Feature Generation / Annotation / Testing Did

    I miss an important feature?
  7. Identifying Useful Features Feature Generation / Annotation / Testing How

    do I extract this feature (automatically)?
  8. Identifying Useful Features Feature Generation / Annotation / Testing Was

    the effort worth it?
  9. We could help developers do this faster…

  10. But what if we add more humans?

  11. Embedding crowds inside machine learning architectures Works in domains where

    machines alone fail Allows for faster prototyping Automatically self-improving (And is more accurate) Flock
  12. Which is the lie? Truth Lie

  13. Many prediction tasks exist where neither humans nor machines do

    well! (But Flock can help!)
  14. Analogical encoding allows crowds to effectively generate features.

  15. Crowds can automatically improve a model when it performs poorly.

  16. Where we’re going Flock: a crowd-machine classifier Generating, labeling and

    evaluating features Evaluating human and machine performance
  17. Flock: automating the learning process Output: Model Input: Examples Process:

    Feature Engineering Feature Generation Example Annotation Model Evaluation
  18. But how do we use a crowd?

  19. Why use people at all?

  20. Why use people at all? Good at generating diverse ideas

    Andre, P., et al. (CSCW 2014), Yu, L., et al. (CHI 2014)
  21. Why use people at all? Can annotate arbitrary data

  22. Why use people at all? Poor at aggregating information Hammond,

    K. R., et al. (Psych. Rev. 1964), Dawes, R. (Am. Psych. 1971)
  23. People Great at weighing multiple factors Machines Good at generating

    diverse ideas Poor at aggregating information Can only annotate certain types of data Limited in feature expressiveness Can annotate arbitrary data
  24. People Great at weighing multiple factors Machines Good at generating

    diverse ideas Poor at aggregating information Can only annotate certain types of data Limited in feature expressiveness Can annotate arbitrary data
  25. People Great at weighing multiple factors Machines Good at generating

    diverse ideas Poor at aggregating information Can only annotate certain types of data Limited in feature expressiveness Can annotate arbitrary data
  26. Feature engineering in Flock Feature Generation Example Annotation Model Evaluation

  27. Flock leverages the complementary strengths of humans and machines

  28. Why not directly ask the crowd the prediction task?

  29. Why not directly ask the crowd the prediction task? Because

    we can be 10% more accurate using Flock.
  30. So, how do we generate features? Feature Generation Example Annotation

    Model Evaluation
  31. How do we get a crowd to suggest features? (*binary

    features)
  32. Can you tell a good Wikipedia article from a bad

    one? ( )
  33. None
  34. What do you think makes this Wikipedia article a “Good

    Article”?
  35. It makes me feel pleasant. What do you think makes

    this Wikipedia article a “Good Article”?
  36. Gentner, D., et al. (J. Ed. Psych. 2003) “Good” Article

    “Bad” Article Analogical Encoding
  37. How does the “good article” differ from the “bad article”?

  38. Broken down into organized sections. Thorough and well-organized. First article

    was poorly organized. First article has more in-depth photos. One article offers more photos. There are insufficient images. More pictures and descriptions More historically reliable references First article offers more in-depth photos … Too many!
  39. Broken down into organized sections. Thorough and well-organized. First article

    was poorly organized. First article has more in-depth photos. One article offers more photos. There are insufficient images. Cluster 1 Cluster 2 Cluster 3 …
  40. Broken down into organized sections. Thorough and well-organized. First article

    was poorly organized. First article has more in-depth photos. One article offers more photos. There are insufficient images. Is this article well-organized? Does this article have photos? Are there insufficient images? …
  41. Is this article well-organized? Does this article have photos? Are

    there insufficient images? …
  42. Feature generation Compare positive and negative examples Cluster similar suggestions

    Generate feature for each cluster A B Lorem ipsum dolor sit Lorem ipsum dolor sit Lorem ipsum dolor sit Lorem ipsum dolor sit
  43. How do we annotate our examples? Feature Generation Example Annotation

    Model Evaluation
  44. Does this article have photos? Yes No

  45. How do we aggregate these features? Feature Generation Example Annotation

    Model Evaluation
  46. A learning algorithm aggregates features. List of Features Feature Matrix

    Feature Generation Example Annotation Machine Learning ? ? ?
  47. Three possible learning algorithms Logistic Regression Decision Trees Random Forests

    Interpretable Scalable Most interpretable Prone to over-fitting Least interpretable Slower Tends to be most accurate
  48. Short Paragraphs? Yes No … … … Yes No Yes

    No Yes No No Yes No Yes No Yes No Yes
  49. Short Paragraphs? Yes No … … … Yes No Yes

    No No Yes No Yes
  50. Short Paragraphs? Yes No … … … Yes No Yes

    No Strong Intro? Attractive Images? Sounds complicated? Yes Yes Yes No No No No Yes No Yes No Yes No Yes No Yes
  51. Interface

  52. Interface

  53. Interface

  54. Humans and machines can work together! Feature Generation Example Annotation

    Model Evaluation
  55. Humans and machines can work together! Feature Generation Example Annotation

    Model Evaluation
  56. Does all of this work? (Yes.)

  57. Evaluation Paintings Hotel Reviews Wikipedia Jokes StackExchange Lying 200 examples

    200 examples 200 examples 200 examples 200 examples 400 examples
  58. Metrics Guessing
 Crowd directly asked the prediction question Automatic ML


    Training a classifier using best features from prior work Baselines Flock Flock
 Training a classifier with crowd-nominated features Flock + ML
 Training a classifier with crowd-nominated and machine features
  59. Performance Guessing Automatic ML Flock Flock + ML Accuracy 0.5

    0.6 0.7 0.8 0.9 ? ? ? ?
  60. Guessing Automatic ML Flock Flock + ML Accuracy 0.5 0.6

    0.7 0.8 0.9 Baseline performance is decent Sen, S., et al. (CSCW 2015) ? ? Lying Jokes StackEx Paintings Reviews Wikipedia Median
  61. Guessing Automatic ML Flock Flock + ML Accuracy 0.5 0.6

    0.7 0.8 0.9 Flock improves on humans/machines ? Lying Jokes StackEx Paintings Reviews Wikipedia Median
  62. Guessing Automatic ML Flock Flock + ML Accuracy 0.5 0.6

    0.7 0.8 0.9 Flock + ML is 10% more accurate Lying Jokes StackEx Paintings Reviews Wikipedia Median
  63. What were the most predictive features? Paintings Hotel Reviews Wikipedia

    Jokes StackExchange Lying Monet or Sisley? Monet are more likely to have flowers. Truthful or Deceptive? Truthful reviews have negative content. Good or Bad Article? Good articles have strong introductions. Popular joke or not?
 Popular jokes use repetition. Answer selected?
 Selected answers are well-written. Lie or truth?
 A lying person has shifty eyes.
  64. Shouldn’t ML systems be completely automatic? $$$ $

  65. Learning how the crowd labels features 200 800 Examples Annotated

    by
 the crowd Annotated using bigram model
 trained on crowd annotations Baseline 1000 Examples All annotated by the crowd Learned Features
  66. Learning how the crowd labels features 0.74 0.78 Accuracy 200

    800 Examples Annotated by
 the crowd Annotated using bigram model
 trained on crowd annotations Baseline 1000 Examples All annotated by the crowd Learned Features
  67. Flock Hybrid Crowd-Machine Learning Classifiers Justin Cheng @jcccf, Michael Bernstein

    @msbernst · Stanford University hci.st/flock