Flock: Hybrid Crowd-Machine Learning Classifiers

Flock Hybrid Crowd-Machine Learning Classifiers Justin Cheng @jcccf, Michael Bernstein
@msbernst · Stanford University

We rely on predictions every everyday × Today’s weather is…
If you liked…, you may also like… The hourly trending topics are… You may know these people… Is this email spam?

Developing predictive models is hard! s

It’s time-consuming to ﬁgure out which features work. orange contains
the word “cat” # of page-views # of likes time between edits head looking down pastel colors positive sentiment punctuation capitalization repetition Domingos, P. (CACM 2012)

Identifying Useful Features Feature Generation / Annotation / Testing

Identifying Useful Features Feature Generation / Annotation / Testing Did
I miss an important feature?

Identifying Useful Features Feature Generation / Annotation / Testing How
do I extract this feature (automatically)?

Identifying Useful Features Feature Generation / Annotation / Testing Was
the effort worth it?

We could help developers do this faster…

But what if we add more humans?

Embedding crowds inside machine learning architectures Works in domains where
machines alone fail Allows for faster prototyping Automatically self-improving (And is more accurate) Flock

Which is the lie? Truth Lie

Many prediction tasks exist where neither humans nor machines do
well! (But Flock can help!)

Analogical encoding allows crowds to effectively generate features.

Crowds can automatically improve a model when it performs poorly.

Where we’re going Flock: a crowd-machine classifier Generating, labeling and
evaluating features Evaluating human and machine performance

Flock: automating the learning process Output: Model Input: Examples Process:
Feature Engineering Feature Generation Example Annotation Model Evaluation

But how do we use a crowd?

Why use people at all?

Why use people at all? Good at generating diverse ideas
Andre, P., et al. (CSCW 2014), Yu, L., et al. (CHI 2014)

Why use people at all? Can annotate arbitrary data

Why use people at all? Poor at aggregating information Hammond,
K. R., et al. (Psych. Rev. 1964), Dawes, R. (Am. Psych. 1971)

People Great at weighing multiple factors Machines Good at generating
diverse ideas Poor at aggregating information Can only annotate certain types of data Limited in feature expressiveness Can annotate arbitrary data

Feature engineering in Flock Feature Generation Example Annotation Model Evaluation

Flock leverages the complementary strengths of humans and machines

Why not directly ask the crowd the prediction task?

Why not directly ask the crowd the prediction task? Because
we can be 10% more accurate using Flock.

So, how do we generate features? Feature Generation Example Annotation
Model Evaluation

How do we get a crowd to suggest features? (*binary
features)

Can you tell a good Wikipedia article from a bad
one? ( )

What do you think makes this Wikipedia article a “Good
Article”?

It makes me feel pleasant. What do you think makes
this Wikipedia article a “Good Article”?

Gentner, D., et al. (J. Ed. Psych. 2003) “Good” Article
“Bad” Article Analogical Encoding

How does the “good article” differ from the “bad article”?

Broken down into organized sections. Thorough and well-organized. First article
was poorly organized. First article has more in-depth photos. One article offers more photos. There are insufficient images. More pictures and descriptions More historically reliable references First article offers more in-depth photos … Too many!

was poorly organized. First article has more in-depth photos. One article offers more photos. There are insufficient images. Cluster 1 Cluster 2 Cluster 3 …

was poorly organized. First article has more in-depth photos. One article offers more photos. There are insufficient images. Is this article well-organized? Does this article have photos? Are there insufﬁcient images? …

Is this article well-organized? Does this article have photos? Are
there insufﬁcient images? …

Feature generation Compare positive and negative examples Cluster similar suggestions
Generate feature for each cluster A B Lorem ipsum dolor sit Lorem ipsum dolor sit Lorem ipsum dolor sit Lorem ipsum dolor sit

How do we annotate our examples? Feature Generation Example Annotation
Model Evaluation

Does this article have photos? Yes No

How do we aggregate these features? Feature Generation Example Annotation
Model Evaluation

A learning algorithm aggregates features. List of Features Feature Matrix
Feature Generation Example Annotation Machine Learning ? ? ?

Three possible learning algorithms Logistic Regression Decision Trees Random Forests
Interpretable Scalable Most interpretable Prone to over-fitting Least interpretable Slower Tends to be most accurate

Short Paragraphs? Yes No … … … Yes No Yes
No Yes No No Yes No Yes No Yes No Yes

No No Yes No Yes

No Strong Intro? Attractive Images? Sounds complicated? Yes Yes Yes No No No No Yes No Yes No Yes No Yes No Yes

Interface

Humans and machines can work together! Feature Generation Example Annotation
Model Evaluation

Does all of this work? (Yes.)

Evaluation Paintings Hotel Reviews Wikipedia Jokes StackExchange Lying 200 examples
200 examples 200 examples 200 examples 200 examples 400 examples

Metrics Guessing  Crowd directly asked the prediction question Automatic ML 
Training a classifier using best features from prior work Baselines Flock Flock  Training a classifier with crowd-nominated features Flock + ML  Training a classifier with crowd-nominated and machine features

Performance Guessing Automatic ML Flock Flock + ML Accuracy 0.5
0.6 0.7 0.8 0.9 ? ? ? ?

Guessing Automatic ML Flock Flock + ML Accuracy 0.5 0.6
0.7 0.8 0.9 Baseline performance is decent Sen, S., et al. (CSCW 2015) ? ? Lying Jokes StackEx Paintings Reviews Wikipedia Median

0.7 0.8 0.9 Flock improves on humans/machines ? Lying Jokes StackEx Paintings Reviews Wikipedia Median

0.7 0.8 0.9 Flock + ML is 10% more accurate Lying Jokes StackEx Paintings Reviews Wikipedia Median

What were the most predictive features? Paintings Hotel Reviews Wikipedia
Jokes StackExchange Lying Monet or Sisley? Monet are more likely to have flowers. Truthful or Deceptive? Truthful reviews have negative content. Good or Bad Article? Good articles have strong introductions. Popular joke or not?  Popular jokes use repetition. Answer selected?  Selected answers are well-written. Lie or truth?  A lying person has shifty eyes.

Shouldn’t ML systems be completely automatic? $$$ $

Learning how the crowd labels features 200 800 Examples Annotated
by  the crowd Annotated using bigram model  trained on crowd annotations Baseline 1000 Examples All annotated by the crowd Learned Features

Learning how the crowd labels features 0.74 0.78 Accuracy 200
800 Examples Annotated by  the crowd Annotated using bigram model  trained on crowd annotations Baseline 1000 Examples All annotated by the crowd Learned Features

Flock Hybrid Crowd-Machine Learning Classifiers Justin Cheng @jcccf, Michael Bernstein
@msbernst · Stanford University hci.st/flock

Flock: Hybrid Crowd-Machine Learning Classifiers

Flock: Hybrid Crowd-Machine Learning Classifiers

More Decks by Justin Cheng

Other Decks in Research

Featured

Transcript