Measuring Crowdsourcing Effort with Error-Time Curves

Error-Time Curves Justin Cheng @jcccf / Stanford Jaime Teevan @jteevan
/ Microsoft Research Michael Bernstein @msbernst / Stanford Measuring Crowdsourcing Eﬀort with http://hci.st/eta

Crowdsourcing allows large numbers of people to accomplish tasks at
a global scale. Kittur, A., et al. CSCW (2013)

Crowdsourcing allows large numbers of people to accomplish tasks at
a global scale. Copyediting Categorization Retrieval Labeling Surveys Experiments Kittur, A., et al. CSCW (2013)

But it’s diﬃcult to design and price tasks well.

Which task design is better? Tagging climbin What is this
person doing? Choose from 8 options Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping What is this person doing?

Which task design is better? Choose from 4 options Choose
from 8 options Brushing Applauding Drinking Climbing What is this person doing? Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping What is this person doing?

Requesters end up pricing tasks arbitrarily.

Requesters tend to underestimate the eﬀort required to complete tasks.
Hinds, P. Journal of Experimental Psychology: Applied (1999)

Workers are hard- pressed to ﬁgure out which tasks are
worth their time.

Why not just measure how long workers take to complete
a task?

Why not just ask workers how diﬃcult they thought a
task was? Many existing measures are imprecise

To reliably determine task diﬃculty, we need a robust, objective
measure of eﬀort.

ETA (error-time area) is a continuous, absolute, data- driven measure
of task eﬀort.

ETA (error-time area) models the relationship between time and worker
error rate.

Why use ETA? Requesters can use ETA to compare task
designs and iterate towards better ones, as well as objectively price tasks. Workers can identify tasks worth their time, and have a guide for how much time they should spend on a task.

Overview 1 Error-Time Curves (and ETA) Evaluating ETA and other
measures 2 ETA in action 3

Understanding Error-Time Curves

Time Taken Error Rate ETA How do we generate this?

Generating a task’s ETA 1. Have workers complete tasks given
diﬀerent time limits. Tag this image. 5 seconds left… 4 seconds left… 3 seconds left… 2 seconds left… 1 seconds left… Time’s up! Tag this image.

diﬀerent time limits. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. 1s 2s 4s 8s 10s 16s 6s https://www.ﬂickr.com/photos/[jking89/4572668505, jking89/4572668505, manoftaste-de/9563451348, jfh686/3613641379, patdavid/5568423570, dj-dwayne/6056431256, rsmith11235/9254525480]

diﬀerent time limits. Tag this image. Tag this image. Tag this image. Tag this image. 1s 2s 4s 8s 10s 16s 6s Tag this image. Tag this image. Tag this image. https://www.ﬂickr.com/photos/[jking89/4572668505, jking89/4572668505, manoftaste-de/9563451348, jfh686/3613641379, patdavid/5568423570, dj-dwayne/6056431256, rsmith11235/9254525480]

diﬀerent time limits. Tag this image. 2s 16s 6s Tag this image. Tag this image. Tag this image. Practice Questions https://www.ﬂickr.com/photos/[mindwhisperings/5874135107, sugarhiccuphiccup/4808600654, sunsward7/8078455200] Tag this image. Tag this image. 1s Tag this image. …

Generating a task’s ETA 2. Fit a curve to the
recorded data. Time Taken Error Rate

recorded data. 1s Time Taken Error Rate …

recorded data. 1s Time Taken Error Rate … 1s 1.0 20 / 20 wrong

recorded data. 2s Time Taken Error Rate … 2s .90 18 / 20 wrong

recorded data. 8s Time Taken Error Rate … .00 8s 0 / 20 wrong

recorded data. Time Taken Error Rate

Generating a task’s ETA 3. Calculate the area under the
curve (and other measures). Time Taken Error Rate ETA

curve (and other measures). Time Taken Error Rate .10 4s Eﬀective Time

curve (and other measures). = Eﬀective Wage × Wage Rate Time Taken Error Rate .10 4s Eﬀective Time

Example #1 Time Taken Normalized Error Rate Choose from 4
options Brushing Applauding Drinking Climbing What is this person doing? 1s 2s 3s 4s 5s 1.0 .50 ETA=3.5  Eﬀ. Time=2.4s  Eﬀ. Wage=1¢

Example #2 Time Taken Search for the answer In what
year did California become a state? 4s 8s 12s 16s 20s 1.0 .50 ETA=11.7  Eﬀ. Time=16s  Eﬀ. Wage=7¢ 185 Normalized Error Rate

ETA can be computed with as few as 8 workers.
For a 2¢ task, ETA costs less than $5.

ETA vs. other measures of eﬀort

How well can ETA (or other measures) predict task eﬀort?

Task A Task C Task B Task D Task E

More diﬃcult Gold-standard

Task C Task A Task B Task D Task E Gold-standard Measure X More diﬃcult

We compute ten measures… ETA Eﬀective Time Time Taken Estimated
Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >

…on ten common tasks. Binary Choice Scaled Choice Categorization Description
Tagging Finding Errors Fixing Errors Transcription Addition Search True False A B C D Strongly Agree Strongly Disagree Neutral Agree Disagree apple a person is  standing this paer is juicy pear this paer is juicy 1 2 3 4 2 tiny green pear tiny gree 1.68 + 0.74 = ? 2.4 What year did  California become  a state? 185

10 measures 10 tasks 8 time conditions 60 workers

Results ETA Binary Choice Scaled Choice Categorization Description Tagging 3.9
4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search

Results in order of increasing ETA ETA Binary Choice Scaled
Choice Categorization Description Tagging 3.9 4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search

Results in order of increasing ETA ETA (6) (7) (8)
(9) (10) (1) (2) (3) (4) (5) Binary Choice Scaled Choice Categorization Description Tagging 3.9 4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search

Comparing to subjective rank ETA Subjective Rank Binary Choice Scaled
Choice Categorization Tagging Finding Errors 1 2 3 4 6 Binary Choice Scaled Choice Categorization Tagging Finding Errors (1) (2) (3) (4) (5)

So how did each measure do? ETA Eﬀective Time Time
Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0

Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69

Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 It’s relative.

Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 Workers multitask. Rzeszotarski, J. M., and Kittur, A. UIST (2011)

Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 Expensive; market is inelastic Toomim, M., et al. CHI (2011)

Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 High variance DeLeeuw, K. E. and Mayer, R. E. J. Educ. Psychol. (2008) / Herlocker, J., et al. Inform. Retrieval (2002)

Individually, hard to interpret Finding Errors 0.13 48 1¢ 6.1¢
6 3.9 5.5s 10.9s 8s -0.13 ETA Eﬀective Time Time Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank

ETA in Action

People doing things.

What’s the best way to label these images?

Multiple Choice vs. Tagging Brushing Drinking Brushing Applauding Gardening Climbing
Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping Gardening Cleaning Writing Waving Typing Reading Phoning Swimming Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping 2 Choices 4 Choices 8 Choices 16 Choices Tagging What is this person doing? runnin

Eﬀort increases with choices* 2 Choices 4 Choices 8 Choices
16 Choices Tagging ETA 1.6 1.8 2.5 3.2 3.1 2 Choices 4 Choices 8 Choices 16 Choices Tagging Eﬀective Time 2.3 2.5 3.6 5.0 4.2

Eﬀort increases with choices* 2 Choices 4 Choices 8 Choices
16 Choices Tagging ETA 1.6 1.8 2.5 3.2 3.1 2 Choices 4 Choices 8 Choices 16 Choices Tagging Eﬀective Time 2.3 2.5 3.6 5.0 4.2 But tagging is less work than picking from 16 choices! *

Also in the paper… 1 Computing ETA without ground truth
Measuring the perceptual cost of a task 2 Complete experimental results 3

ETA can eﬀectively capture task eﬀort to inform task design
and pricing.

ETA can effectively capture task effort to inform task design
and pricing.   (Minimal effort required.)

http://hci.st/eta

Justin Cheng @jcccf / Stanford Jaime Teevan @jteevan / Microsoft
Research Michael Bernstein @msbernst / Stanford Error-Time Curves Measuring Crowdsourcing Eﬀort with http://hci.st/eta

Measuring Crowdsourcing Effort with Error-Time ...

Measuring Crowdsourcing Effort with Error-Time Curves

More Decks by Justin Cheng

Other Decks in Research

Featured

Transcript