Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Measuring Crowdsourcing Effort with Error-Time ...

Measuring Crowdsourcing Effort with Error-Time Curves

Presented at CHI 2015.

Crowdsourcing systems lack effective measures of the effort required to complete each task. Without knowing how much time workers need to execute a task well, requesters struggle to accurately structure and price their work. Objective measures of effort could better help workers identify tasks that are worth their time. We propose a data-driven effort metric, ETA (error-time area), that can be used to determine a task's fair price. It empirically models the relationship between time and error rate by manipulating the time that workers have to complete a task. ETA reports the area under the error-time curve as a continuous metric of worker effort. The curve's 10th percentile is also interpretable as the minimum time most workers require to complete the task without error, which can be used to price the task. We validate the ETA metric on ten common crowdsourcing tasks, including tagging, transcription, and search, and find that ETA closely tracks how workers would rank these tasks by effort. We also demonstrate how ETA allows requesters to rapidly iterate on task designs and measure whether the changes improve worker efficiency. Our findings can facilitate the process of designing, pricing, and allocating crowdsourcing tasks.

Justin Cheng

April 21, 2015
Tweet

More Decks by Justin Cheng

Other Decks in Research

Transcript

  1. Error-Time Curves Justin Cheng @jcccf / Stanford Jaime Teevan @jteevan

    / Microsoft Research Michael Bernstein @msbernst / Stanford Measuring Crowdsourcing Effort with http://hci.st/eta
  2. Crowdsourcing allows large numbers of people to accomplish tasks at

    a global scale. Kittur, A., et al. CSCW (2013)
  3. Crowdsourcing allows large numbers of people to accomplish tasks at

    a global scale. Copyediting Categorization Retrieval Labeling Surveys Experiments Kittur, A., et al. CSCW (2013)
  4. Which task design is better? Tagging climbin What is this

    person doing? Choose from 8 options Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping What is this person doing?
  5. Which task design is better? Choose from 4 options Choose

    from 8 options Brushing Applauding Drinking Climbing What is this person doing? Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping What is this person doing?
  6. Requesters tend to underestimate the effort required to complete tasks.

    Hinds, P. Journal of Experimental Psychology: Applied (1999)
  7. Why not just ask workers how difficult they thought a

    task was? Many existing measures are imprecise
  8. Why use ETA? Requesters can use ETA to compare task

    designs and iterate towards better ones, as well as objectively price tasks. Workers can identify tasks worth their time, and have a guide for how much time they should spend on a task.
  9. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. 5 seconds left… 4 seconds left… 3 seconds left… 2 seconds left… 1 seconds left… Time’s up! Tag this image.
  10. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. 1s 2s 4s 8s 10s 16s 6s https://www.flickr.com/photos/[jking89/4572668505, jking89/4572668505, manoftaste-de/9563451348, jfh686/3613641379, patdavid/5568423570, dj-dwayne/6056431256, rsmith11235/9254525480]
  11. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. Tag this image. Tag this image. Tag this image. 1s 2s 4s 8s 10s 16s 6s Tag this image. Tag this image. Tag this image. https://www.flickr.com/photos/[jking89/4572668505, jking89/4572668505, manoftaste-de/9563451348, jfh686/3613641379, patdavid/5568423570, dj-dwayne/6056431256, rsmith11235/9254525480]
  12. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. 2s 16s 6s Tag this image. Tag this image. Tag this image. Practice Questions https://www.flickr.com/photos/[mindwhisperings/5874135107, sugarhiccuphiccup/4808600654, sunsward7/8078455200] Tag this image. Tag this image. 1s Tag this image. …
  13. Generating a task’s ETA 2. Fit a curve to the

    recorded data. Time Taken Error Rate
  14. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 1s Time Taken Error Rate …
  15. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 1s Time Taken Error Rate … 1s 1.0 20 / 20 wrong
  16. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 2s Time Taken Error Rate … 2s .90 18 / 20 wrong
  17. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 8s Time Taken Error Rate … .00 8s 0 / 20 wrong
  18. Generating a task’s ETA 2. Fit a curve to the

    recorded data. Time Taken Error Rate
  19. Generating a task’s ETA 3. Calculate the area under the

    curve (and other measures). Time Taken Error Rate ETA
  20. Generating a task’s ETA 3. Calculate the area under the

    curve (and other measures). Time Taken Error Rate .10 4s Effective Time
  21. Generating a task’s ETA 3. Calculate the area under the

    curve (and other measures). = Effective Wage × Wage Rate Time Taken Error Rate .10 4s Effective Time
  22. Example #1 Time Taken Normalized Error Rate Choose from 4

    options Brushing Applauding Drinking Climbing What is this person doing? 1s 2s 3s 4s 5s 1.0 .50 ETA=3.5
 Eff. Time=2.4s
 Eff. Wage=1¢
  23. Example #2 Time Taken Search for the answer In what

    year did California become a state? 4s 8s 12s 16s 20s 1.0 .50 ETA=11.7
 Eff. Time=16s
 Eff. Wage=7¢ 185 Normalized Error Rate
  24. ETA can be computed with as few as 8 workers.

    For a 2¢ task, ETA costs less than $5.
  25. Task A Task C Task B Task D Task E

    More difficult Gold-standard
  26. Task A Task C Task B Task D Task E

    Task C Task A Task B Task D Task E Gold-standard Measure X More difficult
  27. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  28. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  29. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  30. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  31. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  32. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  33. …on ten common tasks. Binary Choice Scaled Choice Categorization Description

    Tagging Finding Errors Fixing Errors Transcription Addition Search True False A B C D Strongly Agree Strongly Disagree Neutral Agree Disagree apple a person is
 standing this paer is juicy pear this paer is juicy 1 2 3 4 2 tiny green pear tiny gree 1.68 + 0.74 = ? 2.4 What year did
 California become
 a state? 185
  34. Results ETA Binary Choice Scaled Choice Categorization Description Tagging 3.9

    4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search
  35. Results in order of increasing ETA ETA Binary Choice Scaled

    Choice Categorization Description Tagging 3.9 4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search
  36. Results in order of increasing ETA ETA (6) (7) (8)

    (9) (10) (1) (2) (3) (4) (5) Binary Choice Scaled Choice Categorization Description Tagging 3.9 4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search
  37. Comparing to subjective rank ETA Subjective Rank Binary Choice Scaled

    Choice Categorization Tagging Finding Errors 1 2 3 4 6 Binary Choice Scaled Choice Categorization Tagging Finding Errors (1) (2) (3) (4) (5)
  38. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0
  39. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69
  40. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 It’s relative.
  41. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 Workers multitask. Rzeszotarski, J. M., and Kittur, A. UIST (2011)
  42. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 Expensive; market is inelastic Toomim, M., et al. CHI (2011)
  43. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 High variance DeLeeuw, K. E. and Mayer, R. E. J. Educ. Psychol. (2008) / Herlocker, J., et al. Inform. Retrieval (2002)
  44. Individually, hard to interpret Finding Errors 0.13 48 1¢ 6.1¢

    6 3.9 5.5s 10.9s 8s -0.13 ETA Effective Time Time Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank
  45. Individually, hard to interpret Finding Errors 0.13 48 1¢ 6.1¢

    6 3.9 5.5s 10.9s 8s -0.13 ETA Effective Time Time Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank
  46. Multiple Choice vs. Tagging Brushing Drinking Brushing Applauding Gardening Climbing

    Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping Gardening Cleaning Writing Waving Typing Reading Phoning Swimming Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping 2 Choices 4 Choices 8 Choices 16 Choices Tagging What is this person doing? runnin
  47. Effort increases with choices* 2 Choices 4 Choices 8 Choices

    16 Choices Tagging ETA 1.6 1.8 2.5 3.2 3.1 2 Choices 4 Choices 8 Choices 16 Choices Tagging Effective Time 2.3 2.5 3.6 5.0 4.2
  48. Effort increases with choices* 2 Choices 4 Choices 8 Choices

    16 Choices Tagging ETA 1.6 1.8 2.5 3.2 3.1 2 Choices 4 Choices 8 Choices 16 Choices Tagging Effective Time 2.3 2.5 3.6 5.0 4.2 But tagging is less work than picking from 16 choices! *
  49. Also in the paper… 1 Computing ETA without ground truth

    Measuring the perceptual cost of a task 2 Complete experimental results 3
  50. ETA can effectively capture task effort to inform task design

    and pricing. 
 (Minimal effort required.)
  51. Justin Cheng @jcccf / Stanford Jaime Teevan @jteevan / Microsoft

    Research Michael Bernstein @msbernst / Stanford Error-Time Curves Measuring Crowdsourcing Effort with http://hci.st/eta