Measuring Crowdsourcing Effort with Error-Time Curves

Measuring Crowdsourcing Effort with Error-Time Curves

Presented at CHI 2015.

Crowdsourcing systems lack effective measures of the effort required to complete each task. Without knowing how much time workers need to execute a task well, requesters struggle to accurately structure and price their work. Objective measures of effort could better help workers identify tasks that are worth their time. We propose a data-driven effort metric, ETA (error-time area), that can be used to determine a task's fair price. It empirically models the relationship between time and error rate by manipulating the time that workers have to complete a task. ETA reports the area under the error-time curve as a continuous metric of worker effort. The curve's 10th percentile is also interpretable as the minimum time most workers require to complete the task without error, which can be used to price the task. We validate the ETA metric on ten common crowdsourcing tasks, including tagging, transcription, and search, and find that ETA closely tracks how workers would rank these tasks by effort. We also demonstrate how ETA allows requesters to rapidly iterate on task designs and measure whether the changes improve worker efficiency. Our findings can facilitate the process of designing, pricing, and allocating crowdsourcing tasks.

8480b47e733a040fba07c32da414b0e0?s=128

Justin Cheng

April 21, 2015
Tweet

Transcript

  1. Error-Time Curves Justin Cheng @jcccf / Stanford Jaime Teevan @jteevan

    / Microsoft Research Michael Bernstein @msbernst / Stanford Measuring Crowdsourcing Effort with http://hci.st/eta
  2. Crowdsourcing allows large numbers of people to accomplish tasks at

    a global scale. Kittur, A., et al. CSCW (2013)
  3. Crowdsourcing allows large numbers of people to accomplish tasks at

    a global scale. Copyediting Categorization Retrieval Labeling Surveys Experiments Kittur, A., et al. CSCW (2013)
  4. But it’s difficult to design and price tasks well.

  5. Which task design is better? Tagging climbin What is this

    person doing? Choose from 8 options Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping What is this person doing?
  6. Which task design is better? Choose from 4 options Choose

    from 8 options Brushing Applauding Drinking Climbing What is this person doing? Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping What is this person doing?
  7. Requesters end up pricing tasks arbitrarily.

  8. Requesters tend to underestimate the effort required to complete tasks.

    Hinds, P. Journal of Experimental Psychology: Applied (1999)
  9. Workers are hard- pressed to figure out which tasks are

    worth their time.
  10. Why not just measure how long workers take to complete

    a task?
  11. Why not just ask workers how difficult they thought a

    task was? Many existing measures are imprecise
  12. To reliably determine task difficulty, we need a robust, objective

    measure of effort.
  13. ETA (error-time area) is a continuous, absolute, data- driven measure

    of task effort.
  14. ETA (error-time area) models the relationship between time and worker

    error rate.
  15. Why use ETA? Requesters can use ETA to compare task

    designs and iterate towards better ones, as well as objectively price tasks. Workers can identify tasks worth their time, and have a guide for how much time they should spend on a task.
  16. Overview 1 Error-Time Curves (and ETA) Evaluating ETA and other

    measures 2 ETA in action 3
  17. Understanding Error-Time Curves

  18. Time Taken Error Rate ETA How do we generate this?

  19. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. 5 seconds left… 4 seconds left… 3 seconds left… 2 seconds left… 1 seconds left… Time’s up! Tag this image.
  20. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. Tag this image. 1s 2s 4s 8s 10s 16s 6s https://www.flickr.com/photos/[jking89/4572668505, jking89/4572668505, manoftaste-de/9563451348, jfh686/3613641379, patdavid/5568423570, dj-dwayne/6056431256, rsmith11235/9254525480]
  21. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. Tag this image. Tag this image. Tag this image. 1s 2s 4s 8s 10s 16s 6s Tag this image. Tag this image. Tag this image. https://www.flickr.com/photos/[jking89/4572668505, jking89/4572668505, manoftaste-de/9563451348, jfh686/3613641379, patdavid/5568423570, dj-dwayne/6056431256, rsmith11235/9254525480]
  22. Generating a task’s ETA 1. Have workers complete tasks given

    different time limits. Tag this image. 2s 16s 6s Tag this image. Tag this image. Tag this image. Practice Questions https://www.flickr.com/photos/[mindwhisperings/5874135107, sugarhiccuphiccup/4808600654, sunsward7/8078455200] Tag this image. Tag this image. 1s Tag this image. …
  23. Generating a task’s ETA 2. Fit a curve to the

    recorded data. Time Taken Error Rate
  24. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 1s Time Taken Error Rate …
  25. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 1s Time Taken Error Rate … 1s 1.0 20 / 20 wrong
  26. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 2s Time Taken Error Rate … 2s .90 18 / 20 wrong
  27. Generating a task’s ETA 2. Fit a curve to the

    recorded data. 8s Time Taken Error Rate … .00 8s 0 / 20 wrong
  28. Generating a task’s ETA 2. Fit a curve to the

    recorded data. Time Taken Error Rate
  29. Generating a task’s ETA 3. Calculate the area under the

    curve (and other measures). Time Taken Error Rate ETA
  30. Generating a task’s ETA 3. Calculate the area under the

    curve (and other measures). Time Taken Error Rate .10 4s Effective Time
  31. Generating a task’s ETA 3. Calculate the area under the

    curve (and other measures). = Effective Wage × Wage Rate Time Taken Error Rate .10 4s Effective Time
  32. Example #1 Time Taken Normalized Error Rate Choose from 4

    options Brushing Applauding Drinking Climbing What is this person doing? 1s 2s 3s 4s 5s 1.0 .50 ETA=3.5
 Eff. Time=2.4s
 Eff. Wage=1¢
  33. Example #2 Time Taken Search for the answer In what

    year did California become a state? 4s 8s 12s 16s 20s 1.0 .50 ETA=11.7
 Eff. Time=16s
 Eff. Wage=7¢ 185 Normalized Error Rate
  34. ETA can be computed with as few as 8 workers.

    For a 2¢ task, ETA costs less than $5.
  35. ETA vs. other measures of effort

  36. How well can ETA (or other measures) predict task effort?

  37. Task A Task C Task B Task D Task E

  38. Task A Task C Task B Task D Task E

    More difficult Gold-standard
  39. Task A Task C Task B Task D Task E

    Task C Task A Task B Task D Task E Gold-standard Measure X More difficult
  40. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  41. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  42. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  43. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  44. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  45. We compute ten measures… ETA Effective Time Time Taken Estimated

    Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > >
  46. …on ten common tasks. Binary Choice Scaled Choice Categorization Description

    Tagging Finding Errors Fixing Errors Transcription Addition Search True False A B C D Strongly Agree Strongly Disagree Neutral Agree Disagree apple a person is
 standing this paer is juicy pear this paer is juicy 1 2 3 4 2 tiny green pear tiny gree 1.68 + 0.74 = ? 2.4 What year did
 California become
 a state? 185
  47. 10 measures 10 tasks 8 time conditions 60 workers

  48. Results ETA Binary Choice Scaled Choice Categorization Description Tagging 3.9

    4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search
  49. Results in order of increasing ETA ETA Binary Choice Scaled

    Choice Categorization Description Tagging 3.9 4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search
  50. Results in order of increasing ETA ETA (6) (7) (8)

    (9) (10) (1) (2) (3) (4) (5) Binary Choice Scaled Choice Categorization Description Tagging 3.9 4.3 7.6 4.9 11.7 1.6 1.9 2.0 7.8 2.9 Finding Errors Fixing Errors Transcription Addition Search
  51. Comparing to subjective rank ETA Subjective Rank Binary Choice Scaled

    Choice Categorization Tagging Finding Errors 1 2 3 4 6 Binary Choice Scaled Choice Categorization Tagging Finding Errors (1) (2) (3) (4) (5)
  52. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0
  53. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69
  54. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 It’s relative.
  55. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 Workers multitask. Rzeszotarski, J. M., and Kittur, A. UIST (2011)
  56. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 Expensive; market is inelastic Toomim, M., et al. CHI (2011)
  57. So how did each measure do? ETA Effective Time Time

    Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank × 7.24s 2 s Actual Est. Time Actual Time Taken How demanding? 1 11 15 ¢ A B C > > 1.0 .87 .69 .66 .78 .82 .29 .82 .78 .69 High variance DeLeeuw, K. E. and Mayer, R. E. J. Educ. Psychol. (2008) / Herlocker, J., et al. Inform. Retrieval (2002)
  58. Individually, hard to interpret Finding Errors 0.13 48 1¢ 6.1¢

    6 3.9 5.5s 10.9s 8s -0.13 ETA Effective Time Time Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank
  59. Individually, hard to interpret Finding Errors 0.13 48 1¢ 6.1¢

    6 3.9 5.5s 10.9s 8s -0.13 ETA Effective Time Time Taken Estimated Time Rel. Subj. Dur. Error Rate NASA TLX Market Price Estimated Cost Subjective Rank
  60. ETA in Action

  61. People doing things.

  62. What’s the best way to label these images?

  63. Multiple Choice vs. Tagging Brushing Drinking Brushing Applauding Gardening Climbing

    Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping Gardening Cleaning Writing Waving Typing Reading Phoning Swimming Brushing Cooking Applauding Drinking Climbing Rowing Fishing Jumping 2 Choices 4 Choices 8 Choices 16 Choices Tagging What is this person doing? runnin
  64. None
  65. Effort increases with choices* 2 Choices 4 Choices 8 Choices

    16 Choices Tagging ETA 1.6 1.8 2.5 3.2 3.1 2 Choices 4 Choices 8 Choices 16 Choices Tagging Effective Time 2.3 2.5 3.6 5.0 4.2
  66. Effort increases with choices* 2 Choices 4 Choices 8 Choices

    16 Choices Tagging ETA 1.6 1.8 2.5 3.2 3.1 2 Choices 4 Choices 8 Choices 16 Choices Tagging Effective Time 2.3 2.5 3.6 5.0 4.2 But tagging is less work than picking from 16 choices! *
  67. Also in the paper… 1 Computing ETA without ground truth

    Measuring the perceptual cost of a task 2 Complete experimental results 3
  68. ETA can effectively capture task effort to inform task design

    and pricing.
  69. ETA can effectively capture task effort to inform task design

    and pricing. 
 (Minimal effort required.)
  70. http://hci.st/eta

  71. Justin Cheng @jcccf / Stanford Jaime Teevan @jteevan / Microsoft

    Research Michael Bernstein @msbernst / Stanford Error-Time Curves Measuring Crowdsourcing Effort with http://hci.st/eta