Human Evaluation for Google Play

063b6d76e00dc665cb737cdac278c192?s=47 Edwin Chen
February 01, 2015

Human Evaluation for Google Play

063b6d76e00dc665cb737cdac278c192?s=128

Edwin Chen

February 01, 2015
Tweet

Transcript

  1. Man Meets Machine Evaluation Insights for Android Play Edwin Chen

    / echen.me
  2. Overview

  3. Building a search and discovery marketplace is hard. Building it

    for millions of diverse users and items, against established competitors like Apple and Amazon, is even harder.
  4. How well does the Play Store do?

  5. Are users receiving good recommendations? How do they compare for

    apps vs. movies? How friendly is the UI? Is it easy to find the Game of Thrones episode I want? Does inventory and pricing match up against other app stores like Apple and Amazon? Where else does the experience succeed and fail?
  6. What are common problems with recommendations?

  7. Do other stores do better?

  8. Pick an app you enjoyed in the past year. Rate

    one of the similar apps. Would you want to buy it? Horrible suggestion. Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I definitely don't want to simulate becoming one. I'd just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.
  9. What parts of the Play Store most need improvement?

  10. Machine learning is only one aspect of a digital store.

    Is the pricing any good?
  11. What about inventory?

  12. Can I even find the products I’m looking for?

  13. Human Evaluation

  14. How can we measure and improve the quality of Play

    Store recommendations?
  15. Log metrics are the best source of truth, but often

    don’t contain enough information. Can we quickly ask users what they think, at scale?
  16. Pick a book you enjoyed in the past year. “It

    blends my two favorite genres, Comedy and Science fiction. I was engaged in the story but also laughing most of the way through, and Douglas Adams had a really great, unique writing style.”
  17. Rate the first Similar book. Would you want to buy

    it? Horrible suggestion. “Google Play /thinks/ this is the same author, but they're not. Also, this is some kind of health monitoring textbook and completely unlike Douglas Adams' work.”
  18. Rate the second Similar book. Would you want to buy

    it? Horrible suggestion. “Another textbook by a not-Douglas- Adams-Douglas-Adams. This time a textbook about structural dynamics.”
  19. Rate the third Similar book. Would you want to buy

    it? Fairly bad suggestion. “The reviews aren't too great (but ok) and it is the second book in a series.”
  20. Potential Improvements 1. Build an author disambiguation model (say, by

    using genre or other features), and stop showing related books by different authors with the same name. 2. Penalize poorly rated books, or ratings that differ from the rating of the original. 3. Build series detectors, and only recommend related books that are the first in the series (unless you’re recommending a follow-up to the target book).
  21. Pick a book you enjoyed in the past year. “I

    was a fan of the Lord of the Rings movies, but it was too many books. I liked that The Hobbit was short, and it was humorous. I was able to knock it out quickly.”
  22. Rate the first Similar book. Would you want to buy

    it? Horrible suggestion. “Just reading and watching The Hobbit is enough for me. I don't want to delve into this world any further.”
  23. Rate the second Similar book. Would you want to buy

    it? Horrible suggestion. “This is just a study guide, so something I would have used when I didn't read the books in school. But definitely not something I would need to use now.”
  24. Rate the third Similar book. Would you want to buy

    it? Horrible suggestion. “It's yet another study guide! And not necessary for me.”
  25. Potential Improvements 1. Penalize books in different genres. 2. Stop

    showing study guides everywhere!
  26. None
  27. And Hunger Games, Hobbit, and Hitchiker’s Guide to the Galaxy

    are all extremely popular items! Until these models improve, what if we could manually fix the Related Item recommendations at the head?
  28. Elite Proletariat

  29. I sit down at the computer like, "What up? I

    got some big debt!” I'm so pumped about tasks from my work desk Transcription on the down low, it's so damn noisy Trying to make out people sayin', "Damn! That podcast's crazy!” Categorizin' ten layers deep, trying to get it right, Brain is startin' to keep me up all night Dreamin' about those apps, pennies raining down Probably need a break, they're all I can think about (Taskssssssss…) But shit, it paid ninety-nine cents! (Work it!)
  30. Faster, Cheaper, Stronger The Video Inventory Analytics team found that,

    compared to the internal raters they had been using for a while, our elite proletariat was • 7X cheaper • 5X faster • Higher quality: on 500 videos, their internal raters made 15 mistakes, compared to 3 mistakes by our workers (on their first exposure to the task!)
  31. Rundown • Want 100,000 apps categorized? Just give us a

    couple days. • Need special languages? We can even handle esoteric languages like Thai and Icelandic. • Require specific types of workers? We can recruit whatever you need (e.g., Korean Android users who are also heavy Playstation players). • Have complex tasks? It’s one of our specialties. We have the full power of the human brain available, so we shouldn't be limited to labeling cat images. We can download apps and play them, write catchy descriptions, etc.
  32. Similar Books

  33. None
  34. note: biased towards popular books, not randomly sampled — but

    a random sample would likely be even worse!
  35. Similar Apps

  36. Pick an app you enjoyed in the past year. “I

    downloaded the app because I saw that everyone was playing it and always wanted to play it myself.”
  37. Rate the second Similar app. Would you want to buy

    it? Horrible suggestion. “The game itself looks very badly programmed and put together, while putting in Minecraft content that looks terribly made.”
  38. Rate the third Similar app. Would you want to buy

    it? Horrible suggestion. “It doesn't seem like a very good app, since the graphics and style look horrible.”
  39. Graphics, screenshots, and quality matter. And even without knowing that

    the related apps look horrible, it’s easy to tell that Angry Birds is in a completely different level of popularity and professionalism.
  40. Pick an app you enjoyed in the past year. “I'd

    just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.”
  41. Rate the second Similar app. Would you want to buy

    it? Horrible suggestion. “I don't have any interest in collecting cute little dragons.”
  42. Rate the third Similar app. Would you want to buy

    it? Horrible suggestion. “Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I definitely don't want to simulate becoming one.”
  43. These apps are for completely different demographics. Do the Play

    Store’s machine learning models use personalized age and gender features?
  44. Recommendations

  45. Go to your Play Store homepage, and rate the first

    three Recommended for You items.
  46. Which items have the best recommendations?

  47. None
  48. Which recommendation reasons are the best?

  49. Based on X recommendations are the most relevant. Top X

    recommendations are the least.
  50. Horrible recommendation. I listen to all of my music on

    Google Play Music so you would think they would know what type of music to recommend to me. I would never listen to this horrible music. This is not a genre I ever listen to. It says Top selling song but I would rather have something personalized and recommended to fit my tastes, not just a top seller.
  51. Excellent recommendation. I've listened to this entire disk, on Youtube,

    numerous times. It was a good bet that I would like the file on my phone, and I would. I love the music and Youtube is the perfect place to determine what I would listen to, it is the only platform I use for streaming music.
  52. Google knows a lot of information about users from places

    besides the Play Store. It may be worth incorporating even more of this information, to improve the coverage of “Based on…” recommendations even further.
  53. Metrics & Evaluation

  54. Imagine running a side-by-side evaluation for every experiment. What could

    we do with this?
  55. Find examples What actually happens isn’t always what we expect,

    so evaluations can help find a bunch of examples of what your experiment is actually doing, what’s wrong, and why. Iterate faster on new features Launching new A/B tests can be slow, so human evaluation can provide a quicker feedback loop. Better launch decisions There’s no single perfect metric. By incorporating a complementary relevance score into every experiment, we can hopefully improve long-term user happiness. We can even try training models on such a score.
  56. UX

  57. Sometimes the biggest gains come from design changes. Where did

    users have problems with the Play Store’s UI?
  58. Try to find a TV episode you want to watch.

    “I wanted to watch the next episode of GOT, but it’s impossible for me to tell which one is which because the titles are cut off.”
  59. None
  60. Would improving the UI of customer reviews increase downloads and

    purchases?
  61. None
  62. None
  63. None
  64. Inventory

  65. None
  66. 37% of Animation is missing from the Play Store

  67. 40% of Anime is missing from the Play Store

  68. Even classics like I Love Lucy are missing

  69. Why is the Play Store’s inventory so limited? How does

    this affect brand perception and adoption?
  70. Pricing

  71. note: biased towards popular music, not sampled from each genre

    Over half of the Play Store’s music is more expensive than Amazon
  72. Over 10% of the time, it’s more than twice as

    expensive!
  73. None
  74. None
  75. Summary

  76. In Books, Music, and Movies/TV, the Play Store lags behind

    some of its competitors.
  77. There are a few problems with the recommendation and related

    engines.
  78. Adding personalized user models should help some issues.

  79. Based on X recommendations (recommendations that use information from other

    Google properties), are the most useful.
  80. Fixing the UX could also improve metrics dramatically.

  81. The Play Store’s inventory is also very limited, which doesn’t

    make recommending good content any easier.
  82. In many cases, the Play Store is also more expensive.

  83. Building a discovery engine is hard. So is creating a

    marketplace. Can we mix humans and machines to make it easier?
  84. http://blog.echen.me/2014/10/07/moving-beyond-ctr- better-recommendations-through-human-evaluation/ http://blog.echen.me/2013/01/08/improving-twitter- search-with-real-time-human-computation/

  85. hello@echen.me