Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Human Evaluation for Google Play

Edwin Chen
February 01, 2015

Human Evaluation for Google Play

Edwin Chen

February 01, 2015
Tweet

More Decks by Edwin Chen

Other Decks in Technology

Transcript

  1. Building a search and discovery marketplace is hard. Building it

    for millions of diverse users and items, against established competitors like Apple and Amazon, is even harder.
  2. Are users receiving good recommendations? How do they compare for

    apps vs. movies? How friendly is the UI? Is it easy to find the Game of Thrones episode I want? Does inventory and pricing match up against other app stores like Apple and Amazon? Where else does the experience succeed and fail?
  3. Pick an app you enjoyed in the past year. Rate

    one of the similar apps. Would you want to buy it? Horrible suggestion. Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I definitely don't want to simulate becoming one. I'd just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.
  4. Log metrics are the best source of truth, but often

    don’t contain enough information. Can we quickly ask users what they think, at scale?
  5. Pick a book you enjoyed in the past year. “It

    blends my two favorite genres, Comedy and Science fiction. I was engaged in the story but also laughing most of the way through, and Douglas Adams had a really great, unique writing style.”
  6. Rate the first Similar book. Would you want to buy

    it? Horrible suggestion. “Google Play /thinks/ this is the same author, but they're not. Also, this is some kind of health monitoring textbook and completely unlike Douglas Adams' work.”
  7. Rate the second Similar book. Would you want to buy

    it? Horrible suggestion. “Another textbook by a not-Douglas- Adams-Douglas-Adams. This time a textbook about structural dynamics.”
  8. Rate the third Similar book. Would you want to buy

    it? Fairly bad suggestion. “The reviews aren't too great (but ok) and it is the second book in a series.”
  9. Potential Improvements 1. Build an author disambiguation model (say, by

    using genre or other features), and stop showing related books by different authors with the same name. 2. Penalize poorly rated books, or ratings that differ from the rating of the original. 3. Build series detectors, and only recommend related books that are the first in the series (unless you’re recommending a follow-up to the target book).
  10. Pick a book you enjoyed in the past year. “I

    was a fan of the Lord of the Rings movies, but it was too many books. I liked that The Hobbit was short, and it was humorous. I was able to knock it out quickly.”
  11. Rate the first Similar book. Would you want to buy

    it? Horrible suggestion. “Just reading and watching The Hobbit is enough for me. I don't want to delve into this world any further.”
  12. Rate the second Similar book. Would you want to buy

    it? Horrible suggestion. “This is just a study guide, so something I would have used when I didn't read the books in school. But definitely not something I would need to use now.”
  13. Rate the third Similar book. Would you want to buy

    it? Horrible suggestion. “It's yet another study guide! And not necessary for me.”
  14. And Hunger Games, Hobbit, and Hitchiker’s Guide to the Galaxy

    are all extremely popular items! Until these models improve, what if we could manually fix the Related Item recommendations at the head?
  15. I sit down at the computer like, "What up? I

    got some big debt!” I'm so pumped about tasks from my work desk Transcription on the down low, it's so damn noisy Trying to make out people sayin', "Damn! That podcast's crazy!” Categorizin' ten layers deep, trying to get it right, Brain is startin' to keep me up all night Dreamin' about those apps, pennies raining down Probably need a break, they're all I can think about (Taskssssssss…) But shit, it paid ninety-nine cents! (Work it!)
  16. Faster, Cheaper, Stronger The Video Inventory Analytics team found that,

    compared to the internal raters they had been using for a while, our elite proletariat was • 7X cheaper • 5X faster • Higher quality: on 500 videos, their internal raters made 15 mistakes, compared to 3 mistakes by our workers (on their first exposure to the task!)
  17. Rundown • Want 100,000 apps categorized? Just give us a

    couple days. • Need special languages? We can even handle esoteric languages like Thai and Icelandic. • Require specific types of workers? We can recruit whatever you need (e.g., Korean Android users who are also heavy Playstation players). • Have complex tasks? It’s one of our specialties. We have the full power of the human brain available, so we shouldn't be limited to labeling cat images. We can download apps and play them, write catchy descriptions, etc.
  18. note: biased towards popular books, not randomly sampled — but

    a random sample would likely be even worse!
  19. Pick an app you enjoyed in the past year. “I

    downloaded the app because I saw that everyone was playing it and always wanted to play it myself.”
  20. Rate the second Similar app. Would you want to buy

    it? Horrible suggestion. “The game itself looks very badly programmed and put together, while putting in Minecraft content that looks terribly made.”
  21. Rate the third Similar app. Would you want to buy

    it? Horrible suggestion. “It doesn't seem like a very good app, since the graphics and style look horrible.”
  22. Graphics, screenshots, and quality matter. And even without knowing that

    the related apps look horrible, it’s easy to tell that Angry Birds is in a completely different level of popularity and professionalism.
  23. Pick an app you enjoyed in the past year. “I'd

    just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.”
  24. Rate the second Similar app. Would you want to buy

    it? Horrible suggestion. “I don't have any interest in collecting cute little dragons.”
  25. Rate the third Similar app. Would you want to buy

    it? Horrible suggestion. “Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I definitely don't want to simulate becoming one.”
  26. These apps are for completely different demographics. Do the Play

    Store’s machine learning models use personalized age and gender features?
  27. Go to your Play Store homepage, and rate the first

    three Recommended for You items.
  28. Horrible recommendation. I listen to all of my music on

    Google Play Music so you would think they would know what type of music to recommend to me. I would never listen to this horrible music. This is not a genre I ever listen to. It says Top selling song but I would rather have something personalized and recommended to fit my tastes, not just a top seller.
  29. Excellent recommendation. I've listened to this entire disk, on Youtube,

    numerous times. It was a good bet that I would like the file on my phone, and I would. I love the music and Youtube is the perfect place to determine what I would listen to, it is the only platform I use for streaming music.
  30. Google knows a lot of information about users from places

    besides the Play Store. It may be worth incorporating even more of this information, to improve the coverage of “Based on…” recommendations even further.
  31. Find examples What actually happens isn’t always what we expect,

    so evaluations can help find a bunch of examples of what your experiment is actually doing, what’s wrong, and why. Iterate faster on new features Launching new A/B tests can be slow, so human evaluation can provide a quicker feedback loop. Better launch decisions There’s no single perfect metric. By incorporating a complementary relevance score into every experiment, we can hopefully improve long-term user happiness. We can even try training models on such a score.
  32. UX

  33. Sometimes the biggest gains come from design changes. Where did

    users have problems with the Play Store’s UI?
  34. Try to find a TV episode you want to watch.

    “I wanted to watch the next episode of GOT, but it’s impossible for me to tell which one is which because the titles are cut off.”
  35. Why is the Play Store’s inventory so limited? How does

    this affect brand perception and adoption?
  36. note: biased towards popular music, not sampled from each genre

    Over half of the Play Store’s music is more expensive than Amazon
  37. The Play Store’s inventory is also very limited, which doesn’t

    make recommending good content any easier.
  38. Building a discovery engine is hard. So is creating a

    marketplace. Can we mix humans and machines to make it easier?