Human Evaluation for Google Play

Slide 1

Slide 1 text

Man Meets Machine Evaluation Insights for Android Play Edwin Chen / echen.me

Slide 2

Slide 2 text

Overview

Slide 3

Slide 3 text

Building a search and discovery marketplace is hard. Building it for millions of diverse users and items, against established competitors like Apple and Amazon, is even harder.

Slide 4

Slide 4 text

How well does the Play Store do?

Slide 5

Slide 5 text

Are users receiving good recommendations? How do they compare for apps vs. movies? How friendly is the UI? Is it easy to ﬁnd the Game of Thrones episode I want? Does inventory and pricing match up against other app stores like Apple and Amazon? Where else does the experience succeed and fail?

Slide 6

Slide 6 text

What are common problems with recommendations?

Slide 7

Slide 7 text

Do other stores do better?

Slide 8

Slide 8 text

Pick an app you enjoyed in the past year. Rate one of the similar apps. Would you want to buy it? Horrible suggestion. Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I deﬁnitely don't want to simulate becoming one. I'd just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.

Slide 9

Slide 9 text

What parts of the Play Store most need improvement?

Slide 10

Slide 10 text

Machine learning is only one aspect of a digital store. Is the pricing any good?

Slide 11

Slide 11 text

What about inventory?

Slide 12

Slide 12 text

Can I even ﬁnd the products I’m looking for?

Slide 13

Slide 13 text

Human Evaluation

Slide 14

Slide 14 text

How can we measure and improve the quality of Play Store recommendations?

Slide 15

Slide 15 text

Log metrics are the best source of truth, but often don’t contain enough information. Can we quickly ask users what they think, at scale?

Slide 16

Slide 16 text

Pick a book you enjoyed in the past year. “It blends my two favorite genres, Comedy and Science ﬁction. I was engaged in the story but also laughing most of the way through, and Douglas Adams had a really great, unique writing style.”

Slide 17

Slide 17 text

Rate the ﬁrst Similar book. Would you want to buy it? Horrible suggestion. “Google Play /thinks/ this is the same author, but they're not. Also, this is some kind of health monitoring textbook and completely unlike Douglas Adams' work.”

Slide 18

Slide 18 text

Rate the second Similar book. Would you want to buy it? Horrible suggestion. “Another textbook by a not-Douglas- Adams-Douglas-Adams. This time a textbook about structural dynamics.”

Slide 19

Slide 19 text

Rate the third Similar book. Would you want to buy it? Fairly bad suggestion. “The reviews aren't too great (but ok) and it is the second book in a series.”

Slide 20

Slide 20 text

Potential Improvements 1. Build an author disambiguation model (say, by using genre or other features), and stop showing related books by different authors with the same name. 2. Penalize poorly rated books, or ratings that differ from the rating of the original. 3. Build series detectors, and only recommend related books that are the ﬁrst in the series (unless you’re recommending a follow-up to the target book).

Slide 21

Slide 21 text

Pick a book you enjoyed in the past year. “I was a fan of the Lord of the Rings movies, but it was too many books. I liked that The Hobbit was short, and it was humorous. I was able to knock it out quickly.”

Slide 22

Slide 22 text

Rate the ﬁrst Similar book. Would you want to buy it? Horrible suggestion. “Just reading and watching The Hobbit is enough for me. I don't want to delve into this world any further.”

Slide 23

Slide 23 text

Rate the second Similar book. Would you want to buy it? Horrible suggestion. “This is just a study guide, so something I would have used when I didn't read the books in school. But deﬁnitely not something I would need to use now.”

Slide 24

Slide 24 text

Rate the third Similar book. Would you want to buy it? Horrible suggestion. “It's yet another study guide! And not necessary for me.”

Slide 25

Slide 25 text

Potential Improvements 1. Penalize books in different genres. 2. Stop showing study guides everywhere!

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

And Hunger Games, Hobbit, and Hitchiker’s Guide to the Galaxy are all extremely popular items! Until these models improve, what if we could manually ﬁx the Related Item recommendations at the head?

Slide 28

Slide 28 text

Elite Proletariat

Slide 29

Slide 29 text

I sit down at the computer like, "What up? I got some big debt!” I'm so pumped about tasks from my work desk Transcription on the down low, it's so damn noisy Trying to make out people sayin', "Damn! That podcast's crazy!” Categorizin' ten layers deep, trying to get it right, Brain is startin' to keep me up all night Dreamin' about those apps, pennies raining down Probably need a break, they're all I can think about (Taskssssssss…) But shit, it paid ninety-nine cents! (Work it!)

Slide 30

Slide 30 text

Faster, Cheaper, Stronger The Video Inventory Analytics team found that, compared to the internal raters they had been using for a while, our elite proletariat was • 7X cheaper • 5X faster • Higher quality: on 500 videos, their internal raters made 15 mistakes, compared to 3 mistakes by our workers (on their ﬁrst exposure to the task!)

Slide 31

Slide 31 text

Rundown • Want 100,000 apps categorized? Just give us a couple days. • Need special languages? We can even handle esoteric languages like Thai and Icelandic. • Require speciﬁc types of workers? We can recruit whatever you need (e.g., Korean Android users who are also heavy Playstation players). • Have complex tasks? It’s one of our specialties. We have the full power of the human brain available, so we shouldn't be limited to labeling cat images. We can download apps and play them, write catchy descriptions, etc.

Slide 32

Slide 32 text

Similar Books

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

note: biased towards popular books, not randomly sampled — but a random sample would likely be even worse!

Slide 35

Slide 35 text

Similar Apps

Slide 36

Slide 36 text

Pick an app you enjoyed in the past year. “I downloaded the app because I saw that everyone was playing it and always wanted to play it myself.”

Slide 37

Slide 37 text

Rate the second Similar app. Would you want to buy it? Horrible suggestion. “The game itself looks very badly programmed and put together, while putting in Minecraft content that looks terribly made.”

Slide 38

Slide 38 text

Rate the third Similar app. Would you want to buy it? Horrible suggestion. “It doesn't seem like a very good app, since the graphics and style look horrible.”

Slide 39

Slide 39 text

Graphics, screenshots, and quality matter. And even without knowing that the related apps look horrible, it’s easy to tell that Angry Birds is in a completely different level of popularity and professionalism.

Slide 40

Slide 40 text

Pick an app you enjoyed in the past year. “I'd just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.”

Slide 41

Slide 41 text

Rate the second Similar app. Would you want to buy it? Horrible suggestion. “I don't have any interest in collecting cute little dragons.”

Slide 42

Slide 42 text

Rate the third Similar app. Would you want to buy it? Horrible suggestion. “Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I deﬁnitely don't want to simulate becoming one.”

Slide 43

Slide 43 text

These apps are for completely different demographics. Do the Play Store’s machine learning models use personalized age and gender features?

Slide 44

Slide 44 text

Recommendations

Slide 45

Slide 45 text

Go to your Play Store homepage, and rate the ﬁrst three Recommended for You items.

Slide 46

Slide 46 text

Which items have the best recommendations?

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Which recommendation reasons are the best?

Slide 49

Slide 49 text

Based on X recommendations are the most relevant. Top X recommendations are the least.

Slide 50

Slide 50 text

Horrible recommendation. I listen to all of my music on Google Play Music so you would think they would know what type of music to recommend to me. I would never listen to this horrible music. This is not a genre I ever listen to. It says Top selling song but I would rather have something personalized and recommended to ﬁt my tastes, not just a top seller.

Slide 51

Slide 51 text

Excellent recommendation. I've listened to this entire disk, on Youtube, numerous times. It was a good bet that I would like the ﬁle on my phone, and I would. I love the music and Youtube is the perfect place to determine what I would listen to, it is the only platform I use for streaming music.

Slide 52

Slide 52 text

Google knows a lot of information about users from places besides the Play Store. It may be worth incorporating even more of this information, to improve the coverage of “Based on…” recommendations even further.

Slide 53

Slide 53 text

Metrics & Evaluation

Slide 54

Slide 54 text

Imagine running a side-by-side evaluation for every experiment. What could we do with this?

Slide 55

Slide 55 text

Find examples What actually happens isn’t always what we expect, so evaluations can help ﬁnd a bunch of examples of what your experiment is actually doing, what’s wrong, and why. Iterate faster on new features Launching new A/B tests can be slow, so human evaluation can provide a quicker feedback loop. Better launch decisions There’s no single perfect metric. By incorporating a complementary relevance score into every experiment, we can hopefully improve long-term user happiness. We can even try training models on such a score.

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Sometimes the biggest gains come from design changes. Where did users have problems with the Play Store’s UI?

Slide 58

Slide 58 text

Try to ﬁnd a TV episode you want to watch. “I wanted to watch the next episode of GOT, but it’s impossible for me to tell which one is which because the titles are cut off.”

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Would improving the UI of customer reviews increase downloads and purchases?

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Inventory

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

37% of Animation is missing from the Play Store

Slide 67

Slide 67 text

40% of Anime is missing from the Play Store

Slide 68

Slide 68 text

Even classics like I Love Lucy are missing

Slide 69

Slide 69 text

Why is the Play Store’s inventory so limited? How does this affect brand perception and adoption?

Slide 70

Slide 70 text

Pricing

Slide 71

Slide 71 text

note: biased towards popular music, not sampled from each genre Over half of the Play Store’s music is more expensive than Amazon

Slide 72

Slide 72 text

Over 10% of the time, it’s more than twice as expensive!

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

Summary

Slide 76

Slide 76 text

In Books, Music, and Movies/TV, the Play Store lags behind some of its competitors.

Slide 77

Slide 77 text

There are a few problems with the recommendation and related engines.

Slide 78

Slide 78 text

Adding personalized user models should help some issues.

Slide 79

Slide 79 text

Based on X recommendations (recommendations that use information from other Google properties), are the most useful.

Slide 80

Slide 80 text

Fixing the UX could also improve metrics dramatically.

Slide 81

Slide 81 text

The Play Store’s inventory is also very limited, which doesn’t make recommending good content any easier.

Slide 82

Slide 82 text

In many cases, the Play Store is also more expensive.

Slide 83

Slide 83 text

Building a discovery engine is hard. So is creating a marketplace. Can we mix humans and machines to make it easier?

Slide 84

Slide 84 text

http://blog.echen.me/2014/10/07/moving-beyond-ctr- better-recommendations-through-human-evaluation/ http://blog.echen.me/2013/01/08/improving-twitter- search-with-real-time-human-computation/

Slide 85

Slide 85 text

[email protected]