Human Evaluation for Google Play

Man Meets Machine Evaluation Insights for Android Play Edwin Chen
/ echen.me

Overview

Building a search and discovery marketplace is hard. Building it
for millions of diverse users and items, against established competitors like Apple and Amazon, is even harder.

How well does the Play Store do?

Are users receiving good recommendations? How do they compare for
apps vs. movies? How friendly is the UI? Is it easy to ﬁnd the Game of Thrones episode I want? Does inventory and pricing match up against other app stores like Apple and Amazon? Where else does the experience succeed and fail?

What are common problems with recommendations?

Do other stores do better?

Pick an app you enjoyed in the past year. Rate
one of the similar apps. Would you want to buy it? Horrible suggestion. Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I deﬁnitely don't want to simulate becoming one. I'd just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.

What parts of the Play Store most need improvement?

Machine learning is only one aspect of a digital store.
Is the pricing any good?

What about inventory?

Can I even ﬁnd the products I’m looking for?

Human Evaluation

How can we measure and improve the quality of Play
Store recommendations?

Log metrics are the best source of truth, but often
don’t contain enough information. Can we quickly ask users what they think, at scale?

Pick a book you enjoyed in the past year. “It
blends my two favorite genres, Comedy and Science ﬁction. I was engaged in the story but also laughing most of the way through, and Douglas Adams had a really great, unique writing style.”

Rate the ﬁrst Similar book. Would you want to buy
it? Horrible suggestion. “Google Play /thinks/ this is the same author, but they're not. Also, this is some kind of health monitoring textbook and completely unlike Douglas Adams' work.”

Rate the second Similar book. Would you want to buy
it? Horrible suggestion. “Another textbook by a not-Douglas- Adams-Douglas-Adams. This time a textbook about structural dynamics.”

Rate the third Similar book. Would you want to buy
it? Fairly bad suggestion. “The reviews aren't too great (but ok) and it is the second book in a series.”

Potential Improvements 1. Build an author disambiguation model (say, by
using genre or other features), and stop showing related books by different authors with the same name. 2. Penalize poorly rated books, or ratings that differ from the rating of the original. 3. Build series detectors, and only recommend related books that are the ﬁrst in the series (unless you’re recommending a follow-up to the target book).

Pick a book you enjoyed in the past year. “I
was a fan of the Lord of the Rings movies, but it was too many books. I liked that The Hobbit was short, and it was humorous. I was able to knock it out quickly.”

Rate the ﬁrst Similar book. Would you want to buy
it? Horrible suggestion. “Just reading and watching The Hobbit is enough for me. I don't want to delve into this world any further.”

Rate the second Similar book. Would you want to buy
it? Horrible suggestion. “This is just a study guide, so something I would have used when I didn't read the books in school. But deﬁnitely not something I would need to use now.”

Rate the third Similar book. Would you want to buy
it? Horrible suggestion. “It's yet another study guide! And not necessary for me.”

Potential Improvements 1. Penalize books in different genres. 2. Stop
showing study guides everywhere!

And Hunger Games, Hobbit, and Hitchiker’s Guide to the Galaxy
are all extremely popular items! Until these models improve, what if we could manually ﬁx the Related Item recommendations at the head?

Elite Proletariat

I sit down at the computer like, "What up? I
got some big debt!” I'm so pumped about tasks from my work desk Transcription on the down low, it's so damn noisy Trying to make out people sayin', "Damn! That podcast's crazy!” Categorizin' ten layers deep, trying to get it right, Brain is startin' to keep me up all night Dreamin' about those apps, pennies raining down Probably need a break, they're all I can think about (Taskssssssss…) But shit, it paid ninety-nine cents! (Work it!)

Faster, Cheaper, Stronger The Video Inventory Analytics team found that,
compared to the internal raters they had been using for a while, our elite proletariat was • 7X cheaper • 5X faster • Higher quality: on 500 videos, their internal raters made 15 mistakes, compared to 3 mistakes by our workers (on their ﬁrst exposure to the task!)

Rundown • Want 100,000 apps categorized? Just give us a
couple days. • Need special languages? We can even handle esoteric languages like Thai and Icelandic. • Require speciﬁc types of workers? We can recruit whatever you need (e.g., Korean Android users who are also heavy Playstation players). • Have complex tasks? It’s one of our specialties. We have the full power of the human brain available, so we shouldn't be limited to labeling cat images. We can download apps and play them, write catchy descriptions, etc.

Similar Books

note: biased towards popular books, not randomly sampled — but
a random sample would likely be even worse!

Similar Apps

Pick an app you enjoyed in the past year. “I
downloaded the app because I saw that everyone was playing it and always wanted to play it myself.”

Rate the second Similar app. Would you want to buy
it? Horrible suggestion. “The game itself looks very badly programmed and put together, while putting in Minecraft content that looks terribly made.”

Rate the third Similar app. Would you want to buy
it? Horrible suggestion. “It doesn't seem like a very good app, since the graphics and style look horrible.”

Graphics, screenshots, and quality matter. And even without knowing that
the related apps look horrible, it’s easy to tell that Angry Birds is in a completely different level of popularity and professionalism.

Pick an app you enjoyed in the past year. “I'd
just watched an episode of the Walking Dead and decided to look for a game that would simulate the worldwide spread of a disease, if it could be combated, etc.”

Rate the second Similar app. Would you want to buy
it? Horrible suggestion. “I don't have any interest in collecting cute little dragons.”

Rate the third Similar app. Would you want to buy
it? Horrible suggestion. “Never in a million years. I am not a teen girl, I have zero interest in celebs in real life, and I deﬁnitely don't want to simulate becoming one.”

These apps are for completely different demographics. Do the Play
Store’s machine learning models use personalized age and gender features?

Recommendations

Go to your Play Store homepage, and rate the ﬁrst
three Recommended for You items.

Which items have the best recommendations?

Which recommendation reasons are the best?

Based on X recommendations are the most relevant. Top X
recommendations are the least.

Horrible recommendation. I listen to all of my music on
Google Play Music so you would think they would know what type of music to recommend to me. I would never listen to this horrible music. This is not a genre I ever listen to. It says Top selling song but I would rather have something personalized and recommended to ﬁt my tastes, not just a top seller.

Excellent recommendation. I've listened to this entire disk, on Youtube,
numerous times. It was a good bet that I would like the ﬁle on my phone, and I would. I love the music and Youtube is the perfect place to determine what I would listen to, it is the only platform I use for streaming music.

Google knows a lot of information about users from places
besides the Play Store. It may be worth incorporating even more of this information, to improve the coverage of “Based on…” recommendations even further.

Metrics & Evaluation

Imagine running a side-by-side evaluation for every experiment. What could
we do with this?

Find examples What actually happens isn’t always what we expect,
so evaluations can help ﬁnd a bunch of examples of what your experiment is actually doing, what’s wrong, and why. Iterate faster on new features Launching new A/B tests can be slow, so human evaluation can provide a quicker feedback loop. Better launch decisions There’s no single perfect metric. By incorporating a complementary relevance score into every experiment, we can hopefully improve long-term user happiness. We can even try training models on such a score.

Sometimes the biggest gains come from design changes. Where did
users have problems with the Play Store’s UI?

Try to ﬁnd a TV episode you want to watch.
“I wanted to watch the next episode of GOT, but it’s impossible for me to tell which one is which because the titles are cut off.”

Would improving the UI of customer reviews increase downloads and
purchases?

Inventory

37% of Animation is missing from the Play Store

40% of Anime is missing from the Play Store

Even classics like I Love Lucy are missing

Why is the Play Store’s inventory so limited? How does
this affect brand perception and adoption?

Pricing

note: biased towards popular music, not sampled from each genre
Over half of the Play Store’s music is more expensive than Amazon

Over 10% of the time, it’s more than twice as
expensive!

Summary

In Books, Music, and Movies/TV, the Play Store lags behind
some of its competitors.

There are a few problems with the recommendation and related
engines.

Adding personalized user models should help some issues.

Based on X recommendations (recommendations that use information from other
Google properties), are the most useful.

Fixing the UX could also improve metrics dramatically.

The Play Store’s inventory is also very limited, which doesn’t
make recommending good content any easier.

In many cases, the Play Store is also more expensive.

Building a discovery engine is hard. So is creating a
marketplace. Can we mix humans and machines to make it easier?

http://blog.echen.me/2014/10/07/moving-beyond-ctr- better-recommendations-through-human-evaluation/ http://blog.echen.me/2013/01/08/improving-twitter- search-with-real-time-human-computation/

[email protected]

Human Evaluation for Google Play

Human Evaluation for Google Play

More Decks by Edwin Chen

Other Decks in Technology

Featured

Transcript