Jonny Brooks-Bartlett - Hello, Is It Me You're Looking For? Improving Spotify Search (Turing Fest 2022)

Hello, is it me you’re looking for? Improving Spotify search

• Introducing the search problem • Why search can be
hard • Improvements to Spotify search • Lessons learned Agenda

Introducing the search problem

Given a search query, provide the user the most “relevant”
results What is the goal of search? “Relevant” is context dependent: e.g. time of day, user, query etc. You name it

The search problem can be divided into 3 parts 1.
Query processing 2. Candidate retrieval 3. Ranking

1. Query processing Converts a query into a form that
can be easily used by downstream components “2000’s Party bangers” [2000s], [party], [bangers] Original query Processed query Tokenised and lowercased

1. Query processing (cont) Converts a query into a form
that can be easily used by downstream components “rinning” [running] Original query Processed query Spell correction

2. Candidate Retrieval Narrows down the set of candidates from
hundreds of millions to a few hundred 100,000,000s 100s

3. Ranking Rank the narrowed down set of candidates to
show the user the most relevant candidates first

The 3 parts of the search problem 1. Query processing
2. Candidate retrieval 3. Ranking I will talk about the improvements in this phase

Why search can be hard

Search is easy when the user tells you exactly what
they are looking for Query = “hello lionel ri”

Search is harder when the query is ambiguous Query =
“hello”

Search is much harder when the user doesn’t know what
they want exactly Query = “something to chill to”

Search is very hard when the user could want so
many different types of content Query = “latest on political unrest”

Search feels like a shot in the dark when the
query could mean anything Query = “h”

Improvements to Spotify search

Reminder: the 3 parts of the search problem 1. Query
processing 2. Candidate retrieval 3. Ranking I will talk about the improvements in this phase

Reminder: Candidate Retrieval Narrows down the set of candidates from
hundreds of millions to a few hundred 100,000,000s 100s

Prefix or word matching Intuitive way of retrieving candidates •
“run” prefix of “running” so return things that contain the word “running” • Get candidates that contain the word “run”

Query = “rinning” wouldn’t return this result Limitations of prefix/word
matching Query = “something to chill to” Query = “latest on political unrest” or

We built 2 systems to improve search results 1. Crowdsurf
2. Natural Search

Crowdsurf

How does Crowdsurf work? Let’s a user types this sequence
of characters: R Ri RiR RiRi RiR Ri Rih Riha Rihan Rihann Rihanna User subsequently clicks on the artist “Rihanna”

How does Crowdsurf work? (cont) R Ri RiR RiRi Rih
Riha Rihan Rihann Rihanna Every term in the sequence will be somewhat associated with Rihanna and that relationship will be stored in Bigtable

Extending to all queries Query Content E.g. “RiRi” E.g. Rihanna
“Drizzy” Drake “Work” Work

Scoring candidates Query URI E.g. “RiRi” E.g. Rihanna P(content |
query) ≅ Number of times a query is searched Number of times a query is searched and the content is clicked

Big wins from Crowdsurf • Reduced the amount that users
had to type to find content that they interacted with • Increased the amount of “successful” searches globally

Natural Search

The “exact match” problem

What you might see now How do we get a
computer to understand the semantic meaning of the query?

Word/Sentence embeddings

We can do computations on the embeddings

We can group Spotify content and queries together tips for
ending bad friendships dealing with covid ptsd Train a machine learning model using content metadata to understand more about the content and associate them with relevant queries • https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/ • https://www.pinecone.io/learn/spotify-podcast-search/

Big wins from Natural Search • Reduced the amount that
users had to type to find content that they interacted with • Increased the amount of “successful” searches globally

Lessons learned

1. Don’t jump to machine learning straight away • Think
“How would I solve this problem without ML?” • Exact matching for search worked well enough for a long time • Natural search is very engineering intensive

2. Make sure you set up monitoring for your systems
Crowdsurf fail

3. Add features that will enable you to debug and
respond quickly

Wrapping up

The search problem can be divided into 3 parts 1.
Query processing 2. Candidate retrieval 3. Ranking

Search can be hard Query = “h”

We built 2 systems to improve search results 1. Crowdsurf
2. Natural Search

We learned some lessons 1. Don’t go straight to ML
2. Monitor your pipelines and systems 3. Make sure you can debug and respond quickly

Thanks for listening :)

Jonny Brooks-Bartlett - Hello, Is It Me You're ...

Jonny Brooks-Bartlett - Hello, Is It Me You're Looking For? Improving Spotify Search (Turing Fest 2022)

More Decks by Turing Fest

Other Decks in Technology

Featured

Transcript