Hello, is it me
you’re looking
for?
Improving Spotify
search
Slide 2
Slide 2 text
● Introducing the search problem
● Why search can be hard
● Improvements to Spotify search
● Lessons learned
Agenda
Slide 3
Slide 3 text
Introducing
the search
problem
Slide 4
Slide 4 text
Given a search query,
provide the user the
most “relevant” results
What is the goal of search?
“Relevant” is context
dependent: e.g. time of
day, user, query etc.
You name it
Slide 5
Slide 5 text
The search problem can be divided into 3 parts
1. Query processing
2. Candidate retrieval
3. Ranking
Slide 6
Slide 6 text
1. Query processing
Converts a query into a form that can be easily used by
downstream components
“2000’s Party bangers” [2000s], [party], [bangers]
Original query Processed query
Tokenised and lowercased
Slide 7
Slide 7 text
1. Query processing (cont)
Converts a query into a form that can
be easily used by downstream
components
“rinning” [running]
Original query Processed query
Spell correction
Slide 8
Slide 8 text
2. Candidate Retrieval
Narrows down the set of candidates from hundreds of
millions to a few hundred
100,000,000s 100s
Slide 9
Slide 9 text
3. Ranking
Rank the narrowed down set of candidates to show the
user the most relevant candidates first
Slide 10
Slide 10 text
The 3 parts of the search problem
1. Query processing
2. Candidate retrieval
3. Ranking
I will talk about the
improvements in
this phase
Slide 11
Slide 11 text
Why search
can be hard
Slide 12
Slide 12 text
Search is easy when the user tells you exactly what they
are looking for
Query = “hello lionel ri”
Slide 13
Slide 13 text
Search is harder when the query is ambiguous
Query = “hello”
Slide 14
Slide 14 text
Search is much harder when the user doesn’t know what
they want exactly
Query = “something to chill to”
Slide 15
Slide 15 text
Search is very hard when the user could want so many
different types of content
Query = “latest on political unrest”
Slide 16
Slide 16 text
Search feels like a shot in the dark when the query could
mean anything
Query = “h”
Slide 17
Slide 17 text
Improvements
to Spotify
search
Slide 18
Slide 18 text
Reminder: the 3 parts of the search problem
1. Query processing
2. Candidate retrieval
3. Ranking
I will talk about the
improvements in
this phase
Slide 19
Slide 19 text
Reminder: Candidate Retrieval
Narrows down the set of candidates from hundreds of
millions to a few hundred
100,000,000s 100s
Slide 20
Slide 20 text
Prefix or word matching
Intuitive way of retrieving candidates
● “run” prefix of “running” so
return things that contain
the word “running”
● Get candidates that contain
the word “run”
Slide 21
Slide 21 text
Query = “rinning”
wouldn’t return this result
Limitations of prefix/word matching
Query = “something to chill to”
Query = “latest on political unrest”
or
Slide 22
Slide 22 text
We built 2 systems to improve search results
1. Crowdsurf
2. Natural Search
Slide 23
Slide 23 text
Crowdsurf
Slide 24
Slide 24 text
How does Crowdsurf work?
Let’s a user types this sequence of characters:
R
Ri
RiR
RiRi
RiR
Ri
Rih
Riha
Rihan
Rihann
Rihanna
User subsequently
clicks on the artist
“Rihanna”
Slide 25
Slide 25 text
How does Crowdsurf work? (cont)
R
Ri
RiR
RiRi
Rih
Riha
Rihan
Rihann
Rihanna
Every term in the sequence will be somewhat associated with Rihanna and that
relationship will be stored in Bigtable
Slide 26
Slide 26 text
Extending to all queries
Query
Content
E.g. “RiRi”
E.g. Rihanna
“Drizzy”
Drake
“Work”
Work
Slide 27
Slide 27 text
Scoring candidates
Query
URI
E.g. “RiRi”
E.g. Rihanna
P(content | query) ≅
Number of times a query is searched
Number of times a query is searched and the content is clicked
Slide 28
Slide 28 text
Big wins from Crowdsurf
● Reduced the amount that users had to type to find content
that they interacted with
● Increased the amount of “successful” searches globally
Slide 29
Slide 29 text
Natural Search
Slide 30
Slide 30 text
The “exact match” problem
Slide 31
Slide 31 text
What you might see now
How do we get a
computer to
understand the
semantic meaning of
the query?
Slide 32
Slide 32 text
Word/Sentence embeddings
Slide 33
Slide 33 text
We can do computations on the embeddings
Slide 34
Slide 34 text
We can group Spotify content and queries together
tips for ending bad friendships
dealing with covid ptsd
Train a machine learning model using
content metadata to understand
more about the content and
associate them with relevant queries
● https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/
● https://www.pinecone.io/learn/spotify-podcast-search/
Slide 35
Slide 35 text
Big wins from Natural Search
● Reduced the amount that users had to type to find content
that they interacted with
● Increased the amount of “successful” searches globally
Slide 36
Slide 36 text
Lessons
learned
Slide 37
Slide 37 text
1. Don’t jump to machine learning straight away
● Think “How would I solve this
problem without ML?”
● Exact matching for search
worked well enough for a
long time
● Natural search is very
engineering intensive
Slide 38
Slide 38 text
2. Make sure you set up monitoring for your systems
Crowdsurf fail
Slide 39
Slide 39 text
3. Add features that will enable you to debug and respond
quickly
Slide 40
Slide 40 text
Wrapping up
Slide 41
Slide 41 text
The search problem can be divided into 3 parts
1. Query processing
2. Candidate retrieval
3. Ranking
Slide 42
Slide 42 text
Search can be hard
Query = “h”
Slide 43
Slide 43 text
We built 2 systems to improve search results
1. Crowdsurf
2. Natural Search
Slide 44
Slide 44 text
We learned some lessons
1. Don’t go straight to ML
2. Monitor your pipelines and systems
3. Make sure you can debug and respond quickly