Jonny Brooks-Bartlett - Hello, Is It Me You're Looking For? Improving Spotify Search (Turing Fest 2022)

Slide 1

Slide 1 text

Hello, is it me you’re looking for? Improving Spotify search

Slide 2

Slide 2 text

● Introducing the search problem ● Why search can be hard ● Improvements to Spotify search ● Lessons learned Agenda

Slide 3

Slide 3 text

Introducing the search problem

Slide 4

Slide 4 text

Given a search query, provide the user the most “relevant” results What is the goal of search? “Relevant” is context dependent: e.g. time of day, user, query etc. You name it

Slide 5

Slide 5 text

The search problem can be divided into 3 parts 1. Query processing 2. Candidate retrieval 3. Ranking

Slide 6

Slide 6 text

1. Query processing Converts a query into a form that can be easily used by downstream components “2000’s Party bangers” [2000s], [party], [bangers] Original query Processed query Tokenised and lowercased

Slide 7

Slide 7 text

1. Query processing (cont) Converts a query into a form that can be easily used by downstream components “rinning” [running] Original query Processed query Spell correction

Slide 8

Slide 8 text

2. Candidate Retrieval Narrows down the set of candidates from hundreds of millions to a few hundred 100,000,000s 100s

Slide 9

Slide 9 text

3. Ranking Rank the narrowed down set of candidates to show the user the most relevant candidates first

Slide 10

Slide 10 text

The 3 parts of the search problem 1. Query processing 2. Candidate retrieval 3. Ranking I will talk about the improvements in this phase

Slide 11

Slide 11 text

Why search can be hard

Slide 12

Slide 12 text

Search is easy when the user tells you exactly what they are looking for Query = “hello lionel ri”

Slide 13

Slide 13 text

Search is harder when the query is ambiguous Query = “hello”

Slide 14

Slide 14 text

Search is much harder when the user doesn’t know what they want exactly Query = “something to chill to”

Slide 15

Slide 15 text

Search is very hard when the user could want so many different types of content Query = “latest on political unrest”

Slide 16

Slide 16 text

Search feels like a shot in the dark when the query could mean anything Query = “h”

Slide 17

Slide 17 text

Improvements to Spotify search

Slide 18

Slide 18 text

Reminder: the 3 parts of the search problem 1. Query processing 2. Candidate retrieval 3. Ranking I will talk about the improvements in this phase

Slide 19

Slide 19 text

Reminder: Candidate Retrieval Narrows down the set of candidates from hundreds of millions to a few hundred 100,000,000s 100s

Slide 20

Slide 20 text

Prefix or word matching Intuitive way of retrieving candidates ● “run” prefix of “running” so return things that contain the word “running” ● Get candidates that contain the word “run”

Slide 21

Slide 21 text

Query = “rinning” wouldn’t return this result Limitations of prefix/word matching Query = “something to chill to” Query = “latest on political unrest” or

Slide 22

Slide 22 text

We built 2 systems to improve search results 1. Crowdsurf 2. Natural Search

Slide 23

Slide 23 text

Crowdsurf

Slide 24

Slide 24 text

How does Crowdsurf work? Let’s a user types this sequence of characters: R Ri RiR RiRi RiR Ri Rih Riha Rihan Rihann Rihanna User subsequently clicks on the artist “Rihanna”

Slide 25

Slide 25 text

How does Crowdsurf work? (cont) R Ri RiR RiRi Rih Riha Rihan Rihann Rihanna Every term in the sequence will be somewhat associated with Rihanna and that relationship will be stored in Bigtable

Slide 26

Slide 26 text

Extending to all queries Query Content E.g. “RiRi” E.g. Rihanna “Drizzy” Drake “Work” Work

Slide 27

Slide 27 text

Scoring candidates Query URI E.g. “RiRi” E.g. Rihanna P(content | query) ≅ Number of times a query is searched Number of times a query is searched and the content is clicked

Slide 28

Slide 28 text

Big wins from Crowdsurf ● Reduced the amount that users had to type to find content that they interacted with ● Increased the amount of “successful” searches globally

Slide 29

Slide 29 text

Natural Search

Slide 30

Slide 30 text

The “exact match” problem

Slide 31

Slide 31 text

What you might see now How do we get a computer to understand the semantic meaning of the query?

Slide 32

Slide 32 text

Word/Sentence embeddings

Slide 33

Slide 33 text

We can do computations on the embeddings

Slide 34

Slide 34 text

We can group Spotify content and queries together tips for ending bad friendships dealing with covid ptsd Train a machine learning model using content metadata to understand more about the content and associate them with relevant queries ● https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/ ● https://www.pinecone.io/learn/spotify-podcast-search/

Slide 35

Slide 35 text

Big wins from Natural Search ● Reduced the amount that users had to type to find content that they interacted with ● Increased the amount of “successful” searches globally

Slide 36

Slide 36 text

Lessons learned

Slide 37

Slide 37 text

1. Don’t jump to machine learning straight away ● Think “How would I solve this problem without ML?” ● Exact matching for search worked well enough for a long time ● Natural search is very engineering intensive

Slide 38

Slide 38 text

2. Make sure you set up monitoring for your systems Crowdsurf fail

Slide 39

Slide 39 text

3. Add features that will enable you to debug and respond quickly

Slide 40

Slide 40 text

Wrapping up

Slide 41

Slide 41 text

The search problem can be divided into 3 parts 1. Query processing 2. Candidate retrieval 3. Ranking

Slide 42

Slide 42 text

Search can be hard Query = “h”

Slide 43

Slide 43 text

We built 2 systems to improve search results 1. Crowdsurf 2. Natural Search

Slide 44

Slide 44 text

We learned some lessons 1. Don’t go straight to ML 2. Monitor your pipelines and systems 3. Make sure you can debug and respond quickly

Slide 45

Slide 45 text

Thanks for listening :)