Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jonny Brooks-Bartlett - Hello, Is It Me You're Looking For? Improving Spotify Search (Turing Fest 2022)

Jonny Brooks-Bartlett - Hello, Is It Me You're Looking For? Improving Spotify Search (Turing Fest 2022)

When a Spotify user types a query into the search bar it sets off a cascade of algorithms which ultimately ends in the user being shown a set of music and/or podcast results. This cascade of processes can be very complex as we must try to understand exactly what the user is looking for and choose a handful of results from a set of 10s of millions of potential options. As more and more items are added to the Spotify catalogue, from music to podcasts and soon audiobooks too, the Search team has increased in size and scope to try to tackle the growing challenges. In this talk I'll give an overview of the components of a typical search system and explain some of the challenges that arise in search systems. Then I'll talk about some of the new algorithms that we've developed over the last year that have led to huge improvements in Search quality. Finally, I'll outline some of the general lessons that I've personally learned with building machine learning models.

Head to www.turingfest.com to learn more about Europe's best cross-functional tech conference.

Turing Fest

August 15, 2022
Tweet

More Decks by Turing Fest

Other Decks in Technology

Transcript

  1. • Introducing the search problem • Why search can be

    hard • Improvements to Spotify search • Lessons learned Agenda
  2. Given a search query, provide the user the most “relevant”

    results What is the goal of search? “Relevant” is context dependent: e.g. time of day, user, query etc. You name it
  3. The search problem can be divided into 3 parts 1.

    Query processing 2. Candidate retrieval 3. Ranking
  4. 1. Query processing Converts a query into a form that

    can be easily used by downstream components “2000’s Party bangers” [2000s], [party], [bangers] Original query Processed query Tokenised and lowercased
  5. 1. Query processing (cont) Converts a query into a form

    that can be easily used by downstream components “rinning” [running] Original query Processed query Spell correction
  6. 2. Candidate Retrieval Narrows down the set of candidates from

    hundreds of millions to a few hundred 100,000,000s 100s
  7. 3. Ranking Rank the narrowed down set of candidates to

    show the user the most relevant candidates first
  8. The 3 parts of the search problem 1. Query processing

    2. Candidate retrieval 3. Ranking I will talk about the improvements in this phase
  9. Search is easy when the user tells you exactly what

    they are looking for Query = “hello lionel ri”
  10. Search is much harder when the user doesn’t know what

    they want exactly Query = “something to chill to”
  11. Search is very hard when the user could want so

    many different types of content Query = “latest on political unrest”
  12. Search feels like a shot in the dark when the

    query could mean anything Query = “h”
  13. Reminder: the 3 parts of the search problem 1. Query

    processing 2. Candidate retrieval 3. Ranking I will talk about the improvements in this phase
  14. Reminder: Candidate Retrieval Narrows down the set of candidates from

    hundreds of millions to a few hundred 100,000,000s 100s
  15. Prefix or word matching Intuitive way of retrieving candidates •

    “run” prefix of “running” so return things that contain the word “running” • Get candidates that contain the word “run”
  16. Query = “rinning” wouldn’t return this result Limitations of prefix/word

    matching Query = “something to chill to” Query = “latest on political unrest” or
  17. How does Crowdsurf work? Let’s a user types this sequence

    of characters: R Ri RiR RiRi RiR Ri Rih Riha Rihan Rihann Rihanna User subsequently clicks on the artist “Rihanna”
  18. How does Crowdsurf work? (cont) R Ri RiR RiRi Rih

    Riha Rihan Rihann Rihanna Every term in the sequence will be somewhat associated with Rihanna and that relationship will be stored in Bigtable
  19. Scoring candidates Query URI E.g. “RiRi” E.g. Rihanna P(content |

    query) ≅ Number of times a query is searched Number of times a query is searched and the content is clicked
  20. Big wins from Crowdsurf • Reduced the amount that users

    had to type to find content that they interacted with • Increased the amount of “successful” searches globally
  21. What you might see now How do we get a

    computer to understand the semantic meaning of the query?
  22. We can group Spotify content and queries together tips for

    ending bad friendships dealing with covid ptsd Train a machine learning model using content metadata to understand more about the content and associate them with relevant queries • https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/ • https://www.pinecone.io/learn/spotify-podcast-search/
  23. Big wins from Natural Search • Reduced the amount that

    users had to type to find content that they interacted with • Increased the amount of “successful” searches globally
  24. 1. Don’t jump to machine learning straight away • Think

    “How would I solve this problem without ML?” • Exact matching for search worked well enough for a long time • Natural search is very engineering intensive
  25. The search problem can be divided into 3 parts 1.

    Query processing 2. Candidate retrieval 3. Ranking
  26. We learned some lessons 1. Don’t go straight to ML

    2. Monitor your pipelines and systems 3. Make sure you can debug and respond quickly