Language Model for Music Recommendation

Language Models for Music Discovery @raghothams @nischalhp

Who We are Nischal HP VP Data and ML at
scoutbee, Berlin Building Large Language Models, Knowledge Graphs and MLOps. Twitter : @nischalhp linkedin.com/in/nischalhp https:/ /github.com/deep-learning-for-humans Raghotham Sripadraj AI Architect at PayPal, Bangalore Building Document AI, Large Language and Computer Vision Models at PayPal. Twitter : @raghothams linkedin.com/in/raghothams/

Introduction Le us circa 2022 - There are so many
cool music producers. How can we ﬁnd them based on what we like? Le us circa 2023 - Here is shazam.nearest_neighbors, music discovery using language models and graphs. https:/ /github.com/deep-learning-for-humans

What is the problem, you ask? @raghothams @nischalhp

Problem with Music Discovery #1 Finding artists in your city
who perform the music you like is a challenge. Genres can be hard to work with as a ﬁlter.

You like a certain guitar solo from David Gilmour's rendition
of Comfortably Numb and you would love to ﬁnd other tracks that have similar solos, but can you? Problem with Music Discovery #2

As a music producer, would it not be cool to
have an assistant who could help me build great DJ sets? Problem with Music Discovery #3

What does the current landscape look like? @raghothams @nischalhp

Music Discovery ~= 70% of music streaming market share

Music Discovery Recommendation and discovery on these platforms happens via
Content Filtering + Collaborative Filtering Explore vs Exploit

Music Discovery > Music exploration via search is limited. >
Memorisation of tracks from your history of listening. > New artists are not often recommended.

How did we solve this?** @raghothams @nischalhp ** conditions apply
;)

Our Goal Given a 10 second sample, can we identify
other songs and artists that contain a similar pattern of music?

Dataset Dataset - 600 songs, mostly songs without lyrics. 10
second samples of 16000 and 48000 sample rate.

Trivia : Sampling Rate Sampling rate refers to the number
of samples taken for given time period. Higher the sampling rate, better the quality.

Algorithm Transformers is all you need.

Our Experiments with Transformers We looked at 3 Transformer based
audio models. Wav2Vec2 Audio Spectrum Transformer CLAP

We took a 10 second sample, generated embeddings using the
transformers model and tried a quick similarity using a vector store, to ﬁnd the most similar samples. Wav2Vec2 Audio Spectrum Transformer CLAP Our Experiments with Transformers

Our hypothesis was that the closest matches will be other
parts of the same track, Wav2Vec2 completely surprised us. We wanted to understand a bit further using visualisation. Input Sample spectrum analysis Our Experiments with Transformers

parts of the same track, Wav2Vec2 completely surprised us. We wanted to understand a bit further using visualisation. Our Experiments with Transformers Wav2Vec2

parts of the same track, Wav2Vec2 completely surprised us. We wanted to understand a bit further using visualisation. Our Experiments with Transformers Audio Spectrum Transformer

parts of the same track, Wav2Vec2 completely surprised us. We wanted to understand a bit further using visualisation. Our Experiments with Transformers CLAP

CLAP Audio Spectrum Transformer Wav2Vec2 Input Sample Understanding the transformers
2nd most similar sample

We realised Wav2Vec2 was different, and went deeper to understand
the datasets used to train those models and their intended purposes. Audio Transformers - Our Findings

Wav2Vec2, was trained for speech recognition and is trained on
Librispeech corpus that contains 960 hours of audio with speech. Audio Transformers - Our Findings

AST, was trained for Audio classiﬁcation and used AudioSet dataset,
is a collection of over 2 million 10-second audio clips excised from YouTube videos and labeled with the sounds that the clip contains from a set of 527 labels. Audio Transformers - Our Findings

Audio Spectrum Transformer Audio Transformers - Our Findings

CLAP, was designed to build good audio representation and was
trained using the Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing. It consists mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. Audio Transformers - Our Findings

Dataset - 600 songs, 10 second samples, 16000 and 48000
sampling rates used. Embeddings generated for all the samples using Wav2Vec, Clap, AST transformers model Indexed the embeddings into a vector store (FAISS) to generate similarity Spectrum analysis of the samples and similar samples to better understand the transformers Audio Transformers - Summary

We are not looking to replace current methods of recommendation,
we just wanted to see if we can build something to change how users can discover music. Just to clarify

Music Discovery @raghothams @nischalhp

Given a sample, we can now identify other samples that
sound similar. How do we enable discovery though? Music Discovery

We introduced graphs as a way for users to discover
and explore similar tracks and artists. Music Discovery

Using simple ontology we generated a graph data model with
artist and audio entities. Audio can refer to a song or a sample. Audio has relationships with other Audio entities and Artist entities. Music Discovery

Graphs make discovery fun. For starters, it helps us ﬁnd
other songs that are similar that are not from the same song / artist easier. We saw some interesting patterns, where we could see certain 10 second samples used by the same artists in different songs, with the help of graphs. Music Discovery

It’s not all talk and no show. Lets go! @raghothams
@nischalhp

After enabling the discovery of music with Language models and
graph, our goal is to train a generative model to create DJ sets from playlists. Future is wild

Open source for the win!

Thank you ^_^

Image credits Sampling Image in Slide 16 Image used in
Slide 12 References

Language Model for Music Recommendation

Language Model for Music Recommendation

More Decks by raghothams

Other Decks in Technology

Featured

Transcript