Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to trust music metadata - PyCon FR 2019

How to trust music metadata - PyCon FR 2019

Did you know American singer/songwriter Joan Baez has a near-homonym in Colombia, Joan Báez, who plays progressive rock?
Did you know there are (at least) two bands called Aggression that play thrash metal?
Jazzman Avishai Cohen, anyone? There are two contemporary jazz musicians who go by that name.

At Deezer, the data we receive from music labels is often ambiguous. And if we display an album in the wrong artist page, users get (rightfully!) mad and may turn to the competition the next time they want to listen to their favorite tracks.

In this talk, I'll present some of the techniques we use to verify our metadata and fix our music catalog.
We leverage several metadata sources to consolidate a "source of truth" which then helps us spot & correct errors in our database.
We'll explore topics like record linkage, bag-of-words, TF-IDF, graph algorithms and community detection. All of this in Python using scikit-learn, scipy and networkx.

Paul Tremberth

November 03, 2019
Tweet

Other Decks in Technology

Transcript

  1. Add a picture here How to trust music metadata Paul

    Tremberth PyCon FR 2019 Add a picture here
  2. Deezer • Online music streaming service ◦ 180+ countries, 14MAUs

    • 600 employees in ~10 cities ◦ All tech in FR (Paris & Bordeaux) • Deezer’s music catalog ◦ ~4M “main” artists (11M artists overall) ◦ ~10M albums ◦ 53M+ tracks
  3. Me • Software Engineer at Deezer (since 2019) ◦ Working

    on music catalog quality (integrity, completeness) • In the past: ◦ Embedded software (mobile phones) ◦ Web scraping in Python ◦ (one of) Scrapy maintainers ◦ Data consolidation in winetech • redapple on GitHub
  4. Data accompanying audio file collections (a.k.a singles and albums) •

    Cover art • Album title • Artists (performers, composers, lyricists) • Copyright and publishing information • Label, release dates, release countries • Tags, genres • press release, booklet • ... Music metadata For this talk: Focus is on textual metadata delivered by music labels
  5. Tracking homonyms Thrash metal from Canada Thrash metal from Spain

    Prog rock from Colombia Singer songwriter from USA
  6. 1. Match against ground truth A B C D 1

    2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 E Ground truth Deezer artists albums
  7. 2. Swap, merge, split A B C D 1 2

    3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A’ B’ C’ E D’ E’ Ground truth Deezer artists albums
  8. Linking music metadata sources • Human-curated metadata ◦ MusicBrainz, Discogs,

    Wikipedia, etc. • Sources can be broad or focused regarding music genres • Coreference: find entities that refer to the same “real-world” entities • Multiple sources → better coverage and completeness
  9. 1. Match albums by artist & title Source Artist name

    Album Title Artist ID A Nirvana Nevermind 34 A Nirvana In Utero 34 A Nirvana Local Anaesthetic 68 B Nirvana Nevermind EF65 B Nirvana Local Anaesthetic 45AE
  10. 1. Match albums by artist & title Source Artist name

    Album Title Artist ID A Nirvana Nevermind 34 A Nirvana In Utero 34 A Nirvana Local Anaesthetic 68 B Nirvana Nevermind EF65 B Nirvana Local Anaesthetic 45AE
  11. 2. Build a graph Nirvana Nevermind Nirvana Nirvana Nirvana Nevermind

    In Utero Local Anaesthetic Local Anaesthetic Source A Source B
  12. 3. Find connected components Nirvana Nevermind Nirvana Nirvana Nirvana Nevermind

    In Utero Local Anaesthetic Local Anaesthetic Source A Source B
  13. Unrelated artists get connected Nirvana Nevermind Nirvana Nirvana Nirvana Nevermind

    Best of Local Anaesthetic Local Anaesthetic Source A Source B Best of
  14. • Each release as one text document ◦ Artist name

    ◦ Release name ◦ track titles • Vectorize documents in word space: ◦ 1-word tokens, ◦ But you can also use 2-words tokens, etc. Match using track names too
  15. Text1 → tokens Text document: “Nevermind” by Nirvana Tokens (1-word,

    2-words, 3-words) Nirvana Nevermind Smells Like Teen Spirit In Bloom Come as You Are Breed Lithium Polly Territorial Pissings Drain You Lounge Act Stay Away On a Plain Something in the Way / Endless, Nameless "nirvana", "nevermind", "smells", "like", "teen", "spirit", "smells like", "like teen", "teen spirit", "smells like teen", "like teen spirit", "in", "bloom", "in bloom", "come", "as", "you", "are", "come as", "as you", "you are", "come as you", "as you are", "breed", "lithium", "polly", "territorial", "pissings", "territorial pissings", "drain", "you", "drain you", "lounge", "act", "lounge act", "stay", "away", "stay away", "on", "plain", "on plain", "something", "in", "the", "way", "endless", "nameless", "something in", "in the", "the way", "way endless", "endless nameless", "something in the", "in the way", "the way endless", "way endless nameless"
  16. Text2 → tokens Text document: “MTV Unplugged” by Nirvana Tokens

    (1-word, 2-words, 3-words) Nirvana MTV Unplugged in New York About a Girl Come as You Are Jesus Doesn’t Want Me for a Sunbeam The Man Who Sold the World Pennyroyal Tea Dumb Polly On a Plain Something in the Way Plateau Oh Me Lake of Fire All Apologies Where Did You Sleep Last Night "nirvana", "mtv", "unplugged", " in", "new", "york", "mtv unplugged", "unplugged in", "in new", "new york", "mtv unplugged in", "unplugged in new", "in new york", "about", "girl", "about girl", "come", "as", "you", "are", "come as", "as you", "you are", "come as you", "as you are", "jesus", "doesn", "want", "me", "for", "sunbeam", "jesus doesn", "doesn want", "want me", "me for", "for sunbeam", "jesus doesn want", "doesn want me", "want me for", "me for sunbeam", "the", "man", "who", "sold", "the", "world", "the man", "man who", "who sold", "sold the", "the world", "the man who", "man who sold", "who sold the", "sold the world", "pennyroyal", "tea", "pennyroyal tea", "dumb", "polly", "on", "plain", "on plain", "something", "in", "the", "way", "something in", "in the", "the way", "something in the", "in the way", "plateau", "oh", "me", "oh me", "lake", "of", "fire", "lake of", "of fire", "lake of fire", "all", "apologies", "all apologies", "where", "did", "you", "sleep", "last", "night", "where did", "did you", "you sleep", "sleep last", "last night", "where did you", "did you sleep", "you sleep last", "sleep last night"
  17. • Count the number of appearances of each token in

    the document • Orderless representation • Tokens can be seen as features Bag-of-words model
  18. Compare documents using vectors 1 2 3 ... 98 99

    100 doc1 0 1 0 1 1 0 doc2 1 1 0 0 1 1 tokens/features Possible measures • Jaccard similarity • Cosine similarity • ...
  19. • Not all tokens are relevant ◦ “in” or “a”

    vs. “lithium” or “spirit” • Wikipedia: ◦ In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. • Term Frequency: how many times a token appears in the document • Inverse Document Frequency: inversely proportional to the number of documents the token appears in (logarithmic) • Weight of a token in a document d: TFIDF w(token, d) = tf(token, d) × idf(token, d)
  20. TF-IDF matrix tokens Source B Source A E.g. with scikit-learn

    • scikit-learn normalizes the vectors for you by default • So scalar product gives you cosine similarity • You can customize the tokenizer (for example not to use multiline tokens)
  21. Similarity scores for all pairs DocA DocB DocC DocD DocE

    DocF DocG DocH DocI DocJ Doc1 0.2 0.9 0.1 0.0 0.6 0.7 0.2 0.9 0.1 0.0 Doc2 0.3 0.1 0.9 0.8 0.7 0.1 0.1 0.2 0.3 0.1 Doc3 0.9 0.8 0.1 0.2 0.5 0.6 0.4 0.2 0.1 0.9 Doc4 0.2 0.2 0.8 0.5 0.1 0.9 0.4 0.3 0.3 0.2 Doc5 0.9 0.5 0.5 0.5 0.2 0.1 0.9 0.7 0.8 0.1 • Matching is now “fuzzy” • Multiple possible matches for each document (i.e. above some chosen threshold)
  22. • No duplicates within each source • One document in

    one source can only match one document in another • Some documents may not have any match in another source → Solve the assignment problem Simple assumptions
  23. Find best matches overall DocA DocB DocC DocD DocE DocF

    DocG DocH DocI DocJ Doc1 0.2 0.9 0.1 0.0 0.6 0.7 0.2 0.9 0.1 0.0 Doc2 0.3 0.1 0.9 0.8 0.7 0.1 0.1 0.2 0.3 0.1 Doc3 0.9 0.8 0.1 0.2 0.5 0.6 0.4 0.2 0.1 0.9 Doc4 0.2 0.2 0.8 0.5 0.1 0.9 0.4 0.3 0.3 0.2 Doc5 0.9 0.5 0.5 0.5 0.2 0.1 0.9 0.7 0.8 0.1 e.g., with scipy
  24. Releases not linked to same artist Nirvana Never mind Nirvana

    Nirvana Nirvana Never mind To Markos III To markos III Source A Source B In Utero In Utero Bleach Bleach Local Anaest hetic Local Anaest hetic
  25. Next steps? • Use other relations: labels, publishers, collaborators, composers,

    lyricists • Different artist-release link weights for different sources • Non-textual signals ◦ similarity based on audio features ◦ cover art similarities