Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Music data & music search at insane scale @ The Echo Nest

Brian Whitman
October 03, 2012

Music data & music search at insane scale @ The Echo Nest

In the past few years, The Echo Nest has built the largest database of music anywhere – over 2 million artists and 30 million songs each with detailed information down to the pitch of each note in each guitar solo and every adjective ever said about your favorite new band. We’ve done it with a nimble and speedy custom infrastructure—web crawling, natural language processing, audio analysis and synthesis, audio fingerprinting and deduplication, and front ends to our massive key-value stores and text indexes. Our real time music data API handles hundreds of queries a second and powers most music discovery experiences you have on the internet today, from iHeartRadio and Spotify to eMusic, VEVO, MOG and MTV.

During this talk, the Echo Nest’s co-founder and CTO will run through the challenges and solutions needed to build music recommendation, search and identification at “severe scale,” with the constraint that most of our results are computed on the fly with little caching. It’s hard to store results when data about music changes on the internet so quickly as do the tastes and preferences of your customers’ listeners.

Brian Whitman

October 03, 2012
Tweet

More Decks by Brian Whitman

Other Decks in Technology

Transcript

  1. Music data & music search at insane scale at The

    Echo Nest Brian Whitman co-Founder & CTO The Echo Nest @bwhitman
  2. The world of music data  We know about almost

    3 million artists  35 million unique songs  Hundreds of millions of recordings  We estimate 100 MB of data unpacks from each song
  3. The future music platform & APIs  APIs have transformed

    the music industry  New music experiences  Content now available via Spotify or sandboxes
  4. The Echo Nest  We power most music experiences you

    have on the internet  Recommendation, search, analytics, audio analysis, fingerprinting, playlists, radio  You’ve never heard of us, that’s cool
  5. Today’s talk  Real time music services - the problem

    space in general  Doing “weird” stuff at scale  Operations & scale challenges on a large music data service: - Acoustic analysis on hundreds of millions of songs - Specific challenges with search (Solr) at scale & with user models - Compute heavy API serving - Fingerprinting  What’s next
  6. Music analysis  How does a computer listen to music

    anyway  We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever  We’ve analyzed hundreds of millions of songs & billions of documents about artists
  7. Music analysis  How does a computer listen to music

    anyway  We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever  We’ve analyzed hundreds of millions of songs & billions of documents about artists
  8. Music analysis  How does a computer listen to music

    anyway  We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever  We’ve analyzed hundreds of millions of songs & billions of documents about artists
  9. Music analysis  How does a computer listen to music

    anyway  We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever  We’ve analyzed hundreds of millions of songs & billions of documents about artists
  10. Care & scale  We attempt to have our results

    as useful as possible  We do a very weird kind of recommendation technology  It works well but has some bad performance & ops characteristics
  11. Care & scale  We attempt to have our results

    as useful as possible  We do a very weird kind of recommendation technology  It works well but has some bad performance & ops characteristics
  12. Care & scale  We attempt to have our results

    as useful as possible  We do a very weird kind of recommendation technology  It works well but has some bad performance & ops characteristics
  13. Music similarity & playlisting  Streaming radio experiences for many

    clients  We serve hundreds of API calls a second  Many involve a user model per query
  14. Music resolving  Audio resolving (fingerprinting) - Echoprint & ENMFP

     Text resolving - artist/extract  ID space resolving - Rosetta Stone
  15. Taste profiles  Server side representation of everything someone does

    with music  Preferences, play logging, skips, bans, favorites & other metadata  Predictive power -- increase play time, better recommendations, ads & demo stuff
  16. Data in (Crawling, partners) SQL store for relational data Key-value

    store for "storables" Columnstore for metadata Various "fake map reduce" compute processes Solr Key-value store API Basic overview of data processing stack
  17. Acoustic analysis  All audio from partners & API users

    go through an analysis  Heavily optimized C++, ~1s per 1 minute of audio  Divides song into segments, each 0.1s - 4s: - Timbre - Pitch - Loudness  Beat tracking - Measures - Downbeat  Song level features: - Key, mode - Energy, danceability, etc
  18. Acoustic analysis  Transcoding-style async job  Output of analysis

    is a ~1MB json block for an average song  RabbitMQ for analysis queues  Workers run on EC2  Raw output goes on S3  Workers pull together indexable data
  19. Text indexing  Quite a lot of EN is powered

    directly by text-style inverted index search: - Artist similarity - Music search & metadata - Playlisting - Even audio fingerprinting  We use Solr quite extensively with a lot of custom stuff on top  Backed by a key-value store (Tokyo Tyrant) for storage Artists Songs Documents Metadata Fingerprints 192,000,000 docs 344,000,000 docs 200,000,000 docs 34,000,000 docs 2,200,000 docs
  20. Why Solr & TT?  I get this question so

    many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing
  21. Why Solr & TT?  I get this question so

    many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing
  22. Why Solr & TT?  I get this question so

    many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing  Test it first, let me know
  23. In practice: artist similarity  Artist similarity is a complicated

    query with boosts: - t_term:‘swedish pop’^5.3 t_term:‘dancing queen’^0.2 f_familiarity:[0.8 TO *] etc  Terms are indexed with term frequency weights using a custom component  Should run in real time whenever possible, not cached - things change - people change  This makes Solr a big sorting engine. It’s not so bad at it! n2 Term Score dancing queen 0.0707 mamma mia 0.0622 disco era 0.0346 winner takes 0.0307 chance on 0.0297 swedish pop 0.0296 my my 0.0290 s enduring 0.0287 and gimme 0.0280 enduring appeal 0.0280 np Term Score dancing queen 0.0875 mamma mia 0.0553 benny 0.0399 chess 0.0390 its chorus 0.0389 vous 0.0382 the invitations 0.0377 voulez 0.0377 something’s 0.0374 priscilla 0.0369 adj Term Score perky 0.8157 nonviolent 0.7178 swedish 0.2991 international 0.2010 inner 0.1776 consistent 0.1508 bitter 0.0871 classified 0.0735 junior 0.0664 produced 0.0616 Table 4.2: Top 10 terms of various types for ABBA. The score is TF-IDF for a (adjective), and gaussian weighted TF-IDF for term types n2 (bigrams) and n
  24. Solr: Sharding  Two types of data sharding - Consistent

    hashing of data with replication - Full data replication on multiple nodes - Choice depends on data size & index complexity
  25. Solr: Storage  Should you store data in a text

    index? - Sometimes  As average data size goes up, Solr does better at retrieval - but that’s an edge case  In general, if you don’t need to (highlighting or etc), don’t
  26. Solr: Indexing (1)  Indexing large amounts of data is

    the biggest bottleneck we’ve had  We built a bulk Lucene indexing system (Flattery) - All new or updated documents go into the key value store immediately & on a queue - Since all indexes run at least two copies of themselves we can: - Take down one - Index the documents out of the queue & from the KV store using embedded Solr - Then bring it back up and take down the other during rsyncing  This avoids query slowdowns at the expense of truly RT indexing (but it’s close)
  27. Solr: Indexing (2)  Other simple things: - Only index

    what you need: date accuracy, stemming, etc - Index should be on RAM or fast SSD as much as possible - We use Fusion-IO and slower SSDs as storage - Don’t use the grouping feature
  28. Solr: Indexing (2)  Other simple things: - Only index

    what you need: date accuracy, stemming, etc - Index should be on RAM or fast SSD as much as possible - We use Fusion-IO and slower SSDs as storage - Don’t use the grouping feature - EC2, EBS, no.
  29. User models  Taste profiles: users can send us anything

    about their music preference - plays, skips, collection contents  We currently track millions of individuals’ listening habits via our partners & API  48 shards of SQL after a resolving step - Async resolving workers  Solr index as well for similarity (as artists / songs)
  30. User models  Users can search within a profile or

    use a profile as a seed (recommendation)  How can you quickly “cut” a huge index to a user model? - Model changes frequently & we can’t have stale info - Can’t index model into Solr for perf reasons  We built Polarized: custom Lucene / Solr filter component - open source! - Reads a set from external data source (SQL) from an ID passed in at query time - Builds an in memory bitset at query time to filter results to Lucene docIDs - Very useful
  31. API  Hundreds of music queries a second hit our

    stack  Mix of enterprise customers & indie developers  Most on the fly compute queries  Some data cached through CDN for some customers  All compute in Python
  32. API  Front end nginx & tornado  Worker model

    - Workers do CPU compute: - FP matching - Playlist generation - Specialized workers for - Profile resolution - Audio analysis - Specialized workers use MQ  Python threading during compute is biggest killer API FRONTEND Tornado Track Upload Tornado ... Tornado NginX ... API WORKER Apache + mod_wsgi PyAPI Thread PyAPI Thread DATA developer.echonest.com (LB) worker-pool (LB) CATALOG WORKER Python Catalog Worker Python Catalog Worker ANALYZE WORKER Python Analyze Worker Python Analyze Worker Memcache Queue (RabbitMQ) Queue (RabbitMQ)
  33. API ops  Graphite  Everything logged  Fabric for

    deployment  ~1 / day, new features & bugs
  34. Audio fingerprinting  Echoprint & ENMFP  We identify 75

    songs a second via our FPs  >20m songs, >100m tracks in the lookup DB (& people can add to it!)  Internal use & APIs  Both are open to a degree, Echoprint fully open source end to end
  35. Fingerprinting process  (All open source! if you want to

    look)  http://echoprint.me/ File Codegen (Mobile or server) Lookup API Hash space 1000s of 20-bit code / time pairs Matching index Storage
  36. Fingerprinting indexing & search  We use Solr as the

    search backend - treat the 20-bit codes as tokens in the index - the query finds the top N docs with the most overlap  Python process in the worker then does “post filtering” - very CPU intensive, dynamic programming problem - takes a list of ~20 contenders from Solr and finds if there is a real match  Index: sharded by song duration - queries hit 1 or 2 of 16 shards if we know the duration - For echoprint, songs are indexed in 60s pieces to match the right spot 123984 6940 21823 TR5904 50ms TR1283 840ms TR1293 680ms TR1283 120ms TR7348 860ms TR5909 650ms
  37. Fingerprinting performance  People want FP results fast  But

    compute is a serious bottleneck  At 50 rps per server, we aim for 1% being slower than 500ms 0 225.0 450.0 675.0 900.0 Average Median 1% 5% 20% ms Echoprint ENMFP
  38.  We keep annoying our engineers  Our next feature

    release is more live compute What’s next