Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Music data & music search at insane scale @ The Echo Nest

Brian Whitman
October 03, 2012

Music data & music search at insane scale @ The Echo Nest

In the past few years, The Echo Nest has built the largest database of music anywhere – over 2 million artists and 30 million songs each with detailed information down to the pitch of each note in each guitar solo and every adjective ever said about your favorite new band. We’ve done it with a nimble and speedy custom infrastructure—web crawling, natural language processing, audio analysis and synthesis, audio fingerprinting and deduplication, and front ends to our massive key-value stores and text indexes. Our real time music data API handles hundreds of queries a second and powers most music discovery experiences you have on the internet today, from iHeartRadio and Spotify to eMusic, VEVO, MOG and MTV.

During this talk, the Echo Nest’s co-founder and CTO will run through the challenges and solutions needed to build music recommendation, search and identification at “severe scale,” with the constraint that most of our results are computed on the fly with little caching. It’s hard to store results when data about music changes on the internet so quickly as do the tastes and preferences of your customers’ listeners.

Brian Whitman

October 03, 2012
Tweet

More Decks by Brian Whitman

Other Decks in Technology

Transcript

  1. Music data & music
    search at insane scale at
    The Echo Nest
    Brian Whitman
    co-Founder & CTO
    The Echo Nest
    @bwhitman

    View Slide

  2. musician (1997-2004)

    View Slide

  3. musician (1997-2004) academic (1999-2005)

    View Slide

  4. musician (1997-2004) academic (1999-2005) The Echo Nest (2005-)

    View Slide

  5. The world of music data
     We know about almost 3 million artists
     35 million unique songs
     Hundreds of millions of recordings
     We estimate 100 MB of data unpacks from each song

    View Slide

  6. The future music platform & APIs
     APIs have transformed the music industry
     New music experiences
     Content now available via Spotify or sandboxes

    View Slide

  7. The Echo Nest
     We power most music experiences you have on the internet
     Recommendation, search, analytics, audio analysis, fingerprinting, playlists, radio
     You’ve never heard of us, that’s cool

    View Slide

  8. Today’s talk
     Real time music services - the problem space in general
     Doing “weird” stuff at scale
     Operations & scale challenges on a large music data service:
    - Acoustic analysis on hundreds of millions of songs
    - Specific challenges with search (Solr) at scale & with user models
    - Compute heavy API serving
    - Fingerprinting
     What’s next

    View Slide

  9. Music analysis
     How does a computer listen to music anyway
     We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever
     We’ve analyzed hundreds of millions of songs & billions of documents about artists

    View Slide

  10. Music analysis
     How does a computer listen to music anyway
     We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever
     We’ve analyzed hundreds of millions of songs & billions of documents about artists

    View Slide

  11. Music analysis
     How does a computer listen to music anyway
     We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever
     We’ve analyzed hundreds of millions of songs & billions of documents about artists

    View Slide

  12. Music analysis
     How does a computer listen to music anyway
     We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever
     We’ve analyzed hundreds of millions of songs & billions of documents about artists

    View Slide

  13. Care & scale
     We attempt to have our results as useful as possible
     We do a very weird kind of recommendation technology
     It works well but has some bad performance & ops characteristics

    View Slide

  14. Care & scale
     We attempt to have our results as useful as possible
     We do a very weird kind of recommendation technology
     It works well but has some bad performance & ops characteristics

    View Slide

  15. Care & scale
     We attempt to have our results as useful as possible
     We do a very weird kind of recommendation technology
     It works well but has some bad performance & ops characteristics

    View Slide

  16. Music similarity & playlisting
     Streaming radio experiences for many clients
     We serve hundreds of API calls a second
     Many involve a user model per query

    View Slide

  17. Music resolving
     Audio resolving (fingerprinting) - Echoprint & ENMFP
     Text resolving - artist/extract
     ID space resolving - Rosetta Stone

    View Slide

  18. Taste profiles
     Server side representation of everything someone does with music
     Preferences, play logging, skips, bans, favorites & other metadata
     Predictive power -- increase play time, better recommendations, ads & demo stuff

    View Slide

  19. Data in
    (Crawling,
    partners)
    SQL store for
    relational data
    Key-value store for
    "storables"
    Columnstore for
    metadata
    Various "fake map reduce"
    compute processes
    Solr Key-value store
    API
    Basic overview of data processing stack

    View Slide

  20. Acoustic analysis
     All audio from partners & API users go through an analysis
     Heavily optimized C++, ~1s per 1 minute of audio
     Divides song into segments, each 0.1s - 4s:
    - Timbre
    - Pitch
    - Loudness
     Beat tracking
    - Measures
    - Downbeat
     Song level features:
    - Key, mode
    - Energy, danceability, etc

    View Slide

  21. Acoustic analysis

    View Slide

  22. Acoustic analysis
     Transcoding-style async job
     Output of analysis is a ~1MB json block for an
    average song
     RabbitMQ for analysis queues
     Workers run on EC2
     Raw output goes on S3
     Workers pull together indexable data

    View Slide

  23. Text indexing
     Quite a lot of EN is powered directly by text-style inverted index search:
    - Artist similarity
    - Music search & metadata
    - Playlisting
    - Even audio fingerprinting
     We use Solr quite extensively with a lot of custom stuff on top
     Backed by a key-value store (Tokyo Tyrant) for storage
    Artists Songs Documents Metadata Fingerprints
    192,000,000 docs
    344,000,000 docs
    200,000,000 docs
    34,000,000 docs
    2,200,000 docs

    View Slide

  24. Why Solr & TT?
     I get this question so many times

    View Slide

  25. Why Solr & TT?
     I get this question so many times
     I’ve destroyed the following products with 1% of our data:
    - CouchDB
    - SimpleDB & S3
    - Voldemort
    - MySQL w/ special indexing

    View Slide

  26. Why Solr & TT?
     I get this question so many times
     I’ve destroyed the following products with 1% of our data:
    - CouchDB
    - SimpleDB & S3
    - Voldemort
    - MySQL w/ special indexing

    View Slide

  27. Why Solr & TT?
     I get this question so many times
     I’ve destroyed the following products with 1% of our data:
    - CouchDB
    - SimpleDB & S3
    - Voldemort
    - MySQL w/ special indexing
     Test it first, let me know

    View Slide

  28. In practice: artist similarity

    View Slide

  29. In practice: artist similarity
     Artist similarity is a complicated query with boosts:
    - t_term:‘swedish pop’^5.3 t_term:‘dancing queen’^0.2 f_familiarity:[0.8 TO *] etc
     Terms are indexed with term frequency weights using a custom component
     Should run in real time whenever possible, not cached
    - things change
    - people change
     This makes Solr a big sorting
    engine. It’s not so bad at it!
    n2 Term Score
    dancing queen 0.0707
    mamma mia 0.0622
    disco era 0.0346
    winner takes 0.0307
    chance on 0.0297
    swedish pop 0.0296
    my my 0.0290
    s enduring 0.0287
    and gimme 0.0280
    enduring appeal 0.0280
    np Term Score
    dancing queen 0.0875
    mamma mia 0.0553
    benny 0.0399
    chess 0.0390
    its chorus 0.0389
    vous 0.0382
    the invitations 0.0377
    voulez 0.0377
    something’s 0.0374
    priscilla 0.0369
    adj Term Score
    perky 0.8157
    nonviolent 0.7178
    swedish 0.2991
    international 0.2010
    inner 0.1776
    consistent 0.1508
    bitter 0.0871
    classified 0.0735
    junior 0.0664
    produced 0.0616
    Table 4.2: Top 10 terms of various types for ABBA. The score is TF-IDF for a
    (adjective), and gaussian weighted TF-IDF for term types n2 (bigrams) and n

    View Slide

  30. Solr: Sharding
     Two types of data sharding
    - Consistent hashing of data with replication
    - Full data replication on multiple nodes
    - Choice depends on data size & index complexity

    View Slide

  31. Solr: Storage
     Should you store data in a text index?
    - Sometimes
     As average data size goes up, Solr does better at retrieval - but that’s an edge case
     In general, if you don’t need to (highlighting or etc), don’t

    View Slide

  32. Solr: Indexing (1)
     Indexing large amounts of data is the biggest bottleneck we’ve had
     We built a bulk Lucene indexing system (Flattery)
    - All new or updated documents go into the key value store immediately & on a queue
    - Since all indexes run at least two copies of themselves we can:
    - Take down one
    - Index the documents out of the queue & from the KV store using embedded Solr
    - Then bring it back up and take down the other during rsyncing
     This avoids query slowdowns at the expense of truly RT indexing (but it’s close)

    View Slide

  33. Solr: Indexing (2)
     Other simple things:
    - Only index what you need: date accuracy, stemming, etc
    - Index should be on RAM or fast SSD as much as possible
    - We use Fusion-IO and slower SSDs as storage
    - Don’t use the grouping feature

    View Slide

  34. Solr: Indexing (2)
     Other simple things:
    - Only index what you need: date accuracy, stemming, etc
    - Index should be on RAM or fast SSD as much as possible
    - We use Fusion-IO and slower SSDs as storage
    - Don’t use the grouping feature
    - EC2, EBS, no.

    View Slide

  35. User models
     Taste profiles: users can send us anything about their
    music preference - plays, skips, collection contents
     We currently track millions of individuals’ listening habits
    via our partners & API
     48 shards of SQL after a resolving step
    - Async resolving workers
     Solr index as well for similarity (as artists / songs)

    View Slide

  36. User models

    View Slide

  37. User models
     Users can search within a profile or use a profile as a seed (recommendation)
     How can you quickly “cut” a huge index to a user model?
    - Model changes frequently & we can’t have stale info
    - Can’t index model into Solr for perf reasons
     We built Polarized: custom Lucene / Solr filter component
    - open source!
    - Reads a set from external data source (SQL) from an ID passed in at query time
    - Builds an in memory bitset at query time to filter results to Lucene docIDs
    - Very useful

    View Slide

  38. API
     Hundreds of music queries a second hit our stack
     Mix of enterprise customers & indie developers
     Most on the fly compute queries
     Some data cached through CDN for some customers
     All compute in Python

    View Slide

  39. API
     Front end nginx & tornado
     Worker model
    - Workers do CPU compute:
    - FP matching
    - Playlist generation
    - Specialized workers for
    - Profile resolution
    - Audio analysis
    - Specialized workers use MQ
     Python threading during compute
    is biggest killer
    API FRONTEND
    Tornado
    Track
    Upload
    Tornado
    ...
    Tornado
    NginX
    ...
    API WORKER
    Apache + mod_wsgi
    PyAPI
    Thread
    PyAPI
    Thread
    DATA
    developer.echonest.com (LB)
    worker-pool (LB)
    CATALOG WORKER
    Python
    Catalog
    Worker
    Python
    Catalog
    Worker
    ANALYZE WORKER
    Python
    Analyze
    Worker
    Python
    Analyze
    Worker
    Memcache
    Queue
    (RabbitMQ)
    Queue
    (RabbitMQ)

    View Slide

  40. API ops
     Graphite
     Everything logged
     Fabric for deployment
     ~1 / day, new features & bugs

    View Slide

  41. Audio fingerprinting
     Echoprint & ENMFP
     We identify 75 songs a second via our FPs
     >20m songs, >100m tracks in the lookup DB (& people can add to it!)
     Internal use & APIs
     Both are open to a degree, Echoprint fully open source end to end

    View Slide

  42. Fingerprinting process
     (All open source! if you want to look)
     http://echoprint.me/
    File
    Codegen
    (Mobile or server)
    Lookup API
    Hash space
    1000s of 20-bit
    code / time pairs
    Matching index Storage

    View Slide

  43. Fingerprinting indexing & search
     We use Solr as the search backend
    - treat the 20-bit codes as tokens in the index
    - the query finds the top N docs with the most overlap
     Python process in the worker then does “post filtering”
    - very CPU intensive, dynamic programming problem
    - takes a list of ~20 contenders from Solr and finds if there is a real match
     Index: sharded by song duration
    - queries hit 1 or 2 of 16 shards if we know the duration
    - For echoprint, songs are indexed in 60s pieces to match the right spot
    123984 6940 21823
    TR5904 50ms TR1283 840ms TR1293 680ms
    TR1283 120ms TR7348 860ms TR5909 650ms

    View Slide

  44. Fingerprinting performance
     People want FP results fast
     But compute is a serious bottleneck
     At 50 rps per server, we aim for 1% being slower than 500ms
    0
    225.0
    450.0
    675.0
    900.0
    Average Median 1% 5% 20%
    ms
    Echoprint ENMFP

    View Slide

  45.  We keep annoying our engineers
     Our next feature release is more live compute
    What’s next

    View Slide

  46.  Singular value decomposition as-a-service
     Live machine learning evaluation on taste profiles
    What’s next

    View Slide

  47.  email: [email protected]
     Thanks: Tyler Williams, Hui Cao, Aaron Daubman & john!
    Thank you!

    View Slide