Music data & music search at insane scale @ The Echo Nest

Music data & music search at insane scale at The
Echo Nest Brian Whitman co-Founder & CTO The Echo Nest @bwhitman

musician (1997-2004)

musician (1997-2004) academic (1999-2005)

musician (1997-2004) academic (1999-2005) The Echo Nest (2005-)

The world of music data  We know about almost
3 million artists  35 million unique songs  Hundreds of millions of recordings  We estimate 100 MB of data unpacks from each song

The future music platform & APIs  APIs have transformed
the music industry  New music experiences  Content now available via Spotify or sandboxes

The Echo Nest  We power most music experiences you
have on the internet  Recommendation, search, analytics, audio analysis, fingerprinting, playlists, radio  You’ve never heard of us, that’s cool

Today’s talk  Real time music services - the problem
space in general  Doing “weird” stuff at scale  Operations & scale challenges on a large music data service: - Acoustic analysis on hundreds of millions of songs - Specific challenges with search (Solr) at scale & with user models - Compute heavy API serving - Fingerprinting  What’s next

Music analysis  How does a computer listen to music
anyway  We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever  We’ve analyzed hundreds of millions of songs & billions of documents about artists

Care & scale  We attempt to have our results
as useful as possible  We do a very weird kind of recommendation technology  It works well but has some bad performance & ops characteristics

Music similarity & playlisting  Streaming radio experiences for many
clients  We serve hundreds of API calls a second  Many involve a user model per query

Music resolving  Audio resolving (fingerprinting) - Echoprint & ENMFP
 Text resolving - artist/extract  ID space resolving - Rosetta Stone

Taste profiles  Server side representation of everything someone does
with music  Preferences, play logging, skips, bans, favorites & other metadata  Predictive power -- increase play time, better recommendations, ads & demo stuff

Data in (Crawling, partners) SQL store for relational data Key-value
store for "storables" Columnstore for metadata Various "fake map reduce" compute processes Solr Key-value store API Basic overview of data processing stack

Acoustic analysis  All audio from partners & API users
go through an analysis  Heavily optimized C++, ~1s per 1 minute of audio  Divides song into segments, each 0.1s - 4s: - Timbre - Pitch - Loudness  Beat tracking - Measures - Downbeat  Song level features: - Key, mode - Energy, danceability, etc

Acoustic analysis

Acoustic analysis  Transcoding-style async job  Output of analysis
is a ~1MB json block for an average song  RabbitMQ for analysis queues  Workers run on EC2  Raw output goes on S3  Workers pull together indexable data

Text indexing  Quite a lot of EN is powered
directly by text-style inverted index search: - Artist similarity - Music search & metadata - Playlisting - Even audio fingerprinting  We use Solr quite extensively with a lot of custom stuff on top  Backed by a key-value store (Tokyo Tyrant) for storage Artists Songs Documents Metadata Fingerprints 192,000,000 docs 344,000,000 docs 200,000,000 docs 34,000,000 docs 2,200,000 docs

Why Solr & TT?  I get this question so
many times

many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing

many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing  Test it first, let me know

In practice: artist similarity

In practice: artist similarity  Artist similarity is a complicated
query with boosts: - t_term:‘swedish pop’^5.3 t_term:‘dancing queen’^0.2 f_familiarity:[0.8 TO *] etc  Terms are indexed with term frequency weights using a custom component  Should run in real time whenever possible, not cached - things change - people change  This makes Solr a big sorting engine. It’s not so bad at it! n2 Term Score dancing queen 0.0707 mamma mia 0.0622 disco era 0.0346 winner takes 0.0307 chance on 0.0297 swedish pop 0.0296 my my 0.0290 s enduring 0.0287 and gimme 0.0280 enduring appeal 0.0280 np Term Score dancing queen 0.0875 mamma mia 0.0553 benny 0.0399 chess 0.0390 its chorus 0.0389 vous 0.0382 the invitations 0.0377 voulez 0.0377 something’s 0.0374 priscilla 0.0369 adj Term Score perky 0.8157 nonviolent 0.7178 swedish 0.2991 international 0.2010 inner 0.1776 consistent 0.1508 bitter 0.0871 classiﬁed 0.0735 junior 0.0664 produced 0.0616 Table 4.2: Top 10 terms of various types for ABBA. The score is TF-IDF for a (adjective), and gaussian weighted TF-IDF for term types n2 (bigrams) and n

Solr: Sharding  Two types of data sharding - Consistent
hashing of data with replication - Full data replication on multiple nodes - Choice depends on data size & index complexity

Solr: Storage  Should you store data in a text
index? - Sometimes  As average data size goes up, Solr does better at retrieval - but that’s an edge case  In general, if you don’t need to (highlighting or etc), don’t

Solr: Indexing (1)  Indexing large amounts of data is
the biggest bottleneck we’ve had  We built a bulk Lucene indexing system (Flattery) - All new or updated documents go into the key value store immediately & on a queue - Since all indexes run at least two copies of themselves we can: - Take down one - Index the documents out of the queue & from the KV store using embedded Solr - Then bring it back up and take down the other during rsyncing  This avoids query slowdowns at the expense of truly RT indexing (but it’s close)

Solr: Indexing (2)  Other simple things: - Only index
what you need: date accuracy, stemming, etc - Index should be on RAM or fast SSD as much as possible - We use Fusion-IO and slower SSDs as storage - Don’t use the grouping feature

Solr: Indexing (2)  Other simple things: - Only index
what you need: date accuracy, stemming, etc - Index should be on RAM or fast SSD as much as possible - We use Fusion-IO and slower SSDs as storage - Don’t use the grouping feature - EC2, EBS, no.

User models  Taste profiles: users can send us anything
about their music preference - plays, skips, collection contents  We currently track millions of individuals’ listening habits via our partners & API  48 shards of SQL after a resolving step - Async resolving workers  Solr index as well for similarity (as artists / songs)

User models

User models  Users can search within a profile or
use a profile as a seed (recommendation)  How can you quickly “cut” a huge index to a user model? - Model changes frequently & we can’t have stale info - Can’t index model into Solr for perf reasons  We built Polarized: custom Lucene / Solr filter component - open source! - Reads a set from external data source (SQL) from an ID passed in at query time - Builds an in memory bitset at query time to filter results to Lucene docIDs - Very useful

API  Hundreds of music queries a second hit our
stack  Mix of enterprise customers & indie developers  Most on the fly compute queries  Some data cached through CDN for some customers  All compute in Python

API  Front end nginx & tornado  Worker model
- Workers do CPU compute: - FP matching - Playlist generation - Specialized workers for - Profile resolution - Audio analysis - Specialized workers use MQ  Python threading during compute is biggest killer API FRONTEND Tornado Track Upload Tornado ... Tornado NginX ... API WORKER Apache + mod_wsgi PyAPI Thread PyAPI Thread DATA developer.echonest.com (LB) worker-pool (LB) CATALOG WORKER Python Catalog Worker Python Catalog Worker ANALYZE WORKER Python Analyze Worker Python Analyze Worker Memcache Queue (RabbitMQ) Queue (RabbitMQ)

API ops  Graphite  Everything logged  Fabric for
deployment  ~1 / day, new features & bugs

Audio fingerprinting  Echoprint & ENMFP  We identify 75
songs a second via our FPs  >20m songs, >100m tracks in the lookup DB (& people can add to it!)  Internal use & APIs  Both are open to a degree, Echoprint fully open source end to end

Fingerprinting process  (All open source! if you want to
look)  http://echoprint.me/ File Codegen (Mobile or server) Lookup API Hash space 1000s of 20-bit code / time pairs Matching index Storage

Fingerprinting indexing & search  We use Solr as the
search backend - treat the 20-bit codes as tokens in the index - the query finds the top N docs with the most overlap  Python process in the worker then does “post filtering” - very CPU intensive, dynamic programming problem - takes a list of ~20 contenders from Solr and finds if there is a real match  Index: sharded by song duration - queries hit 1 or 2 of 16 shards if we know the duration - For echoprint, songs are indexed in 60s pieces to match the right spot 123984 6940 21823 TR5904 50ms TR1283 840ms TR1293 680ms TR1283 120ms TR7348 860ms TR5909 650ms

Fingerprinting performance  People want FP results fast  But
compute is a serious bottleneck  At 50 rps per server, we aim for 1% being slower than 500ms 0 225.0 450.0 675.0 900.0 Average Median 1% 5% 20% ms Echoprint ENMFP

 We keep annoying our engineers  Our next feature
release is more live compute What’s next

 Singular value decomposition as-a-service  Live machine learning evaluation
on taste profiles What’s next

 email: [email protected]  Thanks: Tyler Williams, Hui Cao, Aaron
Daubman & john! Thank you!

Music data & music search at insane scale @ The...

Music data & music search at insane scale @ The Echo Nest

More Decks by Brian Whitman

Other Decks in Technology

Featured

Transcript