Music data & music search at insane scale @ The Echo Nest

Slide 1

Slide 1 text

Music data & music search at insane scale at The Echo Nest Brian Whitman co-Founder & CTO The Echo Nest @bwhitman

Slide 2

Slide 2 text

musician (1997-2004)

Slide 3

Slide 3 text

musician (1997-2004) academic (1999-2005)

Slide 4

Slide 4 text

musician (1997-2004) academic (1999-2005) The Echo Nest (2005-)

Slide 5

Slide 5 text

The world of music data  We know about almost 3 million artists  35 million unique songs  Hundreds of millions of recordings  We estimate 100 MB of data unpacks from each song

Slide 6

Slide 6 text

The future music platform & APIs  APIs have transformed the music industry  New music experiences  Content now available via Spotify or sandboxes

Slide 7

Slide 7 text

The Echo Nest  We power most music experiences you have on the internet  Recommendation, search, analytics, audio analysis, fingerprinting, playlists, radio  You’ve never heard of us, that’s cool

Slide 8

Slide 8 text

Today’s talk  Real time music services - the problem space in general  Doing “weird” stuff at scale  Operations & scale challenges on a large music data service: - Acoustic analysis on hundreds of millions of songs - Specific challenges with search (Solr) at scale & with user models - Compute heavy API serving - Fingerprinting  What’s next

Slide 9

Slide 9 text

Music analysis  How does a computer listen to music anyway  We do “cultural” (web crawling) & “acoustic” (DSP) analysis on all music ever  We’ve analyzed hundreds of millions of songs & billions of documents about artists

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Care & scale  We attempt to have our results as useful as possible  We do a very weird kind of recommendation technology  It works well but has some bad performance & ops characteristics

Slide 14

Slide 14 text

Care & scale  We attempt to have our results as useful as possible  We do a very weird kind of recommendation technology  It works well but has some bad performance & ops characteristics

Slide 15

Slide 15 text

Care & scale  We attempt to have our results as useful as possible  We do a very weird kind of recommendation technology  It works well but has some bad performance & ops characteristics

Slide 16

Slide 16 text

Music similarity & playlisting  Streaming radio experiences for many clients  We serve hundreds of API calls a second  Many involve a user model per query

Slide 17

Slide 17 text

Music resolving  Audio resolving (fingerprinting) - Echoprint & ENMFP  Text resolving - artist/extract  ID space resolving - Rosetta Stone

Slide 18

Slide 18 text

Taste profiles  Server side representation of everything someone does with music  Preferences, play logging, skips, bans, favorites & other metadata  Predictive power -- increase play time, better recommendations, ads & demo stuff

Slide 19

Slide 19 text

Data in (Crawling, partners) SQL store for relational data Key-value store for "storables" Columnstore for metadata Various "fake map reduce" compute processes Solr Key-value store API Basic overview of data processing stack

Slide 20

Slide 20 text

Acoustic analysis  All audio from partners & API users go through an analysis  Heavily optimized C++, ~1s per 1 minute of audio  Divides song into segments, each 0.1s - 4s: - Timbre - Pitch - Loudness  Beat tracking - Measures - Downbeat  Song level features: - Key, mode - Energy, danceability, etc

Slide 21

Slide 21 text

Acoustic analysis

Slide 22

Slide 22 text

Acoustic analysis  Transcoding-style async job  Output of analysis is a ~1MB json block for an average song  RabbitMQ for analysis queues  Workers run on EC2  Raw output goes on S3  Workers pull together indexable data

Slide 23

Slide 23 text

Text indexing  Quite a lot of EN is powered directly by text-style inverted index search: - Artist similarity - Music search & metadata - Playlisting - Even audio fingerprinting  We use Solr quite extensively with a lot of custom stuff on top  Backed by a key-value store (Tokyo Tyrant) for storage Artists Songs Documents Metadata Fingerprints 192,000,000 docs 344,000,000 docs 200,000,000 docs 34,000,000 docs 2,200,000 docs

Slide 24

Slide 24 text

Why Solr & TT?  I get this question so many times

Slide 25

Slide 25 text

Why Solr & TT?  I get this question so many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing

Slide 26

Slide 26 text

Why Solr & TT?  I get this question so many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing

Slide 27

Slide 27 text

Why Solr & TT?  I get this question so many times  I’ve destroyed the following products with 1% of our data: - CouchDB - SimpleDB & S3 - Voldemort - MySQL w/ special indexing  Test it first, let me know

Slide 28

Slide 28 text

In practice: artist similarity

Slide 29

Slide 29 text

In practice: artist similarity  Artist similarity is a complicated query with boosts: - t_term:‘swedish pop’^5.3 t_term:‘dancing queen’^0.2 f_familiarity:[0.8 TO *] etc  Terms are indexed with term frequency weights using a custom component  Should run in real time whenever possible, not cached - things change - people change  This makes Solr a big sorting engine. It’s not so bad at it! n2 Term Score dancing queen 0.0707 mamma mia 0.0622 disco era 0.0346 winner takes 0.0307 chance on 0.0297 swedish pop 0.0296 my my 0.0290 s enduring 0.0287 and gimme 0.0280 enduring appeal 0.0280 np Term Score dancing queen 0.0875 mamma mia 0.0553 benny 0.0399 chess 0.0390 its chorus 0.0389 vous 0.0382 the invitations 0.0377 voulez 0.0377 something’s 0.0374 priscilla 0.0369 adj Term Score perky 0.8157 nonviolent 0.7178 swedish 0.2991 international 0.2010 inner 0.1776 consistent 0.1508 bitter 0.0871 classiﬁed 0.0735 junior 0.0664 produced 0.0616 Table 4.2: Top 10 terms of various types for ABBA. The score is TF-IDF for a (adjective), and gaussian weighted TF-IDF for term types n2 (bigrams) and n

Slide 30

Slide 30 text

Solr: Sharding  Two types of data sharding - Consistent hashing of data with replication - Full data replication on multiple nodes - Choice depends on data size & index complexity

Slide 31

Slide 31 text

Solr: Storage  Should you store data in a text index? - Sometimes  As average data size goes up, Solr does better at retrieval - but that’s an edge case  In general, if you don’t need to (highlighting or etc), don’t

Slide 32

Slide 32 text

Solr: Indexing (1)  Indexing large amounts of data is the biggest bottleneck we’ve had  We built a bulk Lucene indexing system (Flattery) - All new or updated documents go into the key value store immediately & on a queue - Since all indexes run at least two copies of themselves we can: - Take down one - Index the documents out of the queue & from the KV store using embedded Solr - Then bring it back up and take down the other during rsyncing  This avoids query slowdowns at the expense of truly RT indexing (but it’s close)

Slide 33

Slide 33 text

Solr: Indexing (2)  Other simple things: - Only index what you need: date accuracy, stemming, etc - Index should be on RAM or fast SSD as much as possible - We use Fusion-IO and slower SSDs as storage - Don’t use the grouping feature

Slide 34

Slide 34 text

Slide 35

Slide 35 text

User models  Taste profiles: users can send us anything about their music preference - plays, skips, collection contents  We currently track millions of individuals’ listening habits via our partners & API  48 shards of SQL after a resolving step - Async resolving workers  Solr index as well for similarity (as artists / songs)

Slide 36

Slide 36 text

User models

Slide 37

Slide 37 text

User models  Users can search within a profile or use a profile as a seed (recommendation)  How can you quickly “cut” a huge index to a user model? - Model changes frequently & we can’t have stale info - Can’t index model into Solr for perf reasons  We built Polarized: custom Lucene / Solr filter component - open source! - Reads a set from external data source (SQL) from an ID passed in at query time - Builds an in memory bitset at query time to filter results to Lucene docIDs - Very useful

Slide 38

Slide 38 text

API  Hundreds of music queries a second hit our stack  Mix of enterprise customers & indie developers  Most on the fly compute queries  Some data cached through CDN for some customers  All compute in Python

Slide 39

Slide 39 text

API  Front end nginx & tornado  Worker model - Workers do CPU compute: - FP matching - Playlist generation - Specialized workers for - Profile resolution - Audio analysis - Specialized workers use MQ  Python threading during compute is biggest killer API FRONTEND Tornado Track Upload Tornado ... Tornado NginX ... API WORKER Apache + mod_wsgi PyAPI Thread PyAPI Thread DATA developer.echonest.com (LB) worker-pool (LB) CATALOG WORKER Python Catalog Worker Python Catalog Worker ANALYZE WORKER Python Analyze Worker Python Analyze Worker Memcache Queue (RabbitMQ) Queue (RabbitMQ)

Slide 40

Slide 40 text

API ops  Graphite  Everything logged  Fabric for deployment  ~1 / day, new features & bugs

Slide 41

Slide 41 text

Audio fingerprinting  Echoprint & ENMFP  We identify 75 songs a second via our FPs  >20m songs, >100m tracks in the lookup DB (& people can add to it!)  Internal use & APIs  Both are open to a degree, Echoprint fully open source end to end

Slide 42

Slide 42 text

Fingerprinting process  (All open source! if you want to look)  http://echoprint.me/ File Codegen (Mobile or server) Lookup API Hash space 1000s of 20-bit code / time pairs Matching index Storage

Slide 43

Slide 43 text

Fingerprinting indexing & search  We use Solr as the search backend - treat the 20-bit codes as tokens in the index - the query finds the top N docs with the most overlap  Python process in the worker then does “post filtering” - very CPU intensive, dynamic programming problem - takes a list of ~20 contenders from Solr and finds if there is a real match  Index: sharded by song duration - queries hit 1 or 2 of 16 shards if we know the duration - For echoprint, songs are indexed in 60s pieces to match the right spot 123984 6940 21823 TR5904 50ms TR1283 840ms TR1293 680ms TR1283 120ms TR7348 860ms TR5909 650ms

Slide 44

Slide 44 text

Fingerprinting performance  People want FP results fast  But compute is a serious bottleneck  At 50 rps per server, we aim for 1% being slower than 500ms 0 225.0 450.0 675.0 900.0 Average Median 1% 5% 20% ms Echoprint ENMFP

Slide 45

Slide 45 text

 We keep annoying our engineers  Our next feature release is more live compute What’s next

Slide 46

Slide 46 text

 Singular value decomposition as-a-service  Live machine learning evaluation on taste profiles What’s next

Slide 47

Slide 47 text

 email: [email protected]  Thanks: Tyler Williams, Hui Cao, Aaron Daubman & john! Thank you!