A social music service, powered by ElasticSearch

A social music service, powered by ElasticSearch

ElasticSearch is one of the main technologies behind serendip.me, the social music service. It enables the application main features - generating the music feed and providing user recommendations.
You can find some more details about the technology behind serendip in this blog post: http://rore.im/posts/building-serendip/

Ed52a75c8cf2f4cb1c2e9d8d161ca771?s=128

Rotem Hermon

July 28, 2014
Tweet

Transcript

  1. A social music service. Powered by ElasticSearch Rotem Hermon

  2. serendip.me

  3. serendip.me Feed

  4. serendip.me Recommendations

  5. serendip.me Stations

  6. • Scala (and Java) • akka • Play (web app

    and API) • MongoDB Stack
  7. • Scala (and Java) • akka • Play (web app

    and API) • MongoDB • ElasticSearch Stack
  8. Data In

  9. Twitter API Facebook API URL Expander Music Service Filters Importer

    Meta Data Enrichers ElasticSearch The Pump
  10. Twitter API Facebook API URL Expander Music Service Filters Importer

    Meta Data Enrichers ElasticSearch The Pump akka actors
  11. • In: 5M Items/day • Valid: 850,000 (containing music links)

    • → ~40 Index op/sec The Pump
  12. The Model

  13. Post An item containing a music link. { "postid" :

    "0972cd80-01bb-11e4-b21c-123136519c3", "network" : "serendip", "postDate" : "2014-07-01T01:00:04.000Z", "txt" : "#airing \"Door Gunner\"Performed by Herb Hutchinson Written by Jeffrey Deitelbaum #rockradio ROCK INSTRUMENTAL http://srndp.me/ahMFkTQ", "lang" : "en", "uid" : ["tw_...", "fb_...", "sd_..."], "service" : "serendip", "clip" : ["yt_at1kaxrmOR8"], ... }
  14. • ~25M Posts/month • Data continuously increasing: Using monthly indexes!

    • Searches are always within a time frame: Search only on the needed indexes! (e.g. posts-514, posts-614, posts-714) Post
  15. User A social network user (Facebook/Twitter/Serendip) { "network" : "serendip",

    "id" : "4dd0e2775c6b09a536aee1ab", "name" : "Rotem Hermon", "dsc" : "Non-social media amateur", "country" : "Israel", "city" : "Tel Aviv Yaffo", "connectedAccounts" : ["tw_...", "fb_..."], "lastUpdate" : "2014-06-30T09:00:00.000Z", "postCount" : 710, "rockOnCount" : 249, "reairCount" : 93, }
  16. • A single index • Limiting index size: Scheduled cleanup

    of inactive users User
  17. • Metadata for a music clip (url, artist, album, genre

    etc.) • Not indexed in ES Clip
  18. The Feed

  19. • Requirements: ◦ Combine music from several sources (friends, preferred

    artists, recommendations) ◦ Reactive to user input and actions This means generating the feed in real-time. The Feed
  20. • A collection of “strategies” (e.g. “friends”, “preferred artists”, “suggested

    users”) ◦ A strategy considers most recent user actions • Strategies are dynamically combined in every feed fetch ◦ This translates to searches on posts in Elasticsearch The Feed Algorithm
  21. The key - The Feed Algorithm

  22. The key - Denormalization ! The Feed Algorithm

  23. • A post is indexed with needed data from other

    objects: ◦ User details (e.g. location) ◦ Clip metadata (artist, genre, description, language) • So all required data for a strategy search is contained in the posts index • Cons: space (data is duplicated). integrity (data may not be recent) The Feed Algorithm
  24. Same for creating “stations” by artists or genre: All required

    data is indexed under the post. The Feed Algorithm
  25. Recommendations

  26. • “Music Soulmates” - find users with matching musical taste

    • Common solutions - using machine learning, hadoop, M/R jobs • We’re a small startup. We already have enough systems on our plate • Can we do it with the existing system? Recommendations
  27. The key - Recommendations

  28. The key - Prepare the data in advance ! Recommendations

  29. • Data preparation: ◦ When importing posts, constantly calculate top

    shared artists for users ◦ Top artists are found using faceted search on posts shared by the user ◦ Mark “spammers” (e.g. a lot of shares of only a single artist) Recommendations
  30. • Finding “Music Soulmates”: ◦ Search for users with matching

    “top artists” ◦ Use scoring to surface users with most matches ◦ Use boosting to tweak the results (e.g. prefer users from the same country, active users, recent activity) Recommendations
  31. Scaling

  32. • Current setup: ◦ 4 X m2.2xlarge ◦ Most CPU

    - pump imports (indexing + facets) • Scaling: ◦ More nodes, bigger nodes, IOPS optimization ◦ Indexing optimizations (use parent-child for frequently updated fields) Scaling
  33. Questions? Thank you!