Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A social music service, powered by ElasticSearch

A social music service, powered by ElasticSearch

ElasticSearch is one of the main technologies behind serendip.me, the social music service. It enables the application main features - generating the music feed and providing user recommendations.
You can find some more details about the technology behind serendip in this blog post: http://rore.im/posts/building-serendip/

Rotem Hermon

July 28, 2014
Tweet

More Decks by Rotem Hermon

Other Decks in Technology

Transcript

  1. A social music service.
    Powered by ElasticSearch
    Rotem Hermon

    View Slide

  2. serendip.me

    View Slide

  3. serendip.me
    Feed

    View Slide

  4. serendip.me
    Recommendations

    View Slide

  5. serendip.me
    Stations

    View Slide

  6. ● Scala (and Java)
    ● akka
    ● Play (web app and API)
    ● MongoDB
    Stack

    View Slide

  7. ● Scala (and Java)
    ● akka
    ● Play (web app and API)
    ● MongoDB
    ● ElasticSearch
    Stack

    View Slide

  8. Data In

    View Slide

  9. Twitter API
    Facebook API
    URL
    Expander
    Music Service Filters
    Importer Meta Data Enrichers
    ElasticSearch
    The Pump

    View Slide

  10. Twitter API
    Facebook API
    URL
    Expander
    Music Service Filters
    Importer Meta Data Enrichers
    ElasticSearch
    The Pump
    akka actors

    View Slide

  11. ● In: 5M Items/day
    ● Valid: 850,000 (containing music links)
    ● → ~40 Index op/sec
    The Pump

    View Slide

  12. The Model

    View Slide

  13. Post
    An item containing a music link.
    {
    "postid" : "0972cd80-01bb-11e4-b21c-123136519c3",
    "network" : "serendip",
    "postDate" : "2014-07-01T01:00:04.000Z",
    "txt" : "#airing \"Door Gunner\"Performed by Herb Hutchinson Written by
    Jeffrey Deitelbaum #rockradio ROCK INSTRUMENTAL http://srndp.me/ahMFkTQ",
    "lang" : "en",
    "uid" : ["tw_...", "fb_...", "sd_..."],
    "service" : "serendip",
    "clip" : ["yt_at1kaxrmOR8"],
    ...
    }

    View Slide

  14. ● ~25M Posts/month
    ● Data continuously increasing:
    Using monthly indexes!
    ● Searches are always within a time frame:
    Search only on the needed indexes!
    (e.g. posts-514, posts-614, posts-714)
    Post

    View Slide

  15. User
    A social network user (Facebook/Twitter/Serendip)
    {
    "network" : "serendip",
    "id" : "4dd0e2775c6b09a536aee1ab",
    "name" : "Rotem Hermon",
    "dsc" : "Non-social media amateur",
    "country" : "Israel",
    "city" : "Tel Aviv Yaffo",
    "connectedAccounts" : ["tw_...", "fb_..."],
    "lastUpdate" : "2014-06-30T09:00:00.000Z",
    "postCount" : 710,
    "rockOnCount" : 249,
    "reairCount" : 93,
    }

    View Slide

  16. ● A single index
    ● Limiting index size:
    Scheduled cleanup of inactive users
    User

    View Slide

  17. ● Metadata for a music clip (url, artist, album,
    genre etc.)
    ● Not indexed in ES
    Clip

    View Slide

  18. The Feed

    View Slide

  19. ● Requirements:
    ○ Combine music from several sources (friends,
    preferred artists, recommendations)
    ○ Reactive to user input and actions
    This means generating the feed in real-time.
    The Feed

    View Slide

  20. ● A collection of “strategies” (e.g. “friends”,
    “preferred artists”, “suggested users”)
    ○ A strategy considers most recent user actions
    ● Strategies are dynamically combined in
    every feed fetch
    ○ This translates to searches on posts in Elasticsearch
    The Feed Algorithm

    View Slide

  21. The key -
    The Feed Algorithm

    View Slide

  22. The key -
    Denormalization !
    The Feed Algorithm

    View Slide

  23. ● A post is indexed with needed data from
    other objects:
    ○ User details (e.g. location)
    ○ Clip metadata (artist, genre, description, language)
    ● So all required data for a strategy search is
    contained in the posts index
    ● Cons: space (data is duplicated). integrity (data
    may not be recent)
    The Feed Algorithm

    View Slide

  24. Same for creating “stations” by artists or genre:
    All required data is indexed under the post.
    The Feed Algorithm

    View Slide

  25. Recommendations

    View Slide

  26. ● “Music Soulmates” - find users with matching
    musical taste
    ● Common solutions - using machine learning,
    hadoop, M/R jobs
    ● We’re a small startup. We already have
    enough systems on our plate
    ● Can we do it with the existing system?
    Recommendations

    View Slide

  27. The key -
    Recommendations

    View Slide

  28. The key -
    Prepare the data in advance !
    Recommendations

    View Slide

  29. ● Data preparation:
    ○ When importing posts, constantly calculate top
    shared artists for users
    ○ Top artists are found using faceted search on posts
    shared by the user
    ○ Mark “spammers” (e.g. a lot of shares of only a
    single artist)
    Recommendations

    View Slide

  30. ● Finding “Music Soulmates”:
    ○ Search for users with matching “top artists”
    ○ Use scoring to surface users with most matches
    ○ Use boosting to tweak the results (e.g. prefer users
    from the same country, active users, recent activity)
    Recommendations

    View Slide

  31. Scaling

    View Slide

  32. ● Current setup:
    ○ 4 X m2.2xlarge
    ○ Most CPU - pump imports (indexing + facets)
    ● Scaling:
    ○ More nodes, bigger nodes, IOPS optimization
    ○ Indexing optimizations (use parent-child for
    frequently updated fields)
    Scaling

    View Slide

  33. Questions?
    Thank you!

    View Slide