Slide 1

Slide 1 text

Data Science at Whisper ULAS BARDAK, MAARTEN BOSMA, ROHAN MONGA, MARK HSIAO, NICK STUCKY-MACK Presented at Data Science, LA Meetup. March 23rd, 2015.

Slide 2

Slide 2 text

A little background on Whisper u  Anonymous Social Network focused on mobile apps u  Users come to share secrets, make confessions, find others to connect to u  No need to create an account u  Engagement through replies, direct messages, “hearts” u  Millions of users & hundreds of millions of whispers Whisper @ LA DS Meetup, 2015/03/23

Slide 3

Slide 3 text

High Level Usage Patterns App Launch Recommended Whispers Recommendation Engine User + Content Models User Engagement Whisper Create Suggest Image Creation Flow Interaction Flow Whisper @ LA DS Meetup, 2015/03/23

Slide 4

Slide 4 text

Some Problems We Are Tackling Content Understanding •  Spam detection •  Language detection •  Content quality prediction •  Content classification •  Image Suggestion User Understanding •  Spammer detection •  Personalization •  Similar user detection •  Churn prediction Overall •  A/B testing •  Reporting Whisper @ LA DS Meetup, 2015/03/23

Slide 5

Slide 5 text

Language Detection Whisper @ LA DS Meetup, 2015/03/23

Slide 6

Slide 6 text

Content Quality Prediction Whisper @ LA DS Meetup, 2015/03/23

Slide 7

Slide 7 text

Image Suggestion From Text Maarten Bosma Whisper @ LA DS Meetup, 2015/03/23

Slide 8

Slide 8 text

Whisper Creation Flow User creates text content System Suggests one image OK? Whisper is created More suggestions shown OK? User searches or uploads Yes Yes No No Whisper @ LA DS Meetup, 2015/03/23

Slide 9

Slide 9 text

When Whisper First Started… u  No image suggestions u  Users had to type in a search phrase after they created whispers. Whisper @ LA DS Meetup, 2015/03/23

Slide 10

Slide 10 text

Image Suggest Goals u  5 second create u  Support “mood” set in the whisper u  High quality images u  High variation in suggested images Whisper @ LA DS Meetup, 2015/03/23

Slide 11

Slide 11 text

Where do we get the images? u  Building an image repo is difficult: u  Need a lot of images u  Still need a source to populate the repo u  Cannot simply use a search engine Whisper @ LA DS Meetup, 2015/03/23

Slide 12

Slide 12 text

Where do we get the images? u  Building an image repo is difficult: u  Need a lot of images u  Still need a source to populate the repo u  Cannot simply use a search engine Whisper @ LA DS Meetup, 2015/03/23

Slide 13

Slide 13 text

Pipeline Start Creating Whisper Generate Search Terms Read images from cache Query ext. source Terms Cached ? Cache 3rd Party Yes No Whisper @ LA DS Meetup, 2015/03/23

Slide 14

Slide 14 text

How do we get the search terms? We use four different strategies: u  Fixed list u  Sentiment analysis u  Keyword extraction Cut in phrases, score them using tf-idf, pos-tags, etc. u  Learn from previous searches Whisper @ LA DS Meetup, 2015/03/23

Slide 15

Slide 15 text

Learning Using Similar Whispers Top-n similar whispers (cosine similarity on tf-idf weighted bag of words) u  We only use image search terms that worked before Good, but not great… Whisper @ LA DS Meetup, 2015/03/23

Slide 16

Slide 16 text

From Similar Whispers to Similar Terms u  Represent each term as a vector u  Faster, more scalable u  Fixed vocabulary Whisper @ LA DS Meetup, 2015/03/23

Slide 17

Slide 17 text

New Pipeline Start Creating Whisper Generate Search Terms Read images from Image Repository Image Repo Generate Dictionary For each term, query 3rd party if needed Remove low quality images Offline Processing Whisper @ LA DS Meetup, 2015/03/23

Slide 18

Slide 18 text

Low Quality Image Detection u  Remove dead images u  Check how quickly images can be loaded u  Remove images too big or too small (in addition to query parameters) u  Text detection u  Images with text make poor Whisper backgrounds Whisper @ LA DS Meetup, 2015/03/23

Slide 19

Slide 19 text

Text Detection in Images u  Developed in-house u  Stroke based feature detection Whisper @ LA DS Meetup, 2015/03/23

Slide 20

Slide 20 text

Future Work u  Learn which domains are likely to contain good images u  Combine different image sources u  Better image quality computations u  Other search term prediction strategies u  i18n Whisper @ LA DS Meetup, 2015/03/23

Slide 21

Slide 21 text

Personalization Rohan Monga Whisper @ LA DS Meetup, 2015/03/23

Slide 22

Slide 22 text

Why do we need recommendations? Problem: Showing every user the exact same content is not efficient. Engagement and interest depend on matching users’ preferences to content, i.e. personalization. Requirements: Fast and able to work with little data Whisper @ LA DS Meetup, 2015/03/23

Slide 23

Slide 23 text

Recommendation Engine - Concerns Algorithmic •  User’s past actions, explicit preferences, inferred / implicit information •  Content features •  Model training, testing, feedback delay between rec. and user actions. •  … Business •  Ability to override algorithmic decisions for special cases •  Insights into quality, performance of the algorithms •  Ability to rapidly AB test new ideas •  … Platform •  Data Stores, unified user and item features •  Throughput of the rec. engine, timeouts •  Code reuse and testing •  … Whisper @ LA DS Meetup, 2015/03/23

Slide 24

Slide 24 text

Well that complicates things… … let’s see if we can build something not overly complicated. Whisper @ LA DS Meetup, 2015/03/23

Slide 25

Slide 25 text

Recommendation Engine Start out by building a profile for each user based on their activity (created/liked/available user properties) u  Preferred categories u  Preferred languages u  Keywords u  User device u  Whether or not the user is “new” u  … Whisper @ LA DS Meetup, 2015/03/23

Slide 26

Slide 26 text

Recommendation Engine High Personalization •  Like-minded users •  Collaborative Filtering •  … High Coverage •  Popular in location •  Recently popular •  Popular with new users •  … Combiner •  Merge results, deciding on the right ordering •  If not enough results, use fallback methods to backfill. Whisper @ LA DS Meetup, 2015/03/23

Slide 27

Slide 27 text

Like-minded User Calculations 1.  Agglomeration [Convert the user into a giant document] 2.  Pre-processing [Lowercase, remove stopwords, etc..] 3.  Vectorization [Bag of words into vectors] 4.  Dimensionality reduction [Autoencoder maps 5K+ into ~100] 5.  Similarity calculation [Top k users via cosine similarity] 6.  Recommendation [Collect whispers from similar users] 7.  Feedback [Regenerate model with new activity] Whisper @ LA DS Meetup, 2015/03/23

Slide 28

Slide 28 text

Collaborative Filtering u  We want to learn a low dimensional embedding for users and Whispers. u  Learn a score function f(u,w) that gives scores of whispers given a user. Ex: u  Define a rank function that ranks all whispers for all users *Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 81(1):21–35, 2010. Whisper @ LA DS Meetup, 2015/03/23 f u,w ( )=U u ⋅W w rank u,w ( )= Ι f u,k ( )> f u,w ( ) { } k∈w,k≠w ∑

Slide 29

Slide 29 text

Collaborative Filtering u  We can then define an error function using the template: where L is a non-decreasing loss function and rank is the actual rank. u  For large datasets like ours, it is computationally expensive to obtain exact ranks of items. u  Idea: Online learning to rank - utilize Weighted Approximate Rank Pairwise Loss u  Then use stochastic gradient descent for optimization u  Extension to basic model: Use like-minded user metrics to make sure similar users have similar embeddings. err f x ( ), y ( )= L rank x, y ( ) ( )

Slide 30

Slide 30 text

Current Architecture Whisper @ LA DS Meetup, 2015/03/23 Rec Group for New Users Rec Group for Users w/Churn Risk User Context DAO … User W. Tier 1 Method Filter Sort Method Filter Sort Method Filter Sort Tier 2 Method Filter Sort Method Filter Sort Method Filter Sort Merger Group Sort

Slide 31

Slide 31 text

Performance u  By offloading most of the difficult calculations to offline jobs, we simplify the online calculation requirements. u  The current system can handle more than 500 queries per second with a response time of less than 1 second per query. Whisper @ LA DS Meetup, 2015/03/23

Slide 32

Slide 32 text

Future directions u  Use more implicit signals on top of explicit user actions u  Personalize the way methods are used for each user by employing MABs Whisper @ LA DS Meetup, 2015/03/23

Slide 33

Slide 33 text

Whisper @ LA DS Meetup, 2015/03/23

Slide 34

Slide 34 text

Our Technology Stack for DS Whisper @ LA DS Meetup, 2015/03/23

Slide 35

Slide 35 text

Thank You for Listening! Questions? u  For more info: u - We are hiring ;-) u  Contact me at u  Try out the app for yourself! Whisper @ LA DS Meetup, 2015/03/23