RedChurn Final

Modeling user churn on reddit.com Till Bergmann RedChurn

Most users on reddit.com are not active. /r/politics sub-reddit posted
in last two weeks posted at   at least once

Reddit Gold: Bought by   other people for your comment.
Sign up and comment for free Freemium Is feedback important?   Or simply the quantity of comments? Goal:  Increase number   of active users First:  Who is at risk   and why?

Survival analysis > 4m comments from /r/politics Feature Engineering Aggregation

Expected survival Estimate the time until a user becomes inactive.
Based on their previous activity.

Expected survival Estimate the time until a user becomes inactive.
Based on their previous activity. 50% at 10 weeks

Quantity beats quality The more you post   the more
likely you are  to stay active. Negative feedback from   community has no effect!

Validation Predict time until inactivity for each user. A lot
of users  only post once!

Take-away Dashboard: http://redchurn.xyz Increase number of posts per user. Streamline
process to post. Introduce reward for active users. Contact at risk users.

NLP: topic modeling and   word2vec • • • •
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Cluster 1 Cluster 2 Cluster 3 document wdi is considered in turn, and its topic assignment zi computed conditioned on the topic assignment on all other word tokens (Steyvers & Griffiths, 2007). In other words, the probability that a specific topic j is assigned to the current word wdi depends on the probability that the same word has been assigned that topic in other positions in the corpus. Formally, this posterior can be written as: P (zi = j|z i , wi , di , ·) µ CWT wi j + b W Â w=1 CWT wj + W b CDT di j + a T Â t=1 CDT dit + T a (2.1) where · is all other known information, such as the Dirichlet priors and all other words w i and documents d i ; and µ means proportional to, as in y µ x ⌘ y = kx. CWT and CDT are matrices of counts with dimensions W ⇥ T (number of unique words in vocabulary ⇥ number of topics) and D ⇥ T (number of documents times number of topics) respectively: • CWT wj is the count of word w assigned to topic j, not including current instance i. • CDT dj is the count of of topic j assigned to some word token in document d not including current instance i. Conceptually, the first ratio is the probability of wi under topic j, and the second ratio the probability of topic j in document di . Once many tokens of word i have been assigned a topic j (across all documents), it will increase the probability that subsequent tokens of word i get the assignment topic j. Similarly, if topic j has been used multiple times within a document, it will increase the probability that any word within that document is assigned topic j. Estimates of the topic distribution q and term distribution f can then be calculated using the following formula (Griffiths & Steyvers, 2004; Steyvers & Griffiths, 2007): Table 3.1: Topics in cluster 1 and their associated terms. Topic 1 Topic 2 Topic 4 Topic 6 communic word mean experi inform order emerg particip speaker product languag categori relev event featur studi system interpret form result question semant composit set utter data space condit simpl present semant task encod studi combin test cue lexicon combinatori present Topic 10 Topic 11 Topic 18 Topic 19 signal learn gestur model game cultur languag agent system bias sign popul communic structur symbol network strategi generat system interact agent languag point simul interact linguist icon communiti high regular action dynam player learner speech comput refer transmiss form effect deling the authors of EvoLang topography of collaborations ting an authorship network from co-authored abstracts, we can nature of collaborations at EvoLang. Who collaborates with whom? f submission elicits large collaborations? Are there large components network   analysis non-linear  methods Till Bergmann  [email protected]

Coefficients Decrease in Survival Increase in   Survival

Cox Hazard Hazard function depends on time  not features depends
on features  not time

Probability vs time of activity

Predictions after 2 weeks

Concordance Index

The earlier you joined the less likely you   are
to stay active.

The more you post, the more Gold you get.

RedChurn Final

RedChurn Final

till_be

More Decks by till_be

Other Decks in Science

Featured

Transcript

Modeling user churn on reddit.com Till Bergmann RedChurn

Most users on reddit.com are not active. /r/politics sub-reddit posted

Reddit Gold: Bought by   other people for your comment.

Survival analysis > 4m comments from /r/politics Feature Engineering Aggregation

Expected survival Estimate the time until a user becomes inactive.

Expected survival Estimate the time until a user becomes inactive.

Expected survival Estimate the time until a user becomes inactive.

Quantity beats quality The more you post   the more

Validation Predict time until inactivity for each user. A lot

Validation Predict time until inactivity for each user. A lot

Take-away Dashboard: http://redchurn.xyz Increase number of posts per user. Streamline

NLP: topic modeling and   word2vec • • • •

Coefficients Decrease in Survival Increase in   Survival

Cox Hazard Hazard function depends on time  not features depends

Probability vs time of activity

Predictions after 2 weeks

Concordance Index

The earlier you joined the less likely you   are

The more you post, the more Gold you get.