Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards a Data Driven Product: Techniques and Tools at GitHub

JD Maturen
October 08, 2014

Towards a Data Driven Product: Techniques and Tools at GitHub

Presented at FutureStack14

JD Maturen

October 08, 2014
Tweet

Other Decks in Programming

Transcript

  1. But First Some Meta-Points • Try to remember and act

    on one thing. • Go buy Wizard.app • Go read Kahneman’s “Thinking, Fast and Slow”
  2. @jdmaturen • Background in infrastructure. Both at social networks and

    Enterprise SaaS. • OODA loop. Observe, Orient, Decide, Act. • “Every deploy changes a metric.” • Not a stats expert! Likely not an expert in anything
  3. What questions do we 
 ask of our data? •

    Out of our active users how many are new? How many are returning? Where do they come from? Do they have paid accounts? What are they here to do? • How many new users are we retaining? Which actions are early indications of retention? Which types of users are most likely to be successful? • What is our Customer Lifetime Value (CLV)? Which customers will upgrade next month? Which will leave?
  4. Sampling from the Bernoulli distribution • Bernoulli distribution is 1

    with probability p else 0, 
 written Bernoulli(p) • We are going to estimate the distribution of the parameter p itself based on observed samples x1 , x2 , x3 , …, xn ! • The conjugate prior for the Bernoulli distribution is 
 the beta distribution. • Conjugate priors are algebraic closed forms for 
 estimating distribution parameters using only a fixed memory summary of the sample data
  5. Sampling from the Bernoulli distribution • We have observed 3

    successes (Σxi ) out of 10 trials (n) • In lieu of a long proof about what to do next we’ll instead refer to a math textbook: bit.ly/conj-prior • Prior over the uniform distribution beta(1, 1), posterior distribution for p is beta(1 + 3, 1 + 10 - 3)
  6. 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2

    1.4 p value Probability Beta(α=1, β=1) Our prior distribution before observing any data
 Aka the uniform distribution
  7. Posterior distribution of p in Bernoulli(p) 
 based on a

    summary of observed samples 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 p value Probability Beta(α=1+3, β=1+10-3)
  8. Increasing Confidence in the Distribution of p As we add

    observations the possible range of p shrinks 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 p value Probability Observed Samples 3/10 26/100 316/1000
  9. Shrinking Range of p • Short of always churning out

    graphs how do we communicate the distribution of p? By giving a range of values (typically the 5th and 95th percentile): • >>> beta.ppf([0.05, 0.95], 1+3, 1+10-3)
 array([ 0.13507547, 0.56437419]) • 7 slides later a possible answer: our sign up rate is 
 likely between 13.5% and 56.4% • 26/100: 19.6%–33.9%, 316/1000: 29.2%–34.1% >>> binomial_rvs = lambda trials, p: sum([1 for _ in xrange(trials) if random() < p]) >>> samples = [binomial_rvs(10, 0.135) for _ in xrange(1000)] >>> [np.percentile(samples, x) for x in (5, 95)] [0.0, 3.0] Empirical confirmation ^
  10. Range of p In this example every order of magnitude

    increase in observations results in a 3x increase in confidence
  11. The Types of Questions We Can Now Answer • How

    does retention this month compare
 to retention last month? • How much has our upgrade rate increased? • Which variant of the new user landing page 
 works the best? • Does new feature X work in all browsers?
  12. Comparing Samples • Intuitively: look at the relative difference distribution

    
 to see if the 5th %ile is >0% or the 95th%ile is <0% • Can compute this empirically using random sampling 
 from the two beta distributions
 
 
 
 • E.g. Month 1: 1000 sign ups & retain 25%
 Month 2: 1400 sign ups & retain 26%
 
 Did retention go up by 1% (absolute)? >>> w1 = beta(1 + 250, 1 + 1000 - 250); w2 = beta(1 + 365, 1 + 1400 - 365) >>> samples = [(w1.rvs(), w2.rvs()) for _ in xrange(1000)] >>> diff = [(w2_sample - w1_sample)/w1_sample for w1_sample, w2_sample in samples] >>> [np.percentile(diff, x) for x in (5, 95)] [-0.076501431244058685, 0.17644472655825005] scipy.stats.beta
  13. Retention didn’t significantly increase One mistake was to call retention

    in Month 1 25%, when it should be called 23%–27% If we had 10x more observations, or 10x fewer. 
 What would our ranges look like?
  14. Answering the More 
 Complex Questions • Can extend this

    to multiple comparisons, basis of A/B or multi-variate testing, nicely handles stopping conditions and weeding out underperforming options early: Thompson Sampling. 
 • Conjugate priors exist for many families of distributions. Other methods exist for non-algebraic solutions • In order to drill into questions we need to store lots of dimensions with our raw data. bit.ly/bayes-bandits
  15. Segmentation • Slicing by user (or test-unit) dimensions instead of

    by sample • E.g. when looking at the effect of a new feature is it the same in all browsers? • Or do all countries have the same activity rates?
  16. Sources of Data • The application DB(s). Be sure to

    schematize with an eye towards future analysis. Keep as much as you can. • Structured event logs — everything else that happens that isn’t necessary to go in the DB.
 
 Except we didn’t have that! So, off to build a data collection pipeline.
  17. Data Collection & Storage (Nothing Novel ) • HTTP JSON

    API that receives events. HMAC signing of events or partial event data. • Data gets batched into S3 • There is a queue in the middle. May in the future move to a log. • Collect from the browser and the web app.
  18. Attributes of a Good Event • Attach any extra data

    which is mutable, e.g. on a repository view we store whether the repository is public or private, etc. • Attach any extra data which is hard to join on later • Space allowing, pretty much add everything you can think of • Don’t store PII or user content
  19. Events We Collect • Page views from the browser, with

    browser timings. • A whole slew of user actions (push, star, follow, etc). See also, the public GitHub Archive. • Mostly a task of thinking through the concrete things a user can do in your app
  20. Processing Data
 (Nothing Novel ) • Cool so how do

    you get the data out? Put a SQL interface on it (Hive). • Typical automated workflow is to query a subset or aggregate out of S3, merge it with other data from the database in an intermediary DB, run full queries there and then get that into something that can do stats. • Would be ideal to have the data and the stats functions in one (fast) place
  21. Organizational Tips • Have a dedicated repository 
 for asking

    questions • Get leadership to talk about 
 the metrics you think are important • Reward curiosity
  22. Resources • “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained”

    
 bit.ly/puzzling-outcomes • “A modern Bayesian look at the multi-armed bandit” 
 bit.ly/bayes-bandits • “Bayesian Reasoning and Machine Learning”
 bit.ly/bayes-ml • “The Log: What every software engineer should know about real-time data's unifying abstraction”
 bit.ly/the-log