Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[DevDojo] Experiments and Feature Flags

mercari
PRO
December 23, 2022

[DevDojo] Experiments and Feature Flags

In this course we will go over the essentials for understanding experimentation at Mercari, and how to instrument and configure basic experiments and feature flags. We will also briefly review how this affects our development practices.

mercari
PRO

December 23, 2022
Tweet

More Decks by mercari

Other Decks in Technology

Transcript

  1. Experiments and Feature Flags
    Fall 2022 Dev Dojo
    @rhomel

    View Slide

  2. Agenda
    Experimentation
    ● Why run experiments?
    ● Implementing Experiments
    ● V2 API Preview
    ● Experimentation Issues
    Feature Flags
    ● Why use feature flags?
    ● Common code-merging issues and how we can solve them.
    ● Testing In Production
    Hands-On Session

    View Slide

  3. Goals for this DevDojo
    Experimentation
    Provide insight for good trustworthy experiments so that you can readily contribute back to your teams.
    Provide ideas and designs for implementing and executing experiments in the product code base.
    Feature Flags
    Increase development efficiency with faster and continuous integration practices.

    View Slide

  4. Experimentation

    View Slide

  5. Experimentation

    View Slide

  6. What is experimentation?
    Galileo's Leaning Tower of Pisa experiment
    “Galileo discovered through this experiment
    that the objects fell with the same acceleration,
    proving his prediction true, while at the same
    time disproving Aristotle's theory of gravity
    (which states that objects fall at speed
    proportional to their mass).”
    Original image source: Theresa knott at English Wikibooks., Pisa experiment,
    modified with vector arrows, CC BY-SA 3.0

    View Slide

  7. Mercari A/B Experiment Example
    Personal Inventory - Hiding Prices
    (1) Control: show both total inventory price and
    individual item prices
    (2) Treatment: hide total inventory price ②
    (3) Treatment: hide both prices ① ②


    Hiding prices
    influenced seller listing
    behavior. But why?

    View Slide

  8. What is experimentation?
    Hypothesis Experiment Analysis
    A proposed explanation of
    why a particular
    observation occurs.
    A repeatable method to
    validate or invalidate a
    given hypothesis.
    What we can learn from
    our observations.

    View Slide

  9. Which answer do you think is the most correct?
    1: “A way to increase GMV and therefore Mercari’s stock price.”
    2: “An empirical method of acquiring knowledge.”
    3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)”
    4: “Compare variations of the same web page to determine which will generate the best outcomes.”
    Why Run Experiments?
    GMV = Gross Merchandising Value

    View Slide

  10. Which answer do you think is the most correct? (with sources)
    1: “A way to increase GMV and therefore Mercari’s stock price.”
    Your DevDojo presenter: @rhomel
    2: “An empirical method of acquiring knowledge.”
    Wikipedia’s description of “Scientific Method”
    https://en.wikipedia.org/wiki/Scientific_method
    3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)”
    Published Book: “Trustworthy Online Controlled Experiments”
    Ron Kohavi (Amazon, Microsoft), Diane Tang (Google), Ya Xu (LinkedIn)
    4: “Compare variations of the same web page to determine which will generate the best outcomes.”
    Nielsen Norman User Research Group on AB testing
    https://www.nngroup.com/articles/ab-testing-and-ux-research/
    Why Run Experiments?

    View Slide

  11. Which answer do you think is the most correct?
    1: “A way to increase GMV and therefore Mercari’s stock price.”
    This is the end result that we hope to achieve for our company.
    2: “An empirical method of acquiring knowledge.”
    This is the purpose of experimentation.
    It is the best answer because it creates an environment where the other answers become potential outcomes.
    3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)”
    This is one of the effects of understanding the importance of experimentation.
    4: “Compare variations of the same web page to determine which will generate the best outcomes.”
    This is a tendency of how we conduct our experiments.
    Why Run Experiments?

    View Slide

  12. “An empirical method of acquiring knowledge.”
    What we can learn from our observations.
    Apply new knowledge about our service and marketplace.
    Why Run Experiments?

    View Slide

  13. “An empirical method of acquiring knowledge.”
    What we can learn from our observations.
    Apply new knowledge about our service and marketplace.
    Why Run Experiments?
    But in order to learn anything,
    we need to establish causality.
    Simply finding correlations in
    data is not enough.

    View Slide

  14. Why Run Experiments?
    Original chart source unmodified from: tylervigen.com/spurious-correlations, CC BY 4.0
    Data sources: U.S. Department of Agriculture and National Science Foundation
    Can we learn from correlations?
    Correlation vs Causality

    View Slide

  15. Why Run Experiments?
    Should we increase app crashes to increase purchases?
    *Don’t worry, this is made-up data.
    Correlation vs Causality

    View Slide

  16. Why Run Experiments?
    We cannot know the counterfactual result of a treatment. Additionally there is some stochastic behavior
    between participants. But we can use randomized controlled experiments to gather enough statistical
    evidence that a treatment causes a particular effect within a population.
    A
    B
    A
    B
    T
    Correlation vs Causality
    Counterfactual is the result had the treatment not been applied. For example suppose we give a treatment
    such as a medicine to relieve a patient’s headache. Since this participant is part of the treatment group, we
    cannot know what would have happened to them had they received the placebo treatment instead of the
    actual treatment.

    View Slide

  17. Some excerpts from Netflix: A Culture of Learning
    “At Netflix, we’ve been leveraging experimentation and the scientific method for decades, and are fortunate to have a
    mature experimentation culture. There is broad buy-in across the company, including from the C-Suite, that, whenever possible,
    results from A/B tests or other causal inference approaches are near-requirements for decision making”
    “At Netflix, we do not view tests that do not produce winning experience as “failures.” When our members vote down new
    product experiences with their actions, we still learn a lot about their preferences, what works (and does not
    work!) for different member cohorts, and where there may, or may not be, opportunities for innovation.”
    Why Run Experiments?
    What Other Companies Are Doing

    View Slide

  18. Experiment Design is a description of our hypothesis and how we plan on using our experiment to prove or disprove
    it.
    Common components of an Experiment Design Document at Mercari:
    ● Hypothesis
    ● Variables and their associated treatments
    ● Metrics
    ○ Goal Metrics
    ○ Guardrail Metrics
    ● Analysis
    ● Conclusions and Future Actions
    Experiment Design
    In addition to reviewing the
    variants and treatments, also
    carefully look at the suggested
    guardrail metrics. For example
    certain feature interactions
    within the code base can lead to
    undesirable service incidents.

    View Slide

  19. Implementing
    Experiments

    View Slide

  20. Experiment Development Lifecycle
    Experiment
    Design
    Treatment or
    Variant
    Development
    App Release
    A/A Tests A/B Test Analysis
    Feature and
    Experiment
    Cleanup
    Development
    Experiment
    Archival
    For client code deployed to
    customer devices, be careful
    not to archive experiments or
    flags until we confirm old
    clients will still operate
    properly without the flag.

    View Slide

  21. V1 Configuration Web Dashboard
    Configuration Workflow
    New
    Experiment
    Configuration
    Draft
    Review and
    Publish Draft
    Changes
    Edit Existing
    Experiment
    Configuration
    Draft
    Updated
    Configuration
    Published to
    Prod and Dev
    Environments
    Only members
    with the approver
    role may publish
    changes.

    View Slide

  22. Let’s take a brief look at our current Experiments Configuration Web Dashboard
    V1 Configuration Web Dashboard

    View Slide

  23. 5 minute break

    View Slide

  24. - name: the experiment identifier. Case sensitive and cannot contain spaces.
    - platform: ios and android require the minimum version (integer). These versions should match Android
    version or iOS build numbers.
    - weights: percentage described as a decimal (0.5 = 50%) of threshold traffic to distribute to a particular
    variant. Changing weight ratios while an experiment is running will cause some users to be switch
    between variants.
    - threshold: percentage of all traffic to divert to the experiment.
    - start/end date: the time when the API will begin serving variant distributions for your experiment. If the
    API is accessed outside of the start/end dates, the experiment will not be included in the result set. So
    all client versions must always define an internal default value.
    V1 API Configuration Terminology

    View Slide

  25. GET /experiments/get_running_experiments_by_name?names=Experiment1,Experiment2,Experiment3
    {
    "result": "OK",
    "data": {
    "experiment_results": [
    {
    "name": "Experiment2",
    "variant": 2
    },
    {
    "name": "Experiment3",
    "variant": 0
    }
    ]
    },
    "meta": {}
    }
    V1 API Details

    View Slide

  26. In addition to a valid authentication token, the following headers should be present:
    - X-Platform: the value here will be compared to the selected platform parameters
    - X-App-Version: the value here will be compared to the min-version parameter
    depending on the platform specified in the previous parameter
    If neither parameter matches the experiment configuration, variants will not be assigned.
    V1 API Server Headers

    View Slide

  27. V1 Usage
    Your teams likely already have a method for retrieving experiment variants. So usage will
    be implementation specific.
    In the following section we will go over a naive implementation so you can familiarize
    yourselves with general architecture nuances and gotchas.

    View Slide

  28. package main
    import (
    "fmt"
    "log"
    "net/http"
    "net/url"
    "strings"
    )
    func main() {
    tutorial := &Tutorial{
    eventLogger: &EventLogger{},
    }
    experiments := GetExperiments("Experiment_1", "Experiment_2", "Experiment_3")
    variant := experiments["Experiment_2"] // *** What is wrong here?
    tutorial.Show(variant)
    }
    // Tutorial decides which version of the tutorial to show if any.
    type Tutorial struct {
    eventLogger *EventLogger
    }
    V1 Sample Usage
    func (t *Tutorial) Show(variant uint) {
    t.eventLogger.LogEvent("Experiment_2", fmt.Sprintf("variant:%d", variant))
    if variant == 0 || variant == 1 {
    // don't show anything for users not assigned to the experiment (0) and
    // the control group (1)
    return
    }
    if variant == 2 {
    t.Slideshow()
    return
    }
    if variant == 3 {
    t.Animation()
    return
    }
    if variant == 4 {
    t.DemoVideo()
    return
    }
    // *** On-hands session: How can we improve maintainability of this code?
    }
    func (t *Tutorial) Slideshow() {
    // omitted for brevity
    }
    func (t *Tutorial) Animation() {
    // omitted for brevity
    }
    func (t *Tutorial) DemoVideo() {
    // omitted for brevity
    }
    // Results maps experiment names to their assigned variants.
    type Results map[string]uint
    func GetExperiments(names ...string) Results {
    // *** What is wrong with this function implementation???
    var response *http.Response
    var err error
    const maxRetries = 10
    for c := 0; c < maxRetries; c++ {
    response, err = http.Get(generateURL(names))
    if err != nil {
    log.Println("failed to fetch experiments")
    }
    if err == nil {
    break
    }
    }
    if err != nil {
    return Results{}
    }
    return parseExperiments(response)
    }
    func parseExperiments(response *http.Response) Results {
    // omitted for brevity
    return Results{}
    }
    func generateURL(names []string) string {
    param := url.QueryEscape(strings.Join(names, ","))
    return fmt.Sprintf("https://localhost:55555/get_running_experiments_by_name?names=%s",
    param)
    }
    type EventLogger struct{}
    func (el *EventLogger) LogEvent(details ...string) {
    // Omitted for brevity. Assume all relevant details are properly logged.
    }
    This is a very obvious way to use
    the API. It is not a model of a
    good implementation.

    View Slide

  29. func main() {
    tutorial := &Tutorial{
    eventLogger: &EventLogger{},
    }
    experiments := GetExperiments("Experiment_1", "Experiment_2", "Experiment_3")
    variant := experiments["Experiment_2"] // *** What is wrong here?
    tutorial.Show(variant)
    }
    V1 Sample Usage: Key Concerns (1)

    View Slide

  30. func main() {
    tutorial := &Tutorial{
    eventLogger: &EventLogger{},
    }
    experiments := GetExperiments("Experiment_1", "Experiment_2", "Experiment_3")
    variant := experiments["Experiment_2"] // *** What is wrong here?
    tutorial.Show(variant)
    }
    V1 Sample Usage: Key Concerns (1)
    A client predefined default variant value is not clearly specified. Because this is Go, the
    map access will result in the default value for uint which is 0. So the program will still
    execute normally.
    However other programming languages or constructs may not behave in a safe manner.
    So it is better to be explicit about the default behavior for the client if the experiments
    API becomes unavailable.

    View Slide

  31. func GetExperiments(names ...string) Results {
    // *** What is wrong with this function implementation???
    var response *http.Response
    var err error
    const maxRetries = 10
    for c := 0; c < maxRetries; c++ {
    response, err = http.Get(generateURL(names))
    if err != nil {
    log.Println("failed to fetch experiments")
    }
    if err == nil {
    break
    }
    }
    if err != nil {
    return Results{}
    }
    return parseExperiments(response)
    }
    V1 Sample Usage: Key Concerns (2)

    View Slide

  32. func GetExperiments(names ...string) Results {
    // *** What is wrong with this function implementation???
    var response *http.Response
    var err error
    const maxRetries = 10
    for c := 0; c < maxRetries; c++ {
    response, err = http.Get(generateURL(names))
    if err != nil {
    log.Println("failed to fetch experiments")
    }
    if err == nil {
    break
    }
    }
    if err != nil {
    return Results{}
    }
    return parseExperiments(response)
    }
    V1 Sample Usage: Key Concerns (2)
    The HTTP request has no timeout and no back
    off implementation.
    The lack of timeout means the request can
    block the program forever.
    Secondly the lack of request retry back off will
    result in multiple clients sending a request
    storm to the server. This can put too much
    pressure on the server infrastructure leading
    to further crashes and issues on the server if
    not handled well.
    Even when the server is able to properly deny
    requests in an overloaded state, in this case
    the client will be too fast with request retries
    and quickly fail for intermittent service
    downtime.

    View Slide

  33. func (t *Tutorial) Show(variant uint) {
    t.eventLogger.LogEvent("Experiment_2", fmt.Sprintf("variant:%d", variant))
    if variant == 0 || variant == 1 {
    // don't show anything for users not assigned to the experiment (0) and
    // the control group (1)
    return
    }
    if variant == 2 {
    t.Slideshow()
    return
    }
    if variant == 3 {
    t.Animation()
    return
    }
    if variant == 4 {
    t.DemoVideo()
    return
    }
    // *** On-hands session: How can we improve maintainability of this code?
    }
    V1 Sample Usage: Key Concerns (3)
    Are we logging our events in the right place?

    View Slide

  34. Ensure your request implementation properly satisfies good client-server HTTP behavior:
    ● Retry failed requests with back off intervals.
    ● Ensure your client can react properly when the experiments API becomes unavailable.
    ● React properly to defined HTTP response status codes.
    Ensure your event logging implementation logs events correctly:
    ● Event logs should happen when it is clear the event has taken place:
    ○ When a backend response completed transmission
    ○ When a frontend client actually displays a component
    V1 Usage: Key Takeaways

    View Slide

  35. - Cannot easily re-randomize after the experiment starts. Therefore it is highly recommended
    to perform an A/A test. If the A/A test results in any imbalances, we must create a new
    experiment with a different name to get a new distribution.
    - Cannot isolate experiments from other running experiments. All clients querying the API will
    be considered for all experiments.
    - Cannot easily configure multiple client parameters from a single experiment.
    V1 API Key Limitations
    So my team is currently working on a new
    experimentation platform.

    View Slide

  36. V2 API Preview

    View Slide

  37. - Concept of parameter values
    - Concept of treatments (a set of parameter values)
    - A newly designed bucketing algorithm based on layers (“overlapping experiments”)
    - Ability to re-randomize buckets without updating clients
    - And more!
    V2 Preview

    View Slide

  38. GET /v2/parameters?name=ListButtonColor&name=ListTutorialPage
    [
    {
    "parameter": "ListButtonColor",
    "value": "blue",
    "experiment-id": "ListMoreColors-1234"
    },
    {
    "parameter": "ListTutorialPage",
    "value": "video-animation",
    "experiment-id": "ListTutorials-6789"
    }
    ]
    V2 Preview
    This is still on-going work.
    It is just a sample for
    demonstrative purposes.

    View Slide

  39. Experimentation Issues

    View Slide

  40. - Implementation bugs can obviously cause issues in experiments. For example we can
    observe either an extremely large increase or decrease in key metrics without observing any
    notable issues like app crashes or direct user feedback. Anytime we observe hard to believe
    statistic, it is known as Twyman’s Law.
    - Accurate event logging is another possible problem. For example if we naively believe we
    can just use server event logs for features that can affect client user metrics, then we may
    not get accurate interpretations because the communication between the client and server is
    lossy. We must be careful to consider accurately logging the state of client app in regards to
    actual displayed variant and experiment configurations.
    General Experimentation Issues
    Implementation and Logging Bugs

    View Slide

  41. - Our V1 experimentation system is based on an md5 algorithm for randomization. Therefore
    randomization traits are limited by properties of md5 hashing. And we have observed
    imbalances because of this.
    - Even given pure random results, statistical hypothesis testing is still based on probabilities
    so there is always a small chance we observe some sort of unexpected results.
    General Experimentation Issues
    Observed Effects Due to Chance

    View Slide

  42. - A key assumption for any randomized controlled experiment is that there are no confounding
    variables in the experiment design. A possible common confounding variable within our
    environment is the effects of latency on user behavior. For example it is possible that our user’s
    behavior is sensitive to the responsiveness of our application and APIs. Therefore if we test a new
    feature that is noticeably slower to respond than the feature it replaces, we may inadvertently be
    measuring the effects due to latency, not the actual feature itself.
    - Another time we inadvertently can run into confounding variables is the time period over which we
    run our experiment. For example user behavior is likely influenced by time of day, day of the week,
    and by seasonal aspects (holiday seasons vs non-holidays, winter vs summer, etc). So the chosen
    experimentation period may also introduce uncontrolled biases.
    General Experimentation Issues
    Confounding Variables

    View Slide

  43. - SUTVA basically means the experiment units do not interact with each other. For example
    applying a treatment to a single user should not have any spillover effect to other users in
    the experiment.
    - Unfortunately our environment is a “two-sided marketplace” where buyers and sellers
    interact. So it is fairly easy to come up with experiment ideas that violate SUTVA.
    General Experimentation Issues
    Stable Unit Treatment Value Assumption (SUTVA)

    View Slide

  44. Feature Flags

    View Slide

  45. ● We want to simplify releasing and merging code changes
    ● We want to have our main branch in a releasable state
    ● We want to be able to activate and deactivate code that is already released
    Why Use Feature Flags?

    View Slide

  46. Common Development Scenario With Branches
    🏻 💻
    🏼 💻
    main [several commits]
    big merge
    conflict!
    your-branch
    alice-branch
    When multiple developers are working on the same repository with separate branches, merge conflicts become common
    and become a common pain point to quickly releasing code. Additionally large merges can lead to higher chances of out of
    sync changes and so called “bad merges”. This becomes another source of wasted development effort and potential bugs.

    View Slide

  47. Productivity Goal: Fast Code Merges
    🏻 💻
    🏼 💻
    main
    We want to be able to
    merge as soon as
    possible.
    In our ideal scenario we would like to merge as soon as possible so that are changes become available to the rest of the
    developers. This reduces the possibility of large merge conflicts and helps everyone to work on an up-to-date version of the
    code base. But if code is not always in a mergeable and more importantly releasable state, how can we accomplish this?

    View Slide

  48. Potential Solution 1: rebasing
    Encourage others to rebase branches regularly to pick up merged changes.
    Good: Everyone keeps their branches up to date reducing the possibility of a large
    merge conflict later.
    Bad: Individual developers spend time to keep their branches up to date.
    Additionally their local changes are not immediately visible to others.

    View Slide

  49. Potential Solution 2: dev and release branches
    Automatically attempt to merge changes into an unstable “development”
    branch.
    Good: Everyone synchronizes code into a single branch.
    Bad: Must introduce release branches to stabilize development. This leads to
    multiple branch management for each release.

    View Slide

  50. Not Ideal Solutions
    Both solutions tend to view the merge problem as a source control problem. But
    what if we step away from source control and identify what code base traits we
    want to have:
    ● We would like to have the code base in a releasable state at any time.
    ● We would like developers to publish their changes as early as possible for
    shared visibility in the team.

    View Slide

  51. We can accomplish one trait if we simply just merge changes immediately.
    But what about in-progress changes or changes that are not ready for end-users?

    View Slide

  52. func main() {
    // how can both versions co-exist in the same code-base on the same branch?
    - showOld()
    + showNew()
    }
    func showOld() {}
    func showNew() {}

    View Slide

  53. var featureToggle = "old"
    func main() {
    if featureToggle == "new" {
    showNew()
    } else {
    showOld()
    }
    }

    View Slide

  54. Feature Toggles and Feature Flags
    Feature toggles or feature flags hide code paths. This enables us to merge code
    that is not yet ready so we can achieve our ideal development scenario: fast
    merges and a single releasable trunk branch.
    This strategy is known as trunk based development.

    View Slide

  55. Testing In Production

    View Slide

  56. Testing In Production
    If we combine feature flags with our experiments API, we gain the ability to
    remotely activate and deactivate any code path and therefore feature. This gives a
    new capability: testing in production.
    But why would we want to test in production?

    View Slide

  57. Typical App Development Lifecycle
    New Feature
    Development
    In-House
    Testing and QA
    App Release
    Slow and
    expensive, but still
    necessary.
    New client releases
    (especially iOS and
    Android) can take
    hours.

    View Slide

  58. Lifecycle with Testing In Production
    New Feature
    Development
    In-House
    Testing and
    QA
    App Release
    with New
    Feature
    DISABLED
    Remotely
    Enable
    Feature for 1%
    of users
    Verify no
    anomalies
    Continue
    Feature
    Roll-Out
    If a major incident occurs, we
    now limit the risk to 1% of our
    user-base. Additionally we
    have the ability to disable the
    feature completely without
    deploying a new release.

    View Slide

  59. Feature Flags Conclusion
    ● Trunk based development simplifies our development and release processes
    by allowing us to merge early and release early.
    ● We reuse our existing Experiments API for feature flags and new feature
    rollouts.
    ● We gain a new testing step called Testing In Production to reduce the risks
    associated with releasing new features.

    View Slide

  60. End of Presentation! 🎉

    View Slide