Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[DevDojo] Experiments and Feature Flags

mercari
PRO
December 23, 2022

[DevDojo] Experiments and Feature Flags

In this course we will go over the essentials for understanding experimentation at Mercari, and how to instrument and configure basic experiments and feature flags. We will also briefly review how this affects our development practices.

mercari
PRO

December 23, 2022
Tweet

More Decks by mercari

Other Decks in Technology

Transcript

  1. Experiments and Feature Flags Fall 2022 Dev Dojo @rhomel

  2. Agenda Experimentation • Why run experiments? • Implementing Experiments •

    V2 API Preview • Experimentation Issues Feature Flags • Why use feature flags? • Common code-merging issues and how we can solve them. • Testing In Production Hands-On Session
  3. Goals for this DevDojo Experimentation Provide insight for good trustworthy

    experiments so that you can readily contribute back to your teams. Provide ideas and designs for implementing and executing experiments in the product code base. Feature Flags Increase development efficiency with faster and continuous integration practices.
  4. Experimentation

  5. Experimentation

  6. What is experimentation? Galileo's Leaning Tower of Pisa experiment “Galileo

    discovered through this experiment that the objects fell with the same acceleration, proving his prediction true, while at the same time disproving Aristotle's theory of gravity (which states that objects fall at speed proportional to their mass).” Original image source: Theresa knott at English Wikibooks., Pisa experiment, modified with vector arrows, CC BY-SA 3.0
  7. Mercari A/B Experiment Example Personal Inventory - Hiding Prices (1)

    Control: show both total inventory price and individual item prices (2) Treatment: hide total inventory price ② (3) Treatment: hide both prices ① ② ① ② Hiding prices influenced seller listing behavior. But why?
  8. What is experimentation? Hypothesis Experiment Analysis A proposed explanation of

    why a particular observation occurs. A repeatable method to validate or invalidate a given hypothesis. What we can learn from our observations.
  9. Which answer do you think is the most correct? 1:

    “A way to increase GMV and therefore Mercari’s stock price.” 2: “An empirical method of acquiring knowledge.” 3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)” 4: “Compare variations of the same web page to determine which will generate the best outcomes.” Why Run Experiments? GMV = Gross Merchandising Value
  10. Which answer do you think is the most correct? (with

    sources) 1: “A way to increase GMV and therefore Mercari’s stock price.” Your DevDojo presenter: @rhomel 2: “An empirical method of acquiring knowledge.” Wikipedia’s description of “Scientific Method” https://en.wikipedia.org/wiki/Scientific_method 3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)” Published Book: “Trustworthy Online Controlled Experiments” Ron Kohavi (Amazon, Microsoft), Diane Tang (Google), Ya Xu (LinkedIn) 4: “Compare variations of the same web page to determine which will generate the best outcomes.” Nielsen Norman User Research Group on AB testing https://www.nngroup.com/articles/ab-testing-and-ux-research/ Why Run Experiments?
  11. Which answer do you think is the most correct? 1:

    “A way to increase GMV and therefore Mercari’s stock price.” This is the end result that we hope to achieve for our company. 2: “An empirical method of acquiring knowledge.” This is the purpose of experimentation. It is the best answer because it creates an environment where the other answers become potential outcomes. 3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)” This is one of the effects of understanding the importance of experimentation. 4: “Compare variations of the same web page to determine which will generate the best outcomes.” This is a tendency of how we conduct our experiments. Why Run Experiments?
  12. “An empirical method of acquiring knowledge.” What we can learn

    from our observations. Apply new knowledge about our service and marketplace. Why Run Experiments?
  13. “An empirical method of acquiring knowledge.” What we can learn

    from our observations. Apply new knowledge about our service and marketplace. Why Run Experiments? But in order to learn anything, we need to establish causality. Simply finding correlations in data is not enough.
  14. Why Run Experiments? Original chart source unmodified from: tylervigen.com/spurious-correlations, CC

    BY 4.0 Data sources: U.S. Department of Agriculture and National Science Foundation Can we learn from correlations? Correlation vs Causality
  15. Why Run Experiments? Should we increase app crashes to increase

    purchases? *Don’t worry, this is made-up data. Correlation vs Causality
  16. Why Run Experiments? We cannot know the counterfactual result of

    a treatment. Additionally there is some stochastic behavior between participants. But we can use randomized controlled experiments to gather enough statistical evidence that a treatment causes a particular effect within a population. A B A B T Correlation vs Causality Counterfactual is the result had the treatment not been applied. For example suppose we give a treatment such as a medicine to relieve a patient’s headache. Since this participant is part of the treatment group, we cannot know what would have happened to them had they received the placebo treatment instead of the actual treatment.
  17. Some excerpts from Netflix: A Culture of Learning “At Netflix,

    we’ve been leveraging experimentation and the scientific method for decades, and are fortunate to have a mature experimentation culture. There is broad buy-in across the company, including from the C-Suite, that, whenever possible, results from A/B tests or other causal inference approaches are near-requirements for decision making” “At Netflix, we do not view tests that do not produce winning experience as “failures.” When our members vote down new product experiences with their actions, we still learn a lot about their preferences, what works (and does not work!) for different member cohorts, and where there may, or may not be, opportunities for innovation.” Why Run Experiments? What Other Companies Are Doing
  18. Experiment Design is a description of our hypothesis and how

    we plan on using our experiment to prove or disprove it. Common components of an Experiment Design Document at Mercari: • Hypothesis • Variables and their associated treatments • Metrics ◦ Goal Metrics ◦ Guardrail Metrics • Analysis • Conclusions and Future Actions Experiment Design In addition to reviewing the variants and treatments, also carefully look at the suggested guardrail metrics. For example certain feature interactions within the code base can lead to undesirable service incidents.
  19. Implementing Experiments

  20. Experiment Development Lifecycle Experiment Design Treatment or Variant Development App

    Release A/A Tests A/B Test Analysis Feature and Experiment Cleanup Development Experiment Archival For client code deployed to customer devices, be careful not to archive experiments or flags until we confirm old clients will still operate properly without the flag.
  21. V1 Configuration Web Dashboard Configuration Workflow New Experiment Configuration Draft

    Review and Publish Draft Changes Edit Existing Experiment Configuration Draft Updated Configuration Published to Prod and Dev Environments Only members with the approver role may publish changes.
  22. Let’s take a brief look at our current Experiments Configuration

    Web Dashboard V1 Configuration Web Dashboard
  23. 5 minute break

  24. - name: the experiment identifier. Case sensitive and cannot contain

    spaces. - platform: ios and android require the minimum version (integer). These versions should match Android version or iOS build numbers. - weights: percentage described as a decimal (0.5 = 50%) of threshold traffic to distribute to a particular variant. Changing weight ratios while an experiment is running will cause some users to be switch between variants. - threshold: percentage of all traffic to divert to the experiment. - start/end date: the time when the API will begin serving variant distributions for your experiment. If the API is accessed outside of the start/end dates, the experiment will not be included in the result set. So all client versions must always define an internal default value. V1 API Configuration Terminology
  25. GET /experiments/get_running_experiments_by_name?names=Experiment1,Experiment2,Experiment3 { "result": "OK", "data": { "experiment_results": [ {

    "name": "Experiment2", "variant": 2 }, { "name": "Experiment3", "variant": 0 } ] }, "meta": {} } V1 API Details
  26. In addition to a valid authentication token, the following headers

    should be present: - X-Platform: the value here will be compared to the selected platform parameters - X-App-Version: the value here will be compared to the min-version parameter depending on the platform specified in the previous parameter If neither parameter matches the experiment configuration, variants will not be assigned. V1 API Server Headers
  27. V1 Usage Your teams likely already have a method for

    retrieving experiment variants. So usage will be implementation specific. In the following section we will go over a naive implementation so you can familiarize yourselves with general architecture nuances and gotchas.
  28. package main import ( "fmt" "log" "net/http" "net/url" "strings" )

    func main() { tutorial := &Tutorial{ eventLogger: &EventLogger{}, } experiments := GetExperiments("Experiment_1", "Experiment_2", "Experiment_3") variant := experiments["Experiment_2"] // *** What is wrong here? tutorial.Show(variant) } // Tutorial decides which version of the tutorial to show if any. type Tutorial struct { eventLogger *EventLogger } V1 Sample Usage func (t *Tutorial) Show(variant uint) { t.eventLogger.LogEvent("Experiment_2", fmt.Sprintf("variant:%d", variant)) if variant == 0 || variant == 1 { // don't show anything for users not assigned to the experiment (0) and // the control group (1) return } if variant == 2 { t.Slideshow() return } if variant == 3 { t.Animation() return } if variant == 4 { t.DemoVideo() return } // *** On-hands session: How can we improve maintainability of this code? } func (t *Tutorial) Slideshow() { // omitted for brevity } func (t *Tutorial) Animation() { // omitted for brevity } func (t *Tutorial) DemoVideo() { // omitted for brevity } // Results maps experiment names to their assigned variants. type Results map[string]uint func GetExperiments(names ...string) Results { // *** What is wrong with this function implementation??? var response *http.Response var err error const maxRetries = 10 for c := 0; c < maxRetries; c++ { response, err = http.Get(generateURL(names)) if err != nil { log.Println("failed to fetch experiments") } if err == nil { break } } if err != nil { return Results{} } return parseExperiments(response) } func parseExperiments(response *http.Response) Results { // omitted for brevity return Results{} } func generateURL(names []string) string { param := url.QueryEscape(strings.Join(names, ",")) return fmt.Sprintf("https://localhost:55555/get_running_experiments_by_name?names=%s", param) } type EventLogger struct{} func (el *EventLogger) LogEvent(details ...string) { // Omitted for brevity. Assume all relevant details are properly logged. } This is a very obvious way to use the API. It is not a model of a good implementation.
  29. func main() { tutorial := &Tutorial{ eventLogger: &EventLogger{}, } experiments

    := GetExperiments("Experiment_1", "Experiment_2", "Experiment_3") variant := experiments["Experiment_2"] // *** What is wrong here? tutorial.Show(variant) } V1 Sample Usage: Key Concerns (1)
  30. func main() { tutorial := &Tutorial{ eventLogger: &EventLogger{}, } experiments

    := GetExperiments("Experiment_1", "Experiment_2", "Experiment_3") variant := experiments["Experiment_2"] // *** What is wrong here? tutorial.Show(variant) } V1 Sample Usage: Key Concerns (1) A client predefined default variant value is not clearly specified. Because this is Go, the map access will result in the default value for uint which is 0. So the program will still execute normally. However other programming languages or constructs may not behave in a safe manner. So it is better to be explicit about the default behavior for the client if the experiments API becomes unavailable.
  31. func GetExperiments(names ...string) Results { // *** What is wrong

    with this function implementation??? var response *http.Response var err error const maxRetries = 10 for c := 0; c < maxRetries; c++ { response, err = http.Get(generateURL(names)) if err != nil { log.Println("failed to fetch experiments") } if err == nil { break } } if err != nil { return Results{} } return parseExperiments(response) } V1 Sample Usage: Key Concerns (2)
  32. func GetExperiments(names ...string) Results { // *** What is wrong

    with this function implementation??? var response *http.Response var err error const maxRetries = 10 for c := 0; c < maxRetries; c++ { response, err = http.Get(generateURL(names)) if err != nil { log.Println("failed to fetch experiments") } if err == nil { break } } if err != nil { return Results{} } return parseExperiments(response) } V1 Sample Usage: Key Concerns (2) The HTTP request has no timeout and no back off implementation. The lack of timeout means the request can block the program forever. Secondly the lack of request retry back off will result in multiple clients sending a request storm to the server. This can put too much pressure on the server infrastructure leading to further crashes and issues on the server if not handled well. Even when the server is able to properly deny requests in an overloaded state, in this case the client will be too fast with request retries and quickly fail for intermittent service downtime.
  33. func (t *Tutorial) Show(variant uint) { t.eventLogger.LogEvent("Experiment_2", fmt.Sprintf("variant:%d", variant)) if

    variant == 0 || variant == 1 { // don't show anything for users not assigned to the experiment (0) and // the control group (1) return } if variant == 2 { t.Slideshow() return } if variant == 3 { t.Animation() return } if variant == 4 { t.DemoVideo() return } // *** On-hands session: How can we improve maintainability of this code? } V1 Sample Usage: Key Concerns (3) Are we logging our events in the right place?
  34. Ensure your request implementation properly satisfies good client-server HTTP behavior:

    • Retry failed requests with back off intervals. • Ensure your client can react properly when the experiments API becomes unavailable. • React properly to defined HTTP response status codes. Ensure your event logging implementation logs events correctly: • Event logs should happen when it is clear the event has taken place: ◦ When a backend response completed transmission ◦ When a frontend client actually displays a component V1 Usage: Key Takeaways
  35. - Cannot easily re-randomize after the experiment starts. Therefore it

    is highly recommended to perform an A/A test. If the A/A test results in any imbalances, we must create a new experiment with a different name to get a new distribution. - Cannot isolate experiments from other running experiments. All clients querying the API will be considered for all experiments. - Cannot easily configure multiple client parameters from a single experiment. V1 API Key Limitations So my team is currently working on a new experimentation platform.
  36. V2 API Preview

  37. - Concept of parameter values - Concept of treatments (a

    set of parameter values) - A newly designed bucketing algorithm based on layers (“overlapping experiments”) - Ability to re-randomize buckets without updating clients - And more! V2 Preview
  38. GET /v2/parameters?name=ListButtonColor&name=ListTutorialPage [ { "parameter": "ListButtonColor", "value": "blue", "experiment-id": "ListMoreColors-1234"

    }, { "parameter": "ListTutorialPage", "value": "video-animation", "experiment-id": "ListTutorials-6789" } ] V2 Preview This is still on-going work. It is just a sample for demonstrative purposes.
  39. Experimentation Issues

  40. - Implementation bugs can obviously cause issues in experiments. For

    example we can observe either an extremely large increase or decrease in key metrics without observing any notable issues like app crashes or direct user feedback. Anytime we observe hard to believe statistic, it is known as Twyman’s Law. - Accurate event logging is another possible problem. For example if we naively believe we can just use server event logs for features that can affect client user metrics, then we may not get accurate interpretations because the communication between the client and server is lossy. We must be careful to consider accurately logging the state of client app in regards to actual displayed variant and experiment configurations. General Experimentation Issues Implementation and Logging Bugs
  41. - Our V1 experimentation system is based on an md5

    algorithm for randomization. Therefore randomization traits are limited by properties of md5 hashing. And we have observed imbalances because of this. - Even given pure random results, statistical hypothesis testing is still based on probabilities so there is always a small chance we observe some sort of unexpected results. General Experimentation Issues Observed Effects Due to Chance
  42. - A key assumption for any randomized controlled experiment is

    that there are no confounding variables in the experiment design. A possible common confounding variable within our environment is the effects of latency on user behavior. For example it is possible that our user’s behavior is sensitive to the responsiveness of our application and APIs. Therefore if we test a new feature that is noticeably slower to respond than the feature it replaces, we may inadvertently be measuring the effects due to latency, not the actual feature itself. - Another time we inadvertently can run into confounding variables is the time period over which we run our experiment. For example user behavior is likely influenced by time of day, day of the week, and by seasonal aspects (holiday seasons vs non-holidays, winter vs summer, etc). So the chosen experimentation period may also introduce uncontrolled biases. General Experimentation Issues Confounding Variables
  43. - SUTVA basically means the experiment units do not interact

    with each other. For example applying a treatment to a single user should not have any spillover effect to other users in the experiment. - Unfortunately our environment is a “two-sided marketplace” where buyers and sellers interact. So it is fairly easy to come up with experiment ideas that violate SUTVA. General Experimentation Issues Stable Unit Treatment Value Assumption (SUTVA)
  44. Feature Flags

  45. • We want to simplify releasing and merging code changes

    • We want to have our main branch in a releasable state • We want to be able to activate and deactivate code that is already released Why Use Feature Flags?
  46. Common Development Scenario With Branches 🏻 💻 🏼 💻 main

    [several commits] big merge conflict! your-branch alice-branch When multiple developers are working on the same repository with separate branches, merge conflicts become common and become a common pain point to quickly releasing code. Additionally large merges can lead to higher chances of out of sync changes and so called “bad merges”. This becomes another source of wasted development effort and potential bugs.
  47. Productivity Goal: Fast Code Merges 🏻 💻 🏼 💻 main

    We want to be able to merge as soon as possible. In our ideal scenario we would like to merge as soon as possible so that are changes become available to the rest of the developers. This reduces the possibility of large merge conflicts and helps everyone to work on an up-to-date version of the code base. But if code is not always in a mergeable and more importantly releasable state, how can we accomplish this?
  48. Potential Solution 1: rebasing Encourage others to rebase branches regularly

    to pick up merged changes. Good: Everyone keeps their branches up to date reducing the possibility of a large merge conflict later. Bad: Individual developers spend time to keep their branches up to date. Additionally their local changes are not immediately visible to others.
  49. Potential Solution 2: dev and release branches Automatically attempt to

    merge changes into an unstable “development” branch. Good: Everyone synchronizes code into a single branch. Bad: Must introduce release branches to stabilize development. This leads to multiple branch management for each release.
  50. Not Ideal Solutions Both solutions tend to view the merge

    problem as a source control problem. But what if we step away from source control and identify what code base traits we want to have: • We would like to have the code base in a releasable state at any time. • We would like developers to publish their changes as early as possible for shared visibility in the team.
  51. We can accomplish one trait if we simply just merge

    changes immediately. But what about in-progress changes or changes that are not ready for end-users?
  52. func main() { // how can both versions co-exist in

    the same code-base on the same branch? - showOld() + showNew() } func showOld() {} func showNew() {}
  53. var featureToggle = "old" func main() { if featureToggle ==

    "new" { showNew() } else { showOld() } }
  54. Feature Toggles and Feature Flags Feature toggles or feature flags

    hide code paths. This enables us to merge code that is not yet ready so we can achieve our ideal development scenario: fast merges and a single releasable trunk branch. This strategy is known as trunk based development.
  55. Testing In Production

  56. Testing In Production If we combine feature flags with our

    experiments API, we gain the ability to remotely activate and deactivate any code path and therefore feature. This gives a new capability: testing in production. But why would we want to test in production?
  57. Typical App Development Lifecycle New Feature Development In-House Testing and

    QA App Release Slow and expensive, but still necessary. New client releases (especially iOS and Android) can take hours.
  58. Lifecycle with Testing In Production New Feature Development In-House Testing

    and QA App Release with New Feature DISABLED Remotely Enable Feature for 1% of users Verify no anomalies Continue Feature Roll-Out If a major incident occurs, we now limit the risk to 1% of our user-base. Additionally we have the ability to disable the feature completely without deploying a new release.
  59. Feature Flags Conclusion • Trunk based development simplifies our development

    and release processes by allowing us to merge early and release early. • We reuse our existing Experiments API for feature flags and new feature rollouts. • We gain a new testing step called Testing In Production to reduce the risks associated with releasing new features.
  60. End of Presentation! 🎉