Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating Gradle benchmarks at N26

Automating Gradle benchmarks at N26

Companion slide for my related talk. Delivered at the following events

• Android World Wide (Online) - July/2022

D4b7a3e2ed10f86e0b52498713ba2601?s=128

Ubiratan Soares
PRO

July 26, 2022
Tweet

More Decks by Ubiratan Soares

Other Decks in Programming

Transcript

  1. Automating Gradle benchmarks at N26 Ubiratan Soares July / 2022

  2. https://n26.com/en/careers

  3. CONTEXT

  4. About Android at N26 🧑💻 🧑💻 👩💻 35 Engineers 1MM

    LoC 380 modules ~ 20k tests
  5. Platform Engineering at N26 Android/Core iOS/Core Web/Core NxD/Core Scalability Connectivity

    Observability Client Platform • Four Engineers in Android/Core • Core / Platform libraries • Code + Test + Build + Deploy infrastructure • CI/CD and release train automation • Gradle builds = Top priority for 2022
  6. Discovery Delivery Cooldown House Keeping 2 weeks 2 weeks 1

    week 1 week Ways of Working in Client Platform • Continuous Discovery Framework • Product vision over the Code + Build + Test + Delivery • “Sprints" of 6 weeks (aka cycles)
  7. Solution Space Opportunity Space Ideation Research Experimentation

  8. •Scopes what we want to do •Uncovers value proposition •Propositions

    that address how we’ll do it •Con fi dence assessement •Impact assessment Solution Space Opportunity Space Ideation Research Experimentation
  9. https://producttalk.com

  10. “How can we learn about the impact of a given

    change we want to apply to our build setup without rolling out that change? How can we ensure it will even work ?”
  11. Pre-mortems

  12. https://github.com/gradle/gradle-pro fi ler

  13. Gradle Pro fi ler Features Tooling API Cold/warm builds Daemon

    control Benchmarking Pro fi lling Multiple build systems Multiple pro fi lers Scenarios de fi nition Incremental builds evaluation Reports (CSV, HTML)
  14. Gradle Pro fi ler Features Tooling API Cold/warm builds Daemon

    control Benchmarking Pro fi lling Multiple build systems Multiple pro fi lers Scenarios de fi nition Incremental builds evaluation Reports (CSV, HTML)
  15. $> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ -—output-dir "~/Dev/gradle-benchmarks"

    \ -—scenario-file "ANC-666.scenario"
  16. None
  17. • gradle-pro fi ler could be a great fi t

    for our Solution Discovery process • Data could be generated targeting the situation we wanted to achieve (not implementation details to achieve it) • High con fi dence for the Solution score (RICA) • Ideation pre-benchmarking could uncover implementation paths Evaluation
  18. LEARNING FROM BENCHMARKS

  19. assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks =

    ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] warm-ups = 3 iterations = 15 }
  20. assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks =

    ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] gradle-args = [“-no-build-cache“] warm-ups = 3 iterations = 15 }
  21. assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks =

    ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] apply-abi-change-to = [“<path>/spaces/common/domain/model/Space.kt“] warm-ups = 3 iterations = 15 }
  22. $> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks"

    \ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline
  23. $> git checkout master $> gradle-profiler \ --benchmark \ --project-dir

    "~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks" \ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline
  24. 5 scenarios per pass 20 builds per scenario 2 passes

    2 minutes per build ~ 6.5 hours total
  25. CI

  26. None
  27. None
  28. 😎

  29. 🤔

  30. 0 67500 135000 202500 270000 1 2 3 4 5

    6 7 8 9 10 Measured build Time (milliseconds)
  31. 0 2.5 5 7.5 10 18000 200000 220000 240000 260000

    280000 Measured build (occurences per range) Time (milliseconds)
  32. Hyphotesis Testing Population Sample Data Mean (µ) Sampling Re fi

    nement and Validation Mean (X) Probability Analysis “If I pick another sample, which chance do I have to get the same results?”
  33. p-value Sample Probability Analysis Calculated score (eg, Z or t)

    P-value (area) The probability of not observe this sample again
  34. ~97% probability that means are different for the outcomes of

    benchmarked task
  35. Benchmark #01 (status quo) alpha (0.05) p-value Compare p-value and

    alpha Paired T-test Evidence that means between samples are different is WEAK p-value is BIGGER p-value is SMALLER Evidence that means between samples are different is STRONG Benchmark #02 (modi fi cations) Gradle task
  36. • Running large benchmarks on local machines was expensive and

    executions were not isolated • Consolidating data from outcomes into Google Sheets was a manual process • Reliable interpretation of results could be non-trivial, specially when disambiguating inconclusive scenarios • Statistics is powerful but hard Summary of challenges
  37. SHIFTING LEFT EXPERIMENTS

  38. • Build an automation over the complexity of running benchmarks

    and evaluating outcomes • Make gradle-pro fi ler (almost) invisible • Fast results even when exercising several scenarios • Self-service solution • Non-blocking solution Goals
  39. ⚒ sculptor 🔥 fornax 🦅 aquila machete-benchmarks

  40. Data generation Data evaluation ⚒ 🔥 🦅

  41. ⚒ sculptor 🔥 fornax 🦅 aquila Set of scripts that

    prepares a vanilla self-hosted Linux machine with the required tooling Set of scripts that wraps gradle-pro fi ler and git in an opionated way and drives the benchmark excution Small Python SDK that parses CSV fi les, pairs the data points and runs a Paired T- test on top of Scipy and NumPy.
  42. 🧑💻 Branch with changes Scenario fi le Benchmarker Work fl

    ow
  43. 🧑💻 master Head Merge-base Branch with changes Scenario fi le

    benchmark (changes) benchmark (baseline) Benchmarker Work fl ow packaging
  44. from aquila.models import ( OutcomeFromScenario, PairedBenchmarkData, StatisticalEquality, StatisticalConclusion, BenchmarkEvaluation, )

    from scipy import stats import numpy as np class BenchmarkEvaluator(object): def _ _ init _ _ (self, benchmarks_parser): self.benchmarks_parser = benchmarks_parser def evaluate(self): baseline, changes = self.benchmarks_parser.parse_results() paired_data = self._align_benchmarks(baseline, changes) overview = self._extract_overview(baseline) outcomes = [self._extract_outcome(item) for item in paired_data] return BenchmarkEvaluation(overview, outcomes)
  45. @staticmethod def _extract_overview(suite): scenarios = len(suite) warmup_builds = 0 measured_builds

    = 0 for scenario in suite: warmup_builds = warmup_builds + len(scenario.warmup_builds) measured_builds = measured_builds + len(scenario.measured_builds) return scenarios, warmup_builds, measured_builds @staticmethod def _align_benchmarks(baseline, changes): pairs = [] for item in baseline: task = item.invoked_gradle_task for candidate in changes: if candidate.invoked_gradle_task = = task: pairs.append(PairedBenchmarkData(task, item, candidate)) break return pairs
  46. @staticmethod def _extract_outcome(paired_data): alpha = 0.05 baseline = paired_data.baseline.measured_builds changes

    = paired_data.changes.measured_builds mean_baseline = np.mean(baseline) mean_changes = np.mean(changes) diff_abs = mean_changes - mean_baseline diff_improvement = "+" if diff_abs > 0 else "-" diff_relative_upper = (mean_changes - mean_baseline) / mean_changes diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%" improving = diff_abs < 0 _, pvalue = stats.ttest_ind(changes, baseline) equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual improvements = mean_changes < mean_baseline and pvalue < alpha regression = mean_changes > mean_baseline and pvalue < alpha evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral conclusion = StatisticalConclusion.Accepts if improvements else evaluation
  47. @staticmethod def _extract_outcome(paired_data): alpha = 0.05 baseline = paired_data.baseline.measured_builds changes

    = paired_data.changes.measured_builds mean_baseline = np.mean(baseline) mean_changes = np.mean(changes) diff_abs = mean_changes - mean_baseline diff_improvement = "+" if diff_abs > 0 else "-" diff_relative_upper = (mean_changes - mean_baseline) / mean_changes diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%" improving = diff_abs < 0 _, pvalue = stats.ttest_ind(changes, baseline) equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual improvements = mean_changes < mean_baseline and pvalue < alpha regression = mean_changes > mean_baseline and pvalue < alpha evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral conclusion = StatisticalConclusion.Accepts if improvements else evaluation
  48. None
  49. None
  50. None
  51. LIVE DEMO

  52. FINAL REMARKS

  53. • Our self-service automation allowed us to run 100+ of

    experiments since March/2022 • Thousands of Engineer-minutes saved, async-await style • Assertive solutions to improve our build setup and clearer implementation path when delivering them • We can validate if any input from the Android ecosystem actually works for us, avoiding ad verecundium arguments Our journey so far
  54. UBIRATAN SOARES Computer Scientist made in 🇧🇷 Senior Software Engineer

    @ N26 GDE for Android and Kotlin @ubiratanfsoares ubiratansoares.dev
  55. THANKS