Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating Gradle benchmarks at N26

Automating Gradle benchmarks at N26

Companion slide for my related talk. Delivered at the following events

• Android World Wide (Online) - July/2022

Ubiratan Soares

July 26, 2022
Tweet

Video

More Decks by Ubiratan Soares

Other Decks in Programming

Transcript

  1. Platform Engineering at N26 Android/Core iOS/Core Web/Core NxD/Core Scalability Connectivity

    Observability Client Platform • Four Engineers in Android/Core • Core / Platform libraries • Code + Test + Build + Deploy infrastructure • CI/CD and release train automation • Gradle builds = Top priority for 2022
  2. Discovery Delivery Cooldown House Keeping 2 weeks 2 weeks 1

    week 1 week Ways of Working in Client Platform • Continuous Discovery Framework • Product vision over the Code + Build + Test + Delivery • “Sprints" of 6 weeks (aka cycles)
  3. •Scopes what we want to do •Uncovers value proposition •Propositions

    that address how we’ll do it •Con fi dence assessement •Impact assessment Solution Space Opportunity Space Ideation Research Experimentation
  4. “How can we learn about the impact of a given

    change we want to apply to our build setup without rolling out that change? How can we ensure it will even work ?”
  5. Gradle Pro fi ler Features Tooling API Cold/warm builds Daemon

    control Benchmarking Pro fi lling Multiple build systems Multiple pro fi lers Scenarios de fi nition Incremental builds evaluation Reports (CSV, HTML)
  6. Gradle Pro fi ler Features Tooling API Cold/warm builds Daemon

    control Benchmarking Pro fi lling Multiple build systems Multiple pro fi lers Scenarios de fi nition Incremental builds evaluation Reports (CSV, HTML)
  7. • gradle-pro fi ler could be a great fi t

    for our Solution Discovery process • Data could be generated targeting the situation we wanted to achieve (not implementation details to achieve it) • High con fi dence for the Solution score (RICA) • Ideation pre-benchmarking could uncover implementation paths Evaluation
  8. assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks =

    ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] warm-ups = 3 iterations = 15 }
  9. assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks =

    ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] gradle-args = [“-no-build-cache“] warm-ups = 3 iterations = 15 }
  10. assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks =

    ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] apply-abi-change-to = [“<path>/spaces/common/domain/model/Space.kt“] warm-ups = 3 iterations = 15 }
  11. $> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks"

    \ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline
  12. $> git checkout master $> gradle-profiler \ --benchmark \ --project-dir

    "~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks" \ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline
  13. 5 scenarios per pass 20 builds per scenario 2 passes

    2 minutes per build ~ 6.5 hours total
  14. CI

  15. 0 67500 135000 202500 270000 1 2 3 4 5

    6 7 8 9 10 Measured build Time (milliseconds)
  16. 0 2.5 5 7.5 10 18000 200000 220000 240000 260000

    280000 Measured build (occurences per range) Time (milliseconds)
  17. Hyphotesis Testing Population Sample Data Mean (µ) Sampling Re fi

    nement and Validation Mean (X) Probability Analysis “If I pick another sample, which chance do I have to get the same results?”
  18. p-value Sample Probability Analysis Calculated score (eg, Z or t)

    P-value (area) The probability of not observe this sample again
  19. Benchmark #01 (status quo) alpha (0.05) p-value Compare p-value and

    alpha Paired T-test Evidence that means between samples are different is WEAK p-value is BIGGER p-value is SMALLER Evidence that means between samples are different is STRONG Benchmark #02 (modi fi cations) Gradle task
  20. • Running large benchmarks on local machines was expensive and

    executions were not isolated • Consolidating data from outcomes into Google Sheets was a manual process • Reliable interpretation of results could be non-trivial, specially when disambiguating inconclusive scenarios • Statistics is powerful but hard Summary of challenges
  21. • Build an automation over the complexity of running benchmarks

    and evaluating outcomes • Make gradle-pro fi ler (almost) invisible • Fast results even when exercising several scenarios • Self-service solution • Non-blocking solution Goals
  22. ⚒ sculptor 🔥 fornax 🦅 aquila Set of scripts that

    prepares a vanilla self-hosted Linux machine with the required tooling Set of scripts that wraps gradle-pro fi ler and git in an opionated way and drives the benchmark excution Small Python SDK that parses CSV fi les, pairs the data points and runs a Paired T- test on top of Scipy and NumPy.
  23. 🧑💻 master Head Merge-base Branch with changes Scenario fi le

    benchmark (changes) benchmark (baseline) Benchmarker Work fl ow packaging
  24. from aquila.models import ( OutcomeFromScenario, PairedBenchmarkData, StatisticalEquality, StatisticalConclusion, BenchmarkEvaluation, )

    from scipy import stats import numpy as np class BenchmarkEvaluator(object): def _ _ init _ _ (self, benchmarks_parser): self.benchmarks_parser = benchmarks_parser def evaluate(self): baseline, changes = self.benchmarks_parser.parse_results() paired_data = self._align_benchmarks(baseline, changes) overview = self._extract_overview(baseline) outcomes = [self._extract_outcome(item) for item in paired_data] return BenchmarkEvaluation(overview, outcomes)
  25. @staticmethod def _extract_overview(suite): scenarios = len(suite) warmup_builds = 0 measured_builds

    = 0 for scenario in suite: warmup_builds = warmup_builds + len(scenario.warmup_builds) measured_builds = measured_builds + len(scenario.measured_builds) return scenarios, warmup_builds, measured_builds @staticmethod def _align_benchmarks(baseline, changes): pairs = [] for item in baseline: task = item.invoked_gradle_task for candidate in changes: if candidate.invoked_gradle_task = = task: pairs.append(PairedBenchmarkData(task, item, candidate)) break return pairs
  26. @staticmethod def _extract_outcome(paired_data): alpha = 0.05 baseline = paired_data.baseline.measured_builds changes

    = paired_data.changes.measured_builds mean_baseline = np.mean(baseline) mean_changes = np.mean(changes) diff_abs = mean_changes - mean_baseline diff_improvement = "+" if diff_abs > 0 else "-" diff_relative_upper = (mean_changes - mean_baseline) / mean_changes diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%" improving = diff_abs < 0 _, pvalue = stats.ttest_ind(changes, baseline) equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual improvements = mean_changes < mean_baseline and pvalue < alpha regression = mean_changes > mean_baseline and pvalue < alpha evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral conclusion = StatisticalConclusion.Accepts if improvements else evaluation
  27. @staticmethod def _extract_outcome(paired_data): alpha = 0.05 baseline = paired_data.baseline.measured_builds changes

    = paired_data.changes.measured_builds mean_baseline = np.mean(baseline) mean_changes = np.mean(changes) diff_abs = mean_changes - mean_baseline diff_improvement = "+" if diff_abs > 0 else "-" diff_relative_upper = (mean_changes - mean_baseline) / mean_changes diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%" improving = diff_abs < 0 _, pvalue = stats.ttest_ind(changes, baseline) equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual improvements = mean_changes < mean_baseline and pvalue < alpha regression = mean_changes > mean_baseline and pvalue < alpha evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral conclusion = StatisticalConclusion.Accepts if improvements else evaluation
  28. • Our self-service automation allowed us to run 100+ of

    experiments since March/2022 • Thousands of Engineer-minutes saved, async-await style • Assertive solutions to improve our build setup and clearer implementation path when delivering them • We can validate if any input from the Android ecosystem actually works for us, avoiding ad verecundium arguments Our journey so far
  29. UBIRATAN SOARES Computer Scientist made in 🇧🇷 Senior Software Engineer

    @ N26 GDE for Android and Kotlin @ubiratanfsoares ubiratansoares.dev