Automating Gradle benchmarks at N26

Automating Gradle benchmarks at N26 Ubiratan Soares July / 2022

https://n26.com/en/careers

CONTEXT

About Android at N26 🧑💻 🧑💻 👩💻 35 Engineers 1MM
LoC 380 modules ~ 20k tests

Platform Engineering at N26 Android/Core iOS/Core Web/Core NxD/Core Scalability Connectivity
Observability Client Platform • Four Engineers in Android/Core • Core / Platform libraries • Code + Test + Build + Deploy infrastructure • CI/CD and release train automation • Gradle builds = Top priority for 2022

Discovery Delivery Cooldown House Keeping 2 weeks 2 weeks 1
week 1 week Ways of Working in Client Platform • Continuous Discovery Framework • Product vision over the Code + Build + Test + Delivery • “Sprints" of 6 weeks (aka cycles)

Solution Space Opportunity Space Ideation Research Experimentation

•Scopes what we want to do •Uncovers value proposition •Propositions
that address how we’ll do it •Con fi dence assessement •Impact assessment Solution Space Opportunity Space Ideation Research Experimentation

https://producttalk.com

“How can we learn about the impact of a given
change we want to apply to our build setup without rolling out that change? How can we ensure it will even work ?”

Pre-mortems

https://github.com/gradle/gradle-pro fi ler

Gradle Pro fi ler Features Tooling API Cold/warm builds Daemon
control Benchmarking Pro fi lling Multiple build systems Multiple pro fi lers Scenarios de fi nition Incremental builds evaluation Reports (CSV, HTML)

$> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ -—output-dir "~/Dev/gradle-benchmarks"
\ -—scenario-file "ANC-666.scenario"

• gradle-pro fi ler could be a great fi t
for our Solution Discovery process • Data could be generated targeting the situation we wanted to achieve (not implementation details to achieve it) • High con fi dence for the Solution score (RICA) • Ideation pre-benchmarking could uncover implementation paths Evaluation

LEARNING FROM BENCHMARKS

assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks =
":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] warm-ups = 3 iterations = 15 }

":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] gradle-args = [“-no-build-cache“] warm-ups = 3 iterations = 15 }

":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] apply-abi-change-to = [“<path>/spaces/common/domain/model/Space.kt“] warm-ups = 3 iterations = 15 }

$> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks"
\ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline

$> git checkout master $> gradle-profiler \ --benchmark \ --project-dir
"~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks" \ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline

5 scenarios per pass 20 builds per scenario 2 passes
2 minutes per build ~ 6.5 hours total

0 67500 135000 202500 270000 1 2 3 4 5
6 7 8 9 10 Measured build Time (milliseconds)

0 2.5 5 7.5 10 18000 200000 220000 240000 260000
280000 Measured build (occurences per range) Time (milliseconds)

Hyphotesis Testing Population Sample Data Mean (µ) Sampling Re fi
nement and Validation Mean (X) Probability Analysis “If I pick another sample, which chance do I have to get the same results?”

p-value Sample Probability Analysis Calculated score (eg, Z or t)
P-value (area) The probability of not observe this sample again

~97% probability that means are different for the outcomes of
benchmarked task

Benchmark #01 (status quo) alpha (0.05) p-value Compare p-value and
alpha Paired T-test Evidence that means between samples are different is WEAK p-value is BIGGER p-value is SMALLER Evidence that means between samples are different is STRONG Benchmark #02 (modi fi cations) Gradle task

• Running large benchmarks on local machines was expensive and
executions were not isolated • Consolidating data from outcomes into Google Sheets was a manual process • Reliable interpretation of results could be non-trivial, specially when disambiguating inconclusive scenarios • Statistics is powerful but hard Summary of challenges

SHIFTING LEFT EXPERIMENTS

• Build an automation over the complexity of running benchmarks
and evaluating outcomes • Make gradle-pro fi ler (almost) invisible • Fast results even when exercising several scenarios • Self-service solution • Non-blocking solution Goals

⚒ sculptor 🔥 fornax 🦅 aquila machete-benchmarks

Data generation Data evaluation ⚒ 🔥 🦅

⚒ sculptor 🔥 fornax 🦅 aquila Set of scripts that
prepares a vanilla self-hosted Linux machine with the required tooling Set of scripts that wraps gradle-pro fi ler and git in an opionated way and drives the benchmark excution Small Python SDK that parses CSV fi les, pairs the data points and runs a Paired T- test on top of Scipy and NumPy.

🧑💻 Branch with changes Scenario fi le Benchmarker Work fl
ow

🧑💻 master Head Merge-base Branch with changes Scenario fi le
benchmark (changes) benchmark (baseline) Benchmarker Work fl ow packaging

from aquila.models import ( OutcomeFromScenario, PairedBenchmarkData, StatisticalEquality, StatisticalConclusion, BenchmarkEvaluation, )
from scipy import stats import numpy as np class BenchmarkEvaluator(object): def _ _ init _ _ (self, benchmarks_parser): self.benchmarks_parser = benchmarks_parser def evaluate(self): baseline, changes = self.benchmarks_parser.parse_results() paired_data = self._align_benchmarks(baseline, changes) overview = self._extract_overview(baseline) outcomes = [self._extract_outcome(item) for item in paired_data] return BenchmarkEvaluation(overview, outcomes)

@staticmethod def _extract_overview(suite): scenarios = len(suite) warmup_builds = 0 measured_builds
= 0 for scenario in suite: warmup_builds = warmup_builds + len(scenario.warmup_builds) measured_builds = measured_builds + len(scenario.measured_builds) return scenarios, warmup_builds, measured_builds @staticmethod def _align_benchmarks(baseline, changes): pairs = [] for item in baseline: task = item.invoked_gradle_task for candidate in changes: if candidate.invoked_gradle_task = = task: pairs.append(PairedBenchmarkData(task, item, candidate)) break return pairs

@staticmethod def _extract_outcome(paired_data): alpha = 0.05 baseline = paired_data.baseline.measured_builds changes
= paired_data.changes.measured_builds mean_baseline = np.mean(baseline) mean_changes = np.mean(changes) diff_abs = mean_changes - mean_baseline diff_improvement = "+" if diff_abs > 0 else "-" diff_relative_upper = (mean_changes - mean_baseline) / mean_changes diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%" improving = diff_abs < 0 _, pvalue = stats.ttest_ind(changes, baseline) equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual improvements = mean_changes < mean_baseline and pvalue < alpha regression = mean_changes > mean_baseline and pvalue < alpha evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral conclusion = StatisticalConclusion.Accepts if improvements else evaluation

LIVE DEMO

FINAL REMARKS

• Our self-service automation allowed us to run 100+ of
experiments since March/2022 • Thousands of Engineer-minutes saved, async-await style • Assertive solutions to improve our build setup and clearer implementation path when delivering them • We can validate if any input from the Android ecosystem actually works for us, avoiding ad verecundium arguments Our journey so far

UBIRATAN SOARES Computer Scientist made in 🇧🇷 Senior Software Engineer
@ N26 GDE for Android and Kotlin @ubiratanfsoares ubiratansoares.dev

THANKS

Automating Gradle benchmarks at N26

Automating Gradle benchmarks at N26

Video

More Decks by Ubiratan Soares

Other Decks in Programming

Featured

Transcript