Slide 1

Slide 1 text

Automating Gradle benchmarks at N26 Ubiratan Soares July / 2022

Slide 2

Slide 2 text

https://n26.com/en/careers

Slide 3

Slide 3 text

CONTEXT

Slide 4

Slide 4 text

About Android at N26 🧑💻 🧑💻 👩💻 35 Engineers 1MM LoC 380 modules ~ 20k tests

Slide 5

Slide 5 text

Platform Engineering at N26 Android/Core iOS/Core Web/Core NxD/Core Scalability Connectivity Observability Client Platform • Four Engineers in Android/Core • Core / Platform libraries • Code + Test + Build + Deploy infrastructure • CI/CD and release train automation • Gradle builds = Top priority for 2022

Slide 6

Slide 6 text

Discovery Delivery Cooldown House Keeping 2 weeks 2 weeks 1 week 1 week Ways of Working in Client Platform • Continuous Discovery Framework • Product vision over the Code + Build + Test + Delivery • “Sprints" of 6 weeks (aka cycles)

Slide 7

Slide 7 text

Solution Space Opportunity Space Ideation Research Experimentation

Slide 8

Slide 8 text

•Scopes what we want to do •Uncovers value proposition •Propositions that address how we’ll do it •Con fi dence assessement •Impact assessment Solution Space Opportunity Space Ideation Research Experimentation

Slide 9

Slide 9 text

https://producttalk.com

Slide 10

Slide 10 text

“How can we learn about the impact of a given change we want to apply to our build setup without rolling out that change? How can we ensure it will even work ?”

Slide 11

Slide 11 text

Pre-mortems

Slide 12

Slide 12 text

https://github.com/gradle/gradle-pro fi ler

Slide 13

Slide 13 text

Gradle Pro fi ler Features Tooling API Cold/warm builds Daemon control Benchmarking Pro fi lling Multiple build systems Multiple pro fi lers Scenarios de fi nition Incremental builds evaluation Reports (CSV, HTML)

Slide 14

Slide 14 text

Gradle Pro fi ler Features Tooling API Cold/warm builds Daemon control Benchmarking Pro fi lling Multiple build systems Multiple pro fi lers Scenarios de fi nition Incremental builds evaluation Reports (CSV, HTML)

Slide 15

Slide 15 text

$> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ -—output-dir "~/Dev/gradle-benchmarks" \ -—scenario-file "ANC-666.scenario"

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

• gradle-pro fi ler could be a great fi t for our Solution Discovery process • Data could be generated targeting the situation we wanted to achieve (not implementation details to achieve it) • High con fi dence for the Solution score (RICA) • Ideation pre-benchmarking could uncover implementation paths Evaluation

Slide 18

Slide 18 text

LEARNING FROM BENCHMARKS

Slide 19

Slide 19 text

assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks = ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] warm-ups = 3 iterations = 15 }

Slide 20

Slide 20 text

assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks = ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] gradle-args = [“-no-build-cache“] warm-ups = 3 iterations = 15 }

Slide 21

Slide 21 text

assemble-sfa-spaces { title = "Assemble Spaces Single-Feature app“ tasks = ":features:spaces:app:assembleDebug" daemon = warm cleanup-tasks = ["clean"] apply-abi-change-to = [“/spaces/common/domain/model/Space.kt“] warm-ups = 3 iterations = 15 }

Slide 22

Slide 22 text

$> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks" \ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline

Slide 23

Slide 23 text

$> git checkout master $> gradle-profiler \ --benchmark \ --project-dir "~/Dev/android-machete" \ —-output-dir "~/Dev/benchmarks" \ —-scenario-file "ANC-666.scenario" master ANC-666 Changes Baseline

Slide 24

Slide 24 text

5 scenarios per pass 20 builds per scenario 2 passes 2 minutes per build ~ 6.5 hours total

Slide 25

Slide 25 text

CI

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

😎

Slide 29

Slide 29 text

🤔

Slide 30

Slide 30 text

0 67500 135000 202500 270000 1 2 3 4 5 6 7 8 9 10 Measured build Time (milliseconds)

Slide 31

Slide 31 text

0 2.5 5 7.5 10 18000 200000 220000 240000 260000 280000 Measured build (occurences per range) Time (milliseconds)

Slide 32

Slide 32 text

Hyphotesis Testing Population Sample Data Mean (µ) Sampling Re fi nement and Validation Mean (X) Probability Analysis “If I pick another sample, which chance do I have to get the same results?”

Slide 33

Slide 33 text

p-value Sample Probability Analysis Calculated score (eg, Z or t) P-value (area) The probability of not observe this sample again

Slide 34

Slide 34 text

~97% probability that means are different for the outcomes of benchmarked task

Slide 35

Slide 35 text

Benchmark #01 (status quo) alpha (0.05) p-value Compare p-value and alpha Paired T-test Evidence that means between samples are different is WEAK p-value is BIGGER p-value is SMALLER Evidence that means between samples are different is STRONG Benchmark #02 (modi fi cations) Gradle task

Slide 36

Slide 36 text

• Running large benchmarks on local machines was expensive and executions were not isolated • Consolidating data from outcomes into Google Sheets was a manual process • Reliable interpretation of results could be non-trivial, specially when disambiguating inconclusive scenarios • Statistics is powerful but hard Summary of challenges

Slide 37

Slide 37 text

SHIFTING LEFT EXPERIMENTS

Slide 38

Slide 38 text

• Build an automation over the complexity of running benchmarks and evaluating outcomes • Make gradle-pro fi ler (almost) invisible • Fast results even when exercising several scenarios • Self-service solution • Non-blocking solution Goals

Slide 39

Slide 39 text

⚒ sculptor 🔥 fornax 🦅 aquila machete-benchmarks

Slide 40

Slide 40 text

Data generation Data evaluation ⚒ 🔥 🦅

Slide 41

Slide 41 text

⚒ sculptor 🔥 fornax 🦅 aquila Set of scripts that prepares a vanilla self-hosted Linux machine with the required tooling Set of scripts that wraps gradle-pro fi ler and git in an opionated way and drives the benchmark excution Small Python SDK that parses CSV fi les, pairs the data points and runs a Paired T- test on top of Scipy and NumPy.

Slide 42

Slide 42 text

🧑💻 Branch with changes Scenario fi le Benchmarker Work fl ow

Slide 43

Slide 43 text

🧑💻 master Head Merge-base Branch with changes Scenario fi le benchmark (changes) benchmark (baseline) Benchmarker Work fl ow packaging

Slide 44

Slide 44 text

from aquila.models import ( OutcomeFromScenario, PairedBenchmarkData, StatisticalEquality, StatisticalConclusion, BenchmarkEvaluation, ) from scipy import stats import numpy as np class BenchmarkEvaluator(object): def _ _ init _ _ (self, benchmarks_parser): self.benchmarks_parser = benchmarks_parser def evaluate(self): baseline, changes = self.benchmarks_parser.parse_results() paired_data = self._align_benchmarks(baseline, changes) overview = self._extract_overview(baseline) outcomes = [self._extract_outcome(item) for item in paired_data] return BenchmarkEvaluation(overview, outcomes)

Slide 45

Slide 45 text

@staticmethod def _extract_overview(suite): scenarios = len(suite) warmup_builds = 0 measured_builds = 0 for scenario in suite: warmup_builds = warmup_builds + len(scenario.warmup_builds) measured_builds = measured_builds + len(scenario.measured_builds) return scenarios, warmup_builds, measured_builds @staticmethod def _align_benchmarks(baseline, changes): pairs = [] for item in baseline: task = item.invoked_gradle_task for candidate in changes: if candidate.invoked_gradle_task = = task: pairs.append(PairedBenchmarkData(task, item, candidate)) break return pairs

Slide 46

Slide 46 text

@staticmethod def _extract_outcome(paired_data): alpha = 0.05 baseline = paired_data.baseline.measured_builds changes = paired_data.changes.measured_builds mean_baseline = np.mean(baseline) mean_changes = np.mean(changes) diff_abs = mean_changes - mean_baseline diff_improvement = "+" if diff_abs > 0 else "-" diff_relative_upper = (mean_changes - mean_baseline) / mean_changes diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%" improving = diff_abs < 0 _, pvalue = stats.ttest_ind(changes, baseline) equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual improvements = mean_changes < mean_baseline and pvalue < alpha regression = mean_changes > mean_baseline and pvalue < alpha evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral conclusion = StatisticalConclusion.Accepts if improvements else evaluation

Slide 47

Slide 47 text

@staticmethod def _extract_outcome(paired_data): alpha = 0.05 baseline = paired_data.baseline.measured_builds changes = paired_data.changes.measured_builds mean_baseline = np.mean(baseline) mean_changes = np.mean(changes) diff_abs = mean_changes - mean_baseline diff_improvement = "+" if diff_abs > 0 else "-" diff_relative_upper = (mean_changes - mean_baseline) / mean_changes diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%" improving = diff_abs < 0 _, pvalue = stats.ttest_ind(changes, baseline) equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual improvements = mean_changes < mean_baseline and pvalue < alpha regression = mean_changes > mean_baseline and pvalue < alpha evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral conclusion = StatisticalConclusion.Accepts if improvements else evaluation

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

LIVE DEMO

Slide 52

Slide 52 text

FINAL REMARKS

Slide 53

Slide 53 text

• Our self-service automation allowed us to run 100+ of experiments since March/2022 • Thousands of Engineer-minutes saved, async-await style • Assertive solutions to improve our build setup and clearer implementation path when delivering them • We can validate if any input from the Android ecosystem actually works for us, avoiding ad verecundium arguments Our journey so far

Slide 54

Slide 54 text

UBIRATAN SOARES Computer Scientist made in 🇧🇷 Senior Software Engineer @ N26 GDE for Android and Kotlin @ubiratanfsoares ubiratansoares.dev

Slide 55

Slide 55 text

THANKS