Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating Gradle benchmarks at N26

Automating Gradle benchmarks at N26

Companion slide for my related talk. Delivered at the following events

• Android World Wide (Online) - July/2022

Ubiratan Soares

July 26, 2022
Tweet

Video

More Decks by Ubiratan Soares

Other Decks in Programming

Transcript

  1. Automating Gradle
    benchmarks at N26
    Ubiratan Soares


    July / 2022

    View full-size slide

  2. https://n26.com/en/careers

    View full-size slide

  3. About Android at N26
    🧑💻
    🧑💻
    👩💻
    35 Engineers 1MM LoC 380 modules ~ 20k tests

    View full-size slide

  4. Platform Engineering at N26
    Android/Core
    iOS/Core Web/Core
    NxD/Core
    Scalability
    Connectivity
    Observability
    Client Platform
    • Four Engineers in Android/Core


    • Core / Platform libraries


    • Code + Test + Build + Deploy infrastructure


    • CI/CD and release train automation


    • Gradle builds = Top priority for 2022

    View full-size slide

  5. Discovery Delivery Cooldown
    House
    Keeping
    2 weeks 2 weeks 1 week
    1 week
    Ways of Working in Client Platform
    • Continuous Discovery Framework


    • Product vision over the Code + Build + Test + Delivery


    • “Sprints" of 6 weeks (aka cycles)

    View full-size slide

  6. Solution


    Space
    Opportunity


    Space
    Ideation


    Research


    Experimentation

    View full-size slide

  7. •Scopes what we want to do


    •Uncovers value proposition
    •Propositions that
    address how we’ll do it


    •Con
    fi
    dence assessement


    •Impact assessment
    Solution


    Space
    Opportunity


    Space
    Ideation


    Research


    Experimentation

    View full-size slide

  8. https://producttalk.com

    View full-size slide

  9. “How can we learn about the
    impact of a given change we
    want to apply to our build
    setup without rolling out that
    change?


    How can we ensure it will
    even work ?”

    View full-size slide

  10. https://github.com/gradle/gradle-pro
    fi
    ler

    View full-size slide

  11. Gradle Pro
    fi
    ler Features
    Tooling API
    Cold/warm builds
    Daemon control
    Benchmarking
    Pro
    fi
    lling
    Multiple build systems
    Multiple pro
    fi
    lers Scenarios de
    fi
    nition
    Incremental builds evaluation
    Reports (CSV, HTML)

    View full-size slide

  12. Gradle Pro
    fi
    ler Features
    Tooling API
    Cold/warm builds
    Daemon control
    Benchmarking
    Pro
    fi
    lling
    Multiple build systems
    Multiple pro
    fi
    lers Scenarios de
    fi
    nition
    Incremental builds evaluation
    Reports (CSV, HTML)

    View full-size slide

  13. $> gradle-profiler \


    --benchmark \


    --project-dir "~/Dev/android-machete" \


    -—output-dir "~/Dev/gradle-benchmarks" \


    -—scenario-file "ANC-666.scenario"

    View full-size slide

  14. • gradle-pro
    fi
    ler could be a great
    fi
    t for our Solution
    Discovery process


    • Data could be generated targeting the situation we wanted
    to achieve (not implementation details to achieve it)


    • High con
    fi
    dence for the Solution score (RICA)


    • Ideation pre-benchmarking could uncover implementation
    paths
    Evaluation

    View full-size slide

  15. LEARNING FROM
    BENCHMARKS

    View full-size slide

  16. assemble-sfa-spaces {


    title = "Assemble Spaces Single-Feature app“


    tasks = ":features:spaces:app:assembleDebug"


    daemon = warm


    cleanup-tasks = ["clean"]


    warm-ups = 3


    iterations = 15


    }

    View full-size slide

  17. assemble-sfa-spaces {


    title = "Assemble Spaces Single-Feature app“


    tasks = ":features:spaces:app:assembleDebug"


    daemon = warm


    cleanup-tasks = ["clean"]


    gradle-args = [“-no-build-cache“]


    warm-ups = 3


    iterations = 15


    }

    View full-size slide

  18. assemble-sfa-spaces {


    title = "Assemble Spaces Single-Feature app“


    tasks = ":features:spaces:app:assembleDebug"


    daemon = warm


    cleanup-tasks = ["clean"]


    apply-abi-change-to = [“/spaces/common/domain/model/Space.kt“]


    warm-ups = 3


    iterations = 15


    }

    View full-size slide

  19. $> gradle-profiler \


    --benchmark \


    --project-dir "~/Dev/android-machete" \


    —-output-dir "~/Dev/benchmarks" \


    —-scenario-file "ANC-666.scenario"
    master
    ANC-666
    Changes
    Baseline

    View full-size slide

  20. $> git checkout master


    $> gradle-profiler \


    --benchmark \


    --project-dir "~/Dev/android-machete" \


    —-output-dir "~/Dev/benchmarks" \


    —-scenario-file "ANC-666.scenario"
    master
    ANC-666
    Changes
    Baseline

    View full-size slide

  21. 5 scenarios per pass
    20 builds per scenario
    2 passes
    2 minutes per build
    ~ 6.5 hours total

    View full-size slide

  22. 0
    67500
    135000
    202500
    270000
    1 2 3 4 5 6 7 8 9 10
    Measured build
    Time (milliseconds)

    View full-size slide

  23. 0
    2.5
    5
    7.5
    10
    18000 200000 220000 240000 260000 280000
    Measured build


    (occurences per range)
    Time (milliseconds)

    View full-size slide

  24. Hyphotesis Testing
    Population


    Sample


    Data
    Mean (µ)
    Sampling
    Re
    fi
    nement and


    Validation


    Mean (X)
    Probability


    Analysis
    “If I pick another sample,
    which chance do I have to
    get the same results?”

    View full-size slide

  25. p-value
    Sample
    Probability Analysis
    Calculated score


    (eg, Z or t)
    P-value (area)
    The probability of not observe this sample again

    View full-size slide

  26. ~97% probability that
    means are different
    for the outcomes of
    benchmarked task

    View full-size slide

  27. Benchmark #01


    (status quo)
    alpha


    (0.05)
    p-value
    Compare


    p-value and


    alpha
    Paired T-test
    Evidence that means between samples
    are different is WEAK
    p-value is BIGGER
    p-value is SMALLER
    Evidence that means between samples
    are different is STRONG
    Benchmark #02


    (modi
    fi
    cations)
    Gradle task


    View full-size slide

  28. • Running large benchmarks on local machines was
    expensive and executions were not isolated


    • Consolidating data from outcomes into Google Sheets was
    a manual process


    • Reliable interpretation of results could be non-trivial,
    specially when disambiguating inconclusive scenarios


    • Statistics is powerful but hard
    Summary of challenges

    View full-size slide

  29. SHIFTING LEFT
    EXPERIMENTS

    View full-size slide

  30. • Build an automation over the complexity of running
    benchmarks and evaluating outcomes


    • Make gradle-pro
    fi
    ler (almost) invisible


    • Fast results even when exercising several scenarios


    • Self-service solution


    • Non-blocking solution
    Goals

    View full-size slide


  31. sculptor
    🔥
    fornax
    🦅
    aquila
    machete-benchmarks

    View full-size slide

  32. Data generation Data evaluation
    ⚒ 🔥 🦅

    View full-size slide

  33. ⚒ sculptor
    🔥 fornax
    🦅 aquila
    Set of scripts that prepares a vanilla


    self-hosted Linux machine with the
    required tooling
    Set of scripts that wraps gradle-pro
    fi
    ler and
    git in an opionated way and drives the
    benchmark excution
    Small Python SDK that parses CSV
    fi
    les,
    pairs the data points and runs a Paired T-
    test on top of Scipy and NumPy.

    View full-size slide

  34. 🧑💻 Branch with changes
    Scenario
    fi
    le
    Benchmarker


    Work
    fl
    ow

    View full-size slide

  35. 🧑💻
    master
    Head
    Merge-base
    Branch with changes
    Scenario
    fi
    le
    benchmark


    (changes)
    benchmark


    (baseline)
    Benchmarker


    Work
    fl
    ow
    packaging

    View full-size slide

  36. from aquila.models import (


    OutcomeFromScenario,


    PairedBenchmarkData,


    StatisticalEquality,


    StatisticalConclusion,


    BenchmarkEvaluation,


    )


    from scipy import stats


    import numpy as np


    class BenchmarkEvaluator(object):


    def
    _ _
    init
    _ _
    (self, benchmarks_parser):


    self.benchmarks_parser = benchmarks_parser


    def evaluate(self):


    baseline, changes = self.benchmarks_parser.parse_results()


    paired_data = self._align_benchmarks(baseline, changes)


    overview = self._extract_overview(baseline)


    outcomes = [self._extract_outcome(item) for item in paired_data]


    return BenchmarkEvaluation(overview, outcomes)


    View full-size slide

  37. @staticmethod


    def _extract_overview(suite):


    scenarios = len(suite)


    warmup_builds = 0


    measured_builds = 0


    for scenario in suite:


    warmup_builds = warmup_builds + len(scenario.warmup_builds)


    measured_builds = measured_builds + len(scenario.measured_builds)


    return scenarios, warmup_builds, measured_builds


    @staticmethod


    def _align_benchmarks(baseline, changes):


    pairs = []


    for item in baseline:


    task = item.invoked_gradle_task


    for candidate in changes:


    if candidate.invoked_gradle_task
    = =
    task:


    pairs.append(PairedBenchmarkData(task, item, candidate))


    break


    return pairs


    View full-size slide

  38. @staticmethod


    def _extract_outcome(paired_data):


    alpha = 0.05


    baseline = paired_data.baseline.measured_builds


    changes = paired_data.changes.measured_builds


    mean_baseline = np.mean(baseline)


    mean_changes = np.mean(changes)


    diff_abs = mean_changes - mean_baseline


    diff_improvement = "+" if diff_abs > 0 else "-"


    diff_relative_upper = (mean_changes - mean_baseline) / mean_changes


    diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline


    diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower


    diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%"


    improving = diff_abs < 0


    _, pvalue = stats.ttest_ind(changes, baseline)


    equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual


    improvements = mean_changes < mean_baseline and pvalue < alpha


    regression = mean_changes > mean_baseline and pvalue < alpha


    evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral


    conclusion = StatisticalConclusion.Accepts if improvements else evaluation


    View full-size slide

  39. @staticmethod


    def _extract_outcome(paired_data):


    alpha = 0.05


    baseline = paired_data.baseline.measured_builds


    changes = paired_data.changes.measured_builds


    mean_baseline = np.mean(baseline)


    mean_changes = np.mean(changes)


    diff_abs = mean_changes - mean_baseline


    diff_improvement = "+" if diff_abs > 0 else "-"


    diff_relative_upper = (mean_changes - mean_baseline) / mean_changes


    diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline


    diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower


    diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%"


    improving = diff_abs < 0


    _, pvalue = stats.ttest_ind(changes, baseline)


    equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual


    improvements = mean_changes < mean_baseline and pvalue < alpha


    regression = mean_changes > mean_baseline and pvalue < alpha


    evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral


    conclusion = StatisticalConclusion.Accepts if improvements else evaluation


    View full-size slide

  40. FINAL REMARKS

    View full-size slide

  41. • Our self-service automation allowed us to run 100+ of
    experiments since March/2022


    • Thousands of Engineer-minutes saved, async-await style


    • Assertive solutions to improve our build setup and clearer
    implementation path when delivering them


    • We can validate if any input from the Android ecosystem
    actually works for us, avoiding ad verecundium arguments
    Our journey so far

    View full-size slide

  42. UBIRATAN


    SOARES
    Computer Scientist made in 🇧🇷


    Senior Software Engineer @ N26


    GDE for Android and Kotlin
    @ubiratanfsoares


    ubiratansoares.dev

    View full-size slide