Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating Gradle benchmarks at N26

Automating Gradle benchmarks at N26

Companion slide for my related talk. Delivered at the following events

• Android World Wide (Online) - July/2022

Ubiratan Soares
PRO

July 26, 2022
Tweet

More Decks by Ubiratan Soares

Other Decks in Programming

Transcript

  1. Automating Gradle
    benchmarks at N26
    Ubiratan Soares


    July / 2022

    View Slide

  2. https://n26.com/en/careers

    View Slide

  3. CONTEXT

    View Slide

  4. About Android at N26
    🧑💻
    🧑💻
    👩💻
    35 Engineers 1MM LoC 380 modules ~ 20k tests

    View Slide

  5. Platform Engineering at N26
    Android/Core
    iOS/Core Web/Core
    NxD/Core
    Scalability
    Connectivity
    Observability
    Client Platform
    • Four Engineers in Android/Core


    • Core / Platform libraries


    • Code + Test + Build + Deploy infrastructure


    • CI/CD and release train automation


    • Gradle builds = Top priority for 2022

    View Slide

  6. Discovery Delivery Cooldown
    House
    Keeping
    2 weeks 2 weeks 1 week
    1 week
    Ways of Working in Client Platform
    • Continuous Discovery Framework


    • Product vision over the Code + Build + Test + Delivery


    • “Sprints" of 6 weeks (aka cycles)

    View Slide

  7. Solution


    Space
    Opportunity


    Space
    Ideation


    Research


    Experimentation

    View Slide

  8. •Scopes what we want to do


    •Uncovers value proposition
    •Propositions that
    address how we’ll do it


    •Con
    fi
    dence assessement


    •Impact assessment
    Solution


    Space
    Opportunity


    Space
    Ideation


    Research


    Experimentation

    View Slide

  9. https://producttalk.com

    View Slide

  10. “How can we learn about the
    impact of a given change we
    want to apply to our build
    setup without rolling out that
    change?


    How can we ensure it will
    even work ?”

    View Slide

  11. Pre-mortems

    View Slide

  12. https://github.com/gradle/gradle-pro
    fi
    ler

    View Slide

  13. Gradle Pro
    fi
    ler Features
    Tooling API
    Cold/warm builds
    Daemon control
    Benchmarking
    Pro
    fi
    lling
    Multiple build systems
    Multiple pro
    fi
    lers Scenarios de
    fi
    nition
    Incremental builds evaluation
    Reports (CSV, HTML)

    View Slide

  14. Gradle Pro
    fi
    ler Features
    Tooling API
    Cold/warm builds
    Daemon control
    Benchmarking
    Pro
    fi
    lling
    Multiple build systems
    Multiple pro
    fi
    lers Scenarios de
    fi
    nition
    Incremental builds evaluation
    Reports (CSV, HTML)

    View Slide

  15. $> gradle-profiler \


    --benchmark \


    --project-dir "~/Dev/android-machete" \


    -—output-dir "~/Dev/gradle-benchmarks" \


    -—scenario-file "ANC-666.scenario"

    View Slide

  16. View Slide

  17. • gradle-pro
    fi
    ler could be a great
    fi
    t for our Solution
    Discovery process


    • Data could be generated targeting the situation we wanted
    to achieve (not implementation details to achieve it)


    • High con
    fi
    dence for the Solution score (RICA)


    • Ideation pre-benchmarking could uncover implementation
    paths
    Evaluation

    View Slide

  18. LEARNING FROM
    BENCHMARKS

    View Slide

  19. assemble-sfa-spaces {


    title = "Assemble Spaces Single-Feature app“


    tasks = ":features:spaces:app:assembleDebug"


    daemon = warm


    cleanup-tasks = ["clean"]


    warm-ups = 3


    iterations = 15


    }

    View Slide

  20. assemble-sfa-spaces {


    title = "Assemble Spaces Single-Feature app“


    tasks = ":features:spaces:app:assembleDebug"


    daemon = warm


    cleanup-tasks = ["clean"]


    gradle-args = [“-no-build-cache“]


    warm-ups = 3


    iterations = 15


    }

    View Slide

  21. assemble-sfa-spaces {


    title = "Assemble Spaces Single-Feature app“


    tasks = ":features:spaces:app:assembleDebug"


    daemon = warm


    cleanup-tasks = ["clean"]


    apply-abi-change-to = [“/spaces/common/domain/model/Space.kt“]


    warm-ups = 3


    iterations = 15


    }

    View Slide

  22. $> gradle-profiler \


    --benchmark \


    --project-dir "~/Dev/android-machete" \


    —-output-dir "~/Dev/benchmarks" \


    —-scenario-file "ANC-666.scenario"
    master
    ANC-666
    Changes
    Baseline

    View Slide

  23. $> git checkout master


    $> gradle-profiler \


    --benchmark \


    --project-dir "~/Dev/android-machete" \


    —-output-dir "~/Dev/benchmarks" \


    —-scenario-file "ANC-666.scenario"
    master
    ANC-666
    Changes
    Baseline

    View Slide

  24. 5 scenarios per pass
    20 builds per scenario
    2 passes
    2 minutes per build
    ~ 6.5 hours total

    View Slide

  25. CI

    View Slide

  26. View Slide

  27. View Slide

  28. 😎

    View Slide

  29. 🤔

    View Slide

  30. 0
    67500
    135000
    202500
    270000
    1 2 3 4 5 6 7 8 9 10
    Measured build
    Time (milliseconds)

    View Slide

  31. 0
    2.5
    5
    7.5
    10
    18000 200000 220000 240000 260000 280000
    Measured build


    (occurences per range)
    Time (milliseconds)

    View Slide

  32. Hyphotesis Testing
    Population


    Sample


    Data
    Mean (µ)
    Sampling
    Re
    fi
    nement and


    Validation


    Mean (X)
    Probability


    Analysis
    “If I pick another sample,
    which chance do I have to
    get the same results?”

    View Slide

  33. p-value
    Sample
    Probability Analysis
    Calculated score


    (eg, Z or t)
    P-value (area)
    The probability of not observe this sample again

    View Slide

  34. ~97% probability that
    means are different
    for the outcomes of
    benchmarked task

    View Slide

  35. Benchmark #01


    (status quo)
    alpha


    (0.05)
    p-value
    Compare


    p-value and


    alpha
    Paired T-test
    Evidence that means between samples
    are different is WEAK
    p-value is BIGGER
    p-value is SMALLER
    Evidence that means between samples
    are different is STRONG
    Benchmark #02


    (modi
    fi
    cations)
    Gradle task


    View Slide

  36. • Running large benchmarks on local machines was
    expensive and executions were not isolated


    • Consolidating data from outcomes into Google Sheets was
    a manual process


    • Reliable interpretation of results could be non-trivial,
    specially when disambiguating inconclusive scenarios


    • Statistics is powerful but hard
    Summary of challenges

    View Slide

  37. SHIFTING LEFT
    EXPERIMENTS

    View Slide

  38. • Build an automation over the complexity of running
    benchmarks and evaluating outcomes


    • Make gradle-pro
    fi
    ler (almost) invisible


    • Fast results even when exercising several scenarios


    • Self-service solution


    • Non-blocking solution
    Goals

    View Slide


  39. sculptor
    🔥
    fornax
    🦅
    aquila
    machete-benchmarks

    View Slide

  40. Data generation Data evaluation
    ⚒ 🔥 🦅

    View Slide

  41. ⚒ sculptor
    🔥 fornax
    🦅 aquila
    Set of scripts that prepares a vanilla


    self-hosted Linux machine with the
    required tooling
    Set of scripts that wraps gradle-pro
    fi
    ler and
    git in an opionated way and drives the
    benchmark excution
    Small Python SDK that parses CSV
    fi
    les,
    pairs the data points and runs a Paired T-
    test on top of Scipy and NumPy.

    View Slide

  42. 🧑💻 Branch with changes
    Scenario
    fi
    le
    Benchmarker


    Work
    fl
    ow

    View Slide

  43. 🧑💻
    master
    Head
    Merge-base
    Branch with changes
    Scenario
    fi
    le
    benchmark


    (changes)
    benchmark


    (baseline)
    Benchmarker


    Work
    fl
    ow
    packaging

    View Slide

  44. from aquila.models import (


    OutcomeFromScenario,


    PairedBenchmarkData,


    StatisticalEquality,


    StatisticalConclusion,


    BenchmarkEvaluation,


    )


    from scipy import stats


    import numpy as np


    class BenchmarkEvaluator(object):


    def
    _ _
    init
    _ _
    (self, benchmarks_parser):


    self.benchmarks_parser = benchmarks_parser


    def evaluate(self):


    baseline, changes = self.benchmarks_parser.parse_results()


    paired_data = self._align_benchmarks(baseline, changes)


    overview = self._extract_overview(baseline)


    outcomes = [self._extract_outcome(item) for item in paired_data]


    return BenchmarkEvaluation(overview, outcomes)


    View Slide

  45. @staticmethod


    def _extract_overview(suite):


    scenarios = len(suite)


    warmup_builds = 0


    measured_builds = 0


    for scenario in suite:


    warmup_builds = warmup_builds + len(scenario.warmup_builds)


    measured_builds = measured_builds + len(scenario.measured_builds)


    return scenarios, warmup_builds, measured_builds


    @staticmethod


    def _align_benchmarks(baseline, changes):


    pairs = []


    for item in baseline:


    task = item.invoked_gradle_task


    for candidate in changes:


    if candidate.invoked_gradle_task
    = =
    task:


    pairs.append(PairedBenchmarkData(task, item, candidate))


    break


    return pairs


    View Slide

  46. @staticmethod


    def _extract_outcome(paired_data):


    alpha = 0.05


    baseline = paired_data.baseline.measured_builds


    changes = paired_data.changes.measured_builds


    mean_baseline = np.mean(baseline)


    mean_changes = np.mean(changes)


    diff_abs = mean_changes - mean_baseline


    diff_improvement = "+" if diff_abs > 0 else "-"


    diff_relative_upper = (mean_changes - mean_baseline) / mean_changes


    diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline


    diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower


    diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%"


    improving = diff_abs < 0


    _, pvalue = stats.ttest_ind(changes, baseline)


    equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual


    improvements = mean_changes < mean_baseline and pvalue < alpha


    regression = mean_changes > mean_baseline and pvalue < alpha


    evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral


    conclusion = StatisticalConclusion.Accepts if improvements else evaluation


    View Slide

  47. @staticmethod


    def _extract_outcome(paired_data):


    alpha = 0.05


    baseline = paired_data.baseline.measured_builds


    changes = paired_data.changes.measured_builds


    mean_baseline = np.mean(baseline)


    mean_changes = np.mean(changes)


    diff_abs = mean_changes - mean_baseline


    diff_improvement = "+" if diff_abs > 0 else "-"


    diff_relative_upper = (mean_changes - mean_baseline) / mean_changes


    diff_relative_lower = (mean_baseline - mean_changes) / mean_baseline


    diff_relative = diff_relative_upper if diff_abs > 0 else diff_relative_lower


    diff_percent = f"{diff_improvement} {(diff_relative * 100):.2f}%"


    improving = diff_abs < 0


    _, pvalue = stats.ttest_ind(changes, baseline)


    equality = StatisticalEquality.Equal if pvalue > alpha else StatisticalEquality.NotEqual


    improvements = mean_changes < mean_baseline and pvalue < alpha


    regression = mean_changes > mean_baseline and pvalue < alpha


    evaluation = StatisticalConclusion.Rejects if regression else StatisticalConclusion.Neutral


    conclusion = StatisticalConclusion.Accepts if improvements else evaluation


    View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. LIVE DEMO

    View Slide

  52. FINAL REMARKS

    View Slide

  53. • Our self-service automation allowed us to run 100+ of
    experiments since March/2022


    • Thousands of Engineer-minutes saved, async-await style


    • Assertive solutions to improve our build setup and clearer
    implementation path when delivering them


    • We can validate if any input from the Android ecosystem
    actually works for us, avoiding ad verecundium arguments
    Our journey so far

    View Slide

  54. UBIRATAN


    SOARES
    Computer Scientist made in 🇧🇷


    Senior Software Engineer @ N26


    GDE for Android and Kotlin
    @ubiratanfsoares


    ubiratansoares.dev

    View Slide

  55. THANKS

    View Slide