Slide 1

Slide 1 text

How we Built a Distributed Testing Platform

Slide 2

Slide 2 text

Marc Philipp Roberto Perez Alcolea Sr. Principal Software Engineer @marcphilipp Senior Software Engineer @rpalcolea Special Guest

Slide 3

Slide 3 text

1 Why Distribute Tests?

Slide 4

Slide 4 text

Testing usually dominates build times

Slide 5

Slide 5 text

Why is testing so slow? Databases Web servers Directories Virtual machines Network latency Network latency

Slide 6

Slide 6 text

Reduced test time yields increase in productivity More build and test executions Shift-left testing from CI to local builds Less context-switching

Slide 7

Slide 7 text

2 Existing Solutions for running tests in parallel

Slide 8

Slide 8 text

Local parallelization lacks historical information

Slide 9

Slide 9 text

Single-machine parallelism does not scale

Slide 10

Slide 10 text

CI Fanout Idea: run subset of tests on CI agents in parallel ⬢ Grouping of tests often manual ⬢ Large overhead because each CI agent has to run build up to test task ⬢ Test results are scattered over multiple CI jobs ⬢ Does not support local builds

Slide 11

Slide 11 text

3 Building Test Distribution Our Journey

Slide 12

Slide 12 text

Initial prototype (2019)

Slide 13

Slide 13 text

Initial prototype (2019) ⬢ Implemented as Jenkins plugin ⬢ Idea: reuse Jenkins nodes and infrastructure to run tests remotely ⬢ Pros: ○ Jenkins was widely used by our largest customers at the time ⬢ Cons: ○ Not all customers use Jenkins ○ Installing/updating Jenkins plugins in a corporate environment is not as simple as it seems (i.e. we couldn’t move as fast as we wanted to)

Slide 14

Slide 14 text

Initial release ⬢ Broker component included in Gradle Enterprise server ⬢ Agents connect to the broker ⬢ Builds connect to the broker and request agents ⬢ Works for local and CI builds Gradle Enterprise 2020.2 On-premises inside your network

Slide 15

Slide 15 text

Running Test Distribution agents ⬢ The agent comes in two flavors: Jar and Docker image ⬢ Runs on Java 11+, requires about 128 MB of memory ⬢ Runs on Windows, macOS, and Linux ⬢ Detects its environment (JDKs, OS) during startup ⬢ Administrators can pass additional capabilities as command line parameters java -jar gradle-enterprise-test-distribution-agent.jar \ --server https://ge.example.com \ --api-key «api-key» \ --capabilities docker,postgres=14 docker run \ --env TEST_DISTRIBUTION_AGENT_SERVER=https://ge.example.com \ --env TEST_DISTRIBUTION_AGENT_API_KEY=«api-key» \ --env TEST_DISTRIBUTION_AGENT_CAPABILITIES=postgres=14 \ gradle/gradle-enterprise-test-distribution-agent Runs on-premises inside your own network infrastructure Gradle Enterprise 2020.2

Slide 16

Slide 16 text

Integrates with default Test task Gradle Enterprise 2020.2 tasks.test { useJUnitPlatform() distribution { enabled.set(true) } } Code coverage and other output files are transferred back and merged automatically Input files (e.g. classpath) are automatically transferred to remote agents

Slide 17

Slide 17 text

Test Distribution requires JUnit Platform ⬢ Most test frameworks with JUnit Platform test engines are supported: ○ JUnit 5 (Jupiter) ✔ ○ JUnit 3/4 and Spock 1.x (via junit-vintage-engine included in JUnit 5) ✔ ○ Spock 2.x ✔ ○ TestNG (via testng-engine) ✔ ○ ScalaTest (via scalatest-junit-runner) ✔ ○ ArchUnit ✔ ○ jqwik ✔ ⬢ Currently unsupported (we have plans to add support where possible): ○ Kotest ○ Spek ○ Cucumber

Slide 18

Slide 18 text

Checking compatibility ⬢ Compatibility can be checked before adopting Test Distribution by applying a custom build script that adds custom values to the Build Scan https://github.com/gradle/gradle-enterprise-build-config-samples/pull/469

Slide 19

Slide 19 text

Trace Files ⬢ Simple but powerful ⬢ Using Chrome for visualization

Slide 20

Slide 20 text

Run build only once, distribute test execution

Slide 21

Slide 21 text

Automatic distribution based on previous execution times

Slide 22

Slide 22 text

Biggest shortcoming of initial release: file transfer Local file cache Local file cache Local file cache Worst case: m input files with n agents results in n*m files sent over WebSocket connection Huge bottleneck! Store Store Store

Slide 23

Slide 23 text

File transfer multiplexing/deduplication Local file cache Local file cache Local file cache File cache Store Read Read Upload only once Store Store Store Idea: upload only once (via HTTP) and cache on Gradle Enterprise server Gradle Enterprise 2020.3

Slide 24

Slide 24 text

Speedup for over slow connection Gradle Enterprise 2020.3

Slide 25

Slide 25 text

Next problem: network failures causing builds to fail ⬢ Test Distribution requires WebSocket connections between builds/agents and the Gradle Enterprise server ⬢ Murphy’s law: Anything that can go wrong, will go wrong

Slide 26

Slide 26 text

Resilience against temporary network failures ⬢ Actively manage connections using WebSocket pings ⬢ Reconnect if connection is lost or unresponsive ⬢ Reschedule work on other agents if agent disappears ⬢ Retry file uploads on non-client errors ⬢ Avoid builds from breaking and causing disruption Gradle Enterprise 2020.4

Slide 27

Slide 27 text

Maven support Gradle Enterprise 2020.5 maven-surefire-plugin 2.22.2 true ⬢ Requires Gradle Enterprise Maven extension ⬢ Integrates with Surefire and Failsafe plugins

Slide 28

Slide 28 text

Gradle Enterprise 2021.1 Adaptive scheduling ⬢ Be able to react to additional agents becoming available during test execution ⬢ Increase agent utilization ⬢ Reduce testing time

Slide 29

Slide 29 text

Auto-scaling Test Distribution agents ⬢ Agent pools with min/max size and capabilities for horizontal scaling ⬢ HTTP endpoint provides metrics indicating the target number of agents for each pool, based on demand. ⬢ Step-by-step instructions for Kubernetes in docs ⬢ Real-time and historical usage can be visualized by Gradle Enterprise administrators Gradle Enterprise 2021.2 { "id": "sosmbpbr", "name": "Linux", "capabilities": [ "jdk=8", "os=linux" ], "minimumAgents": 1, "maximumAgents": 90, "connectedAgents": 2, "idleAgents": 0, "desiredAgents": 8 } ✅ Test Distribution is production-ready

Slide 30

Slide 30 text

Gradle Enterprise Acceleration Features

Slide 31

Slide 31 text

Gradle Enterprise Acceleration Features Gradle Enterprise 2022.2

Slide 32

Slide 32 text

Teaser: Foundations of Predictive Test Selection Eric Wendelin Luke Daley Principal Data Scientist Principal Executive Thu 10am

Slide 33

Slide 33 text

Single Gradle plugin and Maven extension for all Gradle Enterprise Gradle Enterprise 2022.3 plugins { id("com.gradle.enterprise") version "3.11.2" id("com.gradle.enterprise.test-distribution") version "2.3.5" }

Slide 34

Slide 34 text

Demo recorded by Doug Tidwell (a.k.a. Dr. DPE)

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

More Information https://gradle.com/gradle-enterprise-solutions/test-distribution/

Slide 37

Slide 37 text

4 Experience Report

Slide 38

Slide 38 text

Test Distribution @ Roberto Perez Alcolea Senior Software Engineer @rpalcolea

Slide 39

Slide 39 text

Netflix JVM Build/CI footprint ⬢ 3.2k Gradle based repositories ⬢ ~191k weekly Gradle based builds ⬢ 35 Jenkins controllers ○ 650-1500 Jenkins agents ■ 1-100 executors per agent ⬢ Hundreds of Engineers

Slide 40

Slide 40 text

Why did Netflix started using Test Distribution? ⬢ 88% of Build Time was spent on Test execution ⬢ The need for consistent experience between local and CI ○ Reduce the "it works on my machine" or "Only works on Jenkins" ○ Test that require Docker containers had different results on local and CI due to architecture or just version differences ⬢ Better compute resource usage ○ Build Performance varies a lot based on who and where it was executed ■ Engineers have different hardware just within same team members ■ Jenkins agents could have more or less capacity than others ⬢ Making test feedback faster and allowing tests to run locally will mean developers will run tests more frequently, speeding up the local development feedback loop, rather than depending on CI environments

Slide 41

Slide 41 text

Getting ready for Test Distribution (Default configurations through Nebula) ⬢ Introduce recommended defaults when Test Distribution is enabled via Nebula, our set of Gradle plugins that are distributed as a custom Gradle distribution within Netflix. ○ Default requirements (capabilities) ○ Timeouts and remote execution preferences ⬢ Used gradle-lint-plugin to introduce JUnit Platform engines ○ Mix of junit4, testng, scalatest and junit5 ■ ~1.4k repositories required this work ○ Decided to not pursue Test code migration but instead, move projects to use proper engines and be compatible

Slide 42

Slide 42 text

Rolling out Test Distribution at Netflix - Phase 1 ⬢ Single Agent pool ○ One AWS region ○ 20 container based agents ⬢ Enrolled the JVM Ecosystem team repositories and couple of external partners interested on beta testing ○ Most of these repositories have Gradle TestKit based tests which made it a great use-case to test right away

Slide 43

Slide 43 text

Rolling out Test Distribution at Netflix - Phase 1 Build w/ Integration tests 62 min → 5 min

Slide 44

Slide 44 text

Rolling out Test Distribution at Netflix - Phase 1 Build w/ Integration tests 62 min → 5 min

Slide 45

Slide 45 text

Rolling out Test Distribution at Netflix - Phase 2 (Current) ⬢ Multiple Agent pools ○ Two types of Agents: Container and EC2 (VM) ■ EC2 are required to support docker in docker for use cases like Testcontainers based integration tests ○ Agents in multiple AWS regions to match Jenkins Job location and have better Network experience ○ The number of agents is scaled up/down based on Agent availability/usage but we run overprovisioned to continue learning ○ Same CPU, Disk and Memory ⬢ Enrolled ~350 projects (~10% of builds)

Slide 46

Slide 46 text

Test Distribution learnings so far ⬢ Having two types of agents (EC2 vs Container based) increases operational complexity ○ Docker maintenance ○ Agent selection (capabilities) ⬢ Engineers write tests that might require resources that can’t be added to common agent pools. Unfortunately, these tests are blockers for adoption. Examples of this are: ○ Network access ○ Security policies ⬢ Every build is unique and not all of them save several minutes but we have seen incredible results and amazing feedback from most of people that are using the product Today ⬢ Some tests require more compute resources than others, having a single configuration of agents is far from ideal

Slide 47

Slide 47 text

What’s next for Netflix on Test Distribution?

Slide 48

Slide 48 text

THANKS Any questions? marc@gradle.com Twitter: @marcphilipp rperezalcolea@netflix.com Twitter: @rpalcolea