How we Built a Distributed Testing Platform

Marc Philipp Roberto Perez Alcolea Sr. Principal Software Engineer @marcphilipp
Senior Software Engineer @rpalcolea Special Guest

1 Why Distribute Tests?

Testing usually dominates build times

Why is testing so slow? Databases Web servers Directories Virtual
machines Network latency Network latency

Reduced test time yields increase in productivity More build and
test executions Shift-left testing from CI to local builds Less context-switching

2 Existing Solutions for running tests in parallel

Local parallelization lacks historical information

Single-machine parallelism does not scale

CI Fanout Idea: run subset of tests on CI agents
in parallel ⬢ Grouping of tests often manual ⬢ Large overhead because each CI agent has to run build up to test task ⬢ Test results are scattered over multiple CI jobs ⬢ Does not support local builds

3 Building Test Distribution Our Journey

Initial prototype (2019)

Initial prototype (2019) ⬢ Implemented as Jenkins plugin ⬢ Idea:
reuse Jenkins nodes and infrastructure to run tests remotely ⬢ Pros: ◦ Jenkins was widely used by our largest customers at the time ⬢ Cons: ◦ Not all customers use Jenkins ◦ Installing/updating Jenkins plugins in a corporate environment is not as simple as it seems (i.e. we couldn’t move as fast as we wanted to)

Initial release ⬢ Broker component included in Gradle Enterprise server
⬢ Agents connect to the broker ⬢ Builds connect to the broker and request agents ⬢ Works for local and CI builds Gradle Enterprise 2020.2 On-premises inside your network

Running Test Distribution agents ⬢ The agent comes in two
ﬂavors: Jar and Docker image ⬢ Runs on Java 11+, requires about 128 MB of memory ⬢ Runs on Windows, macOS, and Linux ⬢ Detects its environment (JDKs, OS) during startup ⬢ Administrators can pass additional capabilities as command line parameters java -jar gradle-enterprise-test-distribution-agent.jar \ --server https://ge.example.com \ --api-key «api-key» \ --capabilities docker,postgres=14 docker run \ --env TEST_DISTRIBUTION_AGENT_SERVER=https://ge.example.com \ --env TEST_DISTRIBUTION_AGENT_API_KEY=«api-key» \ --env TEST_DISTRIBUTION_AGENT_CAPABILITIES=postgres=14 \ gradle/gradle-enterprise-test-distribution-agent Runs on-premises inside your own network infrastructure Gradle Enterprise 2020.2

Integrates with default Test task Gradle Enterprise 2020.2 tasks.test {
useJUnitPlatform() distribution { enabled.set(true) } } Code coverage and other output ﬁles are transferred back and merged automatically Input ﬁles (e.g. classpath) are automatically transferred to remote agents

Test Distribution requires JUnit Platform ⬢ Most test frameworks with
JUnit Platform test engines are supported: ◦ JUnit 5 (Jupiter) ✔ ◦ JUnit 3/4 and Spock 1.x (via junit-vintage-engine included in JUnit 5) ✔ ◦ Spock 2.x ✔ ◦ TestNG (via testng-engine) ✔ ◦ ScalaTest (via scalatest-junit-runner) ✔ ◦ ArchUnit ✔ ◦ jqwik ✔ ⬢ Currently unsupported (we have plans to add support where possible): ◦ Kotest ◦ Spek ◦ Cucumber

Checking compatibility ⬢ Compatibility can be checked before adopting Test
Distribution by applying a custom build script that adds custom values to the Build Scan https://github.com/gradle/gradle-enterprise-build-conﬁg-samples/pull/469

Trace Files ⬢ Simple but powerful ⬢ Using Chrome for
visualization

Run build only once, distribute test execution

Automatic distribution based on previous execution times

Biggest shortcoming of initial release: file transfer Local file cache
Local file cache Local file cache Worst case: m input files with n agents results in n*m files sent over WebSocket connection Huge bottleneck! Store Store Store

File transfer multiplexing/deduplication Local file cache Local file cache Local
file cache File cache Store Read Read Upload only once Store Store Store Idea: upload only once (via HTTP) and cache on Gradle Enterprise server Gradle Enterprise 2020.3

Speedup for over slow connection Gradle Enterprise 2020.3

Next problem: network failures causing builds to fail ⬢ Test
Distribution requires WebSocket connections between builds/agents and the Gradle Enterprise server ⬢ Murphy’s law: Anything that can go wrong, will go wrong

Resilience against temporary network failures ⬢ Actively manage connections using
WebSocket pings ⬢ Reconnect if connection is lost or unresponsive ⬢ Reschedule work on other agents if agent disappears ⬢ Retry ﬁle uploads on non-client errors ⬢ Avoid builds from breaking and causing disruption Gradle Enterprise 2020.4

Maven support Gradle Enterprise 2020.5 <project xmlns="http://maven.apache.org/POM/4.0.0"> 
<build> <plugins> <plugin> <artifactId>maven-surefire-plugin</artifactId> <version>2.22.2</version> <configuration> <properties> <distribution> <enabled>true</enabled> </distribution> </properties> </configuration> </plugin> </plugins> </build> </project> ⬢ Requires Gradle Enterprise Maven extension ⬢ Integrates with Sureﬁre and Failsafe plugins

Gradle Enterprise 2021.1 Adaptive scheduling ⬢ Be able to react
to additional agents becoming available during test execution ⬢ Increase agent utilization ⬢ Reduce testing time

Auto-scaling Test Distribution agents ⬢ Agent pools with min/max size
and capabilities for horizontal scaling ⬢ HTTP endpoint provides metrics indicating the target number of agents for each pool, based on demand. ⬢ Step-by-step instructions for Kubernetes in docs ⬢ Real-time and historical usage can be visualized by Gradle Enterprise administrators Gradle Enterprise 2021.2 { "id": "sosmbpbr", "name": "Linux", "capabilities": [ "jdk=8", "os=linux" ], "minimumAgents": 1, "maximumAgents": 90, "connectedAgents": 2, "idleAgents": 0, "desiredAgents": 8 } ✅ Test Distribution is production-ready

Gradle Enterprise Acceleration Features

Gradle Enterprise Acceleration Features Gradle Enterprise 2022.2

Teaser: Foundations of Predictive Test Selection Eric Wendelin Luke Daley
Principal Data Scientist Principal Executive Thu 10am

Single Gradle plugin and Maven extension for all Gradle Enterprise
Gradle Enterprise 2022.3 plugins { id("com.gradle.enterprise") version "3.11.2" id("com.gradle.enterprise.test-distribution") version "2.3.5" }

Demo recorded by Doug Tidwell (a.k.a. Dr. DPE)

More Information https://gradle.com/gradle-enterprise-solutions/test-distribution/

4 Experience Report

Test Distribution @ Roberto Perez Alcolea Senior Software Engineer @rpalcolea

Netﬂix JVM Build/CI footprint ⬢ 3.2k Gradle based repositories ⬢
~191k weekly Gradle based builds ⬢ 35 Jenkins controllers ◦ 650-1500 Jenkins agents ▪ 1-100 executors per agent ⬢ Hundreds of Engineers

Why did Netﬂix started using Test Distribution? ⬢ 88% of
Build Time was spent on Test execution ⬢ The need for consistent experience between local and CI ◦ Reduce the "it works on my machine" or "Only works on Jenkins" ◦ Test that require Docker containers had different results on local and CI due to architecture or just version differences ⬢ Better compute resource usage ◦ Build Performance varies a lot based on who and where it was executed ▪ Engineers have different hardware just within same team members ▪ Jenkins agents could have more or less capacity than others ⬢ Making test feedback faster and allowing tests to run locally will mean developers will run tests more frequently, speeding up the local development feedback loop, rather than depending on CI environments

Getting ready for Test Distribution (Default conﬁgurations through Nebula) ⬢
Introduce recommended defaults when Test Distribution is enabled via Nebula, our set of Gradle plugins that are distributed as a custom Gradle distribution within Netﬂix. ◦ Default requirements (capabilities) ◦ Timeouts and remote execution preferences ⬢ Used gradle-lint-plugin to introduce JUnit Platform engines ◦ Mix of junit4, testng, scalatest and junit5 ▪ ~1.4k repositories required this work ◦ Decided to not pursue Test code migration but instead, move projects to use proper engines and be compatible

Rolling out Test Distribution at Netﬂix - Phase 1 ⬢
Single Agent pool ◦ One AWS region ◦ 20 container based agents ⬢ Enrolled the JVM Ecosystem team repositories and couple of external partners interested on beta testing ◦ Most of these repositories have Gradle TestKit based tests which made it a great use-case to test right away

Rolling out Test Distribution at Netﬂix - Phase 1 Build
w/ Integration tests 62 min → 5 min

Rolling out Test Distribution at Netﬂix - Phase 2 (Current)
⬢ Multiple Agent pools ◦ Two types of Agents: Container and EC2 (VM) ▪ EC2 are required to support docker in docker for use cases like Testcontainers based integration tests ◦ Agents in multiple AWS regions to match Jenkins Job location and have better Network experience ◦ The number of agents is scaled up/down based on Agent availability/usage but we run overprovisioned to continue learning ◦ Same CPU, Disk and Memory ⬢ Enrolled ~350 projects (~10% of builds)

Test Distribution learnings so far ⬢ Having two types of
agents (EC2 vs Container based) increases operational complexity ◦ Docker maintenance ◦ Agent selection (capabilities) ⬢ Engineers write tests that might require resources that can’t be added to common agent pools. Unfortunately, these tests are blockers for adoption. Examples of this are: ◦ Network access ◦ Security policies ⬢ Every build is unique and not all of them save several minutes but we have seen incredible results and amazing feedback from most of people that are using the product Today ⬢ Some tests require more compute resources than others, having a single conﬁguration of agents is far from ideal

What’s next for Netﬂix on Test Distribution?

THANKS Any questions? [email protected] Twitter: @marcphilipp rperezalcolea@netﬂix.com Twitter: @rpalcolea

How we Built a Distributed Testing Platform

How we Built a Distributed Testing Platform

More Decks by Marc Philipp

Other Decks in Programming

Featured

Transcript