$30 off During Our Annual Pro Sale. View Details »

How we Built a Distributed Testing Platform

How we Built a Distributed Testing Platform

Test time is one of the key drivers of build times. Many factors contribute to test time, including running tests sequentially or dependencies on expensive external resources/services. The sheer number of tests to cover a wide range of inputs is also a factor. This often leads to running tests only on CI thereby considerably lengthening the feedback loop.

Join us in this session to take a behind-the-scenes look at the challenges the team building Test Distribution faced and how we solved them. To name a few: efficient file transfer, unstable network connections, scheduling, and auto-scaling. In addition, we’ll share experiences from customers adopting Test Distribution. What issues did they face when starting to distribute existing test suites? How did they overcome them? What successes/gains did they see?

Marc Philipp

November 02, 2022

More Decks by Marc Philipp

Other Decks in Programming


  1. How we Built a Distributed Testing Platform

  2. Marc Philipp Roberto Perez Alcolea Sr. Principal Software Engineer @marcphilipp

    Senior Software Engineer @rpalcolea Special Guest
  3. 1 Why Distribute Tests?

  4. Testing usually dominates build times

  5. Why is testing so slow? Databases Web servers Directories Virtual

    machines Network latency Network latency
  6. Reduced test time yields increase in productivity More build and

    test executions Shift-left testing from CI to local builds Less context-switching
  7. 2 Existing Solutions for running tests in parallel

  8. Local parallelization lacks historical information

  9. Single-machine parallelism does not scale

  10. CI Fanout Idea: run subset of tests on CI agents

    in parallel ⬢ Grouping of tests often manual ⬢ Large overhead because each CI agent has to run build up to test task ⬢ Test results are scattered over multiple CI jobs ⬢ Does not support local builds
  11. 3 Building Test Distribution Our Journey

  12. Initial prototype (2019)

  13. Initial prototype (2019) ⬢ Implemented as Jenkins plugin ⬢ Idea:

    reuse Jenkins nodes and infrastructure to run tests remotely ⬢ Pros: ◦ Jenkins was widely used by our largest customers at the time ⬢ Cons: ◦ Not all customers use Jenkins ◦ Installing/updating Jenkins plugins in a corporate environment is not as simple as it seems (i.e. we couldn’t move as fast as we wanted to)
  14. Initial release ⬢ Broker component included in Gradle Enterprise server

    ⬢ Agents connect to the broker ⬢ Builds connect to the broker and request agents ⬢ Works for local and CI builds Gradle Enterprise 2020.2 On-premises inside your network
  15. Running Test Distribution agents ⬢ The agent comes in two

    flavors: Jar and Docker image ⬢ Runs on Java 11+, requires about 128 MB of memory ⬢ Runs on Windows, macOS, and Linux ⬢ Detects its environment (JDKs, OS) during startup ⬢ Administrators can pass additional capabilities as command line parameters java -jar gradle-enterprise-test-distribution-agent.jar \ --server https://ge.example.com \ --api-key «api-key» \ --capabilities docker,postgres=14 docker run \ --env TEST_DISTRIBUTION_AGENT_SERVER=https://ge.example.com \ --env TEST_DISTRIBUTION_AGENT_API_KEY=«api-key» \ --env TEST_DISTRIBUTION_AGENT_CAPABILITIES=postgres=14 \ gradle/gradle-enterprise-test-distribution-agent Runs on-premises inside your own network infrastructure Gradle Enterprise 2020.2
  16. Integrates with default Test task Gradle Enterprise 2020.2 tasks.test {

    useJUnitPlatform() distribution { enabled.set(true) } } Code coverage and other output files are transferred back and merged automatically Input files (e.g. classpath) are automatically transferred to remote agents
  17. Test Distribution requires JUnit Platform ⬢ Most test frameworks with

    JUnit Platform test engines are supported: ◦ JUnit 5 (Jupiter) ✔ ◦ JUnit 3/4 and Spock 1.x (via junit-vintage-engine included in JUnit 5) ✔ ◦ Spock 2.x ✔ ◦ TestNG (via testng-engine) ✔ ◦ ScalaTest (via scalatest-junit-runner) ✔ ◦ ArchUnit ✔ ◦ jqwik ✔ ⬢ Currently unsupported (we have plans to add support where possible): ◦ Kotest ◦ Spek ◦ Cucumber
  18. Checking compatibility ⬢ Compatibility can be checked before adopting Test

    Distribution by applying a custom build script that adds custom values to the Build Scan https://github.com/gradle/gradle-enterprise-build-config-samples/pull/469
  19. Trace Files ⬢ Simple but powerful ⬢ Using Chrome for

  20. Run build only once, distribute test execution

  21. Automatic distribution based on previous execution times

  22. Biggest shortcoming of initial release: file transfer Local file cache

    Local file cache Local file cache Worst case: m input files with n agents results in n*m files sent over WebSocket connection Huge bottleneck! Store Store Store
  23. File transfer multiplexing/deduplication Local file cache Local file cache Local

    file cache File cache Store Read Read Upload only once Store Store Store Idea: upload only once (via HTTP) and cache on Gradle Enterprise server Gradle Enterprise 2020.3
  24. Speedup for over slow connection Gradle Enterprise 2020.3

  25. Next problem: network failures causing builds to fail ⬢ Test

    Distribution requires WebSocket connections between builds/agents and the Gradle Enterprise server ⬢ Murphy’s law: Anything that can go wrong, will go wrong
  26. Resilience against temporary network failures ⬢ Actively manage connections using

    WebSocket pings ⬢ Reconnect if connection is lost or unresponsive ⬢ Reschedule work on other agents if agent disappears ⬢ Retry file uploads on non-client errors ⬢ Avoid builds from breaking and causing disruption Gradle Enterprise 2020.4
  27. Maven support Gradle Enterprise 2020.5 <project xmlns="http://maven.apache.org/POM/4.0.0"> <!-- ... -->

    <build> <plugins> <plugin> <artifactId>maven-surefire-plugin</artifactId> <version>2.22.2</version> <configuration> <properties> <distribution> <enabled>true</enabled> </distribution> </properties> </configuration> </plugin> </plugins> </build> </project> ⬢ Requires Gradle Enterprise Maven extension ⬢ Integrates with Surefire and Failsafe plugins
  28. Gradle Enterprise 2021.1 Adaptive scheduling ⬢ Be able to react

    to additional agents becoming available during test execution ⬢ Increase agent utilization ⬢ Reduce testing time
  29. Auto-scaling Test Distribution agents ⬢ Agent pools with min/max size

    and capabilities for horizontal scaling ⬢ HTTP endpoint provides metrics indicating the target number of agents for each pool, based on demand. ⬢ Step-by-step instructions for Kubernetes in docs ⬢ Real-time and historical usage can be visualized by Gradle Enterprise administrators Gradle Enterprise 2021.2 { "id": "sosmbpbr", "name": "Linux", "capabilities": [ "jdk=8", "os=linux" ], "minimumAgents": 1, "maximumAgents": 90, "connectedAgents": 2, "idleAgents": 0, "desiredAgents": 8 } ✅ Test Distribution is production-ready
  30. Gradle Enterprise Acceleration Features

  31. Gradle Enterprise Acceleration Features Gradle Enterprise 2022.2

  32. Teaser: Foundations of Predictive Test Selection Eric Wendelin Luke Daley

    Principal Data Scientist Principal Executive Thu 10am
  33. Single Gradle plugin and Maven extension for all Gradle Enterprise

    Gradle Enterprise 2022.3 plugins { id("com.gradle.enterprise") version "3.11.2" id("com.gradle.enterprise.test-distribution") version "2.3.5" }
  34. Demo recorded by Doug Tidwell (a.k.a. Dr. DPE)

  35. None
  36. More Information https://gradle.com/gradle-enterprise-solutions/test-distribution/

  37. 4 Experience Report

  38. Test Distribution @ Roberto Perez Alcolea Senior Software Engineer @rpalcolea

  39. Netflix JVM Build/CI footprint ⬢ 3.2k Gradle based repositories ⬢

    ~191k weekly Gradle based builds ⬢ 35 Jenkins controllers ◦ 650-1500 Jenkins agents ▪ 1-100 executors per agent ⬢ Hundreds of Engineers
  40. Why did Netflix started using Test Distribution? ⬢ 88% of

    Build Time was spent on Test execution ⬢ The need for consistent experience between local and CI ◦ Reduce the "it works on my machine" or "Only works on Jenkins" ◦ Test that require Docker containers had different results on local and CI due to architecture or just version differences ⬢ Better compute resource usage ◦ Build Performance varies a lot based on who and where it was executed ▪ Engineers have different hardware just within same team members ▪ Jenkins agents could have more or less capacity than others ⬢ Making test feedback faster and allowing tests to run locally will mean developers will run tests more frequently, speeding up the local development feedback loop, rather than depending on CI environments
  41. Getting ready for Test Distribution (Default configurations through Nebula) ⬢

    Introduce recommended defaults when Test Distribution is enabled via Nebula, our set of Gradle plugins that are distributed as a custom Gradle distribution within Netflix. ◦ Default requirements (capabilities) ◦ Timeouts and remote execution preferences ⬢ Used gradle-lint-plugin to introduce JUnit Platform engines ◦ Mix of junit4, testng, scalatest and junit5 ▪ ~1.4k repositories required this work ◦ Decided to not pursue Test code migration but instead, move projects to use proper engines and be compatible
  42. Rolling out Test Distribution at Netflix - Phase 1 ⬢

    Single Agent pool ◦ One AWS region ◦ 20 container based agents ⬢ Enrolled the JVM Ecosystem team repositories and couple of external partners interested on beta testing ◦ Most of these repositories have Gradle TestKit based tests which made it a great use-case to test right away
  43. Rolling out Test Distribution at Netflix - Phase 1 Build

    w/ Integration tests 62 min → 5 min
  44. Rolling out Test Distribution at Netflix - Phase 1 Build

    w/ Integration tests 62 min → 5 min
  45. Rolling out Test Distribution at Netflix - Phase 2 (Current)

    ⬢ Multiple Agent pools ◦ Two types of Agents: Container and EC2 (VM) ▪ EC2 are required to support docker in docker for use cases like Testcontainers based integration tests ◦ Agents in multiple AWS regions to match Jenkins Job location and have better Network experience ◦ The number of agents is scaled up/down based on Agent availability/usage but we run overprovisioned to continue learning ◦ Same CPU, Disk and Memory ⬢ Enrolled ~350 projects (~10% of builds)
  46. Test Distribution learnings so far ⬢ Having two types of

    agents (EC2 vs Container based) increases operational complexity ◦ Docker maintenance ◦ Agent selection (capabilities) ⬢ Engineers write tests that might require resources that can’t be added to common agent pools. Unfortunately, these tests are blockers for adoption. Examples of this are: ◦ Network access ◦ Security policies ⬢ Every build is unique and not all of them save several minutes but we have seen incredible results and amazing feedback from most of people that are using the product Today ⬢ Some tests require more compute resources than others, having a single configuration of agents is far from ideal
  47. What’s next for Netflix on Test Distribution?

  48. THANKS Any questions? marc@gradle.com Twitter: @marcphilipp rperezalcolea@netflix.com Twitter: @rpalcolea