Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we Built a Distributed Testing Platform

How we Built a Distributed Testing Platform

Test time is one of the key drivers of build times. Many factors contribute to test time, including running tests sequentially or dependencies on expensive external resources/services. The sheer number of tests to cover a wide range of inputs is also a factor. This often leads to running tests only on CI thereby considerably lengthening the feedback loop.

Join us in this session to take a behind-the-scenes look at the challenges the team building Test Distribution faced and how we solved them. To name a few: efficient file transfer, unstable network connections, scheduling, and auto-scaling. In addition, we’ll share experiences from customers adopting Test Distribution. What issues did they face when starting to distribute existing test suites? How did they overcome them? What successes/gains did they see?

Marc Philipp

November 02, 2022
Tweet

More Decks by Marc Philipp

Other Decks in Programming

Transcript

  1. How we Built a
    Distributed Testing
    Platform

    View full-size slide

  2. Marc Philipp Roberto Perez Alcolea
    Sr. Principal
    Software Engineer
    @marcphilipp
    Senior Software
    Engineer
    @rpalcolea
    Special
    Guest

    View full-size slide

  3. 1
    Why Distribute Tests?

    View full-size slide

  4. Testing usually dominates build times

    View full-size slide

  5. Why is testing so slow?
    Databases
    Web servers
    Directories
    Virtual machines
    Network
    latency
    Network
    latency

    View full-size slide

  6. Reduced test time yields increase in productivity
    More build and test executions
    Shift-left testing from CI to local builds
    Less context-switching

    View full-size slide

  7. 2
    Existing Solutions
    for running tests in parallel

    View full-size slide

  8. Local parallelization lacks historical information

    View full-size slide

  9. Single-machine parallelism does not scale

    View full-size slide

  10. CI Fanout
    Idea: run subset of tests on CI agents
    in parallel
    ⬢ Grouping of tests often manual
    ⬢ Large overhead because each
    CI agent has to run build up to
    test task
    ⬢ Test results are scattered over
    multiple CI jobs
    ⬢ Does not support local builds

    View full-size slide

  11. 3
    Building Test Distribution
    Our Journey

    View full-size slide

  12. Initial prototype (2019)

    View full-size slide

  13. Initial prototype (2019)
    ⬢ Implemented as Jenkins plugin
    ⬢ Idea: reuse Jenkins nodes and infrastructure to run tests remotely
    ⬢ Pros:
    ○ Jenkins was widely used by our largest customers at the time
    ⬢ Cons:
    ○ Not all customers use Jenkins
    ○ Installing/updating Jenkins plugins in a corporate environment is not as simple as it
    seems (i.e. we couldn’t move as fast as we wanted to)

    View full-size slide

  14. Initial release
    ⬢ Broker component included in Gradle Enterprise server
    ⬢ Agents connect to the broker
    ⬢ Builds connect to the broker and request agents
    ⬢ Works for local and CI builds
    Gradle
    Enterprise
    2020.2
    On-premises inside your network

    View full-size slide

  15. Running Test Distribution agents
    ⬢ The agent comes in two flavors: Jar and Docker image
    ⬢ Runs on Java 11+, requires about 128 MB of memory
    ⬢ Runs on Windows, macOS, and Linux
    ⬢ Detects its environment (JDKs, OS) during startup
    ⬢ Administrators can pass additional capabilities as command line parameters
    java -jar gradle-enterprise-test-distribution-agent.jar \
    --server https://ge.example.com \
    --api-key «api-key» \
    --capabilities docker,postgres=14
    docker run \
    --env TEST_DISTRIBUTION_AGENT_SERVER=https://ge.example.com \
    --env TEST_DISTRIBUTION_AGENT_API_KEY=«api-key» \
    --env TEST_DISTRIBUTION_AGENT_CAPABILITIES=postgres=14 \
    gradle/gradle-enterprise-test-distribution-agent
    Runs on-premises inside your
    own network infrastructure
    Gradle
    Enterprise
    2020.2

    View full-size slide

  16. Integrates with default Test task Gradle
    Enterprise
    2020.2
    tasks.test {
    useJUnitPlatform()
    distribution {
    enabled.set(true)
    }
    }
    Code coverage and other output
    files are transferred back and
    merged automatically
    Input files (e.g. classpath) are
    automatically transferred to
    remote agents

    View full-size slide

  17. Test Distribution requires JUnit Platform
    ⬢ Most test frameworks with JUnit Platform test engines are supported:
    ○ JUnit 5 (Jupiter) ✔
    ○ JUnit 3/4 and Spock 1.x (via junit-vintage-engine included in JUnit 5) ✔
    ○ Spock 2.x ✔
    ○ TestNG (via testng-engine) ✔
    ○ ScalaTest (via scalatest-junit-runner) ✔
    ○ ArchUnit ✔
    ○ jqwik ✔
    ⬢ Currently unsupported (we have plans to add support where possible):
    ○ Kotest
    ○ Spek
    ○ Cucumber

    View full-size slide

  18. Checking compatibility
    ⬢ Compatibility can be checked before adopting Test Distribution by applying a custom
    build script that adds custom values to the Build Scan
    https://github.com/gradle/gradle-enterprise-build-config-samples/pull/469

    View full-size slide

  19. Trace Files
    ⬢ Simple but powerful
    ⬢ Using Chrome for visualization

    View full-size slide

  20. Run build only once, distribute test execution

    View full-size slide

  21. Automatic distribution based on previous execution
    times

    View full-size slide

  22. Biggest shortcoming of initial release: file transfer
    Local file
    cache
    Local file
    cache
    Local file
    cache
    Worst case: m input files with n agents results in n*m files sent over WebSocket connection
    Huge
    bottleneck!
    Store
    Store
    Store

    View full-size slide

  23. File transfer multiplexing/deduplication
    Local file
    cache
    Local file
    cache
    Local file
    cache
    File
    cache
    Store
    Read
    Read
    Upload
    only once
    Store
    Store
    Store
    Idea: upload only once (via HTTP) and cache on Gradle Enterprise server
    Gradle
    Enterprise
    2020.3

    View full-size slide

  24. Speedup for over slow connection Gradle
    Enterprise
    2020.3

    View full-size slide

  25. Next problem: network failures causing builds to fail
    ⬢ Test Distribution requires WebSocket connections between
    builds/agents and the Gradle Enterprise server
    ⬢ Murphy’s law: Anything that can go wrong, will go wrong

    View full-size slide

  26. Resilience against temporary network
    failures
    ⬢ Actively manage connections using WebSocket
    pings
    ⬢ Reconnect if connection is lost or unresponsive
    ⬢ Reschedule work on other agents if agent
    disappears
    ⬢ Retry file uploads on non-client errors
    ⬢ Avoid builds from breaking and causing disruption
    Gradle
    Enterprise
    2020.4

    View full-size slide

  27. Maven support Gradle
    Enterprise
    2020.5





    maven-surefire-plugin
    2.22.2



    true







    ⬢ Requires Gradle Enterprise
    Maven extension
    ⬢ Integrates with Surefire and
    Failsafe plugins

    View full-size slide

  28. Gradle
    Enterprise
    2021.1
    Adaptive scheduling
    ⬢ Be able to react to additional
    agents becoming available
    during test execution
    ⬢ Increase agent utilization
    ⬢ Reduce testing time

    View full-size slide

  29. Auto-scaling Test Distribution agents
    ⬢ Agent pools with min/max size and
    capabilities for horizontal scaling
    ⬢ HTTP endpoint provides metrics
    indicating the target number of agents
    for each pool, based on demand.
    ⬢ Step-by-step instructions for
    Kubernetes in docs
    ⬢ Real-time and historical usage can be
    visualized by Gradle Enterprise
    administrators
    Gradle
    Enterprise
    2021.2
    {
    "id": "sosmbpbr",
    "name": "Linux",
    "capabilities": [
    "jdk=8",
    "os=linux"
    ],
    "minimumAgents": 1,
    "maximumAgents": 90,
    "connectedAgents": 2,
    "idleAgents": 0,
    "desiredAgents": 8
    }
    ✅ Test Distribution is production-ready

    View full-size slide

  30. Gradle Enterprise Acceleration Features

    View full-size slide

  31. Gradle Enterprise Acceleration Features
    Gradle
    Enterprise
    2022.2

    View full-size slide

  32. Teaser: Foundations of Predictive
    Test Selection
    Eric Wendelin Luke Daley
    Principal Data Scientist Principal Executive
    Thu
    10am

    View full-size slide

  33. Single Gradle plugin and Maven extension
    for all Gradle Enterprise
    Gradle
    Enterprise
    2022.3
    plugins {
    id("com.gradle.enterprise") version "3.11.2"
    id("com.gradle.enterprise.test-distribution") version "2.3.5"
    }

    View full-size slide

  34. Demo
    recorded by Doug Tidwell (a.k.a. Dr. DPE)

    View full-size slide

  35. More Information
    https://gradle.com/gradle-enterprise-solutions/test-distribution/

    View full-size slide

  36. 4
    Experience Report

    View full-size slide

  37. Test Distribution @
    Roberto Perez Alcolea
    Senior Software
    Engineer
    @rpalcolea

    View full-size slide

  38. Netflix JVM Build/CI footprint
    ⬢ 3.2k Gradle based repositories
    ⬢ ~191k weekly Gradle based builds
    ⬢ 35 Jenkins controllers
    ○ 650-1500 Jenkins agents
    ■ 1-100 executors per agent
    ⬢ Hundreds of Engineers

    View full-size slide

  39. Why did Netflix started using Test Distribution?
    ⬢ 88% of Build Time was spent on Test execution
    ⬢ The need for consistent experience between local and CI
    ○ Reduce the "it works on my machine" or "Only works on Jenkins"
    ○ Test that require Docker containers had different results on local and CI due to
    architecture or just version differences
    ⬢ Better compute resource usage
    ○ Build Performance varies a lot based on who and where it was executed
    ■ Engineers have different hardware just within same team members
    ■ Jenkins agents could have more or less capacity than others
    ⬢ Making test feedback faster and allowing tests to run locally will mean developers will run
    tests more frequently, speeding up the local development feedback loop, rather than
    depending on CI environments

    View full-size slide

  40. Getting ready for Test Distribution
    (Default configurations through Nebula)
    ⬢ Introduce recommended defaults when Test Distribution is enabled via Nebula, our set of
    Gradle plugins that are distributed as a custom Gradle distribution within Netflix.
    ○ Default requirements (capabilities)
    ○ Timeouts and remote execution preferences
    ⬢ Used gradle-lint-plugin to introduce JUnit Platform engines
    ○ Mix of junit4, testng, scalatest and junit5
    ■ ~1.4k repositories required this work
    ○ Decided to not pursue Test code migration but instead, move projects to use proper
    engines and be compatible

    View full-size slide

  41. Rolling out Test Distribution at Netflix - Phase 1
    ⬢ Single Agent pool
    ○ One AWS region
    ○ 20 container based agents
    ⬢ Enrolled the JVM Ecosystem team repositories and couple of external partners interested
    on beta testing
    ○ Most of these repositories have Gradle TestKit based tests which made it a great
    use-case to test right away

    View full-size slide

  42. Rolling out Test Distribution at Netflix - Phase 1
    Build w/ Integration tests 62 min → 5 min

    View full-size slide

  43. Rolling out Test Distribution at Netflix - Phase 1
    Build w/ Integration tests 62 min → 5 min

    View full-size slide

  44. Rolling out Test Distribution at Netflix - Phase 2
    (Current)
    ⬢ Multiple Agent pools
    ○ Two types of Agents: Container and EC2 (VM)
    ■ EC2 are required to support docker in docker for use cases like
    Testcontainers based integration tests
    ○ Agents in multiple AWS regions to match Jenkins Job location and have better
    Network experience
    ○ The number of agents is scaled up/down based on Agent availability/usage but we
    run overprovisioned to continue learning
    ○ Same CPU, Disk and Memory
    ⬢ Enrolled ~350 projects (~10% of builds)

    View full-size slide

  45. Test Distribution learnings so far
    ⬢ Having two types of agents (EC2 vs Container based) increases operational complexity
    ○ Docker maintenance
    ○ Agent selection (capabilities)
    ⬢ Engineers write tests that might require resources that can’t be added to common agent
    pools. Unfortunately, these tests are blockers for adoption. Examples of this are:
    ○ Network access
    ○ Security policies
    ⬢ Every build is unique and not all of them save several minutes but we have seen incredible
    results and amazing feedback from most of people that are using the product Today
    ⬢ Some tests require more compute resources than others, having a single configuration of
    agents is far from ideal

    View full-size slide

  46. What’s next for Netflix on Test Distribution?

    View full-size slide

  47. THANKS
    Any questions?
    [email protected]
    Twitter: @marcphilipp
    rperezalcolea@netflix.com
    Twitter: @rpalcolea

    View full-size slide