Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Day Of Jenkins 2017. Dealing with agent connectivity issues

Day Of Jenkins 2017. Dealing with agent connectivity issues

Welcome to the Dark side of Jenkins...

Almost all agent types in Jenkins use the Remoting library to communicate with the master, including JNLP and SSH agents. Although Jenkins’ ability to run tasks on multiple hosts is one of its success factors, agent connection stability is known to be a major pain point in large-scale installations. In this workshop, The talk is about remoting internals, how to diagnose issues, how to configure Jenkins and underlying infrastructure, and the future of this layer in Jenkins.

Simplified version: https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues-simplified

Oleg Nenashev

May 30, 2017
Tweet

More Decks by Oleg Nenashev

Other Decks in Programming

Transcript

  1. Dark side of Jenkins.
    Troubleshooting Remoting issues*
    Oleg Nenashev
    CloudBees, Inc.
    Day of Jenkins
    Göteborg, May 30, 2017
    * Simplified version: http://bit.ly/day-of-jenkins-remoting-light

    View Slide

  2. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 2
    About me
    @oleg_nenashev
    oleg-nenashev LibreCores
    project
    St. Petersburg
    Polytechnic
    University
    Jenkins
    meetups
    Google
    Summer of
    Code

    View Slide

  3. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 3
    Oleg’s
    “Hall of
    Shame”(c)
    • Jenkins Core
    • Windows Service
    Wrapper
    • Plugins
    • Remoting

    View Slide

  4. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 4
    • Distributed builds – success factor of Hudson/Jenkins
    • Remoting – engine under the hood of Jenkins
    • In-house implementation
    What is Remoting?

    View Slide

  5. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 5
    Have you seen THIS?

    View Slide

  6. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 6
    Or THIS?

    View Slide

  7. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 7
    Remoting. What does it do
    Agent executable (slave.jar)
    Master communication protocols
    Classloading
    Remote I/O Streams
    Monitoring of agents

    View Slide

  8. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 8
    Oleg &vs. Remoting
    • Dealing with Remoting since 2008
    • Maintained own fork at Synopsys
    • Became Remoting maintainer at CloudBees
    • Maintain Remoting during working hours
    • Deal with support escalations

    View Slide

  9. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 9
    üHOW DOES REMOTING WORK?
    üWHAT TO DO IF IT DOES NOT?
    Agenda

    View Slide

  10. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 10
    Disclaimer
    • The presentation represents the speaker’s personal opinion
    • This opinion may differ from official position of CloudBees or
    Jenkins Community
    • Jenkins “agent” and “slave” terms are equivalent, sorry for the
    obsolete term just in case

    View Slide

  11. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 11
    When you run builds on
    agents, do they get
    executed on agents?

    View Slide

  12. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 12
    When you run builds on
    agents, do they get
    executed on agents?

    View Slide

  13. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 13
    Build in Jenkins
    Master Agent
    RPC calls
    System calls
    RemoteInputStream/
    RemoteOutputStream
    Missing classes

    View Slide

  14. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 14
    Remoting Usage in Jenkins
    • Master ó Agents
    • 4 protocols
    • Master ó CLI (Deprecated)
    • https://jenkins.io/blog/2017/04/11/new-cli/
    • Agent ó Maven in Maven Project Plugin
    • via Maven Interceptors
    • CloudBees Jenkins Enterprise:
    • Client master ó CloudBees Jenkins Operations Center

    View Slide

  15. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 15
    Remoting protocols
    • JNLP1 – deprecated protocol
    • JNLP2 – NIO, no encryption
    • JNLP3 – no NIO, encrypted, unstable
    • JNLP4 – NIO, encrypted via TLS
    • CLI1 – no encryption
    • CLI2 – encrypted
    • Ping – test protocol

    View Slide

  16. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 16
    Protocol configuration • Before 2.19.1 – via System Property
    • After – Global security settings

    View Slide

  17. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 17
    Recommended configuration
    2.32+
    2.54+
    2.46.2+
    • JNLP1 – deprecated protocol
    • JNLP2 – NIO, no encryption
    • JNLP3 – no NIO, encrypted, unstable
    • JNLP4 – NIO, encrypted via TLS
    • CLI1 – no encryption
    • CLI2 – encrypted
    • Ping – test protocol

    View Slide

  18. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 18
    JNLP-agent
    Master JVM Agent JVM
    HTTP/HTTPS
    /tcpAgentListener
    remoting.jar
    jenkins.war JNLP-protocol
    • Docker: jenkinsci/jnlp-slave
    • Swarm Plugin: bundled remoting.jar

    View Slide

  19. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 19
    SSH-agent
    Master JVM Agent JVM
    SSH Server
    jenkins.war
    STDOUT/STDERR
    • SSH Slaves Plugin
    • CloudBees NIO SSH Slaves Plugin
    • Docker: jenkinsci/ssh-slave
    SSH-connect
    SSH
    JRE/JDK
    remoting.jar
    settings
    • Remoting auto-update
    from master

    View Slide

  20. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 20
    JNLP-agent + Windows Service
    • Windows Agent Installer Module
    • Extra: Windows Agents Plugin
    (installation via DCOM)
    Master JVM Agent
    HTTP/HTTPS
    /tcpAgentListener
    JVM + slave.jar
    jenkins.war
    JNLP-protocol
    HTTPS/HTTP
    remoting.jar
    WinSW
    (jenkins-slave.exe)
    jenkins-slave.exe
    • Remoting auto-update support
    • Logging by default
    • Failover

    View Slide

  21. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 21
    Top-5 Remoting Issues
    Depends on TCP
    Runaway processes in Windows
    Outdated Remoting
    No logging by-default
    Traffic prioritization

    View Slide

  22. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 22
    Problem 1: Connection failure
    TCP-connection
    failure
    Agent monitoring
    • Disk usage
    • Remoting version
    • …
    Bug in Remoting PingThread

    View Slide

  23. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 23
    Bugs in Remoting
    • Poor diagnosability when issue happens
    All
    • Known issues in connection management
    • RejectedExecutionEx in ExecutorService kills ALL
    connections (.../remoting/pull/156 )
    JNLP1/JNLP2
    • Does not “just work”…
    • Errata: .../remoting/blob/master/docs/protocols.md -
    jnlp3-connect-errata
    JNLP3
    • No big ones so far…
    • Jenkins 2.27+
    JNLP4

    View Slide

  24. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 24
    PingThread – what is it?

    View Slide

  25. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 25
    •Calls Ping with 4-minute timeout?
    Remoting PingThread

    View Slide

  26. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 26
    •Calls Ping with 4-minute timeout?
    •No-Op RPC request
    Remoting PingThread

    View Slide

  27. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 27
    Ping Thread
    Send NOOP RPC request to agent
    Deliver the request over network
    Wait in agent execution queue
    Execute in agent ThreadPool
    Deliver the result back

    View Slide

  28. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 28
    •“Let’s disable PingThread”
    Famous Last Words

    View Slide

  29. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 29
    •“Let’s disable PingThread”…
    •PingThread monitors all request execution stages
    •Without it you rely on the TCP state
    •Agents may hang forever
    Famous Last Words

    View Slide

  30. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 30
    •Reliable message delivery…
    •What can possibly go wrong?
    TCP…

    View Slide

  31. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 31
    7 circles of virtualization
    Hardware & Network
    vSphere/AWS/…
    *nix OS
    Docker
    *nix OS
    JVM

    View Slide

  32. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 32
    Hardware & Network
    vSphere/AWS/…
    *nix OS
    Docker
    *nix OS
    JVM
    7 circles of virtualization

    View Slide

  33. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 33
    Hardware
    vSphere/AWS/…
    *nix OS
    Docker
    *nix OS
    JVM
    Agents –
    double trouble
    Hardware
    vSphere/AWS/…
    *nix OS
    Docker
    *nix OS
    JVM
    Network (routers,
    VPN, dial-up,
    proxy…)
    TCP

    View Slide

  34. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 34
    Dealing with network
    TCP Retransmission timeout
    • *nix: https://unix.stackexchange.com/questions/210367/changing-
    the-tcp-rto-value-in-linux
    • Windows: https://support.microsoft.com/en-us/help/170359/how-to-
    modify-the-tcp-ip-maximum-retransmission-time-out
    Network configuration
    • External monitoring
    • Independent management- and storage-networks
    Reducing [peak] throughput

    View Slide

  35. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 35
    Reducing Remoting network throughput
    Master
    Node
    Access from UI
    Storage
    • Temporary Data
    • Logs
    • Artifacts

    View Slide

  36. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 36
    Reducing Remoting network throughput
    Less console logging
    Persisted JAR cache, esp. in Cloud Nodes
    External Artifact publishers
    Pipeline: Local WS instead of stash/unstash
    Low
    Hanging
    Fruits

    View Slide

  37. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 37
    Stash() replacement. External Workspace Manager
    https://github.com/jenkinsci/external-workspace-manager-plugin

    View Slide

  38. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 38
    Example
    https://github.com/jenkinsci/external-workspace-manager-plugin

    View Slide

  39. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 39
    Problem. Invalid Objects
    JENKINS-23271
    • Garbage Collector is too smart in Java 8
    • RemoteInvocationHandler => command.start().join()
    • Jenkins 2.35+

    View Slide

  40. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 40
    Issue: Outdated Remoting on agents
    • Update to the version on master
    SSH agents
    • No auto-update till Jenkins 2.50
    Windows Service
    agents
    • No auto-update
    JNLP agents
    • No bugfixes
    • No new protocols (e.g. JNLP4)
    • Worse diagnosability
    • Potential compatibility issues
    OOTB:

    View Slide

  41. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 41
    How to check the Remoting version?
    • System Information on the agent page
    • Version Column Plugin: https://plugins.jenkins.io/versioncolumn

    View Slide

  42. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 42
    Issue. Java versions
    • Jenkins needs some native libs
    • IBM Java has known compat issues
    Vendor
    • Do NOT use 32-bit Java on 64bit Windows
    • https://github.com/kohsuke/winp#platform-support
    • 32-bit Java is bundled in Jenkins Windows Installers L
    Target
    platform
    • Jenkins’ Java requirements apply to agents as well
    • Otherwise – Undefined behavior
    Version

    View Slide

  43. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 43
    How to monitor Java versions?
    • Version Column Plugin again!
    • Since: 2.0-beta-1 (Experimental update center)

    View Slide

  44. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 44
    Configuring Java monitors
    ${JENKINS_URL}/computer/configure
    Built-in strategies:
    • Agent JVM version is greater or equal than the Master’s supported one (default)
    • Agent JVM major.minor version is equal to the Master one (paranoid)
    • Agent JVM whose is exactly equal to the Master one (paranoid++)

    View Slide

  45. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 45
    Issue: “Agent is already connected”
    Windows Service Other

    View Slide

  46. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 46
    Runaway Process
    Kill jenkins-slave.exe
    on Jenkins 2.50-
    Runaway agent
    process

    View Slide

  47. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 47
    Runaway Process

    View Slide

  48. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 48
    New Windows Service Wrapper
    • Jenkins 2.50+
    • For new agents…
    •Remoting auto-update on agent side
    •Runaway Process Killer
    • Old agents need configuration
    https://github.com/kohsuke/winsw/

    View Slide

  49. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 49
    jenkins-slave.xml

    @[email protected]
    Jenkins agent (@[email protected])
    This service runs an agent for Jenkins automation server.
    @[email protected]
    -Xrs @[email protected] -jar "%BASE%\slave.jar" @[email protected]
    rotate



    RunawayProcessKiller.RunawayProcessKillerExtension">
    %BASE%\jenkins_agent.pid
    5000
    false



    https://github.com/jenkinsci/windows-slave-installer-module/…
    /src/main/resources/org/jenkinsci/modules/windows_slave_inst
    aller/jenkins-slave.xml

    View Slide

  50. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 50
    Updating Windows agents
    Update the core and/or WinSW
    Update jenkins-slave.xml
    Restart agent
    Wait & Restart again
    Guide: https://github.com/jenkinsci/windows-slave-
    installer-module#upgrading-old-agents

    View Slide

  51. Diagnosing Remoting issues

    View Slide

  52. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 52
    What do we need?
    Core version
    Logs
    ? Stackdumps
    Remoting version
    Logs
    ? Stackdumps
    Master Agent

    View Slide

  53. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 53
    Support Core Plugin

    View Slide

  54. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 54
    When the Agent fails…
    Core version
    Logs
    ? Stackdumps
    Remoting
    version
    Logs
    ? Stackdumps
    Master Agent

    View Slide

  55. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 55
    Problem: No logging by default
    • No logging
    SSH agents
    • Logging with logrotate
    • Collected by Support Core when
    agent is online
    Windows
    Service agents
    • No logging
    JNLP agents
    OOTB:

    View Slide

  56. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 56
    Enabling logging in SSH/JNLP agents
    • Tee STDOUT/STDERR to a file
    • No Logrotate Support
    GOOD – “-slaveLog”
    parameter
    • Shell-dependent
    • SSH agents – patch command suffix
    BAD –
    STDOUT/STDERR
    redirect
    • NOT Documented as well
    • Some logs go to STDOUT/STDERR
    NOT UGLY – JUL
    property file

    View Slide

  57. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 57
    So…
    What to do?

    View Slide

  58. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 58
    What to do NOW?
    • 2.46+ for Remoting patches
    • For Windows – LTS 2.60.1+
    Update Jenkins
    • TRetransmission
    < TPing
    • PingThread should be turned off
    Check timeouts (TCP
    Retransmission +
    Ping)
    • Wait for Remoting 3.8 with
    configurable JUL
    Setup logging

    View Slide

  59. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 59
    What’s next?
    • Better Diagnosability
    • Better Stability

    View Slide

  60. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 60
    Ongoing changes
    • New release – 3.8+
    • .../remoting/blob/master/CHANGELOG.md#38
    • Docs
    •https://github.com/jenkinsci/remoting/
    • Work directories in Remoting (JENKINS-44108)
    • Logging on agents by default (JENKINS-39369)

    View Slide

  61. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 61
    Work Directories (JENKINS-44108)
    • Logging by default
    • Independent JAR Cache
    • Workspace status checks

    View Slide

  62. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 62
    Work Directories (JENKINS-44108)
    Long adoption…
    ETA: September LTS

    View Slide

  63. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 63
    What’s next?

    View Slide

  64. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 64
    Moonshots
    Remoting
    • TCP-robust Remoting?
    • Autoupdate of ALL JNLP agents? (JENKINS-44099)
    • Update of Remoting in Master without the core upgrade?
    Traffic optimization
    • External logging
    • Pluggable storage

    View Slide

  65. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 65
    Takeaways
    • Remoting – risk factor in Jenkins
    •Remoting does not scale well OOTB
    •INFRA issues - frequent root cause
    •Remoting can be stabilized
    • Jenkins 2 is not just about Pipelines,
    keep updating

    View Slide

  66. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 66
    Links
    • Remoting
    • GitHub: https://github.com/jenkinsci/remoting (Docs, etc.)
    • Windows services:
    • https://github.com/kohsuke/winsw/
    • CloudBees
    • go.cloudbees.com – community pages, knowledge base

    View Slide

  67. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 67
    Thank you!
    Contacts:
    E-mail: [email protected]
    GitHub: oleg-nenashev
    Twitter: @oleg_nenashev

    View Slide

  68. Software at the speed of ideas
    THANK YOU!
    www.cloudbees.com

    View Slide