Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Day Of Jenkins 2017. Dealing with agent connectivity issues

Day Of Jenkins 2017. Dealing with agent connectivity issues

Welcome to the Dark side of Jenkins...

Almost all agent types in Jenkins use the Remoting library to communicate with the master, including JNLP and SSH agents. Although Jenkins’ ability to run tasks on multiple hosts is one of its success factors, agent connection stability is known to be a major pain point in large-scale installations. In this workshop, The talk is about remoting internals, how to diagnose issues, how to configure Jenkins and underlying infrastructure, and the future of this layer in Jenkins.

Simplified version: https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues-simplified

Oleg Nenashev

May 30, 2017
Tweet

More Decks by Oleg Nenashev

Other Decks in Programming

Transcript

  1. Dark side of Jenkins. Troubleshooting Remoting issues* Oleg Nenashev CloudBees,

    Inc. Day of Jenkins Göteborg, May 30, 2017 * Simplified version: http://bit.ly/day-of-jenkins-remoting-light
  2. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 2 About me @oleg_nenashev oleg-nenashev LibreCores project St. Petersburg Polytechnic University Jenkins meetups Google Summer of Code
  3. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 3 Oleg’s “Hall of Shame”(c) • Jenkins Core • Windows Service Wrapper • Plugins • Remoting
  4. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 4 • Distributed builds – success factor of Hudson/Jenkins • Remoting – engine under the hood of Jenkins • In-house implementation What is Remoting?
  5. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 7 Remoting. What does it do Agent executable (slave.jar) Master communication protocols Classloading Remote I/O Streams Monitoring of agents
  6. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 8 Oleg &vs. Remoting • Dealing with Remoting since 2008 • Maintained own fork at Synopsys • Became Remoting maintainer at CloudBees • Maintain Remoting during working hours • Deal with support escalations
  7. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 9 üHOW DOES REMOTING WORK? üWHAT TO DO IF IT DOES NOT? Agenda
  8. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 10 Disclaimer • The presentation represents the speaker’s personal opinion • This opinion may differ from official position of CloudBees or Jenkins Community • Jenkins “agent” and “slave” terms are equivalent, sorry for the obsolete term just in case
  9. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 11 When you run builds on agents, do they get executed on agents?
  10. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 12 When you run builds on agents, do they get executed on agents?
  11. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 13 Build in Jenkins Master Agent RPC calls System calls RemoteInputStream/ RemoteOutputStream Missing classes
  12. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 14 Remoting Usage in Jenkins • Master ó Agents • 4 protocols • Master ó CLI (Deprecated) • https://jenkins.io/blog/2017/04/11/new-cli/ • Agent ó Maven in Maven Project Plugin • via Maven Interceptors • CloudBees Jenkins Enterprise: • Client master ó CloudBees Jenkins Operations Center
  13. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 15 Remoting protocols • JNLP1 – deprecated protocol • JNLP2 – NIO, no encryption • JNLP3 – no NIO, encrypted, unstable • JNLP4 – NIO, encrypted via TLS • CLI1 – no encryption • CLI2 – encrypted • Ping – test protocol
  14. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 16 Protocol configuration • Before 2.19.1 – via System Property • After – Global security settings
  15. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 17 Recommended configuration 2.32+ 2.54+ 2.46.2+ • JNLP1 – deprecated protocol • JNLP2 – NIO, no encryption • JNLP3 – no NIO, encrypted, unstable • JNLP4 – NIO, encrypted via TLS • CLI1 – no encryption • CLI2 – encrypted • Ping – test protocol
  16. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 18 JNLP-agent Master JVM Agent JVM HTTP/HTTPS /tcpAgentListener remoting.jar jenkins.war JNLP-protocol • Docker: jenkinsci/jnlp-slave • Swarm Plugin: bundled remoting.jar
  17. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 19 SSH-agent Master JVM Agent JVM SSH Server jenkins.war STDOUT/STDERR • SSH Slaves Plugin • CloudBees NIO SSH Slaves Plugin • Docker: jenkinsci/ssh-slave SSH-connect SSH JRE/JDK remoting.jar settings • Remoting auto-update from master
  18. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 20 JNLP-agent + Windows Service • Windows Agent Installer Module • Extra: Windows Agents Plugin (installation via DCOM) Master JVM Agent HTTP/HTTPS /tcpAgentListener JVM + slave.jar jenkins.war JNLP-protocol HTTPS/HTTP remoting.jar WinSW (jenkins-slave.exe) jenkins-slave.exe • Remoting auto-update support • Logging by default • Failover
  19. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 21 Top-5 Remoting Issues Depends on TCP Runaway processes in Windows Outdated Remoting No logging by-default Traffic prioritization
  20. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 22 Problem 1: Connection failure TCP-connection failure Agent monitoring • Disk usage • Remoting version • … Bug in Remoting PingThread
  21. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 23 Bugs in Remoting • Poor diagnosability when issue happens All • Known issues in connection management • RejectedExecutionEx in ExecutorService kills ALL connections (.../remoting/pull/156 ) JNLP1/JNLP2 • Does not “just work”… • Errata: .../remoting/blob/master/docs/protocols.md - jnlp3-connect-errata JNLP3 • No big ones so far… • Jenkins 2.27+ JNLP4
  22. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 25 •Calls Ping with 4-minute timeout? Remoting PingThread
  23. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 26 •Calls Ping with 4-minute timeout? •No-Op RPC request Remoting PingThread
  24. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 27 Ping Thread Send NOOP RPC request to agent Deliver the request over network Wait in agent execution queue Execute in agent ThreadPool Deliver the result back
  25. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 28 •“Let’s disable PingThread” Famous Last Words
  26. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 29 •“Let’s disable PingThread”… •PingThread monitors all request execution stages •Without it you rely on the TCP state •Agents may hang forever Famous Last Words
  27. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 30 •Reliable message delivery… •What can possibly go wrong? TCP…
  28. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 31 7 circles of virtualization Hardware & Network vSphere/AWS/… *nix OS Docker *nix OS JVM
  29. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 32 Hardware & Network vSphere/AWS/… *nix OS Docker *nix OS JVM 7 circles of virtualization
  30. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 33 Hardware vSphere/AWS/… *nix OS Docker *nix OS JVM Agents – double trouble Hardware vSphere/AWS/… *nix OS Docker *nix OS JVM Network (routers, VPN, dial-up, proxy…) TCP
  31. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 34 Dealing with network TCP Retransmission timeout • *nix: https://unix.stackexchange.com/questions/210367/changing- the-tcp-rto-value-in-linux • Windows: https://support.microsoft.com/en-us/help/170359/how-to- modify-the-tcp-ip-maximum-retransmission-time-out Network configuration • External monitoring • Independent management- and storage-networks Reducing [peak] throughput
  32. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 35 Reducing Remoting network throughput Master Node Access from UI Storage • Temporary Data • Logs • Artifacts
  33. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 36 Reducing Remoting network throughput Less console logging Persisted JAR cache, esp. in Cloud Nodes External Artifact publishers Pipeline: Local WS instead of stash/unstash Low Hanging Fruits
  34. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 37 Stash() replacement. External Workspace Manager https://github.com/jenkinsci/external-workspace-manager-plugin
  35. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 38 Example https://github.com/jenkinsci/external-workspace-manager-plugin
  36. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 39 Problem. Invalid Objects JENKINS-23271 • Garbage Collector is too smart in Java 8 • RemoteInvocationHandler => command.start().join() • Jenkins 2.35+
  37. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 40 Issue: Outdated Remoting on agents • Update to the version on master SSH agents • No auto-update till Jenkins 2.50 Windows Service agents • No auto-update JNLP agents • No bugfixes • No new protocols (e.g. JNLP4) • Worse diagnosability • Potential compatibility issues OOTB:
  38. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 41 How to check the Remoting version? • System Information on the agent page • Version Column Plugin: https://plugins.jenkins.io/versioncolumn
  39. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 42 Issue. Java versions • Jenkins needs some native libs • IBM Java has known compat issues Vendor • Do NOT use 32-bit Java on 64bit Windows • https://github.com/kohsuke/winp#platform-support • 32-bit Java is bundled in Jenkins Windows Installers L Target platform • Jenkins’ Java requirements apply to agents as well • Otherwise – Undefined behavior Version
  40. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 43 How to monitor Java versions? • Version Column Plugin again! • Since: 2.0-beta-1 (Experimental update center)
  41. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 44 Configuring Java monitors ${JENKINS_URL}/computer/configure Built-in strategies: • Agent JVM version is greater or equal than the Master’s supported one (default) • Agent JVM major.minor version is equal to the Master one (paranoid) • Agent JVM whose is exactly equal to the Master one (paranoid++)
  42. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 45 Issue: “Agent is already connected” Windows Service Other
  43. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 46 Runaway Process Kill jenkins-slave.exe on Jenkins 2.50- Runaway agent process
  44. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 48 New Windows Service Wrapper • Jenkins 2.50+ • For new agents… •Remoting auto-update on agent side •Runaway Process Killer • Old agents need configuration https://github.com/kohsuke/winsw/
  45. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 49 jenkins-slave.xml <service> <id>@ID@</id> <name>Jenkins agent (@ID@)</name> <description>This service runs an agent for Jenkins automation server.</description> <executable>@JAVA@</executable> <arguments>-Xrs @VMARGS@ -jar "%BASE%\slave.jar" @ARGS@</arguments> <logmode>rotate</logmode> <onfailure action="restart" /> <download from="JENKINS_URL/jnlpJars/slave.jar" to="%BASE%\slave.jar"/> <extensions> <extension id="killOnStartup" enabled="true” className="winsw.Plugins. RunawayProcessKiller.RunawayProcessKillerExtension"> <pidfile>%BASE%\jenkins_agent.pid</pidfile> <stopTimeout>5000</stopTimeout> <stopParentFirst>false</stopParentFirst> </extension> </extensions> </service> https://github.com/jenkinsci/windows-slave-installer-module/… /src/main/resources/org/jenkinsci/modules/windows_slave_inst aller/jenkins-slave.xml
  46. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 50 Updating Windows agents Update the core and/or WinSW Update jenkins-slave.xml Restart agent Wait & Restart again Guide: https://github.com/jenkinsci/windows-slave- installer-module#upgrading-old-agents
  47. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 52 What do we need? Core version Logs ? Stackdumps Remoting version Logs ? Stackdumps Master Agent
  48. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 54 When the Agent fails… Core version Logs ? Stackdumps Remoting version Logs ? Stackdumps Master Agent
  49. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 55 Problem: No logging by default • No logging SSH agents • Logging with logrotate • Collected by Support Core when agent is online Windows Service agents • No logging JNLP agents OOTB:
  50. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 56 Enabling logging in SSH/JNLP agents • Tee STDOUT/STDERR to a file • No Logrotate Support GOOD – “-slaveLog” parameter • Shell-dependent • SSH agents – patch command suffix BAD – STDOUT/STDERR redirect • NOT Documented as well • Some logs go to STDOUT/STDERR NOT UGLY – JUL property file
  51. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 58 What to do NOW? • 2.46+ for Remoting patches • For Windows – LTS 2.60.1+ Update Jenkins • TRetransmission < TPing • PingThread should be turned off Check timeouts (TCP Retransmission + Ping) • Wait for Remoting 3.8 with configurable JUL Setup logging
  52. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 59 What’s next? • Better Diagnosability • Better Stability
  53. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 60 Ongoing changes • New release – 3.8+ • .../remoting/blob/master/CHANGELOG.md#38 • Docs •https://github.com/jenkinsci/remoting/ • Work directories in Remoting (JENKINS-44108) • Logging on agents by default (JENKINS-39369)
  54. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 61 Work Directories (JENKINS-44108) • Logging by default • Independent JAR Cache • Workspace status checks
  55. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 62 Work Directories (JENKINS-44108) Long adoption… ETA: September LTS
  56. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 64 Moonshots Remoting • TCP-robust Remoting? • Autoupdate of ALL JNLP agents? (JENKINS-44099) • Update of Remoting in Master without the core upgrade? Traffic optimization • External logging • Pluggable storage
  57. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 65 Takeaways • Remoting – risk factor in Jenkins •Remoting does not scale well OOTB •INFRA issues - frequent root cause •Remoting can be stabilized • Jenkins 2 is not just about Pipelines, keep updating
  58. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 66 Links • Remoting • GitHub: https://github.com/jenkinsci/remoting (Docs, etc.) • Windows services: • https://github.com/kohsuke/winsw/ • CloudBees • go.cloudbees.com – community pages, knowledge base
  59. @oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights

    Reserved. 67 Thank you! Contacts: E-mail: [email protected] GitHub: oleg-nenashev Twitter: @oleg_nenashev