Slide 1

Slide 1 text

Dark side of Jenkins. Troubleshooting Remoting issues* Oleg Nenashev CloudBees, Inc. Day of Jenkins Göteborg, May 30, 2017 * Simplified version: http://bit.ly/day-of-jenkins-remoting-light

Slide 2

Slide 2 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 2 About me @oleg_nenashev oleg-nenashev LibreCores project St. Petersburg Polytechnic University Jenkins meetups Google Summer of Code

Slide 3

Slide 3 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 3 Oleg’s “Hall of Shame”(c) • Jenkins Core • Windows Service Wrapper • Plugins • Remoting

Slide 4

Slide 4 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 4 • Distributed builds – success factor of Hudson/Jenkins • Remoting – engine under the hood of Jenkins • In-house implementation What is Remoting?

Slide 5

Slide 5 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 5 Have you seen THIS?

Slide 6

Slide 6 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 6 Or THIS?

Slide 7

Slide 7 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 7 Remoting. What does it do Agent executable (slave.jar) Master communication protocols Classloading Remote I/O Streams Monitoring of agents

Slide 8

Slide 8 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 8 Oleg &vs. Remoting • Dealing with Remoting since 2008 • Maintained own fork at Synopsys • Became Remoting maintainer at CloudBees • Maintain Remoting during working hours • Deal with support escalations

Slide 9

Slide 9 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 9 üHOW DOES REMOTING WORK? üWHAT TO DO IF IT DOES NOT? Agenda

Slide 10

Slide 10 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 10 Disclaimer • The presentation represents the speaker’s personal opinion • This opinion may differ from official position of CloudBees or Jenkins Community • Jenkins “agent” and “slave” terms are equivalent, sorry for the obsolete term just in case

Slide 11

Slide 11 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 11 When you run builds on agents, do they get executed on agents?

Slide 12

Slide 12 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 12 When you run builds on agents, do they get executed on agents?

Slide 13

Slide 13 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 13 Build in Jenkins Master Agent RPC calls System calls RemoteInputStream/ RemoteOutputStream Missing classes

Slide 14

Slide 14 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 14 Remoting Usage in Jenkins • Master ó Agents • 4 protocols • Master ó CLI (Deprecated) • https://jenkins.io/blog/2017/04/11/new-cli/ • Agent ó Maven in Maven Project Plugin • via Maven Interceptors • CloudBees Jenkins Enterprise: • Client master ó CloudBees Jenkins Operations Center

Slide 15

Slide 15 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 15 Remoting protocols • JNLP1 – deprecated protocol • JNLP2 – NIO, no encryption • JNLP3 – no NIO, encrypted, unstable • JNLP4 – NIO, encrypted via TLS • CLI1 – no encryption • CLI2 – encrypted • Ping – test protocol

Slide 16

Slide 16 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 16 Protocol configuration • Before 2.19.1 – via System Property • After – Global security settings

Slide 17

Slide 17 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 17 Recommended configuration 2.32+ 2.54+ 2.46.2+ • JNLP1 – deprecated protocol • JNLP2 – NIO, no encryption • JNLP3 – no NIO, encrypted, unstable • JNLP4 – NIO, encrypted via TLS • CLI1 – no encryption • CLI2 – encrypted • Ping – test protocol

Slide 18

Slide 18 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 18 JNLP-agent Master JVM Agent JVM HTTP/HTTPS /tcpAgentListener remoting.jar jenkins.war JNLP-protocol • Docker: jenkinsci/jnlp-slave • Swarm Plugin: bundled remoting.jar

Slide 19

Slide 19 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 19 SSH-agent Master JVM Agent JVM SSH Server jenkins.war STDOUT/STDERR • SSH Slaves Plugin • CloudBees NIO SSH Slaves Plugin • Docker: jenkinsci/ssh-slave SSH-connect SSH JRE/JDK remoting.jar settings • Remoting auto-update from master

Slide 20

Slide 20 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 20 JNLP-agent + Windows Service • Windows Agent Installer Module • Extra: Windows Agents Plugin (installation via DCOM) Master JVM Agent HTTP/HTTPS /tcpAgentListener JVM + slave.jar jenkins.war JNLP-protocol HTTPS/HTTP remoting.jar WinSW (jenkins-slave.exe) jenkins-slave.exe • Remoting auto-update support • Logging by default • Failover

Slide 21

Slide 21 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 21 Top-5 Remoting Issues Depends on TCP Runaway processes in Windows Outdated Remoting No logging by-default Traffic prioritization

Slide 22

Slide 22 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 22 Problem 1: Connection failure TCP-connection failure Agent monitoring • Disk usage • Remoting version • … Bug in Remoting PingThread

Slide 23

Slide 23 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 23 Bugs in Remoting • Poor diagnosability when issue happens All • Known issues in connection management • RejectedExecutionEx in ExecutorService kills ALL connections (.../remoting/pull/156 ) JNLP1/JNLP2 • Does not “just work”… • Errata: .../remoting/blob/master/docs/protocols.md - jnlp3-connect-errata JNLP3 • No big ones so far… • Jenkins 2.27+ JNLP4

Slide 24

Slide 24 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 24 PingThread – what is it?

Slide 25

Slide 25 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 25 •Calls Ping with 4-minute timeout? Remoting PingThread

Slide 26

Slide 26 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 26 •Calls Ping with 4-minute timeout? •No-Op RPC request Remoting PingThread

Slide 27

Slide 27 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 27 Ping Thread Send NOOP RPC request to agent Deliver the request over network Wait in agent execution queue Execute in agent ThreadPool Deliver the result back

Slide 28

Slide 28 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 28 •“Let’s disable PingThread” Famous Last Words

Slide 29

Slide 29 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 29 •“Let’s disable PingThread”… •PingThread monitors all request execution stages •Without it you rely on the TCP state •Agents may hang forever Famous Last Words

Slide 30

Slide 30 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 30 •Reliable message delivery… •What can possibly go wrong? TCP…

Slide 31

Slide 31 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 31 7 circles of virtualization Hardware & Network vSphere/AWS/… *nix OS Docker *nix OS JVM

Slide 32

Slide 32 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 32 Hardware & Network vSphere/AWS/… *nix OS Docker *nix OS JVM 7 circles of virtualization

Slide 33

Slide 33 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 33 Hardware vSphere/AWS/… *nix OS Docker *nix OS JVM Agents – double trouble Hardware vSphere/AWS/… *nix OS Docker *nix OS JVM Network (routers, VPN, dial-up, proxy…) TCP

Slide 34

Slide 34 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 34 Dealing with network TCP Retransmission timeout • *nix: https://unix.stackexchange.com/questions/210367/changing- the-tcp-rto-value-in-linux • Windows: https://support.microsoft.com/en-us/help/170359/how-to- modify-the-tcp-ip-maximum-retransmission-time-out Network configuration • External monitoring • Independent management- and storage-networks Reducing [peak] throughput

Slide 35

Slide 35 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 35 Reducing Remoting network throughput Master Node Access from UI Storage • Temporary Data • Logs • Artifacts

Slide 36

Slide 36 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 36 Reducing Remoting network throughput Less console logging Persisted JAR cache, esp. in Cloud Nodes External Artifact publishers Pipeline: Local WS instead of stash/unstash Low Hanging Fruits

Slide 37

Slide 37 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 37 Stash() replacement. External Workspace Manager https://github.com/jenkinsci/external-workspace-manager-plugin

Slide 38

Slide 38 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 38 Example https://github.com/jenkinsci/external-workspace-manager-plugin

Slide 39

Slide 39 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 39 Problem. Invalid Objects JENKINS-23271 • Garbage Collector is too smart in Java 8 • RemoteInvocationHandler => command.start().join() • Jenkins 2.35+

Slide 40

Slide 40 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 40 Issue: Outdated Remoting on agents • Update to the version on master SSH agents • No auto-update till Jenkins 2.50 Windows Service agents • No auto-update JNLP agents • No bugfixes • No new protocols (e.g. JNLP4) • Worse diagnosability • Potential compatibility issues OOTB:

Slide 41

Slide 41 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 41 How to check the Remoting version? • System Information on the agent page • Version Column Plugin: https://plugins.jenkins.io/versioncolumn

Slide 42

Slide 42 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 42 Issue. Java versions • Jenkins needs some native libs • IBM Java has known compat issues Vendor • Do NOT use 32-bit Java on 64bit Windows • https://github.com/kohsuke/winp#platform-support • 32-bit Java is bundled in Jenkins Windows Installers L Target platform • Jenkins’ Java requirements apply to agents as well • Otherwise – Undefined behavior Version

Slide 43

Slide 43 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 43 How to monitor Java versions? • Version Column Plugin again! • Since: 2.0-beta-1 (Experimental update center)

Slide 44

Slide 44 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 44 Configuring Java monitors ${JENKINS_URL}/computer/configure Built-in strategies: • Agent JVM version is greater or equal than the Master’s supported one (default) • Agent JVM major.minor version is equal to the Master one (paranoid) • Agent JVM whose is exactly equal to the Master one (paranoid++)

Slide 45

Slide 45 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 45 Issue: “Agent is already connected” Windows Service Other

Slide 46

Slide 46 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 46 Runaway Process Kill jenkins-slave.exe on Jenkins 2.50- Runaway agent process

Slide 47

Slide 47 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 47 Runaway Process

Slide 48

Slide 48 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 48 New Windows Service Wrapper • Jenkins 2.50+ • For new agents… •Remoting auto-update on agent side •Runaway Process Killer • Old agents need configuration https://github.com/kohsuke/winsw/

Slide 49

Slide 49 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 49 jenkins-slave.xml @ID@ Jenkins agent (@ID@) This service runs an agent for Jenkins automation server. @JAVA@ -Xrs @VMARGS@ -jar "%BASE%\slave.jar" @ARGS@ rotate %BASE%\jenkins_agent.pid 5000 false https://github.com/jenkinsci/windows-slave-installer-module/… /src/main/resources/org/jenkinsci/modules/windows_slave_inst aller/jenkins-slave.xml

Slide 50

Slide 50 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 50 Updating Windows agents Update the core and/or WinSW Update jenkins-slave.xml Restart agent Wait & Restart again Guide: https://github.com/jenkinsci/windows-slave- installer-module#upgrading-old-agents

Slide 51

Slide 51 text

Diagnosing Remoting issues

Slide 52

Slide 52 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 52 What do we need? Core version Logs ? Stackdumps Remoting version Logs ? Stackdumps Master Agent

Slide 53

Slide 53 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 53 Support Core Plugin

Slide 54

Slide 54 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 54 When the Agent fails… Core version Logs ? Stackdumps Remoting version Logs ? Stackdumps Master Agent

Slide 55

Slide 55 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 55 Problem: No logging by default • No logging SSH agents • Logging with logrotate • Collected by Support Core when agent is online Windows Service agents • No logging JNLP agents OOTB:

Slide 56

Slide 56 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 56 Enabling logging in SSH/JNLP agents • Tee STDOUT/STDERR to a file • No Logrotate Support GOOD – “-slaveLog” parameter • Shell-dependent • SSH agents – patch command suffix BAD – STDOUT/STDERR redirect • NOT Documented as well • Some logs go to STDOUT/STDERR NOT UGLY – JUL property file

Slide 57

Slide 57 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 57 So… What to do?

Slide 58

Slide 58 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 58 What to do NOW? • 2.46+ for Remoting patches • For Windows – LTS 2.60.1+ Update Jenkins • TRetransmission < TPing • PingThread should be turned off Check timeouts (TCP Retransmission + Ping) • Wait for Remoting 3.8 with configurable JUL Setup logging

Slide 59

Slide 59 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 59 What’s next? • Better Diagnosability • Better Stability

Slide 60

Slide 60 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 60 Ongoing changes • New release – 3.8+ • .../remoting/blob/master/CHANGELOG.md#38 • Docs •https://github.com/jenkinsci/remoting/ • Work directories in Remoting (JENKINS-44108) • Logging on agents by default (JENKINS-39369)

Slide 61

Slide 61 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 61 Work Directories (JENKINS-44108) • Logging by default • Independent JAR Cache • Workspace status checks

Slide 62

Slide 62 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 62 Work Directories (JENKINS-44108) Long adoption… ETA: September LTS

Slide 63

Slide 63 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 63 What’s next?

Slide 64

Slide 64 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 64 Moonshots Remoting • TCP-robust Remoting? • Autoupdate of ALL JNLP agents? (JENKINS-44099) • Update of Remoting in Master without the core upgrade? Traffic optimization • External logging • Pluggable storage

Slide 65

Slide 65 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 65 Takeaways • Remoting – risk factor in Jenkins •Remoting does not scale well OOTB •INFRA issues - frequent root cause •Remoting can be stabilized • Jenkins 2 is not just about Pipelines, keep updating

Slide 66

Slide 66 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 66 Links • Remoting • GitHub: https://github.com/jenkinsci/remoting (Docs, etc.) • Windows services: • https://github.com/kohsuke/winsw/ • CloudBees • go.cloudbees.com – community pages, knowledge base

Slide 67

Slide 67 text

@oleg_nenashev, #DayOfJenkins - 2017 © 2017 CloudBees, Inc. All Rights Reserved. 67 Thank you! Contacts: E-mail: [email protected] GitHub: oleg-nenashev Twitter: @oleg_nenashev

Slide 68

Slide 68 text

Software at the speed of ideas THANK YOU! www.cloudbees.com