Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Monitoring, Tuning and Troubleshooting of Sakai CLE

Practical Monitoring, Tuning and Troubleshooting of Sakai CLE

#apereo13

Mike DeSimone

June 05, 2013
Tweet

Other Decks in Education

Transcript

  1. Practical Monitoring, Tuning and Troubleshooting of Sakai CLE Mike DeSimone,

    Director of Enterprise Operations [email protected] Brooke Biltimier, Tier 3 Support Manager [email protected]
  2. Agenda ◦ Describe some problem scenarios and process followed to

    diagnose underlying issue ◦ Monitoring review ◦ Tuning
  3. 1. Diagnosing a Runaway Thread • Symptoms ◦ Load on

    app server high ◦ CPU % high ◦ Tomcat unresponsive ◦ ==> restart ??? ◦ [too easy/disruptive]
  4. 1. Diagnosing a Runaway Thread, cont'd • Didn't find much

    initially, so dig deeper • check online users • kill -3 • find top java CPU threads • top -b -n1 -H -p 18595 [main tomcat process] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18602 tomcat 22 0 9.9g 8.9g 40m R 100.6 74.6 408:13.42 java 16752 tomcat 18 0 9.9g 8.9g 40m R 96.8 74.6 429:21.89 java 25479 tomcat 21 0 9.9g 8.9g 40m R 89.0 74.6 168:24.49 java
  5. • convert to hex, correlate to thread dump • google!

    '18602 in hex' "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x000000005d9b0800 nid=0x48aa runnable "VM Periodic Task Thread" prio=10 tid=0x000000005da42000 nid=0x48b3 waiting on condition 1. Diagnosing a Runaway Thread cont'd
  6. • try 16752 PID 16752 "TP-Processor221" daemon prio=10 tid=0x00002aaac0160000 nid=0x4170

    runnable [0x000000005b29f000] java.lang.Thread.State: RUNNABLE at java.lang.String.toLowerCase(String.java:2763) at java.lang.String.toLowerCase(String.java:2847) at org.hibernate.hql.QuerySplitter.concreteQueries(QuerySplitter.java:77) at org.hibernate.engine.query.HQLQueryPlan.<init>(HQLQueryPlan.java:68) at org.hibernate.engine.query.HQLQueryPlan.<init>(HQLQueryPlan.java:56) at org.hibernate.engine.query.QueryPlanCache.getHQLQueryPlan(QueryPlanCache.java:72) at org.hibernate.impl.AbstractSessionImpl.getHQLQueryPlan(AbstractSessionImpl.java:133) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1114) at org.hibernate.impl.QueryImpl.list(QueryImpl.java:79) at org.sakaiproject.sitestats.impl.StatsManagerImpl$28.doInHibernate(StatsManagerImpl.java: 3573) 1. Diagnosing a Runaway Thread cont'd
  7. • Aha! found the proverbial • Site stats...hibernate, spinning in

    toLowerCase • Solution: rewrite with Criteria Hibernate pattern 1. Diagnosing a Runaway Thread cont'd
  8. • Alternate approach, more heavy handed & risky E.g., •

    jdb -attach localhost:5005 # attach debugger • thread 0xacdf # jump into thread • suspend 0xacdf # suspend thread • step • kill 0xacdf new java.lang.Exception() # send something bad • org.apache.avalon.excalibur.thread.impl. SimpleWorkerThread(name='default Worker #1', id=44084) killed 1. Diagnosing a Runaway Thread cont'd
  9. 2. Don't overlook the obvious • Sometimes the issue is

    right in front of you: don't overthink it. Follow the evidence. • User received internal server error on site duplication, turned out there was an issue with a property. logs had numberFormatException errors content.upload.max=500mb
  10. 2. Don't overlook the obvious • sluggish or user(s) can't

    log in = Internet connection or issue with external authentication • View user report with (small) grain of salt
  11. • PKIX errors LDAPException: I/O Exception on host 71.51.124.140, port

    636 (91) Connect Error javax.net.ssl.SSLHandshakeException: sun.security.validator. ValidatorException: PKIX path building failed: sun.security.provider. certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.novell.ldap.Connection.writeMessage(Unknown Source) at com.novell.ldap.Connection.writeMessage(Unknown Source) • at com.novell.ldap.Message.sendMessage(Unknown Source) Topic 3: Troubleshooting SSL certificates
  12. Topic 3: Troubleshooting SSL certificates Process • Connectivity: telnet yourserver.edu

    636 • Review expiration, etc. • openssl x509 -text < in.crt • SSLPoke: http://goo.gl/G5ZX6 • e.g., java -Djavax.net.debug=all SSLPoke yourserver.edu 443
  13. Topic 3: Troubleshooting SSL certificates java -Djavax.net.debug=ssl SSLPoke yourserver.edu 636

    keyStore is : keyStore type is : jks keyStore provider is : init keystore init keymanager of type SunX509 trustStore is: /System/Library/Java/JavaVirtualMachines/1.6.0. jdk/Contents/Home/lib/security/cacerts trustStore type is : jks ... main, WRITE: TLSv1 Handshake, length = 32 main, READ: TLSv1 Change Cipher Spec, length = 1 main, READ: TLSv1 Handshake, length = 32 *** Finished verify_data: { 126, 123, 181, 220, 168, 147, 131, 225, 169, 216, 34, 83 } *** %% Cached client session: [Session-1, SSL_RSA_WITH_RC4_128_MD5] main, WRITE: TLSv1 Application Data, length = 17 Successfully connected
  14. Topic 4 - symptom is high cpu load • App

    server load is high • Database load is low • DB pool exhaustion • App server thread exhaustion • Slow query log ◦ missing index? • App server logs
  15. Monitoring • Nagios - http ◦ ok, but incomplete •

    Better: Access public resource • Better still: perform login & access private resource • Munin & graphs ◦ OS Level: cpu, disk, network • Commercial: e.g., AppDynamics ◦ Limited free version
  16. Tuning • A perennial topic :) Recent thread on production

    list: 'out of memory exceptions after upgrade to Sakai CLE 2.9' has a nice example. Thanks to UFL. # only create a huge JVM if the operation is 'start' if [[ "$1" == 'start' ]]; then JAVA_OPTS="$JAVA_OPTS -Xmx6144m -Xms6144m" # etc. fi
  17. Take-Aways Problem may be unfamiliar, but you can use logical

    analysis to narrow down the causes. Understand the overall application architecture, then systematically look at each layer. tomcat, database, load balancer. Summary