Swap the Engine without Stopping the Car

Swap the Engine without Stopping the Car A Sakai CLE
Infrastructure Replacement Project Open Apereo Conference 2013 June 5, 2013 Jeff Cousineau, Beth Kirschner, Chris Kretler

Introductions • CTools – University of Michigan’s Sakai CLE instantiation
• Initial Pilot Service: Winter 2003 • Initial Production Service: Fall 2004 • As of March 2013: • Over 120,000 Class & Project sites • Over 220,000 My Workspaces • Reference: http://ctoolsstatus.blogspot.com/ • CTIR – CTools Infrastructure Rationalization Project (a FY12 Priority IT Project for U-M)

What we hope to accomplish today… • Lessons learned from
virtual vs. physical infrastructure • Implications of shared infrastructure decisions • Cost-savings a significant factor to many institutions in decision-making processes today • Different services have different requirements (resource, performance, etc.)

Alternately…

The “ride” before we started

Rationale for changes • Management Perspective • Ongoing merger of
legacy IT central organizations • New support model of stratified technical teams • Technical Perspective • Create a standardized and sustainable infrastructure following U-M (ITS) strategic direction • Retire at-risk hardware and software • Migrate to new data centers • Cross-train technical staff to eliminate single points of knowledge • Document all infrastructure and procedures • Better position for unknown capacity needs in near future • Untangling unknown and undocumented dependencies

Architecture decisions • Cost-savings • Virtualization when possible/feasible without sacrificing
performance or increasing risk • Implement using shared infrastructure on organizational standards (hardware & software) • File resources required to be on shared file systems (NFS) • Migrate database storage to high performance SAN-based solution • Capacity • Address immediate capacity issues at database server layer and plan for unknown growth rate • Horizontal vs. vertical scaling

Implementation overview • Design • Architecture • Phase 1 implementation
• Non-production: Build, Dev, QA, Integration, Support services • Phase 2 implementation • Non-production: Load • Production • Testing • Performance & Capacity testing • DRBC • Release & Stabilization

Communications planning & scheduling • Deployment & Contingency Planning •
What if outage is delayed or postponed or takes longer than expected? • What if users experience problems after update? • Communications planning • T-9 months: Email to selected campus groups; Special email setup to service requests • T-4 months: CTools/Sakai MOTD on Gateway • T-3 months: Re-occurring Email, Facebook, Yammer, UM Web pages posts, plus ITS news article • T-2 weeks: One more email, continued social media presence, campus paper news brief • T-30 minutes: Global Alert Message to active sessions. • T+1 day: all clear email announcement; new gateway

Development impact • Limited new functionality/bug fix releases • Technical
staff resources for testing/validating new environments • Investment in automated functional testing • Review new service architecture designs for usability, security from developer perspective

Performance Testing: Background • Commitment to CTools load testing •
Load test environment equivalent production • Dedicated load tester from 2007-2009, 2012- • Data roughly equivalent, but diverging with time • Goals: • Correctly size flexible infrastructure • Certify combinations of hardware & software configuration

Performance Testing: Results • Stress tested both systems to find
upper limits • Multiples of base load test • System capacity increased 120% • Significant database CPU capacity increased 5 times • New bottleneck at application tier • Virtualization: flexible application tier. • Added nodes, not CPUs/node • Reduced risk for users in case of server failure

Performance Testing: Results Hits/Second: New vs Legacy on a per
test basis.

Performance Testing: Takeaways • Multi-discipline team resolved bottlenecks. • Communication
challenges within infrastructure, rather than application, support model • Provided basis for communication in resolving production incidents in Fall and Winter terms.

DRBC • Ambitious plan initially: Design and test multiple disaster
scenarios of varying size, impact, recovery difficulty • Scaled back due to time and resource constraints • End result: Single scenario (loss of primary data center) tested • Exposed some conflicting architecture assumptions, some contrived environmentals • BUT… a successful and well-documented test!

Timing & Leap Seconds

Deployment • Dress Rehearsal – late May 2012 • Pre-deployment
Freeze – June 2012 • Deployment: June 30 – July 1, 2012 • Start time – 12:15am EDT 06/30/2012 • Progressed smoothly overall • Slow rebuild of Site Search indexes introduced delays • Decision to leave it running overnight and reconvene Sunday morning (07/01/2012) • Largely coincidental timing of index rebuild at 11:59:60pm aka: Java “leap second” bug ☹ • Rebooted all application servers Sunday morning to resolve issue • Service restored by early Sunday afternoon (within scheduled window)

Successes and challenges • Delivered on time and within budget!
• Met stated goals of project • Major achievement for organizational/institutional direction for shared services • Some tasks slipped out of scope of project • “CTools is different” – doesn’t fit the same availability and reliability requirements as some other services. • Stratified support model requires significantly more communication and engagement

Monitoring & Diagnostic Tools • Exposure to Nagios led to
organizational adoption • CTSTATS • Trending • Diagnostics • Expanded use • homegrown • EUE

Stabilization • Fall 2012 “crunch” • Application servers could not
handle the load • Usage profile significantly different from load testing profile • Earlier load testing configuration validation informed quick decision to change production application configuration to resolve • Winter 2013 “bleed over” • February: Observed curious performance issue in CTLOAD during a Gradebook-related production incident • March: Observed and reproducible performance degradation in production service during load testing • March-April: Long problem investigation eventually identified shared infrastructure component (firewall device) as the cause of the degraded performance • May: Near-term solution implemented to have backend app/db communication bypass firewall device

Where we are now • San Diego, CA! • Spring/Summer
2013 – low load again ☺ • All legacy hardware has been decommissioned • Preparing for next application release • Learned a lot about shared infrastructure and support model (benefits and challenges)

Lessons learned • Well-defined and scoped high-level project plan •
DRBC testing was a successful milestone • QA processes refined • Open collaboration across infrastructure teams • Communication: start early and communicate often • Organizational culture disconnects • No clearly defined SLA/SLE w/ performance metrics; affected architecture decision-making • Resource contention (staff availability)

The end result…

Questions? • Feel free to E-mail follow up questions: •
Jeff Cousineau ([email protected]) • Beth Kirschner ([email protected]) • Chris Kretler ([email protected]) Thank you!

Image References • Slide 4, http://benzironen.files.wordpress.com/2008/02/pimp-my-ride1.jpg • Slide 5, http://classicoldsmobile.com/forums/members/joe_padavano-albums-
1962+f-85+wagon-picture2503-old-motor-out.jpg • Slide 15, http://www.wired.com/images_blogs/wiredscience/2012/01/clock- face-flickr-geishaboy500.jpg • Slide 22, http://www.fastcoolcars.com/Pimp/My-Ride-0014.htm

Swap the Engine without Stopping the Car

Swap the Engine without Stopping the Car

Jeff Cousineau

Other Decks in Education

Featured

Transcript