Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Swap the Engine without Stopping the Car

Swap the Engine without Stopping the Car

A Sakai CLE Infrastructure Replacement Project
Open Apereo Conference 2013
San Diego, CA
June 5, 2013

Jeff Cousineau

June 05, 2013
Tweet

Other Decks in Education

Transcript

  1. Swap the Engine without Stopping the Car A Sakai CLE

    Infrastructure Replacement Project Open Apereo Conference 2013 June 5, 2013 Jeff Cousineau, Beth Kirschner, Chris Kretler
  2. Introductions • CTools – University of Michigan’s Sakai CLE instantiation

    • Initial Pilot Service: Winter 2003 • Initial Production Service: Fall 2004 • As of March 2013: • Over 120,000 Class & Project sites • Over 220,000 My Workspaces • Reference: http://ctoolsstatus.blogspot.com/ • CTIR – CTools Infrastructure Rationalization Project (a FY12 Priority IT Project for U-M)
  3. What we hope to accomplish today… • Lessons learned from

    virtual vs. physical infrastructure • Implications of shared infrastructure decisions • Cost-savings a significant factor to many institutions in decision-making processes today • Different services have different requirements (resource, performance, etc.)
  4. Rationale for changes • Management Perspective • Ongoing merger of

    legacy IT central organizations • New support model of stratified technical teams • Technical Perspective • Create a standardized and sustainable infrastructure following U-M (ITS) strategic direction • Retire at-risk hardware and software • Migrate to new data centers • Cross-train technical staff to eliminate single points of knowledge • Document all infrastructure and procedures • Better position for unknown capacity needs in near future • Untangling unknown and undocumented dependencies
  5. Architecture decisions • Cost-savings • Virtualization when possible/feasible without sacrificing

    performance or increasing risk • Implement using shared infrastructure on organizational standards (hardware & software) • File resources required to be on shared file systems (NFS) • Migrate database storage to high performance SAN-based solution • Capacity • Address immediate capacity issues at database server layer and plan for unknown growth rate • Horizontal vs. vertical scaling
  6. Implementation overview • Design • Architecture • Phase 1 implementation

    • Non-production: Build, Dev, QA, Integration, Support services • Phase 2 implementation • Non-production: Load • Production • Testing • Performance & Capacity testing • DRBC • Release & Stabilization
  7. Communications planning & scheduling • Deployment & Contingency Planning •

    What if outage is delayed or postponed or takes longer than expected? • What if users experience problems after update? • Communications planning • T-9 months: Email to selected campus groups; Special email setup to service requests • T-4 months: CTools/Sakai MOTD on Gateway • T-3 months: Re-occurring Email, Facebook, Yammer, UM Web pages posts, plus ITS news article • T-2 weeks: One more email, continued social media presence, campus paper news brief • T-30 minutes: Global Alert Message to active sessions. • T+1 day: all clear email announcement; new gateway
  8. Development impact • Limited new functionality/bug fix releases • Technical

    staff resources for testing/validating new environments • Investment in automated functional testing • Review new service architecture designs for usability, security from developer perspective
  9. Performance Testing: Background • Commitment to CTools load testing •

    Load test environment equivalent production • Dedicated load tester from 2007-2009, 2012- • Data roughly equivalent, but diverging with time • Goals: • Correctly size flexible infrastructure • Certify combinations of hardware & software configuration
  10. Performance Testing: Results • Stress tested both systems to find

    upper limits • Multiples of base load test • System capacity increased 120% • Significant database CPU capacity increased 5 times • New bottleneck at application tier • Virtualization: flexible application tier. • Added nodes, not CPUs/node • Reduced risk for users in case of server failure
  11. Performance Testing: Takeaways • Multi-discipline team resolved bottlenecks. • Communication

    challenges within infrastructure, rather than application, support model • Provided basis for communication in resolving production incidents in Fall and Winter terms.
  12. DRBC • Ambitious plan initially: Design and test multiple disaster

    scenarios of varying size, impact, recovery difficulty • Scaled back due to time and resource constraints • End result: Single scenario (loss of primary data center) tested • Exposed some conflicting architecture assumptions, some contrived environmentals • BUT… a successful and well-documented test!
  13. Deployment • Dress Rehearsal – late May 2012 • Pre-deployment

    Freeze – June 2012 • Deployment: June 30 – July 1, 2012 • Start time – 12:15am EDT 06/30/2012 • Progressed smoothly overall • Slow rebuild of Site Search indexes introduced delays • Decision to leave it running overnight and reconvene Sunday morning (07/01/2012) • Largely coincidental timing of index rebuild at 11:59:60pm aka: Java “leap second” bug ☹ • Rebooted all application servers Sunday morning to resolve issue • Service restored by early Sunday afternoon (within scheduled window)
  14. Successes and challenges • Delivered on time and within budget!

    • Met stated goals of project • Major achievement for organizational/institutional direction for shared services • Some tasks slipped out of scope of project • “CTools is different” – doesn’t fit the same availability and reliability requirements as some other services. • Stratified support model requires significantly more communication and engagement
  15. Monitoring & Diagnostic Tools • Exposure to Nagios led to

    organizational adoption • CTSTATS • Trending • Diagnostics • Expanded use • homegrown • EUE
  16. Stabilization • Fall 2012 “crunch” • Application servers could not

    handle the load • Usage profile significantly different from load testing profile • Earlier load testing configuration validation informed quick decision to change production application configuration to resolve • Winter 2013 “bleed over” • February: Observed curious performance issue in CTLOAD during a Gradebook-related production incident • March: Observed and reproducible performance degradation in production service during load testing • March-April: Long problem investigation eventually identified shared infrastructure component (firewall device) as the cause of the degraded performance • May: Near-term solution implemented to have backend app/db communication bypass firewall device
  17. Where we are now • San Diego, CA! • Spring/Summer

    2013 – low load again ☺ • All legacy hardware has been decommissioned • Preparing for next application release • Learned a lot about shared infrastructure and support model (benefits and challenges)
  18. Lessons learned • Well-defined and scoped high-level project plan •

    DRBC testing was a successful milestone • QA processes refined • Open collaboration across infrastructure teams • Communication: start early and communicate often • Organizational culture disconnects • No clearly defined SLA/SLE w/ performance metrics; affected architecture decision-making • Resource contention (staff availability)
  19. Image References • Slide 4, http://benzironen.files.wordpress.com/2008/02/pimp-my-ride1.jpg • Slide 5, http://classicoldsmobile.com/forums/members/joe_padavano-albums-

    1962+f-85+wagon-picture2503-old-motor-out.jpg • Slide 15, http://www.wired.com/images_blogs/wiredscience/2012/01/clock- face-flickr-geishaboy500.jpg • Slide 22, http://www.fastcoolcars.com/Pimp/My-Ride-0014.htm