Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Disaster Recovery and Site Migration with Site...

Disaster Recovery and Site Migration with Site Recovery Manager: Customer Experiences from Around the World

Presentation about the use of NetApp storage and VMware SRM for successful disaster recovery during the 2011 Tōhoku earthquake and tsunami.

Avatar for Christopher Wells

Christopher Wells

August 30, 2011
Tweet

More Decks by Christopher Wells

Other Decks in Technology

Transcript

  1. BCO3276 Disaster Recovery and Site Migration with Site Recovery Manager:

    Customer Experiences from Around the World Gil Haberman, Product Marketing Manager, Business Continuity and Disaster Recovery, VMware, Inc. Alan Baird, VMware, Inc. Christopher Wells, TUV Rheinland Japan Ltd. Paul Schlosser, VMware, Inc. Robert Busillo, Independence Blue Cross
  2. 2 Disclaimer  This session may contain product features that

    are currently under development.  This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product.  Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.  Technical feasibility and market demand will affect final delivery.  Pricing and packaging for any new technologies or features discussed or presented have not been determined.
  3. 3 Agenda SRM and vSphere For Simple and Reliable DR

    TÜV Rheinland, Japan Mainfreight, New Zealand Independence Blue Cross, USA
  4. 5 43% of companies experiencing disasters never re-open, and 29%

    close within two years. (McGladrey and Pullen) 93% of business that lost their data center for 10 days went bankrupt within one year. (National Archives & Records Administration) 40% of all companies that experience a major disaster will go out of business if they cannot gain access to their data within 24 hours. (Gartner) Top executives say 10 hours to recovery; IT managers say up to 30 hours. (Harris Interactive) Disasters Happen. Do You Need Protection?
  5. 6 vCenter Site Recovery Manager Ensures Simple, Reliable DR Provide

    cost-efficient replication of applications to failover site • Built-in vSphere Replication • Broad support for storage-based replication Simplify management of recovery and migration plans • Replace manual runbooks with centralized recovery plans • From weeks to minutes to set up new plan Automate failover and migration processes for reliable recovery • Enable frequent non-disruptive testing • Ensure fast, automated failover • Automate failback processes Site Recovery Manager Complements vSphere to provide the simplest and most reliable disaster protection and site migration for all applications VMware vSphere VMware vCenter Server Site Recovery Manager VMware vCenter Server Site Recovery Manager VMware vSphere Site A (Primary) Site B (Recovery) Servers Servers
  6. 7 SRM Momentum Introduced in Q2’ 2008 125,000+ units sold

    5,000+ customers 50% annual growth in 2010 “If your organization is already taking advantage of virtualization, then adding Site Recovery Manager to handle disaster recovery is a no-brainer.” ― Jerry Wilkin Senior Systems Administrator, Dayton Superior Corp
  7. 8 What’s New In Site Recovery Manager 5.0? vSphere Replication

     Bundled with SRM at no additional cost  Provides simple, cost-efficient replication between vSphere clusters Automated failback  Bi-directional recovery plans  Automates failback to original site Planned migration  New workflow that can be applied to any recovery plan  Ensures no data-loss, application-consistent migrations of virtual machines Others  More granular control over VM startup order  Protection-side APIs  IPv6 support Expand DR coverage to Tier 2 apps and smaller sites Streamline planned migrations (for disaster avoidance, planned maintenance, …)
  8. 9 Beyond DR: Disaster Avoidance And Planned Migrations Recover from

    unexpected site failure • Full or partial site failure The most critical but least frequent use-case • Unexpected site failures do not happen often • When they do, fast recovery is critical to the business Anticipate potential datacenter outages • For example: in case of planned hurricane, floods, forced evacuation, etc. Initiate preventive failover for smooth migration • Leverage SRM ‘planned migration’ to ensure no data-loss • ‘Automated failback’ enables easy return to original site Most frequent SRM use case • Planned datacenter maintenance • Global load balancing Streamline routine migrations across sites • Test to minimize risk • Execute partial failovers • Leverage SRM ‘planned migration’ to ensure no data-loss • ‘Automated failback’ enables bi-directional migrations Disaster Failover Disaster Avoidance Planned Migration 3 typical use-cases for SRM
  9. 11 Background TÜV Rheinland was started in Germany in 1872

    to perform safety testing of steam pressure vessels. Today TÜV Rheinland is active in 61 countries and 39 different business fields. Technical certification of a wide range of technology products and services. Examples: PV cells, X-ray machines, photocopiers, computer monitors, computer mice/keyboards. Also perform Business Continuity Management, Data Protection Management, Information Security and ITIL services.
  10. 12 Justification Propensity for seismicity in Japan. Already had infrastructure

    at more than one location. Services hosted for external customers required specific SLA. Simplify difficult process of disaster recovery.
  11. 13 Status Quo Before the earthquake, companies where using physical

    servers at their DR site, or had no DR site at all! Companies in Japan are now conscious of a need for DR and BCP solutions. Many Japanese VMware customers are only familiar with the vSphere base product, not complimentary solutions. VMware is now more actively marketing the SRM products as a result of the recent earthquake.
  12. 14 History Prior to SRM, DR process was manual. Already

    had implemented SAN replication, so running SRM was next logical step. DR testing was non-existent due to manual overhead involved with testing. Leveraged VMware snapshots to reduce RTO during failback.
  13. 15 Implementation Met with VMware and a local reseller for

    guidance. Set up a POC and learned the product, especially with help of official documentation and books by 3rd party authors. Performed tests of the recovery plan. Leveraged IP address mapping CSV. 3-4 months later, put system into production.
  14. 16 Use Cases General use of VMware products helps conserve

    power (useful during power shortages). Shift workloads from areas under power consumption constraints/reductions to unaffected areas. Typical DR protection between Eastern and Western Japan offices. Temporary fail-over to remote site for planned power outage situations (once per year).
  15. 17 Disaster & Aftermath On March 11th, at 2:46PM JST

    our disaster recovery plan went into motion. Immediately following the initial shock, systems were functional. Performed testing of the SRM recovery plans as extra precaution. Rolling power outages were implemented by TEPCO, necessitating failover process. Systems not covered by SRM (physical machines) had RTO of >24 hours.
  16. 18 Lessons & Suggestions Planning for the initial disaster is

    not enough, you must also plan for energy and other supply shortages. Ensure there is a chain of command to kick-off recovery and ensure more than 1 person can initiate it. Make sure newly created VMs are configured in the Recovery Plan. Be sure to back-up the SRM configuration (local files) and DB backend prior to upgrade. Perform frequent disaster tests. Provide more user-friendly way to map IP addresses. Alert administrators about unprotected or misconfigured VMs.
  17. 20 Thanks! For more information:  www.tuv.com Follow me: 

    Blog: http://www.vsamurai.com  Twitter: @wygtya  LinkedIn: http://jp.linkedin.com/in/wygtya  Facebook: http://www.facebook.com/wygtya
  18. 23 Confidential Challenges we face Natural Disasters • Earthquakes (

    3 major and 250 minor in the last 12 months) • Tsunami • Volcanic – 2 active Remote • 3 hour flight to Australia Stability of Power • 1998 Auckland power crisis • Reliance on hydro electricity WAN Considerations • Cost and bandwidth limitations
  19. 24 Confidential What was learnt from Christchurch Christchurch was considered

    low risk for earthquakes Servers and desktops • Unable to return to the office 6 months later • Servers were protected but desktops were lost Reliance on backup media • Slow and potentially unreliable The Human factor • Other priorities • Civil unrest The value of virtualisation • DR with SRM becomes viable
  20. 26 Confidential Who are we Mainfreight is a global supply

    chain logistics provider Commenced business in 1978 Today has a market capitalisation of $993 million Sales revenues in excess of $1.75 billion 4,600+ team members Unique culture & philosophy We have a quality focus and aim to delight our customers. “A company with a 100 year vision”
  21. 28 Confidential Our Challenges “Do more with less”  Hybrid

    model consisting of mostly physical  Cost of DR & BCP  Previous DR process worked but was complex & time consuming  Recent Christchurch earthquakes reiterated to our business the reality of disaster occurring & the importance of DR & BCP  Costs of ~$10,000 every hour the systems are down
  22. 30 Confidential About our environment Hardware / Software  HP

    servers & storage  Cisco network  Microsoft, Citrix, VSphere/SRM 4.x  Active – Active data centres Applications protected with SRM  Maintrak - Web-based consignment tracking system  MIMs - Inventory management system  Cargowise – International freight forwarding system  On Account – Accounting system  On Sale – CRM system “Top performing organisation's are those that have harnessed the true potential of todays cutting edge technologies” Production Recovery South Auckland Central Auckland
  23. 31 Confidential SRM Highlights Reduced DR test times from ~15

    hours to 4 hours Reduced number of team for DR from 4 to 2 Minimised downtime costs – estimated at $10k per hour Achieved 99.999% availability SRM has been proven and used in ‘anger’ - SAN failure Installation well planned and implemented Project completed on time and on budget Minimal external consultancy required Provided a platform to deliver DR for future business applications “DR is only as good as the last time it was tested”
  24. 32 Confidential Thank you “VMware has provided us with a

    flexible, reliable IT platform to support the business and deliver IT services in more responsive and cost-effective ways.” – Kevin Drinkwater, Global Chief Information Officer
  25. 34 Company Background VMware History IBC started in 2004 to

    convert physical servers to VM's in a company wide effort to consolidate hardware, drive down maintenance cost & datacenter space/utilities. Servers Virtualized We currently manage about 800 VM's residing on 60 plus ESX Hosts running ESX 4.1 & ESXi. Since 2005 we have converted over 300 physical servers to VM's. Storage EMC DMX 4 (Production and DR ) & NetApp (Test, Dev and QA) Uses for VMware We run Windows 2003, 2008, Red Hat v5 (64 and 32 bit O/S's). We have many Tiers 1 applications running in our VM environment SQL, Share Point, Citrix, Hyperion/Informatics and our Claims processing servers.
  26. 35 Business Needs What was needed We were moving our

    data center in the Summer of 2009 from Philadelphia to Hershey , PA and needed to migrate 300+ Production VM's to our new location. SRM Review VMware came onsite to present the SRM product for a future IBC project (DR insourcing) after the product presentation we saw the potential in using this product for our Datacenter move. Working with VMware professional services served very beneficial for IBC. Did it solve the problem? Yes, SRM made our D.C. move less stressful and streamlined, it also solved our plans for DR insourcing & Redundant Production environment.
  27. 36 Business Needs Why VMware solution When we saw the

    SRM product and how it could help us move 300+ production VM's from our Center City Philadelphia D.C to our new Hershey, PA D.C it was clear to us that this product would save us many man hours that we needed elsewhere on our D.C move weekend. SRM Characteristics The SRM advantages that IBC leveraged were the pre-move testing, streamlining and automation of the over all D.C move script which we could plan out the recovery sequence of Tier 1 Prod VM’s to Tier 3 Test VM’s. The over reliability of this product saved our company many Admin man hours, pre and post migration.
  28. 37 Business Needs Time outages avoided We saved hours of

    Production server outage times by using SRM instead of a manual migration and countless Admin man hours were saved allowing our staff to be utilized in other areas of the move weekend. What was needed SRM plugin for Virtual Center EMC – SRDF VMware Professional Services – The professional services contact was very knowledgeable in the SRM product and how to integrate this with our EMC storage. SRM script and planning – Setting up your server priority migration planning.
  29. 38 Data Center Migration How much time till DC cutover

    Professional Service came out a few months prior to the DC move and were onsite for 2 days to prepare the plan and gather information about the environment. What was the setup and integration process We worked with VMware to setup our migration script and verify that the EMC storage was replicating correctly
  30. 39 Data Center Migration Services needed  Replication of data

    – Our initial synch was about 50 LUNS and about 30TB of data.  We then setup daily replication of about 1TB a day.  Setup our server priority script (what servers to power down last and which servers to power up 1st.  VMware came onsite 1 more day for verification that all was well before the final move date.
  31. 40 Data Center Migration What happened on Labor day move

    weekend? VMware was on site Friday night when we kicked off SRM, there was about 1TB of changes left to be synched. We then disconnected our EMC storage at the old datacenter and failed over to the new datacenter storage. We had less than 10 VM's that needed some attention to get back online. I would highly recommend the VMware Professional Services. They were on site a total of 4 days and walked us through the whole datacenter migration.
  32. 41 Today How is SRM running today? We currently insource

    our Disaster Recovery Drill at our D.R./Redundant Production datacenter in Reading, PA utilizing SRM and VMware to get us through the DR drill with replication and failover. We currently run these tests 3-4 times a year.
  33. 43 Where Can I Learn More? At VMworld • Visit

    us at the booth • Multiple great sessions on SRM  BCO 1269 – SRM 5 technical – Tue 4:30PM; Wed 1 PM  BCO 1562 – SRM 5 technical – Tue 12 PM, Wed 10 AM  BCO 2527 – SRM 5 technical – Tue 3 PM  BCO 3334 – Cloud DR – Mon 10 AM; Wed 4 PM  BCO 3336 – Cloud DR – SP perspective – Mon 11:30AM; Tue 12 PM VMware.com • Product Page – www.vmware.com/products/srm • Overview, datasheet, webinars, docs, community links • Free 60-day Evaluation – all you need to get started! • Solutions from VMware – www.vmware.com/solutions/continuity