Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.


Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 05, 2017

More Decks by Big Data Spain

Other Decks in Technology


  1. About us  We are nerds!  Started working in

    Big Data for international companies  Founded a start-up a few years ago:  With colleagues working in related technical areas  And who also knew business stuff!  We’ve been participating in different Big Data projects
  2. Introduction “I already have HDFS replication and High Availability in

    my services, why would I need Disaster Recovery (or backup)?”
  3. Concepts  High Availability (HA)  Protects from failing components:

    disks, servers, network  Is generally a “systems” issue  Redundant, doubles components  Generally has strict network requirements  Fully automated, immediate
  4. Concepts  Backup  Allows you to go back to

    a previous state in time: daily, monthly, etc.  It is a “data” issue  Protects from accidental deletion or modification  Also used to check for unwanted modifications  Takes some time to restore
  5. Concepts  Disaster Recovery  Allows you to work elsewhere

     It is a “business” issue  Covers you from: main site failures such as electric power or network outages, fires, floods or building damage  Similar to having insurance  Medium time to be back online
  6. The ideal Disaster Recovery  High Availability for datacenters 

    Exact duplicate of the main site  Seamless operation (no changes required)  Same performance  Same data  This is often very expensive and sometimes downright impossible
  7. DR considerations  So, can we build a cheap(ish) DR?

     We must evaluate some tradeoffs:  What’s the cost of the service not being available? (Murphy’s Law: accidents will happen when you are busiest)  Is all information equally important? Can we lose a small amount of data?  Can we wait until we recover certain data from backup?  Can I find other uses for the DR site?
  8. DR considerations  Synchronous vs Asynchronous  Synchronous replication requires

    a FAST connection  Synchronous works at transaction level and is necessary for operational systems  Asynchronous replication converges over time  Asynchronous is not affected by delays nor does it create them
  9. Big Data DR  Can’t generally be copied synchronously 

    No VM replication  Other DR rules apply:  Since it impacts users, someone is in charge of the “starting gun”  DNS and network changes to point clients  Main types:  Storage replication  Dual ingestion
  10. Storage replication  Similar to non-Big Data solutions, where central

    storage is replicated  Generally implemented using distcp and HDFS snapshots  Data is ingested in source cluster and then copied
  11. Storage replication  Administrative overhead:  Copy jobs must be

    scheduled  Metadata changes must be tracked  Good enough for data that comes traditional ETLs such as daily batches
  12. Dual Ingestion  No files, just streams  Generally ingested

    from multiple outside sources through Kafka  Streams must be directed to both sites
  13. Dual Ingestion  Adds complexity to apps  Nifi can

    be set up as a front-end to both endpoints  Data consistency must be checked  Can be automatically set up via monitoring  Consolidation processes (such as a monthly re-sync) might be needed
  14. Others  Ingestion replication  Variant of the dual ingestion

     A consumer is set up in the source Kafka that in turn writes to a destination Kafka  Bottleneck if the initial streams were generated by many producers  Mixed:  Previous solutions are not mutually exclusive  Storage replication for batch processes’ results  Dual ingestion for streams
  15. Commercial offerings  Solutions that ease DR setup  Cloudera

    BDR  Coordinates HDFS snapshots and copy  WANdisco Fusion  Continuous storage replication  Confluent Multi-site  Allows multi-site Kafka data replication
  16. Tips  Big Data clusters have many nodes  Costly

    to replicate  Performance / Capacity tradeoff  We can use cheaper servers in DR, since we don’t expect to use them often
  17. Tips  Document and test procedures  DR is rarely

    fully automated, so responsibilities and actions should be clearly defined  Plan for (at least) a yearly DR run  Track changes in software and configuration
  18. Tips  Once you have a DR solution, other uses

    will surface  DR site can be used for backup  Maintain HDFS snapshots  DR data can be used for testing / reporting  Warning: it may alter stored data
  19. Conclusions  Balance HA / Backup / DR as needed,

    they are not exclusive:  Different costs  Different impact  Big Data DR is different:  Dedicated hardware  No VMs, no storage cabin  Plan for DATA CENTRIC solutions