$30 off During Our Annual Pro Sale. View Details »

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.


Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 05, 2017

More Decks by Big Data Spain

Other Decks in Technology


  1. View Slide

  2. Disaster Recovery for Big Data

    View Slide

  3. About us

    We are nerds!

    Started working in Big Data for international companies

    Founded a start-up a few years ago:
     With colleagues working in related technical areas
     And who also knew business stuff!

    We’ve been participating in different Big Data projects

    View Slide

  4. Introduction
    “I already have HDFS replication and High
    Availability in my services, why would I
    need Disaster Recovery (or backup)?”

    View Slide

  5. Concepts

    High Availability (HA)
     Protects from failing
    components: disks,
    servers, network
     Is generally a “systems”
     Redundant, doubles
     Generally has strict
    network requirements
     Fully automated,

    View Slide

  6. Concepts

     Allows you to go back to
    a previous state in time:
    daily, monthly, etc.
     It is a “data” issue
     Protects from accidental
    deletion or modification
     Also used to check for
    unwanted modifications
     Takes some time to

    View Slide

  7. Concepts

    Disaster Recovery
     Allows you to work
     It is a “business” issue
     Covers you from: main site
    failures such as electric
    power or network outages,
    fires, floods or building
     Similar to having insurance
     Medium time to be back

    View Slide

  8. The ideal Disaster Recovery

    High Availability for

    Exact duplicate of the
    main site
     Seamless operation (no
    changes required)
     Same performance
     Same data

    This is often very
    expensive and sometimes
    downright impossible

    View Slide

  9. DR considerations

    So, can we build a cheap(ish) DR?

    We must evaluate some tradeoffs:
     What’s the cost of the service not being
    available? (Murphy’s Law: accidents will happen
    when you are busiest)
     Is all information equally important? Can we lose
    a small amount of data?
     Can we wait until we recover certain data from
     Can I find other uses for the DR site?

    View Slide

  10. DR considerations

    Near or far?
     Availability
     Latency
     Legal considerations

    View Slide

  11. DR considerations

    Synchronous vs
     Synchronous replication
    requires a FAST connection
     Synchronous works at
    transaction level and is
    necessary for operational
     Asynchronous replication
    converges over time
     Asynchronous is not
    affected by delays nor does
    it create them

    View Slide

  12. Big Data DR

    Can’t generally be copied

    No VM replication

    Other DR rules apply:
     Since it impacts users,
    someone is in charge of the
    “starting gun”
     DNS and network changes
    to point clients

    Main types:
     Storage replication
     Dual ingestion

    View Slide

  13. Storage replication

    Similar to non-Big Data solutions, where central
    storage is replicated

    Generally implemented using distcp and HDFS

    Data is ingested in source cluster and then copied

    View Slide

  14. Storage replication

     Copy jobs must be
     Metadata changes
    must be tracked

    Good enough for
    data that comes
    traditional ETLs such
    as daily batches

    View Slide

  15. Dual Ingestion

    No files, just streams

    Generally ingested from multiple outside
    sources through Kafka

    Streams must be directed to both sites

    View Slide

  16. Dual Ingestion

    Adds complexity to apps
     Nifi can be set up as a front-end to both

    Data consistency must be checked
     Can be automatically set up via monitoring
     Consolidation processes (such as a monthly
    re-sync) might be needed

    View Slide

  17. Others

    Ingestion replication
     Variant of the dual ingestion
     A consumer is set up in the source Kafka that in turn
    writes to a destination Kafka
     Bottleneck if the initial streams were generated by
    many producers

     Previous solutions are not mutually exclusive
     Storage replication for batch processes’ results
     Dual ingestion for streams

    View Slide

  18. Commercial offerings

    Solutions that ease DR setup

    Cloudera BDR
     Coordinates HDFS snapshots and copy

    WANdisco Fusion
     Continuous storage replication

    Confluent Multi-site
     Allows multi-site Kafka data replication

    View Slide

  19. Tips

    Big Data clusters
    have many nodes
     Costly to replicate
     Performance /
    Capacity tradeoff
     We can use
    cheaper servers in
    DR, since we don’t
    expect to use them

    View Slide

  20. Tips

    Document and test procedures
     DR is rarely fully automated, so responsibilities and
    actions should be clearly defined
     Plan for (at least) a yearly DR run
     Track changes in software and configuration

    View Slide

  21. Tips

    Once you have a DR
    solution, other uses will

    DR site can be used for
     Maintain HDFS

    DR data can be used
    for testing / reporting
     Warning: it may alter
    stored data

    View Slide

  22. Conclusions

    Balance HA / Backup / DR as needed, they
    are not exclusive:
     Different costs
     Different impact

    Big Data DR is different:
     Dedicated hardware
     No VMs, no storage cabin

    Plan for DATA CENTRIC solutions

    View Slide

  23. Questions

    View Slide