Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Data Engineering

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Scaling Data Engineering

Avatar for Michael Hausenblas

Michael Hausenblas

November 17, 2016
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. © 2016 Mesosphere, Inc. All Rights Reserved. SCALING DATA ENGINEERING

    1 Michael Hausenblas | 2016-11-17 | Big Data Spain, Madrid
  2. © 2015 Mesosphere, Inc. All Rights Reserved. FAST AND BIG

    DATA … 15 *) kudos to Timothy St. Clair, @timothysc batch streaming PaaS MapReduce
  3. © 2015 Mesosphere, Inc. All Rights Reserved. CHALLENGES 16 •

    Set up and operation of data pipeline components • Dealing with back-pressure: elasticity (static vs. dynamic partitioning) • Efficient usage of resources (utilization/TCO)
  4. © 2015 Mesosphere, Inc. All Rights Reserved. • Apache Kafka

    • ØMQ, RabbitMQ, Disque (Redis-based), etc. • fluentd, Logstash, Flume • Akka streams • cloud-only: AWS SQS, Google Cloud Pub/Sub • see also queues.io MESSAGE QUEUES & ROUTERS 18
  5. © 2015 Mesosphere, Inc. All Rights Reserved. STREAM PROCESSING PLATFORMS

    19 • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Concord • cloud-only: AWS Kinesis, Google Cloud Dataflow • see also my webinar on stream processing
  6. © 2015 Mesosphere, Inc. All Rights Reserved. TIME SERIES DATASTORES

    20 • InfluxDB • OpenTSDB • KairosDB • Prometheus
  7. © 2016 Mesosphere, Inc. All Rights Reserved. DISTRIBUTED APPLICATION 23

    hardware OS app hardware OS hardware OS hardware OS hardware OS hardware OS hardware OS
  8. © 2016 Mesosphere, Inc. All Rights Reserved. DISTRIBUTED OS +

    DISTRIBUTED APP 24 hardware OS app hardware OS hardware OS hardware OS hardware OS hardware OS hardware OS distributed OS
  9. © 2016 Mesosphere, Inc. All Rights Reserved. DC/OS BENEFITS 27

    • One cluster for • stateless services such as Web servers & app servers • stateful services like PostgreSQL, MemSQL, Kafka, Cassandra, etc. • elastic data processing via Spark, Storm/Heron, Akka, etc. • CI/CD, for example Jenkins/Marathon • Dynamic partitioning of your cluster, depending on your needs • Increased utilization (10% → 80%+)
  10. © 2016 Mesosphere, Inc. All Rights Reserved. 29 appops The

    person who writes an app is also the person responsible for operating the app in prod.
  11. © 2016 Mesosphere, Inc. All Rights Reserved. 30 It's not

    about provisioning
 a VM or installing a DC/OS cluster or replacing a faulty HDD …
 
 … this would be on the infrastructure team. appops
  12. © 2016 Mesosphere, Inc. All Rights Reserved. 31 human fault

    tolerance UX matters! protect people from themselves
  13. © 2015 Mesosphere, Inc. All Rights Reserved. 33 A SIMPLE

    DATA PIPELINE https://api.github.com/orgs/$ORG/events
  14. © 2015 Mesosphere, Inc. All Rights Reserved. 34 A SIMPLE

    DATA PIPELINE $ dcos package install marathon-lb $ dcos package install --options=config.json influxdb $ dcos package install grafana $ dcos marathon app add fetcher.json $ curl fetcher.marathon.l4lb.thisdcos.directory:80/start
  15. © 2016 Mesosphere, Inc. All Rights Reserved. TAKE HOME MESSAGES

    35 • Try to have short feedback loops • Containers and 'The Cloud' make deployment easy, leverage it! • Technology is the simple part of the solution:
 big data technologies won't fix your broken culture
  16. © 2016 Mesosphere, Inc. All Rights Reserved. Q & A

    36 • @mhausenblas • mhausenblas.info • [email protected] https://dcos.io