Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Data Engineering

Scaling Data Engineering

Michael Hausenblas

November 17, 2016
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. © 2016 Mesosphere, Inc. All Rights Reserved. SCALING DATA ENGINEERING

    1 Michael Hausenblas | 2016-11-17 | Big Data Spain, Madrid
  2. © 2015 Mesosphere, Inc. All Rights Reserved. FAST AND BIG

    DATA … 15 *) kudos to Timothy St. Clair, @timothysc batch streaming PaaS MapReduce
  3. © 2015 Mesosphere, Inc. All Rights Reserved. CHALLENGES 16 •

    Set up and operation of data pipeline components • Dealing with back-pressure: elasticity (static vs. dynamic partitioning) • Efficient usage of resources (utilization/TCO)
  4. © 2015 Mesosphere, Inc. All Rights Reserved. • Apache Kafka

    • ØMQ, RabbitMQ, Disque (Redis-based), etc. • fluentd, Logstash, Flume • Akka streams • cloud-only: AWS SQS, Google Cloud Pub/Sub • see also queues.io MESSAGE QUEUES & ROUTERS 18
  5. © 2015 Mesosphere, Inc. All Rights Reserved. STREAM PROCESSING PLATFORMS

    19 • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Concord • cloud-only: AWS Kinesis, Google Cloud Dataflow • see also my webinar on stream processing
  6. © 2015 Mesosphere, Inc. All Rights Reserved. TIME SERIES DATASTORES

    20 • InfluxDB • OpenTSDB • KairosDB • Prometheus
  7. © 2016 Mesosphere, Inc. All Rights Reserved. DISTRIBUTED APPLICATION 23

    hardware OS app hardware OS hardware OS hardware OS hardware OS hardware OS hardware OS
  8. © 2016 Mesosphere, Inc. All Rights Reserved. DISTRIBUTED OS +

    DISTRIBUTED APP 24 hardware OS app hardware OS hardware OS hardware OS hardware OS hardware OS hardware OS distributed OS
  9. © 2016 Mesosphere, Inc. All Rights Reserved. DC/OS BENEFITS 27

    • One cluster for • stateless services such as Web servers & app servers • stateful services like PostgreSQL, MemSQL, Kafka, Cassandra, etc. • elastic data processing via Spark, Storm/Heron, Akka, etc. • CI/CD, for example Jenkins/Marathon • Dynamic partitioning of your cluster, depending on your needs • Increased utilization (10% → 80%+)
  10. © 2016 Mesosphere, Inc. All Rights Reserved. 29 appops The

    person who writes an app is also the person responsible for operating the app in prod.
  11. © 2016 Mesosphere, Inc. All Rights Reserved. 30 It's not

    about provisioning
 a VM or installing a DC/OS cluster or replacing a faulty HDD …
 
 … this would be on the infrastructure team. appops
  12. © 2016 Mesosphere, Inc. All Rights Reserved. 31 human fault

    tolerance UX matters! protect people from themselves
  13. © 2015 Mesosphere, Inc. All Rights Reserved. 33 A SIMPLE

    DATA PIPELINE https://api.github.com/orgs/$ORG/events
  14. © 2015 Mesosphere, Inc. All Rights Reserved. 34 A SIMPLE

    DATA PIPELINE $ dcos package install marathon-lb $ dcos package install --options=config.json influxdb $ dcos package install grafana $ dcos marathon app add fetcher.json $ curl fetcher.marathon.l4lb.thisdcos.directory:80/start
  15. © 2016 Mesosphere, Inc. All Rights Reserved. TAKE HOME MESSAGES

    35 • Try to have short feedback loops • Containers and 'The Cloud' make deployment easy, leverage it! • Technology is the simple part of the solution:
 big data technologies won't fix your broken culture
  16. © 2016 Mesosphere, Inc. All Rights Reserved. Q & A

    36 • @mhausenblas • mhausenblas.info • [email protected] https://dcos.io