Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing Elastic Data Pipelines

Developing Elastic Data Pipelines

Webinar, eSynergySolutions, see also video at https://www.youtube.com/watch?v=9y1qHY3JMko

Michael Hausenblas

March 14, 2016
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. © 2016 Mesosphere, Inc. All Rights Reserved. DEVELOPING ELASTIC DATA

    PIPELINES 1 Michael Hausenblas, Developer & Cloud Advocate | 2016-03-14 | Webinar
  2. © 2016 Mesosphere, Inc. All Rights Reserved. LET'S TALK ABOUT

    WORKLOADS* … 10 *) kudos to Timothy St. Clair, @timothysc batch streaming PaaS MapReduce
  3. © 2016 Mesosphere, Inc. All Rights Reserved. • Apache Kafka

    • ØMQ, RabbitMQ, Disque (Redis-based), etc. • fluentd, Logstash, Flume • Akka streams • cloud-only: AWS SQS, Google Cloud Pub/Sub • see also queues.io MESSAGE QUEUES & ROUTERS 11
  4. © 2016 Mesosphere, Inc. All Rights Reserved. APACHE KAFKA 12

    • High-throughput, distributed, persistent publish-subscribe messaging system • Originates from LinkedIn • Typically used as buffer/de-coupling layer in online stream processing Message queues & routers kafka.apache.org
  5. © 2016 Mesosphere, Inc. All Rights Reserved. STREAM PROCESSING PLATFORMS

    14 • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Concord • cloud-only: AWS Kinesis, Google Cloud Dataflow • see also my webinar on stream processing
  6. © 2016 Mesosphere, Inc. All Rights Reserved. APACHE STORM 15

    • Distributed, fault-tolerant stream- processing platform • Guaranteed message processing (replaying messages on failure) • Concepts: tuples, streams, spouts, bolts, topologies Stream processing platforms storm.apache.org
  7. © 2016 Mesosphere, Inc. All Rights Reserved. APACHE SPARK 16

    Stream processing platforms spark.apache.org Spark SQL Spark Streaming MLlib
 (machine learning) Spark core (RDD) GraphX
 (graph processing) Mesos Filesystem (local, HDFS, S3) or data store (HBase, Cassandra, Elasticsearch, etc.) YARN Standalone
  8. © 2016 Mesosphere, Inc. All Rights Reserved. TIME SERIES DATASTORES

    17 • InfluxDB • OpenTSDB • KairosDB • Prometheus • see also iot-a.info
  9. © 2016 Mesosphere, Inc. All Rights Reserved. OPENTSDB 18 •

    Distributed time series database on top HBase • Store, index, query & plot metrics • Extremely scalable • Low-level monitoring Time series datastores opentsdb.net
  10. © 2016 Mesosphere, Inc. All Rights Reserved. INFLUXDB 19 •

    No-dependency, time series database written in Go • SQLish query language (incl. regex, fan out) • Single node or Raft-based distributed node mode Time series datastores influxdb.com
  11. © 2016 Mesosphere, Inc. All Rights Reserved. CHALLENGES 20 The

    Toolbox • Distributed systems are hard • Set up and operation of components • One (static) cluster per component • Efficient usage of cluster resources (TCO)
  12. © 2016 Mesosphere, Inc. All Rights Reserved. BENEFITS 23 DCOS

    • Run stateless services (Web server, app server, etc.) and Big Data services like Kafka, Spark, or Cassandra together on one cluster • Dynamic partitioning of your cluster, depending on your business requirements • Increased utilization (10% → 80%++)
  13. © 2016 Mesosphere, Inc. All Rights Reserved. A SIMPLE DATA

    PIPELINE 26 Examples https://github.com/mesosphere/cassandra-kairosdb-tutorial
  14. © 2016 Mesosphere, Inc. All Rights Reserved. HYBRID DATA PIPELINE

    27 Examples https://mesosphere.com/blog/2015/11/18/dcos-time-series-demo
  15. © 2016 Mesosphere, Inc. All Rights Reserved. Q & A

    29 • @mhausenblas • mhausenblas.info • @mesosphere • mesosphere.com