Scaling Data Engineering

© 2016 Mesosphere, Inc. All Rights Reserved. SCALING DATA ENGINEERING
1 Michael Hausenblas | 2016-11-17 | Big Data Spain, Madrid

© 2015 Mesosphere, Inc. All Rights Reserved. CITIES 8 ©
2014, Wired magazine

© 2015 Mesosphere, Inc. All Rights Reserved. TRENDS,  OPPORTUNITIES, AND
CHALLENGES 10

© 2016 Mesosphere, Inc. All Rights Reserved. ACTIONABLE INSIGHTS,  ANYONE?
11

© 2016 Mesosphere, Inc. All Rights Reserved. TOWARDS 100%  
SELF SERVICE 13

© 2016 Mesosphere, Inc. All Rights Reserved. CLOUD, CONTAINERS  AND
DEVOPS 14

© 2015 Mesosphere, Inc. All Rights Reserved. FAST AND BIG
DATA … 15 *) kudos to Timothy St. Clair, @timothysc batch streaming PaaS MapReduce

© 2015 Mesosphere, Inc. All Rights Reserved. CHALLENGES 16 •
Set up and operation of data pipeline components • Dealing with back-pressure: elasticity (static vs. dynamic partitioning) • Efﬁcient usage of resources (utilization/TCO)

© 2015 Mesosphere, Inc. All Rights Reserved. • Apache Kafka
• ØMQ, RabbitMQ, Disque (Redis-based), etc. • ﬂuentd, Logstash, Flume • Akka streams • cloud-only: AWS SQS, Google Cloud Pub/Sub • see also queues.io MESSAGE QUEUES & ROUTERS 18

© 2015 Mesosphere, Inc. All Rights Reserved. STREAM PROCESSING PLATFORMS
19 • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Concord • cloud-only: AWS Kinesis, Google Cloud Dataﬂow • see also my webinar on stream processing

© 2015 Mesosphere, Inc. All Rights Reserved. TIME SERIES DATASTORES
20 • InﬂuxDB • OpenTSDB • KairosDB • Prometheus

© 2016 Mesosphere, Inc. All Rights Reserved. TIME FOR A
NEW KIND OF OPERATING SYSTEM 21

© 2016 Mesosphere, Inc. All Rights Reserved. SINGLE MACHINE APPLICATION
22 hardware OS app

© 2016 Mesosphere, Inc. All Rights Reserved. DISTRIBUTED APPLICATION 23
hardware OS app hardware OS hardware OS hardware OS hardware OS hardware OS hardware OS

© 2016 Mesosphere, Inc. All Rights Reserved. DISTRIBUTED OS +
DISTRIBUTED APP 24 hardware OS app hardware OS hardware OS hardware OS hardware OS hardware OS hardware OS distributed OS

© 2016 Mesosphere, Inc. All Rights Reserved. LOCAL OS  VS 
DISTRIBUTED OS 25

© 2016 Mesosphere, Inc. All Rights Reserved. DC/OS BENEFITS 27
• One cluster for • stateless services such as Web servers & app servers • stateful services like PostgreSQL, MemSQL, Kafka, Cassandra, etc. • elastic data processing via Spark, Storm/Heron, Akka, etc. • CI/CD, for example Jenkins/Marathon • Dynamic partitioning of your cluster, depending on your needs • Increased utilization (10% → 80%+)

© 2015 Mesosphere, Inc. All Rights Reserved. 34 A SIMPLE
DATA PIPELINE $ dcos package install marathon-lb $ dcos package install --options=config.json influxdb $ dcos package install grafana $ dcos marathon app add fetcher.json $ curl fetcher.marathon.l4lb.thisdcos.directory:80/start

© 2016 Mesosphere, Inc. All Rights Reserved. TAKE HOME MESSAGES
35 • Try to have short feedback loops • Containers and 'The Cloud' make deployment easy, leverage it! • Technology is the simple part of the solution:  big data technologies won't ﬁx your broken culture

Scaling Data Engineering

Scaling Data Engineering

More Decks by Michael Hausenblas

Other Decks in Technology

Featured

Transcript