Upgrade to Pro — share decks privately, control downloads, hide ads and more …

resilient-data-pipelines-in-go-uk

 resilient-data-pipelines-in-go-uk

The modern world runs on Data. In this talk we will cover how Gophers of any level can easily build Data Pipelines in Go with Kafka and Cassandra. At the end, we will look at how GE has written a Data Pipeline in Go that can handle over 800,000 writes per second of industrial time series data.

Avatar for Grant Griffiths

Grant Griffiths

August 03, 2018
Tweet

More Decks by Grant Griffiths

Other Decks in Technology

Transcript

  1. Building Resilient Data Pipelines An introduction to Building Data Pipelines

    in Go on Kubernetes Grant Griffiths Software Engineer Platform Cloud Engineering GE Digital
  2. Who am I? Climbing + Mountaineering 2 Building Resilient Data

    Pipelines in Go GE Digital - Platform Cloud Engineering Sr. Software Engineer San Francisco, CA GE Go User Group Founder & Organizer @ggriffiths @griffithsgrant
  3. What we’ll cover 1. Industrial IoT and Data @ GE

    2. Introduction to Data Pipelines 3. Sample Data Pipeline in Go 4. Resiliency Testing our Data Pipeline 5. Results and takeaways 3 Building Resilient Data Pipelines in Go
  4. 1. Industrial IoT and Data @ GE 2. Introduction to

    Data Pipelines 3. Sample Data Pipeline in Go 4. Reliability Testing our Data Pipeline 5. Results and takeaways
  5. Fun facts about GE • GE Power generates roughly 33%

    of the world’s electricity • Every two seconds, an aircraft powered by GE takes off • 35,000 wind turbines globally • 25% of all global hydropower 10 Black Box Monitoring in Go
  6. Industrial Internet of Things (IIoT) • GE Assets produce petabytes

    of useful data • Valuable for gaining insights into these assets • Optimize w/ Asset Performance Management (APM) • Small percent increase in efficiency saves billions of dollars 11 Building Resilient Data Pipelines in Go
  7. Predix • Multi-cloud platform for the Industrial Internet of Things

    • Many services and applications optimized for industrial data 12 Building Resilient Data Pipelines in Go
  8. Some Customers • Schindler • Exelon • Rosneft • BP

    • GE Power • GE Aviation • GE Renewables 13 Building Resilient Data Pipelines in Go
  9. Monitoring and Diagnostics Architecture 14 Building Resilient Data Pipelines in

    Go Apache Kafka Pipeline (Java -> Go) Customer Query App Query Service (Java) Apache Cassandra Edge device/Customer app Cloud Gateway (Go) Subscribe Write Ingest Publish HTTP Req Query C*
  10. Monitoring and Diagnostics Architecture 15 Building Resilient Data Pipelines in

    Go Apache Kafka Pipeline (Java -> Go) Customer Query App Query Service (Java) Apache Cassandra Edge device/Customer app Cloud Gateway (Go) Subscribe Write Ingest Publish HTTP Req Query C*
  11. Data Pipeline component (Java version) • Stack: o Stateful Java

    app o Java and Apache Apex • Largest deployment (out of several) o 150 Cassandra nodes o 30 Kafka nodes o 144 Apache Apex containers • Results: o 900,000 C* writes/sec (peak) 16 Building Resilient Data Pipelines in Go
  12. Data Pipeline component (Go version) • Stack: o Stateless Go

    app o Cloud Foundry application (k8s in some newer envs) • Prod Configuration (smaller): o 9 Cassandra nodes o 4 Kafka nodes o 32+ Go-Pipeline instances • Results: o 450,000 C* writes/sec (peak) 17 Building Resilient Data Pipelines in Go
  13. Comparison (Java vs. Go) 18 Building Resilient Data Pipelines in

    Go Pipeline version Kafka nodes C* nodes Pipeline nodes Throughput (writes/sec) AVG Throughput (writes/day) Environment Java & Apache 30 150 144 900,000 (peak) 77,760,000,000 Largest production environment Go (actual numbers) 4 9 32 450,000 (peak) 38,880,000,000 Performance environment Go (projected) 30 40 128 1,800,000 (peak) 155,520,000,000 Planned production environment
  14. Rewrite motivations • Change in service vision/purpose • Originally –

    wanted customer specific data models/parsing • Now – standard data model, parsing • Operational cost ($$$) • Managing a Hadoop cluster • Resources (RAM/CPU/Disk) • Moving towards Kubernetes • Simple Go Microservice • We love Go! 19 Building Resilient Data Pipelines in Go
  15. 1. Data @ GE 2. Introduction to Data Pipelines 3.

    Sample Data Pipeline in Go 4. Reliability Testing our Data Pipeline 5. Results and takeaways
  16. Introduction to Data Pipelines • Move data from one system

    to another • Perform transformations and business logic 21 Building Resilient Data Pipelines in Go Kafka Custom Microservice Cassandra Kafka CockroachDB Storm NATS Custom Microservice Cassandra
  17. Data source: Apache Kafka • Publish/Subscribe messaging system • Parallelized

    with topic partitions • High throughput • Very widely used • Java, open source - github.com/apache/kafka 22 Building Resilient Data Pipelines in Go Kafka Publish Subscribe App1 App2
  18. Consumer groups Example: 1:1 23 Building Resilient Data Pipelines in

    Go Partition 1 Partition 2 Partition 3 Partition 4 Topic Name: demo-topic go-pipeline-node-1 go-pipeline-node-2 go-pipeline-node-3 go-pipeline-node-4 Kubernetes cluster Consumer group name: pipeline-group
  19. Consumer groups Example: 2:1 24 Building Resilient Data Pipelines in

    Go Partition 1 Partition 2 Partition 3 Partition 4 Topic Name: demo-topic go-pipeline-node-1 go-pipeline-node-2 Kubernetes cluster Consumer group name: pipeline-group
  20. Using Kafka with Go • Many libraries github.com/Shopify/sarama + github.com/bsm/sarama-cluster

    github.com/confluentinc/confluent-kafka-go github.com/segmentio/kafka-go (June 2017) • Chose Sarama + Sarama Cluster 1. No CGo dependency 2. Most mature library (at the time) 3. Wrote internal tooling + documentation around it for ease of use • Pick what works for you 25 Building Resilient Data Pipelines in Go
  21. Data store: Apache Cassandra • Column-oriented database • Fault Tolerant

    - replicated • Scalable • Apple: over 75,000, over 10 PB of data • Netflix: 2,500 nodes, 420 TB, 1 trillion requests • Java, open source github.com/apache/cassandra 26 Building Resilient Data Pipelines in Go
  22. Go and Cassandra • github.com/gocql/gocql • For high performance data

    bindings: • github.com/scylladb/gocqlx 27 Building Resilient Data Pipelines in Go
  23. 1. Data @ GE 2. Introduction to Data Pipelines 3.

    Sample Data Pipeline in Go 4. Reliability Testing our Data Pipeline 5. Results and takeaways
  24. Example: Data Pipeline (Gophers “R” Us) 31 Building Resilient Data

    Pipelines in Go Gopher sales data Log data Sensor data
  25. Data pipeline architecture 32 Building Resilient Data Pipelines in Go

    Kafka Sensor data Log data Gopher sales data Publish Publish
  26. Data pipeline architecture 33 Building Resilient Data Pipelines in Go

    Transformation Subscribe Kafka Sensor data Log data Purchase data Publish Publish Gopher sales data
  27. Data pipeline architecture 34 Building Resilient Data Pipelines in Go

    Kafka Transformation Cassandra Subscribe Write Sensor data Log data Publish Publish Gopher sales data
  28. Data pipeline architecture 35 Building Resilient Data Pipelines in Go

    Transformation Cassandra Subscribe Write Kafka Sensor data Log data Publish Publish What we’ll focus on Gopher sales data
  29. Simplified Application Flow: 3 Easy Steps for { select {

    case msg := <-consumer.Messages(): event := Transform(&msg) sink.Write(&event) } } 36 Building Resilient Data Pipelines in Go Kafka Transformation Cassandra
  30. Subscribing to Kafka for { select { case msg :=

    <-consumer.Messages(): event := Transform(&msg) sink.Write(&event) } } 37 Building Resilient Data Pipelines in Go Kafka Transformation Cassandra
  31. Subscribing to Kafka 38 Building Resilient Data Pipelines in Go

    1. Messages channel: Data from Kafka 2. Notifications channel: Rebalance notifications 3. Errors channel: Errors in offset management
  32. Handling messages for { select { case msg := <-consumer.Messages():

    event := Transform(&msg) sink.Write(&event) } } 39 Building Resilient Data Pipelines in Go Kafka Transformation Cassandra
  33. Writing to Cassandra for { select { case msg :=

    <-consumer.Messages(): event := Transform(&msg) sink.Write(&event) } } 43 Building Resilient Data Pipelines in Go Kafka Transformation Cassandra
  34. Graceful shutdown 45 Building Resilient Data Pipelines in Go 1.

    Setup channel for listening to OS Signals 2. Listen on os signals channel 3. Handle SIGTERM for graceful shutdown • k8s sends this signal when containers are stopped, scaled down, etc 4. Grace period (10-30s) • Configurable in k8s: terminationGracePeriodSeconds • Container killed after this period (or you can os.Exit(0) manually)
  35. 1. Data @ GE 2. Introduction to Data Pipelines 3.

    Sample Data Pipeline in Go 4. Reliability Testing our Data Pipeline 5. Results and takeaways
  36. Reliability Testing 48 Building Resilient Data Pipelines in Go •

    Systems fail • How does our pipeline behave during times like these? • How can we remedy these failures? • How can we ensure customer data is not lost? • Embrace failure scenarios • or they will embrace you at 3 AM • If this interests you: Google SRE Book landing.google.com/sre/book/index.html
  37. Reliability Testing our Data Pipeline 49 Building Resilient Data Pipelines

    in Go • What can fail? • Data pipeline node(s) • Kafka node(s) • Cassandra node(s) • Kubernetes can fail • How does our pipeline behave when this happens? • We can write a test • How can we remedy these failures?
  38. • What else can we test? • Partial cluster failures

    • Full cluster failures • Pipeline failures • High load • etc • Integration Test with Docker! 51 Building Resilient Data Pipelines in Go Reliability Test example
  39. 1. Data @ GE 2. Introduction to Data Pipelines 3.

    Sample Data Pipeline in Go 4. Reliability Testing our Data Pipeline 5. Results and takeaways
  40. Results & Takeaways 53 Building Resilient Data Pipelines in Go

    • Building Data Pipelines in Go o Simple to get a small app up and running o Good enough community support for Kafka, Cassandra, etc o Use chan of os.Signals for graceful shutdown • Reliability testing o Use docker to integration/reliability test o Understand how your system behaves during failures scenarios • Go Pipelines at GE Digital o Replaced our existing Java pipeline for lower operational cost, simplicity, and performance