Slide 1

Slide 1 text

Building data-driven applications with Go Felix Raab, Head of Engineering @ KI labs https:/ / https:/ / 1 Building Data-driven Applications with Go

Slide 2

Slide 2 text

The problem with data: Data Never Sleeps (7.0) "By 2020, there will be 40x more bytes of data than there are stars in the observable universe."1 1 Data Never Sleeps 7.0 2 Building Data-driven Applications with Go

Slide 3

Slide 3 text

The big data ecosystem FAANG companies have released their tools as OSS.2 but: you are most likely not processing FAANG-scale data! "No, Kafka doesn't cut it any more, I'm now looking at Pulsar..." 2 Data & AI Landscape 2019 3 Building Data-driven Applications with Go

Slide 4

Slide 4 text

This talk... shows you how to solve 80%3 of your data problems with simpler tooling (no Hadoop, no Kubernetes, ...). 3 Wild Pareto-style guess 4 Building Data-driven Applications with Go

Slide 5

Slide 5 text

What are (data) pipelines? Data Pipeline "the process that takes input data through a series of transformation stages, producing data as output." 4 Special type: ML pipeline "the process that takes data and code as input, and produces a trained ML model as the output." 4 4 Continuous Delivery for Machine Learning 5 Building Data-driven Applications with Go

Slide 6

Slide 6 text

Data pipeline building blocks5 5 Adapted from "Foundations For Architecting Data Solutions", Seidman and Malaska 6 Building Data-driven Applications with Go

Slide 7

Slide 7 text

Typical data flow Raw Data | Queue | (Stream Processor || HDFS|S3) | (Queue || Scheduled ETL) | Target DB 7 Building Data-driven Applications with Go

Slide 8

Slide 8 text

World's simplest data pipeline6? awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -r -n | head -n 5 More than a poor man's solution: • Unix philosophy: Compose small programs that do one thing well • Handles large data by spilling to disk, runs parallel across CPU cores 6 Designing Data-Driven Applications, Martin Kleppmann 8 Building Data-driven Applications with Go

Slide 9

Slide 9 text

On the other hand... If only we had a nice programming language... 9 Building Data-driven Applications with Go

Slide 10

Slide 10 text

Advanced Taco Bell programming7 using Go std. lib and a single binary 7 Taco Bell Programming 7: Original creator of "performance gopher" unknown 10 Building Data-driven Applications with Go

Slide 11

Slide 11 text

Memory Challenge OOM = working set > available memory otherwise: disk I/O is an order of magnitude slower than RAM. 11 Building Data-driven Applications with Go

Slide 12

Slide 12 text

Instead of adding RAM or spinning up an expensive big data cluster9: #1 Compress e.g., store strings as booleans (in memory!) 9 When your data doesn’t fit in memory: the basic techniques 12 Building Data-driven Applications with Go

Slide 13

Slide 13 text

Instead of adding RAM or spinning up an expensive big data cluster10: #2 Chunk e.g., load data into memory in chunks and process chunks in parallel 10 When your data doesn’t fit in memory: the basic techniques 10: MapReduce explained in 41 words 13 Building Data-driven Applications with Go

Slide 14

Slide 14 text

Instead of adding RAM or spinning up an expensive big data cluster12: #3 Index e.g., only load subset of data, using an index ("summary") 12 When your data doesn’t fit in memory: the basic techniques 14 Building Data-driven Applications with Go

Slide 15

Slide 15 text

Now, Go: Data Flow Primitives! 15 Building Data-driven Applications with Go

Slide 16

Slide 16 text

#1 Generator A generator is a function that returns a sequence of values through a channel ("producer-only module"): type Generator func() <-chan int Use cases: A generator could generate a number stream, load a file, read from database, scrape the web, etc. 16 Building Data-driven Applications with Go

Slide 17

Slide 17 text

#2 Processor A processor is a function that takes a channel and returns a sequence of values through a channel. type Processor func(<-chan int) <-chan int Use cases: A processor could be used for number crunching, data aggregation, deduplication, filtering, validation, etc. 17 Building Data-driven Applications with Go

Slide 18

Slide 18 text

#3 Consumer A consumer is a function that takes a channel. type Consumer func(<-chan int) Use cases: Print sequence, save data, etc. 18 Building Data-driven Applications with Go

Slide 19

Slide 19 text

Returning function types... ...allows us to customize our generators, processors, and consumers: func customGenerator(...params int) Generator { return func () <-chan int { // use params... } } 19 Building Data-driven Applications with Go

Slide 20

Slide 20 text

Our first Go pipeline: Producer – Processor – Consumer func numberGenerator(nums Generator { return func () <-chan int { out := make(chan int) go func() { defer close(out) for _, i := range nums { out <- i } }() return out } } 20 Building Data-driven Applications with Go

Slide 21

Slide 21 text

Our first Go pipeline: Producer – Processor – Consumer func squareProcessor() Processor { return func(in <-chan int) <-chan int { out := make(chan int) go func() { defer close(out) for i := range in { out <- i * i } }() return out } } 21 Building Data-driven Applications with Go

Slide 22

Slide 22 text

Our first Go pipeline: Producer – Processor – Consumer func printConsumer() Consumer { return func(in <-chan int) { for { i, ok := <-in if ok { fmt.Println(i) } else { return } } } } 22 Building Data-driven Applications with Go

Slide 23

Slide 23 text

Our first Go pipeline: Run func main() { printConsumer()( squareProcessor()( numberGenerator(1, 2, 3, 4)(), )) } 23 Building Data-driven Applications with Go

Slide 24

Slide 24 text

Data flow patterns: Fan-in, fan-out Remember chunking/indexing? 24 Building Data-driven Applications with Go

Slide 25

Slide 25 text

Data flow patterns: Fan-in, fan-out (Code) func main() { logConsumer()( // Fan-in logAggregator(100)( // Fan-out logProcessor()( logGenerator("2019-01.log")()), logProcessor()( logGenerator("2019-02.log")()), ), ) } 25 Building Data-driven Applications with Go

Slide 26

Slide 26 text

Data flow patterns: Multiplexing Aggregator takes a variable number of channels and returns a (buffered) output channel. Multiplexing: Range over all input channels and use Waitgroups to close the output channel when all inner goroutines writing to the output channel are done. 26 Building Data-driven Applications with Go

Slide 27

Slide 27 text

Data flow patterns: Multiplexing (Code) var wg sync.WaitGroup for _, in := range ins { wg.Add(1) go func(in <-chan string) { defer wg.Done() for i := range in { out <- i } }(in) } 27 Building Data-driven Applications with Go

Slide 28

Slide 28 text

Data flow patterns: Closing & Unblocking General guidelines13: • Pipeline stages close their out-channels when send operations are done (i.e., goroutines exit once values have been sent downstream). • Pipeline stages receive their values until in-channels are closed or the senders are unblocked. Q: How to unblock? A: 1) Use buffered channels, 2) Explicitly cancel via done channels, passed to all pipeline stages. 13 Go Concurrency Patterns: Pipelines and cancellation 28 Building Data-driven Applications with Go

Slide 29

Slide 29 text

Data flow patterns: Cancellation Example In the consumer: defer close(done) or explicitly close it (e.g. error). func squareProcessor() Processor { return func(done <-chan struct{}, in <-chan int) <-chan int { // ... defer close(out) for n := range in { // proceeds either after successful send on out or received value from done // return without draining in, upstream will also stop sending after done broadcast select { case out <- n * n: case <-done: return } } // ... } 29 Building Data-driven Applications with Go

Slide 30

Slide 30 text

Other data flow patterns: Streaming "Poor man's streaming": func randNumStream(max int) Generator { return func() <-chan int { out := make(chan int, 10) rand.Seed(time.Now().UnixNano()) go func() { for { out <- rand.Intn(max) time.Sleep(10 * time.Millisecond) } }() return out } } 30 Building Data-driven Applications with Go

Slide 31

Slide 31 text

Other data flow patterns: Rate Limiting14 throttle := make(chan time.Time, 100) //bu er = burstLimit go func() { tick := time.NewTicker(rate) defer tick.Stop() for t := range tick.C { select { case throttle <- t: default: } } }() for i := range in { <-throttle out <- i } 14 Rate Limiting 31 Building Data-driven Applications with Go

Slide 32

Slide 32 text

More advanced data processing Three OSS projects to watch: • Benthos: Stream processor, sources and sinks concept (actions, transformations, filters) • Automi: Stream processing API (emitter – operation – collector), abstraction on top of channels, early stages • Bigslice: Ad-hoc cluster processing framework (not strictly streaming, distributed computation across multiple nodes), similar to Spark but "serverless" 32 Building Data-driven Applications with Go

Slide 33

Slide 33 text

Of Sources and Sinks... ...producers/consumers, inputs/outputs, ... 33 Building Data-driven Applications with Go

Slide 34

Slide 34 text

Benthos15 • Directly links inputs to outputs with acknowledgements, (optional buffering) • Supports resilient and customizable processing logic (incl. convenience such as JMESPath queries) • Is well documented, configuration-driven, used in production 15 Benthos 34 Building Data-driven Applications with Go

Slide 35

Slide 35 text

Benthos Input Handling16 16 Building a Resilient Stream Processor in Go 35 Building Data-driven Applications with Go

Slide 36

Slide 36 text

Benthos Custom Processing17 17 Building a Resilient Stream Processor in Go 36 Building Data-driven Applications with Go

Slide 37

Slide 37 text

Benthos Config18 18 Building a Resilient Stream Processor in Go 37 Building Data-driven Applications with Go

Slide 38

Slide 38 text

Final data recommendations... • Use Go (obviously...), try data flow programming • Stay lean, check if you really need typical big data tooling: be aware of both infrastructure complexities and runtime overheads (splitting data sets, serializing, sending data over the network for distributed computations) • (and also: often use Postgres, learn SQL) 38 Building Data-driven Applications with Go

Slide 39

Slide 39 text

Recap • We looked at basic building blocks of a data pipeline and some of the challenges when processing large data sets • We covered some general techniques to work with data in memory • We looked at how to construct simple data flow systems using goroutines and channels • And finally we looked at more advanced open-source tooling to set up flexible and scalable pipelines 39 Building Data-driven Applications with Go

Slide 40

Slide 40 text

Thank You QA? 40 Building Data-driven Applications with Go

Slide 41

Slide 41 text

Links ➡ Designing Data-Driven Applications, Martin Kleppmann ➡ When your data doesn’t fit in memory: the basic techniques ➡ Channel Use Cases – Data Flow Manipulations ➡ Go Concurrency Patterns: Pipelines and cancellation ➡ Generator Tricks For System Programmers ➡ Building a Resilient Stream Processor in Go ➡ Benthos ➡ Automi ➡ Bigslice 41 Building Data-driven Applications with Go