Ingest, Store, Process, Serve – Building data-driven applications and ML pipelines with Golang - Felix Raab - KI Labs

Building data-driven applications with Go Felix Raab, Head of Engineering
@ KI labs https:/ /www.linkedin.com/in/dr-felix-r-698a19154/ https:/ /medium.com/@fe9lix 1 Building Data-driven Applications with Go

The problem with data: Data Never Sleeps (7.0) "By 2020,
there will be 40x more bytes of data than there are stars in the observable universe."1 1 Data Never Sleeps 7.0 2 Building Data-driven Applications with Go

The big data ecosystem FAANG companies have released their tools
as OSS.2 but: you are most likely not processing FAANG-scale data! "No, Kafka doesn't cut it any more, I'm now looking at Pulsar..." 2 Data & AI Landscape 2019 3 Building Data-driven Applications with Go

This talk... shows you how to solve 80%3 of your
data problems with simpler tooling (no Hadoop, no Kubernetes, ...). 3 Wild Pareto-style guess 4 Building Data-driven Applications with Go

What are (data) pipelines? Data Pipeline "the process that takes
input data through a series of transformation stages, producing data as output." 4 Special type: ML pipeline "the process that takes data and code as input, and produces a trained ML model as the output." 4 4 Continuous Delivery for Machine Learning 5 Building Data-driven Applications with Go

Data pipeline building blocks5 5 Adapted from "Foundations For Architecting
Data Solutions", Seidman and Malaska 6 Building Data-driven Applications with Go

Typical data flow Raw Data | Queue | (Stream Processor
|| HDFS|S3) | (Queue || Scheduled ETL) | Target DB 7 Building Data-driven Applications with Go

World's simplest data pipeline6? awk '{print $7}' /var/log/nginx/access.log | sort
| uniq -c | sort -r -n | head -n 5 More than a poor man's solution: • Unix philosophy: Compose small programs that do one thing well • Handles large data by spilling to disk, runs parallel across CPU cores 6 Designing Data-Driven Applications, Martin Kleppmann 8 Building Data-driven Applications with Go

On the other hand... If only we had a nice
programming language... 9 Building Data-driven Applications with Go

Advanced Taco Bell programming7 using Go std. lib and a
single binary 7 Taco Bell Programming 7: Original creator of "performance gopher" unknown 10 Building Data-driven Applications with Go

Memory Challenge OOM = working set > available memory otherwise:
disk I/O is an order of magnitude slower than RAM. 11 Building Data-driven Applications with Go

Instead of adding RAM or spinning up an expensive big
data cluster9: #1 Compress e.g., store strings as booleans (in memory!) 9 When your data doesn’t fit in memory: the basic techniques 12 Building Data-driven Applications with Go

data cluster10: #2 Chunk e.g., load data into memory in chunks and process chunks in parallel 10 When your data doesn’t fit in memory: the basic techniques 10: MapReduce explained in 41 words 13 Building Data-driven Applications with Go

data cluster12: #3 Index e.g., only load subset of data, using an index ("summary") 12 When your data doesn’t fit in memory: the basic techniques 14 Building Data-driven Applications with Go

Now, Go: Data Flow Primitives! 15 Building Data-driven Applications with
Go

#1 Generator A generator is a function that returns a
sequence of values through a channel ("producer-only module"): type Generator func() <-chan int Use cases: A generator could generate a number stream, load a file, read from database, scrape the web, etc. 16 Building Data-driven Applications with Go

#2 Processor A processor is a function that takes a
channel and returns a sequence of values through a channel. type Processor func(<-chan int) <-chan int Use cases: A processor could be used for number crunching, data aggregation, deduplication, filtering, validation, etc. 17 Building Data-driven Applications with Go

#3 Consumer A consumer is a function that takes a
channel. type Consumer func(<-chan int) Use cases: Print sequence, save data, etc. 18 Building Data-driven Applications with Go

Returning function types... ...allows us to customize our generators, processors,
and consumers: func customGenerator(...params int) Generator { return func () <-chan int { // use params... } } 19 Building Data-driven Applications with Go

Our first Go pipeline: Producer – Processor – Consumer func
numberGenerator(nums ...int) Generator { return func () <-chan int { out := make(chan int) go func() { defer close(out) for _, i := range nums { out <- i } }() return out } } 20 Building Data-driven Applications with Go

squareProcessor() Processor { return func(in <-chan int) <-chan int { out := make(chan int) go func() { defer close(out) for i := range in { out <- i * i } }() return out } } 21 Building Data-driven Applications with Go

printConsumer() Consumer { return func(in <-chan int) { for { i, ok := <-in if ok { fmt.Println(i) } else { return } } } } 22 Building Data-driven Applications with Go

Our first Go pipeline: Run func main() { printConsumer()( squareProcessor()(
numberGenerator(1, 2, 3, 4)(), )) } 23 Building Data-driven Applications with Go

Data flow patterns: Fan-in, fan-out Remember chunking/indexing? 24 Building Data-driven
Applications with Go

Data flow patterns: Fan-in, fan-out (Code) func main() { logConsumer()(
// Fan-in logAggregator(100)( // Fan-out logProcessor()( logGenerator("2019-01.log")()), logProcessor()( logGenerator("2019-02.log")()), ), ) } 25 Building Data-driven Applications with Go

Data flow patterns: Multiplexing Aggregator takes a variable number of
channels and returns a (buffered) output channel. Multiplexing: Range over all input channels and use Waitgroups to close the output channel when all inner goroutines writing to the output channel are done. 26 Building Data-driven Applications with Go

Data flow patterns: Multiplexing (Code) var wg sync.WaitGroup for _,
in := range ins { wg.Add(1) go func(in <-chan string) { defer wg.Done() for i := range in { out <- i } }(in) } 27 Building Data-driven Applications with Go

Data flow patterns: Closing & Unblocking General guidelines13: • Pipeline
stages close their out-channels when send operations are done (i.e., goroutines exit once values have been sent downstream). • Pipeline stages receive their values until in-channels are closed or the senders are unblocked. Q: How to unblock? A: 1) Use buffered channels, 2) Explicitly cancel via done channels, passed to all pipeline stages. 13 Go Concurrency Patterns: Pipelines and cancellation 28 Building Data-driven Applications with Go

Data flow patterns: Cancellation Example In the consumer: defer close(done)
or explicitly close it (e.g. error). func squareProcessor() Processor { return func(done <-chan struct{}, in <-chan int) <-chan int { // ... defer close(out) for n := range in { // proceeds either after successful send on out or received value from done // return without draining in, upstream will also stop sending after done broadcast select { case out <- n * n: case <-done: return } } // ... } 29 Building Data-driven Applications with Go

Other data flow patterns: Streaming "Poor man's streaming": func randNumStream(max
int) Generator { return func() <-chan int { out := make(chan int, 10) rand.Seed(time.Now().UnixNano()) go func() { for { out <- rand.Intn(max) time.Sleep(10 * time.Millisecond) } }() return out } } 30 Building Data-driven Applications with Go

Other data flow patterns: Rate Limiting14 throttle := make(chan time.Time,
100) //bu er = burstLimit go func() { tick := time.NewTicker(rate) defer tick.Stop() for t := range tick.C { select { case throttle <- t: default: } } }() for i := range in { <-throttle out <- i } 14 Rate Limiting 31 Building Data-driven Applications with Go

More advanced data processing Three OSS projects to watch: •
Benthos: Stream processor, sources and sinks concept (actions, transformations, filters) • Automi: Stream processing API (emitter – operation – collector), abstraction on top of channels, early stages • Bigslice: Ad-hoc cluster processing framework (not strictly streaming, distributed computation across multiple nodes), similar to Spark but "serverless" 32 Building Data-driven Applications with Go

Of Sources and Sinks... ...producers/consumers, inputs/outputs, ... 33 Building Data-driven
Applications with Go

Benthos15 • Directly links inputs to outputs with acknowledgements, (optional
buffering) • Supports resilient and customizable processing logic (incl. convenience such as JMESPath queries) • Is well documented, configuration-driven, used in production 15 Benthos 34 Building Data-driven Applications with Go

Benthos Input Handling16 16 Building a Resilient Stream Processor in
Go 35 Building Data-driven Applications with Go

Benthos Custom Processing17 17 Building a Resilient Stream Processor in
Go 36 Building Data-driven Applications with Go

Benthos Config18 18 Building a Resilient Stream Processor in Go
37 Building Data-driven Applications with Go

Final data recommendations... • Use Go (obviously...), try data flow
programming • Stay lean, check if you really need typical big data tooling: be aware of both infrastructure complexities and runtime overheads (splitting data sets, serializing, sending data over the network for distributed computations) • (and also: often use Postgres, learn SQL) 38 Building Data-driven Applications with Go

Recap • We looked at basic building blocks of a
data pipeline and some of the challenges when processing large data sets • We covered some general techniques to work with data in memory • We looked at how to construct simple data flow systems using goroutines and channels • And finally we looked at more advanced open-source tooling to set up flexible and scalable pipelines 39 Building Data-driven Applications with Go

Thank You QA? 40 Building Data-driven Applications with Go

Links ➡ Designing Data-Driven Applications, Martin Kleppmann ➡ When your
data doesn’t fit in memory: the basic techniques ➡ Channel Use Cases – Data Flow Manipulations ➡ Go Concurrency Patterns: Pipelines and cancellation ➡ Generator Tricks For System Programmers ➡ Building a Resilient Stream Processor in Go ➡ Benthos ➡ Automi ➡ Bigslice 41 Building Data-driven Applications with Go

Ingest, Store, Process, Serve – Building data-d...

Ingest, Store, Process, Serve – Building data-driven applications and ML pipelines with Golang - Felix Raab - KI Labs

More Decks by GoDays

Other Decks in Technology

Featured

Transcript