Save 37% off PRO during our Black Friday Sale! »

Ingest, Store, Process, Serve – Building data-driven applications and ML pipelines with Golang - Felix Raab - KI Labs

6e3ea86995d93d35c0fadf2694bca773?s=47 GoDays
January 22, 2020

Ingest, Store, Process, Serve – Building data-driven applications and ML pipelines with Golang - Felix Raab - KI Labs

As more and more businesses need to process data from diverse sources to eventually support decision making, designing data-driven applications has become increasingly important. This talk gives an overview of how to build data flow solutions using Golang. The first part covers conceptual building blocks of a lean data architecture and highlights basic techniques for processing data. We'll then show the evolution of a data pipeline starting by connecting simple CLI tools, move to generators with goroutines and channels, and close with more advanced examples leveraging stream processing tooling.

6e3ea86995d93d35c0fadf2694bca773?s=128

GoDays

January 22, 2020
Tweet

Transcript

  1. Building data-driven applications with Go Felix Raab, Head of Engineering

    @ KI labs https:/ /www.linkedin.com/in/dr-felix-r-698a19154/ https:/ /medium.com/@fe9lix 1 Building Data-driven Applications with Go
  2. The problem with data: Data Never Sleeps (7.0) "By 2020,

    there will be 40x more bytes of data than there are stars in the observable universe."1 1 Data Never Sleeps 7.0 2 Building Data-driven Applications with Go
  3. The big data ecosystem FAANG companies have released their tools

    as OSS.2 but: you are most likely not processing FAANG-scale data! "No, Kafka doesn't cut it any more, I'm now looking at Pulsar..." 2 Data & AI Landscape 2019 3 Building Data-driven Applications with Go
  4. This talk... shows you how to solve 80%3 of your

    data problems with simpler tooling (no Hadoop, no Kubernetes, ...). 3 Wild Pareto-style guess 4 Building Data-driven Applications with Go
  5. What are (data) pipelines? Data Pipeline "the process that takes

    input data through a series of transformation stages, producing data as output." 4 Special type: ML pipeline "the process that takes data and code as input, and produces a trained ML model as the output." 4 4 Continuous Delivery for Machine Learning 5 Building Data-driven Applications with Go
  6. Data pipeline building blocks5 5 Adapted from "Foundations For Architecting

    Data Solutions", Seidman and Malaska 6 Building Data-driven Applications with Go
  7. Typical data flow Raw Data | Queue | (Stream Processor

    || HDFS|S3) | (Queue || Scheduled ETL) | Target DB 7 Building Data-driven Applications with Go
  8. World's simplest data pipeline6? awk '{print $7}' /var/log/nginx/access.log | sort

    | uniq -c | sort -r -n | head -n 5 More than a poor man's solution: • Unix philosophy: Compose small programs that do one thing well • Handles large data by spilling to disk, runs parallel across CPU cores 6 Designing Data-Driven Applications, Martin Kleppmann 8 Building Data-driven Applications with Go
  9. On the other hand... If only we had a nice

    programming language... 9 Building Data-driven Applications with Go
  10. Advanced Taco Bell programming7 using Go std. lib and a

    single binary 7 Taco Bell Programming 7: Original creator of "performance gopher" unknown 10 Building Data-driven Applications with Go
  11. Memory Challenge OOM = working set > available memory otherwise:

    disk I/O is an order of magnitude slower than RAM. 11 Building Data-driven Applications with Go
  12. Instead of adding RAM or spinning up an expensive big

    data cluster9: #1 Compress e.g., store strings as booleans (in memory!) 9 When your data doesn’t fit in memory: the basic techniques 12 Building Data-driven Applications with Go
  13. Instead of adding RAM or spinning up an expensive big

    data cluster10: #2 Chunk e.g., load data into memory in chunks and process chunks in parallel 10 When your data doesn’t fit in memory: the basic techniques 10: MapReduce explained in 41 words 13 Building Data-driven Applications with Go
  14. Instead of adding RAM or spinning up an expensive big

    data cluster12: #3 Index e.g., only load subset of data, using an index ("summary") 12 When your data doesn’t fit in memory: the basic techniques 14 Building Data-driven Applications with Go
  15. Now, Go: Data Flow Primitives! 15 Building Data-driven Applications with

    Go
  16. #1 Generator A generator is a function that returns a

    sequence of values through a channel ("producer-only module"): type Generator func() <-chan int Use cases: A generator could generate a number stream, load a file, read from database, scrape the web, etc. 16 Building Data-driven Applications with Go
  17. #2 Processor A processor is a function that takes a

    channel and returns a sequence of values through a channel. type Processor func(<-chan int) <-chan int Use cases: A processor could be used for number crunching, data aggregation, deduplication, filtering, validation, etc. 17 Building Data-driven Applications with Go
  18. #3 Consumer A consumer is a function that takes a

    channel. type Consumer func(<-chan int) Use cases: Print sequence, save data, etc. 18 Building Data-driven Applications with Go
  19. Returning function types... ...allows us to customize our generators, processors,

    and consumers: func customGenerator(...params int) Generator { return func () <-chan int { // use params... } } 19 Building Data-driven Applications with Go
  20. Our first Go pipeline: Producer – Processor – Consumer func

    numberGenerator(nums ...int) Generator { return func () <-chan int { out := make(chan int) go func() { defer close(out) for _, i := range nums { out <- i } }() return out } } 20 Building Data-driven Applications with Go
  21. Our first Go pipeline: Producer – Processor – Consumer func

    squareProcessor() Processor { return func(in <-chan int) <-chan int { out := make(chan int) go func() { defer close(out) for i := range in { out <- i * i } }() return out } } 21 Building Data-driven Applications with Go
  22. Our first Go pipeline: Producer – Processor – Consumer func

    printConsumer() Consumer { return func(in <-chan int) { for { i, ok := <-in if ok { fmt.Println(i) } else { return } } } } 22 Building Data-driven Applications with Go
  23. Our first Go pipeline: Run func main() { printConsumer()( squareProcessor()(

    numberGenerator(1, 2, 3, 4)(), )) } 23 Building Data-driven Applications with Go
  24. Data flow patterns: Fan-in, fan-out Remember chunking/indexing? 24 Building Data-driven

    Applications with Go
  25. Data flow patterns: Fan-in, fan-out (Code) func main() { logConsumer()(

    // Fan-in logAggregator(100)( // Fan-out logProcessor()( logGenerator("2019-01.log")()), logProcessor()( logGenerator("2019-02.log")()), ), ) } 25 Building Data-driven Applications with Go
  26. Data flow patterns: Multiplexing Aggregator takes a variable number of

    channels and returns a (buffered) output channel. Multiplexing: Range over all input channels and use Waitgroups to close the output channel when all inner goroutines writing to the output channel are done. 26 Building Data-driven Applications with Go
  27. Data flow patterns: Multiplexing (Code) var wg sync.WaitGroup for _,

    in := range ins { wg.Add(1) go func(in <-chan string) { defer wg.Done() for i := range in { out <- i } }(in) } 27 Building Data-driven Applications with Go
  28. Data flow patterns: Closing & Unblocking General guidelines13: • Pipeline

    stages close their out-channels when send operations are done (i.e., goroutines exit once values have been sent downstream). • Pipeline stages receive their values until in-channels are closed or the senders are unblocked. Q: How to unblock? A: 1) Use buffered channels, 2) Explicitly cancel via done channels, passed to all pipeline stages. 13 Go Concurrency Patterns: Pipelines and cancellation 28 Building Data-driven Applications with Go
  29. Data flow patterns: Cancellation Example In the consumer: defer close(done)

    or explicitly close it (e.g. error). func squareProcessor() Processor { return func(done <-chan struct{}, in <-chan int) <-chan int { // ... defer close(out) for n := range in { // proceeds either after successful send on out or received value from done // return without draining in, upstream will also stop sending after done broadcast select { case out <- n * n: case <-done: return } } // ... } 29 Building Data-driven Applications with Go
  30. Other data flow patterns: Streaming "Poor man's streaming": func randNumStream(max

    int) Generator { return func() <-chan int { out := make(chan int, 10) rand.Seed(time.Now().UnixNano()) go func() { for { out <- rand.Intn(max) time.Sleep(10 * time.Millisecond) } }() return out } } 30 Building Data-driven Applications with Go
  31. Other data flow patterns: Rate Limiting14 throttle := make(chan time.Time,

    100) //bu er = burstLimit go func() { tick := time.NewTicker(rate) defer tick.Stop() for t := range tick.C { select { case throttle <- t: default: } } }() for i := range in { <-throttle out <- i } 14 Rate Limiting 31 Building Data-driven Applications with Go
  32. More advanced data processing Three OSS projects to watch: •

    Benthos: Stream processor, sources and sinks concept (actions, transformations, filters) • Automi: Stream processing API (emitter – operation – collector), abstraction on top of channels, early stages • Bigslice: Ad-hoc cluster processing framework (not strictly streaming, distributed computation across multiple nodes), similar to Spark but "serverless" 32 Building Data-driven Applications with Go
  33. Of Sources and Sinks... ...producers/consumers, inputs/outputs, ... 33 Building Data-driven

    Applications with Go
  34. Benthos15 • Directly links inputs to outputs with acknowledgements, (optional

    buffering) • Supports resilient and customizable processing logic (incl. convenience such as JMESPath queries) • Is well documented, configuration-driven, used in production 15 Benthos 34 Building Data-driven Applications with Go
  35. Benthos Input Handling16 16 Building a Resilient Stream Processor in

    Go 35 Building Data-driven Applications with Go
  36. Benthos Custom Processing17 17 Building a Resilient Stream Processor in

    Go 36 Building Data-driven Applications with Go
  37. Benthos Config18 18 Building a Resilient Stream Processor in Go

    37 Building Data-driven Applications with Go
  38. Final data recommendations... • Use Go (obviously...), try data flow

    programming • Stay lean, check if you really need typical big data tooling: be aware of both infrastructure complexities and runtime overheads (splitting data sets, serializing, sending data over the network for distributed computations) • (and also: often use Postgres, learn SQL) 38 Building Data-driven Applications with Go
  39. Recap • We looked at basic building blocks of a

    data pipeline and some of the challenges when processing large data sets • We covered some general techniques to work with data in memory • We looked at how to construct simple data flow systems using goroutines and channels • And finally we looked at more advanced open-source tooling to set up flexible and scalable pipelines 39 Building Data-driven Applications with Go
  40. Thank You QA? 40 Building Data-driven Applications with Go

  41. Links ➡ Designing Data-Driven Applications, Martin Kleppmann ➡ When your

    data doesn’t fit in memory: the basic techniques ➡ Channel Use Cases – Data Flow Manipulations ➡ Go Concurrency Patterns: Pipelines and cancellation ➡ Generator Tricks For System Programmers ➡ Building a Resilient Stream Processor in Go ➡ Benthos ➡ Automi ➡ Bigslice 41 Building Data-driven Applications with Go