Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ingest, Store, Process, Serve – Building data-driven applications and ML pipelines with Golang - Felix Raab - KI Labs

GoDays
January 22, 2020

Ingest, Store, Process, Serve – Building data-driven applications and ML pipelines with Golang - Felix Raab - KI Labs

As more and more businesses need to process data from diverse sources to eventually support decision making, designing data-driven applications has become increasingly important. This talk gives an overview of how to build data flow solutions using Golang. The first part covers conceptual building blocks of a lean data architecture and highlights basic techniques for processing data. We'll then show the evolution of a data pipeline starting by connecting simple CLI tools, move to generators with goroutines and channels, and close with more advanced examples leveraging stream processing tooling.

GoDays

January 22, 2020
Tweet

More Decks by GoDays

Other Decks in Technology

Transcript

  1. Building data-driven applications with Go Felix Raab, Head of Engineering

    @ KI labs https:/ /www.linkedin.com/in/dr-felix-r-698a19154/ https:/ /medium.com/@fe9lix 1 Building Data-driven Applications with Go
  2. The problem with data: Data Never Sleeps (7.0) "By 2020,

    there will be 40x more bytes of data than there are stars in the observable universe."1 1 Data Never Sleeps 7.0 2 Building Data-driven Applications with Go
  3. The big data ecosystem FAANG companies have released their tools

    as OSS.2 but: you are most likely not processing FAANG-scale data! "No, Kafka doesn't cut it any more, I'm now looking at Pulsar..." 2 Data & AI Landscape 2019 3 Building Data-driven Applications with Go
  4. This talk... shows you how to solve 80%3 of your

    data problems with simpler tooling (no Hadoop, no Kubernetes, ...). 3 Wild Pareto-style guess 4 Building Data-driven Applications with Go
  5. What are (data) pipelines? Data Pipeline "the process that takes

    input data through a series of transformation stages, producing data as output." 4 Special type: ML pipeline "the process that takes data and code as input, and produces a trained ML model as the output." 4 4 Continuous Delivery for Machine Learning 5 Building Data-driven Applications with Go
  6. Data pipeline building blocks5 5 Adapted from "Foundations For Architecting

    Data Solutions", Seidman and Malaska 6 Building Data-driven Applications with Go
  7. Typical data flow Raw Data | Queue | (Stream Processor

    || HDFS|S3) | (Queue || Scheduled ETL) | Target DB 7 Building Data-driven Applications with Go
  8. World's simplest data pipeline6? awk '{print $7}' /var/log/nginx/access.log | sort

    | uniq -c | sort -r -n | head -n 5 More than a poor man's solution: • Unix philosophy: Compose small programs that do one thing well • Handles large data by spilling to disk, runs parallel across CPU cores 6 Designing Data-Driven Applications, Martin Kleppmann 8 Building Data-driven Applications with Go
  9. On the other hand... If only we had a nice

    programming language... 9 Building Data-driven Applications with Go
  10. Advanced Taco Bell programming7 using Go std. lib and a

    single binary 7 Taco Bell Programming 7: Original creator of "performance gopher" unknown 10 Building Data-driven Applications with Go
  11. Memory Challenge OOM = working set > available memory otherwise:

    disk I/O is an order of magnitude slower than RAM. 11 Building Data-driven Applications with Go
  12. Instead of adding RAM or spinning up an expensive big

    data cluster9: #1 Compress e.g., store strings as booleans (in memory!) 9 When your data doesn’t fit in memory: the basic techniques 12 Building Data-driven Applications with Go
  13. Instead of adding RAM or spinning up an expensive big

    data cluster10: #2 Chunk e.g., load data into memory in chunks and process chunks in parallel 10 When your data doesn’t fit in memory: the basic techniques 10: MapReduce explained in 41 words 13 Building Data-driven Applications with Go
  14. Instead of adding RAM or spinning up an expensive big

    data cluster12: #3 Index e.g., only load subset of data, using an index ("summary") 12 When your data doesn’t fit in memory: the basic techniques 14 Building Data-driven Applications with Go
  15. #1 Generator A generator is a function that returns a

    sequence of values through a channel ("producer-only module"): type Generator func() <-chan int Use cases: A generator could generate a number stream, load a file, read from database, scrape the web, etc. 16 Building Data-driven Applications with Go
  16. #2 Processor A processor is a function that takes a

    channel and returns a sequence of values through a channel. type Processor func(<-chan int) <-chan int Use cases: A processor could be used for number crunching, data aggregation, deduplication, filtering, validation, etc. 17 Building Data-driven Applications with Go
  17. #3 Consumer A consumer is a function that takes a

    channel. type Consumer func(<-chan int) Use cases: Print sequence, save data, etc. 18 Building Data-driven Applications with Go
  18. Returning function types... ...allows us to customize our generators, processors,

    and consumers: func customGenerator(...params int) Generator { return func () <-chan int { // use params... } } 19 Building Data-driven Applications with Go
  19. Our first Go pipeline: Producer – Processor – Consumer func

    numberGenerator(nums ...int) Generator { return func () <-chan int { out := make(chan int) go func() { defer close(out) for _, i := range nums { out <- i } }() return out } } 20 Building Data-driven Applications with Go
  20. Our first Go pipeline: Producer – Processor – Consumer func

    squareProcessor() Processor { return func(in <-chan int) <-chan int { out := make(chan int) go func() { defer close(out) for i := range in { out <- i * i } }() return out } } 21 Building Data-driven Applications with Go
  21. Our first Go pipeline: Producer – Processor – Consumer func

    printConsumer() Consumer { return func(in <-chan int) { for { i, ok := <-in if ok { fmt.Println(i) } else { return } } } } 22 Building Data-driven Applications with Go
  22. Our first Go pipeline: Run func main() { printConsumer()( squareProcessor()(

    numberGenerator(1, 2, 3, 4)(), )) } 23 Building Data-driven Applications with Go
  23. Data flow patterns: Fan-in, fan-out (Code) func main() { logConsumer()(

    // Fan-in logAggregator(100)( // Fan-out logProcessor()( logGenerator("2019-01.log")()), logProcessor()( logGenerator("2019-02.log")()), ), ) } 25 Building Data-driven Applications with Go
  24. Data flow patterns: Multiplexing Aggregator takes a variable number of

    channels and returns a (buffered) output channel. Multiplexing: Range over all input channels and use Waitgroups to close the output channel when all inner goroutines writing to the output channel are done. 26 Building Data-driven Applications with Go
  25. Data flow patterns: Multiplexing (Code) var wg sync.WaitGroup for _,

    in := range ins { wg.Add(1) go func(in <-chan string) { defer wg.Done() for i := range in { out <- i } }(in) } 27 Building Data-driven Applications with Go
  26. Data flow patterns: Closing & Unblocking General guidelines13: • Pipeline

    stages close their out-channels when send operations are done (i.e., goroutines exit once values have been sent downstream). • Pipeline stages receive their values until in-channels are closed or the senders are unblocked. Q: How to unblock? A: 1) Use buffered channels, 2) Explicitly cancel via done channels, passed to all pipeline stages. 13 Go Concurrency Patterns: Pipelines and cancellation 28 Building Data-driven Applications with Go
  27. Data flow patterns: Cancellation Example In the consumer: defer close(done)

    or explicitly close it (e.g. error). func squareProcessor() Processor { return func(done <-chan struct{}, in <-chan int) <-chan int { // ... defer close(out) for n := range in { // proceeds either after successful send on out or received value from done // return without draining in, upstream will also stop sending after done broadcast select { case out <- n * n: case <-done: return } } // ... } 29 Building Data-driven Applications with Go
  28. Other data flow patterns: Streaming "Poor man's streaming": func randNumStream(max

    int) Generator { return func() <-chan int { out := make(chan int, 10) rand.Seed(time.Now().UnixNano()) go func() { for { out <- rand.Intn(max) time.Sleep(10 * time.Millisecond) } }() return out } } 30 Building Data-driven Applications with Go
  29. Other data flow patterns: Rate Limiting14 throttle := make(chan time.Time,

    100) //bu er = burstLimit go func() { tick := time.NewTicker(rate) defer tick.Stop() for t := range tick.C { select { case throttle <- t: default: } } }() for i := range in { <-throttle out <- i } 14 Rate Limiting 31 Building Data-driven Applications with Go
  30. More advanced data processing Three OSS projects to watch: •

    Benthos: Stream processor, sources and sinks concept (actions, transformations, filters) • Automi: Stream processing API (emitter – operation – collector), abstraction on top of channels, early stages • Bigslice: Ad-hoc cluster processing framework (not strictly streaming, distributed computation across multiple nodes), similar to Spark but "serverless" 32 Building Data-driven Applications with Go
  31. Benthos15 • Directly links inputs to outputs with acknowledgements, (optional

    buffering) • Supports resilient and customizable processing logic (incl. convenience such as JMESPath queries) • Is well documented, configuration-driven, used in production 15 Benthos 34 Building Data-driven Applications with Go
  32. Benthos Input Handling16 16 Building a Resilient Stream Processor in

    Go 35 Building Data-driven Applications with Go
  33. Benthos Custom Processing17 17 Building a Resilient Stream Processor in

    Go 36 Building Data-driven Applications with Go
  34. Benthos Config18 18 Building a Resilient Stream Processor in Go

    37 Building Data-driven Applications with Go
  35. Final data recommendations... • Use Go (obviously...), try data flow

    programming • Stay lean, check if you really need typical big data tooling: be aware of both infrastructure complexities and runtime overheads (splitting data sets, serializing, sending data over the network for distributed computations) • (and also: often use Postgres, learn SQL) 38 Building Data-driven Applications with Go
  36. Recap • We looked at basic building blocks of a

    data pipeline and some of the challenges when processing large data sets • We covered some general techniques to work with data in memory • We looked at how to construct simple data flow systems using goroutines and channels • And finally we looked at more advanced open-source tooling to set up flexible and scalable pipelines 39 Building Data-driven Applications with Go
  37. Links ➡ Designing Data-Driven Applications, Martin Kleppmann ➡ When your

    data doesn’t fit in memory: the basic techniques ➡ Channel Use Cases – Data Flow Manipulations ➡ Go Concurrency Patterns: Pipelines and cancellation ➡ Generator Tricks For System Programmers ➡ Building a Resilient Stream Processor in Go ➡ Benthos ➡ Automi ➡ Bigslice 41 Building Data-driven Applications with Go