Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software 2.0 With Go

Software 2.0 With Go

Ngalam Backend Community

February 22, 2020
Tweet

More Decks by Ngalam Backend Community

Other Decks in Programming

Transcript

  1. @ Ice House (3.5 yr) Ruby on Rails + MySQL

    + Redis ElasticSearh + Salesforce + AWS Scala + Play + Akka + Cassandra Python Flask Go + PostgreSQL @ DANA (8 mo) Java Spring + MySQL + Aliyun @ OnTel AB (5 mo) Python + React + Kafka + RocksDB @ SpaceStock (2 mo) Go + MongoDB + ElasticSearch
  2. Software 2.0 with Go . Software 2.0 . Software 1.0

    -> Software 2.0 . Lifecycle of Software 2.0 Project . Infrastructure and Tooling . Python is the King, Go can be the Minister . Go Case Study: Data Processing Pipeline
  3. Programmer 2.0 vs 1.0 2.0: curate + maintain + massage

    + clean + label dataset 1.0: maintain tools + analytics + visualization + labeling interface + infrastructure + training code
  4. Benefit of S/W 2.0 Work better in practice Computationally homogeneous

    Simple to bake into silicon Constant running time Constant memory use Highly portable Agile Module can meld into optimal whole Better than you
  5. S/W 1.0 vs S/W 2.0 S/W 1.0 -> Write code,

    express how system achieve goal S/@ 2.0 -> Curating training data, spec-by-example of what system should do Toolchain 1.0 -> Create + Validate Logic Toolchain 2.0 -> Create / Curate + Validate Data
  6. Example: Google Email Extraction System Learn template for B2C email

    Use template to extract info (order number / travel date) Use heuristic based with handcrafted rule Lesson Learned -> Coverage of heuristic based extraction system flat for several month because too brittle to improve without introducing error
  7. Benefit of switching to S/W 2.0 Precision and Recall quickly

    surpassed result from S/W 1.0 Google delete 45k line of code New system is easier to maintain Old system is brittle, difficult to debug error difficult to make further accuracy improvement New possibility -> Cross-language word embedding to learn extraction model across several languages
  8. Result machine learned > heuristic based easier to understand and

    improve critical ingredient for software 2.0 manage training data acquiring debugging versioning transforming
  9. Mental Model for S/W 2.0 Project High Impact Complex parts

    of your pipeline Where "cheap prediction" is valuable Where automatic complicated manual process is valuable Low Cost Cost is driven by: Data availability Performance requirements Problem difficulty
  10. Data Management . Data Sources . Data Labeling . Data

    Storage . Data Versioning . Data Processing
  11. Data Sources Supervised deep learning requires a lot of labeled

    data Labeling own data is costly Here are some resources for data Open source data (good to start with, not an advantage) Data augmentation (a MUST for CV, optional for NLP) Synthetic data (worth starting with, esp. in NLP)
  12. Data Labeling Requires: labeling platforms, temporary labor, and QC Sources

    of labor: Crowdsourcing: cheap and scalable, less reliable, needs QC Hiring own annotators: less QC needed, expensive, slow to scale Data labeling service companies: FigureEight
  13. Data Labeling Labeling platforms: Diffgram: Training Data Software (CV) Prodigy:

    Annotation tool powered by active learning (Text + Image) HIVE: AI as a Service platform for CV Supervisely: entire CV platform Labelbox: CV Scale: AI data platform (CV & NLP)
  14. Data Storage Object store: Store binary data (images, sound files,

    compressed texts) Amazon S3 Ceph Object Store Database: Store metadata (file paths, labels, user activity, etc) Postgres: right choice for most applications, best-in-class SQL and great support for unstructured JSON
  15. Data Storage Data Lake: aggregate features which are not obtainable

    from database (e.g. logs) Amazon Redshift Feature Store: store, access, and share ML features FEAST Michelangelo Palette At training time, copy data into local or networked filesystem (NFS)
  16. Data Versioning It's a "MUST" for deployed ML models: Deployed

    ML models are part code, part data. No data versioning means no model versioning. Data versioning platforms: DVC: Open source version control system for ML projects Pachyderm: version control for data Dolt: versioning for SQL database
  17. Data Processing Training data for production models may come from

    different sources, Stored data in db and object stores, log processing, and outputs of other classifiers. There are dependencies between tasks, each needs to be kicked off after its dependencies are finished. For example, training on new log data, requires a preprocessing step before training. Makefiles are not scalable. "Workflow manager"s become pretty essential in this regard.
  18. Data Processing Workflow orchestration: Luigi by Spotify Airflow by Airbnb:

    Dynamic, extensible, elegant, and scalable (the most widely used) DAG workflow Robust conditional execution: retry in case of failure Pusher supports docker images with tensorflow serving Whole workflow in a single .py file
  19. Development, Training, and Evaluation . Software Engineering . Resource Management

    . Deep Learning Frameworks . Experiment Management . Hyperparameter Tuning . Distributed Training
  20. Software Engineering Winner language: Python Editors: Vim / Emacs /

    VS Code Notebooks: Great as starting point of the projects, hard to scale nteract: a next-gen React-based UI for Jupyter notebooks Papermill: is an nteract library built for parameterizing, executing, and analyzing Jupyter Notebooks. Commuter: another nteract project which provides a read-only display of notebooks (e.g. from S3 buckets). Streamlit: interactive data science tool with applets
  21. Software Engineering Compute recommendations 1: For individuals or startups: Development:

    a 4x Turing-architecture PC Training/Evaluation: Use the same 4x GPU PC. When running many experiments, either buy shared servers or use cloud instances. For large companies: Development: Buy a 4x Turing-architecture PC per ML scientist or let them use V100 instances Training/Evaluation: Use cloud instances with proper provisioning and handling of failures
  22. Resource Management Allocating free resources to programs Resource management options:

    Old school cluster job scheduler ( e.g. Slurm workload manager ) Docker + Kubernetes Kubeflow Polyaxon (paid features)
  23. Experiment Management Development, training, and evaluation strategy: Always start simple

    Train a small model on a small batch. Only if it works, scale to larger data and models, and hyperparameter tuning!
  24. Experiment Management Experiment management tools: Tensorboard provides the visualization and

    tooling needed for ML experimentation Losswise (Monitoring for ML) Comet: lets you track code, experiments, and results on ML projects Weights & Biases: Record and visualize every detail of your research with easy collaboration
  25. Experiment Management MLFlow Tracking: for logging parameters, code versions, metrics,

    and output files as well as visualization of the results. Automatic experiment tracking with one line of code in python Side by side comparison of experiments Hyper parameter tuning Supports Kubernetes based jobs
  26. Hyperparameter Tuning Approaches: Grid search Random search Bayesian Optimization HyperBand

    and Asynchronous Successive Halving Algorithm (ASHA) Population-based Training
  27. Distributed Training Data parallelism: Use it when iteration time is

    too long (both tensorflow and PyTorch support) Ray Distributed Training Horovod Model parallelism: when model does not fit on a single GPU
  28. Testing and Deployment . Testing and CI/CD . Web Deployment

    . Service Mesh and Traffic Routing . Monitoring . Deploying on Embedded and Mobile Devices
  29. Testing and CI/CD Unit and Integration Testing: Types of tests:

    Training system tests: testing training pipeline Validation tests: testing prediction system on validation set Functionality tests: testing prediction system on few important examples Continuous Integration: Running tests after each new code change pushed to the repo
  30. Testing and CI/CD SaaS for continuous integration: Argo: Open source

    Kubernetes native workflow engine for orchestrating parallel jobs (incudes workflows, events, CI and CD). CircleCI: Language-Inclusive Support, Custom Environments, Flexible Resource Allocation, used by instacart, Lyft, and StackShare. Travis CI Buildkite: Fast and stable builds, Open source agent runs on almost any machine and architecture, Freedom to use your own tools and services Jenkins: Old school build system
  31. Web Deployment Consists of a Prediction System and a Serving

    System Prediction System: Process input data, make predictions Serving System (Web server): Serve prediction with scale in mind Use REST API to serve prediction HTTP requests Calls the prediction system to respond
  32. Web Deployment Serving options: Deploy to VMs, scale by adding

    instances Deploy as containers, scale via orchestration Containers Docker Container Orchestration: Kubernetes (the most popular now) MESOS Marathon Deploy code as a "serverless function" Deploy via a model serving solution
  33. Web Deployment Model serving: Specialized web deployment for ML models

    Batches request for GPU inference Frameworks: Tensorflow serving MXNet Model server Clipper (Berkeley) SaaS solutions Seldon: serve and scale models built in any framework on Kubernetes Algorithmia Deploying Jupyter Notebooks: Kubeflow Fairing is a hybrid deployment package that let's you deploy your Jupyter notebook codes!
  34. Web Deployment Decision making: CPU or GPU? CPU inference: CPU

    inference is preferable if it meets the requirements. Scale by adding more servers, or going serverless. GPU inference: TF serving or Clipper Adaptive batching is useful
  35. Service Mesh and Traffic Routing Transition from monolithic applications towards

    a distributed microservice architecture could be challenging. A Service mesh (consisting of a network of microservices) reduces the complexity of such deployments, and eases the strain on development teams. Istio: a service mesh to ease creation of a network of deployed services load balancing service-to-service authentication monitoring with few or no code changes in service code.
  36. Monitoring Purpose of monitoring: Alerts for downtime, errors, and distribution

    shifts Catching service and data regressions Cloud providers solutions are decent Kiali: observability console for Istio with service mesh configuration capabilities. It answers these questions: How are the microservices connected? How are they performing?
  37. Deploying on Embedded and Mobile Devices Main challenge: memory footprint

    and compute constraints Solutions: Quantization Reduced model size MobileNets Knowledge Distillation DistillBERT (for NLP)
  38. Deploying on Embedded and Mobile Devices Embedded and Mobile Frameworks:

    Tensorflow Lite PyTorch Mobile Core ML ML Kit FRITZ OpenVINO Model Conversion: Open Neural Network Exchange (ONNX): open-source format for deep learning models
  39. Though Python is the King, Go can be the Minister

    Python is the King for S/W 2.0 To actually run a production system at scale -> need infra that implement Autoscaling -> traffic fluctuation don't break API API management -> handle simultaneous API deployment Rolling update -> update models while still serving user Logging Cost optimization
  40. Why Go? Concurrency is crucial for S/W 2.0 infrastructure Wrangle

    few different APIs Can be used to programmatically calls these APIs to provision cluster, launch deployment, and monitor APIs Challenging to have performant, reliable. Go has goroutines + channels Build cross-platform CLI is easier in Go Go ecosystem is great for infrastructure projects Go is just a pleasure to work with Good for large project Fast compilation, static typing and great tooling
  41. type Processor interface { Process(context.Context, Payload) (Payload, error) } type

    ProcessorFunc func(context.Context, Payload) (Payload, error) func (f ProcessorFunc) Process(ctx context.Context, p Payload) (Payload, error) { return f(ctx, p) }
  42. type StageParams interface { StageIndex() int Input() <-chan Payload Output()

    chan<- Payload Error() chan<- error } type StageRunner interface { Run(context.Context, StageParams) }
  43. type Source interface { Next(context.Context) bool Payload() Payload Error() error

    } type Sink interface { Consume(context.Context, Payload) error }
  44. func sourceWorker(ctx context.Context, source Source, outCh chan<- Payload, errCh chan<-

    error) { for source.Next(ctx) { payload := source.Payload() select { case outCh <- payload: case <-ctx.Done(): return } } if err := source.Error(); err != nil { wrappedErr := xerrors.Errorf("pipeline source: %w", err) maybeEmitError(wrappedErr, errCh) } }
  45. func sinkWorker(ctx context.Context, sink Sink, inCh <-chan Payload, errCh chan<-

    error) { for { select { case payload, ok := <-inCh: if !ok { return } if err := sink.Consume(ctx, payload); err != nil { wrappedErr := xerrors.Errorf("pipeline sink: %w", err) maybeEmitError(wrappedErr, errCh) return } payload.MarkAsProcessed() case <-ctx.Done(): return } } }
  46. type Pipeline struct { stages []StageRunner } func New(stages ...StageRunner)

    *Pipeline { return &Pipeline{stages: stages} } Pipeline
  47. func (p *Pipeline) Process(ctx context.Context, source Source, sink Sink) error

    { var wg sync.WaitGroup pCtx, ctxCancelFn := context.WithCancel(ctx) stageCh := make([]chan Payload, len(p.stages)+1) errCh := make(chan error, len(p.stages)+2) for i := 0; i < len(stageCh); i++ { stageCh[i] = make(chan Payload) } for i := 0; i < len(p.stages); i++ { wg.Add(1) go func(stageIndex int) { p.stages[stageIndex].Run(pCtx, &workerParams{ stage: stageIndex, inCh: stageCh[stageIndex], outCh: stageCh[stageIndex+1], errCh: errCh, }) close(stageCh[stageIndex+1]) wg.Done() }(i) } wg.Add(2) go func() { sourceWorker(pCtx, source, stageCh[0], errCh) close(stageCh[0]) wg.Done() }() go func() { sinkWorker(pCtx, sink, stageCh[len(stageCh)-1], errCh) wg.Done() }() go func() { wg.Wait() close(errCh) ctxCancelFn() }() var err error for pErr := range errCh { err = multierror.Append(err, pErr) ctxCancelFn() } return err }
  48. func (r fifo) Run(ctx context.Context, params StageParams) { for {

    select { case <-ctx.Done(): return case payloadIn, ok := <-params.Input(): if !ok { return } payloadOut, err := r.proc.Process(ctx, payloadIn) if err != nil { wrappedErr := xerrors.Errorf("pipeline stage %d: %w", params.StageIndex(), err) maybeEmitError(wrappedErr, params.Error()) return } if payloadOut == nil { payloadIn.MarkAsProcessed() continue } select { case params.Output() <- payloadOut: case <-ctx.Done(): return } } } }
  49. fifos := make([]StageRunner, numWorkers) for i := 0; i <

    numWorkers; i++ { fifos[i] = FIFO(proc) } return &fixedWorkerPool{fifos: fifos} } func (p *fixedWorkerPool) Run(ctx context.Context, params StageParams) { var wg sync.WaitGroup for i := 0; i < len(p.fifos); i++ { wg.Add(1) go func(fifoIndex int) { p.fifos[fifoIndex].Run(ctx, params) wg.Done() }(i) } wg.Wait() }
  50. payloadOut, err := p.proc.Process(ctx, payloadIn) if err != nil {

    wrappedErr := xerrors.Errorf("pipeline stage %d: %w", params.StageIndex(), err) maybeEmitError(wrappedErr, params.Error()) return } if payloadOut == nil { payloadIn.MarkAsProcessed() return } select { case params.Output() <- payloadOut: case <-ctx.Done(): } }(payloadIn, token) } } for i := 0; i < cap(p.tokenPool); i++ { <-p.tokenPool } }
  51. type broadcast struct { fifos []StageRunner } func Broadcast(procs ...Processor)

    StageRunner { if len(procs) == 0 { panic("Broadcast: at least one processor must be specified") } fifos := make([]StageRunner, len(procs)) for i, p := range procs { fifos[i] = FIFO(p) } return &broadcast{fifos: fifos} } func (b *broadcast) Run(ctx context.Context, params StageParams) { var ( wg sync.WaitGroup inCh = make([]chan Payload, len(b.fifos)) ) for i := 0; i < len(b.fifos); i++ { wg.Add(1) inCh[i] = make(chan Payload) go func(fifoIndex int) { fifoParams := &workerParams{ stage: params.StageIndex(), inCh: inCh[fifoIndex], outCh: params.Output(), errCh: params.Error(), } b.fifos[fifoIndex].Run(ctx, fifoParams) wg.Done() }(i) } done: for { select { case <-ctx.Done(): break done case payload, ok := <-params.Input(): if !ok { break done } for i := len(b.fifos) - 1; i >= 0; i-- { var fifoPayload = payload if i != 0 { fifoPayload = payload.Clone() } select { case <-ctx.Done(): break done case inCh[i] <- fifoPayload: } } } } for _, ch := range inCh { close(ch) } wg.Wait() }