Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software Design Practices for Large-Scale Autom...

Avatar for hohaxu hohaxu
January 05, 2017

Software Design Practices for Large-Scale Automation

Design practices for large-scale, high-performance, distributed system for complex algorithms such as graph, optimization, prediction, and machine learning etc.

Avatar for hohaxu

hohaxu

January 05, 2017
Tweet

Other Decks in Programming

Transcript

  1. Ongoing... Design for large-scale, high-performance, distributed software systems for complex

    algorithms such as graph, optimization, prediction, and machine learning. Corrections/improvements are very welcome at [email protected] (Hao Xu)
  2. Topics • Large-scale Automation: Why Challenging? • Design Principles: Coping

    with Complexity and Physicality • Computation Paradigms: HPC, Spark, Tensorflow • Designs: Logical, Physical, System levels • Distributed and Iterative Algorithms: Partition, Sync, Iteration Trade-offs • Smart QA: Protection, Auditing, Debug codes
  3. Design Objectives for Large-scale Automation • Scalability (growing) • Extensibility

    (evolving) • Performance (fast) • Maintenance (controllable)
  4. Scalability: Name of the Game • Electronics simulation: mandatory for

    simulation software to scale with Moore’s law • Internet Applications: systems need to be ready for next 10x user growth and feature evolution • Knowledge Base: bigger system improves cross referencing and hence quality of learning new knowledge • Deep learning: capacity of system affects quality of latent features learned and hence the prediction capability • Internet of Things: as the name suggests...
  5. What make it difficult? #1 Complexity • Complexity is the

    TOP challenge for software engineering • Usually grows with the scale of the system ◦ exhibits different patterns at different scale ◦ explodes with the number of software features • The only way to handle complexity ◦ “Divide and Conquer” ◦ realized by various Design Principles
  6. What make it difficult? #2 Physicality • Software is physical,

    just like human ◦ Results are stored in physical memory (RAM/ROM/Disk) ◦ Computation is done in physical processing units (CPU/GPU/FPGA) • Not feasible to build one gigantic machine that solves everything ◦ System should live on machine farms ◦ Data / Computation should be distributed • Physicality complicates the design of systems ◦ Data partition ◦ Computation partition
  7. Design Principles for Coping with Complexity • Abstraction (Vertical Divide

    & Conquer) ◦ Core Abstractions ◦ Hierarchization • Decoupling (Horizontal Divide & Conquer) ◦ Encapsulation ◦ Layerization Decoupling Centerpiece of large-scale system design Abstraction
  8. Abstraction: Vertical Divide and Conquer • Core Abstractions ◦ the

    soul of large-scale systems ◦ the root of abstraction hierarchy ◦ higher level abstraction = better extensibility • Hierarchization ◦ simplification of system functionality graph ◦ ideally mapped into tree structures (no loop) ◦ the template for Object Oriented Design ◦ need a balance b/w delegation & check
  9. Decoupling: Horizontal Divide and Conquer • Encapsulation ◦ components encapsulate

    complex logic ◦ API design for minimal interface • Layerization ◦ algorithms divided into layers ◦ each layer handles a feature/algorithm ▪ layer 1: Graph partition and communication ▪ layer 2: Graph node property analysis ▪ layer 3: User operation on Graph nodes ▪ ...
  10. The Priority of Abstractions for Project Management • Core abstractions

    (1st Priority) ◦ Determines functionality/scalability • Library abstractions (2nd Priority) ◦ Determines performance • Logic abstractions (low priority) ◦ Flows ◦ Apps ◦ Business logics 1 2 3
  11. Computation Paradigms • What is Computation Paradigm? ◦ Computation abstraction

    at different levels ◦ Offers encapsulation and parallelism at different levels ◦ Crucial to choose the right computation paradigm • Computation Paradigm at different levels ◦ Language level: Python, C, Scala ◦ Flow level: Imperative, Symbolic, Functional programming ◦ System level: Computation-centric (HPC) or Data-centric (e.g. Spark)
  12. Flow level: Imperative Programming • Imperative Programming: No native abstraction

    ◦ C++ / Python / Java ◦ Computation at instruction level ◦ Task level parallel
  13. Flow level: Functional Programming • Functional Programming: Data abstraction ◦

    Scala / MapReduce ◦ Immutable, Stateless function • Pros ◦ Offers Data level parallel • Cons ◦ Data read only, need to make another copy if update. ◦ More memory consumption. Potential performance overhead.
  14. Flow level: Symbolic Programming • Symbolic Programming: Operator abstraction ◦

    Theano / TensorFlow ◦ Operator level parallel ◦ Graph model as base engine • Pros ◦ Offers high operator parallelism through graph propagation • Cons ◦ Not flexible for all programming tasks ◦ May incur overhead handling with fine-grained operators
  15. System level: Computation-Centric System (typical HPC 1) • What is

    HPC ◦ HPC is extreme parallel computing ◦ Computation Partition ▪ Communication delay aware • Inter-node L3/L2/L1 • Intra-node interconnect 100gb/s • Inter-cluster ethernet 1gb/s + Ram to Disk time ▪ Physical architecture ware • Register size etc
  16. System level: Computation-Centric System (typical HPC 2) • Parallel at

    different levels ◦ Multi-threading ◦ Multi-process ◦ Distributed cluster ◦ Mainstream communication: MPI • Partition based on needs of communication ◦ Minimize communication ◦ Algorithm partition ◦ Data partition
  17. System level: Computation-Centric System (typical HPC 3) • Exploit Heterogeneous

    Components ◦ GPU acceleration (many small cores) ▪ Model is too small; too much overhead; stays on CPU ▪ Model is too large; exceeds GPU memory; do partial acceleration ▪ Exchange memory with CPU through memory copy ◦ FPGA (millions of gates) ◦ SSD, RAID 0/1,5/10 • Disk IO ◦ HDF5 parallel read/write
  18. System level: Data-Centric System (Spark-like) • Data partition: Physically distributed

    central DB ◦ Serialization: boost:serialization(c++), pickling(Python) • Scalable computation ◦ Usually has a scheduler ◦ Explicit scheduling: user defines computation graph nodes ◦ Implicit scheduling: engine analyzes the computation graph • Stateless ◦ Good for debug, easy recover from failure
  19. System level: Hybrid Architecture • Hybrid Architecture Example: TensorFlow ◦

    Stochastic algorithms → use Data-centric model ▪ E.g. Back propagation: Parameter Server ◦ Deterministic algorithms → use Computation-centric (HPC) model ▪ E.g. Common data sync among model partitions: Bulk Synchronous Parallel
  20. Logical Design • Objectify everything ◦ an object can have

    multiple copies for parallel computing ◦ avoid singleton / global / static variables ◦ top level should fall through, should not execute anything
  21. Logical Design • Standardize everything ◦ Base Class for any

    task = function(data, parameters, executor_id) ◦ schema (base class) for task ◦ scheme for any data ◦ schema for any function ◦ schema for any parameter • Benefits ◦ higher level automation ◦ potentially more intelligent system
  22. Logical Design • Modularize everything ◦ encapsulate data by using

    setter / getter ◦ encapsulate atomic or repeated functionality ◦ #define any hard number ◦ factorize long function or class ◦ build shared libraries from bottom-up ▪ communication lib ▪ parallel computing lib ▪ debug / reporting lib
  23. Physical Design: Code • Source Code ◦ component level decouple

    by folder ◦ module level decouple by file ◦ variable space decouple by namespace • Code change ◦ physical change (files/folders touched) should reflect logical change ◦ change scope should narrow down as development goes ◦ diff mangement
  24. Physical Design: Memory 1) • Memory is the #1 factor

    for performance ◦ Code runs in memory, not in the air • OS Memory Handling ◦ Memory allocation, fragmentation, release etc ◦ Tcmalloc VS jemalloc ▪ Improves allocation/fragmentation ▪ Still has issue on release
  25. Physical Design: Memory 2) • Interpreter Memory Handling ◦ Garbage

    Collection • Manual Memory Management ◦ memory pooling is mandatory ◦ memory lifecycle management for any large usage
  26. Physical Design: Memory 3) • Trade-offs ◦ Depends on application

    ▪ Memory critical: TC/JEmalloc ▪ Memory and Performance critical: MMU ◦ HPC is memory and performance critical ▪ Parallel does not solve all the problem. Single machine performance is still dominant factor ▪ You should know the code very well to design manual MMU ◦ Spark replacing JVM memory management with Tungsten project
  27. Physical Design: Performance • Performance Tuning ◦ profiling, profiling, profiling...

    ◦ lazy initialization / write / read ◦ cache-aware design ▪ cache-friendly data structure • linked structure locality ▪ cache-friendly algorithm • read / write locality
  28. System Design • Scalable Distributed System ◦ DB Service: Data

    and Computation decouple ◦ Task/Scheduler: Computation and Execution decouple ◦ Query/Queue: Producer and Consumer decouple
  29. System Design • DB Service ◦ Logically Centralized ▪ Parameter

    Server ◦ Physically distributed ▪ Only routing / bookkeeping service on Master ▪ Master capacity is not an issue ▪ Computation locality on Slaves
  30. System Design • Parallel Computing ◦ multi-threading ▪ light overhead

    ▪ shared memory, data exchange OK ◦ multi-process ▪ heavy overhead ▪ separated memory space, more difficult data exchange ◦ distributed multiple machine ▪ balance between computation VS. communication
  31. System Design • TensorFlow Example ◦ Multi-threading: Graph Execution Engine

    ▪ BFS ▪ DFS ◦ Multi-machine: Graph partition ▪ Edge-cut? ▪ Vertex-cut?
  32. System Design • Fault Tolerance ◦ Monitor granularity ▪ system

    level: module behavior ▪ flow level: major steps ▪ algorithm level: major checkpoints ◦ Persistence granularity ▪ recovery depth ▪ recovery contents
  33. Key Issues of Distributed Algorithms • Data / Model partition

    ◦ inference data partition; graph partition; datastore sharding • Communication paradigm ◦ Spark RDD; MPI; RPC • Computation locality ◦ locality-aware job scheduling; Yarn; Drill • Parallel algorithm paradigm ◦ Map/Reduce; Spark • Multi-stage distributed flow
  34. Distributed Deterministic Algorithms 1) • What to sync? ◦ what

    is the key information to stitch each pieces together ◦ sync data to resemble single machine algorithm (rare but can be useful) ◦ keep data local, sync results (map/reduce) • When to sync? ◦ lazy sync (e.g. Bulk Synchronous Parallel) ◦ async (e.g. Parameter Server) • Where to sync? ◦ refactor algorithm by optimal sync point
  35. Distributed Deterministic Algorithms 2) • Trade-offs ◦ performance ▪ computation

    VS. communication ◦ scalability ▪ need scalable communication pattern ▪ avoid point-to-point communication
  36. Distributed Approximate Algorithms 1) • QoR loss in distributed computing

    ◦ for many algorithms, lack of global sync leads to QoR loss ◦ full global sync is very expensive in communication cost ◦ carefully choose sync points to maximize Performance / QoR Loss • Self-healing Algorithms ◦ some algorithms have less dependency on global sync ◦ e.g. in Stochastic Optimization ▪ global sync may be postponed to allow local optimum explored ▪ however this nice feature is data / model dependant
  37. Distributed Approximate Algorithms 2) • Major challenges 1) ◦ Trade-off

    on QoR? ▪ approximation is inevitable, so what can be approximated? ▪ not just an engineering problem ▪ usually needs assessment on business impact ◦ Solutions ▪ for each approximation candidates, detail profiling on QoR loss VS. Performance Gain VS. Business impact
  38. Distributed Approximate Algorithms 3) • Major challenges 2) ◦ Hard

    to maintain? ▪ Stochastic Algorithms: find deterministic in probability values ▪ Graph algorithms: hard to trace in large-scale graph ◦ Solutions ▪ develop single machine algorithm first as golden ▪ detailed testing and correlation for each parallelization step ▪ detailed testing to understand result/error pattern on small data
  39. Distributed Iterative Algorithms 1) • Many algorithms for large-scale problem

    are iterative ◦ Simulated Annealing; Genetic Algorithm; Graph Partition; PageRank; Expectation Maximization; Loopy Belief Propagation etc • Two Common approaches ◦ Local computation + lazy Sync ◦ Global computation with graph propagation
  40. Distributed Iterative Algorithms 2) • Distributed environment adds another layer

    of complexity ◦ iterations need to be tuned, or completely re-designed ◦ may become harder to converge • Tuning iterations ◦ Again, where to iterate? ▪ spend runtime on key gainer ▪ profiling of iterations VS. QoR gain ◦ Tuning knots for convergence ▪ iteration knots have very high impact on convergence ▪ profiling of convergence parameters VS runtime VS QoR
  41. Multi-stage Distributed Flow • Data re-partition problem (“Shuffle” in Spark

    Language) “In these distributed computation engines, the shuffle refers to the repartitioning and aggregation of data during an all-to-all operation. Understandably, most performance, scalability, and reliability issues that we observe in production Spark deployments occur within the shuffle.” http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark- its-a-double
  42. Multi-stage Distributed Flow 1) • Data re-partition problem (“Shuffle” in

    Spark Language) ◦ unified partition VS. per-stage partition ▪ per-stage partition fits algorithm better, but requires data migration ◦ global partition VS. stream partition ▪ global partition fits algorithm better, but requires single machine to hold all data for partition ▪ stream partition + post-partition adjustment
  43. Multi-stage Distributed Flow 2) • Data re-partition problem (“Shuffle” in

    Spark Language) ◦ QoR numerical dependence on the number of partitions ▪ direct partitioning has numerical stability problem ▪ fine-grained partition + post-partition coarsening is better • Solutions ◦ Hard to use standard library for high performance system ◦ Best performance system is customized on: ▪ Data volume ▪ Computation intensity ▪ (Multiple-stage) Algorithm parallelism ◦ Always, keep a golden of single machine run, even for small input data!
  44. Smart QA cannot fix a bug unless you can reproduce

    it cannot build a system unless you can test it …...
  45. Smart QA: Why • Successful software must have good QA

    ◦ A high level model of the system ◦ Save time in debug ◦ Save business in crisis • Throughout Software Lifecycle ◦ Development: test-driven development ◦ Deployment: handles discrepancy b/w user env and dev env ◦ Maintenance: predicts error, learns from failures, improves system
  46. Protection Code • Assert / Try, Except / Raise… •

    Good to have: ◦ Cases run through ◦ Information on internal data, sometimes • Too much of it? ◦ hurts performance • Need a balance ◦ Input of external data → sanity check ◦ Internal data → no check on high performance engine. System design and code should ensure that
  47. Auditing Code • Check correctness from another angle ◦ Rule

    based ▪ Simply adds up the numbers to see if match ▪ Use another algorithm, simpler, but does rough check ◦ Data driven ▪ Samples intermediate data from normal runs, issues alert when runtime data distribution is different
  48. Debug Code • As important as functional code! (if not

    more) • Essentially a high level abstraction on code OUTPUT ◦ Not just debug ◦ A reversed tree structure, with samples on key nodes ◦ Grows intelligently with field practice • Maintenance effort should decrease over time ◦ Error handling/messaging system should mature through time ◦ Bugs should be fixed in the right direction, not just workaround