Software Design Practices for Large-Scale Automation

Design for Large-Scale Automation 12/30/2015

Ongoing... Design for large-scale, high-performance, distributed software systems for complex
algorithms such as graph, optimization, prediction, and machine learning. Corrections/improvements are very welcome at [email protected] (Hao Xu)

Topics • Large-scale Automation: Why Challenging? • Design Principles: Coping
with Complexity and Physicality • Computation Paradigms: HPC, Spark, Tensorflow • Designs: Logical, Physical, System levels • Distributed and Iterative Algorithms: Partition, Sync, Iteration Trade-offs • Smart QA: Protection, Auditing, Debug codes

Design Objectives for Large-scale Automation • Scalability (growing) • Extensibility
(evolving) • Performance (fast) • Maintenance (controllable)

Scalability: Name of the Game • Electronics simulation: mandatory for
simulation software to scale with Moore’s law • Internet Applications: systems need to be ready for next 10x user growth and feature evolution • Knowledge Base: bigger system improves cross referencing and hence quality of learning new knowledge • Deep learning: capacity of system affects quality of latent features learned and hence the prediction capability • Internet of Things: as the name suggests...

What make it difficult? #1 Complexity • Complexity is the
TOP challenge for software engineering • Usually grows with the scale of the system ◦ exhibits different patterns at different scale ◦ explodes with the number of software features • The only way to handle complexity ◦ “Divide and Conquer” ◦ realized by various Design Principles

What make it difficult? #2 Physicality • Software is physical,
just like human ◦ Results are stored in physical memory (RAM/ROM/Disk) ◦ Computation is done in physical processing units (CPU/GPU/FPGA) • Not feasible to build one gigantic machine that solves everything ◦ System should live on machine farms ◦ Data / Computation should be distributed • Physicality complicates the design of systems ◦ Data partition ◦ Computation partition

Design Principles Abstraction and Decoupling

Design Principles: The Philosophy

Design Principles for Coping with Complexity • Abstraction (Vertical Divide
& Conquer) ◦ Core Abstractions ◦ Hierarchization • Decoupling (Horizontal Divide & Conquer) ◦ Encapsulation ◦ Layerization Decoupling Centerpiece of large-scale system design Abstraction

Abstraction: Vertical Divide and Conquer • Core Abstractions ◦ the
soul of large-scale systems ◦ the root of abstraction hierarchy ◦ higher level abstraction = better extensibility • Hierarchization ◦ simplification of system functionality graph ◦ ideally mapped into tree structures (no loop) ◦ the template for Object Oriented Design ◦ need a balance b/w delegation & check

Decoupling: Horizontal Divide and Conquer • Encapsulation ◦ components encapsulate
complex logic ◦ API design for minimal interface • Layerization ◦ algorithms divided into layers ◦ each layer handles a feature/algorithm ▪ layer 1: Graph partition and communication ▪ layer 2: Graph node property analysis ▪ layer 3: User operation on Graph nodes ▪ ...

The Priority of Abstractions for Project Management • Core abstractions
(1st Priority) ◦ Determines functionality/scalability • Library abstractions (2nd Priority) ◦ Determines performance • Logic abstractions (low priority) ◦ Flows ◦ Apps ◦ Business logics 1 2 3

Computation Paradigms Language level, Flow level, System level

Computation Paradigms: The Framework

Computation Paradigms • What is Computation Paradigm? ◦ Computation abstraction
at different levels ◦ Offers encapsulation and parallelism at different levels ◦ Crucial to choose the right computation paradigm • Computation Paradigm at different levels ◦ Language level: Python, C, Scala ◦ Flow level: Imperative, Symbolic, Functional programming ◦ System level: Computation-centric (HPC) or Data-centric (e.g. Spark)

Flow level: Imperative Programming • Imperative Programming: No native abstraction
◦ C++ / Python / Java ◦ Computation at instruction level ◦ Task level parallel

Flow level: Functional Programming • Functional Programming: Data abstraction ◦
Scala / MapReduce ◦ Immutable, Stateless function • Pros ◦ Offers Data level parallel • Cons ◦ Data read only, need to make another copy if update. ◦ More memory consumption. Potential performance overhead.

Flow level: Symbolic Programming • Symbolic Programming: Operator abstraction ◦
Theano / TensorFlow ◦ Operator level parallel ◦ Graph model as base engine • Pros ◦ Offers high operator parallelism through graph propagation • Cons ◦ Not flexible for all programming tasks ◦ May incur overhead handling with fine-grained operators

System level: Computation-Centric System (typical HPC 1) • What is
HPC ◦ HPC is extreme parallel computing ◦ Computation Partition ▪ Communication delay aware • Inter-node L3/L2/L1 • Intra-node interconnect 100gb/s • Inter-cluster ethernet 1gb/s + Ram to Disk time ▪ Physical architecture ware • Register size etc

System level: Computation-Centric System (typical HPC 2) • Parallel at
different levels ◦ Multi-threading ◦ Multi-process ◦ Distributed cluster ◦ Mainstream communication: MPI • Partition based on needs of communication ◦ Minimize communication ◦ Algorithm partition ◦ Data partition

System level: Computation-Centric System (typical HPC 3) • Exploit Heterogeneous
Components ◦ GPU acceleration (many small cores) ▪ Model is too small; too much overhead; stays on CPU ▪ Model is too large; exceeds GPU memory; do partial acceleration ▪ Exchange memory with CPU through memory copy ◦ FPGA (millions of gates) ◦ SSD, RAID 0/1,5/10 • Disk IO ◦ HDF5 parallel read/write

System level: Data-Centric System (Spark-like) • Data partition: Physically distributed
central DB ◦ Serialization: boost:serialization(c++), pickling(Python) • Scalable computation ◦ Usually has a scheduler ◦ Explicit scheduling: user defines computation graph nodes ◦ Implicit scheduling: engine analyzes the computation graph • Stateless ◦ Good for debug, easy recover from failure

System level: Hybrid Architecture • Hybrid Architecture Example: TensorFlow ◦
Stochastic algorithms → use Data-centric model ▪ E.g. Back propagation: Parameter Server ◦ Deterministic algorithms → use Computation-centric (HPC) model ▪ E.g. Common data sync among model partitions: Bulk Synchronous Parallel

Designs: The Quality

Logical Design Objectify, Modularize, Standardize

Logical Design • Objectify everything ◦ an object can have
multiple copies for parallel computing ◦ avoid singleton / global / static variables ◦ top level should fall through, should not execute anything

Logical Design • Standardize everything ◦ Base Class for any
task = function(data, parameters, executor_id) ◦ schema (base class) for task ◦ scheme for any data ◦ schema for any function ◦ schema for any parameter • Benefits ◦ higher level automation ◦ potentially more intelligent system

Logical Design • Modularize everything ◦ encapsulate data by using
setter / getter ◦ encapsulate atomic or repeated functionality ◦ #define any hard number ◦ factorize long function or class ◦ build shared libraries from bottom-up ▪ communication lib ▪ parallel computing lib ▪ debug / reporting lib

Physical Design Code, Memory, Performance

Physical Design: Code • Source Code ◦ component level decouple
by folder ◦ module level decouple by file ◦ variable space decouple by namespace • Code change ◦ physical change (files/folders touched) should reflect logical change ◦ change scope should narrow down as development goes ◦ diff mangement

Physical Design: Memory 1) • Memory is the #1 factor
for performance ◦ Code runs in memory, not in the air • OS Memory Handling ◦ Memory allocation, fragmentation, release etc ◦ Tcmalloc VS jemalloc ▪ Improves allocation/fragmentation ▪ Still has issue on release

Physical Design: Memory 2) • Interpreter Memory Handling ◦ Garbage
Collection • Manual Memory Management ◦ memory pooling is mandatory ◦ memory lifecycle management for any large usage

Physical Design: Memory 3) • Trade-offs ◦ Depends on application
▪ Memory critical: TC/JEmalloc ▪ Memory and Performance critical: MMU ◦ HPC is memory and performance critical ▪ Parallel does not solve all the problem. Single machine performance is still dominant factor ▪ You should know the code very well to design manual MMU ◦ Spark replacing JVM memory management with Tungsten project

Physical Design: Performance • Performance Tuning ◦ profiling, profiling, profiling...
◦ lazy initialization / write / read ◦ cache-aware design ▪ cache-friendly data structure • linked structure locality ▪ cache-friendly algorithm • read / write locality

System Design Distributed, Parallel, Resilient

System Design • Scalable Distributed System ◦ DB Service: Data
and Computation decouple ◦ Task/Scheduler: Computation and Execution decouple ◦ Query/Queue: Producer and Consumer decouple

System Design • DB Service ◦ Logically Centralized ▪ Parameter
Server ◦ Physically distributed ▪ Only routing / bookkeeping service on Master ▪ Master capacity is not an issue ▪ Computation locality on Slaves

System Design • Parallel Computing ◦ multi-threading ▪ light overhead
▪ shared memory, data exchange OK ◦ multi-process ▪ heavy overhead ▪ separated memory space, more difficult data exchange ◦ distributed multiple machine ▪ balance between computation VS. communication

System Design • TensorFlow Example ◦ Multi-threading: Graph Execution Engine
▪ BFS ▪ DFS ◦ Multi-machine: Graph partition ▪ Edge-cut? ▪ Vertex-cut?

System Design • Fault Tolerance ◦ Monitor granularity ▪ system
level: module behavior ▪ flow level: major steps ▪ algorithm level: major checkpoints ◦ Persistence granularity ▪ recovery depth ▪ recovery contents

Distributed and Iterative Algorithms Partition, Sync, Iterate, Global/Local Optimum

Distributed and Iterative Algorithms: The Lifeblood

Key Issues of Distributed Algorithms • Data / Model partition
◦ inference data partition; graph partition; datastore sharding • Communication paradigm ◦ Spark RDD; MPI; RPC • Computation locality ◦ locality-aware job scheduling; Yarn; Drill • Parallel algorithm paradigm ◦ Map/Reduce; Spark • Multi-stage distributed flow

Distributed Deterministic Algorithms 1) • What to sync? ◦ what
is the key information to stitch each pieces together ◦ sync data to resemble single machine algorithm (rare but can be useful) ◦ keep data local, sync results (map/reduce) • When to sync? ◦ lazy sync (e.g. Bulk Synchronous Parallel) ◦ async (e.g. Parameter Server) • Where to sync? ◦ refactor algorithm by optimal sync point

Distributed Deterministic Algorithms 2) • Trade-offs ◦ performance ▪ computation
VS. communication ◦ scalability ▪ need scalable communication pattern ▪ avoid point-to-point communication

Distributed Approximate Algorithms 1) • QoR loss in distributed computing
◦ for many algorithms, lack of global sync leads to QoR loss ◦ full global sync is very expensive in communication cost ◦ carefully choose sync points to maximize Performance / QoR Loss • Self-healing Algorithms ◦ some algorithms have less dependency on global sync ◦ e.g. in Stochastic Optimization ▪ global sync may be postponed to allow local optimum explored ▪ however this nice feature is data / model dependant

Distributed Approximate Algorithms 2) • Major challenges 1) ◦ Trade-off
on QoR? ▪ approximation is inevitable, so what can be approximated? ▪ not just an engineering problem ▪ usually needs assessment on business impact ◦ Solutions ▪ for each approximation candidates, detail profiling on QoR loss VS. Performance Gain VS. Business impact

Distributed Approximate Algorithms 3) • Major challenges 2) ◦ Hard
to maintain? ▪ Stochastic Algorithms: find deterministic in probability values ▪ Graph algorithms: hard to trace in large-scale graph ◦ Solutions ▪ develop single machine algorithm first as golden ▪ detailed testing and correlation for each parallelization step ▪ detailed testing to understand result/error pattern on small data

Distributed Iterative Algorithms 1) • Many algorithms for large-scale problem
are iterative ◦ Simulated Annealing; Genetic Algorithm; Graph Partition; PageRank; Expectation Maximization; Loopy Belief Propagation etc • Two Common approaches ◦ Local computation + lazy Sync ◦ Global computation with graph propagation

Distributed Iterative Algorithms 2) • Distributed environment adds another layer
of complexity ◦ iterations need to be tuned, or completely re-designed ◦ may become harder to converge • Tuning iterations ◦ Again, where to iterate? ▪ spend runtime on key gainer ▪ profiling of iterations VS. QoR gain ◦ Tuning knots for convergence ▪ iteration knots have very high impact on convergence ▪ profiling of convergence parameters VS runtime VS QoR

Multi-stage Distributed Flow • Data re-partition problem (“Shuffle” in Spark
Language) “In these distributed computation engines, the shuffle refers to the repartitioning and aggregation of data during an all-to-all operation. Understandably, most performance, scalability, and reliability issues that we observe in production Spark deployments occur within the shuffle.” http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark- its-a-double

Multi-stage Distributed Flow 1) • Data re-partition problem (“Shuffle” in
Spark Language) ◦ unified partition VS. per-stage partition ▪ per-stage partition fits algorithm better, but requires data migration ◦ global partition VS. stream partition ▪ global partition fits algorithm better, but requires single machine to hold all data for partition ▪ stream partition + post-partition adjustment

Multi-stage Distributed Flow 2) • Data re-partition problem (“Shuffle” in
Spark Language) ◦ QoR numerical dependence on the number of partitions ▪ direct partitioning has numerical stability problem ▪ fine-grained partition + post-partition coarsening is better • Solutions ◦ Hard to use standard library for high performance system ◦ Best performance system is customized on: ▪ Data volume ▪ Computation intensity ▪ (Multiple-stage) Algorithm parallelism ◦ Always, keep a golden of single machine run, even for small input data!

Smart QA cannot fix a bug unless you can reproduce
it cannot build a system unless you can test it …...

Smart QA: The Guardian

Smart QA: Why • Successful software must have good QA
◦ A high level model of the system ◦ Save time in debug ◦ Save business in crisis • Throughout Software Lifecycle ◦ Development: test-driven development ◦ Deployment: handles discrepancy b/w user env and dev env ◦ Maintenance: predicts error, learns from failures, improves system

Protection Code • Assert / Try, Except / Raise… •
Good to have: ◦ Cases run through ◦ Information on internal data, sometimes • Too much of it? ◦ hurts performance • Need a balance ◦ Input of external data → sanity check ◦ Internal data → no check on high performance engine. System design and code should ensure that

Auditing Code • Check correctness from another angle ◦ Rule
based ▪ Simply adds up the numbers to see if match ▪ Use another algorithm, simpler, but does rough check ◦ Data driven ▪ Samples intermediate data from normal runs, issues alert when runtime data distribution is different

Debug Code • As important as functional code! (if not
more) • Essentially a high level abstraction on code OUTPUT ◦ Not just debug ◦ A reversed tree structure, with samples on key nodes ◦ Grows intelligently with field practice • Maintenance effort should decrease over time ◦ Error handling/messaging system should mature through time ◦ Bugs should be fixed in the right direction, not just workaround

Software Design Practices for Large-Scale Autom...

Software Design Practices for Large-Scale Automation

Other Decks in Programming

Featured

Transcript