Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecting Immediacy: The Design of a High-Performance Portable Wrangling Engine

Architecting Immediacy: The Design of a High-Performance Portable Wrangling Engine

Talk at Strata San Jose 2016 on the Trifacta Photon in-memory compute framework

Joe Hellerstein

March 31, 2016
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. Architecting Immediacy Joe Hellerstein and Seshadri Mahalingam The Design of

    a High-Performance Portable Wrangling Engine PHOTON
  2. Wrangling and Immediacy Technical Challenges Architectural Context JavaScript and its

    Discontents Photon: A Clean Slate Design Outline 1 2 3 4 5
  3. Turn of the Century Data Transformation - Schemas and annotations

    - Box-and-arrow programming - Batch execution
  4. Research Roots: Potter’s Wheel 2001 + Real data, sampled on

    the fly + Menu-driven transforms + Immediate execution and feedback [Raman & Hellerstein, VLDB01]
  5. Research Roots: Open Source Data Wrangler, 2011 – Browser-sized data

    sets + Predictive Transformation + Immediate feedback on
 multiple choices [Kandel, Heer & Hellerstein, CHI 11]
  6. 2016: Data Wrangling and Immediacy DEMO Grown from research roots.

    + Deeper and broader Predictive 
 Transformation + Sample-to-Scale with Intelligent 
 Execution + Interactive Visual Profiling Multi-faceted immediacy!
  7. The Value of Immediacy Immediate, step-by-step feedback during data transformation

    Understand options Confirm intent Assess
 progress
  8. The Value of Immediacy Miller ’68, 
 Card ’91, 


    Nielsen ‘93 Elapsed Time (secs) 1 2 3 4 5 6 7 8 9 10 0 User attention User flow “Instantaneous” Batch processing is in conflict with these goals!
  9. Wrangling and Immediacy Technical Challenges Architectural Context JavaScript and its

    Discontents Photon: A Clean Slate Design Outline 1 2 3 4 5
  10. Technical Challenges Performance: Stay in the Flow Miller ’68, 


    Card ’91, 
 Nielsen ‘93 Elapsed Time (secs) 1 2 3 4 5 6 7 8 9 10 0 User attention User flow Instantaneous
  11. The Dirty Details of Dirty Data 1 2 3 4

    5 Technical Challenges Ambiguous Schemas 6 Heavy String Processing Ambiguous Types Noise & Exceptions Limited Filtering Rich Transforms
  12. This is the New World of Data Wrangling Stay in

    
 the Flow Scale the Unknown Expect the Unexpected
  13. Wrangling and Immediacy Technical Challenges Architectural Context JavaScript and its

    Discontents Photon: A Clean Slate Design Outline 1 2 3 4 5
  14. Flexible Deployments Trifacta Wrangler App On Premises DSL Compiler Intelligent

    Execution Hosted DSL Compiler Intelligent Execution AWS DSL Compiler Intelligent Execution
  15. Wrangling and Immediacy Technical Challenges Architectural Context JavaScript and its

    Discontents Photon: A Clean Slate Design Outline 1 2 3 4 5
  16. Why a JS Wrangling Engine? Stanford Wrangler team was deep

    in JS JS runs in the browser Dynamically typed JS is the way people-centric data apps are built today! Easy to prototype & add new features Continued to attract top JS talent …and on the desktop too JS deals well with ambiguous types and structures found while in wrangling
  17. Why a JS Wrangling Engine? Stanford Wrangler team was deep

    in JS JS runs in the browser Dynamically typed JS is the way people-centric data apps are built today! Easy to prototype & add new features Continued to attract top JS talent …and on the desktop too JS deals well with ambiguous types and structures found while in wrangling But does it perform? Does it scale?
  18. Jeff Heer’s Datavore Prototype Datavore can complete queries over million-element

    data tables at interactive (sub-100ms) rates. in-memory column-oriented database
  19. Hard to extend, performantly You will have to modify the

    guts of the engine to add new aggregate operators add new logic to the inner loop of the query processor (for both dense and sparse queries)
  20. Type ambiguity Inconsistencies with strongly typed execution engines Core engine

    slows down switch (typeof inputValue) { case 'string': return inputValue; case 'number': return inputValue + '';
  21. Code Generation is deceptively easy Function bodies can be analyzed:

    Function.prototype.toString() Functions can be generated: new Function(functionBody) Inlining function calls brought ~15-20% speed up Difficult to maintain & debug. Modest wins don’t justify the maintenance cost.
  22. JavaScript runtime: fast, but not easily tamed Difficult to exploit

    parallelism via multi-core processors Garbage-collected language pain points: Value copies, object allocation and garbage collection 
 Small inputs easily blow up memory usage
  23. Escape from JS: A Study of Alternatives DSL COMPILER INTELLIGENT


    EXECUTION Interaction via roundtrip to a server User-interface Server
  24. A Clean Slate Go for performance? Maybe write our own

    C++/LLVM- based query engine! Java? Not viable in the browser Portability a primary concern Browser + 
 Single-node Server But … portability? LLVM
  25. Surprise: What can’t a browser do these days? Native
 code

    execution frameworks Chrome’s Portable Native Client
 (PNaCl) asm.js Upcoming: WebAssembly Run compiled
 C and C++ code LLVM No JIT overhead Dense data representations No unpredictable garbage collection
  26. Portable Native Client (PNaCl) LLVM-based, cross-compilation style toolchain https://developer.chrome.com/native-client We

    went with Chrome’s PNaCl Plenty of ports for popular vendor libraries (webports)
  27. Wrangling and Immediacy Technical Challenges Architectural Context JavaScript and its

    Discontents Photon: A Clean Slate Design Outline 1 2 3 4 5
  28. Engineering Process Start by establishing “Speed of Light” Step 1

    Then add features for functionality and extensibility Step 2 Compare to speed of light; compromise judiciously. Step 3 Iterate, back to step 1 Step 4
  29. Inspiration & Reading ➔ Serializable & human-readable description of data

    flow ➔ Partitioned (sharded) data flow ➔ Strategies for complex transformations ➔ Data locality maximization Impala HyPer Tupleware Spark
  30. In-memory data layout Table Metadata Row Metadata Column 0 Column

    1 Column 2 Chunk 0 Chunk 1 Column Metadata Chunk 0 Chunk 1 Chunk 0 Chunk 1 Column Metadata Column Metadata
  31. In-memory data layout Row Batch Chunk 0 Row Metadata Row

    Batch http://arrow.apache.org/ Chunk 0 Chunk 0 Chunk 1 Row Metadata Chunk 1 Chunk 1
  32. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer Row Batch 0 Row Batch 1 Row Batch 2 T0 T1
  33. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer Row Batch 2 T1 Row Batch 1 T0 Row Batch 0
  34. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer Row Batch 2 T1 Row Batch 1 T0 Row Batch 0
  35. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer Row Batch 2 T1 Row Batch 1 T0 Row Batch 0
  36. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer Row Batch 2 T1 Row Batch 1 T0 Row Batch 0
  37. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer Row Batch 2 Row Batch 0 T0 T1 Row Batch 3 Row Batch 1
  38. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T1 Row Batch 2 T0 Row Batch 3
  39. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T1 Row Batch 2 T0 Row Batch 3
  40. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T1 Row Batch 2 T0 Row Batch 3
  41. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T1 Row Batch 2 T0 Row Batch 3
  42. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T1 Row Batch 2 T0 Row Batch 3 Row Batch 3
  43. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T0 T1 Row Batch 3
  44. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T0 T1 Row Batch 3
  45. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T0 T1 Row Batch 3
  46. Execution Thread Pool Source Filter Map Agg Map Sink Producer

    Pipe Pipe Pipe Barrier Consumer T0 Row Batch 3 T1
  47. Development Toolchain CMake Handles
 cross-compilation toolchain for PNaCl & Desktop

    Google C++ libraries Modern C++11/14 std::unique_ptr & std::shared_ptr simplify memory management Standard
 threading library New
 container classes LLVM Clang libraries ASAN Address Sanitizer TSAN Thread Sanitizer Useful for catching large classes of bugs googletest googlemock benchmark
  48. Design Requirements What is your desired workload? Low memory footprint

    % of dataset size Immediate Feedback < 1 second response time
  49. ➔ The right engine for the right job ➔ Photon

    is yet another payoff of (DSL + compiler + intelligent execution) ➔ Alongside Spark and Hadoop ➔ Photon specialized for UX: immediacy and scale ➔ Start by establishing speed-of-light ➔ Design to ensure it remains achievable as you grow ➔ Memory management with portability ➔ LLVM + toolchain form a powerful portability platform. ➔ Explicit memory management is critical for immediacy, even more so in-browser Lessons learned
  50. Part of Trifacta’s Intelligent Execution Alongside the distributed processing of

    Spark and Hadoop Portability
 
 Browser, Desktop, Single-Node Server Lean usage of
 client memory Efficient and predictable. Photon performance preserves user flow No loss of context
 Immediate data wrangling Fluid UX with predictions, profiles, previews