Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python data-science going functional

Python data-science going functional

Functional concepts are on the rise. In the big data space, this has been known ever since Map Reduce took the stage and nowadays, spark’s resilient, distributed data frames are a primary example of how functional, immutable concepts translate into successful architectures, that we can leverage from Python.

In this presentation I show how a functional approach to programming leads to a formulation of computation problems by directed acyclic graphs, which drives innovative libraries like Dask to provide an abstraction over the limitations of the von Neumann architecture.

Holger Peters

April 16, 2016
Tweet

More Decks by Holger Peters

Other Decks in Programming

Transcript

  1. FP: An Umbrella Term https://flic.kr/p/wPKKto expressions instead of statements
 evaluation

    instead of instruction no memory mutation programs are built by composing functions pure (side-effect free) functions
  2. FP: A Constant Innovator Python Features originating in Functional Programming

    • Garbage Collection (1959, LISP by John McCarthy) • List Comprehension (1969 SETL) • Lazy Evaluation (1976, Friedman & Wise)
 i.e. yield, generator expression • the interactive prompt 3
  3. 4 lst = [] for i in range(10): lst.append(str(i)) Imperative

    [str(i) 
 for i in range(10)] map(str, range(10)) Functional 4 "4" 3 "3" 2 "2" 1 "1" 0 "0" [ ] [ ]
  4. >>> xs = [10.0, 15.0, 20.0, 3.0] >>> mx =

    max(xs) >>> q = [x / mx for x in xs] >>> s = sum(q) * mx A numerically more stable summation 5
  5. 6 >>> xs = [10.0, 15.0, 20.0, 3.0] >>> mx

    = max(xs) >>> q = [x / mx for x in xs] >>> s = sum(q) * mx >>> q [0.5, 0.75, 1.0, 0.15] >>> xs = [10.0, 15.0, 20.0, 3.0] 
 ... :: [Double] >>> mx = maximum xs >>> q = [x / mx | x <- xs] >>> s = sum q * mx >>> :sprint q q = _ :sprint inspects Haskell's internal evaluation state _ represents a not-yet evaluated value Python-Haskell Side-By-Side
  6. 7 >>> xs = [10.0, 15.0, 20.0, 3.0] 
 ...

    :: [Double] >>> mx = maximum xs >>> q = [x / mx | x <- xs] >>> s = sum q * mx >>> :sprint q q = _ depends on max xs array division xs mx sum q * mx xs Haskell's functions are pure lazy evaluation per default Hint: We'll see how this is possible in Python, too!
  7. 8 >>> xs = [10.0, 15.0, 20.0, 3.0] 
 ...

    :: [Double] >>> mx = maximum xs >>> q = [x / mx | x <- xs] >>> s = sum q * mx >>> :sprint q q = _ >>> :sprint mx mx = _ >>> q !! 0 0.5 >>> q !! 1 0.75 >>> :sprint q 0.5 : 0.75 : _ >>> :sprint mx mx = 20.0 >>> s 48.0 >>> :sprint q q = [0.5,0.75,1.0,0.15] lazy trigger evaluation list is only partially evaluated >>> q[0] >>> q[1]
  8. FP: A traveller's dictionary 9 Imperative Functional method, procedure pure

    function
 higher order function
 mutable object immutable value execute statements evaluate expressions loop higher order functions recursion direct execution (machine) lazy or eager evaluation break / continue / goto (continuation)
  9. C/Fortran Style Numerics • in-place memory mutation • inner-loop optimisation

    • focus on machine layout (x86) • FORTRAN: John Backus, 
 received Turing Award 1977 11 By SciCompTeacher [Public domain], 
 via Wikimedia Commons
  10. 12 SUBROUTINE CGTSV( N, NRHS, DL, D, DU, B, LDB,

    INFO ) * .. Scalar Arguments .. INTEGER INFO, LDB, N, NRHS * .. * .. Array Arguments .. COMPLEX B( LDB, * ), D( * ), DL( * ), DU( * ) * * Purpose * ======= * * CGTSV solves the equation * * A*X = B, * * where A is an N-by-N tridiagonal matrix, by Gaussian elimination with * partial pivoting. * * Note that the equation A**H *X = B may be solved by interchanging the * order of the arguments DU and DL. * .. Parameters .. COMPLEX ZERO PARAMETER ( ZERO = ( 0.0E+0, 0.0E+0 ) ) * .. Local Scalars .. INTEGER J, K COMPLEX MULT, TEMP, ZDUM * .. Intrinsic Functions .. INTRINSIC ABS, AIMAG, MAX, REAL * .. External Subroutines .. EXTERNAL XERBLA * .. * .. Statement Functions .. REAL CABS1 * .. * .. Statement Function definitions .. CABS1( ZDUM ) = ABS( REAL( ZDUM ) ) + ABS( AIMAG( ZDUM ) ) INFO = 0 IF( N.LT.0 ) THEN INFO = -1 ELSE IF( NRHS.LT.0 ) THEN INFO = -2 ELSE IF( LDB.LT.MAX( 1, N ) ) THEN INFO = -7 END IF IF( INFO.NE.0 ) THEN CALL XERBLA( 'CGTSV ', -INFO ) RETURN END IF IF( N.EQ.0 ) $ RETURN DO 30 K = 1, N - 1 IF( DL( K ).EQ.ZERO ) THEN IF( D( K ).EQ.ZERO ) THEN INFO = K RETURN END IF ELSE IF( CABS1( D( K ) ).GE.CABS1( DL( K ) ) ) THEN MULT = DL( K ) / D( K ) D( K+1 ) = D( K+1 ) - MULT*DU( K ) DO 10 J = 1, NRHS B( K+1, J ) = B( K+1, J ) - MULT*B( K, J ) 10 CONTINUE IF( K.LT.( N-1 ) ) $ DL( K ) = ZERO ELSE MULT = D( K ) / DL( K ) D( K ) = DL( K ) TEMP = D( K+1 ) D( K+1 ) = DU( K ) - MULT*TEMP IF( K.LT.( N-1 ) ) THEN DL( K ) = DU( K+1 ) DU( K+1 ) = -MULT*DL( K ) END IF DU( K ) = TEMP DO 20 J = 1, NRHS TEMP = B( K, J ) B( K, J ) = B( K+1, J ) B( K+1, J ) = TEMP - MULT*B( K+1, J ) 20 CONTINUE END IF 30 CONTINUE IF( D( N ).EQ.ZERO ) THEN INFO = N RETURN END IF DO 50 J = 1, NRHS B( N, J ) = B( N, J ) / D( N ) IF( N.GT.1 ) $ B( N-1, J ) = ( B( N-1, J )-DU( N-1 )*B( N, J ) ) / D( N-1 ) DO 40 K = N - 2, 1, -1 B( K, J ) = ( B( K, J )-DU( K )*B( K+1, J )-DL( K )* $ B( K+2, J ) ) / D( K ) 40 CONTINUE 50 CONTINUE RETURN END LAPACK routine (version 3.3.1) LAPACK is a software package provided by Univ. of Tennessee, Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd
  11. Abstract of John Backus' "Can Programming be Liberated from the

    von Neumann Style? A Functional Style and Its Algebra of Programs" ACM Turing Award 1977 Conventional programming languages are growing ever more enormous, but not stronger. Inherent defects at the most basic level cause them to be both fat and weak: their primitive word- at-a-time style of programming inherited from their common ancestor — the von Neumann computer, their close coupling of semantics to state transitions, their division of programming into a world of expressions and a world of statements, [...], and their lack of useful mathematic properties for reasoning about programs. 14
  12. Calculating a Linear Regression 15 w = (XTX)−1XT y Estimation:

    def linreg(X, y): return np.linalg.inv(X.T @ X) @ X.T @ y def linreg_pinv(X, y): return np.linalg.pinv(X) @ y Linear Algebra is not word-at-a time, is it proper to think "von Neumann" style when implementing a linear regression? X = (xij) ŷi' = w0 + w1 Xi1 + w2 Xi2 + w3 Xi3+ .... y = yi w = wi Fit:
  13. 16 https://flic.kr/p/D2Dti When FP is the only viable option What

    happens, once fundamental assumptions suddenly don't matter any more?
  14. My Data Does Not Fit Into Machine Memory 17 As

    of June 2015, the dump of all pages with complete edit history in XML format [...] is about 100 GB compressed [...], and 10 TB uncompressed. We need methods, that do not assume random memory access at all times.
  15. Meet Dask • pure python library • API similar to

    Numpy • does not assume that all data is in RAM
 (out-of-core) 18 https://flic.kr/p/o4vNCv
  16. Idea 1: Evaluation as a directed, acyclic graph (DAG) 19

    mx = max(x) sum ( x / mx) * mx s * mx max(x) x / mx x mx sum(q) x mx q s value data flow
  17. Idea 2: Chunks: Map-Reduce Style 20 x / mx mx

    sum(q) x q x / mx mx s1 + s2 x1 x / mx x2 mx sum(q1) sum(q2) q1 q2 s2 s1
  18. Evaluation Strategy 1: Minimise Memory Usage mx x1 x2 mx

    q1 q2 s2 s2 mx x1 mx q1 q2 s2 s2 mx x1 q1 q2 s2 s2 q1 s2 s2 s2 s2 mx 21
  19. Computation Graphs w/ Dask 23 s_nump = stable_sum(numpy_array) s_dask =

    stable_sum(dask_array).compute() def stable_sum(x): mx = x.max() return (x / mx).sum() * mx
  20. 24 Array length: 5e8 10 runs per box
 in memory

    (not out of core) nx - numexpr np - numpy CyOMP - Cython with OpenMP DskFromNumpy - Dask
  21. 25 Array length: 100 10 runs per box
 in memory

    (not out of core) nx - numexpr np - numpy CyOMP - Cython with OpenMP DskFromNumpy - Dask Conversion from numpy DskNative - Dask, with a given Dask array
  22. Cython and Dask Side-by-Side 26 def stable_sum(x): mx = x.max()

    return (x / mx).sum() * mx # vs @cython.boundscheck(False) def csum_par(np.ndarray[np.float64_t, ndim=1] arr): cdef int n = arr.shape[0] cdef int i cdef np.float64_t s = 0.0 cdef np.float64_t mx = np.max(arr) for i in prange(n, nogil=True): s += arr[i] / mx return s * mx
  23. My Take on this Benchmark Dask performance comes in 2nd,

    after parallelized Cython. Dask • scheduling overhead • needs to know shapes/chunks of arrays at every step of the computation. • simple, Numpy-like formulation 
 Cython • C/Fortran-like formulation • complex • lower-level 27
  24. Dask in Action: Clean Data 28 import dask.array as da

    import numpy as np def test_cleanup_with_np(): x = np.random.normal(size=1e4) + 4. x[x < 0] = 0. is_positive = (x >= 0).all() assert is_positive def test_cleanup_with_dask(): x = da.random.normal(size=1e4, chunks=1e3) + 4 x = da.where(x < 0, 0., x) is_positive = (x >= 0).all() assert is_positive.compute()
  25. 29

  26. Summary https://flic.kr/p/7z55Lg Composing expressions liberates
 from the von Neumann architecture.

    Decoupling: HOW from WHAT HOW: Word/Chunk-at-a-time, Parallel/Sequential/Distributed
 WHAT: sum(x / mx) * mx where mx = x.max()
  27. 31 https://flic.kr/p/bjRngF We have successfully built abstraction on abstraction. The

    imperative epoch is superseded by functional, mathematical thinking!
  28. What is ahead? https://flic.kr/p/7z55Lg Decoupling specification (the what?) and implementation

    (the how?) Reusability with regard to the what Guarantees https://flic.kr/p/5hFkUR 32