Python data-science going functional

Python data-science going functional PyCon Sette, 2016 Holger Peters  
@data_hope

FP: An Umbrella Term https://flic.kr/p/wPKKto expressions instead of statements  evaluation
instead of instruction no memory mutation programs are built by composing functions pure (side-eﬀect free) functions

FP: A Constant Innovator Python Features originating in Functional Programming
• Garbage Collection (1959, LISP by John McCarthy) • List Comprehension (1969 SETL) • Lazy Evaluation (1976, Friedman & Wise)  i.e. yield, generator expression • the interactive prompt 3

4 lst = [] for i in range(10): lst.append(str(i)) Imperative
[str(i)   for i in range(10)] map(str, range(10)) Functional 4 "4" 3 "3" 2 "2" 1 "1" 0 "0" [ ] [ ]

>>> xs = [10.0, 15.0, 20.0, 3.0] >>> mx =
max(xs) >>> q = [x / mx for x in xs] >>> s = sum(q) * mx A numerically more stable summation 5

6 >>> xs = [10.0, 15.0, 20.0, 3.0] >>> mx
= max(xs) >>> q = [x / mx for x in xs] >>> s = sum(q) * mx >>> q [0.5, 0.75, 1.0, 0.15] >>> xs = [10.0, 15.0, 20.0, 3.0]   ... :: [Double] >>> mx = maximum xs >>> q = [x / mx | x <- xs] >>> s = sum q * mx >>> :sprint q q = _ :sprint inspects Haskell's internal evaluation state _ represents a not-yet evaluated value Python-Haskell Side-By-Side

7 >>> xs = [10.0, 15.0, 20.0, 3.0]   ...
:: [Double] >>> mx = maximum xs >>> q = [x / mx | x <- xs] >>> s = sum q * mx >>> :sprint q q = _ depends on max xs array division xs mx sum q * mx xs Haskell's functions are pure lazy evaluation per default Hint: We'll see how this is possible in Python, too!

8 >>> xs = [10.0, 15.0, 20.0, 3.0]   ...
:: [Double] >>> mx = maximum xs >>> q = [x / mx | x <- xs] >>> s = sum q * mx >>> :sprint q q = _ >>> :sprint mx mx = _ >>> q !! 0 0.5 >>> q !! 1 0.75 >>> :sprint q 0.5 : 0.75 : _ >>> :sprint mx mx = 20.0 >>> s 48.0 >>> :sprint q q = [0.5,0.75,1.0,0.15] lazy trigger evaluation list is only partially evaluated >>> q[0] >>> q[1]

FP: A traveller's dictionary 9 Imperative Functional method, procedure pure
function  higher order function  mutable object immutable value execute statements evaluate expressions loop higher order functions recursion direct execution (machine) lazy or eager evaluation break / continue / goto (continuation)

10 Functional Programming and Number Crunching

C/Fortran Style Numerics • in-place memory mutation • inner-loop optimisation
• focus on machine layout (x86) • FORTRAN: John Backus,   received Turing Award 1977 11 By SciCompTeacher [Public domain],   via Wikimedia Commons

12 SUBROUTINE CGTSV( N, NRHS, DL, D, DU, B, LDB,
INFO ) * .. Scalar Arguments .. INTEGER INFO, LDB, N, NRHS * .. * .. Array Arguments .. COMPLEX B( LDB, * ), D( * ), DL( * ), DU( * ) * * Purpose * ======= * * CGTSV solves the equation * * A*X = B, * * where A is an N-by-N tridiagonal matrix, by Gaussian elimination with * partial pivoting. * * Note that the equation A**H *X = B may be solved by interchanging the * order of the arguments DU and DL. * .. Parameters .. COMPLEX ZERO PARAMETER ( ZERO = ( 0.0E+0, 0.0E+0 ) ) * .. Local Scalars .. INTEGER J, K COMPLEX MULT, TEMP, ZDUM * .. Intrinsic Functions .. INTRINSIC ABS, AIMAG, MAX, REAL * .. External Subroutines .. EXTERNAL XERBLA * .. * .. Statement Functions .. REAL CABS1 * .. * .. Statement Function definitions .. CABS1( ZDUM ) = ABS( REAL( ZDUM ) ) + ABS( AIMAG( ZDUM ) ) INFO = 0 IF( N.LT.0 ) THEN INFO = -1 ELSE IF( NRHS.LT.0 ) THEN INFO = -2 ELSE IF( LDB.LT.MAX( 1, N ) ) THEN INFO = -7 END IF IF( INFO.NE.0 ) THEN CALL XERBLA( 'CGTSV ', -INFO ) RETURN END IF IF( N.EQ.0 ) $ RETURN DO 30 K = 1, N - 1 IF( DL( K ).EQ.ZERO ) THEN IF( D( K ).EQ.ZERO ) THEN INFO = K RETURN END IF ELSE IF( CABS1( D( K ) ).GE.CABS1( DL( K ) ) ) THEN MULT = DL( K ) / D( K ) D( K+1 ) = D( K+1 ) - MULT*DU( K ) DO 10 J = 1, NRHS B( K+1, J ) = B( K+1, J ) - MULT*B( K, J ) 10 CONTINUE IF( K.LT.( N-1 ) ) $ DL( K ) = ZERO ELSE MULT = D( K ) / DL( K ) D( K ) = DL( K ) TEMP = D( K+1 ) D( K+1 ) = DU( K ) - MULT*TEMP IF( K.LT.( N-1 ) ) THEN DL( K ) = DU( K+1 ) DU( K+1 ) = -MULT*DL( K ) END IF DU( K ) = TEMP DO 20 J = 1, NRHS TEMP = B( K, J ) B( K, J ) = B( K+1, J ) B( K+1, J ) = TEMP - MULT*B( K+1, J ) 20 CONTINUE END IF 30 CONTINUE IF( D( N ).EQ.ZERO ) THEN INFO = N RETURN END IF DO 50 J = 1, NRHS B( N, J ) = B( N, J ) / D( N ) IF( N.GT.1 ) $ B( N-1, J ) = ( B( N-1, J )-DU( N-1 )*B( N, J ) ) / D( N-1 ) DO 40 K = N - 2, 1, -1 B( K, J ) = ( B( K, J )-DU( K )*B( K+1, J )-DL( K )* $ B( K+2, J ) ) / D( K ) 40 CONTINUE 50 CONTINUE RETURN END LAPACK routine (version 3.3.1) LAPACK is a software package provided by Univ. of Tennessee, Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd

von Neumann Architecture 13 "word-at-a-time"

Abstract of John Backus' "Can Programming be Liberated from the
von Neumann Style? A Functional Style and Its Algebra of Programs" ACM Turing Award 1977 Conventional programming languages are growing ever more enormous, but not stronger. Inherent defects at the most basic level cause them to be both fat and weak: their primitive word- at-a-time style of programming inherited from their common ancestor — the von Neumann computer, their close coupling of semantics to state transitions, their division of programming into a world of expressions and a world of statements, [...], and their lack of useful mathematic properties for reasoning about programs. 14

Calculating a Linear Regression 15 w = (XTX)−1XT y Estimation:
def linreg(X, y): return np.linalg.inv(X.T @ X) @ X.T @ y def linreg_pinv(X, y): return np.linalg.pinv(X) @ y Linear Algebra is not word-at-a time, is it proper to think "von Neumann" style when implementing a linear regression? X = (xij) ŷi' = w0 + w1 Xi1 + w2 Xi2 + w3 Xi3+ .... y = yi w = wi Fit:

16 https://ﬂic.kr/p/D2Dti When FP is the only viable option What
happens, once fundamental assumptions suddenly don't matter any more?

My Data Does Not Fit Into Machine Memory 17 As
of June 2015, the dump of all pages with complete edit history in XML format [...] is about 100 GB compressed [...], and 10 TB uncompressed. We need methods, that do not assume random memory access at all times.

Meet Dask • pure python library • API similar to
Numpy • does not assume that all data is in RAM  (out-of-core) 18 https://ﬂic.kr/p/o4vNCv

Idea 1: Evaluation as a directed, acyclic graph (DAG) 19
mx = max(x) sum ( x / mx) * mx s * mx max(x) x / mx x mx sum(q) x mx q s value data ﬂow

Idea 2: Chunks: Map-Reduce Style 20 x / mx mx
sum(q) x q x / mx mx s1 + s2 x1 x / mx x2 mx sum(q1) sum(q2) q1 q2 s2 s1

Evaluation Strategy 1: Minimise Memory Usage mx x1 x2 mx
q1 q2 s2 s2 mx x1 mx q1 q2 s2 s2 mx x1 q1 q2 s2 s2 q1 s2 s2 s2 s2 mx 21

Evaluation Strategy 2: Parallelize Evaluation mx x1 x2 mx q1
q2 s2 s2 q1 q2 s2 s2 s2 s2 s2 s2 22

Computation Graphs w/ Dask 23 s_nump = stable_sum(numpy_array) s_dask =
stable_sum(dask_array).compute() def stable_sum(x): mx = x.max() return (x / mx).sum() * mx

24 Array length: 5e8 10 runs per box  in memory
(not out of core) nx - numexpr np - numpy CyOMP - Cython with OpenMP DskFromNumpy - Dask

25 Array length: 100 10 runs per box  in memory
(not out of core) nx - numexpr np - numpy CyOMP - Cython with OpenMP DskFromNumpy - Dask Conversion from numpy DskNative - Dask, with a given Dask array

Cython and Dask Side-by-Side 26 def stable_sum(x): mx = x.max()
return (x / mx).sum() * mx # vs @cython.boundscheck(False) def csum_par(np.ndarray[np.float64_t, ndim=1] arr): cdef int n = arr.shape[0] cdef int i cdef np.float64_t s = 0.0 cdef np.float64_t mx = np.max(arr) for i in prange(n, nogil=True): s += arr[i] / mx return s * mx

My Take on this Benchmark Dask performance comes in 2nd,
after parallelized Cython. Dask • scheduling overhead • needs to know shapes/chunks of arrays at every step of the computation. • simple, Numpy-like formulation   Cython • C/Fortran-like formulation • complex • lower-level 27

Dask in Action: Clean Data 28 import dask.array as da
import numpy as np def test_cleanup_with_np(): x = np.random.normal(size=1e4) + 4. x[x < 0] = 0. is_positive = (x >= 0).all() assert is_positive def test_cleanup_with_dask(): x = da.random.normal(size=1e4, chunks=1e3) + 4 x = da.where(x < 0, 0., x) is_positive = (x >= 0).all() assert is_positive.compute()

Summary https://flic.kr/p/7z55Lg Composing expressions liberates  from the von Neumann architecture.
Decoupling: HOW from WHAT HOW: Word/Chunk-at-a-time, Parallel/Sequential/Distributed  WHAT: sum(x / mx) * mx where mx = x.max()

31 https://ﬂic.kr/p/bjRngF We have successfully built abstraction on abstraction. The
imperative epoch is superseded by functional, mathematical thinking!

What is ahead? https://flic.kr/p/7z55Lg Decoupling specification (the what?) and implementation
(the how?) Reusability with regard to the what Guarantees https://ﬂic.kr/p/5hFkUR 32

Thank You. Slides: https://speakerdeck.com/holgerpeters/python-data-science-going-functional Holger Peters   @data_hope

Python data-science going functional

Python data-science going functional

Holger Peters

More Decks by Holger Peters

Other Decks in Programming

Featured

Transcript