Data Science, Python and the Functional Programming Revolution

Data Science, Python and the Functional Programming Revolution Holger Peters
@data_hope EuroScipy 2016, Erlangen How to profit from functional principles when doing number-crunching.

Functional programming: An umbrella term programs are built by composing
functions expressions instead of statements  evaluation instead of instruction Avoid memory mutation pure (side-effect free) functions higher-order functions Image licensed CC BY 2.0 https://flic.kr/p/wPKKto © flickr user surfergirl30

FP: A constant innovator Python Features originating in FP •
Garbage Collection • List Comprehension • Lazy Evaluation i.e. yield / generator expression • Interactive prompt (ipython) • Lambdas, higher-order functions

Imperative vs. functional Python Imperative Statement exponentials = [] for
x in values: exponentials.append(exp(x)) Functional Expression exponentials = [exp(x) for x in values] exponentials = map(exp, values) exponentials = np.exp(values) Mutation! Higher Order Function

Imperative vs. functional Python Imperative Statement x = np.random.randn(size=100) mask
= x < 0 res[mask] = 0 res[~mask] = np.sqrt(x[mask]) Functional Expression x = np.random.randn(size=100) np.where(x < 0, 0, np.sqrt(x))

FP: A traveler's dictionary Imperative Functional method, procedure, callable (pure)
function, lambda-functions mutable object value (immutable) execute evaluate statement expression loop higher order functions, recursion break / continue / goto Continuation, yield

Linear regression "# $ = ' + ) * #*
, -./ = ' + = (2 3 )5/2 pseudo−inverse Prediction Fit def linreg(X, y): return np.linalg.inv(X.T @ X) @ X.T @ y def linreg_pinv(X, y): return np.linalg.pinv(X) @ y Thinking on a more abstract level (Algebra) comes already natural to us! Linear algebra is about expressions. Math is functional. = / B ⋮ D , = #- = / ⋮ ,

Von Neumann architecture

When FP is the only viable option What happens, once
fundamental assumptions suddenly don't matter any more? CC © flick user „greg“ https://flic.kr/p/D2Dti

Two common problems with Scientific Python Stack Dataset does not
fit into RAM A core assumption of von Neumann, the random memory access, cannot be upheld! How can I write code for distributed programs over clusters? Can I use out-of-core methods to load data from disk on-demand? Implementation of Algorithm only uses one core Writing code as ”instructions for a CPU”, we typically write it in a way that it only utilizes one CPU core. Writing parallelized implementations can be challenging. Is there a better way?

Dask: Out-of-core, distributed and parallization for numpy/pandas Dask works on
expressions. Dask partitions data into chunks and evaluates expressions on these chunks using map and reduce. Dask can parallelize the evaluation of expressions. Dask can distribute the evaluation of expressions.

How can map and reduce work chunkwise? Many reductive operations
can be chunked: >>> np.sum(np.r_[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) 55 >>> sum([0, 1, 2, 3, 4, 5]) + sum([6, 7, 8, 9, 10]) 55 Elementwise mapping operations can be chunked: >>> 10. * np.r_[0, 1, 2, 3, 4, 5] array([ 0., 10., 20., 30., 40., 50.]) >>> np.hstack((10. * np.r_[0, 1, 2], ... 10. * np.r_[3, 4, 5])) array([ 0., 10., 20., 30., 40., 50.])

Softmax: A vector ⃗ ∶ ℝJ is mapped, each component
being constrained between 0 and 1 (so - ∶ ℝJ → (0,1)). Example: Normalized exponential a.k.a. “softmax function” - ⃗ = QRS(T5 T̂) ∑ QRS(T5 T̂) W XYZ , Beware of - > 2/', exp(2/' + ) overflows. With ̂ = max - ⃗ = QRS(Ta) ∑ QRS(TX) W XYZ ,

# ̂ x = − ̂ exp_x= exp( − ̂)
sum_x = ) exp( − ̂) J c./ exp_x sum_x - ⃗ = exp( − ̂) ∑ exp( − ̂) J c./ Softmax’ expression graph X_exp = np.exp(z - z.max()) return x_exp / x_exp.sum()

Dask’s evaluation graph Map and Reduce Operations can be performed
chunk wise parallelizable and serializable Example: Graph for softmax with four chunks of data

How fast is parallelization by Dask, compared to alternatives? Image
licensed CC-BY Flickr user ChristianSinclair https://flic.kr/p/aGo5fr

Array length: 1e8 10 runs per box  in memory (not
out of core, not distributed) parallelized with 4 cores (virtual: 8)

Comparison of softmax- implementations def softmax_numpy(x): e_x = np.exp(x -
x.max()) return e_x / e_x.sum() def softmax_dask(x): x = da.from_array(x, chunks=int(x.shape[0] / CPU_COUNT), name='x') e_x = da.exp(x - x.max()) return (e_x / e_x.sum()).compute() def softmax_numexpr(x): mx = ne.evaluate('max(x)') e_x = ne.evaluate('exp(x - mx)') sum_of_exp = ne.evaluate('sum(e_x)') normalized = ne.evaluate('e_x / sum_of_exp') return normalized

Comparison of softmax- implementations Cython with OpenMP from cython.parallel import
prange from libc.math cimport exp @cython.nonecheck(False) @cython.cdivision(True) @cython.boundscheck(False) def softmax_openmp(np.ndarray[np.float64_t, ndim=1] x): cdef: int n = x.shape[0] int i np.float64_t s = 0.0 double max_x = np.max(x) np.ndarray[np.float64_t, ndim=1] e_x = np.empty(n) with nogil: for i in prange(n): e_x[i] = exp(x[i] - max_x) s += e_x[i] for i in prange(n): e_x[i] /= s return e_x

Learning from this example With an API based on functional,
pure-Python expressions, Dask gives you great performance, without losing any level of abstraction. Cython w/ OpenMP can be faster, but introduces complexity (natively-compiled code, low-level code). Image Flickr user Jordan Schwartz https://flic.kr/p/9B9HN3

Summary It is better to focus on the problem &
algorithmic properties when implementing, than to focus on implementation details of the machine. Image licensed CC BY 2.0 © flickr user dfaulder https://flic.kr/p/7z8NMj Functional programming – composing expressions – abstracts nicely over the machine- level. We rely on software – let software generate the most efficient implementation from your specifications. Ergonomy often trumps slight performance improvements.

Love Python & Data? We are hiring! http://www.blue-yonder.com/en/company/career.html

Conventional programming languages are growing ever more enormous, but not
stronger. Inherent defects at the most basic level cause them to be both fat and weak: their primitive word-at-a-time style of programming inherited from their common ancestor — the von Neumann computer, their close coupling of semantics to state transitions, their division of programming into a world of expressions and a world of statements, [...], and their lack of useful mathematic properties for reasoning about programs. Abstract of John Backus' “Can Programming be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs” ACM Turing Award 1977

Data Science, Python and the Functional Program...

Data Science, Python and the Functional Programming Revolution

Holger Peters

More Decks by Holger Peters

Other Decks in Programming

Featured

Transcript

Data Science, Python and the Functional Programming Revolution Holger Peters

Functional programming: An umbrella term programs are built by composing

FP: A constant innovator Python Features originating in FP •

Imperative vs. functional Python Imperative Statement exponentials = [] for

Imperative vs. functional Python Imperative Statement x = np.random.randn(size=100) mask

FP: A traveler's dictionary Imperative Functional method, procedure, callable (pure)

Linear regression "# $ = ' + ) * #*

Von Neumann architecture

When FP is the only viable option What happens, once

Two common problems with Scientific Python Stack Dataset does not

Dask: Out-of-core, distributed and parallization for numpy/pandas Dask works on

How can map and reduce work chunkwise? Many reductive operations

Softmax: A vector ⃗ ∶ ℝJ is mapped, each component

# ̂ x = − ̂ exp_x= exp( − ̂)

Dask’s evaluation graph Map and Reduce Operations can be performed

How fast is parallelization by Dask, compared to alternatives? Image

Array length: 1e8 10 runs per box  in memory (not

Comparison of softmax- implementations def softmax_numpy(x): e_x = np.exp(x -

Comparison of softmax- implementations Cython with OpenMP from cython.parallel import

Learning from this example With an API based on functional,

Summary It is better to focus on the problem &

Love Python & Data? We are hiring! http://www.blue-yonder.com/en/company/career.html

Conventional programming languages are growing ever more enormous, but not