Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Number Crunching in Python @ Eu...

Introduction to Number Crunching in Python @ EuroPython 2012

EuroPython 2012 Talk [1] by Dr. Enrico Franchi [2] and me

**Abstract**:

As computer scientists and geeks, we hate repetitive and manual operations and usually prefer making all the processing as automatic as possible (http://jonudell.net/images/geeks-win-eventually.png). Manual operations are boring, time consuming and mostly error-prone and do not allow for any kind of replication or reuse. On the other hand, automatic processing highly promote a better reuse of common operations and may easily scale on problem of different size, from few to very large amount of data.

All such data analysis processes are usually indicated by the term “crunching”, which refers to the analysis of large amounts of information in an automatic fashion, together with its corresponding set of “complex” operations.

Many tools used for data analysis are not overly geek friendly as they require a great deal of repetitive tasks: consider for example the simple case in which we have to collect values obtained by an experimental trial and we have to compute the mean, the minimum and the maximum of such values. A typical solution is copying all the data into an Excel file and to perform all the analysis of interest from there. However, all of these operations become infeasible in real world scenarios where we have to deal with huge amount of data and when “doing things manually” means go ahead by copying and pasting data from several different files.

While other tools, such as Matlab, allow a better automation and offer a more programmer friendly environment, Python offers extremely interesting solutions for these kind of problems. In particular, Python allows to exploits the benefits of a general purpose programming language in combination with a huge number of capabilities for crunching (Numpy, Scipy), data storage (pytables, nosql interfaces), data visualization (matplotlib) and an easy to use interactive environment (iPython, iPython Notebook).

In this talk we present some of the powerful tools available in the Python environment to automatically analyze, filter and process large amount of data. In particular, we present different real-world case studies along with the corresponding working Python code.

Basic maths skills and basic knowledge of the Python programming language are the only suggested prerequisites.

---
[1]: https://ep2013.europython.eu/conference/talks/introduction-to-number-crunching

[2]: https://ep2013.europython.eu/conference/p/enrico-franchi

Valerio Maggio

July 06, 2012
Tweet

More Decks by Valerio Maggio

Other Decks in Programming

Transcript

  1. DOLOR S I T OUTLINE • Scientific and Engineering Computing

    • Common FP pitfalls • Numpy NDArray (Memory and Indexing) • Case Studies
  2. DOLOR S I T OUTLINE • Scientific and Engineering Computing

    • Common FP pitfalls • Numpy NDArray (Memory and Indexing) • Case Studies
  3. DOLOR S I T OUTLINE • Scientific and Engineering Computing

    • Common FP pitfalls • Numpy NDArray (Memory and Indexing) • Case Studies
  4. number-crunching: n. [common] Computations of a numerical nature, esp. those

    that make extensive use of floating-point numbers. This term is in widespread informal use outside hackerdom and even in mainstream slang, but has additional hackish connotations: namely, that the computations are mindless and involve massive use of brute force. This is not always evil, esp. if it involves ray tracing or fractals or some other use that makes pretty pictures, esp. if such pictures can be used as screen backgrounds. See also crunch.
  5. number-crunching: n. [common] Computations of a numerical nature, esp. those

    that make extensive use of floating-point numbers. This term is in widespread informal use outside hackerdom and even in mainstream slang, but has additional hackish connotations: namely, that the computations are mindless and involve massive use of brute force. This is not always evil, esp. if it involves ray tracing or fractals or some other use that makes pretty pictures, esp. if such pictures can be used as screen backgrounds. See also crunch. We are not evil.
  6. number-crunching: n. [common] Computations of a numerical nature, esp. those

    that make extensive use of floating-point numbers. This term is in widespread informal use outside hackerdom and even in mainstream slang, but has additional hackish connotations: namely, that the computations are mindless and involve massive use of brute force. This is not always evil, esp. if it involves ray tracing or fractals or some other use that makes pretty pictures, esp. if such pictures can be used as screen backgrounds. See also crunch. We are not evil. Just chaotic neutral.
  7. AMET M E N T I T U M ALTERNATIVES

    • Matlab (IDE, numeric computations oriented, high quality algorithms, lots of packages, poor GP programming support, commercial) • Octave (Matlab clone) • R (stats oriented, poor general purpose programming support) • Fortran/C++ (very low level, very fast, more complex to use) • In general, these tools either are low level GP or high level DSLs
  8. HIS EX, T E M P O R PYTHON •

    Numpy (low-level numerical computations) + Scipy (lots of additional packages) • IPython (wonderfull command line interpreter) + IPython Notebook (“Mathematica-like” interactive documents) • HDF5 (PyTables, H5Py), Databases • Specific libraries for machine learning, etc. • General Purpose Object Oriented Programming
  9. DENIQU E G U B E R G R E

    N Our Code Numpy Atlas/MKL Improvements Improvements Algorithms are fast because of highly optimized C/Fortran code 4 30 LOAD_GLOBAL 1 (dot) 33 LOAD_FAST 0 (a) 36 LOAD_FAST 1 (b) 39 CALL_FUNCTION 2 42 STORE_FAST 2 (c) NUMPY STACK c = a · b
  10. ndarray ndarray Memory behavior shape, stride, flags (i0, . .

    . , in 1) ! I Shape: (d 0 , …, d n-1 ) 4x3 An n-dimensional array references some (usually contiguous memory area) An n-dimensional array has property such as its shape or the data-type of the elements containes Is an object, so there is some behavior, e.g., the def. of __add__ and similar stuff N-dimensional arrays are homogeneous
  11. (i0, . . . , in 1) ! I C-contiguous

    F-contiguous Shape: (d 0 , …, d n ) IC = n 1 X k=0 ik n 1 Y j=k+1 dj IF = n 1 X k=0 ik k 1 Y j=0 dj Shape: (d 0 , …, d k ,…, d n-1 ) Shape: (d 0 , …, d k ,…, d n-1 ) IC = i0 · d0 + i1 4x3 IF = i0 + i1 · d1 Element Layout in Memory
  12. Stride C-contiguous F-contiguous sF (k) = k 1 Y j=0

    dj IF = n X k=0 ik · sF (k) sC(k) = n 1 Y j=k+1 dj IC = n 1 X k=0 ik · sC(k) Stride C-contiguous F-contiguous C-contiguous (s0 = d0, s1 = 1) (s0 = 1, s1 = d1) IC = n 1 X k=0 ik n 1 Y j=k+1 dj IF = n 1 X k=0 ik k 1 Y j=0 dj
  13. General Disclaimer: All the Maths appearing in the next slides

    is only intended to better introduce the considered case studies. Speakers are not responsible for any possible disease or “brain consumption” caused by too much formulas. So BEWARE; use this information at your own risk! It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or health professional. Awful Maths
  14. BEFORE STARTING What do you need to get started: •

    A handful Unix Command-line tool: • Linux / Mac OSX Users: Your’re done. • Windows Users: It should be the time to change your OS :-) • [I]Python (You say?!) • A DBMS: • Relational: e.g., SQLite3, PostgreSQL • No-SQL: e.g., MongoDB MINIM S C R I P T O R E M
  15. LOREM I P S U M • Vectorization (NumPy vs.

    “pure” Python • Loops and Math functions (i.e., sin(x)) • Matrix-Vector Product • Different implementations of Matrix-Vector Product CASE STUDIES ON NUMERICAL EFFICIENCY
  16. dot

  17. dot

  18. dot

  19. dot

  20. dot

  21. MACHINE LEARNING • Machine Learing = Learning by Machine(s) •

    Algorithms and Techniques to gain insights from data or a dataset • Supervised or Unsupervised Learning • Machine Learning is actively being used today, perhaps in many more places than you’d expected • Mail Spam Filtering • Search Engine Results Ranking • Preference Selection • e.g., Amazon “Customers Who Bought This Item Also Bought” NAM IN, S E A N O
  22. LOREM I P S U M CLUSTERING: BRIEF INTRODUCTION •

    Clustering is a type of unsupervised learning that automatically forms clusters (groups) of similar things. It’s like automatic classification. You can cluster almost anything, and the more similar the items are in the cluster, the better your clusters are. • k-means is an algorithm that will find k clusters for a given dataset. • The number of clusters k is user defined. • Each cluster is described by a single point known as the centroid. • Centroid means it’s at the center of all the points in the cluster.
  23. LOREM I P S U M EXAMPLE: CLUSTERING POINTS ON

    A MAP Here’s the situation: your friend <NAME> wants you to take him out in the greater Portland, Oregon, area (US) for his birthday. A number of other friends are going to come also, so you need to provide a plan that everyone can follow. Your friend has given you a list of places he wants to go. This list is long; it has 70 establishments in it.
  24. s s f f Latitude and Longitude Coordinates of two

    points (s and f) Corresponding differences ˆ = arccos(sin s sin f + cos s cos f cos ) Spherical Distance Measure Spherical Distance Measure
  25. • Problem: Given an input matrix A, calculate if possible,

    its inverse matrix. • Definition: In linear algebra, a n-by-n (square) matrix A is invertible (a.k.a. is nonsingular or nondegenerate) if there exists a n-by-n matrix B (A-1) such that: AB = BA = In TRIVIAL EXAMPLE:INVERSE MATRIX
  26. ✓ Eigen Decomposition: • If A is nonsingular, i.e., it

    can be eigendecomposed and none of its eigenvalue is equal to zero ✓ Cholesky Decomposition: • If A is positive definite, where is the Conjugate transpose matrix of L (i.e., L is a lower triangular matrix) ✓ LU Factorization: (with L and U Lower (Upper) Triangular Matrix) ✓ Analytic Solution: (writing the Matrix of Cofactors), a.k.a. Cramer Method A 1 = Q⇤Q 1 A 1 = (L⇤) 1L 1 A 1 = 1 det(A) (CT )i,j = 1 det(A) (Cji) = 1 det(A) 0 B B B @ C1,1 C1,2 · · · C1,n C2,1 C2,2 · · · C2,n . . . . . . ... . . . Cm,1 Cm,2 · · · Cm,n 1 C C C A L⇤ A = LU Solution(s)
  27. C = 0 @ C1,1 C1,2 C1,3 C2,1 C2,2 C2,3

    C3,1 C3,2 C3,3 1 A Example
  28. C = 0 @ C1,1 C1,2 C1,3 C2,1 C2,2 C2,3

    C3,1 C3,2 C3,3 1 A Example C 1 = 1 det(C) ⇤ ⇤ 0 @ (C2,2C3,3 C2,3C3,2) (C1,3C3,2 C1,2C3,3) (C1,2C2,3 C1,3C2,2) (C2,3C3,1 C2,1C3,3) (C1,1C3,3 C1,3C3,1) (C1,3C2,1 C1,1C2,3) (C2,1C3,2 C2,2C3,1) (C3,1C1,2 C1,1C3,2) (C1,1C2,2 C1,2C2,1) 1 A
  29. C = 0 @ C1,1 C1,2 C1,3 C2,1 C2,2 C2,3

    C3,1 C3,2 C3,3 1 A Example det(C) = C1,1(C2,2C3,3 C2,3C3,2) +C1,2(C1,3C3,2 C1,2C3,3) +C1,3(C1,2C2,3 C1,3C2,2) C 1 = 1 det(C) ⇤ ⇤ 0 @ (C2,2C3,3 C2,3C3,2) (C1,3C3,2 C1,2C3,3) (C1,2C2,3 C1,3C2,2) (C2,3C3,1 C2,1C3,3) (C1,1C3,3 C1,3C3,1) (C1,3C2,1 C1,1C2,3) (C2,1C3,2 C2,2C3,1) (C3,1C1,2 C1,1C3,2) (C1,1C2,2 C1,2C2,1) 1 A
  30. Duplicated Code Template Method Pattern However, we still have to

    implement from scratch computational functions!! Reinventing the wheel! Home Made
  31. Type: function String Form:<function inv at 0x105f72b90> File: /Library/Python/2.7/site-packages/numpy/linalg/linalg.py Definition:

    linalg.inv(a) Source: def inv(a): """ Compute the (multiplicative) inverse of a matrix. [...] Parameters ---------- a : array_like, shape (M, M) Matrix to be inverted. Returns ------- ainv : ndarray or matrix, shape (M, M) (Multiplicative) inverse of the matrix `a`. Raises ------ LinAlgError If `a` is singular or not square. [...] """ a, wrap = _makearray(a) return wrap(solve(a, identity(a.shape[0], dtype=a.dtype))) Under the hood