based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4]) MPDBUJPO /VN1Zl4USVDUVSFE"SSBZzDBODPOBJONVMUJQMFEUZQFT
Reshaping (merge, join, concat…) • Various I/O support (SQL, Excel, CSV/TSV…) • Flexible time series handling • Plotting • Please refer to the official documentation for details.
C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJUJPOT 3FUVSOCPPMBSSBZJOEJDBUFT BSSBZFMFNFOUTBSFJODMVEFE JOUIFTFU (FUJOQVU`TJUIWBMVF
duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*-
multiple threads. • GIL is released on I/O. • GIL can be released using Cython. • Scientific packages are working to release GIL. • NumPy, SciPy, Scikit-learn, pandas… • Cannot use Python classes after GIL is released. • If the target is object dtype, GIL cannot be released.
pandas performs computations using single thread. • Users have to parallelize by themselves. • pandas cannot handle data which exceeds physical memory. • Users have to write logic using pandas “chunk” function.
framework for numeric operations which offers: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation. • Author: Matthew Rocklin • License: BSD • GitHub: 1500↑⭐
library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • An interface to query data on different storage systems. • A graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader Airflow
chunk size. • Recommended to use the same chunk size over axis because computations are performed per chunk. da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2)) dask.array<wrapped, shape=(30, 20, 1, 15), dtype=float64, chunksize=(3, 7, 1, 2)>
(split-apply-combine) • Reshaping (merge, join, concat…) • Various I/O support (SQL, CSV/TSV…) • Flexible time series handling • Please refer to the official documentation for detail.
x + 1 @delayed def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total Delayed('add-b43be476-ffc7-48d7-a8ec-0f95df821e64') total.compute() 8 6TJOH!EFMBZFEEFDPSBUPSNBLFT XSBQQFEGVODUJPOMB[Z %FMBZFEGVODUJPOTDBOCF DIBJOFE BOEPVUQVUT%FMBZFE JOTUBODF OPUFWBMVBUFE"5. 5SJHHFSDPNQVUBUJPO
which works on a single node: • threading • multiprocessing • Dask Distributed package offers a distributed scheduler, which distributes tasks over multiple nodes.
• Low latency: Each task suffers about 1ms of overhead. • Peer-to-peer data sharing: Workers communicate with each other to share data. • Complex Scheduling: Supports complex workflows. • Data Locality: Scheduling algorithms cleverly execute computations where data lives. %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
to scale existing NumPy or pandas project. • Parallel / Out-of-Core processing on a single node. • Prototype complex algorithm interactively. • Don’t have Big Data infrastructure you can use freely. • Spark works well when: • Needs to scale large number of clusters. • Workflow requirement meets Spark API (typical ETL or SQL-like ops). • Needs enterprise support.
internal data efficiently. • Dask to parallelize data processing easily. It provides: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation.