(merge, join, concat…) • Various I/O support (SQL, Excel, …) • Flexible time series handling • Plotting • Please refer to the official documentation for details.
internally. • Intend to clarify the basics, rather than explaining algorithm detail. • Expect to be useful to achieve better performance in your program.
C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJTJPOT 3FUVSOCPPMBSSBZJOEJDBUFT lBSSzJTJODMVEFEJOlWBMVFTzTFU (FUlBSSz`TJUIWBMVF
C. • Cited from “Cython: A guide for Python programmers” by Kurt W.Smith” • NOTE: C and Fortran are not included in the table. -JOFTPG$ZUIPO 4BHF QBOEBT 4DJ1Z 4DJLJUMFBSO /VN1Z
ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*- 6TFLIBTIEJSFDUMZ VOBCMF UPVTFXSBQQFSDMBTTXJUIPVU(*-
applied to most cases in performance point of view. • Some functions intends user’s convenience, rather than performance. • Environment • AWS EC2: c4.2xlarge (vCPU: 8, Memory: 15 GiB) • Python 3.5.0 • DISCLAIMER: Performance is mostly depending on actual data and operations. Be sure to profile the effectiveness.
BLAS/ATLAS, LAPACK • Install pandas optional dependencies: #PUUMFOFDL A collection of fast NumPy array functions. /VNFYQS A fast numerical expression evaluator.
writing user defined functions (UDF) by yourself. • Some functions may be faster than NumPy depending on conditions. • Example: Uniquify np.unique([1, 2, 2, 3, 2, 4]) array([1, 2, 3, 4]) 3FNPWFEVQMJDBUFT
of 3: 42.2 ms per loop %timeit pd.unique(values) 100 loops, best of 3: 7.1 ms per loop np.random.seed(71) values = np.random.randint(1, 1000, 1000000) values array([108, 942, 12, ..., 308, 897, 40]) (FOFSBUFTBNQMFEBUB /VN1Z QBOEBT
Note that “str” is regarded as “object” dtype. • Example: Group-by → mean (SPVQCZlCzDPMVNOXIJDI IBTVOJRVFWBMVFT 5IFO$BMDVMBUFNFBO df.groupby('b').mean()
column to “Categorical”. %timeit df.groupby('b').mean() 10 loops, best of 3: 59.7 ms per loop df['b'] = df['b'].astype('category') %timeit df.groupby('b').mean() 100 loops, best of 3: 17.2 ms per loop lPCKFDUzEUZQF l$BUFHPSJDBMz
= 10000 df_right = pd.DataFrame({'c': np.random.randint(1, 100, n_right)}) %timeit df_left.join(df_right) 100 loops, best of 3: 6.88 ms per loop df_right_shuffled = df_right.sample(n=len(df_right)) %timeit df_left.join(df_right_shuffled) 100 loops, best of 3: 18.7 ms per loop +PJOCZTPSUFEVOJRVF*OEFY 4IV⒐FCZSBOEPNTBNQMJOH /POTPSUFEVOJRVF*OEFY -FGU.SPXT DPMVNOT 3JHIU,SPXT DPMVNO
days = np.random.randint(1, 28, N) 100 loops, best of 3: 2.26 ms per loop dates = [mdy_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1 loops, best of 3: 805 ms per loop dates = [iso_8601_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1SFQBSF,SBOEPN DPNCJOBUJPOTPGNPOUIBOEEBZ %timeit pd.to_datetime(dates, format='%m/%d/%Y') 10 loops, best of 3: 26.1 ms per loop *40 'PSNBUMJLFlz 1SPWJEJOHlGPSNBUzLXNBZ JNQSPWFUIFQFSGPSNBODF