Sungmin Han
October 15, 2023
130

# Python의 최적화 그리고 C 바인딩 이야기 _ 한성민 [PyCon Korea 2021]

October 15, 2023

## Transcript

2. ### Speaker 한성민 (Sungmin Han) AIOps Senior backend-engineer at Riiid Former

Research engineer at Naver Clova Former Software engineer at IGAWorks Former Software engineer at 심심이 https://www.facebook.com/han.sungmin/ https://github.com/KennethanCeyer https://www.linkedin.com/in/sungmin-han-768419133/ [email protected]

- Thread in Python - GIL (Global Interpreter Lock) - IO Bound - Async IO Bindings - C-bindings - Lib bindings - Cross-compile bindings Hardware-accelerations - SIMD (Single Input Multiple Data) - GPGPU - Introduce libraries Data processing - Memory access mechanism - Zero-copy - Columnar Data Format - Clustered computing

7. ### Thread in Python def calc(x: int) -> int: sum_value =

0 for _ in range(10000000): sum_value += x return sum_value if __name__ == "__main__": with ThreadPoolExecutor(max_workers=10) as executor: res = executor.map(calc, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) print(list(res)) def calc(x: int) -> int: sum_value = 0 for _ in range(10000000): sum_value += x return sum_value if __name__ == "__main__": res = map(calc, [1, 2, 3]) print(list(res)) Thread: 4,116ms Single core: 1,208ms

9. ### Thread in Python with IO Bound Requests URLS: List[str] =

[ "https://naver.com", "https://daum.net", "https://kakao.com", "https://google.com", "https://stackoverflow.com", ] if __name__ == "__main__": with ThreadPoolExecutor(max_workers=5) as executor: res = executor.map(requests.get, [*(URLS * 10)]) print(list(res)) URLS: List[str] = [ "https://naver.com", "https://daum.net", "https://kakao.com", "https://google.com", "https://stackoverflow.com", ] if __name__ == "__main__": res = map(requests.get, [*(URLS * 10)]) print(list(res)) Thread: 4,316ms Single core: 12,358ms
10. ### CPU Bound vs IO Bound CPU Bound IO Bound -

Access & Update Variable - CPU Bound Function Call - Sys call - GC - Disk Read & Write - Network Communication - DB Connect & Requests - Audio Play - Sleep (In case of AsyncIO)

13. ### Native Python STD def mean(nums: List[float]) -> float: return sum(nums)

/ len(nums) def var(nums: List[float]) -> float: mean_ = mean(nums) vsum = 0 for n in nums: vsum += (n - mean_) ** 2 return vsum / len(nums) def std(nums: List[float]) -> float: return sqrt(var(nums)) data = list(range(5000)) timeit(lambda:std(data), number=100000) 94,582ms
14. ### Python with Clang API PyObject* gs_sum(PyObject* pList) { PyObject* pListItem;

double result = 0; int length; int i; if (!PyList_Check(pList)) { PyErr_SetString(PyExc_TypeError, "parameter must be a list."); return NULL; } length = PyList_Size(pList); for (i=0; i<length; ++i) { pListItem = PyList_GetItem(pList, (Py_ssize_t)i); result += PyFloat_AsDouble(pListItem); } return PyFloat_FromDouble(result); } API PyObject* gs_mean(PyObject* pList) { PyObject* pSum; int length; pSum = gs_sum(pList); length = PyList_Size(pList); return PyFloat_FromDouble(PyFloat_AsDouble(pSum) / length); } API PyObject* gs_var(PyObject* pList) { PyObject* pListItem; double result = 0; double meanValue; int length; int i; if (!PyList_Check(pList)) { PyErr_SetString(PyExc_TypeError, "parameter must be a list."); return NULL; } meanValue = PyFloat_AsDouble(gs_mean(pList)); length = PyList_Size(pList); for (i=0; i<length; ++i) { pListItem = PyList_GetItem(pList, (Py_ssize_t)i); result += pow(PyFloat_AsDouble(pListItem) - meanValue, 2); } result /= length; return PyFloat_FromDouble(result); } API PyObject* gs_std(PyObject* pList) { return PyFloat_FromDouble(sqrt(PyFloat_AsDouble(gs_var(pList)))); }
15. ### Python with Clang dll = PyDLL(path.join(path.dirname(__file__), './lib/grayscale.dll')) # math dll.gs_sum.restype

= py_object dll.gs_sum.argtypes = py_object, dll.gs_mean.restype = py_object dll.gs_mean.argtypes = py_object, dll.gs_var.restype = py_object dll.gs_var.argtypes = py_object, dll.gs_std.restype = py_object dll.gs_std.argtypes = py_object, data = list(range(5000)) timeit(lambda: dll.gs_std(data), number=100000) 27,095ms (-71%)
16. ### Native vs Clang Binding Python (3.9) Clang Sum Mean Var

Std 2,638 5,475 2,264 5,706 95,506 26,945 94,582 27,095
17. ### C Bindings Python Runtime PyModule C-lang Python Runtime PyAPI C-lang

Python Function Module-level Binding API-level Binding GIL Release
18. ### Library Bindings Linux .so MacOS .dylib Windows .dll Library Python

Runtime Library Mapping (Parameter Definition) Python Function Library Extension by Platforms
19. ### Binding with Various languages Library Python Runtime * Due to

the structural nature of the library, only the native compile form is compatible, and the language constituting the VM runtime is not compatible.
20. ### Rust: PyO3 https://github.com/PyO3/pyo3 [package] name = "string-sum" version = "0.1.0"

edition = "2018" [lib] name = "string_sum" # "cdylib" is necessary to produce a shared library for Python to import from. # # Downstream Rust code (including code in `bin/`, `examples/`, and `tests/`) will not be able # to `use string_sum;` unless the "rlib" or "lib" crate type is also included, e.g.: # crate-type = ["cdylib", "rlib"] crate-type = ["cdylib"] [dependencies.pyo3] version = "0.14.1" features = ["extension-module"] Cargo.toml
21. ### Rust: PyO3 https://github.com/PyO3/pyo3 use pyo3::prelude::*; /// Formats the sum of

two numbers as string. #[pyfunction] fn sum_as_string(a: usize, b: usize) -> PyResult<String> { Ok((a + b).to_string()) } /// A Python module implemented in Rust. The name of this function must match /// the `lib.name` setting in the `Cargo.toml`, else Python will not be able to /// import the module. #[pymodule] fn string_sum(_py: Python, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(sum_as_string, m)?)?; Ok(()) } src/lib.rs

\$ python
23. ### Rust: PyO3 https://github.com/PyO3/pyo3 Python Runtime Library Mapping (Parameter Definition) Python

Function Rust Module maturin Library
24. ### Other projects Golang binding - gopy: https://github.com/go-python/gopy Kotlin binding -

kotlin-native python binding: https://github.com/JetBrains/kotlin/blob/master/kotlin- native/samples/python_extension
25. ### Hardware Acceleration

Task 2_1 Task 3 Task 2_2 Task 2_3 Mem Mem Sequential Processing Parallel Processing for Hardware Acceleration
27. ### SIMD Operation 1 5 + 2 6 + 3 7

+ 4 8 + CPU Tick CPU Tick CPU Tick CPU Tick 6 8 10 12 Single Instruction Single Data Stream (SISD) 1 5 3 7 4 8 2 6 CPU Tick Single Instruction Multiple Data Stream (SIMD) + 6 8 10 12
28. ### SSE3 in Cython cdef void sample(): cdef double[4] operand1 cdef

double[4] operand2 cdef __m128d mdata, mtmp cdef double[4] out operand1[:] = [1.0, 2.0, 3.0, 4.0] operand2[:] = [5.0, 6.0, 7.0, 8.0] with nogil: moperand1 = _mm_loadu_pd(operand1) moperand2 = _mm_loadu_pd(operand2) # tmp = A + B # [1.0, 2.0, 3.0, 4.0] + [5.0, 6.0, 7.0, 8.0] mtmp = _mm_add_pd(moperand1, moperand2) _mm_store_pd(out, mtmp)
29. ### BLAS (Basic Linear Algebra Subprograms) Level 1 Level 2 Level

3 Linear Algebra Matrix to Vector operation Matrix to Matrix operation
30. ### GPU Processing with CUDA Native Python CUDA Lib CUDA Memory

GPU Mem CPU GPU C-lang Binding GPU Device
31. ### CUDA Python using numba @nb.njit(parallel=True, fastmath=True) def sum_in_parallel(A): acc =

0. n = len(A) for i in prange(n): acc += np.log(A[i]) return acc
32. ### Numba package This slide explains the content below. - Introduce

Numba project - Inside of Numpy Impls - Main difference between Numba and Numpy https://developer.nvidia.com/how-to-cuda-python

34. ### Memory-access mechanism Apple Banana Grape 0x02 1,300 0x01 700 2,200

menu_price = { 'Apple': 1300, 'Banana': 700, 'Grape': 2,200, } Keys Buckets Entries
35. ### Hash mechanisms >>> hash(10) 10 >>> hash('a') 2879375576708175707 >>> hash((1,

2, 3)) 2528502973977326415 >>> from dataclasses import dataclass >>> @dataclass(frozen=True) ... class HelloWorld: ... hello: str ... >>> hello_world = HelloWorld('world') >>> hash(hello_world) 9136121803269620124 Hashable PyObject __hash__() hash()
36. ### Context switching, copy costs User Space Kernel space Read fiie

Buffer Buffer send (CPU copy) Syscall (read) Descriptor Hardware read (DMA copy) Send bytes NIC Syscall (write) Buffer Transport (DMA copy) CPU copy
37. ### Introduce Zero-copy User Space Kernel space os.sendfile Kernel Buffer Syscall

(read) Descriptor Hardware DMA copy NIC Syscall (write) Socket Buffer DMA copy CPU copy
38. ### import os source = './src_file' dest = './dst_file' OFFSET =

0 COUNT = 1024 with ( open(source, 'rb') as src_fd, open(dest, 'wb') as dest_fd, ): src_fd.seek(OFFSET) buffer = src_fd.read(COUNT) bytes_sent = dest_fd.write(buffer) print(bytes_sent) Traditional application file send
39. ### os.sendfile for Zero-copy import os source = './src_file' dest =

'./dst_file' OFFSET = 0 COUNT = 1024 with ( open(source, 'rb') as src_fd, open(dest, 'wb') as dest_fd, ): bytes_sent = os.sendfile(dest_fd, src_fd, offset, count) print(bytes_sent)
40. ### Columna data format https://parquet.apache.org/ Apple 1,300 Red 2021-08-22T03:41:14+09:00 Banana 700

Yellow 2021-08-22T08:13:01+09:00 Grape 2,200 Purple 2021-08-22T16:01:42+09:00 Row 1 Row 2 Row 3 Apple Banana Grape 1,300 700 2,200 Red Yellow Purple 2021-08-22T03:41:14+09:00 2021-08-22T08:13:01+09:00 2021-08-22T16:01:42+09:00 Column 1 Column 2 Column 3 Column 4
41. ### Clustered computing Job Scheduler Worker cluster User Worker Worker Worker

Worker Worker Worker Worker