Sungmin Han
October 15, 2023
35

# Python의 최적화 그리고 C 바인딩 이야기 _ 한성민 [PyCon Korea 2021]

## Transcript

1. Python의 최적화 그리고
C 바인딩 이야기
한성민

2. Speaker
한성민 (Sungmin Han)
AIOps Senior backend-engineer at Riiid
Former Research engineer at Naver Clova
Former Software engineer at IGAWorks
Former Software engineer at 심심이
https://github.com/KennethanCeyer
[email protected]

Introduction
Multi-core Programming
- GIL (Global Interpreter Lock)
- IO Bound
- Async IO
Bindings
- C-bindings
- Lib bindings
- Cross-compile bindings
Hardware-accelerations
- SIMD (Single Input Multiple Data)
- GPGPU
- Introduce libraries
Data processing
- Memory access mechanism
- Zero-copy
- Columnar Data Format
- Clustered computing

4. Introduction

Web
Data
Science
ML/DL Robotics Extra...

6. Multi-core Programming

def calc(x: int) -> int:
sum_value = 0
for _ in range(10000000):
sum_value += x
return sum_value
if __name__ == "__main__":
res = executor.map(calc, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(list(res))
def calc(x: int) -> int:
sum_value = 0
for _ in range(10000000):
sum_value += x
return sum_value
if __name__ == "__main__":
res = map(calc, [1, 2, 3])
print(list(res))

8. Global Interpreter Lock (GIL)

9. Thread in Python with IO Bound Requests
URLS: List[str] = [
"https://naver.com",
"https://daum.net",
"https://kakao.com",
"https://stackoverflow.com",
]
if __name__ == "__main__":
res = executor.map(requests.get, [*(URLS * 10)])
print(list(res))
URLS: List[str] = [
"https://naver.com",
"https://daum.net",
"https://kakao.com",
"https://stackoverflow.com",
]
if __name__ == "__main__":
res = map(requests.get, [*(URLS * 10)])
print(list(res))

10. CPU Bound vs IO Bound
CPU Bound IO Bound
- Access & Update Variable
- CPU Bound Function Call
- Sys call
- GC
- Network Communication
- DB Connect & Requests
- Audio Play
- Sleep (In case of AsyncIO)

IO Bound Lock

12. Bindings

13. Native Python STD
def mean(nums: List[float]) -> float:
return sum(nums) / len(nums)
def var(nums: List[float]) -> float:
mean_ = mean(nums)
vsum = 0
for n in nums:
vsum += (n - mean_) ** 2
return vsum / len(nums)
def std(nums: List[float]) -> float:
return sqrt(var(nums))
data = list(range(5000))
timeit(lambda:std(data), number=100000)
94,582ms

14. Python with Clang
API PyObject* gs_sum(PyObject* pList) {
PyObject* pListItem;
double result = 0;
int length;
int i;
if (!PyList_Check(pList)) {
PyErr_SetString(PyExc_TypeError, "parameter must be a list.");
return NULL;
}
length = PyList_Size(pList);
for (i=0; ipListItem = PyList_GetItem(pList, (Py_ssize_t)i);
result += PyFloat_AsDouble(pListItem);
}
return PyFloat_FromDouble(result);
}
API PyObject* gs_mean(PyObject* pList) {
PyObject* pSum;
int length;
pSum = gs_sum(pList);
length = PyList_Size(pList);
return PyFloat_FromDouble(PyFloat_AsDouble(pSum) / length);
}
API PyObject* gs_var(PyObject* pList) {
PyObject* pListItem;
double result = 0;
double meanValue;
int length;
int i;
if (!PyList_Check(pList)) {
PyErr_SetString(PyExc_TypeError, "parameter must be a list.");
return NULL;
}
meanValue = PyFloat_AsDouble(gs_mean(pList));
length = PyList_Size(pList);
for (i=0; ipListItem = PyList_GetItem(pList, (Py_ssize_t)i);
result += pow(PyFloat_AsDouble(pListItem) - meanValue, 2);
}
result /= length;
return PyFloat_FromDouble(result);
}
API PyObject* gs_std(PyObject* pList) {
return PyFloat_FromDouble(sqrt(PyFloat_AsDouble(gs_var(pList))));
}

15. Python with Clang
dll = PyDLL(path.join(path.dirname(__file__),
'./lib/grayscale.dll'))
# math
dll.gs_sum.restype = py_object
dll.gs_sum.argtypes = py_object,
dll.gs_mean.restype = py_object
dll.gs_mean.argtypes = py_object,
dll.gs_var.restype = py_object
dll.gs_var.argtypes = py_object,
dll.gs_std.restype = py_object
dll.gs_std.argtypes = py_object,
data = list(range(5000))
timeit(lambda: dll.gs_std(data), number=100000)
27,095ms (-71%)

16. Native vs Clang Binding
Python (3.9) Clang
Sum
Mean
Var
Std
2,638 5,475
2,264 5,706
95,506 26,945
94,582 27,095

17. C Bindings
Python Runtime
PyModule
C-lang
Python Runtime
PyAPI
C-lang
Python Function
Module-level Binding API-level Binding
GIL Release

18. Library Bindings
Linux
.so
MacOS
.dylib
Windows
.dll
Library
Python Runtime
Library Mapping
(Parameter Definition)
Python Function
Library Extension by Platforms

19. Binding with Various languages
Library Python Runtime
* Due to the structural nature of the library, only the native compile form is
compatible, and the language constituting the VM runtime is not compatible.

20. Rust: PyO3
https://github.com/PyO3/pyo3
[package]
name = "string-sum"
version = "0.1.0"
edition = "2018"
[lib]
name = "string_sum"
# "cdylib" is necessary to produce a shared library for Python to import from.
#
# Downstream Rust code (including code in `bin/`, `examples/`, and `tests/`) will not be able
# to `use string_sum;` unless the "rlib" or "lib" crate type is also included, e.g.:
# crate-type = ["cdylib", "rlib"]
crate-type = ["cdylib"]
[dependencies.pyo3]
version = "0.14.1"
features = ["extension-module"]
Cargo.toml

21. Rust: PyO3
https://github.com/PyO3/pyo3
use pyo3::prelude::*;
/// Formats the sum of two numbers as string.
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult {
Ok((a + b).to_string())
}
/// A Python module implemented in Rust. The name of this function must match
/// the `lib.name` setting in the `Cargo.toml`, else Python will not be able to
/// import the module.
#[pymodule]
fn string_sum(_py: Python, m: &PyModule) -> PyResult<()> {
Ok(())
}
src/lib.rs

22. Rust: PyO3
https://github.com/PyO3/pyo3
\$ pip install maturin
\$ maturin develop
\$ python

23. Rust: PyO3
https://github.com/PyO3/pyo3
Python Runtime
Library Mapping
(Parameter Definition)
Python Function
Rust Module maturin Library

24. Other projects
Golang binding
- gopy:
https://github.com/go-python/gopy
Kotlin binding
- kotlin-native python binding:
https://github.com/JetBrains/kotlin/blob/master/kotlin-
native/samples/python_extension

25. Hardware
Acceleration

26. Hardware Acceleration
Mem
Mem
Sequential Processing Parallel Processing for Hardware Acceleration

27. SIMD Operation
1 5
+
2 6
+
3 7
+
4 8
+
CPU Tick
CPU Tick
CPU Tick
CPU Tick
6
8
10
12
Single Instruction Single Data Stream (SISD)
1 5
3 7
4 8
2 6
CPU Tick
Single Instruction Multiple Data Stream (SIMD)
+
6
8
10
12

28. SSE3 in Cython
cdef void sample():
cdef double[4] operand1
cdef double[4] operand2
cdef __m128d mdata, mtmp
cdef double[4] out
operand1[:] = [1.0, 2.0, 3.0, 4.0]
operand2[:] = [5.0, 6.0, 7.0, 8.0]
with nogil:
# tmp = A + B
# [1.0, 2.0, 3.0, 4.0] + [5.0, 6.0, 7.0, 8.0]
_mm_store_pd(out, mtmp)

29. BLAS (Basic Linear Algebra Subprograms)
Level 1 Level 2 Level 3
Linear Algebra Matrix to Vector operation Matrix to Matrix operation

30. GPU Processing with CUDA
Native Python CUDA Lib
CUDA
Memory GPU Mem
CPU GPU
C-lang Binding GPU Device

31. CUDA Python using numba
@nb.njit(parallel=True,
fastmath=True)
def sum_in_parallel(A):
acc = 0.
n = len(A)
for i in prange(n):
acc += np.log(A[i])
return acc

32. Numba package
This slide explains the content below.
- Introduce Numba project
- Inside of Numpy Impls
- Main difference between Numba and Numpy
https://developer.nvidia.com/how-to-cuda-python

33. Data Processing

34. Memory-access mechanism
Apple
Banana
Grape
0x02
1,300
0x01 700
2,200
'Apple': 1300,
'Banana': 700,
'Grape': 2,200,
}
Keys Buckets Entries

35. Hash mechanisms
>>> hash(10)
10
>>> hash('a')
2879375576708175707
>>> hash((1, 2, 3))
2528502973977326415
>>> from dataclasses import dataclass
>>> @dataclass(frozen=True)
... class HelloWorld:
... hello: str
...
>>> hello_world = HelloWorld('world')
>>> hash(hello_world)
9136121803269620124
Hashable PyObject
__hash__()
hash()

36. Context switching, copy costs
User Space
Kernel space
Buffer
send (CPU copy)
Syscall
Descriptor
Hardware
Send bytes
NIC
Syscall
(write)
Buffer
Transport (DMA copy)
CPU copy

37. Introduce Zero-copy
User Space
Kernel space
os.sendfile
Kernel
Buffer
Syscall
Descriptor
Hardware
DMA copy
NIC
Syscall
(write)
Socket
Buffer
DMA copy
CPU copy

38. import os
source = './src_file'
dest = './dst_file'
OFFSET = 0
COUNT = 1024
with (
open(source, 'rb') as src_fd,
open(dest, 'wb') as dest_fd,
):
src_fd.seek(OFFSET)
bytes_sent = dest_fd.write(buffer)
print(bytes_sent)

39. os.sendfile for Zero-copy
import os
source = './src_file'
dest = './dst_file'
OFFSET = 0
COUNT = 1024
with (
open(source, 'rb') as src_fd,
open(dest, 'wb') as dest_fd,
):
bytes_sent = os.sendfile(dest_fd, src_fd, offset, count)
print(bytes_sent)

40. Columna data format
https://parquet.apache.org/
Apple
1,300
Red
2021-08-22T03:41:14+09:00
Banana
700
Yellow
2021-08-22T08:13:01+09:00
Grape
2,200
Purple
2021-08-22T16:01:42+09:00
Row 1
Row 2
Row 3
Apple
Banana
Grape
1,300
700
2,200
Red
Yellow
Purple
2021-08-22T03:41:14+09:00
2021-08-22T08:13:01+09:00
2021-08-22T16:01:42+09:00
Column 1
Column 2
Column 3
Column 4

41. Clustered computing
Job Scheduler
Worker cluster
User
Worker
Worker
Worker
Worker
Worker
Worker
Worker

42. Example projects