Beyond Multiprocessing: A Real-World ML Workload Speedup with Python 3.13+ Free-Threading

Embed

Start on current slide

Slide 1

Slide 1 text

Beyond Multiprocessing: A Real-World ML Workload Speedup with Python 3.13+ Free-Threading Kitsuya Azuma, Institute of Science Tokyo PyConJP 2025

Slide 2

Slide 2 text

About Me 1 Kitsuya Azuma Master’s Student @ Institute of Science Tokyo Research Focus Federated Learning & Computer Vision Passion High-performance, robust Python Next step Platform Engineer (from 2026) @kitsuyaazuma @azuma_alvin

Slide 3

Slide 3 text

Follow Along 2 All materials for today's talk, including code and a link to the slides, are available here. https://github.com/kitsuyaazuma/pyconjp2025

Slide 4

Slide 4 text

The Core Question 3 How should we parallelize CPU-bound tasks on multiple cores in Python? The Go-To Answer: multiprocessing A New Contender: Free-threaded threading

Slide 5

Slide 5 text

Today’s Message 4 1. Free-threading makes threading a powerful, sometimes superior, choice for CPU-bound tasks. (Especially with shared data!) 2. The Golden Rule: Measure, Don't Guess. Your workload is unique.

Slide 6

Slide 6 text

The Game Changer: Free-threaded Python 5 What is it? An official, yet still optional, way to run Python without the Global Interpreter Lock (GIL). What does it mean? True parallelism for your threads. Before After Thread 1 Thread 2 Thread 3 CPU Cores GIL ü Officially supported in Python 3.14 Thread 1 Thread 2 Thread 3

Slide 7

Slide 7 text

Agenda: The Path to Proof 6 1.The Basics A quick recap of concurrency concepts. 2.The Evidence Head-to-head benchmarks: threading vs. multiprocessing. 3.The Real World Applying it to a Machine Learning workload. 4.Your Playbook How to choose the right tool and avoid common pitfalls.

Slide 8

Slide 8 text

Agenda: The Path to Proof 7 1.The Basics A quick recap of concurrency concepts. 2.The Evidence Head-to-head benchmarks: threading vs. multiprocessing. 3.The Real World Applying it to a Machine Learning workload. 4.Your Playbook How to choose the right tool and avoid common pitfalls.

Slide 9

Slide 9 text

8 Concurrency Composition of independently executing processes. Recap #1: Concurrency vs. Parallelism Adapted from Rob Pike, "Concurrency is not Parallelism” (Waza, 2013) Parallelism Simultaneous execution of computations. Dealing with lots of things at once Doing lots of things at once

Slide 10

Slide 10 text

Recap 2: Process vs. Thread 9 Process • An executing program itself. • Has an independent memory space. Thread • The basic unit to which the OS allocates CPU time. • Shares memory space with other threads in the same process. Code Data Files Registers Registers Stack Stack Counter Counter Multi-threaded Process Thread

Slide 11

Slide 11 text

The Contenders: A New Landscape 10 threading multiprocessing GIL Free-threaded Model Concurrent Parallel Parallel Unit Thread Process GIL Impact Yes No No Best For I/O-Bound + CPU-Bound CPU-Bound Memory Space Shared Independent Overhead Low High Free-threading fundamentally changes the game for the threading module, making it a true parallel contender.

Slide 12

Slide 12 text

Agenda: The Path to Proof 11 1.The Basics A quick recap of concurrency concepts. 2.The Evidence Head-to-head benchmarks: threading vs. multiprocessing. 3.The Real World Applying it to a Machine Learning workload. 4.Your Playbook How to choose the right tool and avoid common pitfalls.

Slide 13

Slide 13 text

The Proving Ground (Benchmark Setup) 12 Compare threading vs. multiprocessing on CPU-bound tasks. Supermicro AS-1014S-WTRT OS: Ubuntu 24.04.2 LTS Chip: AMD EPYC 7543P Cores: 32 Memory: 128 GB • 3.14.0rc2 • 3.14.0rc2 free-threading build Python Machine Goal

Slide 14

Slide 14 text

Case #1: Pure CPU-Bound Task 13 Task: Count prime numbers up to 10,000,000 Characteristic: Embarrassingly Parallel (No data sharing) What each worker does: Input: Numbers 1-10M Split into Chunks Chunk 1 Chunk 2 Chunk P Count Primes Aggregate Result: Total Prime Count Parallel Workers Count Primes Count Primes def count_primes_in_range(start: int, end: int) -> int: return sum(1 for n in range(start, end) if is_prime(n))

Slide 15

Slide 15 text

Result #1: No GIL, No Performance Gap 14 GIL threading: No speedup. Free-threaded threading scales! Performance is now comparable to multiprocessing. For pure parallel work, free-threaded threading is now just as fast as multiprocessing.

Slide 16

Slide 16 text

Case #2: CPU-Bound Task with Data Sharing 15 Task: Sum a giant NumPy array (100 million elements). Key Difference from Case #1: The workflow is similar, but now all workers reference the same initial array. The cost of passing or sharing these large data chunks could be the new bottleneck. Input: Large Numpy Array Split into Chunks Chunk 1 Chunk 2 Chunk P Sum Chunk Aggregate Result: Total Sum Parallel Workers Sum Chunk Sum Chunk Shared Read-Only Array

Slide 17

Slide 17 text

Result #2: The Tables Have Turned 16 multiprocessing overhead is significant, even with shared memory. Free-threaded threading is the clear winner! With shared data, threading's low overhead gives it a powerful advantage.

Slide 18

Slide 18 text

Agenda: The Path to Proof 17 1.The Basics A quick recap of concurrency concepts. 2.The Evidence Head-to-head benchmarks: threading vs. multiprocessing. 3.The Real World Applying it to a Machine Learning workload. 4.Your Playbook How to choose the right tool and avoid common pitfalls.

Slide 19

Slide 19 text

Case #3: A Read-World ML Workload 18 Federated Learning (FL) Training a global ML model across many devices without centralizing the private data. The Challenge Simulating this is a CPU- and GPU- intensive task. Each round, a large model (the shared data) is processed by many clients in parallel. Server Clients = + + + ⋯ #𝐶𝑙𝑖𝑒𝑛𝑡𝑠 Model Parameters

Slide 20

Slide 20 text

The Classic Approach: Flower (Ray) 19 Backend: Built on Ray, a general-purpose distributed computing framework. Parallelism Model: Uses multiprocessing via Ray Actors. Data Sharing: Model parameters (as Numpy arrays) are placed in an in-memory object store for zero-copy reads. Potential Bottleneck: The overhead for scheduling and object store management can be significant for small and simple tasks. In-Memory Object Store (torch.Tensor) Worker Process Server Process Object Ref Server Parameters (numpy.ndarray) Client Parameters

Slide 21

Slide 21 text

Our Solution: BlazeFL (Free-Threading) 20 Backend: Built on the standard library (threading). Parallelism Model: Uses threading with Free-threaded Python. Data Sharing: All threads access model parameters via direct memory access (zero-copy). Key Advantage: Minimal overhead. No serialization or copying bottleneck. Main Process Server Thread Worker Threads Shared

Slide 22

Slide 22 text

The Proving Ground (Benchmark Setup) 21 Compare BlazeFL (threading, multiprocessing) vs. Flower on a real-world FL simulation workload. TYAN S8030GM2NE OS: Ubuntu 24.04.2 LTS Chip: AMD EPYC 7542 Cores: 32 Memory: 256 GB GPU: 4×Quadro RTX 6000 • 3.13.7 • 3.13.7 experimental free-threading build Python Machine Goal

Slide 23

Slide 23 text

Result #3a: Lightweight CNN: Overhead Revealed 22 On lightweight models, low-overhead threading provides a massive performance advantage. BlazeFL (threading) is overwhelmingly fast. Minimal framework overhead is a huge advantage for fast tasks. Process-based methods are slower. The overhead of Ray's framework and process creation is significant here.

Slide 24

Slide 24 text

Result #3b: Heavier ResNet-18: Threading Still Leads 23 Even on heavier workloads, the efficiency of threading provides the best scalability and top-end speed. Heavier computation allows all methods to scale better. The task's workload is now large enough to hide some of the framework overhead. threading maintains its lead. It consistently provides the best performance, saturating around 16 workers.

Slide 25

Slide 25 text

Summary of The Evidence 24 • Case #1 (No Data Sharing): Free-threaded threading catches up. It's now just as fast as multiprocessing for simple parallel tasks. • Case #2 (With Data Sharing): Free-threaded threading pulls ahead. Its direct memory access gives it a massive advantage over multiprocessing. • Case #3 (Real-World ML): Free-threaded threading is the champion. In complex, single-node simulations, its minimal overhead leads to the best performance.

Slide 26

Slide 26 text

Agenda: The Path to Proof 25 1.The Basics A quick recap of concurrency concepts. 2.The Evidence Head-to-head benchmarks: threading vs. multiprocessing. 3.The Real World Applying it to a Machine Learning workload. 4.Your Playbook How to choose the right tool and avoid common pitfalls.

Slide 27

Slide 27 text

The Pitfalls: With Great Power... 26 Sharing a memory space is what makes threading fast, but it also opens the door to a classic problem: Race Conditions. What is it? When multiple threads access shared data at the same time, leading to unpredictable results. Why now? The GIL often hid this problem. With free-threading, true parallelism makes race conditions far more likely to occur. Thread 1 Thread 2

Slide 28

Slide 28 text

Same Code, Different Results 27 GIL: Expected: 200000, Actual: 200000 Free-threaded: Expected: 200000, Actual: 103725 import threading counter = 0 def worker(n: int) -> None: global counter for _ in range(n): counter += 1 if __name__ == "__main__": N = 100_000 t1, t2 = [threading.Thread(target=worker, args=(N,)) for _ in range(2)] t1.start(); t2.start(); t1.join(); t2.join() print(f"Expected: {2*N}, Actual: {counter}")

Slide 29

Slide 29 text

Anatomy of a Race Condition & The Fix 28 Why It Fails: It's Not One Operation counter += 1 is actually three steps: READ → MODIFY → WRITE. Without the GIL, another thread can interrupt between this process, causing a "lost update". The Fix: Use threading.Lock A lock protects this critical section, ensuring it runs atomically. + + + − import threading counter = 0 lock = threading.Lock() def worker(n: int) -> None: global counter for _ in range(n): counter += 1 with lock: counter += 1

Slide 30

Slide 30 text

The Golden Rule: Measure, Don’t Guess 29 Benchmarks: Tells You WHAT is Slow Measures the total execution time to show which approach is faster overall. Profiling: Tells You WHY it’s Slow Looks inside your code to find the true bottleneck. § A specific function? § Memory allocation? § Lock contention? Don't optimize without data!

Slide 31

Slide 31 text

Profiling in a Free-Threaded World 30 The Challenge: GIL-Dependent Profilers A Solution: System-Level Samplers samply Scalene py-spy Many popular profilers rely on GIL internals and are currently incompatible with free-threaded builds. It works by sampling the call stack at fixed intervals, independent of the GIL.

Slide 32

Slide 32 text

A Decision Guide: Choosing Your Tool 31 Is your task I/O-Bound? → Use asyncio or threading. Is your task CPU-Bound? Does it require heavy data sharing? Yes → Start with free-threaded threading. No → multiprocessing and free-threaded threading are both good. Benchmark them! Do you need to scale across multiple machines? → Use a distributed framework like Ray, Dask or Horovod.

Slide 33

Slide 33 text

Final Takeaways 32 1. The GIL era is ending.For data-intensive, single-node tasks, free-threading makes threading a first-class citizen for parallelism —often simpler and faster than multiprocessing. 2. Measure, don't guess. Benchmark and profile your code to find the real bottlenecks before you optimize.

Slide 34

Slide 34 text

33 Thank You. Any Questions? Blog Post Code & Slides https://alvinvin.hatenablog.jp/ entry/17 https://github.com/kitsuyaazuma/ pyconjp2025