Slide 1

Slide 1 text

🦀 Rust for Data What Works and What Doesn’t Faster doesn’t always mean better

Slide 2

Slide 2 text

About me kahnwong Karn Wong karnwong.me Platform Engineer @Data Cafe Company Limited Used to work as a Data Engineer and Machine Learning Engineer Tinkers with Go, Python and Rust Deployed bajillion things to production

Slide 3

Slide 3 text

Table of contents 1. Types of Data Work 2. Data Engineering Workloads 3. Machine Learning Workloads 4. When to Use Rust 5. When Not to Use Rust 6. Bonus: Boosting Python Performance via Rust FFI 7. Conclusion 1. Should You Use Rust?

Slide 4

Slide 4 text

Types of Data Work Type Java / Scala R Python Rust Data Analysis ❌ ✅ ✅ ❌ Data Engineering ✅ ❌ ✅ ✅ Machine Learning ☑️ ✅ ✅ ✅ Our main focus today is on Python vs Rust comparison. As in: Rust for Data Engineering and Machine Learning. Based on how common it is to adopt a language for a particular domain

Slide 5

Slide 5 text

Data Engineering Workloads

Slide 6

Slide 6 text

Data Engineering: Python For small data, pandas is very popular But it does not have strong typing features You can’t be sure what your dtypes are ⚠️⚠️⚠️ For big data, pyspark is king If you are not familiar with Spark Dataframe API , you can use Spark SQL pandas API is also available Middle of the road solution is polars , a Rust-based similar to pandas , but more performant Also has strong typing features, courtesy of Rust

Slide 7

Slide 7 text

Benchmark: Polars vs Spark https://karnwong.me/posts/2023/04/duckdb-vs-polars-vs-spark/

Slide 8

Slide 8 text

Data Engineering: Rust Polars ( pandas alternative) 30k stars Rust API documentation is still lacking compared to Python’s Verdict: you should use polars via Python. Better integration to existing ecosystem that way Datafusion ( spark alternative) 6.1k stars Benchmark: https://andygrove.io/2018/03/datafusion-0.2.1-benchmark/ Still not widely adopted Verdict: probably don’t use it in production for now, unless you know what you are doing

Slide 9

Slide 9 text

Machine Learning Workloads

Slide 10

Slide 10 text

Machine Learning: Python Machine Learning Workflow Data Preparation Model Training Model Evaluation Model Deployment Model Training Various frameworks to choose from: spark mlib , scikit-learn , pytorch , tensorflow Large ecosystem to support distributed training and model deployment Model Deployment Mostly by utilizing FastAPI to expose your model as API endpoint Real-time inference involves input validation and data prep ️ ⚠️⚠️️ ⚠️

Slide 11

Slide 11 text

Machine Learning: Rust Model Training Various frameworks for model training: candle-core , burn , tensorflow , torch-rs , linfa Sparse documentation Verdict: probably better to still use Python for model training due to better ecosystem Model Deployment Can convert your model to onnx or gguf and serve it via Rust Rust also has backend frameworks similar to FastAPI , such as actix , axum Strong type safety means the Rust compiler would catch errors during compilation time, and during data processing pre-inference In Python, sometimes you need to run the code to see the errors Better memory management than Python, which means Rust is faster for the same Python workload Verdict: use Rust-based solutions if you want better performance and stability

Slide 12

Slide 12 text

When to Use Rust Rust is suitable when you have to create APIs for other services to use (model deployment) A lot of production issues when using Python is unforeseen bugs under nested statements Rust would catch these errors during compilation time Additionally, if your services need low latency, Rust would be perfect because it’s very fast

Slide 13

Slide 13 text

Benchmark: Rust vs Node, Go and Python https://karnwong.me/posts/2024/09/hello-world-api-performance- benchmark-go-node-python-rust/

Slide 14

Slide 14 text

Benchmark: LLM Serving Latency https://karnwong.me/posts/2024/10/llm-serving-latency-benchmark/ "น้องไม่เข้าพวก" - A certain MLE

Slide 15

Slide 15 text

When Not to Use Rust Due to the nature of data processing where a lot of transformations are involved, data types would change all the time Using Rust would make things more complicated and verbose For model training, there are not many resources to do this in Rust Some light transformations are still involved, so Rust would be counterintuitive in this case

Slide 16

Slide 16 text

Bonus: Boosting Python Performance via Rust FFI "เจ้าเป็นงูหรือเป็นสล็อตกันแน่" - A certain MLE

Slide 17

Slide 17 text

Any questions?

Slide 18

Slide 18 text

Conclusion Model deployment with Rust results in better durability and less bugs in production, because you would know about the bugs during compilation time You can use Rust for data processing, model training and model deployment Data processing in Rust translates to more overhead due to strong typing features - and Rust has less mature ecosystem compared to Python Model training in Rust is similar to data processing, namely lack of ecosystem and readily available examples

Slide 19

Slide 19 text

Should You Use Rust? It’s harder to find Rust developers :( Most data folks are only familiar with Python Rust takes longer to write (but it will pay off in production due to less bugs) If you managed to find Rust developers to work for you, great! But if they leave, how are you going to maintain Rust-based solutions in your organization?

Slide 20

Slide 20 text

Thank You Slides Blog GitHub LinkedIn