Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

As a Data Scientist/Engineer in Python, we focus in our work to solve problems with large amounts of data but still stay in Python. This is where we are the most effective and feel comfortable. Libraries like Pandas and NumPy provide us with efficient interfaces to deal with this data while still getting optimal performance. The main problem appears when we have to deal with systems outside of our comfort ecosystem. We need to write cumbersome and mostly slow conversion code that ingests data from there into our pipeline until we can work efficiently. Using Apache Arrow and Parquet as base technologies, we get a set of tools that eases this interaction and also brings us a huge performance improvement. As part of the talk we will show a basic problem where we take data coming from a Java application through Python into using these tools.

Uwe L. Korn

July 02, 2019
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. (Efficient) Data Exchange with
    "Foreign" Ecosystems
    Uwe Korn – QuantCo – 2nd July 2019

    View full-size slide

  2. About me
    • Engineering at QuantCo

    • Apache {Arrow, Parquet} PMC

    • Focus on Python but interact with
    R, Java, SAS, …
    @xhochy
    @xhochy
    [email protected]
    https://uwekorn.com

    View full-size slide

  3. Python & R
    … & Java & Rust &
    Javascript & C# & Matlab
    & …

    View full-size slide

  4. Do we have a problem?

    View full-size slide

  5. Do we have a problem?
    • Yes, there are different ecosystems!

    View full-size slide

  6. Do we have a problem?
    • Yes, there are different ecosystems!
    • PyData

    • Python / R

    • Pandas / NumPy / PySpark/sparklyr / Docker

    View full-size slide

  7. Do we have a problem?
    • Yes, there are different ecosystems!
    • PyData

    • Python / R

    • Pandas / NumPy / PySpark/sparklyr / Docker
    • Two weeks ago: Berlin Buzzwords

    • Java / Scala

    • Flink / ElasticSearch / Kafka

    • Scala-Spark / Kubernetes

    View full-size slide

  8. Do we have a problem?
    • Yes, there are different ecosystems!
    • PyData

    • Python / R

    • Pandas / NumPy / PySpark/sparklyr / Docker
    • Two weeks ago: Berlin Buzzwords

    • Java / Scala

    • Flink / ElasticSearch / Kafka

    • Scala-Spark / Kubernetes
    • SQL-based databases

    • ODBC / JDBC

    • Custom protocols (e.g. Postgres)

    View full-size slide

  9. Why solve this?
    • We build pipelines to move data

    • Goal: end-to-end data products

    Somewhere along the path we need to talk

    • Avoid duplicate work / work on converters

    • We don’t want Python vs R but use each of them where they’re best.

    View full-size slide

  10. Apache Arrow at its core
    • Main idea: common columnar representation of data in memory
    • Provide libraries to access the data structures

    • Broad support for many languages

    • Create building blocks to form an ecosystem around it

    • Implement adaptors for existing structures

    View full-size slide

  11. Columnar Data

    View full-size slide

  12. Previous Work
    • CSV works really everywhere

    • Slow, untyped and row-wise

    • Parquet is gaining traction in all ecosystems

    • one of the major features and interaction points of Arrow

    • Still, this serializes data

    • RAM-Copy: 10GB/s on a Laptop

    • DataFrame implementations look similar but still are incompatible

    View full-size slide

  13. Languages
    • C++, C(glib), Python, Ruby, R, Matlab

    • C#

    • Go

    • Java

    • JavaScript

    • Rust

    View full-size slide

  14. There’s a social component
    • It’s not only APIs you need to bring together

    • Communities are also quite distinct

    • Get them talking!

    View full-size slide

  15. Shipped with batteries
    • There is more than just data structures

    • Batteries in Arrow

    • Vectorized Parquet reader: C++, Rust, Java(WIP)

    C++ also supports ORC

    • Gandiva: LLVM-based expression kernels

    • Plasma: Shared-memory object store

    • DataFusion: Rust-based query engine

    • Flight: RPC protocol built on top of gRPC with zero-copy optimizations

    View full-size slide

  16. Ecosystem
    • RAPIDS: Analytics on the GPU

    • Dremio: Data platform

    • Turbodbc: columnar ODBC access in C++/Python

    • Spark: fast Python and R bridge

    • fletcher (pandas): Use Arrow instead of NumPy as backing storage

    • fletcher (FPGA): Use Arrow on FPGAs

    • Many more … https://arrow.apache.org/powered_by/

    View full-size slide

  17. Ecosystem
    Kartothek:

    • Heavily relies on Parquet adapter

    • Uses Arrow’s type system which is more sophisticated than pandas’

    • Using Arrow instead of building some components on their own allows
    us to provide Kartothek access in other languages easily in the future

    View full-size slide

  18. Does it work?

    View full-size slide

  19. Does it work?
    Everything is amazing on slides …

    View full-size slide

  20. Does it work?
    Everything is amazing on slides …
    … so does this Arrow actually work?

    View full-size slide

  21. Does it work?
    Everything is amazing on slides …
    … so does this Arrow actually work?
    Let’s take a real example with:

    View full-size slide

  22. Does it work?
    Everything is amazing on slides …
    … so does this Arrow actually work?
    Let’s take a real example with:
    • ERP System in Java with JDBC access (no non-Java client)

    View full-size slide

  23. Does it work?
    Everything is amazing on slides …
    … so does this Arrow actually work?
    Let’s take a real example with:
    • ERP System in Java with JDBC access (no non-Java client)
    • ETL and Data Cleaning in Python

    View full-size slide

  24. Does it work?
    Everything is amazing on slides …
    … so does this Arrow actually work?
    Let’s take a real example with:
    • ERP System in Java with JDBC access (no non-Java client)
    • ETL and Data Cleaning in Python
    • Analysis in R

    View full-size slide

  25. Does it work?

    View full-size slide

  26. Does it work?

    View full-size slide

  27. Does it work?

    View full-size slide

  28. Does it work?
    WIP

    View full-size slide

  29. Get started easily?

    View full-size slide

  30. Up Next
    • Build more adaptors, e.g. Postgres

    • Building blocks for query engines on top of Arrow

    • Datasets

    • Analytical kernels

    • DataFrame implementations directly on top of Arrow

    View full-size slide

  31. Thanks
    Slides at https://twitter.com/xhochy

    Question here!

    View full-size slide