Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

As a Data Scientist/Engineer in Python, we focus in our work to solve problems with large amounts of data but still stay in Python. This is where we are the most effective and feel comfortable. Libraries like Pandas and NumPy provide us with efficient interfaces to deal with this data while still getting optimal performance. The main problem appears when we have to deal with systems outside of our comfort ecosystem. We need to write cumbersome and mostly slow conversion code that ingests data from there into our pipeline until we can work efficiently. Using Apache Arrow and Parquet as base technologies, we get a set of tools that eases this interaction and also brings us a huge performance improvement. As part of the talk we will show a basic problem where we take data coming from a Java application through Python into using these tools.

Uwe L. Korn

July 02, 2019
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. About me • Engineering at QuantCo • Apache {Arrow, Parquet}

    PMC • Focus on Python but interact with R, Java, SAS, … @xhochy @xhochy [email protected] https://uwekorn.com
  2. Do we have a problem? • Yes, there are different

    ecosystems! • PyData • Python / R • Pandas / NumPy / PySpark/sparklyr / Docker
  3. Do we have a problem? • Yes, there are different

    ecosystems! • PyData • Python / R • Pandas / NumPy / PySpark/sparklyr / Docker • Two weeks ago: Berlin Buzzwords • Java / Scala • Flink / ElasticSearch / Kafka • Scala-Spark / Kubernetes
  4. Do we have a problem? • Yes, there are different

    ecosystems! • PyData • Python / R • Pandas / NumPy / PySpark/sparklyr / Docker • Two weeks ago: Berlin Buzzwords • Java / Scala • Flink / ElasticSearch / Kafka • Scala-Spark / Kubernetes • SQL-based databases • ODBC / JDBC • Custom protocols (e.g. Postgres)
  5. Why solve this? • We build pipelines to move data

    • Goal: end-to-end data products
 Somewhere along the path we need to talk • Avoid duplicate work / work on converters • We don’t want Python vs R but use each of them where they’re best.
  6. Apache Arrow at its core • Main idea: common columnar

    representation of data in memory • Provide libraries to access the data structures • Broad support for many languages • Create building blocks to form an ecosystem around it • Implement adaptors for existing structures
  7. Previous Work • CSV works really everywhere • Slow, untyped

    and row-wise • Parquet is gaining traction in all ecosystems • one of the major features and interaction points of Arrow • Still, this serializes data • RAM-Copy: 10GB/s on a Laptop • DataFrame implementations look similar but still are incompatible
  8. Languages • C++, C(glib), Python, Ruby, R, Matlab • C#

    • Go • Java • JavaScript • Rust
  9. There’s a social component • It’s not only APIs you

    need to bring together • Communities are also quite distinct • Get them talking!
  10. Shipped with batteries • There is more than just data

    structures • Batteries in Arrow • Vectorized Parquet reader: C++, Rust, Java(WIP)
 C++ also supports ORC • Gandiva: LLVM-based expression kernels • Plasma: Shared-memory object store • DataFusion: Rust-based query engine • Flight: RPC protocol built on top of gRPC with zero-copy optimizations
  11. Ecosystem • RAPIDS: Analytics on the GPU • Dremio: Data

    platform • Turbodbc: columnar ODBC access in C++/Python • Spark: fast Python and R bridge • fletcher (pandas): Use Arrow instead of NumPy as backing storage • fletcher (FPGA): Use Arrow on FPGAs • Many more … https://arrow.apache.org/powered_by/
  12. Ecosystem Kartothek: • Heavily relies on Parquet adapter • Uses

    Arrow’s type system which is more sophisticated than pandas’ • Using Arrow instead of building some components on their own allows us to provide Kartothek access in other languages easily in the future
  13. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work?
  14. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with:
  15. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with: • ERP System in Java with JDBC access (no non-Java client)
  16. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with: • ERP System in Java with JDBC access (no non-Java client) • ETL and Data Cleaning in Python
  17. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with: • ERP System in Java with JDBC access (no non-Java client) • ETL and Data Cleaning in Python • Analysis in R
  18. Up Next • Build more adaptors, e.g. Postgres • Building

    blocks for query engines on top of Arrow • Datasets • Analytical kernels • DataFrame implementations directly on top of Arrow