PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

As a Data Scientist/Engineer in Python, we focus in our work to solve problems with large amounts of data but still stay in Python. This is where we are the most effective and feel comfortable. Libraries like Pandas and NumPy provide us with efficient interfaces to deal with this data while still getting optimal performance. The main problem appears when we have to deal with systems outside of our comfort ecosystem. We need to write cumbersome and mostly slow conversion code that ingests data from there into our pipeline until we can work efficiently. Using Apache Arrow and Parquet as base technologies, we get a set of tools that eases this interaction and also brings us a huge performance improvement. As part of the talk we will show a basic problem where we take data coming from a Java application through Python into using these tools.

D6fcc16462fbe93673342da3ff5d8121?s=128

Uwe L. Korn

July 02, 2019
Tweet

Transcript

  1. (Efficient) Data Exchange with "Foreign" Ecosystems Uwe Korn – QuantCo

    – 2nd July 2019
  2. About me • Engineering at QuantCo • Apache {Arrow, Parquet}

    PMC • Focus on Python but interact with R, Java, SAS, … @xhochy @xhochy mail@uwekorn.com https://uwekorn.com
  3. Python vs R

  4. Python vs R

  5. Python & R

  6. Python & R … & Java & Rust & Javascript

    & C# & Matlab & …
  7. Do we have a problem?

  8. Do we have a problem? • Yes, there are different

    ecosystems!
  9. Do we have a problem? • Yes, there are different

    ecosystems! • PyData • Python / R • Pandas / NumPy / PySpark/sparklyr / Docker
  10. Do we have a problem? • Yes, there are different

    ecosystems! • PyData • Python / R • Pandas / NumPy / PySpark/sparklyr / Docker • Two weeks ago: Berlin Buzzwords • Java / Scala • Flink / ElasticSearch / Kafka • Scala-Spark / Kubernetes
  11. Do we have a problem? • Yes, there are different

    ecosystems! • PyData • Python / R • Pandas / NumPy / PySpark/sparklyr / Docker • Two weeks ago: Berlin Buzzwords • Java / Scala • Flink / ElasticSearch / Kafka • Scala-Spark / Kubernetes • SQL-based databases • ODBC / JDBC • Custom protocols (e.g. Postgres)
  12. Why solve this? • We build pipelines to move data

    • Goal: end-to-end data products
 Somewhere along the path we need to talk • Avoid duplicate work / work on converters • We don’t want Python vs R but use each of them where they’re best.
  13. None
  14. Apache Arrow at its core • Main idea: common columnar

    representation of data in memory • Provide libraries to access the data structures • Broad support for many languages • Create building blocks to form an ecosystem around it • Implement adaptors for existing structures
  15. Columnar Data

  16. Previous Work • CSV works really everywhere • Slow, untyped

    and row-wise • Parquet is gaining traction in all ecosystems • one of the major features and interaction points of Arrow • Still, this serializes data • RAM-Copy: 10GB/s on a Laptop • DataFrame implementations look similar but still are incompatible
  17. Languages • C++, C(glib), Python, Ruby, R, Matlab • C#

    • Go • Java • JavaScript • Rust
  18. There’s a social component • It’s not only APIs you

    need to bring together • Communities are also quite distinct • Get them talking!
  19. Shipped with batteries • There is more than just data

    structures • Batteries in Arrow • Vectorized Parquet reader: C++, Rust, Java(WIP)
 C++ also supports ORC • Gandiva: LLVM-based expression kernels • Plasma: Shared-memory object store • DataFusion: Rust-based query engine • Flight: RPC protocol built on top of gRPC with zero-copy optimizations
  20. Ecosystem • RAPIDS: Analytics on the GPU • Dremio: Data

    platform • Turbodbc: columnar ODBC access in C++/Python • Spark: fast Python and R bridge • fletcher (pandas): Use Arrow instead of NumPy as backing storage • fletcher (FPGA): Use Arrow on FPGAs • Many more … https://arrow.apache.org/powered_by/
  21. Ecosystem Kartothek: • Heavily relies on Parquet adapter • Uses

    Arrow’s type system which is more sophisticated than pandas’ • Using Arrow instead of building some components on their own allows us to provide Kartothek access in other languages easily in the future
  22. Does it work?

  23. Does it work? Everything is amazing on slides …

  24. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work?
  25. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with:
  26. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with: • ERP System in Java with JDBC access (no non-Java client)
  27. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with: • ERP System in Java with JDBC access (no non-Java client) • ETL and Data Cleaning in Python
  28. Does it work? Everything is amazing on slides … …

    so does this Arrow actually work? Let’s take a real example with: • ERP System in Java with JDBC access (no non-Java client) • ETL and Data Cleaning in Python • Analysis in R
  29. Does it work?

  30. Does it work?

  31. Does it work?

  32. Does it work? WIP

  33. Get started easily?

  34. Up Next • Build more adaptors, e.g. Postgres • Building

    blocks for query engines on top of Arrow • Datasets • Analytical kernels • DataFrame implementations directly on top of Arrow
  35. Thanks Slides at https://twitter.com/xhochy Question here!