PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

1 Connecting PyData to other Big Data Landscapes using Arrow
and Parquet Uwe L. Korn, PyCon.DE 2017

2 • Data Scientist & Architect at Blue Yonder (@BlueYonderTech)
• Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas user About me xhochy [email protected]

3 Python is a good companion for a Data Scientist
…but there are other ecosystems out there.

• Large set of files on distributed filesystem • Non-uniform
schema • Execute query • Only a subset is interesting 4 Why do I care? not in Python

5 All are amazing but… How to get my data
out of Python and back in again? …but there was no fast Parquet access 2 years ago. Use Parquet!

6 A general problem • Great interoperability inside ecosystems •
Often based on a common backend (e.g. NumPy) • Poor integration to other systems • CSV is your only resort • „We need to talk!“ • Memory copy is about 10GiB/s • (De-)serialisation comes on top

7 Columnar Data Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

8 Apache Parquet

9 About Parquet 1. Columnar on-disk storage format 2. Started
in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option

10 Why use Parquet? 1. Columnar format  —> vectorized operations
2. Eﬃcient encodings and compressions  —> small size without the need for a fat CPU 3. Predicate push-down  —> bring computation to the I/O layer 4. Language independent format  —> libs in Java / Scala / C++ / Python /…

Compression 1. Shrink data size independent of its content 2.
More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli  —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)  Snappy: 216 MiB (14 %)

Predicate pushdown 1. Only load used data • skip columns
that are not needed • skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded Which products are sold in $?

File Structure File RowGroup Column Chunks Page Statistics

Read & Write Parquet 14 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

Read & Write Parquet 15 Pandas 0.21 will bring pd.read_parquet(…)
df.write_parquet(…) http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet

16 Save in one, load in another ecosystem …but always
persist the intermediate.

17 Zero-Copy DataFrames

2.57s Converting 1 million longs from Spark to PySpark 18
(8MiB)

19 Apache Arrow • Specification for in-memory columnar data layout
• No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp

20 Dissecting Arrow C++ • General zero-copy memory management •
jemalloc as the base allocator • Columnar memory format & metadata • Schema & DataType • Columns & Table

21 Dissecting Arrow C++ • Structured data IPC (inter-process communication)
• used in Spark for JVM<->Python • future extensions include: GRPC backend, shared memory communication, … • Columnar in-memory analytics • be the backbone of Pandas 2.0

0.05s Converting 1 million longs from Spark to PySpark 22
with Arrow https://github.com/apache/spark/pull/15821#issuecomment-282175163

23 Apache Arrow – Real life improvement Real life example!
Retrieve a dataset from an MPP database and analyze it in Pandas 1. Run a query in the DB 2. Pass it in columnar form to the DB driver 3. The OBDC layer transform it into row-wise form 4. Pandas makes it columnar again Ugly real-life solution: export as CSV, bypass ODBC

24 Better solution: Turbodbc with Arrow support 1. Retrieve columnar
results 2. Pass them in a columnar fashion to Pandas More systems in the future (without the ODBC overhead) See also Michael’s talk tomorrow: Turbodbc: Turbocharged database access for data scientists Apache Arrow – Real life improvement

25 Ray

GPU Open Analytics Initiative 26 https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/

Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]
• Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github: https://github.com/apache/arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • Github: https://github.com/apache/parquet- cpp 27 Get Involved!

Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721
383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 28

PyConDE / PyData Karlsruhe 2017 – Connecting Py...

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

Uwe L. Korn

More Decks by Uwe L. Korn

Other Decks in Programming

Featured

Transcript

1 Connecting PyData to other Big Data Landscapes using Arrow

2 • Data Scientist & Architect at Blue Yonder (@BlueYonderTech)

3 Python is a good companion for a Data Scientist

• Large set of files on distributed filesystem • Non-uniform

5 All are amazing but… How to get my data

6 A general problem • Great interoperability inside ecosystems •

7 Columnar Data Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

8 Apache Parquet

9 About Parquet 1. Columnar on-disk storage format 2. Started

10 Why use Parquet? 1. Columnar format  —> vectorized operations

Compression 1. Shrink data size independent of its content 2.

Predicate pushdown 1. Only load used data • skip columns

File Structure File RowGroup Column Chunks Page Statistics

Read & Write Parquet 14 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

Read & Write Parquet 15 Pandas 0.21 will bring pd.read_parquet(…)

16 Save in one, load in another ecosystem …but always

17 Zero-Copy DataFrames

2.57s Converting 1 million longs from Spark to PySpark 18

19 Apache Arrow • Specification for in-memory columnar data layout

20 Dissecting Arrow C++ • General zero-copy memory management •

21 Dissecting Arrow C++ • Structured data IPC (inter-process communication)

0.05s Converting 1 million longs from Spark to PySpark 22

23 Apache Arrow – Real life improvement Real life example!

24 Better solution: Turbodbc with Arrow support 1. Retrieve columnar

25 Ray

GPU Open Analytics Initiative 26 https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/

Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]

Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721