Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Arrow

Apache Arrow

This presentation gives an overview of the Apache Arrow project. It explains the Arrow project in terms of its in memory structure, its purpose, language interfaces and supporting projects.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

May 29, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Arrow ? • A development platform for

    in-memory data • It has a columnar memory format • It provides efficient analytic operations on modern hardware • Used for in memory processing • Cross language support • Open source / Apache 2.0 license • Supports zero-copy reads for lightning fast data access
  2. Languages supported • Arrow supports many languages • C •

    C++ • C# • Go • Java • JavaScript • MATLAB • Python • R • Ruby • Rust
  3. OS Community Support • Many open source projects support Arrow

    • Calcite • Cassandra • Drill • Hadoop • HBase • Ibis • Impala • Kudu • Pandas • Parquet • Phoenix • Spark • Storm
  4. The problem Arrow tackles • Each system has its own

    internal memory format • 70-80% computation wasted – on serialization and de-serialization • Similar functionality implemented in multiple projects • Overheads for cross-system communication • All systems utilize different memory formats
  5. Arrow solves this problem • All systems utilize the same

    memory format – In memory – Columnar format – Optimized for modern CPUs and GPUs • No overhead for cross-system communication • Projects can share functionality
  6. Arrow works with Parquet • Arrow is an in memory

    format • Parquet is designed for disk storage • Arrow and Parquet are intended to be used together • Parquet is a columnar file format • Used for data serialization • Parquet is a streaming format • Data must be decoded from start-to-end • Files are compressed and encoded • Means smaller files on disk
  7. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  8. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration