Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4

ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4

Dongjoon Hyun

June 18, 2018
Tweet

More Decks by Dongjoon Hyun

Other Decks in Technology

Transcript

  1. 1 © Hortonworks Inc. 2011–2018. All rights reserved ORC Improvement

    & Roadmap in Apache Spark 2.3 and 2.4 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team June 2018
  2. 2 © Hortonworks Inc. 2011–2018. All rights reserved HDP 2.6.5

    (May 2018) • Apache Spark − 2.3.0 (2018 FEB) • Apache ORC − 1.4.3 (2018 FEB) • Apache KAFKA − 1.0.0 (2017 NOV)
  3. 3 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized

    ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features Apache Spark 2.3.x Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
  4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Spark’s built-in

    file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables
  5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Motivation •

    TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables Fast Flexible Hive Table Access
  6. 6 © Hortonworks Inc. 2011–2018. All rights reserved The story

    of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN) ➔ SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB) ➔ HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT) ➔ SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB) ➔ SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY) ➔ SPARK-24322 Spark 2.3.1 (JUN) − v1.5.1 (2018 MAY) ➔ SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
  7. 8 © Hortonworks Inc. 2011–2018. All rights reserved Six Issue

    Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
  8. 9 © Hortonworks Inc. 2011–2018. All rights reserved Category 1

    – ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Use real column names from Hive tables • HIVE_12055(2015) Vectorized Writer • HIVE_13083(2016) Decimals write present stream correctly • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
  9. 10 © Hortonworks Inc. 2011–2018. All rights reserved Category 2

    – Performance • Vectorized ORC Reader (SPARK-16060) • Fast reading partition-columns (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
  10. 11 © Hortonworks Inc. 2011–2018. All rights reserved • `FileNotFoundException`

    at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Category 3 – Structured streaming spark.readStream.orc(path)
  11. 12 © Hortonworks Inc. 2011–2018. All rights reserved Category 4

    – Column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
  12. 13 © Hortonworks Inc. 2011–2018. All rights reserved Category 5

    – Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) − Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) − Return wrong result if ORC file schema is different from Hive MetaStore schema order • Support table properties during `convertMetastoreOrc/Parquet` (SPARK-23355, Spark 2.4) − For ORC/Parquet Hive tables, `convertMetastore` ignores table properties
  13. 14 © Hortonworks Inc. 2011–2018. All rights reserved Category 6

    – Robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305)
  14. 16 © Hortonworks Inc. 2011–2018. All rights reserved Supports two

    ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvm o.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4+
  15. 17 © Hortonworks Inc. 2011–2018. All rights reserved In Reality

    – Four cases for ORC Reader/Writer `hive` Reader `native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
  16. 18 © Hortonworks Inc. 2011–2018. All rights reserved Performance –

    Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  17. 19 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC

    implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
  18. 20 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC

    implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
  19. 21 © Hortonworks Inc. 2011–2018. All rights reserved Support vectorized

    read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) − `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
  20. 22 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution

    at reading file-based data sources • Frequently, new files can have wider column types or new columns − Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting − boolean -> byte -> short -> int -> long − float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path) Old Data New Data
  21. 23 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution

    at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Column Type2 ✔️ ✔️3 ✔️ Change Column Position ✔️ ✔️ ✔️
  22. 25 © Hortonworks Inc. 2011–2018. All rights reserved Micro Benchmark

    (Apache Spark 2.3.0) • Target − Apache Spark 2.3.0 − Apache ORC 1.4.1 • Machine − MacBook Pro (2015 Mid) − Intel® Core™ i7-4770JQ CPI @ 2.20GHz − Mac OS X 10.13.4 − JDK 1.8.0_161
  23. 26 © Hortonworks Inc. 2011–2018. All rights reserved Performance –

    Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  24. 27 © Hortonworks Inc. 2011–2018. All rights reserved Performance –

    Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
  25. 28 © Hortonworks Inc. 2011–2018. All rights reserved Performance –

    Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x 7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
  26. 29 © Hortonworks Inc. 2011–2018. All rights reserved Predicate Pushdown

    0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
  27. 31 © Hortonworks Inc. 2011–2018. All rights reserved Support Matrix

    • Spark 2.3 and ORC 1.4 becomes GA at HDP 2.6.5. HDP 2.6.3~4 HDP 2.6.5 HDP 3.0 EA1 TP for ORC on Spark GA for ORC on Spark Early Access Spark 2.2 Spark 2.3.0+ Spark 2.3.1+ N/A ORC 1.4.3 ORC 1.4.3+ spark.sql.orc.enabled=true spark.sql.orc.impl=native spark.sql.orc.impl=native spark.sql.orc.char.enabled=true N/A N/A 1. https://hortonworks.com/info/early-access-hdp-3-0/
  28. 32 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap

    – Targeting Apache Spark 2.4 (2018 Fall) Umbrella Issue • Feature Parity for ORC with Parquet SPARK-20901 Sub issues • Upgrade Apache ORC to 1.5.1 SPARK-24576 • Use `native` ORC implementation by default SPARK-23456 • Use ORC predicate pushdown by default SPARK-21783 • Use `convertMetastoreOrc` by default SPARK-22279 • Support table properties with `convertMetastoreOrc/Parquet` SPARK-23355 • Test ORC as default data source format SPARK-23553 • Test and support Bloom Filters SPARK-12417
  29. 33 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap

    – On-going work • ORC Column-level encryption (with ORC 1.6) • Support VectorUDT/MatrixUDT (SPARK-22320) • Vectorized Writer with DataSource V2 • Support CHAR/VARCHAR Types • ALTER TABLE … CHANGE column type (SPARK-18727)
  30. 34 © Hortonworks Inc. 2011–2018. All rights reserved Summary •

    Like Hive, Apache Spark 2.3 starts to take advantage of Apache ORC − Improved feature parity between Spark and Hive • Native vectorized ORC reader − boosts Spark ORC performance − provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
  31. 35 © Hortonworks Inc. 2011–2018. All rights reserved Reference •

    https://www.slideshare.net/DongjoonHyun/orc-improvement-in-apache-spark-23, Dataworks Summit 2018 Berlin • https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose