ORC 2015: Faster, Better, Smaller

© Hortonworks Inc. 2011 – 2015. All
Rights Reserved ORC 2015: Faster, BeCer, Smaller Prasanth Jayachandran Apache Hive Team, Hortonworks @prasanth_j

Rights Reserved Apache ORC – Optimized Row-Columnar File Apache TLP – orc.apache.org + Type Speciﬁc Encodings + Came out of Apache Hive + Vectorized Readers (Java, C++) + ProjecVon and Predicate Pushdown + Columnar Storage + Block Compression + Hive ACID transacVons + Single SerDe Format + Protobuf Metadata Storage +

Rights Reserved ORC: Format SpeciﬁcaVon How ORC stores data?

Rights Reserved ORC File Layout §  File Footer and Postscript §  Stripes §  Indexes (Row group indexes and Bloom Filter interleaved) §  Min/Max stats, Positions for every 10K rows §  Data §  Multiple streams per column encoded and compressed independently §  Stripe Footer §  Locations to streams, type of encoding §  Full specification at [1]

Rights Reserved ORC Writer Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time> §  One tree writer per flattened column §  Multiple streams per column §  PRESENT §  DATA §  LENGTH §  DICTIONARY_DATA §  SECONDARY §  ROW_INDEX §  BLOOM_FILTER

Rights Reserved ORC Data Streams Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time> §  Streams can be suppressed. §  Example: PRESENT stream is suppressed when all values in a stripe are non-null. IS_PRESENT DATA DICTIONARY LENGTH SECONDARY Compression Buffers

Rights Reserved ORC: Features Timeline How ORC improved over <me?

Rights Reserved Timeline February 2013 §  Stinger Initiative Announcement* §  Roadmap to improve Apache Hive’s performance by 100x §  Delivered in 100% Apache Open Source * http://hortonworks.com/blog/100x-faster-hive/ | 2013 | 2014 | 2015 SQL Engine Vectorized SQL Engine Columnar Storage ORC + + Distributed Execution Apache Tez = 100x

Rights Reserved Timeline March 2013 Optimized Row Columnar (ORC) file format committed to Hive §  Hive version: 0.11 §  Native data format in Hive | 2013 | 2014 | 2015

Rights Reserved Timeline March 2013 | 2013 | 2014 | 2015 Predicate Pushdown §  SARG interface §  Prune stripes and row groups based on min/max statistics Improved Run Length Encoding §  Tighter bit packing §  Longer runs §  DELTA, SHORT_REPEATS, DIRECT, PATCHED_BASE

Rights Reserved Run Length Encoding Improvements RLE (hive 0.11) RLE (hive >= 0.12) Compression RaVo Encoding Time (in ms) Decoding Time (in ms) Compression RaVo Encoding Time (in ms) Decoding Time (in ms) Twi$er Census API ID (24,556,361 records) 2.32 1770 1263 6.97 1558 864 HTTP Archive (bytes.json) 79.4 198 191 200.82 263 125 Github Archive (root.payload.name.txt.dict-‐len) 114.05 21 15 260.73 23 15 AOL Querylog Epoch (36,389,577 records) 2.51 553 364 3.7 652 246 Reference: h$ps://issues.apache.org/jira/secure/a$achment/12596722/ORC-‐Compression-‐RaWo-‐Comparison.xlsx

Rights Reserved Timeline April 2013 | 2013 | 2014 | 2015 Vectorized ORC readers §  Read and process columns in batches of size 1024 Null stream suppression §  Suppress PRESENT stream if no nulls in a stripe §  Enables fast path in vectorization June 2013

Rights Reserved Timeline October 2013 | 2013 | 2014 | 2015 Statistics Interface §  Writer – Update statistics during load time §  Reader – ANALYZE TABLE .. NOSCAN Split Elimination §  Stripe level column statistics §  Eliminate stripes that do not satisfy predicate conditions November 2013

Rights Reserved Timeline February 2014 | 2013 | 2014 | 2015 Zero copy read path §  HDFS caching APIs to read directly into memory without extra data copies Serialization Improvements §  Bit width alignment (trade-off space for speed) §  Unrolled bit packing and unpacking §  Buffered double reader and writer June 2014

Rights Reserved Serialization Improvements 0 200 400 600 800 1000 1200 1400 1600 1800 1 2 4 8 16 24 32 40 48 56 64 Mean Time (ms) Bit Width ORC Read Integer Performance (smaller is better) hive 0.13 unpacking hive-1.0 unpacking (new)

Rights Reserved Serialization Improvements 241.679 171.045 174.163 0 50 100 150 200 250 300 hive <= 0.13 buffered + BE buffered + LE Mean Time (ms) Double Read Modes ORC Read Double Performance (smaller is better) ~1.4x improvement

Rights Reserved Timeline June 2014 | 2013 | 2014 | 2015 Adaptive compression buffer size §  >1000 columns adjust compression buffer size based on available memory §  Avoids wide table OOMs Fast stripe level file merging §  Many small files to few large files §  No Decompression, No Decoding §  ALTER TABLE … CONCATENATE July 2014

Rights Reserved Fast File Merging 1091 651 245 816 0 200 400 600 800 1000 1200 1400 1600 ORC RCFile Total Time in seconds CONCAT Supporting File Formats ETL With File Merging – TPC-H 1000 Scale Lineitem (smaller is better) Merge Time Load Time 1336 1467 ~3.33x improvement in merge time

Rights Reserved Timeline July 2014 | 2013 | 2014 | 2015 ORC Padding Improvements §  Pad bytes to avoid remote HDFS reads §  Last stripe is adjusted to fit within HDFS block boundary (worst case: 5% wastage) Decouple stripe size vs block size §  Smaller stripes (64MB) §  More stripes per block (4 per block) §  Better parallelism & split elimination

Rights Reserved Timeline September 2014 | 2013 | 2014 | 2015 String Dictionary Improvements §  Row group level checking §  Remember decision across stripes §  Avoids expensive RBTree insertions

Rights Reserved String Dictionary Improvements 767 540 0 100 200 300 400 500 600 700 800 900 hive <= 0.13 hive > 0.13 Time in seconds Hive Version String Dictionary Improvements - TPC-H 1000 Scale Lineitem (smaller is better) Load Time ~1.4x improvement

Rights Reserved Timeline September 2014 | 2013 | 2014 | 2015 Improved ZLIB compression §  Different streams compressed with different zlib strategies/levels §  Compress integers and doubles differently §  Data and Dictionary stream - Looks for smaller byte patterns §  All other streams - Less LZ77, More Huffman

Rights Reserved ZLIB Improvements 178.5 172.2 225.1 0 50 100 150 200 250 ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY Data Size in GBs File Format + Compression Codec Data Size Improvements - TPC-H 1000 Scale Lineitem (smaller is better) ~4% improvement ~1.3x smaller

Rights Reserved ZLIB Improvements 674 433 389 0 100 200 300 400 500 600 700 800 ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY Data Size in GBs File Format + Compression Codec Load Time Improvements - TPC-H 1000 Scale Lineitem (smaller is better) ~1.6x improvement Only ~10% slower than SNAPPY

Rights Reserved Timeline September 2014 | 2013 | 2014 | 2015 ACID transactions §  Order of millions of rows §  Not designed for OLTP requirements §  Streaming Ingest via Flume or Storm §  Atomically add base and delta directories §  Minor compaction – Merge many delta files §  Major compaction – Re-write base files to incorporate delta file changes Broken pattern: Add Partitions for Atomicity -

Rights Reserved Timeline January 2015 | 2013 | 2014 | 2015 hasNull flag in ORC internal index §  Better pruning of row groups §  Improves the performance of SELECT .. WHERE column IS NULL;

Rights Reserved hasNull in Index Improvement Bytes Read: 208.77 GB vs 539 MB 66.73 7.87 0 10 20 30 40 50 60 70 80 hive < 1.1.0 hive >= 1.1.0 Execution Time in seconds Hive Version select * from lineitem where l_shipdate is null (smaller is better) Execution Time ~8.5x improvement

Rights Reserved Timeline February 2015 | 2013 | 2014 | 2015 Bloom Filter Index §  Much better row group pruning when compared to min/max §  Bloom filter evaluated after the fast Min/Max based elimination

Rights Reserved Bloom Filter Indexes Improvements 5999989709 540,000 10,000 No Indexes Min-Max Indexes Bloomfilter Indexes select * from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale – smaller is better) Rows Read

Rights Reserved Bloom Filter Indexes Improvements 74 4.5 1.34 No Indexes Min-Max Indexes Bloomfilter Indexes select * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller is better) Time Taken (seconds) ~16x improvement ~3.3x improvement

Rights Reserved Timeline April 2015 | 2013 | 2014 | 2015 Split Strategies §  BI – Skip reading file footer §  ETL – Read and cache file footer §  HYBRID – Default. Chooses BI/ETL based on number of files and average file size §  Group splits based on columnar projection size instead of file size

Rights Reserved Timeline April 2015 | 2013 | 2014 | 2015 ORC became Apache Top Level Project §  C++ reader with contributions from Hortonworks, HP and Microsoft §  Column encryption to encrypt sensitive columns http://orc.apache.org/

Rights Reserved ORC: In ProducVon

Rights Reserved ORC at Facebook Saved more than 1,400 servers worth of storage. (2) Compression i Compression raVo increased from 5x to 8x globally. (2) Compression i

Rights Reserved ORC at Spotify 16x less HDFS read when using ORC versus Avro.(3) IO i 32x less CPU when using ORC versus Avro.(3) CPU i

Rights Reserved ORC at Yahoo! 6-‐50x speedup when using ORC versus Text File.(4) Speedup i 1.6-‐30x speedup when using ORC versus RCFile.(4) Speedup i

Rights Reserved ORC: LLAP and Sub-‐second ORC – Pushing for Sub-‐second

Rights Reserved ORC: LLAP - JIT Performance for short queries + Row-‐group level caching + Asynchronous IO Elevator + + MulV-‐threaded Column Vector processing +

Rights Reserved ORC: Vectorization + SIMD 0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2 0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2 0x00007f13d2e6afba: movslq %eax,%r10 0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3 ;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94) Example: Query: select ss_ext_tax + 1.0 from store_sales_orc; JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly” Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib Generated Assembly: §  AllocaVon free Vght inner loops enables JDK’s auto-‐vectorizaVon §  Vectors can be ﬁltered early in ORC §  String dicVonary can be used to binary-‐search §  Vectorized SIMD Join §  Improves performance for single key joins AVX - Vector Addition Packed Double 4 doubles loaded to 256 bit registers

Rights Reserved ORC: LLAP (+ SIMD + Split Strategies + Row Indexes) select * from tpch_1000.lineitem where l_orderkey=1212000001;

Rights Reserved Questions ? Interested? Stop by the Hortonworks booth to learn more

Rights Reserved Endnotes (1)  hXps://cwiki.apache.org/conﬂuence/display/Hive/LanguageManual+ORC#LanguageManualORC-‐orc-‐ specORCFormatSpeciﬁca<on (2)  hXps://code.facebook.com/posts/229861827208629/scaling-‐the-‐facebook-‐data-‐warehouse-‐to-‐300-‐pb/ (3)  hXp://www.slideshare.net/AdamKawa/a-‐perfect-‐hive-‐query-‐for-‐a-‐perfect-‐mee<ng-‐hadoop-‐summit-‐2014 (4)  hXp://www.slideshare.net/Hadoop_Summit/w-‐1205p230-‐aradhakrishnan-‐v3

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

Other Decks in Technology

Featured

Transcript