Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

In this talk we speak about ORC (Optimized Row Columnar) file format, features and performance optimizations that went in after its initial version (Hive 0.11 back in May 2013). We will also briefly talk about the latest and greatest features, and future enhancements that are planned for Hive 0.15.

Avatar for Prasanth Jayachandran

Prasanth Jayachandran

June 09, 2015
Tweet

Other Decks in Technology

Transcript

  1. Page  1   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC  2015:  Faster,  BeCer,  Smaller   Prasanth  Jayachandran Apache  Hive  Team,  Hortonworks @prasanth_j
  2. Page  2   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Apache ORC – Optimized Row-Columnar File Apache  TLP  –  orc.apache.org   + Type  Specific  Encodings   + Came  out  of  Apache  Hive   + Vectorized  Readers  (Java,  C++)   + ProjecVon  and  Predicate  Pushdown   + Columnar  Storage   + Block  Compression   + Hive  ACID  transacVons   + Single  SerDe  Format   + Protobuf  Metadata  Storage   +
  3. Page  3   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC:  Format  SpecificaVon   How  ORC  stores  data?  
  4. Page  4   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC File Layout §  File Footer and Postscript §  Stripes §  Indexes (Row group indexes and Bloom Filter interleaved) §  Min/Max stats, Positions for every 10K rows §  Data §  Multiple streams per column encoded and compressed independently §  Stripe Footer §  Locations to streams, type of encoding §  Full specification at [1]
  5. Page  5   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC Writer Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time> §  One tree writer per flattened column §  Multiple streams per column §  PRESENT §  DATA §  LENGTH §  DICTIONARY_DATA §  SECONDARY §  ROW_INDEX §  BLOOM_FILTER
  6. Page  6   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC Data Streams Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time> §  Streams can be suppressed. §  Example: PRESENT stream is suppressed when all values in a stripe are non-null. IS_PRESENT DATA DICTIONARY LENGTH SECONDARY Compression Buffers
  7. Page  7   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC:  Features  Timeline   How  ORC  improved  over  <me?  
  8. Page  8   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline February 2013 §  Stinger Initiative Announcement* §  Roadmap to improve Apache Hive’s performance by 100x §  Delivered in 100% Apache Open Source * http://hortonworks.com/blog/100x-faster-hive/ | 2013 | 2014 | 2015 SQL Engine Vectorized SQL Engine Columnar Storage ORC +   +   Distributed Execution Apache Tez = 100x
  9. Page  9   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline March 2013 Optimized Row Columnar (ORC) file format committed to Hive §  Hive version: 0.11 §  Native data format in Hive | 2013 | 2014 | 2015
  10. Page  10   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline March 2013 | 2013 | 2014 | 2015 Predicate Pushdown §  SARG interface §  Prune stripes and row groups based on min/max statistics Improved Run Length Encoding §  Tighter bit packing §  Longer runs §  DELTA, SHORT_REPEATS, DIRECT, PATCHED_BASE
  11. Page  11   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Run Length Encoding Improvements     RLE  (hive  0.11)   RLE  (hive  >=  0.12)       Compression   RaVo   Encoding  Time  (in   ms)   Decoding  Time  (in   ms)   Compression   RaVo   Encoding  Time  (in   ms)   Decoding  Time  (in   ms)   Twi$er  Census  API  ID  (24,556,361   records)   2.32   1770   1263   6.97   1558   864   HTTP  Archive  (bytes.json)   79.4   198   191   200.82   263   125   Github  Archive   (root.payload.name.txt.dict-­‐len)   114.05   21   15   260.73   23   15   AOL  Querylog  Epoch  (36,389,577   records)   2.51   553   364   3.7   652   246   Reference:  h$ps://issues.apache.org/jira/secure/a$achment/12596722/ORC-­‐Compression-­‐RaWo-­‐Comparison.xlsx  
  12. Page  12   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline April 2013 | 2013 | 2014 | 2015 Vectorized ORC readers §  Read and process columns in batches of size 1024 Null stream suppression §  Suppress PRESENT stream if no nulls in a stripe §  Enables fast path in vectorization June 2013
  13. Page  13   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline October 2013 | 2013 | 2014 | 2015 Statistics Interface §  Writer – Update statistics during load time §  Reader – ANALYZE TABLE .. NOSCAN Split Elimination §  Stripe level column statistics §  Eliminate stripes that do not satisfy predicate conditions November 2013
  14. Page  14   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline February 2014 | 2013 | 2014 | 2015 Zero copy read path §  HDFS caching APIs to read directly into memory without extra data copies Serialization Improvements §  Bit width alignment (trade-off space for speed) §  Unrolled bit packing and unpacking §  Buffered double reader and writer June 2014
  15. Page  15   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Serialization Improvements 0 200 400 600 800 1000 1200 1400 1600 1800 1 2 4 8 16 24 32 40 48 56 64 Mean Time (ms) Bit Width ORC Read Integer Performance (smaller is better) hive 0.13 unpacking hive-1.0 unpacking (new)
  16. Page  16   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Serialization Improvements 241.679 171.045 174.163 0 50 100 150 200 250 300 hive <= 0.13 buffered + BE buffered + LE Mean Time (ms) Double Read Modes ORC Read Double Performance (smaller is better) ~1.4x improvement
  17. Page  17   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline June 2014 | 2013 | 2014 | 2015 Adaptive compression buffer size §  >1000 columns adjust compression buffer size based on available memory §  Avoids wide table OOMs Fast stripe level file merging §  Many small files to few large files §  No Decompression, No Decoding §  ALTER TABLE … CONCATENATE July 2014
  18. Page  18   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Fast File Merging 1091 651 245 816 0 200 400 600 800 1000 1200 1400 1600 ORC RCFile Total Time in seconds CONCAT Supporting File Formats ETL With File Merging – TPC-H 1000 Scale Lineitem (smaller is better) Merge Time Load Time 1336 1467 ~3.33x improvement in merge time
  19. Page  19   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline July 2014 | 2013 | 2014 | 2015 ORC Padding Improvements §  Pad bytes to avoid remote HDFS reads §  Last stripe is adjusted to fit within HDFS block boundary (worst case: 5% wastage) Decouple stripe size vs block size §  Smaller stripes (64MB) §  More stripes per block (4 per block) §  Better parallelism & split elimination
  20. Page  20   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline September 2014 | 2013 | 2014 | 2015 String Dictionary Improvements §  Row group level checking §  Remember decision across stripes §  Avoids expensive RBTree insertions
  21. Page  21   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   String Dictionary Improvements 767 540 0 100 200 300 400 500 600 700 800 900 hive <= 0.13 hive > 0.13 Time in seconds Hive Version String Dictionary Improvements - TPC-H 1000 Scale Lineitem (smaller is better) Load Time ~1.4x improvement
  22. Page  22   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline September 2014 | 2013 | 2014 | 2015 Improved ZLIB compression §  Different streams compressed with different zlib strategies/levels §  Compress integers and doubles differently §  Data and Dictionary stream - Looks for smaller byte patterns §  All other streams - Less LZ77, More Huffman
  23. Page  23   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ZLIB Improvements 178.5 172.2 225.1 0 50 100 150 200 250 ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY Data Size in GBs File Format + Compression Codec Data Size Improvements - TPC-H 1000 Scale Lineitem (smaller is better) ~4% improvement ~1.3x smaller
  24. Page  24   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ZLIB Improvements 674 433 389 0 100 200 300 400 500 600 700 800 ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY Data Size in GBs File Format + Compression Codec Load Time Improvements - TPC-H 1000 Scale Lineitem (smaller is better) ~1.6x improvement Only ~10% slower than SNAPPY
  25. Page  25   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline September 2014 | 2013 | 2014 | 2015 ACID transactions §  Order of millions of rows §  Not designed for OLTP requirements §  Streaming Ingest via Flume or Storm §  Atomically add base and delta directories §  Minor compaction – Merge many delta files §  Major compaction – Re-write base files to incorporate delta file changes Broken pattern: Add Partitions for Atomicity -
  26. Page  26   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline January 2015 | 2013 | 2014 | 2015 hasNull flag in ORC internal index §  Better pruning of row groups §  Improves the performance of SELECT .. WHERE column IS NULL;
  27. Page  27   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   hasNull in Index Improvement Bytes Read: 208.77 GB vs 539 MB 66.73 7.87 0 10 20 30 40 50 60 70 80 hive < 1.1.0 hive >= 1.1.0 Execution Time in seconds Hive Version select * from lineitem where l_shipdate is null (smaller is better) Execution Time ~8.5x improvement
  28. Page  28   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline February 2015 | 2013 | 2014 | 2015 Bloom Filter Index §  Much better row group pruning when compared to min/max §  Bloom filter evaluated after the fast Min/Max based elimination
  29. Page  29   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Bloom Filter Indexes Improvements 5999989709 540,000 10,000 No Indexes Min-Max Indexes Bloomfilter Indexes select * from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale – smaller is better) Rows Read
  30. Page  30   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Bloom Filter Indexes Improvements 74 4.5 1.34 No Indexes Min-Max Indexes Bloomfilter Indexes select * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller is better) Time Taken (seconds) ~16x improvement ~3.3x improvement
  31. Page  31   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline April 2015 | 2013 | 2014 | 2015 Split Strategies §  BI – Skip reading file footer §  ETL – Read and cache file footer §  HYBRID – Default. Chooses BI/ETL based on number of files and average file size §  Group splits based on columnar projection size instead of file size
  32. Page  32   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Timeline April 2015 | 2013 | 2014 | 2015 ORC became Apache Top Level Project §  C++ reader with contributions from Hortonworks, HP and Microsoft §  Column encryption to encrypt sensitive columns http://orc.apache.org/
  33. Page  33   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC:  In  ProducVon  
  34. Page  34   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC at Facebook Saved  more  than  1,400   servers  worth  of  storage.  (2)   Compression   i Compression  raVo   increased  from  5x  to  8x   globally.  (2)   Compression   i
  35. Page  35   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC at Spotify     16x  less  HDFS  read  when   using  ORC  versus  Avro.(3)   IO   i 32x  less  CPU  when  using   ORC  versus  Avro.(3)   CPU   i
  36. Page  36   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC at Yahoo!     6-­‐50x  speedup  when  using   ORC  versus  Text  File.(4)   Speedup   i 1.6-­‐30x  speedup  when   using  ORC  versus  RCFile.(4)   Speedup   i
  37. Page  37   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC:  LLAP  and  Sub-­‐second   ORC  –  Pushing  for  Sub-­‐second    
  38. Page  38   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC: LLAP - JIT  Performance  for  short  queries   + Row-­‐group  level  caching   + Asynchronous  IO  Elevator   + + MulV-­‐threaded  Column  Vector  processing   +
  39. Page  39   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC: Vectorization + SIMD    0x00007f13d2e6afb0:  vmovdqu  0x10(%rsi,%rax,8),%ymm2      0x00007f13d2e6afb6:  vaddpd  %ymm1,%ymm2,%ymm2      0x00007f13d2e6afba:  movslq  %eax,%r10      0x00007f13d2e6afbd:  vmovdqu  0x30(%rsi,%r10,8),%ymm3       ;*daload  vector.expressions.gen.DoubleColAddDoubleColumn::evaluate  (line  94)     Example: Query: select ss_ext_tax + 1.0 from store_sales_orc; JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly” Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib Generated Assembly: §  AllocaVon  free  Vght  inner  loops  enables  JDK’s  auto-­‐vectorizaVon   §  Vectors  can  be  filtered  early  in  ORC   §  String  dicVonary  can  be  used  to  binary-­‐search   §  Vectorized  SIMD  Join   §  Improves  performance  for  single  key  joins   AVX - Vector Addition Packed Double 4 doubles loaded to 256 bit registers
  40. Page  40   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   ORC: LLAP (+ SIMD + Split Strategies + Row Indexes) select  *  from  tpch_1000.lineitem  where  l_orderkey=1212000001;  
  41. Page  41   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Questions ?     Interested?  Stop  by  the  Hortonworks  booth  to  learn  more  
  42. Page  42   ©  Hortonworks  Inc.  2011  –  2015.  All

     Rights  Reserved   Endnotes (1)  hXps://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-­‐orc-­‐ specORCFormatSpecifica<on   (2)  hXps://code.facebook.com/posts/229861827208629/scaling-­‐the-­‐facebook-­‐data-­‐warehouse-­‐to-­‐300-­‐pb/   (3)  hXp://www.slideshare.net/AdamKawa/a-­‐perfect-­‐hive-­‐query-­‐for-­‐a-­‐perfect-­‐mee<ng-­‐hadoop-­‐summit-­‐2014   (4)  hXp://www.slideshare.net/Hadoop_Summit/w-­‐1205p230-­‐aradhakrishnan-­‐v3