Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Amazon Redshift : it's all about data compression

Botify
March 08, 2013

Amazon Redshift : it's all about data compression

We had the chance at Botify to be one of the firsts to access to the limited preview of Amazon Redshift last christmas.

Now that Redshift is public, it's time to share with you what we learned about one of its key features : Data Compression !

More about us on http://www.botify.com/ - http://twitter.com/botify

We're hiring ! http://www.botify.com/jobs/

Botify

March 08, 2013
Tweet

More Decks by Botify

Other Decks in Programming

Transcript

  1. We crawl websites dozens of millions pages We extract metrics

    load time, http_code, size We extract data html tags, schema anchors, linking... ... and we count things !
  2. Petabyte-Scale Data Warehouse SQL-Like Querying Near Real-Time Columnar Storage Data

    loaded via Amazon S3 Made for Business Intelligence and Data Analytics Small costs !
  3. @airbnb and @hapyrus did nice benchmarks #Redshift can be up

    to 10x faster than Hive. http://www.hapyrus.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides http://nerds.airbnb.com/redshift-performance-cost
  4. We want to store results from a crawl of a

    website containing 300M pages. For each url, we store: Its depth from the homepage Its page type Its title anchor The url fetching timestamp The http code we've encountered The content size The load time The number of outlinks Its compression style?
  5. CREATE TABLE urls_data ( url_id BIGINT, url_host VARCHAR(50), url_path VARCHAR(255),

    url_query_string VARCHAR(500), title VARCHAR(100), page_type CHAR(20), depth INT, nb_outlinks INT, ts_crawled BIGINT, http_code, size_bytes, delay_ms INT, gzipped BOOLEAN );
  6. For each value in the column, we store the difference

    with the previous one *. As the difference is generally small, it will use 1 byte instead of 8 bytes Recommended for columns with incremental numbers like primary keys, timestamps... * Use DELTA if the absolute difference don't exceed 127 or DELTA32K if the absolute difference don't exceed 32K
  7. url_id column Original Size in bytes Compressed Size in bytes

    10000 4 4 10001 4 1 10002 4 1 10010 4 1 10100 4 1 etc etc... 4 x ... 1 x ... Over 300M of urls, we're able to compress our 2.23GB url_id column down to 300MB, pretty neat!
  8. Imagine that you repeat in a column 1000 times the

    word A, then 200 times the word B, then x times etc etc... Instead of storing all values, RLE stores in a header the number of times on which the value is repeated. Useful for values repeated consecutively About Run Length Encoding : http://fr.wikipedia.org/wiki/Run-length_encoding
  9. http_code column Original Size in bytes Compressed Size in bytes

    200 2 200 2 200 2 200 2 1 byte for header + 2 bytes = 3 301 2 301 2 1 byte for header + 2 bytes = 3 Total 12 6 Over 300M of urls, we're able to compress our 600MB http_code column down to 20MB, fantastic! (with more than 90% of 200 codes)
  10. Useful for columns with a lot of common values (but

    "randomly" distributed) #Redshift creates outside of the column a dictionary containing the 256 most common values (for each disk block) appearing in the column. Then for each of those values appearing in the column, 1 byte will be taken instead of the original size. Works with all columns except VARCHAR and BOOLEAN
  11. page_type colum Original Size in bytes (char 30) Compressed Size

    in bytes article 30 1 article 30 1 author 30 1 author 30 1 video 30 1 author 30 1 comment 30 1 ... ... ... Value Size in bytes article 7 author 6 video 5 comment 7 Total 25 Bytedict Column values Over 300M of urls, we're able to compress our 6GB page_type column down to 300MB, champagne!
  12. If your column shares a few common values but a

    lot of common words, it might be the best compression method. As amazon does not communicate about the compression algorithm used, it's difficult to evaluate its impacts. For our use case, it will be used for the title column. This column shares a lot of common words like Site name, Category name... Useful for text columns
  13. redshift does not allow to create columns with SMALLINT data

    type (1 byte), or to set INTx datatypes as UNSIGNED. If you need to store INT or BIGINT but you know that most of your values are not so big, you can use MOSTLY8, MOSTLY16 and MOSTLY32 to store the smallest values in 1, 2 ou 4 bytes instead of 2, 4 or 8 bytes initially ! Useful for some integers datasets
  14. Over 300M of urls, we're able to compress our 1.1GB

    nb_outlinks column down to 600MB, not so bad ! nb_outlinks column Original Size in bytes (Int 4 bytes) Compressed Size in bytes 150 4 2 130 4 2 20 4 2 40000 4 4 (Takes more than 16 bits on a signed column, so same as Raw) 320 4 2 250 4 2 ... .. ..
  15. CREATE TABLE urls_data_compressed ( url_id BIGINT ENCODE DELTA, url_host VARCHAR(50)

    ENCODE RUNLENGTH, url_path VARCHAR(255) ENCODE RAW, url_query_string VARCHAR(500) ENCODE RUNLENGTH, title VARCHAR(100) ENCODE TEXT32K, page_type CHAR(20) ENCODE BYTEDICT, depth INT ENCODE RUNLENGTH, nb_outlinks INT ENCODE MOSTLY8, ts_crawled BIGINT ENCODE DELTA, http_code INT ENCODE RUNLENGTH, size_bytes INT ENCODE DELTA32, delay_ms INT ENCODE MOSTLY16, gzipped BOOLEAN ENCODE RUNLENGTH );
  16. CREATE TABLE urls_data_compressed ( url_id BIGINT ENCODE DELTA, .. ts_crawled

    BIGINT ENCODE DELTA, .. size_bytes INT ENCODE DELTA32, .. ); url_id and ts_crawled : consecutive numbers ! size_bytes : generally not a big difference between two pages crawled in the same time DELTA
  17. CREATE TABLE urls_data_compressed ( url_host VARCHAR(50) ENCODE RUNLENGTH, url_query_string VARCHAR(500)

    ENCODE RUNLENGTH, ... depth INT ENCODE RUNLENGTH, ... http_code INT ENCODE RUNLENGTH, ... gzipped BOOLEAN ENCODE RUNLENGTH ); url_host : not a lot of subdomains, so a lot of chance that we are crawling by subdomains batches. depth : as we crawl in BFS (http://en.wikipedia.org/wiki/Breadth-first_search) and we don't have more than 100 depths, they're sorted consecutively. http_code : there's a lot of chance that most a pages crawled are 200! gzipped : usually all the website or not is gzipped RUNLENGTH
  18. CREATE TABLE urls_data_compressed ( ... title VARCHAR(100) ENCODE TEXT32K, );

    title : a lot of common words : Site name, Category name... Surprisingly, for "url_query_string" we thought that TEXT32K was the best method (because of some common patterns) But RUNLENGTH was the best, because there were large empty sections in the column! Compress large sections of empty values is here more efficient than compressing text! TEXT
  19. CREATE TABLE urls_data_compressed ( ... page_type CHAR(20) ENCODE BYTEDICT, );

    page_type : not more than 200 different type, perfect for BYTEDICT BYTEDICT
  20. nb_outlinks : average of outlinks is usually at 150 links.

    As we have some pages like sitemaps containing more than 1000 links, we have to store it into an INT datatype. delay_ms : pages are usually rendered in less than 1000 ms, but some of them could take more than 30000 ms... MOSTLYn CREATE TABLE urls_data_compressed ( ... nb_outlinks INT ENCODE MOSTLY8, ... delay_ms INT ENCODE MOSTLY16, );
  21. Column Encoding Size by row Total Raw size Total Compressed

    Size url_id DELTA 8 bytes 2.23 GB 300 MB url_host RUNLENGTH ~16 bytes avg 4.47 GB 1 MB url_path TEXT255 ~70 bytes avg 19.55 GB 18 GB url_query_string RUNLENGTH ~4 bytes avg 1 GB 350 MB page_type BYTEDICT 20 bytes 5.58 GB 300 MB title TEXT32K ~60 bytes avg 16.7GB 14.6 GB depth RUNLENGTH 4 bytes 1.11 GB 1MB nb_outlinks MOSTLY8 4 bytes 1.1 GB ~310 MB ts_crawled DELTA 8 bytes 2.23 GB 300 MB http_code RUNLENGTH 4 bytes 1.11 GB 20 MB size_bytes DELTA32 4 bytes 1.11 GB 0.72 GB delay_ms MOSTLY16 4 bytes 1.11 GB 0.6 GB gzipped RUNLENGTH 1 byte 0.27GB 1MB Total 58 GB 36 GB
  22. By the way, we're hiring! #devops #bigdata #aws #puppet #redshift

    #scala #python http://www.botify.com/jobs/