Amazon Redshift : it's all about data compression

Amazon Redshift It's all about data compression! @botify

What we do at Botify @botify

We crawl websites dozens of millions pages We extract metrics
load time, http_code, size We extract data html tags, schema anchors, linking... ... and we count things !

Our data model could particularly fit with a SQL schema

Old fashioned you said?

Redshift for beginners

Petabyte-Scale Data Warehouse SQL-Like Querying Near Real-Time Columnar Storage Data
loaded via Amazon S3 Made for Business Intelligence and Data Analytics Small costs !

Big data ready!

#redshift is the new trendy hashtag !

@airbnb and @hapyrus did nice benchmarks #Redshift can be up
to 10x faster than Hive. http://www.hapyrus.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides http://nerds.airbnb.com/redshift-performance-cost

What about its key feature: data compression?

Simple Use Case

We want to store results from a crawl of a
website containing 300M pages. For each url, we store: Its depth from the homepage Its page type Its title anchor The url fetching timestamp The http code we've encountered The content size The load time The number of outlinks Its compression style?

CREATE TABLE urls_data ( url_id BIGINT, url_host VARCHAR(50), url_path VARCHAR(255),
url_query_string VARCHAR(500), title VARCHAR(100), page_type CHAR(20), depth INT, nb_outlinks INT, ts_crawled BIGINT, http_code, size_bytes, delay_ms INT, gzipped BOOLEAN );

Choose the best compression method for each column

"Compressing data ain't no different from compressing a car" -
Xzibit

TEXT32K DELTA MOSTLYn RUNLENGTH ? BYTEDICT

Delta compression

For each value in the column, we store the difference
with the previous one *. As the difference is generally small, it will use 1 byte instead of 8 bytes Recommended for columns with incremental numbers like primary keys, timestamps... * Use DELTA if the absolute difference don't exceed 127 or DELTA32K if the absolute difference don't exceed 32K

url_id column Original Size in bytes Compressed Size in bytes
10000 4 4 10001 4 1 10002 4 1 10010 4 1 10100 4 1 etc etc... 4 x ... 1 x ... Over 300M of urls, we're able to compress our 2.23GB url_id column down to 300MB, pretty neat!

Runlength Compression

Imagine that you repeat in a column 1000 times the
word A, then 200 times the word B, then x times etc etc... Instead of storing all values, RLE stores in a header the number of times on which the value is repeated. Useful for values repeated consecutively About Run Length Encoding : http://fr.wikipedia.org/wiki/Run-length_encoding

http_code column Original Size in bytes Compressed Size in bytes
200 2 200 2 200 2 200 2 1 byte for header + 2 bytes = 3 301 2 301 2 1 byte for header + 2 bytes = 3 Total 12 6 Over 300M of urls, we're able to compress our 600MB http_code column down to 20MB, fantastic! (with more than 90% of 200 codes)

Bytedict Compression

Useful for columns with a lot of common values (but
"randomly" distributed) #Redshift creates outside of the column a dictionary containing the 256 most common values (for each disk block) appearing in the column. Then for each of those values appearing in the column, 1 byte will be taken instead of the original size. Works with all columns except VARCHAR and BOOLEAN

page_type colum Original Size in bytes (char 30) Compressed Size
in bytes article 30 1 article 30 1 author 30 1 author 30 1 video 30 1 author 30 1 comment 30 1 ... ... ... Value Size in bytes article 7 author 6 video 5 comment 7 Total 25 Bytedict Column values Over 300M of urls, we're able to compress our 6GB page_type column down to 300MB, champagne!

Text Compression

If your column shares a few common values but a
lot of common words, it might be the best compression method. As amazon does not communicate about the compression algorithm used, it's difficult to evaluate its impacts. For our use case, it will be used for the title column. This column shares a lot of common words like Site name, Category name... Useful for text columns

MostlyN Compression

redshift does not allow to create columns with SMALLINT data
type (1 byte), or to set INTx datatypes as UNSIGNED. If you need to store INT or BIGINT but you know that most of your values are not so big, you can use MOSTLY8, MOSTLY16 and MOSTLY32 to store the smallest values in 1, 2 ou 4 bytes instead of 2, 4 or 8 bytes initially ! Useful for some integers datasets

Over 300M of urls, we're able to compress our 1.1GB
nb_outlinks column down to 600MB, not so bad ! nb_outlinks column Original Size in bytes (Int 4 bytes) Compressed Size in bytes 150 4 2 130 4 2 20 4 2 40000 4 4 (Takes more than 16 bits on a signed column, so same as Raw) 320 4 2 250 4 2 ... .. ..

To summarize...

CREATE TABLE urls_data_compressed ( url_id BIGINT ENCODE DELTA, url_host VARCHAR(50)
ENCODE RUNLENGTH, url_path VARCHAR(255) ENCODE RAW, url_query_string VARCHAR(500) ENCODE RUNLENGTH, title VARCHAR(100) ENCODE TEXT32K, page_type CHAR(20) ENCODE BYTEDICT, depth INT ENCODE RUNLENGTH, nb_outlinks INT ENCODE MOSTLY8, ts_crawled BIGINT ENCODE DELTA, http_code INT ENCODE RUNLENGTH, size_bytes INT ENCODE DELTA32, delay_ms INT ENCODE MOSTLY16, gzipped BOOLEAN ENCODE RUNLENGTH );

CREATE TABLE urls_data_compressed ( url_id BIGINT ENCODE DELTA, .. ts_crawled
BIGINT ENCODE DELTA, .. size_bytes INT ENCODE DELTA32, .. ); url_id and ts_crawled : consecutive numbers ! size_bytes : generally not a big difference between two pages crawled in the same time DELTA

CREATE TABLE urls_data_compressed ( url_host VARCHAR(50) ENCODE RUNLENGTH, url_query_string VARCHAR(500)
ENCODE RUNLENGTH, ... depth INT ENCODE RUNLENGTH, ... http_code INT ENCODE RUNLENGTH, ... gzipped BOOLEAN ENCODE RUNLENGTH ); url_host : not a lot of subdomains, so a lot of chance that we are crawling by subdomains batches. depth : as we crawl in BFS (http://en.wikipedia.org/wiki/Breadth-first_search) and we don't have more than 100 depths, they're sorted consecutively. http_code : there's a lot of chance that most a pages crawled are 200! gzipped : usually all the website or not is gzipped RUNLENGTH

CREATE TABLE urls_data_compressed ( ... title VARCHAR(100) ENCODE TEXT32K, );
title : a lot of common words : Site name, Category name... Surprisingly, for "url_query_string" we thought that TEXT32K was the best method (because of some common patterns) But RUNLENGTH was the best, because there were large empty sections in the column! Compress large sections of empty values is here more efficient than compressing text! TEXT

CREATE TABLE urls_data_compressed ( ... page_type CHAR(20) ENCODE BYTEDICT, );
page_type : not more than 200 different type, perfect for BYTEDICT BYTEDICT

nb_outlinks : average of outlinks is usually at 150 links.
As we have some pages like sitemaps containing more than 1000 links, we have to store it into an INT datatype. delay_ms : pages are usually rendered in less than 1000 ms, but some of them could take more than 30000 ms... MOSTLYn CREATE TABLE urls_data_compressed ( ... nb_outlinks INT ENCODE MOSTLY8, ... delay_ms INT ENCODE MOSTLY16, );

Column Encoding Size by row Total Raw size Total Compressed
Size url_id DELTA 8 bytes 2.23 GB 300 MB url_host RUNLENGTH ~16 bytes avg 4.47 GB 1 MB url_path TEXT255 ~70 bytes avg 19.55 GB 18 GB url_query_string RUNLENGTH ~4 bytes avg 1 GB 350 MB page_type BYTEDICT 20 bytes 5.58 GB 300 MB title TEXT32K ~60 bytes avg 16.7GB 14.6 GB depth RUNLENGTH 4 bytes 1.11 GB 1MB nb_outlinks MOSTLY8 4 bytes 1.1 GB ~310 MB ts_crawled DELTA 8 bytes 2.23 GB 300 MB http_code RUNLENGTH 4 bytes 1.11 GB 20 MB size_bytes DELTA32 4 bytes 1.11 GB 0.72 GB delay_ms MOSTLY16 4 bytes 1.11 GB 0.6 GB gzipped RUNLENGTH 1 byte 0.27GB 1MB Total 58 GB 36 GB

Still, you don't know which encoding to use ?

No problem!

Redshift does it for you!

Challenge accepted !

COPY urls_data from 's3://mybucket/file.txt' CREDENTIALS 'aws_access_key_id=AWS_ACCESS_KEY; aws_secret_access_key=AWS_SECRET_KEY' delimiter '\t'; Load
your data from S3 and let Redshift automatically apply the best compressor

ANALYSE COMPRESSION urls_data Or, run ANALYSE COMPRESSION to get Redshift
suggestions

Thanks!

By the way, we're hiring! #devops #bigdata #aws #puppet #redshift
#scala #python http://www.botify.com/jobs/

crafted by @mpelmann & @oleiade Follow us @botify ... and
be one of our early adopters

Amazon Redshift : it's all about data compression

Amazon Redshift : it's all about data compression

More Decks by Botify

Other Decks in Programming

Featured

Transcript