Amazon Redshift : it's all about data compression

Slide 1

Slide 1 text

Amazon Redshift It's all about data compression! @botify

Slide 2

Slide 2 text

What we do at Botify @botify

Slide 3

Slide 3 text

We crawl websites dozens of millions pages We extract metrics load time, http_code, size We extract data html tags, schema anchors, linking... ... and we count things !

Slide 4

Slide 4 text

Our data model could particularly fit with a SQL schema

Slide 5

Slide 5 text

Old fashioned you said?

Slide 6

Slide 6 text

Redshift for beginners

Slide 7

Slide 7 text

Petabyte-Scale Data Warehouse SQL-Like Querying Near Real-Time Columnar Storage Data loaded via Amazon S3 Made for Business Intelligence and Data Analytics Small costs !

Slide 8

Slide 8 text

Big data ready!

Slide 9

Slide 9 text

#redshift is the new trendy hashtag !

Slide 10

Slide 10 text

@airbnb and @hapyrus did nice benchmarks #Redshift can be up to 10x faster than Hive. http://www.hapyrus.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides http://nerds.airbnb.com/redshift-performance-cost

Slide 11

Slide 11 text

What about its key feature: data compression?

Slide 12

Slide 12 text

Simple Use Case

Slide 13

Slide 13 text

We want to store results from a crawl of a website containing 300M pages. For each url, we store: Its depth from the homepage Its page type Its title anchor The url fetching timestamp The http code we've encountered The content size The load time The number of outlinks Its compression style?

Slide 14

Slide 14 text

CREATE TABLE urls_data ( url_id BIGINT, url_host VARCHAR(50), url_path VARCHAR(255), url_query_string VARCHAR(500), title VARCHAR(100), page_type CHAR(20), depth INT, nb_outlinks INT, ts_crawled BIGINT, http_code, size_bytes, delay_ms INT, gzipped BOOLEAN );

Slide 15

Slide 15 text

Choose the best compression method for each column

Slide 16

Slide 16 text

"Compressing data ain't no different from compressing a car" - Xzibit

Slide 17

Slide 17 text

TEXT32K DELTA MOSTLYn RUNLENGTH ? BYTEDICT

Slide 18

Slide 18 text

Delta compression

Slide 19

Slide 19 text

For each value in the column, we store the difference with the previous one *. As the difference is generally small, it will use 1 byte instead of 8 bytes Recommended for columns with incremental numbers like primary keys, timestamps... * Use DELTA if the absolute difference don't exceed 127 or DELTA32K if the absolute difference don't exceed 32K

Slide 20

Slide 20 text

url_id column Original Size in bytes Compressed Size in bytes 10000 4 4 10001 4 1 10002 4 1 10010 4 1 10100 4 1 etc etc... 4 x ... 1 x ... Over 300M of urls, we're able to compress our 2.23GB url_id column down to 300MB, pretty neat!

Slide 21

Slide 21 text

Runlength Compression

Slide 22

Slide 22 text

Imagine that you repeat in a column 1000 times the word A, then 200 times the word B, then x times etc etc... Instead of storing all values, RLE stores in a header the number of times on which the value is repeated. Useful for values repeated consecutively About Run Length Encoding : http://fr.wikipedia.org/wiki/Run-length_encoding

Slide 23

Slide 23 text

http_code column Original Size in bytes Compressed Size in bytes 200 2 200 2 200 2 200 2 1 byte for header + 2 bytes = 3 301 2 301 2 1 byte for header + 2 bytes = 3 Total 12 6 Over 300M of urls, we're able to compress our 600MB http_code column down to 20MB, fantastic! (with more than 90% of 200 codes)

Slide 24

Slide 24 text

Bytedict Compression

Slide 25

Slide 25 text

Useful for columns with a lot of common values (but "randomly" distributed) #Redshift creates outside of the column a dictionary containing the 256 most common values (for each disk block) appearing in the column. Then for each of those values appearing in the column, 1 byte will be taken instead of the original size. Works with all columns except VARCHAR and BOOLEAN

Slide 26

Slide 26 text

page_type colum Original Size in bytes (char 30) Compressed Size in bytes article 30 1 article 30 1 author 30 1 author 30 1 video 30 1 author 30 1 comment 30 1 ... ... ... Value Size in bytes article 7 author 6 video 5 comment 7 Total 25 Bytedict Column values Over 300M of urls, we're able to compress our 6GB page_type column down to 300MB, champagne!

Slide 27

Slide 27 text

Text Compression

Slide 28

Slide 28 text

If your column shares a few common values but a lot of common words, it might be the best compression method. As amazon does not communicate about the compression algorithm used, it's difficult to evaluate its impacts. For our use case, it will be used for the title column. This column shares a lot of common words like Site name, Category name... Useful for text columns

Slide 29

Slide 29 text

MostlyN Compression

Slide 30

Slide 30 text

redshift does not allow to create columns with SMALLINT data type (1 byte), or to set INTx datatypes as UNSIGNED. If you need to store INT or BIGINT but you know that most of your values are not so big, you can use MOSTLY8, MOSTLY16 and MOSTLY32 to store the smallest values in 1, 2 ou 4 bytes instead of 2, 4 or 8 bytes initially ! Useful for some integers datasets

Slide 31

Slide 31 text

Over 300M of urls, we're able to compress our 1.1GB nb_outlinks column down to 600MB, not so bad ! nb_outlinks column Original Size in bytes (Int 4 bytes) Compressed Size in bytes 150 4 2 130 4 2 20 4 2 40000 4 4 (Takes more than 16 bits on a signed column, so same as Raw) 320 4 2 250 4 2 ... .. ..

Slide 32

Slide 32 text

To summarize...

Slide 33

Slide 33 text

CREATE TABLE urls_data_compressed ( url_id BIGINT ENCODE DELTA, url_host VARCHAR(50) ENCODE RUNLENGTH, url_path VARCHAR(255) ENCODE RAW, url_query_string VARCHAR(500) ENCODE RUNLENGTH, title VARCHAR(100) ENCODE TEXT32K, page_type CHAR(20) ENCODE BYTEDICT, depth INT ENCODE RUNLENGTH, nb_outlinks INT ENCODE MOSTLY8, ts_crawled BIGINT ENCODE DELTA, http_code INT ENCODE RUNLENGTH, size_bytes INT ENCODE DELTA32, delay_ms INT ENCODE MOSTLY16, gzipped BOOLEAN ENCODE RUNLENGTH );

Slide 34

Slide 34 text

CREATE TABLE urls_data_compressed ( url_id BIGINT ENCODE DELTA, .. ts_crawled BIGINT ENCODE DELTA, .. size_bytes INT ENCODE DELTA32, .. ); url_id and ts_crawled : consecutive numbers ! size_bytes : generally not a big difference between two pages crawled in the same time DELTA

Slide 35

Slide 35 text

CREATE TABLE urls_data_compressed ( url_host VARCHAR(50) ENCODE RUNLENGTH, url_query_string VARCHAR(500) ENCODE RUNLENGTH, ... depth INT ENCODE RUNLENGTH, ... http_code INT ENCODE RUNLENGTH, ... gzipped BOOLEAN ENCODE RUNLENGTH ); url_host : not a lot of subdomains, so a lot of chance that we are crawling by subdomains batches. depth : as we crawl in BFS (http://en.wikipedia.org/wiki/Breadth-first_search) and we don't have more than 100 depths, they're sorted consecutively. http_code : there's a lot of chance that most a pages crawled are 200! gzipped : usually all the website or not is gzipped RUNLENGTH

Slide 36

Slide 36 text

CREATE TABLE urls_data_compressed ( ... title VARCHAR(100) ENCODE TEXT32K, ); title : a lot of common words : Site name, Category name... Surprisingly, for "url_query_string" we thought that TEXT32K was the best method (because of some common patterns) But RUNLENGTH was the best, because there were large empty sections in the column! Compress large sections of empty values is here more efficient than compressing text! TEXT

Slide 37

Slide 37 text

CREATE TABLE urls_data_compressed ( ... page_type CHAR(20) ENCODE BYTEDICT, ); page_type : not more than 200 different type, perfect for BYTEDICT BYTEDICT

Slide 38

Slide 38 text

nb_outlinks : average of outlinks is usually at 150 links. As we have some pages like sitemaps containing more than 1000 links, we have to store it into an INT datatype. delay_ms : pages are usually rendered in less than 1000 ms, but some of them could take more than 30000 ms... MOSTLYn CREATE TABLE urls_data_compressed ( ... nb_outlinks INT ENCODE MOSTLY8, ... delay_ms INT ENCODE MOSTLY16, );

Slide 39

Slide 39 text

Column Encoding Size by row Total Raw size Total Compressed Size url_id DELTA 8 bytes 2.23 GB 300 MB url_host RUNLENGTH ~16 bytes avg 4.47 GB 1 MB url_path TEXT255 ~70 bytes avg 19.55 GB 18 GB url_query_string RUNLENGTH ~4 bytes avg 1 GB 350 MB page_type BYTEDICT 20 bytes 5.58 GB 300 MB title TEXT32K ~60 bytes avg 16.7GB 14.6 GB depth RUNLENGTH 4 bytes 1.11 GB 1MB nb_outlinks MOSTLY8 4 bytes 1.1 GB ~310 MB ts_crawled DELTA 8 bytes 2.23 GB 300 MB http_code RUNLENGTH 4 bytes 1.11 GB 20 MB size_bytes DELTA32 4 bytes 1.11 GB 0.72 GB delay_ms MOSTLY16 4 bytes 1.11 GB 0.6 GB gzipped RUNLENGTH 1 byte 0.27GB 1MB Total 58 GB 36 GB

Slide 40

Slide 40 text

Still, you don't know which encoding to use ?

Slide 41

Slide 41 text

No problem!

Slide 42

Slide 42 text

Redshift does it for you!

Slide 43

Slide 43 text

Challenge accepted !

Slide 44

Slide 44 text

COPY urls_data from 's3://mybucket/file.txt' CREDENTIALS 'aws_access_key_id=AWS_ACCESS_KEY; aws_secret_access_key=AWS_SECRET_KEY' delimiter '\t'; Load your data from S3 and let Redshift automatically apply the best compressor

Slide 45

Slide 45 text

ANALYSE COMPRESSION urls_data Or, run ANALYSE COMPRESSION to get Redshift suggestions

Slide 46

Slide 46 text

Thanks!

Slide 47

Slide 47 text

By the way, we're hiring! #devops #bigdata #aws #puppet #redshift #scala #python http://www.botify.com/jobs/

Slide 48

Slide 48 text

crafted by @mpelmann & @oleiade Follow us @botify ... and be one of our early adopters