Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data on Disk: When Data Storage Matters

Becky Sweger
November 05, 2018
37

Data on Disk: When Data Storage Matters

An overview of columnar data storage, why it matters, and some Redshift-specific optimization strategies.

Becky Sweger

November 05, 2018
Tweet

Transcript

  1. Data on Disk Becky Sweger Sr. Data Engineer Trello @

    Atlassian When data storage matters
  2. Row-Based Storage Event ID Type Date OS Build A29 pageview

    2018-10-15 macos 1028 3TY pageview 2018-10-15 iOS 1028 V8Q join card 2011-10-01 win 1018 6TT leave team 2013-10-02 macos 1029 90I pageview 2018-10-03 win 1030 BQ7 join card 2018-10-15 iOS 1028 GVG join team 2017-10-13 iOS 1028 AX2 pageview 2018-10-04 win 1019
  3. Event ID Type Date OS Build A29 pageview 2018-10-15 macos

    1028 3TY pageview 2018-10-15 iOS 1028 BQ7 join card 2018-10-15 iOS 1028 SELECT type, COUNT(*) as total FROM events WHERE date = '2018-10-15' GROUP BY type
  4. Date 2018-10-15 2018-10-15 2018-10-15 Type pageview pageview join card Column-Based

    Storage Event ID A29 3TY V8Q 6TT 90I BQ7 GVG AX2 Type pageview pageview join card leave team pageview join card join team pageview Date 2018-10-15 2018-10-15 2011-10-01 2013-10-02 2018-10-03 2018-10-15 2017-10-13 2018-10-04 OS macos iOS win macos win iOS iOS win Build 1028 1028 1018 1029 1030 1028 1028 1019
  5. SELECT type, COUNT(*) as total FROM events WHERE date =

    '2018-10-15' GROUP BY type Type pageview pageview join card Date 2018-10-15 2018-10-15 2018-10-15
  6. Top Three Culprits 1. Using 80% of the available 64

    TB 2. High percentage of unsorted data 3. Sporadic compression
  7. Fix One: More Nodes SQL Clients leader 16 TB 16

    TB 16 TB 16 TB 16 TB 16 TB 16 TB 16 TB
  8. Date 2018-10-15 2018-10-15 2018-10-15 Type pageview pageview join card Fix

    Two: Sort & Compress Event ID A29 3TY V8Q 6TT 90I BQ7 GVG AX2 Type pageview pageview join card leave team pageview join card join team pageview Date 2018-10-15 2018-10-15 2011-10-01 2013-10-02 2018-10-03 2018-10-15 2017-10-13 2018-10-04 OS macos iOS win macos win iOS iOS win Build 1028 1028 1018 1029 1030 1028 1028 1019
  9. Date 2018-10-15 2018-10-15 2018-10-15 Type pageview pageview join card Fix

    Two: Sort & Compress Event ID V8Q 6TT GVG 90I AX2 BQ7 3TY A29 Type join card leave join team pageview pageview join card pageview pageview Date 2011-10-01 2013-10-02 2017-10-13 2018-10-03 2018-10-04 2018-10-15 2018-10-15 2018-10-15 OS win macos iOS win win iOS iOS macos Build 1029 1029 1028 1030 1019 1028 1028 1028
  10. Sorting: What Went Wrong? CREATE TABLE events ( event_id CHAR(36),

    type VARCHAR(128), date TIMESTAMP, [a bunch of other columns]) DISTSTYLE KEY DISTKEY (event_id) SORTKEY (date);
  11. – Amazon Redshift Documentation “If each batch of new data

    follows the existing rows in your table, your data is properly stored in sort order…”
  12. – Amazon Redshift Documentation “…new rows are stored in a

    separate unsorted region on disk, then sorted on demand for queries as required”
  13. Original Events Table New Events Table Attempt One: Deep Copy

    1. Sort original data in an intermediate temp table 2. Temp table spills to disk if it exceeds available memory 3. Temp table is not compressed
  14. Original Events Table New Events Table S3 Attempt Two: Export

    1. Dump event data to a series of files on AWS S3 2. "COPY" those files into a new events table 3. COPY operation sorts data when the target table is empty
  15. Redshift: April 2018 1. Data size reduced by 62% 2.

    Data is 100% sorted (and kept that way) 3. Happier data scientists and analysts
  16. Queries as a Service hosted query service (AWS Athena, Google

    BigQuery) distributed query engine (Presto, BigQuery) metastore (Hive, AWS Glue Catalog) structured data files (AWS S3, Google Cloud Storage)