Data Warehousing for Fun and Profit

Slide 1

Slide 1 text

Data Warehousing For fun and profit.

Slide 2

Slide 2 text

So, what is a Data Warehouse? ● Simply put, a Data Warehouse is a system used for efficient reporting and data analysis. ● Data is traditionally uploaded to a Data Warehouse via an Extract, Transform, Load (ETL) process from one or more source systems. ● Data is stored in a denormalised format usually employing something called a Star Schema. ● Uses columnar based storage, along with compression for efficient data storage and retrieval. ● Can process very large data sets utilising parallelism (MPP - massively parallel processing).

Slide 3

Slide 3 text

Why use a Data Warehouse ● Critical for effective business intelligence from one or more data sources. ● Designed to be used for reporting and Business Intelligence (e.g. Tableau). ● Leaves the application transactional database to do it’s job of processing transactions. ● Designed to be queried and excels at querying and aggregating large volumes of data. ● Data is transformed into a format optimised for reporting. ● Reports no longer need to be built by developers. Anyone with access to a Business Intelligence tool can build a report quickly and easily.

Slide 4

Slide 4 text

We haz the need, the need for speedz!

Slide 5

Slide 5 text

Our Solution - Amazon Redshift ● Amazon Redshift is an enterprise-level petabyte scale, fully managed data warehousing solution. ● Redshift is compatible with PostgreSQL 8, from which it was forked, and adds it’s own functionality. ● Redshift is therefore largely PostgreSQL compatible. ● Redshift is reasonably priced, has no maintenance costs, and is easy to scale. ● Is capable of massively parallel processing.

Slide 6

Slide 6 text

Parallelism in Amazon Redshift

Slide 7

Slide 7 text

Data Flow Overview

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Keeping things up to date ● Obviously data is changing all the time. The ETL process runs hourly to process updates. ● When an export starts, the start time is logged and is used to match against the update time stamp from a table. ● The next export uses the previous start time and extracts the data only from that point. ● Updates are quick due to parallel processing on both sides. ● Using a staging temp table and a transaction allows us to seamlessly update the production tables without any down time for users (upsert).

Slide 10

Slide 10 text

What is a Star Schema?  A Star Schema is optimised for analytical reporting.  It consists of Fact tables and relating Dimension tables.  Facts are measurable, quantitative data about a business.  Dimensions are descriptive data relating to fact data.  Star schemas are denormalised.  Disadvantages of the Star Schema is no referential integrity is enforced, due to speed.  It’s name comes from the schema representing a star shape.

Slide 11

Slide 11 text

Dimension vs Fact Dimensions provide the detail for facts. Facts can be easily aggregated.

Slide 12

Slide 12 text

Normalisation vs Denormlisation ● Normalised tables are typical for online transaction processing databases (OLTP) such as MySQL or PostgreSQL. ● Denormalised tables are typical for online analytical processing databases (OLAP) such as Amazon Redshift. ● Normalised tables reduce redundancy, maintain data integrity, and are optimised for efficient data storage. ● Denormalised tables, especially when combined with a Star Schema are optimised for querying.