Data Warehousing for Fun and Profit

Data Warehousing For fun and profit.

So, what is a Data Warehouse? • Simply put, a
Data Warehouse is a system used for efficient reporting and data analysis. • Data is traditionally uploaded to a Data Warehouse via an Extract, Transform, Load (ETL) process from one or more source systems. • Data is stored in a denormalised format usually employing something called a Star Schema. • Uses columnar based storage, along with compression for efficient data storage and retrieval. • Can process very large data sets utilising parallelism (MPP - massively parallel processing).

Why use a Data Warehouse • Critical for effective business
intelligence from one or more data sources. • Designed to be used for reporting and Business Intelligence (e.g. Tableau). • Leaves the application transactional database to do it’s job of processing transactions. • Designed to be queried and excels at querying and aggregating large volumes of data. • Data is transformed into a format optimised for reporting. • Reports no longer need to be built by developers. Anyone with access to a Business Intelligence tool can build a report quickly and easily.

We haz the need, the need for speedz!

Our Solution - Amazon Redshift • Amazon Redshift is an
enterprise-level petabyte scale, fully managed data warehousing solution. • Redshift is compatible with PostgreSQL 8, from which it was forked, and adds it’s own functionality. • Redshift is therefore largely PostgreSQL compatible. • Redshift is reasonably priced, has no maintenance costs, and is easy to scale. • Is capable of massively parallel processing.

Parallelism in Amazon Redshift

Data Flow Overview

Keeping things up to date • Obviously data is changing
all the time. The ETL process runs hourly to process updates. • When an export starts, the start time is logged and is used to match against the update time stamp from a table. • The next export uses the previous start time and extracts the data only from that point. • Updates are quick due to parallel processing on both sides. • Using a staging temp table and a transaction allows us to seamlessly update the production tables without any down time for users (upsert).

What is a Star Schema?  A Star Schema is
optimised for analytical reporting.  It consists of Fact tables and relating Dimension tables.  Facts are measurable, quantitative data about a business.  Dimensions are descriptive data relating to fact data.  Star schemas are denormalised.  Disadvantages of the Star Schema is no referential integrity is enforced, due to speed.  It’s name comes from the schema representing a star shape.

Dimension vs Fact Dimensions provide the detail for facts. Facts
can be easily aggregated.

Normalisation vs Denormlisation • Normalised tables are typical for online
transaction processing databases (OLTP) such as MySQL or PostgreSQL. • Denormalised tables are typical for online analytical processing databases (OLAP) such as Amazon Redshift. • Normalised tables reduce redundancy, maintain data integrity, and are optimised for efficient data storage. • Denormalised tables, especially when combined with a Star Schema are optimised for querying.

Business Intelligence Tools - e.g.Tableau

Data Warehousing for Fun and Profit

Data Warehousing for Fun and Profit

Mark Taylor

More Decks by Mark Taylor

Other Decks in Technology

Featured

Transcript