Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Warehousing for Fun and Profit

Data Warehousing for Fun and Profit

Learn the advantages of having a Data Warehouse, what the hell a star schema is, and how we designed and built our solution with a custom PHP7 & Symfony 4 powered ETL (Extract Transform Load) process leveraging parallel processing. Our solution at DealTrak runs on the highly scaleable Amazon Redshift in our production environment and uses Postgres for development and testing.

Mark Taylor

October 17, 2018
Tweet

More Decks by Mark Taylor

Other Decks in Technology

Transcript

  1. So, what is a Data Warehouse? • Simply put, a

    Data Warehouse is a system used for efficient reporting and data analysis. • Data is traditionally uploaded to a Data Warehouse via an Extract, Transform, Load (ETL) process from one or more source systems. • Data is stored in a denormalised format usually employing something called a Star Schema. • Uses columnar based storage, along with compression for efficient data storage and retrieval. • Can process very large data sets utilising parallelism (MPP - massively parallel processing).
  2. Why use a Data Warehouse • Critical for effective business

    intelligence from one or more data sources. • Designed to be used for reporting and Business Intelligence (e.g. Tableau). • Leaves the application transactional database to do it’s job of processing transactions. • Designed to be queried and excels at querying and aggregating large volumes of data. • Data is transformed into a format optimised for reporting. • Reports no longer need to be built by developers. Anyone with access to a Business Intelligence tool can build a report quickly and easily.
  3. Our Solution - Amazon Redshift • Amazon Redshift is an

    enterprise-level petabyte scale, fully managed data warehousing solution. • Redshift is compatible with PostgreSQL 8, from which it was forked, and adds it’s own functionality. • Redshift is therefore largely PostgreSQL compatible. • Redshift is reasonably priced, has no maintenance costs, and is easy to scale. • Is capable of massively parallel processing.
  4. Keeping things up to date • Obviously data is changing

    all the time. The ETL process runs hourly to process updates. • When an export starts, the start time is logged and is used to match against the update time stamp from a table. • The next export uses the previous start time and extracts the data only from that point. • Updates are quick due to parallel processing on both sides. • Using a staging temp table and a transaction allows us to seamlessly update the production tables without any down time for users (upsert).
  5. What is a Star Schema?  A Star Schema is

    optimised for analytical reporting.  It consists of Fact tables and relating Dimension tables.  Facts are measurable, quantitative data about a business.  Dimensions are descriptive data relating to fact data.  Star schemas are denormalised.  Disadvantages of the Star Schema is no referential integrity is enforced, due to speed.  It’s name comes from the schema representing a star shape.
  6. Normalisation vs Denormlisation • Normalised tables are typical for online

    transaction processing databases (OLTP) such as MySQL or PostgreSQL. • Denormalised tables are typical for online analytical processing databases (OLAP) such as Amazon Redshift. • Normalised tables reduce redundancy, maintain data integrity, and are optimised for efficient data storage. • Denormalised tables, especially when combined with a Star Schema are optimised for querying.