Adattárház, elemezzünk sok adatot gyorsan - Papp Tamás (Ustream)

DataWarehouse & ETL BI team: DW & ETL overview and
architecture

2 What is a DataWarehouse? ▪  Collection of diverse data
▪  subject oriented ▪  aimed at business users ▪  often a copy of operational data ▪  with value-added data (e.g., summaries, history) ▪  integrated ▪  time-varying ▪  non-volatile ▪  Collection of tools ▪  gathering data ▪  cleansing, integrating ▪  querying, reporting, analysis ▪  data mining ▪  monitoring, administering Ustream BI - DW & ETL A DataWarehouse is basically a regular database but it’s data structure is created for analysis…

3 DataWarehouse Architecture Client Client DataWarehouse Source1 Source2 Source3 Query
& Analysis Integration Metadata Ustream BI - DW & ETL

4 Advantages of DataWarehousing ▪  High query performance (query-optimized structure)
▪  Local processing -> sources unaffected ▪  Can operate when sources unavailable ▪  Can query data not stored in a source DBMS (off-line sources) ▪  Extra information at datawarehouse ▪  Aggregated and / or modified data ▪  Adds historical information to the data -> trends Ustream BI - DW & ETL

5 Modeling DataWarehouses ▪  Modeling terms: fact, dimension, measure ▪ 
In Ustream terms: fact = views, broadcasts, events; dimension = users, channels, videos, etc.; measures = viewer hours, timestamps, revenue, paid amounts etc. ▪  Modeling data warehouses: facts and dimension tables ▪  Star schema: A single object (fact) in the middle connected to a number of objects (dimensions) radially. ▪  Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. ▪  Of course star and snowflake modeling can be mixed in a DW, if necessary. Ustream BI - DW & ETL

Example of star schema Ustream BI - DW & ETL
6 ▪  Star schema ▪  Structure is simple ▪  Denormalized, flat tables -> redundant data ▪  No complex joins -> easy to understand and query ▪  Be careful about dimensions with large number of records ▪  Example: time, category

Example of snowflake schema Ustream BI - DW & ETL
7 ▪  Snowflake schema ▪  Structure is complex, but still fact is in the middle ▪  No redundant data in dimensions ▪  Complex joins -> longer query executions ▪  Consider for large dimensions, or dimensions with a lot of different attributes per element type ▪  Example: different type of broadcasters: PPV, Pro, RevShare

ETL: Extract, Transform and Load ▪  Get data out of
the sources and load into the DW – a process of copying data (scheduled, automatic) ▪  Data is transformed to match the DW schema during the process ▪  Data also needed from non-OLTP systems such as text files, data feeds, and spreadsheets ▪  When defining ETL always think of ETL as a process, not a physical implementation -> need to have a logical data mapping before the data transformation (mapping is usually a SQL expression) Ustream BI - DW & ETL 8

9 Loading Data ▪  Initial load for the first time
▪  Incremental vs. refresh ▪  Off-line vs. on-line ▪  Frequency of loading can be critical ▪  At night, 1x a week/month, continuously ▪  Parallel / Partitioned load ▪  The integration of all of the sources is the real challenge Ustream BI - DW & ETL

10 Derived Data ▪  Derived Data ▪  Indexes (built-in) ▪ 
Aggregates (with stored procedures) ▪  Materialized views (built-in) ▪  Update derived data ▪  Incremental ▪  Refresh Derived data Table1 Table2 Table3 Report1 Report2 Ustream BI - DW & ETL

Staging Database ▪  Usually a separated relational database is needed
to serve the operational needs of the ETL ▪  Data structure is similar to sources ▪  Logical and physical separation of the source systems and the DW ▪  Minimizes the impact of the ETL activity on both source and DW databases Staging DB DW Ustream BI - DW & ETL 11

Before ETL development ▪  Data discovery ▪  Profiling (rowcounts, NULL
values, dates in different formats or in char columns) ▪  Find master data („System of record”) ▪  Data cleansing (endless fight J) ▪  Matching similar records -> merge or skip or delete ▪  Finding invalid values (based on dictionary) ▪  Data completeness check (do we have every value) ▪  Anomaly checking (out-of-range, not matching the usual pattern) ▪  Logical data check (applying of business rules) Ustream BI - DW & ETL 12

Determining changed data ▪  Audit logging on source side –
extra overhead for sources ▪  Built-in Change-data-capture on sources (MySQL doesn’t have one) ▪  Using „_history” / „_deleted” tables where exist in the source ▪  Time-driven approach – the newly created data is loaded (works for facts and some time-driven dimensions) ▪  Staging comparison (outer joins of previously and currently loaded data) – resource-intense on staging but the only 100% sure solution Ustream BI - DW & ETL 13

Loading dimensions ▪  Denormalized (flat) tables to minimize number of
joins ▪  Yes, store data redundantly J ▪  Primary key is a surrogate key (meaningless number) ▪  Original key in source can be changed (but needs to be stored in DW) ▪  Data mapping between different sources is easier ▪  Data in dimensions usually changing slowly, but changes must be followed: ▪  Simple update ▪  Changed value insterted as new record ▪  Old value stored in separate column Ustream BI - DW & ETL 14

Loading Fact Tables ▪  Fact tables are highly denormalized tables
-> lot of restructure is needed ▪  Yes, store data redundantly J ▪  No foreign key objects implemented in the DW ▪  Too much overhead for the loading process ▪  Converting original keys of the data into surrogate keys ▪  Performance-boost: creating separate lookup tables ▪  Managing Indexes ▪  Indexes are the biggest performance killers for the data load ▪  Drop index, load data, rebuild index -> see partitioning on next slide Ustream BI - DW & ETL 15

Partition-driven loading ▪  Partitions allow a table (and its indexes!)
to be physically divided into smaller tables to improve performance ▪  Partitioning strategy on fact tables: partition by the date key, the new data comes into the last partition ▪  Derived data (index, aggregate) is easier to rebuild based on the only one changed partition Ustream BI - DW & ETL 16 Partition Day N+1 Partition Day N … … … Partition Day 3 Partition Day 2 Partition Day 1

Let’s do Ustream BI, not just theories… J ▪  What
do we have: ▪  3rd party analytics tools ▪  Own (both statistical and operational) dbs and logs + some offline files ▪  Huge amount of users, channels, broadcasts, views, so a lot of DATA ▪  What do we need: ▪  ETL tool, which has a lot of built-in functions and scalable: Kettle ▪  DBMS which supports indexing, partitioning to handle the huge amount of data: MySQL, Greenplum ▪  Reporting solution: coming later… ▪  Do a lots of work… J -> data modeling, ETL development, etc. Ustream BI - DW & ETL 17

ETL plans – near and further future ▪  Near future
▪  Collecting data from local dbs, 3rd party tools, and offline files into MySQL based DW ▪  Developing Prio 1 ETL processes necessary for KRs data need ▪  Automated ETL based on event calendar ▪  Further future ▪  Consolidating all 3rd party analytical tools into 1-2, with own event handling ▪  Hadoop-based cluster for advanced data processing (e.g. clickstream analysis, etc.) ▪  Move DW to Greenplum based database Ustream BI - DW & ETL 18

ETL tool ▪  Kettle (Pentaho Data Integration – PDI), http://kettle.pentaho.com/
▪  Open source, GPL ▪  Broadest functionality among the open source tools ▪  A lot of supported inputs: various RDMBS connections, text files, GA, Excel etc. ▪  Both MySQL and Greenplum connection supported in bulk mode ▪  Also strong support for big data dbs (Hadoop, HBase, Cassandra, CouchDB, MapReduce, MongoDB) Ustream BI - DW & ETL 19

DataWarehouse Data flow a.k.a. ETL architecture Ustream BI - DW
& ETL 20 Solr / Metrics Conviva Graphite Quantcast comScore GA uStreamTv Statistics Editorial Kettle Ustream Meta StreamStat SocialStream Event calendar ConnStat Hadoop

Adattárház, elemezzünk sok adatot gyorsan - Pap...

Adattárház, elemezzünk sok adatot gyorsan - Papp Tamás (Ustream)

Budapest Database Meetup

More Decks by Budapest Database Meetup

Other Decks in Technology

Featured

Transcript

DataWarehouse & ETL BI team: DW & ETL overview and

2 What is a DataWarehouse? ▪  Collection of diverse data

3 DataWarehouse Architecture Client Client DataWarehouse Source1 Source2 Source3 Query

4 Advantages of DataWarehousing ▪  High query performance (query-optimized structure)

5 Modeling DataWarehouses ▪  Modeling terms: fact, dimension, measure ▪

Example of star schema Ustream BI - DW & ETL

Example of snowflake schema Ustream BI - DW & ETL

ETL: Extract, Transform and Load ▪  Get data out of

9 Loading Data ▪  Initial load for the first time

10 Derived Data ▪  Derived Data ▪  Indexes (built-in) ▪

Staging Database ▪  Usually a separated relational database is needed

Before ETL development ▪  Data discovery ▪  Profiling (rowcounts, NULL

Determining changed data ▪  Audit logging on source side –

Loading dimensions ▪  Denormalized (flat) tables to minimize number of

Loading Fact Tables ▪  Fact tables are highly denormalized tables

Partition-driven loading ▪  Partitions allow a table (and its indexes!)

Let’s do Ustream BI, not just theories… J ▪  What

ETL plans – near and further future ▪  Near future

ETL tool ▪  Kettle (Pentaho Data Integration – PDI), http://kettle.pentaho.com/

DataWarehouse Data flow a.k.a. ETL architecture Ustream BI - DW