Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adattárház, elemezzünk sok adatot gyorsan - Pap...

Adattárház, elemezzünk sok adatot gyorsan - Papp Tamás (Ustream)

Amikor valakinek azt mondom, adattárházakkal foglalkozom, mindig megkapom a következő kérdést: 'Jó, de ez miben különbözik egy "hagyományos" adatbázistól?'. Ebben az előadásban éppen ezt nézzük át: mik a főbb különbségek, miért tudnak a felhasználók nagy mennyiségű adatot gyorsan lekérdezni, milyen tervezési módszereket használunk adattárházaknál.

Avatar for Budapest Database Meetup

Budapest Database Meetup

March 07, 2013
Tweet

More Decks by Budapest Database Meetup

Other Decks in Technology

Transcript

  1. 2 What is a DataWarehouse? ▪  Collection of diverse data

    ▪  subject oriented ▪  aimed at business users ▪  often a copy of operational data ▪  with value-added data (e.g., summaries, history) ▪  integrated ▪  time-varying ▪  non-volatile ▪  Collection of tools ▪  gathering data ▪  cleansing, integrating ▪  querying, reporting, analysis ▪  data mining ▪  monitoring, administering Ustream BI - DW & ETL A DataWarehouse is basically a regular database but it’s data structure is created for analysis…
  2. 4 Advantages of DataWarehousing ▪  High query performance (query-optimized structure)

    ▪  Local processing -> sources unaffected ▪  Can operate when sources unavailable ▪  Can query data not stored in a source DBMS (off-line sources) ▪  Extra information at datawarehouse ▪  Aggregated and / or modified data ▪  Adds historical information to the data -> trends Ustream BI - DW & ETL
  3. 5 Modeling DataWarehouses ▪  Modeling terms: fact, dimension, measure ▪ 

    In Ustream terms: fact = views, broadcasts, events; dimension = users, channels, videos, etc.; measures = viewer hours, timestamps, revenue, paid amounts etc. ▪  Modeling data warehouses: facts and dimension tables ▪  Star schema: A single object (fact) in the middle connected to a number of objects (dimensions) radially. ▪  Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. ▪  Of course star and snowflake modeling can be mixed in a DW, if necessary. Ustream BI - DW & ETL
  4. Example of star schema Ustream BI - DW & ETL

    6 ▪  Star schema ▪  Structure is simple ▪  Denormalized, flat tables -> redundant data ▪  No complex joins -> easy to understand and query ▪  Be careful about dimensions with large number of records ▪  Example: time, category
  5. Example of snowflake schema Ustream BI - DW & ETL

    7 ▪  Snowflake schema ▪  Structure is complex, but still fact is in the middle ▪  No redundant data in dimensions ▪  Complex joins -> longer query executions ▪  Consider for large dimensions, or dimensions with a lot of different attributes per element type ▪  Example: different type of broadcasters: PPV, Pro, RevShare
  6. ETL: Extract, Transform and Load ▪  Get data out of

    the sources and load into the DW – a process of copying data (scheduled, automatic) ▪  Data is transformed to match the DW schema during the process ▪  Data also needed from non-OLTP systems such as text files, data feeds, and spreadsheets ▪  When defining ETL always think of ETL as a process, not a physical implementation -> need to have a logical data mapping before the data transformation (mapping is usually a SQL expression) Ustream BI - DW & ETL 8
  7. 9 Loading Data ▪  Initial load for the first time

    ▪  Incremental vs. refresh ▪  Off-line vs. on-line ▪  Frequency of loading can be critical ▪  At night, 1x a week/month, continuously ▪  Parallel / Partitioned load ▪  The integration of all of the sources is the real challenge Ustream BI - DW & ETL
  8. 10 Derived Data ▪  Derived Data ▪  Indexes (built-in) ▪ 

    Aggregates (with stored procedures) ▪  Materialized views (built-in) ▪  Update derived data ▪  Incremental ▪  Refresh Derived data Table1 Table2 Table3 Report1 Report2 Ustream BI - DW & ETL
  9. Staging Database ▪  Usually a separated relational database is needed

    to serve the operational needs of the ETL ▪  Data structure is similar to sources ▪  Logical and physical separation of the source systems and the DW ▪  Minimizes the impact of the ETL activity on both source and DW databases Staging DB DW Ustream BI - DW & ETL 11
  10. Before ETL development ▪  Data discovery ▪  Profiling (rowcounts, NULL

    values, dates in different formats or in char columns) ▪  Find master data („System of record”) ▪  Data cleansing (endless fight J) ▪  Matching similar records -> merge or skip or delete ▪  Finding invalid values (based on dictionary) ▪  Data completeness check (do we have every value) ▪  Anomaly checking (out-of-range, not matching the usual pattern) ▪  Logical data check (applying of business rules) Ustream BI - DW & ETL 12
  11. Determining changed data ▪  Audit logging on source side –

    extra overhead for sources ▪  Built-in Change-data-capture on sources (MySQL doesn’t have one) ▪  Using „_history” / „_deleted” tables where exist in the source ▪  Time-driven approach – the newly created data is loaded (works for facts and some time-driven dimensions) ▪  Staging comparison (outer joins of previously and currently loaded data) – resource-intense on staging but the only 100% sure solution Ustream BI - DW & ETL 13
  12. Loading dimensions ▪  Denormalized (flat) tables to minimize number of

    joins ▪  Yes, store data redundantly J ▪  Primary key is a surrogate key (meaningless number) ▪  Original key in source can be changed (but needs to be stored in DW) ▪  Data mapping between different sources is easier ▪  Data in dimensions usually changing slowly, but changes must be followed: ▪  Simple update ▪  Changed value insterted as new record ▪  Old value stored in separate column Ustream BI - DW & ETL 14
  13. Loading Fact Tables ▪  Fact tables are highly denormalized tables

    -> lot of restructure is needed ▪  Yes, store data redundantly J ▪  No foreign key objects implemented in the DW ▪  Too much overhead for the loading process ▪  Converting original keys of the data into surrogate keys ▪  Performance-boost: creating separate lookup tables ▪  Managing Indexes ▪  Indexes are the biggest performance killers for the data load ▪  Drop index, load data, rebuild index -> see partitioning on next slide Ustream BI - DW & ETL 15
  14. Partition-driven loading ▪  Partitions allow a table (and its indexes!)

    to be physically divided into smaller tables to improve performance ▪  Partitioning strategy on fact tables: partition by the date key, the new data comes into the last partition ▪  Derived data (index, aggregate) is easier to rebuild based on the only one changed partition Ustream BI - DW & ETL 16 Partition Day N+1 Partition Day N … … … Partition Day 3 Partition Day 2 Partition Day 1
  15. Let’s do Ustream BI, not just theories… J ▪  What

    do we have: ▪  3rd party analytics tools ▪  Own (both statistical and operational) dbs and logs + some offline files ▪  Huge amount of users, channels, broadcasts, views, so a lot of DATA ▪  What do we need: ▪  ETL tool, which has a lot of built-in functions and scalable: Kettle ▪  DBMS which supports indexing, partitioning to handle the huge amount of data: MySQL, Greenplum ▪  Reporting solution: coming later… ▪  Do a lots of work… J -> data modeling, ETL development, etc. Ustream BI - DW & ETL 17
  16. ETL plans – near and further future ▪  Near future

    ▪  Collecting data from local dbs, 3rd party tools, and offline files into MySQL based DW ▪  Developing Prio 1 ETL processes necessary for KRs data need ▪  Automated ETL based on event calendar ▪  Further future ▪  Consolidating all 3rd party analytical tools into 1-2, with own event handling ▪  Hadoop-based cluster for advanced data processing (e.g. clickstream analysis, etc.) ▪  Move DW to Greenplum based database Ustream BI - DW & ETL 18
  17. ETL tool ▪  Kettle (Pentaho Data Integration – PDI), http://kettle.pentaho.com/

    ▪  Open source, GPL ▪  Broadest functionality among the open source tools ▪  A lot of supported inputs: various RDMBS connections, text files, GA, Excel etc. ▪  Both MySQL and Greenplum connection supported in bulk mode ▪  Also strong support for big data dbs (Hadoop, HBase, Cassandra, CouchDB, MapReduce, MongoDB) Ustream BI - DW & ETL 19
  18. DataWarehouse Data flow a.k.a. ETL architecture Ustream BI - DW

    & ETL 20 Solr / Metrics Conviva Graphite Quantcast comScore GA uStreamTv Statistics Editorial Kettle Ustream Meta StreamStat SocialStream Event calendar ConnStat Hadoop