Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adattárház, elemezzünk sok adatot gyorsan - Papp Tamás (Ustream)

Adattárház, elemezzünk sok adatot gyorsan - Papp Tamás (Ustream)

Amikor valakinek azt mondom, adattárházakkal foglalkozom, mindig megkapom a következő kérdést: 'Jó, de ez miben különbözik egy "hagyományos" adatbázistól?'. Ebben az előadásban éppen ezt nézzük át: mik a főbb különbségek, miért tudnak a felhasználók nagy mennyiségű adatot gyorsan lekérdezni, milyen tervezési módszereket használunk adattárházaknál.

Budapest Database Meetup

March 07, 2013
Tweet

More Decks by Budapest Database Meetup

Other Decks in Technology

Transcript

  1. 2 What is a DataWarehouse? ▪  Collection of diverse data

    ▪  subject oriented ▪  aimed at business users ▪  often a copy of operational data ▪  with value-added data (e.g., summaries, history) ▪  integrated ▪  time-varying ▪  non-volatile ▪  Collection of tools ▪  gathering data ▪  cleansing, integrating ▪  querying, reporting, analysis ▪  data mining ▪  monitoring, administering Ustream BI - DW & ETL A DataWarehouse is basically a regular database but it’s data structure is created for analysis…
  2. 4 Advantages of DataWarehousing ▪  High query performance (query-optimized structure)

    ▪  Local processing -> sources unaffected ▪  Can operate when sources unavailable ▪  Can query data not stored in a source DBMS (off-line sources) ▪  Extra information at datawarehouse ▪  Aggregated and / or modified data ▪  Adds historical information to the data -> trends Ustream BI - DW & ETL
  3. 5 Modeling DataWarehouses ▪  Modeling terms: fact, dimension, measure ▪ 

    In Ustream terms: fact = views, broadcasts, events; dimension = users, channels, videos, etc.; measures = viewer hours, timestamps, revenue, paid amounts etc. ▪  Modeling data warehouses: facts and dimension tables ▪  Star schema: A single object (fact) in the middle connected to a number of objects (dimensions) radially. ▪  Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. ▪  Of course star and snowflake modeling can be mixed in a DW, if necessary. Ustream BI - DW & ETL
  4. Example of star schema Ustream BI - DW & ETL

    6 ▪  Star schema ▪  Structure is simple ▪  Denormalized, flat tables -> redundant data ▪  No complex joins -> easy to understand and query ▪  Be careful about dimensions with large number of records ▪  Example: time, category
  5. Example of snowflake schema Ustream BI - DW & ETL

    7 ▪  Snowflake schema ▪  Structure is complex, but still fact is in the middle ▪  No redundant data in dimensions ▪  Complex joins -> longer query executions ▪  Consider for large dimensions, or dimensions with a lot of different attributes per element type ▪  Example: different type of broadcasters: PPV, Pro, RevShare
  6. ETL: Extract, Transform and Load ▪  Get data out of

    the sources and load into the DW – a process of copying data (scheduled, automatic) ▪  Data is transformed to match the DW schema during the process ▪  Data also needed from non-OLTP systems such as text files, data feeds, and spreadsheets ▪  When defining ETL always think of ETL as a process, not a physical implementation -> need to have a logical data mapping before the data transformation (mapping is usually a SQL expression) Ustream BI - DW & ETL 8
  7. 9 Loading Data ▪  Initial load for the first time

    ▪  Incremental vs. refresh ▪  Off-line vs. on-line ▪  Frequency of loading can be critical ▪  At night, 1x a week/month, continuously ▪  Parallel / Partitioned load ▪  The integration of all of the sources is the real challenge Ustream BI - DW & ETL
  8. 10 Derived Data ▪  Derived Data ▪  Indexes (built-in) ▪ 

    Aggregates (with stored procedures) ▪  Materialized views (built-in) ▪  Update derived data ▪  Incremental ▪  Refresh Derived data Table1 Table2 Table3 Report1 Report2 Ustream BI - DW & ETL
  9. Staging Database ▪  Usually a separated relational database is needed

    to serve the operational needs of the ETL ▪  Data structure is similar to sources ▪  Logical and physical separation of the source systems and the DW ▪  Minimizes the impact of the ETL activity on both source and DW databases Staging DB DW Ustream BI - DW & ETL 11
  10. Before ETL development ▪  Data discovery ▪  Profiling (rowcounts, NULL

    values, dates in different formats or in char columns) ▪  Find master data („System of record”) ▪  Data cleansing (endless fight J) ▪  Matching similar records -> merge or skip or delete ▪  Finding invalid values (based on dictionary) ▪  Data completeness check (do we have every value) ▪  Anomaly checking (out-of-range, not matching the usual pattern) ▪  Logical data check (applying of business rules) Ustream BI - DW & ETL 12
  11. Determining changed data ▪  Audit logging on source side –

    extra overhead for sources ▪  Built-in Change-data-capture on sources (MySQL doesn’t have one) ▪  Using „_history” / „_deleted” tables where exist in the source ▪  Time-driven approach – the newly created data is loaded (works for facts and some time-driven dimensions) ▪  Staging comparison (outer joins of previously and currently loaded data) – resource-intense on staging but the only 100% sure solution Ustream BI - DW & ETL 13
  12. Loading dimensions ▪  Denormalized (flat) tables to minimize number of

    joins ▪  Yes, store data redundantly J ▪  Primary key is a surrogate key (meaningless number) ▪  Original key in source can be changed (but needs to be stored in DW) ▪  Data mapping between different sources is easier ▪  Data in dimensions usually changing slowly, but changes must be followed: ▪  Simple update ▪  Changed value insterted as new record ▪  Old value stored in separate column Ustream BI - DW & ETL 14
  13. Loading Fact Tables ▪  Fact tables are highly denormalized tables

    -> lot of restructure is needed ▪  Yes, store data redundantly J ▪  No foreign key objects implemented in the DW ▪  Too much overhead for the loading process ▪  Converting original keys of the data into surrogate keys ▪  Performance-boost: creating separate lookup tables ▪  Managing Indexes ▪  Indexes are the biggest performance killers for the data load ▪  Drop index, load data, rebuild index -> see partitioning on next slide Ustream BI - DW & ETL 15
  14. Partition-driven loading ▪  Partitions allow a table (and its indexes!)

    to be physically divided into smaller tables to improve performance ▪  Partitioning strategy on fact tables: partition by the date key, the new data comes into the last partition ▪  Derived data (index, aggregate) is easier to rebuild based on the only one changed partition Ustream BI - DW & ETL 16 Partition Day N+1 Partition Day N … … … Partition Day 3 Partition Day 2 Partition Day 1
  15. Let’s do Ustream BI, not just theories… J ▪  What

    do we have: ▪  3rd party analytics tools ▪  Own (both statistical and operational) dbs and logs + some offline files ▪  Huge amount of users, channels, broadcasts, views, so a lot of DATA ▪  What do we need: ▪  ETL tool, which has a lot of built-in functions and scalable: Kettle ▪  DBMS which supports indexing, partitioning to handle the huge amount of data: MySQL, Greenplum ▪  Reporting solution: coming later… ▪  Do a lots of work… J -> data modeling, ETL development, etc. Ustream BI - DW & ETL 17
  16. ETL plans – near and further future ▪  Near future

    ▪  Collecting data from local dbs, 3rd party tools, and offline files into MySQL based DW ▪  Developing Prio 1 ETL processes necessary for KRs data need ▪  Automated ETL based on event calendar ▪  Further future ▪  Consolidating all 3rd party analytical tools into 1-2, with own event handling ▪  Hadoop-based cluster for advanced data processing (e.g. clickstream analysis, etc.) ▪  Move DW to Greenplum based database Ustream BI - DW & ETL 18
  17. ETL tool ▪  Kettle (Pentaho Data Integration – PDI), http://kettle.pentaho.com/

    ▪  Open source, GPL ▪  Broadest functionality among the open source tools ▪  A lot of supported inputs: various RDMBS connections, text files, GA, Excel etc. ▪  Both MySQL and Greenplum connection supported in bulk mode ▪  Also strong support for big data dbs (Hadoop, HBase, Cassandra, CouchDB, MapReduce, MongoDB) Ustream BI - DW & ETL 19
  18. DataWarehouse Data flow a.k.a. ETL architecture Ustream BI - DW

    & ETL 20 Solr / Metrics Conviva Graphite Quantcast comScore GA uStreamTv Statistics Editorial Kettle Ustream Meta StreamStat SocialStream Event calendar ConnStat Hadoop