Introduction to Big Data

Introduction to Big Data

Big Data employees are typically paid 20-30% more than their peers.  
Explore the technical realities and challenges within Big Data.  
We will discuss Data Warehouses, Star Schemas, Data Repository Architecture, Slowly Changing Data and several common Big Data terms.
Goal: Enable you to speak intelligently about Big Data

Fc921cbc630531f1e35e7527d44dd14f?s=128

John Kerley-Weeks

May 09, 2014
Tweet

Transcript

  1. 2.

    Big Data Introduction • Big Data employees are typically paid

    20-30% more than their peers. • Explore the technical realities and challenges within Big Data. • We will discuss Data Warehouses, Star Schemas, Data Repository Architecture, Slowly Changing Data and several common Big Data terms. • Goal: Enable you to speak intelligently about Big Data Slides and Feedback at: http://joind.in/11011 Friday, may 9th, 10:30am-11:20am SB 139 2
  2. 3.

    What is Big Data? • Massive volume – Facebook generates

    10TB of data daily – Twitter generates 7TB of data daily – 90% of available data generated in last 2 years – In 2000 there were 800,000,000 TB world wide – http://hosteddocs.ittoolbox.com/zd_wp_big_data_v2.pdf • Un-structured or table friendly data • Identifying Positive and Negative Trends • Lots of live data coming in very Fast! 3 Slides and Feedback at: http://joind.in/11011
  3. 4.

    http://wikibon.org/blog/big-data-statistics/ • 2.7 Zettabytes of data exist in the digital

    universe today. (IBM 2012) • 1.8 Zettabytes of data created in 2011 (IDC) • By 2020, Internet B2B and B2C transactions will exceed 450 B per day. (IDC) • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • Akamai analyzes 75 million events per day to better target advertisements. • Walmart handles 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data. • More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. • In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day. The Rapid Growth of Unstructured Data • YouTube users upload 48 hours of new video every minute of the day. • 100 terabytes of data uploaded daily to Facebook. 4 Slides and Feedback at: http://joind.in/11011
  4. 5.

    Twitter • Started as only 140 Characters? • As of

    2010 – Stored 7TB/day – 17,500 Bytes / tweet • As of October 2013 – 300 Billion tweets sent – 500 Million tweets per day average • Kept and Searchable for how long? • Simple Analysis Query – 1 in 13 Number of tweets containing curse words – 569 F Bombs in The Wolf of Wall Street • http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing- twitter-stats/#.U2zgHizjhdg • http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 5
  5. 6.

    ====Tweet==== { "contributors": null, "coordinates": null, "created_at": "Wed Mar 12

    21:56:15 +0000 2014", "entities": { "hashtags": [], "symbols": [], "urls": [ { "display_url": "jobs.usu.edu/applicants/Cen\u2026", "expanded_url": "http://jobs.usu.edu/applicants/Central?quickFind=59832", "indices": [ 71, 93 ], "url": "http://t.co/zy1s0IAnh0" } ], "user_mentions": [ { "id": 810656204, "id_str": "810656204", "indices": [ 3, 16 ], "name": "Alice Dolan", "screen_name": "AliceDolinen" } ] }, "favorite_count": 0, "favorited": false, "geo": null, "id": 443868097514967040, "id_str": "443868097514967040", "in_reply_to_screen_name": null, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "lang": "en", "metadata": { "iso_language_code": "en", ====Tweet==== { "contributors": null, "coordinates": null, "created_at": "Wed Mar 12 20:56:35 +0000 2014", "entities": { "hashtags": [], "symbols": [], "urls": [], "user_mentions": [ { "id": 869089020, "id_str": "869089020", "indices": [ 0, 13 ], "name": "USU Campus Rec", "screen_name": "USUCampusRec" } ] }, "favorite_count": 1, "favorited": false, "geo": null, "id": 443853080174407680, "id_str": "443853080174407680", "in_reply_to_screen_name": "USUCampusRec", "in_reply_to_status_id": 443852925333274624, "in_reply_to_status_id_str": "443852925333274624", "in_reply_to_user_id": 869089020, "in_reply_to_user_id_str": "869089020", "lang": "et", "metadata": { "iso_language_code": "et", "result_type": "recent" }, "place": null, "retweet_count": 0, "retweeted": false, "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "text": "@USUCampusRec Brian.isom@aggiemail.usu.edu", "truncated": false, "user": { 6
  6. 8.

    Why Big Data? • Will improved business performance by an

    average of 26-41% over the next 3 years • Retail data determines if a customer should be offered a 15% or 25% discount. • Automotive GPS, telematics, and maintenance sensors • Healthcare Big Data could generate up to $100 billion annually http://hosteddocs.ittoolbox.com/zd_wp_big_data_v2.pdf 8 Slides and Feedback at: http://joind.in/11011
  7. 9.

    Why Big Data? • Companies want to understand their organization

    and their customers • We can collect and store more data now 9 Slides and Feedback at: http://joind.in/11011
  8. 10.

    Problems with Big Data • Can provide lots of cool

    charts – What are the Big Questions? • How can we attract more customers? • Sell more products? • Save the world? • Organizations don’t share data internally • Skilled Professionals who understand data analytics 10 Slides and Feedback at: http://joind.in/11011
  9. 11.

    Why Learn Data Warehousing? Gartner reported in February 2012 •

    The DBMS market represented $23.2 billion with expected growth of 7% • The Data Warehouse market represents about $7 billion of that $23.2 billion Dice.com High Paying Skills Salaries: • #1 Hadoop $115,062 • #2 Big Data $113,739 • #3 NoSQL $113,031 • #10 ETL $104,777 • #25 Business Intelligence $101,854 • #29 Data Warehouse $101,061 http://marketing.dice.com/pdf/Dice_TechSalarySurvey_2013.pdf 11
  10. 12.

    Recommended Reading The Data Warehouse Toolkit by Ralph Kimball and

    Margy Ross http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/1118530802/ 12
  11. 14.

    Why Data Warehousing? • Problems – Performance – Production DB

    slowed down – Ease of Use – Difficulty for Business Users • Solutions – Independent storage – Historical storage – De-normalization of data structures – Organize data in a simple and logical layout • Communicating in a common language – Business: One question, one answer – IT Dept.: Common terminology 14
  12. 16.

    Data Warehouse Components • Operational Source Systems - Business transaction

    databases • Source Extracts – Downloads from or ODBC connections to business transaction databases – Filters data required for analysis (don’t need or want user passwords) – May be internal, partner (supply chain), or external (census, zip code) • Staging Area – Storage area for initial extraction, verification, and transformation. – Can represent databases anywhere between source and warehouse • Operational data store (ODS) – Aggregates current and historical data – Provides structure for simplified, scheduled extraction into warehouse 16 Slides and Feedback at: http://joind.in/11011
  13. 17.

    Data Warehouse Components • Data Warehouse – Processed data is

    Loaded into Data Warehouse – Contains star-schemas that can be queried for ad hoc requirements – Contains lowest “grain” of data available to users – Contains conformed dimensions shared by all star schemas • Data Marts – Subsets of data warehouse representing a specific business process – Provides limited data sets to specific groups (HR, Sales totals) – Often one department starts with a data mart which becomes the foundation for the data warehouse 17
  14. 19.

    Data Warehouse Components Data Access Tools – End user tools

    to generate reports – Cubes, reports, SQL queries, visualization, dashboards, etc. Cubes – Provides data aggregation across multiple, often hierarchical, dimensions. Dimensions provide “filter by” and “group by” capabilities. Reports – Standard data views that are frequently generated – Often regularly scheduled and emailed or posted – Standard configuration allows trend and anomaly detection Ad Hoc – SQL query run directly on data warehouse – Tools often provide simple custom queries to be developed by end users 19
  15. 20.

    Metadata Describes the state, source, history, and characteristics of data

    – Definitions: business terms, abbreviations, descriptions – Data Source – Hierarchical relationships – Default Values – Extraction and Transformation methods – Accuracy, completeness, reliability, and trustworthiness of data – Data expiration and refresh – Record History and Timestamps – Historical Policy and Structure Changes – Access Control / Permissions – Intellectual Property Limitations and Fees – Access Monitoring: Usage, Audit, Diagnostic, Privacy, and Security 20
  16. 23.

    Snowflaking SKU>Style>Brand>ProductGroup>MerchGroup>MerchDivision Snowflake = Normalized • If every Dimension Table

    is Normalized the Dimensional Model begins to look like a crystal or snowflake 23
  17. 24.

    Snowflaking Single Dimension Table or Hierarchical 3NF? • Which is

    easier to understand? • Which is easier to query? • Which is faster to query? • Which is easier for report users to understand? 24
  18. 25.

    Snowflaking Advantages of Snowflaking: • Imports directly from transaction database

    • Easier to maintain • Easier to identify duplicates • Requires less storage space – Reduction in Redundant data
  19. 26.

    Snowflaking Disadvantages Increases presentation complexity for end users. – If

    reports are not easily built data is not queried – If reports are not easily understood they are not read – If data and reports are not used Data Warehouse project has failed. Queries require more joins – Query performance is impacted Disk space savings – Fact Tables are comparatively much larger Data accuracy is maintained – Data Warehouse is compiled from normalized transaction data Don’t snowflake unless you must – OK, If there is a compelling reason such as data update challenges. 26 Slides and Feedback at: http://joind.in/11011
  20. 27.

    Common Problem • One row in fact table = one

    treatment. • How do we associate multiple criteria to a single dimension? – Multiple diagnosis during a doctor’s visit – Increasing number of insured family members – Cellular Family Plan with multiple phones – Retail order with multiple shipping addresses – Trouble ticket with multiple problems 27
  21. 28.

    Solutions • How do we associate multiple criteria to a

    single dimension? – Create a separate record for each instance • Requires a primary key for each instance • Requires a common key for each group – Many to Many Relationship • Multiple patients with identical diagnosis • Patients with many diagnosis • How do we solve many-to-many relationships? 28
  22. 29.

    Healthcare Model To handle multiple values per fact table record,

    we introduce a Helper or Bridge table. 29
  23. 30.

    Common Language • Helper Table – Solves Many to Many

    Problems – Snowflaking reduces number of shared elements – Helper table increases number of shared elements • Similar Terms – Bridge – Junction Table – Joining Table 30
  24. 33.

    IS_VARIANT_ACTIVE IS_PRODUCT_ACTIVE IS_DOWNGRADE IS_GSO IS_CLONED IS_PACKAGE 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Attributes with little variation
  25. 34.

    Junk Dimensions Junk_ID IS_VARIANT_ACTIVE IS_PRODUCT_ACTIVE IS_DOWNGRADE IS_GSO IS_CLONED IS_PACKAGE 1

    0 0 0 0 0 0 2 0 0 0 0 0 1 3 0 0 0 0 1 0 4 0 0 0 0 2 0 5 0 0 0 0 3 0 6 0 0 0 0 4 0 7 0 0 0 0 5 0 8 0 0 0 0 6 0 9 0 0 0 0 7 0 10 0 0 0 0 8 0 11 0 0 0 0 9 0 12 0 0 0 0 10 0 13 0 0 0 0 14 0 14 0 0 0 0 18 0 15 0 0 0 0 19 0 16 0 0 0 1 0 0 17 0 1 0 0 0 0 18 0 1 0 0 0 1 19 0 1 0 0 1 0 20 1 1 0 0 0 6 attributes for 5000+ entries can be consolidated to 1 uniquely defining surrogate key in a dimension table with 20 rows
  26. 35.

    Junk Dimensions Demographics Contain multiple dimensions with few unique values

    DIM_TIME PK TIME_KEY SEMESTER_NAME DIM_COURSE PK COURSE_KEY COURSE_NBR COURSE_NAME DIM_MARITAL_STATUS PK MARITAL_STATUS_KEY MARITAL_STATUS_CODE MARITAL_STATUS_DESC FACT_COURSE_ENROLLMENT PK,FK1 TIME_KEY PK,FK2 COURSE_KEY PK,FK3 MARITAL_STATUS_KEY PK,FK5 GENDER_KEY FK4 AGE_KEY GRADE_PCT DIM_AGE PK AGE_KEY AGE AGE_GROUP DIM_GENDER PK GENDER_KEY GENDER_CODE GENDER_DESC
  27. 36.

    Junk Dimensions Can be combined into one Demographic Dimension Table

    DIM_TIME PK TIME_KEY SEMESTER_NAME DIM_COURSE PK COURSE_KEY COURSE_NBR COURSE_NAME DIM_DEMOGRAPHIC PK DEMOGRAPHIC_KEY MARITAL_STATUS_CODE MARITAL_STATUS_DESC GENDER_CODE GENDER_DESC AGE_CODE AGE_DESC FACT_COURSE_ENROLLMENT PK,FK1 TIME_KEY PK,FK2 COURSE_KEY PK,FK3 DEMOGRAPHIC_KEY GRADE_PCT
  28. 37.

    Junk Dimensions • Junk dimensions should be less than 20,000

    rows. – Multiply total number of potential values for each attribute – 5 Marital Status X 3 Genders X 110 Ages X 7 Age Groups = 11,550 or 1,650 Rows? • Can be used to collect associated “comments” or optional text. • In Alan Simon’s experience, DBA’s hate junk dimensions even more than de-normalization, so use them sparingly. 37
  29. 38.

    Why a “Junk” Dimension? Normally a Database and Kitchen should

    be organized and free of clutter Slides and Feedback at: http://joind.in/11011
  30. 39.

    Why a “Junk” Dimension? A Junk Drawer collects the clutter

    http://www.doublecoveragefootball.com/2012/06/dcs-newest-feature-friday-junk-drawer.html http://ths.gardenweb.com/forums/load/kitchbath/msg1222054422880.html http://iheartorganizing.blogspot.com/2011/08/its-here-kitchen-cabinet-tour.html
  31. 41.

    Slowly Changing Historical Records What do you want to track?

    • Only the most recent information – Previous information is assumed to be incorrect – Birth Date, Social Security Number – Last Name? • Previous Values – Previous Values needed for historical reports or tracking – Previous Address or Company Affiliation – Changes in Demographics – Age? • Additional Product Inventory – Any Changes are treated as a new product or customer – Quantity of socks in a pack – Work vs. Personal Chase Visa Card • Combination of above? 41
  32. 42.

    • Slowly Changing Dimensions do not typically change daily but

    may change over time, unpredictably, or never. • Three basic techniques: – Type 1 SCD: Overwrite the value – Type 2 SCD: Add a new dimension row – Type 3 SCD: Create a dimension column Slowly Changing Dimensions 42
  33. 43.

    Well suited for corrections when historical information does not need

    to be tracked. Correction to data entry error. Birthdate, Race, and Gender don’t typically change. Demographic changes may flag purchasing changes Type 1 SCD Example Customer Preferred Contact Address Phone Email Jane Smith Phone 123 Maple St 435-555-1212 Jane.smith@gmail.com Jane Smith Phone 123 Maple St 435-757-1212 Jane.smith@gmail.com 43
  34. 44.

    Type 1 SCD - Overwrite • Old attribute value is

    replaced with current value • Advantages – Fast – Simple • Disadvantages – Loses history of attribute changes – Aggregate tables may need to be rebuilt • Appropriate for corrections – If previous information was incorrect 44 Slides and Feedback at: http://joind.in/11011
  35. 45.

    Utilized to track history of changes Common Key (Customer#) is

    required to correlate changes. Effective Date column may be helpful Is Fact table associated with SurrogateID or Customer#? Type 2 SCD Example SurrogateID Current Customer Customer # Address Phone Email 123 N Jane Smith 87653 123 Maple St 435-555-1212 Jane.smith@gmail.com 642 Y Jane Smith 87653 381 First St 435-757-1212 Jane.smith@gmail.com 45
  36. 46.

    Type 2 SCD – Create a New Record • Utilized

    to track history of changes • A new dimension row is added to represent the change – Effective Date and Current Indicator columns are typically utilized – A new surrogate key is created – A common key is typically maintained to show correlation (SKU, SSN) – Previous record remains unchanged other than Current Indicator • Advantages – History is recorded between fact table and dimension table • Adding a bigger sized candy bar does not change previous purchase records – Aggregate tables do not need to be rebuilt • Disadvantages – Frequent Updates significantly increase dimension table size – Why would you track Facts by common vs. surrogate keys? 46
  37. 47.

    Utilized to show two conditions simultaneously Compare proposed redistricting plans

    Type 3 SCD Example Current Region New Region Customer Customer # Address Phone South Towne Mid Towne Jane Smith 87653 381 First St 435-757-1212 47
  38. 48.

    Type 3 SCD – Add a Dim Column • Column

    added to dimension table to track previous values • When a change occurs – Current value is copied to Previous value – New value replaces Current value • Advantages – Easily associates old and new values – Married and maiden names – Compare changes before and after reorganization • Disadvantages – Only records most recent change – A column must be added for each attribute being tracked – Aggregate tables may need to be rebuilt 48
  39. 49.

    Designing an Integrated Warehouse • Business Processes Integration – Analysis

    across business units, processes, and databases – Data Marts represent Business Processes – Often built incrementally, one Business Process at a time • Bus Matrix – Business Processes + Conformed Dimensions – Allows uniform conformed across Business Processes – Common currency, product IDs, definitions – Shipping uses pallets, stores use cases, customer buys cans 49
  40. 50.

    Data Warehouse Bus Matrix Business Processes/ Dimension Date Product Promotion

    Website Customer Supplier Region Sales X X X X X X Billing X X X X X Inventory X X X Shipping X X X X Market Analysis X X X X X X Partners X X X 50
  41. 51.

    Conformed Dimensions Dimensions may be shared across data marts from

    the same table or duplicate tables using the same source attributes. Consistency must be maintained between: • Dimension keys • Attribute names • Attribute values • Attribute units • Attribute definitions 51 Slides and Feedback at: http://joind.in/11011
  42. 52.

    Conformed Dimensions How large is a quantity of one? •

    Cashier = One can of Sprite • Stock Person = One six pack of Sprite • Delivery Person = One pallet of Sprite • Factory = One truck of Sprite 52
  43. 53.

    Drilling Across • Drill Across queries can span multiple business

    processes • Queries span across multiple data marts, cubes, and reports – Drill Through obtains data not available within a cube • Conformed dimensions and facts are required • Allow detailed drill down and correlations – Combine Sales and Shipping identity backorder trends products – Combine Sales and Inventory to identify theft or damage – Obtains product descriptions from quantity aggregates 53
  44. 54.

    Data Warehouse – Data repository for broad queries Business Intelligence

    – Data with which to make decisions Data Mart – Data representing a Business Process Cube – Self-service reporting tool Dashboard – Single visual representation of processes Broad Definitions Ask two experts, you will get three opinions. Every implementation is different. Clients want one answer. Often product defined (Excel, Oracle, Informatica) 54
  45. 55.

    Transactions vs Reports 55 An e-commerce store has been processing

    300,000 sales per day without incident. Management wants to better understand their business sales trends to increase product sales. When we started generating reports on all sales per quarter the 80 requested reports took three days each to run. During report generation customer transactions took minutes instead of seconds to complete. The CEO has ordered that all report generations be stopped immediately.
  46. 56.

    OLAP / OLTP Examples • OLAP = On-Line Analytical Processing

    – Reports – Answers that take more than a few seconds – Data Warehouse – Star Schema • OLTP = On-Line Transaction Processing – Production databases used to store individual transactions. – Add, update, and delete transactions – Directly impacted by customer interactions – Traditional normalized databases 56 Slides and Feedback at: http://joind.in/11011
  47. 57.

    OLTP vs. OLAP OLTP OLAP User Customer Service IT Professionals

    Decision Makers Purpose Production Business Operations Report Generation Structure 3rd Normal Form Star Schema Timeline Current Historical Data Detailed Filtered Usage Short, Simple Transactions Large Complex Queries Access Read & Write Read Mostly Records per Query Tens Millions Database Size 100 MB-GB 100 GB-TB Measurement Metric # transactions/minute Query Response Time 57
  48. 58.

    • Fast Report Generation • Workload on Production Systems •

    Centralized Repository for Departmental Data • Historical Records • Single Definition • Single Answer Why separate OLAP and OLTP 58
  49. 59.

    Transactions • Amazon.com – Cyber Monday • 13.7 million 2010

    • 27 million orders 2012 • 36.8 million order 2013 (1KB/order = 36.8TB) – Shipped • 15.6 million units shipped in one day • Facebook – 1.23 billion Facebook monthly active users – 757 million daily active users – 4.5 billion likes – Size of user data 300,000,000 GB (240KB/user) 59
  50. 64.

    • Avoids redundancy • Information is entered once • Updates

    occur in one place • Fast transactions • Fast Add, Modify, Delete • Individual rows retrieved quickly • However… • Slow reports due to JOINs • Difficult for database novices to understand Why is 3NF valuable?
  51. 66.

    What is the average GPA of Male and Female students

    in the College of Education each decade? 3NF = 3rd Normal Form 66
  52. 67.

    Dimensional Model (Star Schema) • Dimensional Model = Fact Table

    + Dimension Tables • Benefits of Dimensional Model • Easy for business users to understand • Improved query performance • Future expansions and changes are simplified • Simplified JOIN queries for reports 67
  53. 69.

    Fact Elements are generally aggregated # of Students Sum of

    Sales Quantity of Sales Average Age Dimension Elements are generally “Filtered By” Region: State, Eastern Sales Division Demographics: Gender, Age Date: Holiday, Weekend, 1st Quarter Product: Promotion, Price, Weight, Season Star Schema 69
  54. 70.

    Retail Sales Business Process: Given total sales for a large

    company how would you divide up the data to better understand it? Fact Table = Sales represented by Sales Receipts Customer, date, product, price, quantity, store Dimension Tables = Details and Hierarchies of Facts Customer = Address, Demographics Date = Weekday, Season, Holiday, Promotions Product = SKU, List Price, Weight, Department Store = Address, Region Star Schema 70
  55. 71.

    Fact Tables • Stores business performance measurements • Each row

    represents a specific measurement or activity • All measurements in a fact table must be at the same grain • Contains foreign keys to dimension tables • Most Facts are numeric 71
  56. 72.

    Fact Tables Facts are typically Foreign Keys or numeric additive,

    semi-additive, or non-additive values • Additive: Purchase Price – Can be added across all dimensions • Semi-Additive: % Profit Margin, Age – Can be added across some dimensions • Non-Additive: Zip Code, SSN – Should not be added across any dimensions 72
  57. 73.

    Semi-Additive Facts • Additive facts can be added across all

    dimensions – Sales Price • Semi-additive facts can be added across some dimensions – Inventory can be added across locations, but not time. – Current Bank Balance can be added across customers, but not time. • Non-additive fact should not be added across any dimensions – Totals would be uninformative or useless – ZIP Codes and Invoice Numbers are non-additive • Which of the following are additive, semi-additive, or non- additive? – Sales Quantity – Student ID# – Patient Weight 73
  58. 74.

    Dimension Tables • Textual descriptors of the business • Often

    low in cardinality (rows) compared to fact tables • 50-100 attributes (columns) are not uncommon • Dimension attributes used as “filter by” constraints • Often hierarchical components (product > brand > division) 74
  59. 75.

    Continuous Refinement Dimensional models are often refined over time Attributes

    may be a fact or dimension May want to filter by or aggregate Patient Age Fact: What is the average age of slip and fall patients? Dimension: What is the most common teenage diagnosis? 75
  60. 76.

    1) Select the Business Process to model 2) Declare the

    Grain (Detail Level) of the Business Process 3) Choose the Dimensions that apply to each fact table row 4) Identify the numeric Facts that will populate each fact table row Dimensional Model Design: The Data Warehouse Toolkit by Ralph Kimball and Margy Ross Dimensional Model Design 76
  61. 78.

    78 http://www.fchn.org/images/paying-for-care/FMH-bill.jpg Grain Narrows the scope to a particular task

    What will be on one form but not others? Admissions Form Treatment Form Billing What is the time frame? Previous or Current? Patient Weight Patient Age Patient History
  62. 79.

    Height Weight Temperature Patient Name Employer Diagnosis Treatment Primary Doctor

    Phone Address ZIP Fact or Dimension Table? Where do the following attributes belong: Date Check-In Time Check-Out Time Wait Time Treatment Time Room # Doctor’s Phone Doctor’s Address Doctor’s ZIP Doctor’s Office Building 79
  63. 82.

    1) Select the Business Process to model • Billing 2)

    Declare the Grain (Detail Level) of the Business Process • Patient Bill 3) Choose the Dimensions that apply to each fact table row • Date, Patient, Provider, Treatment 4) Identify the numeric Facts that will populate each fact table row • Price, Insurance Payment • Invoice ID, Date ID, Patient ID, Provider ID, Treatment ID Example Dimensional Model Design 82
  64. 84.

    • Easy for customers to understand • What they can

    “filter by” • What they can aggregate • Reports must run quickly • Must not slow production systems • Often de-normalized • Value is only created if reports are used Why Dimensional Models
  65. 86.

    Date Dimensions Date Dimension Example attributes: – Date – Full

    Date Description – Month Number – Month Name – Month Short Name – Day Number in Month – Day of Week – Day Number in Year – Year – Fiscal Quarter – Fiscal Year – Holiday Indicator – First Day of Quarter Indicator 100 years = 36500 rows Multiple Hierarchies: Q1 2010, Fall 2013, 1st Monday Free Prepopulated downloads http://www.kimballgroup.com/data -warehouse-and-business- intelligence-resources/data- warehouse-books/booksmdwt/ 86
  66. 87.

    Date Dimensions DateKey FullDate DateName Holiday DayOfWeek DayNameOfWeek DayOfMonth DayOfYear

    WeekdayWeekend WeekOfYear MonthName MonthOfYear CalendarYear CalendarYearQtr 20000101 1/1/2000 2000/01/01 Y 7 Saturday 1 1 Weekend 1 January 1 2000 2000Q1 20000102 1/2/2000 2000/01/02 N 1 Sunday 2 2 Weekday 2 January 1 2000 2000Q1 20000103 1/3/2000 2000/01/03 N 2 Monday 3 3 Weekday 2 January 1 2000 2000Q1 20000104 1/4/2000 2000/01/04 N 3 Tuesday 4 4 Weekday 2 January 1 2000 2000Q1 20000105 1/5/2000 2000/01/05 N 4 Wednesday 5 5 Weekday 2 January 1 2000 2000Q1 20000106 1/6/2000 2000/01/06 N 5 Thursday 6 6 Weekday 2 January 1 2000 2000Q1 20000107 1/7/2000 2000/01/07 N 6 Friday 7 7 Weekend 2 January 1 2000 2000Q1 20000108 1/8/2000 2000/01/08 N 7 Saturday 8 8 Weekend 2 January 1 2000 2000Q1 20000109 1/9/2000 2000/01/09 N 1 Sunday 9 9 Weekday 3 January 1 2000 2000Q1 20000110 1/10/2000 2000/01/10 N 2 Monday 10 10 Weekday 3 January 1 2000 2000Q1 20000111 1/11/2000 2000/01/11 N 3 Tuesday 11 11 Weekday 3 January 1 2000 2000Q1 20000112 1/12/2000 2000/01/12 N 4 Wednesday 12 12 Weekday 3 January 1 2000 2000Q1 20000113 1/13/2000 2000/01/13 N 5 Thursday 13 13 Weekday 3 January 1 2000 2000Q1 20000114 1/14/2000 2000/01/14 N 6 Friday 14 14 Weekend 3 January 1 2000 2000Q1 20000115 1/15/2000 2000/01/15 N 7 Saturday 15 15 Weekend 3 January 1 2000 2000Q1 20000116 1/16/2000 2000/01/16 N 1 Sunday 16 16 Weekday 4 January 1 2000 2000Q1 20000117 1/17/2000 2000/01/17 Y 2 Monday 17 17 Weekday 4 January 1 2000 2000Q1 20000118 1/18/2000 2000/01/18 N 3 Tuesday 18 18 Weekday 4 January 1 2000 2000Q1 20000119 1/19/2000 2000/01/19 N 4 Wednesday 19 19 Weekday 4 January 1 2000 2000Q1 20000120 1/20/2000 2000/01/20 N 5 Thursday 20 20 Weekday 4 January 1 2000 2000Q1 20000121 1/21/2000 2000/01/21 N 6 Friday 21 21 Weekend 4 January 1 2000 2000Q1 20000122 1/22/2000 2000/01/22 N 7 Saturday 22 22 Weekend 4 January 1 2000 2000Q1 20000123 1/23/2000 2000/01/23 N 1 Sunday 23 23 Weekday 5 January 1 2000 2000Q1 20000124 1/24/2000 2000/01/24 N 2 Monday 24 24 Weekday 5 January 1 2000 2000Q1 20000125 1/25/2000 2000/01/25 N 3 Tuesday 25 25 Weekday 5 January 1 2000 2000Q1 20000126 1/26/2000 2000/01/26 N 4 Wednesday 26 26 Weekday 5 January 1 2000 2000Q1 20000127 1/27/2000 2000/01/27 N 5 Thursday 27 27 Weekday 5 January 1 2000 2000Q1 20000128 1/28/2000 2000/01/28 N 6 Friday 28 28 Weekend 5 January 1 2000 2000Q1 20000129 1/29/2000 2000/01/29 N 7 Saturday 29 29 Weekend 5 January 1 2000 2000Q1 20000130 1/30/2000 2000/01/30 N 1 Sunday 30 30 Weekday 6 January 1 2000 2000Q1 20000131 1/31/2000 2000/01/31 N 2 Monday 31 31 Weekday 6 January 1 2000 2000Q1 20000201 2/1/2000 2000/02/01 N 3 Tuesday 1 32 Weekday 6 February 2 2000 2000Q1 20000202 2/2/2000 2000/02/02 N 4 Wednesday 2 33 Weekday 6 February 2 2000 2000Q1 20000203 2/3/2000 2000/02/03 N 5 Thursday 3 34 Weekday 6 February 2 2000 2000Q1 20000204 2/4/2000 2000/02/04 N 6 Friday 4 35 Weekend 6 February 2 2000 2000Q1 20000205 2/5/2000 2000/02/05 N 7 Saturday 5 36 Weekend 6 February 2 2000 2000Q1 20000206 2/6/2000 2000/02/06 N 1 Sunday 6 37 Weekday 7 February 2 2000 2000Q1 20000207 2/7/2000 2000/02/07 N 2 Monday 7 38 Weekday 7 February 2 2000 2000Q1 20000208 2/8/2000 2000/02/08 N 3 Tuesday 8 39 Weekday 7 February 2 2000 2000Q1 20000209 2/9/2000 2000/02/09 N 4 Wednesday 9 40 Weekday 7 February 2 2000 2000Q1 87
  67. 88.

    Ralph Kimball's paradigm: Data warehouse is the conglomerate of all

    data marts within the enterprise. Information is always stored in the dimensional model. Kimball vs Inmon Bottom-Up Approach "...The data warehouse is nothing more than the union of all the data marts...," Ralph Kimball, December 29, 1997. 88
  68. 89.

    Bill Inmon's paradigm: Data warehouse is one part of the

    overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. Kimball vs Inmon Top-Down Approach "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998. 89
  69. 90.

    There is no right or wrong between these two ideas,

    as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse. http://www.1keydata.com/datawarehousing/inmon-kimball.html http://www.nagesh.com/publications/technology/173-inmon-vs-kimball-an-analysis.html Kimball vs Inmon 90
  70. 91.

    Whose methodology is more popular? Dice.com technical job listing board

    • 85,000 job listings, average turn over 14 days • 90-118 job listings for Kimball • 33-39 for Inmon • 33-37 Inmon and Kimball Kimball vs Inmon 91 Slides and Feedback at: http://joind.in/11011
  71. 92.
  72. 93.
  73. 94.

    Trimmed Sample Data Day Weekend Zip State Region Division Group

    Quantity Thu No 95014 CA West Sporting Goods Skis 300 Thu No 90210 CA West Sporting Goods Skis 400 Fri No 19090 PA East Sporting Goods Bikes 200 Fri No 20500 DC East Sporting Goods Bikes 600 Sat Yes 19090 PA East Sporting Goods Bikes 100 Sun Yes 95014 CA West Sporting Goods Skis 800 Sun Yes 20500 DC East Sporting Goods Bikes 200 Sun Yes 19090 PA East Clothing Men’s Pants 100 … … … … … … … … • Note Hierarchies: Day > Weekend; Zip > State > Region
  74. 95.

    Cubes • Slice • Dice • Roll Up • Drill

    Down • Drill Through • Pivot
  75. 96.
  76. 97.

    Slice of Bike Data 95014 90210 20500 19090 THU 300

    200 400 400 FRI 200 300 200 100 SAT 100 100 300 300 SUN 400 400 100 200 SUM of 16 cells associated with Bike Sales = 4000
  77. 98.
  78. 99.

    Roll Up Sporting Goods (Bikes and Skis) Bikes 95014 90210

    20500 19090 THU 300 200 400 400 FRI 200 300 200 100 SAT 100 100 300 300 SUN 400 400 100 200 SUM of 32 cells associated with Sporting Good Sales = 8000 Skis 95014 90210 20500 19090 THU 400 400 100 200 FRI 100 100 300 300 SAT 200 300 200 100 SUN 300 200 400 400
  79. 100.
  80. 101.
  81. 103.
  82. 104.
  83. 105.

    Pivot Tables Zip Code Product Day Sales 95010 Skis Sat

    2 95010 Skis Sun 0 95010 Bikes Sat 4 95010 Bikes Sun 0 95014 Skis Sat 5 95014 Skis Sun 3 95014 Bikes Sat 1 95014 Bikes Sun 9 Zip Code Product Sat Sales Sun Sales 95010 Skis 2 0 95010 Bikes 4 0 95014 Skis 5 3 95014 Bikes 1 9 Move unique row information into expanded column header data
  84. 106.

    Pivot Tables Zip Code Product Sat Sales Sun Sales 95010

    Skis 2 0 95010 Bikes 4 0 95014 Skis 5 3 95014 Bikes 1 9 Zip Code Product Sat Sales Sun Sales 95010 Sporting Goods 6 0 95014 Sporting Goods 6 12 Often aggregate (combine) hierarchical data