Slide 1

Slide 1 text

Big Data Introduction

Slide 2

Slide 2 text

Big Data Introduction • Big Data employees are typically paid 20-30% more than their peers. • Explore the technical realities and challenges within Big Data. • We will discuss Data Warehouses, Star Schemas, Data Repository Architecture, Slowly Changing Data and several common Big Data terms. • Goal: Enable you to speak intelligently about Big Data Slides and Feedback at: http://joind.in/11011 Friday, may 9th, 10:30am-11:20am SB 139 2

Slide 3

Slide 3 text

What is Big Data? • Massive volume – Facebook generates 10TB of data daily – Twitter generates 7TB of data daily – 90% of available data generated in last 2 years – In 2000 there were 800,000,000 TB world wide – http://hosteddocs.ittoolbox.com/zd_wp_big_data_v2.pdf • Un-structured or table friendly data • Identifying Positive and Negative Trends • Lots of live data coming in very Fast! 3 Slides and Feedback at: http://joind.in/11011

Slide 4

Slide 4 text

http://wikibon.org/blog/big-data-statistics/ • 2.7 Zettabytes of data exist in the digital universe today. (IBM 2012) • 1.8 Zettabytes of data created in 2011 (IDC) • By 2020, Internet B2B and B2C transactions will exceed 450 B per day. (IDC) • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • Akamai analyzes 75 million events per day to better target advertisements. • Walmart handles 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data. • More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. • In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day. The Rapid Growth of Unstructured Data • YouTube users upload 48 hours of new video every minute of the day. • 100 terabytes of data uploaded daily to Facebook. 4 Slides and Feedback at: http://joind.in/11011

Slide 5

Slide 5 text

Twitter • Started as only 140 Characters? • As of 2010 – Stored 7TB/day – 17,500 Bytes / tweet • As of October 2013 – 300 Billion tweets sent – 500 Million tweets per day average • Kept and Searchable for how long? • Simple Analysis Query – 1 in 13 Number of tweets containing curse words – 569 F Bombs in The Wolf of Wall Street • http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing- twitter-stats/#.U2zgHizjhdg • http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 5

Slide 6

Slide 6 text

====Tweet==== { "contributors": null, "coordinates": null, "created_at": "Wed Mar 12 21:56:15 +0000 2014", "entities": { "hashtags": [], "symbols": [], "urls": [ { "display_url": "jobs.usu.edu/applicants/Cen\u2026", "expanded_url": "http://jobs.usu.edu/applicants/Central?quickFind=59832", "indices": [ 71, 93 ], "url": "http://t.co/zy1s0IAnh0" } ], "user_mentions": [ { "id": 810656204, "id_str": "810656204", "indices": [ 3, 16 ], "name": "Alice Dolan", "screen_name": "AliceDolinen" } ] }, "favorite_count": 0, "favorited": false, "geo": null, "id": 443868097514967040, "id_str": "443868097514967040", "in_reply_to_screen_name": null, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "lang": "en", "metadata": { "iso_language_code": "en", ====Tweet==== { "contributors": null, "coordinates": null, "created_at": "Wed Mar 12 20:56:35 +0000 2014", "entities": { "hashtags": [], "symbols": [], "urls": [], "user_mentions": [ { "id": 869089020, "id_str": "869089020", "indices": [ 0, 13 ], "name": "USU Campus Rec", "screen_name": "USUCampusRec" } ] }, "favorite_count": 1, "favorited": false, "geo": null, "id": 443853080174407680, "id_str": "443853080174407680", "in_reply_to_screen_name": "USUCampusRec", "in_reply_to_status_id": 443852925333274624, "in_reply_to_status_id_str": "443852925333274624", "in_reply_to_user_id": 869089020, "in_reply_to_user_id_str": "869089020", "lang": "et", "metadata": { "iso_language_code": "et", "result_type": "recent" }, "place": null, "retweet_count": 0, "retweeted": false, "source": "Twitter for Android", "text": "@USUCampusRec [email protected]", "truncated": false, "user": { 6

Slide 7

Slide 7 text

Twitter • http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 7 Slides and Feedback at: http://joind.in/11011

Slide 8

Slide 8 text

Why Big Data? • Will improved business performance by an average of 26-41% over the next 3 years • Retail data determines if a customer should be offered a 15% or 25% discount. • Automotive GPS, telematics, and maintenance sensors • Healthcare Big Data could generate up to $100 billion annually http://hosteddocs.ittoolbox.com/zd_wp_big_data_v2.pdf 8 Slides and Feedback at: http://joind.in/11011

Slide 9

Slide 9 text

Why Big Data? • Companies want to understand their organization and their customers • We can collect and store more data now 9 Slides and Feedback at: http://joind.in/11011

Slide 10

Slide 10 text

Problems with Big Data • Can provide lots of cool charts – What are the Big Questions? • How can we attract more customers? • Sell more products? • Save the world? • Organizations don’t share data internally • Skilled Professionals who understand data analytics 10 Slides and Feedback at: http://joind.in/11011

Slide 11

Slide 11 text

Why Learn Data Warehousing? Gartner reported in February 2012 • The DBMS market represented $23.2 billion with expected growth of 7% • The Data Warehouse market represents about $7 billion of that $23.2 billion Dice.com High Paying Skills Salaries: • #1 Hadoop $115,062 • #2 Big Data $113,739 • #3 NoSQL $113,031 • #10 ETL $104,777 • #25 Business Intelligence $101,854 • #29 Data Warehouse $101,061 http://marketing.dice.com/pdf/Dice_TechSalarySurvey_2013.pdf 11

Slide 12

Slide 12 text

Recommended Reading The Data Warehouse Toolkit by Ralph Kimball and Margy Ross http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/1118530802/ 12

Slide 13

Slide 13 text

Data Warehouse Structures What is a Data Warehouse? 13

Slide 14

Slide 14 text

Why Data Warehousing? • Problems – Performance – Production DB slowed down – Ease of Use – Difficulty for Business Users • Solutions – Independent storage – Historical storage – De-normalization of data structures – Organize data in a simple and logical layout • Communicating in a common language – Business: One question, one answer – IT Dept.: Common terminology 14

Slide 15

Slide 15 text

Data Warehouse Structures 15

Slide 16

Slide 16 text

Data Warehouse Components • Operational Source Systems - Business transaction databases • Source Extracts – Downloads from or ODBC connections to business transaction databases – Filters data required for analysis (don’t need or want user passwords) – May be internal, partner (supply chain), or external (census, zip code) • Staging Area – Storage area for initial extraction, verification, and transformation. – Can represent databases anywhere between source and warehouse • Operational data store (ODS) – Aggregates current and historical data – Provides structure for simplified, scheduled extraction into warehouse 16 Slides and Feedback at: http://joind.in/11011

Slide 17

Slide 17 text

Data Warehouse Components • Data Warehouse – Processed data is Loaded into Data Warehouse – Contains star-schemas that can be queried for ad hoc requirements – Contains lowest “grain” of data available to users – Contains conformed dimensions shared by all star schemas • Data Marts – Subsets of data warehouse representing a specific business process – Provides limited data sets to specific groups (HR, Sales totals) – Often one department starts with a data mart which becomes the foundation for the data warehouse 17

Slide 18

Slide 18 text

Data Warehouse Structures 18

Slide 19

Slide 19 text

Data Warehouse Components Data Access Tools – End user tools to generate reports – Cubes, reports, SQL queries, visualization, dashboards, etc. Cubes – Provides data aggregation across multiple, often hierarchical, dimensions. Dimensions provide “filter by” and “group by” capabilities. Reports – Standard data views that are frequently generated – Often regularly scheduled and emailed or posted – Standard configuration allows trend and anomaly detection Ad Hoc – SQL query run directly on data warehouse – Tools often provide simple custom queries to be developed by end users 19

Slide 20

Slide 20 text

Metadata Describes the state, source, history, and characteristics of data – Definitions: business terms, abbreviations, descriptions – Data Source – Hierarchical relationships – Default Values – Extraction and Transformation methods – Accuracy, completeness, reliability, and trustworthiness of data – Data expiration and refresh – Record History and Timestamps – Historical Policy and Structure Changes – Access Control / Permissions – Intellectual Property Limitations and Fees – Access Monitoring: Usage, Audit, Diagnostic, Privacy, and Security 20

Slide 21

Slide 21 text

Restructuring Data 21 Slides and Feedback at: http://joind.in/11011

Slide 22

Slide 22 text

Snowflake Hierarchy 22 Slides and Feedback at: http://joind.in/11011

Slide 23

Slide 23 text

Snowflaking SKU>Style>Brand>ProductGroup>MerchGroup>MerchDivision Snowflake = Normalized • If every Dimension Table is Normalized the Dimensional Model begins to look like a crystal or snowflake 23

Slide 24

Slide 24 text

Snowflaking Single Dimension Table or Hierarchical 3NF? • Which is easier to understand? • Which is easier to query? • Which is faster to query? • Which is easier for report users to understand? 24

Slide 25

Slide 25 text

Snowflaking Advantages of Snowflaking: • Imports directly from transaction database • Easier to maintain • Easier to identify duplicates • Requires less storage space – Reduction in Redundant data

Slide 26

Slide 26 text

Snowflaking Disadvantages Increases presentation complexity for end users. – If reports are not easily built data is not queried – If reports are not easily understood they are not read – If data and reports are not used Data Warehouse project has failed. Queries require more joins – Query performance is impacted Disk space savings – Fact Tables are comparatively much larger Data accuracy is maintained – Data Warehouse is compiled from normalized transaction data Don’t snowflake unless you must – OK, If there is a compelling reason such as data update challenges. 26 Slides and Feedback at: http://joind.in/11011

Slide 27

Slide 27 text

Common Problem • One row in fact table = one treatment. • How do we associate multiple criteria to a single dimension? – Multiple diagnosis during a doctor’s visit – Increasing number of insured family members – Cellular Family Plan with multiple phones – Retail order with multiple shipping addresses – Trouble ticket with multiple problems 27

Slide 28

Slide 28 text

Solutions • How do we associate multiple criteria to a single dimension? – Create a separate record for each instance • Requires a primary key for each instance • Requires a common key for each group – Many to Many Relationship • Multiple patients with identical diagnosis • Patients with many diagnosis • How do we solve many-to-many relationships? 28

Slide 29

Slide 29 text

Healthcare Model To handle multiple values per fact table record, we introduce a Helper or Bridge table. 29

Slide 30

Slide 30 text

Common Language • Helper Table – Solves Many to Many Problems – Snowflaking reduces number of shared elements – Helper table increases number of shared elements • Similar Terms – Bridge – Junction Table – Joining Table 30

Slide 31

Slide 31 text

Common Problem Data Warehouses start out organized… 31

Slide 32

Slide 32 text

Common Problem …but quickly become cluttered 32

Slide 33

Slide 33 text

IS_VARIANT_ACTIVE IS_PRODUCT_ACTIVE IS_DOWNGRADE IS_GSO IS_CLONED IS_PACKAGE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Attributes with little variation

Slide 34

Slide 34 text

Junk Dimensions Junk_ID IS_VARIANT_ACTIVE IS_PRODUCT_ACTIVE IS_DOWNGRADE IS_GSO IS_CLONED IS_PACKAGE 1 0 0 0 0 0 0 2 0 0 0 0 0 1 3 0 0 0 0 1 0 4 0 0 0 0 2 0 5 0 0 0 0 3 0 6 0 0 0 0 4 0 7 0 0 0 0 5 0 8 0 0 0 0 6 0 9 0 0 0 0 7 0 10 0 0 0 0 8 0 11 0 0 0 0 9 0 12 0 0 0 0 10 0 13 0 0 0 0 14 0 14 0 0 0 0 18 0 15 0 0 0 0 19 0 16 0 0 0 1 0 0 17 0 1 0 0 0 0 18 0 1 0 0 0 1 19 0 1 0 0 1 0 20 1 1 0 0 0 6 attributes for 5000+ entries can be consolidated to 1 uniquely defining surrogate key in a dimension table with 20 rows

Slide 35

Slide 35 text

Junk Dimensions Demographics Contain multiple dimensions with few unique values DIM_TIME PK TIME_KEY SEMESTER_NAME DIM_COURSE PK COURSE_KEY COURSE_NBR COURSE_NAME DIM_MARITAL_STATUS PK MARITAL_STATUS_KEY MARITAL_STATUS_CODE MARITAL_STATUS_DESC FACT_COURSE_ENROLLMENT PK,FK1 TIME_KEY PK,FK2 COURSE_KEY PK,FK3 MARITAL_STATUS_KEY PK,FK5 GENDER_KEY FK4 AGE_KEY GRADE_PCT DIM_AGE PK AGE_KEY AGE AGE_GROUP DIM_GENDER PK GENDER_KEY GENDER_CODE GENDER_DESC

Slide 36

Slide 36 text

Junk Dimensions Can be combined into one Demographic Dimension Table DIM_TIME PK TIME_KEY SEMESTER_NAME DIM_COURSE PK COURSE_KEY COURSE_NBR COURSE_NAME DIM_DEMOGRAPHIC PK DEMOGRAPHIC_KEY MARITAL_STATUS_CODE MARITAL_STATUS_DESC GENDER_CODE GENDER_DESC AGE_CODE AGE_DESC FACT_COURSE_ENROLLMENT PK,FK1 TIME_KEY PK,FK2 COURSE_KEY PK,FK3 DEMOGRAPHIC_KEY GRADE_PCT

Slide 37

Slide 37 text

Junk Dimensions • Junk dimensions should be less than 20,000 rows. – Multiply total number of potential values for each attribute – 5 Marital Status X 3 Genders X 110 Ages X 7 Age Groups = 11,550 or 1,650 Rows? • Can be used to collect associated “comments” or optional text. • In Alan Simon’s experience, DBA’s hate junk dimensions even more than de-normalization, so use them sparingly. 37

Slide 38

Slide 38 text

Why a “Junk” Dimension? Normally a Database and Kitchen should be organized and free of clutter Slides and Feedback at: http://joind.in/11011

Slide 39

Slide 39 text

Why a “Junk” Dimension? A Junk Drawer collects the clutter http://www.doublecoveragefootball.com/2012/06/dcs-newest-feature-friday-junk-drawer.html http://ths.gardenweb.com/forums/load/kitchbath/msg1222054422880.html http://iheartorganizing.blogspot.com/2011/08/its-here-kitchen-cabinet-tour.html

Slide 40

Slide 40 text

Slowly Changing Historical Records 40 Slides and Feedback at: http://joind.in/11011

Slide 41

Slide 41 text

Slowly Changing Historical Records What do you want to track? • Only the most recent information – Previous information is assumed to be incorrect – Birth Date, Social Security Number – Last Name? • Previous Values – Previous Values needed for historical reports or tracking – Previous Address or Company Affiliation – Changes in Demographics – Age? • Additional Product Inventory – Any Changes are treated as a new product or customer – Quantity of socks in a pack – Work vs. Personal Chase Visa Card • Combination of above? 41

Slide 42

Slide 42 text

• Slowly Changing Dimensions do not typically change daily but may change over time, unpredictably, or never. • Three basic techniques: – Type 1 SCD: Overwrite the value – Type 2 SCD: Add a new dimension row – Type 3 SCD: Create a dimension column Slowly Changing Dimensions 42

Slide 43

Slide 43 text

Well suited for corrections when historical information does not need to be tracked. Correction to data entry error. Birthdate, Race, and Gender don’t typically change. Demographic changes may flag purchasing changes Type 1 SCD Example Customer Preferred Contact Address Phone Email Jane Smith Phone 123 Maple St 435-555-1212 [email protected] Jane Smith Phone 123 Maple St 435-757-1212 [email protected] 43

Slide 44

Slide 44 text

Type 1 SCD - Overwrite • Old attribute value is replaced with current value • Advantages – Fast – Simple • Disadvantages – Loses history of attribute changes – Aggregate tables may need to be rebuilt • Appropriate for corrections – If previous information was incorrect 44 Slides and Feedback at: http://joind.in/11011

Slide 45

Slide 45 text

Utilized to track history of changes Common Key (Customer#) is required to correlate changes. Effective Date column may be helpful Is Fact table associated with SurrogateID or Customer#? Type 2 SCD Example SurrogateID Current Customer Customer # Address Phone Email 123 N Jane Smith 87653 123 Maple St 435-555-1212 [email protected] 642 Y Jane Smith 87653 381 First St 435-757-1212 [email protected] 45

Slide 46

Slide 46 text

Type 2 SCD – Create a New Record • Utilized to track history of changes • A new dimension row is added to represent the change – Effective Date and Current Indicator columns are typically utilized – A new surrogate key is created – A common key is typically maintained to show correlation (SKU, SSN) – Previous record remains unchanged other than Current Indicator • Advantages – History is recorded between fact table and dimension table • Adding a bigger sized candy bar does not change previous purchase records – Aggregate tables do not need to be rebuilt • Disadvantages – Frequent Updates significantly increase dimension table size – Why would you track Facts by common vs. surrogate keys? 46

Slide 47

Slide 47 text

Utilized to show two conditions simultaneously Compare proposed redistricting plans Type 3 SCD Example Current Region New Region Customer Customer # Address Phone South Towne Mid Towne Jane Smith 87653 381 First St 435-757-1212 47

Slide 48

Slide 48 text

Type 3 SCD – Add a Dim Column • Column added to dimension table to track previous values • When a change occurs – Current value is copied to Previous value – New value replaces Current value • Advantages – Easily associates old and new values – Married and maiden names – Compare changes before and after reorganization • Disadvantages – Only records most recent change – A column must be added for each attribute being tracked – Aggregate tables may need to be rebuilt 48

Slide 49

Slide 49 text

Designing an Integrated Warehouse • Business Processes Integration – Analysis across business units, processes, and databases – Data Marts represent Business Processes – Often built incrementally, one Business Process at a time • Bus Matrix – Business Processes + Conformed Dimensions – Allows uniform conformed across Business Processes – Common currency, product IDs, definitions – Shipping uses pallets, stores use cases, customer buys cans 49

Slide 50

Slide 50 text

Data Warehouse Bus Matrix Business Processes/ Dimension Date Product Promotion Website Customer Supplier Region Sales X X X X X X Billing X X X X X Inventory X X X Shipping X X X X Market Analysis X X X X X X Partners X X X 50

Slide 51

Slide 51 text

Conformed Dimensions Dimensions may be shared across data marts from the same table or duplicate tables using the same source attributes. Consistency must be maintained between: • Dimension keys • Attribute names • Attribute values • Attribute units • Attribute definitions 51 Slides and Feedback at: http://joind.in/11011

Slide 52

Slide 52 text

Conformed Dimensions How large is a quantity of one? • Cashier = One can of Sprite • Stock Person = One six pack of Sprite • Delivery Person = One pallet of Sprite • Factory = One truck of Sprite 52

Slide 53

Slide 53 text

Drilling Across • Drill Across queries can span multiple business processes • Queries span across multiple data marts, cubes, and reports – Drill Through obtains data not available within a cube • Conformed dimensions and facts are required • Allow detailed drill down and correlations – Combine Sales and Shipping identity backorder trends products – Combine Sales and Inventory to identify theft or damage – Obtains product descriptions from quantity aggregates 53

Slide 54

Slide 54 text

Data Warehouse – Data repository for broad queries Business Intelligence – Data with which to make decisions Data Mart – Data representing a Business Process Cube – Self-service reporting tool Dashboard – Single visual representation of processes Broad Definitions Ask two experts, you will get three opinions. Every implementation is different. Clients want one answer. Often product defined (Excel, Oracle, Informatica) 54

Slide 55

Slide 55 text

Transactions vs Reports 55 An e-commerce store has been processing 300,000 sales per day without incident. Management wants to better understand their business sales trends to increase product sales. When we started generating reports on all sales per quarter the 80 requested reports took three days each to run. During report generation customer transactions took minutes instead of seconds to complete. The CEO has ordered that all report generations be stopped immediately.

Slide 56

Slide 56 text

OLAP / OLTP Examples • OLAP = On-Line Analytical Processing – Reports – Answers that take more than a few seconds – Data Warehouse – Star Schema • OLTP = On-Line Transaction Processing – Production databases used to store individual transactions. – Add, update, and delete transactions – Directly impacted by customer interactions – Traditional normalized databases 56 Slides and Feedback at: http://joind.in/11011

Slide 57

Slide 57 text

OLTP vs. OLAP OLTP OLAP User Customer Service IT Professionals Decision Makers Purpose Production Business Operations Report Generation Structure 3rd Normal Form Star Schema Timeline Current Historical Data Detailed Filtered Usage Short, Simple Transactions Large Complex Queries Access Read & Write Read Mostly Records per Query Tens Millions Database Size 100 MB-GB 100 GB-TB Measurement Metric # transactions/minute Query Response Time 57

Slide 58

Slide 58 text

• Fast Report Generation • Workload on Production Systems • Centralized Repository for Departmental Data • Historical Records • Single Definition • Single Answer Why separate OLAP and OLTP 58

Slide 59

Slide 59 text

Transactions • Amazon.com – Cyber Monday • 13.7 million 2010 • 27 million orders 2012 • 36.8 million order 2013 (1KB/order = 36.8TB) – Shipped • 15.6 million units shipped in one day • Facebook – 1.23 billion Facebook monthly active users – 757 million daily active users – 4.5 billion likes – Size of user data 300,000,000 GB (240KB/user) 59

Slide 60

Slide 60 text

Star Schema = Dimensional Model = Fact and Dimension Tables 60 Data Warehouse Structures

Slide 61

Slide 61 text

Data Warehouse Structures 61

Slide 62

Slide 62 text

Star Schema vs 3rd Normal Form 62 Slides and Feedback at: http://joind.in/11011

Slide 63

Slide 63 text

Why is 3NF valuable?

Slide 64

Slide 64 text

• Avoids redundancy • Information is entered once • Updates occur in one place • Fast transactions • Fast Add, Modify, Delete • Individual rows retrieved quickly • However… • Slow reports due to JOINs • Difficult for database novices to understand Why is 3NF valuable?

Slide 65

Slide 65 text

Retail 3NF Which ZIP code and state ordered the most product?

Slide 66

Slide 66 text

What is the average GPA of Male and Female students in the College of Education each decade? 3NF = 3rd Normal Form 66

Slide 67

Slide 67 text

Dimensional Model (Star Schema) • Dimensional Model = Fact Table + Dimension Tables • Benefits of Dimensional Model • Easy for business users to understand • Improved query performance • Future expansions and changes are simplified • Simplified JOIN queries for reports 67

Slide 68

Slide 68 text

Fact and Dimension Table Star Schema (Dimensional Model) 68

Slide 69

Slide 69 text

Fact Elements are generally aggregated # of Students Sum of Sales Quantity of Sales Average Age Dimension Elements are generally “Filtered By” Region: State, Eastern Sales Division Demographics: Gender, Age Date: Holiday, Weekend, 1st Quarter Product: Promotion, Price, Weight, Season Star Schema 69

Slide 70

Slide 70 text

Retail Sales Business Process: Given total sales for a large company how would you divide up the data to better understand it? Fact Table = Sales represented by Sales Receipts Customer, date, product, price, quantity, store Dimension Tables = Details and Hierarchies of Facts Customer = Address, Demographics Date = Weekday, Season, Holiday, Promotions Product = SKU, List Price, Weight, Department Store = Address, Region Star Schema 70

Slide 71

Slide 71 text

Fact Tables • Stores business performance measurements • Each row represents a specific measurement or activity • All measurements in a fact table must be at the same grain • Contains foreign keys to dimension tables • Most Facts are numeric 71

Slide 72

Slide 72 text

Fact Tables Facts are typically Foreign Keys or numeric additive, semi-additive, or non-additive values • Additive: Purchase Price – Can be added across all dimensions • Semi-Additive: % Profit Margin, Age – Can be added across some dimensions • Non-Additive: Zip Code, SSN – Should not be added across any dimensions 72

Slide 73

Slide 73 text

Semi-Additive Facts • Additive facts can be added across all dimensions – Sales Price • Semi-additive facts can be added across some dimensions – Inventory can be added across locations, but not time. – Current Bank Balance can be added across customers, but not time. • Non-additive fact should not be added across any dimensions – Totals would be uninformative or useless – ZIP Codes and Invoice Numbers are non-additive • Which of the following are additive, semi-additive, or non- additive? – Sales Quantity – Student ID# – Patient Weight 73

Slide 74

Slide 74 text

Dimension Tables • Textual descriptors of the business • Often low in cardinality (rows) compared to fact tables • 50-100 attributes (columns) are not uncommon • Dimension attributes used as “filter by” constraints • Often hierarchical components (product > brand > division) 74

Slide 75

Slide 75 text

Continuous Refinement Dimensional models are often refined over time Attributes may be a fact or dimension May want to filter by or aggregate Patient Age Fact: What is the average age of slip and fall patients? Dimension: What is the most common teenage diagnosis? 75

Slide 76

Slide 76 text

1) Select the Business Process to model 2) Declare the Grain (Detail Level) of the Business Process 3) Choose the Dimensions that apply to each fact table row 4) Identify the numeric Facts that will populate each fact table row Dimensional Model Design: The Data Warehouse Toolkit by Ralph Kimball and Margy Ross Dimensional Model Design 76

Slide 77

Slide 77 text

Patient Treatment Billing Staff Scheduling Room Scheduling Community Relations Business Relations Hospital Business Processes 77

Slide 78

Slide 78 text

78 http://www.fchn.org/images/paying-for-care/FMH-bill.jpg Grain Narrows the scope to a particular task What will be on one form but not others? Admissions Form Treatment Form Billing What is the time frame? Previous or Current? Patient Weight Patient Age Patient History

Slide 79

Slide 79 text

Height Weight Temperature Patient Name Employer Diagnosis Treatment Primary Doctor Phone Address ZIP Fact or Dimension Table? Where do the following attributes belong: Date Check-In Time Check-Out Time Wait Time Treatment Time Room # Doctor’s Phone Doctor’s Address Doctor’s ZIP Doctor’s Office Building 79

Slide 80

Slide 80 text

Hospital Patients 3NF

Slide 81

Slide 81 text

Hospital Claims 3NF

Slide 82

Slide 82 text

1) Select the Business Process to model • Billing 2) Declare the Grain (Detail Level) of the Business Process • Patient Bill 3) Choose the Dimensions that apply to each fact table row • Date, Patient, Provider, Treatment 4) Identify the numeric Facts that will populate each fact table row • Price, Insurance Payment • Invoice ID, Date ID, Patient ID, Provider ID, Treatment ID Example Dimensional Model Design 82

Slide 83

Slide 83 text

Example Dimensional Model Design 83

Slide 84

Slide 84 text

• Easy for customers to understand • What they can “filter by” • What they can aggregate • Reports must run quickly • Must not slow production systems • Often de-normalized • Value is only created if reports are used Why Dimensional Models

Slide 85

Slide 85 text

Special Interest Topics 85 Slides and Feedback at: http://joind.in/11011

Slide 86

Slide 86 text

Date Dimensions Date Dimension Example attributes: – Date – Full Date Description – Month Number – Month Name – Month Short Name – Day Number in Month – Day of Week – Day Number in Year – Year – Fiscal Quarter – Fiscal Year – Holiday Indicator – First Day of Quarter Indicator 100 years = 36500 rows Multiple Hierarchies: Q1 2010, Fall 2013, 1st Monday Free Prepopulated downloads http://www.kimballgroup.com/data -warehouse-and-business- intelligence-resources/data- warehouse-books/booksmdwt/ 86

Slide 87

Slide 87 text

Date Dimensions DateKey FullDate DateName Holiday DayOfWeek DayNameOfWeek DayOfMonth DayOfYear WeekdayWeekend WeekOfYear MonthName MonthOfYear CalendarYear CalendarYearQtr 20000101 1/1/2000 2000/01/01 Y 7 Saturday 1 1 Weekend 1 January 1 2000 2000Q1 20000102 1/2/2000 2000/01/02 N 1 Sunday 2 2 Weekday 2 January 1 2000 2000Q1 20000103 1/3/2000 2000/01/03 N 2 Monday 3 3 Weekday 2 January 1 2000 2000Q1 20000104 1/4/2000 2000/01/04 N 3 Tuesday 4 4 Weekday 2 January 1 2000 2000Q1 20000105 1/5/2000 2000/01/05 N 4 Wednesday 5 5 Weekday 2 January 1 2000 2000Q1 20000106 1/6/2000 2000/01/06 N 5 Thursday 6 6 Weekday 2 January 1 2000 2000Q1 20000107 1/7/2000 2000/01/07 N 6 Friday 7 7 Weekend 2 January 1 2000 2000Q1 20000108 1/8/2000 2000/01/08 N 7 Saturday 8 8 Weekend 2 January 1 2000 2000Q1 20000109 1/9/2000 2000/01/09 N 1 Sunday 9 9 Weekday 3 January 1 2000 2000Q1 20000110 1/10/2000 2000/01/10 N 2 Monday 10 10 Weekday 3 January 1 2000 2000Q1 20000111 1/11/2000 2000/01/11 N 3 Tuesday 11 11 Weekday 3 January 1 2000 2000Q1 20000112 1/12/2000 2000/01/12 N 4 Wednesday 12 12 Weekday 3 January 1 2000 2000Q1 20000113 1/13/2000 2000/01/13 N 5 Thursday 13 13 Weekday 3 January 1 2000 2000Q1 20000114 1/14/2000 2000/01/14 N 6 Friday 14 14 Weekend 3 January 1 2000 2000Q1 20000115 1/15/2000 2000/01/15 N 7 Saturday 15 15 Weekend 3 January 1 2000 2000Q1 20000116 1/16/2000 2000/01/16 N 1 Sunday 16 16 Weekday 4 January 1 2000 2000Q1 20000117 1/17/2000 2000/01/17 Y 2 Monday 17 17 Weekday 4 January 1 2000 2000Q1 20000118 1/18/2000 2000/01/18 N 3 Tuesday 18 18 Weekday 4 January 1 2000 2000Q1 20000119 1/19/2000 2000/01/19 N 4 Wednesday 19 19 Weekday 4 January 1 2000 2000Q1 20000120 1/20/2000 2000/01/20 N 5 Thursday 20 20 Weekday 4 January 1 2000 2000Q1 20000121 1/21/2000 2000/01/21 N 6 Friday 21 21 Weekend 4 January 1 2000 2000Q1 20000122 1/22/2000 2000/01/22 N 7 Saturday 22 22 Weekend 4 January 1 2000 2000Q1 20000123 1/23/2000 2000/01/23 N 1 Sunday 23 23 Weekday 5 January 1 2000 2000Q1 20000124 1/24/2000 2000/01/24 N 2 Monday 24 24 Weekday 5 January 1 2000 2000Q1 20000125 1/25/2000 2000/01/25 N 3 Tuesday 25 25 Weekday 5 January 1 2000 2000Q1 20000126 1/26/2000 2000/01/26 N 4 Wednesday 26 26 Weekday 5 January 1 2000 2000Q1 20000127 1/27/2000 2000/01/27 N 5 Thursday 27 27 Weekday 5 January 1 2000 2000Q1 20000128 1/28/2000 2000/01/28 N 6 Friday 28 28 Weekend 5 January 1 2000 2000Q1 20000129 1/29/2000 2000/01/29 N 7 Saturday 29 29 Weekend 5 January 1 2000 2000Q1 20000130 1/30/2000 2000/01/30 N 1 Sunday 30 30 Weekday 6 January 1 2000 2000Q1 20000131 1/31/2000 2000/01/31 N 2 Monday 31 31 Weekday 6 January 1 2000 2000Q1 20000201 2/1/2000 2000/02/01 N 3 Tuesday 1 32 Weekday 6 February 2 2000 2000Q1 20000202 2/2/2000 2000/02/02 N 4 Wednesday 2 33 Weekday 6 February 2 2000 2000Q1 20000203 2/3/2000 2000/02/03 N 5 Thursday 3 34 Weekday 6 February 2 2000 2000Q1 20000204 2/4/2000 2000/02/04 N 6 Friday 4 35 Weekend 6 February 2 2000 2000Q1 20000205 2/5/2000 2000/02/05 N 7 Saturday 5 36 Weekend 6 February 2 2000 2000Q1 20000206 2/6/2000 2000/02/06 N 1 Sunday 6 37 Weekday 7 February 2 2000 2000Q1 20000207 2/7/2000 2000/02/07 N 2 Monday 7 38 Weekday 7 February 2 2000 2000Q1 20000208 2/8/2000 2000/02/08 N 3 Tuesday 8 39 Weekday 7 February 2 2000 2000Q1 20000209 2/9/2000 2000/02/09 N 4 Wednesday 9 40 Weekday 7 February 2 2000 2000Q1 87

Slide 88

Slide 88 text

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model. Kimball vs Inmon Bottom-Up Approach "...The data warehouse is nothing more than the union of all the data marts...," Ralph Kimball, December 29, 1997. 88

Slide 89

Slide 89 text

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. Kimball vs Inmon Top-Down Approach "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998. 89

Slide 90

Slide 90 text

There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse. http://www.1keydata.com/datawarehousing/inmon-kimball.html http://www.nagesh.com/publications/technology/173-inmon-vs-kimball-an-analysis.html Kimball vs Inmon 90

Slide 91

Slide 91 text

Whose methodology is more popular? Dice.com technical job listing board • 85,000 job listings, average turn over 14 days • 90-118 job listings for Kimball • 33-39 for Inmon • 33-37 Inmon and Kimball Kimball vs Inmon 91 Slides and Feedback at: http://joind.in/11011

Slide 92

Slide 92 text

Cubes

Slide 93

Slide 93 text

Cube

Slide 94

Slide 94 text

Trimmed Sample Data Day Weekend Zip State Region Division Group Quantity Thu No 95014 CA West Sporting Goods Skis 300 Thu No 90210 CA West Sporting Goods Skis 400 Fri No 19090 PA East Sporting Goods Bikes 200 Fri No 20500 DC East Sporting Goods Bikes 600 Sat Yes 19090 PA East Sporting Goods Bikes 100 Sun Yes 95014 CA West Sporting Goods Skis 800 Sun Yes 20500 DC East Sporting Goods Bikes 200 Sun Yes 19090 PA East Clothing Men’s Pants 100 … … … … … … … … • Note Hierarchies: Day > Weekend; Zip > State > Region

Slide 95

Slide 95 text

Cubes • Slice • Dice • Roll Up • Drill Down • Drill Through • Pivot

Slide 96

Slide 96 text

Slice

Slide 97

Slide 97 text

Slice of Bike Data 95014 90210 20500 19090 THU 300 200 400 400 FRI 200 300 200 100 SAT 100 100 300 300 SUN 400 400 100 200 SUM of 16 cells associated with Bike Sales = 4000

Slide 98

Slide 98 text

Roll Up

Slide 99

Slide 99 text

Roll Up Sporting Goods (Bikes and Skis) Bikes 95014 90210 20500 19090 THU 300 200 400 400 FRI 200 300 200 100 SAT 100 100 300 300 SUN 400 400 100 200 SUM of 32 cells associated with Sporting Good Sales = 8000 Skis 95014 90210 20500 19090 THU 400 400 100 200 FRI 100 100 300 300 SAT 200 300 200 100 SUN 300 200 400 400

Slide 100

Slide 100 text

Slice 2

Slide 101

Slide 101 text

Drill Down

Slide 102

Slide 102 text

Drill Through

Slide 103

Slide 103 text

Dice

Slide 104

Slide 104 text

Dice2

Slide 105

Slide 105 text

Pivot Tables Zip Code Product Day Sales 95010 Skis Sat 2 95010 Skis Sun 0 95010 Bikes Sat 4 95010 Bikes Sun 0 95014 Skis Sat 5 95014 Skis Sun 3 95014 Bikes Sat 1 95014 Bikes Sun 9 Zip Code Product Sat Sales Sun Sales 95010 Skis 2 0 95010 Bikes 4 0 95014 Skis 5 3 95014 Bikes 1 9 Move unique row information into expanded column header data

Slide 106

Slide 106 text

Pivot Tables Zip Code Product Sat Sales Sun Sales 95010 Skis 2 0 95010 Bikes 4 0 95014 Skis 5 3 95014 Bikes 1 9 Zip Code Product Sat Sales Sun Sales 95010 Sporting Goods 6 0 95014 Sporting Goods 6 12 Often aggregate (combine) hierarchical data

Slide 107

Slide 107 text

Tableau = Fantastic, Easy, and Powerful Data Visualization Tool

Slide 108

Slide 108 text

Virtualized Business Intelligence with InfoSphere Warehouse By: Adriana Carvajal; Thomas Chong Big Data Development Cycle 10

Slide 109

Slide 109 text

Why Big Data Skills? 10

Slide 110

Slide 110 text

Why Big Data Skills? 11

Slide 111

Slide 111 text

Why Big Data Skills? 11

Slide 112

Slide 112 text

Why Big Data Skills? 11

Slide 113

Slide 113 text

Why Big Data Skills? 11

Slide 114

Slide 114 text

Questions ??? [email protected] Slides and Feedback at: http://joind.in/11011