Introduction to Big Data

Big Data Introduction

Big Data Introduction • Big Data employees are typically paid
20-30% more than their peers. • Explore the technical realities and challenges within Big Data. • We will discuss Data Warehouses, Star Schemas, Data Repository Architecture, Slowly Changing Data and several common Big Data terms. • Goal: Enable you to speak intelligently about Big Data Slides and Feedback at: http://joind.in/11011 Friday, may 9th, 10:30am-11:20am SB 139 2

What is Big Data? • Massive volume – Facebook generates
10TB of data daily – Twitter generates 7TB of data daily – 90% of available data generated in last 2 years – In 2000 there were 800,000,000 TB world wide – http://hosteddocs.ittoolbox.com/zd_wp_big_data_v2.pdf • Un-structured or table friendly data • Identifying Positive and Negative Trends • Lots of live data coming in very Fast! 3 Slides and Feedback at: http://joind.in/11011

http://wikibon.org/blog/big-data-statistics/ • 2.7 Zettabytes of data exist in the digital
universe today. (IBM 2012) • 1.8 Zettabytes of data created in 2011 (IDC) • By 2020, Internet B2B and B2C transactions will exceed 450 B per day. (IDC) • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • Akamai analyzes 75 million events per day to better target advertisements. • Walmart handles 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data. • More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. • In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day. The Rapid Growth of Unstructured Data • YouTube users upload 48 hours of new video every minute of the day. • 100 terabytes of data uploaded daily to Facebook. 4 Slides and Feedback at: http://joind.in/11011

Twitter • Started as only 140 Characters? • As of
2010 – Stored 7TB/day – 17,500 Bytes / tweet • As of October 2013 – 300 Billion tweets sent – 500 Million tweets per day average • Kept and Searchable for how long? • Simple Analysis Query – 1 in 13 Number of tweets containing curse words – 569 F Bombs in The Wolf of Wall Street • http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing- twitter-stats/#.U2zgHizjhdg • http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 5

====Tweet==== { "contributors": null, "coordinates": null, "created_at": "Wed Mar 12
21:56:15 +0000 2014", "entities": { "hashtags": [], "symbols": [], "urls": [ { "display_url": "jobs.usu.edu/applicants/Cen\u2026", "expanded_url": "http://jobs.usu.edu/applicants/Central?quickFind=59832", "indices": [ 71, 93 ], "url": "http://t.co/zy1s0IAnh0" } ], "user_mentions": [ { "id": 810656204, "id_str": "810656204", "indices": [ 3, 16 ], "name": "Alice Dolan", "screen_name": "AliceDolinen" } ] }, "favorite_count": 0, "favorited": false, "geo": null, "id": 443868097514967040, "id_str": "443868097514967040", "in_reply_to_screen_name": null, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "lang": "en", "metadata": { "iso_language_code": "en", ====Tweet==== { "contributors": null, "coordinates": null, "created_at": "Wed Mar 12 20:56:35 +0000 2014", "entities": { "hashtags": [], "symbols": [], "urls": [], "user_mentions": [ { "id": 869089020, "id_str": "869089020", "indices": [ 0, 13 ], "name": "USU Campus Rec", "screen_name": "USUCampusRec" } ] }, "favorite_count": 1, "favorited": false, "geo": null, "id": 443853080174407680, "id_str": "443853080174407680", "in_reply_to_screen_name": "USUCampusRec", "in_reply_to_status_id": 443852925333274624, "in_reply_to_status_id_str": "443852925333274624", "in_reply_to_user_id": 869089020, "in_reply_to_user_id_str": "869089020", "lang": "et", "metadata": { "iso_language_code": "et", "result_type": "recent" }, "place": null, "retweet_count": 0, "retweeted": false, "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "text": "@USUCampusRec [email protected]", "truncated": false, "user": { 6

Twitter • http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 7 Slides and Feedback at: http://joind.in/11011

Why Big Data? • Will improved business performance by an
average of 26-41% over the next 3 years • Retail data determines if a customer should be offered a 15% or 25% discount. • Automotive GPS, telematics, and maintenance sensors • Healthcare Big Data could generate up to $100 billion annually http://hosteddocs.ittoolbox.com/zd_wp_big_data_v2.pdf 8 Slides and Feedback at: http://joind.in/11011

Why Big Data? • Companies want to understand their organization
and their customers • We can collect and store more data now 9 Slides and Feedback at: http://joind.in/11011

Problems with Big Data • Can provide lots of cool
charts – What are the Big Questions? • How can we attract more customers? • Sell more products? • Save the world? • Organizations don’t share data internally • Skilled Professionals who understand data analytics 10 Slides and Feedback at: http://joind.in/11011

Why Learn Data Warehousing? Gartner reported in February 2012 •
The DBMS market represented $23.2 billion with expected growth of 7% • The Data Warehouse market represents about $7 billion of that $23.2 billion Dice.com High Paying Skills Salaries: • #1 Hadoop $115,062 • #2 Big Data $113,739 • #3 NoSQL $113,031 • #10 ETL $104,777 • #25 Business Intelligence $101,854 • #29 Data Warehouse $101,061 http://marketing.dice.com/pdf/Dice_TechSalarySurvey_2013.pdf 11

Recommended Reading The Data Warehouse Toolkit by Ralph Kimball and
Margy Ross http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/1118530802/ 12

Data Warehouse Structures What is a Data Warehouse? 13

Why Data Warehousing? • Problems – Performance – Production DB
slowed down – Ease of Use – Difficulty for Business Users • Solutions – Independent storage – Historical storage – De-normalization of data structures – Organize data in a simple and logical layout • Communicating in a common language – Business: One question, one answer – IT Dept.: Common terminology 14

Data Warehouse Structures 15

Data Warehouse Components • Operational Source Systems - Business transaction
databases • Source Extracts – Downloads from or ODBC connections to business transaction databases – Filters data required for analysis (don’t need or want user passwords) – May be internal, partner (supply chain), or external (census, zip code) • Staging Area – Storage area for initial extraction, verification, and transformation. – Can represent databases anywhere between source and warehouse • Operational data store (ODS) – Aggregates current and historical data – Provides structure for simplified, scheduled extraction into warehouse 16 Slides and Feedback at: http://joind.in/11011

Data Warehouse Components • Data Warehouse – Processed data is
Loaded into Data Warehouse – Contains star-schemas that can be queried for ad hoc requirements – Contains lowest “grain” of data available to users – Contains conformed dimensions shared by all star schemas • Data Marts – Subsets of data warehouse representing a specific business process – Provides limited data sets to specific groups (HR, Sales totals) – Often one department starts with a data mart which becomes the foundation for the data warehouse 17

Data Warehouse Components Data Access Tools – End user tools
to generate reports – Cubes, reports, SQL queries, visualization, dashboards, etc. Cubes – Provides data aggregation across multiple, often hierarchical, dimensions. Dimensions provide “filter by” and “group by” capabilities. Reports – Standard data views that are frequently generated – Often regularly scheduled and emailed or posted – Standard configuration allows trend and anomaly detection Ad Hoc – SQL query run directly on data warehouse – Tools often provide simple custom queries to be developed by end users 19

Metadata Describes the state, source, history, and characteristics of data
– Definitions: business terms, abbreviations, descriptions – Data Source – Hierarchical relationships – Default Values – Extraction and Transformation methods – Accuracy, completeness, reliability, and trustworthiness of data – Data expiration and refresh – Record History and Timestamps – Historical Policy and Structure Changes – Access Control / Permissions – Intellectual Property Limitations and Fees – Access Monitoring: Usage, Audit, Diagnostic, Privacy, and Security 20

Restructuring Data 21 Slides and Feedback at: http://joind.in/11011

Snowflake Hierarchy 22 Slides and Feedback at: http://joind.in/11011

Snowflaking SKU>Style>Brand>ProductGroup>MerchGroup>MerchDivision Snowflake = Normalized • If every Dimension Table
is Normalized the Dimensional Model begins to look like a crystal or snowflake 23

Snowflaking Single Dimension Table or Hierarchical 3NF? • Which is
easier to understand? • Which is easier to query? • Which is faster to query? • Which is easier for report users to understand? 24

Snowflaking Advantages of Snowflaking: • Imports directly from transaction database
• Easier to maintain • Easier to identify duplicates • Requires less storage space – Reduction in Redundant data

Snowflaking Disadvantages Increases presentation complexity for end users. – If
reports are not easily built data is not queried – If reports are not easily understood they are not read – If data and reports are not used Data Warehouse project has failed. Queries require more joins – Query performance is impacted Disk space savings – Fact Tables are comparatively much larger Data accuracy is maintained – Data Warehouse is compiled from normalized transaction data Don’t snowflake unless you must – OK, If there is a compelling reason such as data update challenges. 26 Slides and Feedback at: http://joind.in/11011

Common Problem • One row in fact table = one
treatment. • How do we associate multiple criteria to a single dimension? – Multiple diagnosis during a doctor’s visit – Increasing number of insured family members – Cellular Family Plan with multiple phones – Retail order with multiple shipping addresses – Trouble ticket with multiple problems 27

Solutions • How do we associate multiple criteria to a
single dimension? – Create a separate record for each instance • Requires a primary key for each instance • Requires a common key for each group – Many to Many Relationship • Multiple patients with identical diagnosis • Patients with many diagnosis • How do we solve many-to-many relationships? 28

Healthcare Model To handle multiple values per fact table record,
we introduce a Helper or Bridge table. 29

Common Language • Helper Table – Solves Many to Many
Problems – Snowflaking reduces number of shared elements – Helper table increases number of shared elements • Similar Terms – Bridge – Junction Table – Joining Table 30

Common Problem Data Warehouses start out organized… 31

Common Problem …but quickly become cluttered 32

IS_VARIANT_ACTIVE IS_PRODUCT_ACTIVE IS_DOWNGRADE IS_GSO IS_CLONED IS_PACKAGE 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Attributes with little variation

Junk Dimensions Junk_ID IS_VARIANT_ACTIVE IS_PRODUCT_ACTIVE IS_DOWNGRADE IS_GSO IS_CLONED IS_PACKAGE 1
0 0 0 0 0 0 2 0 0 0 0 0 1 3 0 0 0 0 1 0 4 0 0 0 0 2 0 5 0 0 0 0 3 0 6 0 0 0 0 4 0 7 0 0 0 0 5 0 8 0 0 0 0 6 0 9 0 0 0 0 7 0 10 0 0 0 0 8 0 11 0 0 0 0 9 0 12 0 0 0 0 10 0 13 0 0 0 0 14 0 14 0 0 0 0 18 0 15 0 0 0 0 19 0 16 0 0 0 1 0 0 17 0 1 0 0 0 0 18 0 1 0 0 0 1 19 0 1 0 0 1 0 20 1 1 0 0 0 6 attributes for 5000+ entries can be consolidated to 1 uniquely defining surrogate key in a dimension table with 20 rows

Junk Dimensions Demographics Contain multiple dimensions with few unique values
DIM_TIME PK TIME_KEY SEMESTER_NAME DIM_COURSE PK COURSE_KEY COURSE_NBR COURSE_NAME DIM_MARITAL_STATUS PK MARITAL_STATUS_KEY MARITAL_STATUS_CODE MARITAL_STATUS_DESC FACT_COURSE_ENROLLMENT PK,FK1 TIME_KEY PK,FK2 COURSE_KEY PK,FK3 MARITAL_STATUS_KEY PK,FK5 GENDER_KEY FK4 AGE_KEY GRADE_PCT DIM_AGE PK AGE_KEY AGE AGE_GROUP DIM_GENDER PK GENDER_KEY GENDER_CODE GENDER_DESC

Junk Dimensions Can be combined into one Demographic Dimension Table
DIM_TIME PK TIME_KEY SEMESTER_NAME DIM_COURSE PK COURSE_KEY COURSE_NBR COURSE_NAME DIM_DEMOGRAPHIC PK DEMOGRAPHIC_KEY MARITAL_STATUS_CODE MARITAL_STATUS_DESC GENDER_CODE GENDER_DESC AGE_CODE AGE_DESC FACT_COURSE_ENROLLMENT PK,FK1 TIME_KEY PK,FK2 COURSE_KEY PK,FK3 DEMOGRAPHIC_KEY GRADE_PCT

Junk Dimensions • Junk dimensions should be less than 20,000
rows. – Multiply total number of potential values for each attribute – 5 Marital Status X 3 Genders X 110 Ages X 7 Age Groups = 11,550 or 1,650 Rows? • Can be used to collect associated “comments” or optional text. • In Alan Simon’s experience, DBA’s hate junk dimensions even more than de-normalization, so use them sparingly. 37

Why a “Junk” Dimension? Normally a Database and Kitchen should
be organized and free of clutter Slides and Feedback at: http://joind.in/11011

Why a “Junk” Dimension? A Junk Drawer collects the clutter
http://www.doublecoveragefootball.com/2012/06/dcs-newest-feature-friday-junk-drawer.html http://ths.gardenweb.com/forums/load/kitchbath/msg1222054422880.html http://iheartorganizing.blogspot.com/2011/08/its-here-kitchen-cabinet-tour.html

Slowly Changing Historical Records 40 Slides and Feedback at: http://joind.in/11011

Slowly Changing Historical Records What do you want to track?
• Only the most recent information – Previous information is assumed to be incorrect – Birth Date, Social Security Number – Last Name? • Previous Values – Previous Values needed for historical reports or tracking – Previous Address or Company Affiliation – Changes in Demographics – Age? • Additional Product Inventory – Any Changes are treated as a new product or customer – Quantity of socks in a pack – Work vs. Personal Chase Visa Card • Combination of above? 41

• Slowly Changing Dimensions do not typically change daily but
may change over time, unpredictably, or never. • Three basic techniques: – Type 1 SCD: Overwrite the value – Type 2 SCD: Add a new dimension row – Type 3 SCD: Create a dimension column Slowly Changing Dimensions 42

Well suited for corrections when historical information does not need
to be tracked. Correction to data entry error. Birthdate, Race, and Gender don’t typically change. Demographic changes may flag purchasing changes Type 1 SCD Example Customer Preferred Contact Address Phone Email Jane Smith Phone 123 Maple St 435-555-1212 [email protected] Jane Smith Phone 123 Maple St 435-757-1212 [email protected] 43

Type 1 SCD - Overwrite • Old attribute value is
replaced with current value • Advantages – Fast – Simple • Disadvantages – Loses history of attribute changes – Aggregate tables may need to be rebuilt • Appropriate for corrections – If previous information was incorrect 44 Slides and Feedback at: http://joind.in/11011

Utilized to track history of changes Common Key (Customer#) is
required to correlate changes. Effective Date column may be helpful Is Fact table associated with SurrogateID or Customer#? Type 2 SCD Example SurrogateID Current Customer Customer # Address Phone Email 123 N Jane Smith 87653 123 Maple St 435-555-1212 [email protected] 642 Y Jane Smith 87653 381 First St 435-757-1212 [email protected] 45

Type 2 SCD – Create a New Record • Utilized
to track history of changes • A new dimension row is added to represent the change – Effective Date and Current Indicator columns are typically utilized – A new surrogate key is created – A common key is typically maintained to show correlation (SKU, SSN) – Previous record remains unchanged other than Current Indicator • Advantages – History is recorded between fact table and dimension table • Adding a bigger sized candy bar does not change previous purchase records – Aggregate tables do not need to be rebuilt • Disadvantages – Frequent Updates significantly increase dimension table size – Why would you track Facts by common vs. surrogate keys? 46

Utilized to show two conditions simultaneously Compare proposed redistricting plans
Type 3 SCD Example Current Region New Region Customer Customer # Address Phone South Towne Mid Towne Jane Smith 87653 381 First St 435-757-1212 47

Type 3 SCD – Add a Dim Column • Column
added to dimension table to track previous values • When a change occurs – Current value is copied to Previous value – New value replaces Current value • Advantages – Easily associates old and new values – Married and maiden names – Compare changes before and after reorganization • Disadvantages – Only records most recent change – A column must be added for each attribute being tracked – Aggregate tables may need to be rebuilt 48

Designing an Integrated Warehouse • Business Processes Integration – Analysis
across business units, processes, and databases – Data Marts represent Business Processes – Often built incrementally, one Business Process at a time • Bus Matrix – Business Processes + Conformed Dimensions – Allows uniform conformed across Business Processes – Common currency, product IDs, definitions – Shipping uses pallets, stores use cases, customer buys cans 49

Data Warehouse Bus Matrix Business Processes/ Dimension Date Product Promotion
Website Customer Supplier Region Sales X X X X X X Billing X X X X X Inventory X X X Shipping X X X X Market Analysis X X X X X X Partners X X X 50

Conformed Dimensions Dimensions may be shared across data marts from
the same table or duplicate tables using the same source attributes. Consistency must be maintained between: • Dimension keys • Attribute names • Attribute values • Attribute units • Attribute definitions 51 Slides and Feedback at: http://joind.in/11011

Conformed Dimensions How large is a quantity of one? •
Cashier = One can of Sprite • Stock Person = One six pack of Sprite • Delivery Person = One pallet of Sprite • Factory = One truck of Sprite 52

Drilling Across • Drill Across queries can span multiple business
processes • Queries span across multiple data marts, cubes, and reports – Drill Through obtains data not available within a cube • Conformed dimensions and facts are required • Allow detailed drill down and correlations – Combine Sales and Shipping identity backorder trends products – Combine Sales and Inventory to identify theft or damage – Obtains product descriptions from quantity aggregates 53

Data Warehouse – Data repository for broad queries Business Intelligence
– Data with which to make decisions Data Mart – Data representing a Business Process Cube – Self-service reporting tool Dashboard – Single visual representation of processes Broad Definitions Ask two experts, you will get three opinions. Every implementation is different. Clients want one answer. Often product defined (Excel, Oracle, Informatica) 54

Transactions vs Reports 55 An e-commerce store has been processing
300,000 sales per day without incident. Management wants to better understand their business sales trends to increase product sales. When we started generating reports on all sales per quarter the 80 requested reports took three days each to run. During report generation customer transactions took minutes instead of seconds to complete. The CEO has ordered that all report generations be stopped immediately.

OLAP / OLTP Examples • OLAP = On-Line Analytical Processing
– Reports – Answers that take more than a few seconds – Data Warehouse – Star Schema • OLTP = On-Line Transaction Processing – Production databases used to store individual transactions. – Add, update, and delete transactions – Directly impacted by customer interactions – Traditional normalized databases 56 Slides and Feedback at: http://joind.in/11011

OLTP vs. OLAP OLTP OLAP User Customer Service IT Professionals
Decision Makers Purpose Production Business Operations Report Generation Structure 3rd Normal Form Star Schema Timeline Current Historical Data Detailed Filtered Usage Short, Simple Transactions Large Complex Queries Access Read & Write Read Mostly Records per Query Tens Millions Database Size 100 MB-GB 100 GB-TB Measurement Metric # transactions/minute Query Response Time 57

• Fast Report Generation • Workload on Production Systems •
Centralized Repository for Departmental Data • Historical Records • Single Definition • Single Answer Why separate OLAP and OLTP 58

Transactions • Amazon.com – Cyber Monday • 13.7 million 2010
• 27 million orders 2012 • 36.8 million order 2013 (1KB/order = 36.8TB) – Shipped • 15.6 million units shipped in one day • Facebook – 1.23 billion Facebook monthly active users – 757 million daily active users – 4.5 billion likes – Size of user data 300,000,000 GB (240KB/user) 59

Star Schema = Dimensional Model = Fact and Dimension Tables
60 Data Warehouse Structures

Star Schema vs 3rd Normal Form 62 Slides and Feedback
at: http://joind.in/11011

Why is 3NF valuable?

• Avoids redundancy • Information is entered once • Updates
occur in one place • Fast transactions • Fast Add, Modify, Delete • Individual rows retrieved quickly • However… • Slow reports due to JOINs • Difficult for database novices to understand Why is 3NF valuable?

Retail 3NF Which ZIP code and state ordered the most
product?

What is the average GPA of Male and Female students
in the College of Education each decade? 3NF = 3rd Normal Form 66

Dimensional Model (Star Schema) • Dimensional Model = Fact Table
+ Dimension Tables • Benefits of Dimensional Model • Easy for business users to understand • Improved query performance • Future expansions and changes are simplified • Simplified JOIN queries for reports 67

Fact and Dimension Table Star Schema (Dimensional Model) 68

Fact Elements are generally aggregated # of Students Sum of
Sales Quantity of Sales Average Age Dimension Elements are generally “Filtered By” Region: State, Eastern Sales Division Demographics: Gender, Age Date: Holiday, Weekend, 1st Quarter Product: Promotion, Price, Weight, Season Star Schema 69

Retail Sales Business Process: Given total sales for a large
company how would you divide up the data to better understand it? Fact Table = Sales represented by Sales Receipts Customer, date, product, price, quantity, store Dimension Tables = Details and Hierarchies of Facts Customer = Address, Demographics Date = Weekday, Season, Holiday, Promotions Product = SKU, List Price, Weight, Department Store = Address, Region Star Schema 70

Fact Tables • Stores business performance measurements • Each row
represents a specific measurement or activity • All measurements in a fact table must be at the same grain • Contains foreign keys to dimension tables • Most Facts are numeric 71

Fact Tables Facts are typically Foreign Keys or numeric additive,
semi-additive, or non-additive values • Additive: Purchase Price – Can be added across all dimensions • Semi-Additive: % Profit Margin, Age – Can be added across some dimensions • Non-Additive: Zip Code, SSN – Should not be added across any dimensions 72

Semi-Additive Facts • Additive facts can be added across all
dimensions – Sales Price • Semi-additive facts can be added across some dimensions – Inventory can be added across locations, but not time. – Current Bank Balance can be added across customers, but not time. • Non-additive fact should not be added across any dimensions – Totals would be uninformative or useless – ZIP Codes and Invoice Numbers are non-additive • Which of the following are additive, semi-additive, or non- additive? – Sales Quantity – Student ID# – Patient Weight 73

Dimension Tables • Textual descriptors of the business • Often
low in cardinality (rows) compared to fact tables • 50-100 attributes (columns) are not uncommon • Dimension attributes used as “filter by” constraints • Often hierarchical components (product > brand > division) 74

Continuous Refinement Dimensional models are often refined over time Attributes
may be a fact or dimension May want to filter by or aggregate Patient Age Fact: What is the average age of slip and fall patients? Dimension: What is the most common teenage diagnosis? 75

1) Select the Business Process to model 2) Declare the
Grain (Detail Level) of the Business Process 3) Choose the Dimensions that apply to each fact table row 4) Identify the numeric Facts that will populate each fact table row Dimensional Model Design: The Data Warehouse Toolkit by Ralph Kimball and Margy Ross Dimensional Model Design 76

Patient Treatment Billing Staff Scheduling Room Scheduling Community Relations Business
Relations Hospital Business Processes 77

78 http://www.fchn.org/images/paying-for-care/FMH-bill.jpg Grain Narrows the scope to a particular task
What will be on one form but not others? Admissions Form Treatment Form Billing What is the time frame? Previous or Current? Patient Weight Patient Age Patient History

Height Weight Temperature Patient Name Employer Diagnosis Treatment Primary Doctor
Phone Address ZIP Fact or Dimension Table? Where do the following attributes belong: Date Check-In Time Check-Out Time Wait Time Treatment Time Room # Doctor’s Phone Doctor’s Address Doctor’s ZIP Doctor’s Office Building 79

Hospital Patients 3NF

Hospital Claims 3NF

1) Select the Business Process to model • Billing 2)
Declare the Grain (Detail Level) of the Business Process • Patient Bill 3) Choose the Dimensions that apply to each fact table row • Date, Patient, Provider, Treatment 4) Identify the numeric Facts that will populate each fact table row • Price, Insurance Payment • Invoice ID, Date ID, Patient ID, Provider ID, Treatment ID Example Dimensional Model Design 82

Example Dimensional Model Design 83

• Easy for customers to understand • What they can
“filter by” • What they can aggregate • Reports must run quickly • Must not slow production systems • Often de-normalized • Value is only created if reports are used Why Dimensional Models

Special Interest Topics 85 Slides and Feedback at: http://joind.in/11011

Date Dimensions Date Dimension Example attributes: – Date – Full
Date Description – Month Number – Month Name – Month Short Name – Day Number in Month – Day of Week – Day Number in Year – Year – Fiscal Quarter – Fiscal Year – Holiday Indicator – First Day of Quarter Indicator 100 years = 36500 rows Multiple Hierarchies: Q1 2010, Fall 2013, 1st Monday Free Prepopulated downloads http://www.kimballgroup.com/data -warehouse-and-business- intelligence-resources/data- warehouse-books/booksmdwt/ 86

Date Dimensions DateKey FullDate DateName Holiday DayOfWeek DayNameOfWeek DayOfMonth DayOfYear
WeekdayWeekend WeekOfYear MonthName MonthOfYear CalendarYear CalendarYearQtr 20000101 1/1/2000 2000/01/01 Y 7 Saturday 1 1 Weekend 1 January 1 2000 2000Q1 20000102 1/2/2000 2000/01/02 N 1 Sunday 2 2 Weekday 2 January 1 2000 2000Q1 20000103 1/3/2000 2000/01/03 N 2 Monday 3 3 Weekday 2 January 1 2000 2000Q1 20000104 1/4/2000 2000/01/04 N 3 Tuesday 4 4 Weekday 2 January 1 2000 2000Q1 20000105 1/5/2000 2000/01/05 N 4 Wednesday 5 5 Weekday 2 January 1 2000 2000Q1 20000106 1/6/2000 2000/01/06 N 5 Thursday 6 6 Weekday 2 January 1 2000 2000Q1 20000107 1/7/2000 2000/01/07 N 6 Friday 7 7 Weekend 2 January 1 2000 2000Q1 20000108 1/8/2000 2000/01/08 N 7 Saturday 8 8 Weekend 2 January 1 2000 2000Q1 20000109 1/9/2000 2000/01/09 N 1 Sunday 9 9 Weekday 3 January 1 2000 2000Q1 20000110 1/10/2000 2000/01/10 N 2 Monday 10 10 Weekday 3 January 1 2000 2000Q1 20000111 1/11/2000 2000/01/11 N 3 Tuesday 11 11 Weekday 3 January 1 2000 2000Q1 20000112 1/12/2000 2000/01/12 N 4 Wednesday 12 12 Weekday 3 January 1 2000 2000Q1 20000113 1/13/2000 2000/01/13 N 5 Thursday 13 13 Weekday 3 January 1 2000 2000Q1 20000114 1/14/2000 2000/01/14 N 6 Friday 14 14 Weekend 3 January 1 2000 2000Q1 20000115 1/15/2000 2000/01/15 N 7 Saturday 15 15 Weekend 3 January 1 2000 2000Q1 20000116 1/16/2000 2000/01/16 N 1 Sunday 16 16 Weekday 4 January 1 2000 2000Q1 20000117 1/17/2000 2000/01/17 Y 2 Monday 17 17 Weekday 4 January 1 2000 2000Q1 20000118 1/18/2000 2000/01/18 N 3 Tuesday 18 18 Weekday 4 January 1 2000 2000Q1 20000119 1/19/2000 2000/01/19 N 4 Wednesday 19 19 Weekday 4 January 1 2000 2000Q1 20000120 1/20/2000 2000/01/20 N 5 Thursday 20 20 Weekday 4 January 1 2000 2000Q1 20000121 1/21/2000 2000/01/21 N 6 Friday 21 21 Weekend 4 January 1 2000 2000Q1 20000122 1/22/2000 2000/01/22 N 7 Saturday 22 22 Weekend 4 January 1 2000 2000Q1 20000123 1/23/2000 2000/01/23 N 1 Sunday 23 23 Weekday 5 January 1 2000 2000Q1 20000124 1/24/2000 2000/01/24 N 2 Monday 24 24 Weekday 5 January 1 2000 2000Q1 20000125 1/25/2000 2000/01/25 N 3 Tuesday 25 25 Weekday 5 January 1 2000 2000Q1 20000126 1/26/2000 2000/01/26 N 4 Wednesday 26 26 Weekday 5 January 1 2000 2000Q1 20000127 1/27/2000 2000/01/27 N 5 Thursday 27 27 Weekday 5 January 1 2000 2000Q1 20000128 1/28/2000 2000/01/28 N 6 Friday 28 28 Weekend 5 January 1 2000 2000Q1 20000129 1/29/2000 2000/01/29 N 7 Saturday 29 29 Weekend 5 January 1 2000 2000Q1 20000130 1/30/2000 2000/01/30 N 1 Sunday 30 30 Weekday 6 January 1 2000 2000Q1 20000131 1/31/2000 2000/01/31 N 2 Monday 31 31 Weekday 6 January 1 2000 2000Q1 20000201 2/1/2000 2000/02/01 N 3 Tuesday 1 32 Weekday 6 February 2 2000 2000Q1 20000202 2/2/2000 2000/02/02 N 4 Wednesday 2 33 Weekday 6 February 2 2000 2000Q1 20000203 2/3/2000 2000/02/03 N 5 Thursday 3 34 Weekday 6 February 2 2000 2000Q1 20000204 2/4/2000 2000/02/04 N 6 Friday 4 35 Weekend 6 February 2 2000 2000Q1 20000205 2/5/2000 2000/02/05 N 7 Saturday 5 36 Weekend 6 February 2 2000 2000Q1 20000206 2/6/2000 2000/02/06 N 1 Sunday 6 37 Weekday 7 February 2 2000 2000Q1 20000207 2/7/2000 2000/02/07 N 2 Monday 7 38 Weekday 7 February 2 2000 2000Q1 20000208 2/8/2000 2000/02/08 N 3 Tuesday 8 39 Weekday 7 February 2 2000 2000Q1 20000209 2/9/2000 2000/02/09 N 4 Wednesday 9 40 Weekday 7 February 2 2000 2000Q1 87

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all
data marts within the enterprise. Information is always stored in the dimensional model. Kimball vs Inmon Bottom-Up Approach "...The data warehouse is nothing more than the union of all the data marts...," Ralph Kimball, December 29, 1997. 88

Bill Inmon's paradigm: Data warehouse is one part of the
overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. Kimball vs Inmon Top-Down Approach "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998. 89

There is no right or wrong between these two ideas,
as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse. http://www.1keydata.com/datawarehousing/inmon-kimball.html http://www.nagesh.com/publications/technology/173-inmon-vs-kimball-an-analysis.html Kimball vs Inmon 90

Whose methodology is more popular? Dice.com technical job listing board
• 85,000 job listings, average turn over 14 days • 90-118 job listings for Kimball • 33-39 for Inmon • 33-37 Inmon and Kimball Kimball vs Inmon 91 Slides and Feedback at: http://joind.in/11011

Trimmed Sample Data Day Weekend Zip State Region Division Group
Quantity Thu No 95014 CA West Sporting Goods Skis 300 Thu No 90210 CA West Sporting Goods Skis 400 Fri No 19090 PA East Sporting Goods Bikes 200 Fri No 20500 DC East Sporting Goods Bikes 600 Sat Yes 19090 PA East Sporting Goods Bikes 100 Sun Yes 95014 CA West Sporting Goods Skis 800 Sun Yes 20500 DC East Sporting Goods Bikes 200 Sun Yes 19090 PA East Clothing Men’s Pants 100 … … … … … … … … • Note Hierarchies: Day > Weekend; Zip > State > Region

Cubes • Slice • Dice • Roll Up • Drill
Down • Drill Through • Pivot

Slice of Bike Data 95014 90210 20500 19090 THU 300
200 400 400 FRI 200 300 200 100 SAT 100 100 300 300 SUN 400 400 100 200 SUM of 16 cells associated with Bike Sales = 4000

Roll Up

Roll Up Sporting Goods (Bikes and Skis) Bikes 95014 90210
20500 19090 THU 300 200 400 400 FRI 200 300 200 100 SAT 100 100 300 300 SUN 400 400 100 200 SUM of 32 cells associated with Sporting Good Sales = 8000 Skis 95014 90210 20500 19090 THU 400 400 100 200 FRI 100 100 300 300 SAT 200 300 200 100 SUN 300 200 400 400

Slice 2

Drill Down

Drill Through

Pivot Tables Zip Code Product Day Sales 95010 Skis Sat
2 95010 Skis Sun 0 95010 Bikes Sat 4 95010 Bikes Sun 0 95014 Skis Sat 5 95014 Skis Sun 3 95014 Bikes Sat 1 95014 Bikes Sun 9 Zip Code Product Sat Sales Sun Sales 95010 Skis 2 0 95010 Bikes 4 0 95014 Skis 5 3 95014 Bikes 1 9 Move unique row information into expanded column header data

Pivot Tables Zip Code Product Sat Sales Sun Sales 95010
Skis 2 0 95010 Bikes 4 0 95014 Skis 5 3 95014 Bikes 1 9 Zip Code Product Sat Sales Sun Sales 95010 Sporting Goods 6 0 95014 Sporting Goods 6 12 Often aggregate (combine) hierarchical data

Tableau = Fantastic, Easy, and Powerful Data Visualization Tool

Virtualized Business Intelligence with InfoSphere Warehouse By: Adriana Carvajal; Thomas
Chong Big Data Development Cycle 10

Why Big Data Skills? 10

Why Big Data Skills? 11

Questions ??? [email protected] Slides and Feedback at: http://joind.in/11011

Introduction to Big Data

Introduction to Big Data

More Decks by John Kerley-Weeks

Other Decks in Technology

Featured

Transcript