Slide 1

Slide 1 text

1 Exploratory Seminar Factor: Categorical with Order

Slide 2

Slide 2 text

EXPLORATORY

Slide 3

Slide 3 text

Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 4

Slide 4 text

Mission Make Data Science Available for Everyone

Slide 5

Slide 5 text

Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 6

Slide 6 text

First Wave Second Wave Third Wave Proprietary Open Source UI & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Slide 7

Slide 7 text

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Exploratory Data Analysis

Slide 8

Slide 8 text

8 Exploratory Seminar Factor: Categorical with Order

Slide 9

Slide 9 text

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning)

Slide 10

Slide 10 text

Factor • For Ordinal Data (Categorical Data with Order) Columns • Set Levels Explicitly • Manipulate Levels • Many Statistical Models rely on ‘Base Level’ of Factor

Slide 11

Slide 11 text

Data Type

Slide 12

Slide 12 text

12 Data Type in General Data Type in R / Exploratory Numerical numeric, Integer Categorical character Ordinal factor Logical logical Date, Time Date, POSIXct

Slide 13

Slide 13 text

Numerical 0 10 20 30 40 50 11 22 45

Slide 14

Slide 14 text

Categorical California Texas New York Florida Oregon • No continuous relationship • Limited Set of Values • Ordinal relationship is NOT necessary

Slide 15

Slide 15 text

Ordinal Really Bad Bad Neutral Good Really Good It looks like Categories…

Slide 16

Slide 16 text

Ordinal Really Bad Bad Neutral Good Really Good 1 2 3 4 5 But, there is an inherent ordinal relationship.

Slide 17

Slide 17 text

Category Really Bad Bad Neutral Good Really Good Character Category Level Really Bad 1 Bad 2 Neutral 3 Good 4 Really Good 5 Factor vs.

Slide 18

Slide 18 text

Really Bad Bad Neutral Good Really Good

Slide 19

Slide 19 text

Category Level Really Bad 1 Bad 2 Neutral 3 Good 4 Really Good 5 Base Level

Slide 20

Slide 20 text

When do we need it? • Visualization • Window Calculation with Chart • Binning • Statistical Model with Categorical Predictors

Slide 21

Slide 21 text

Examples

Slide 22

Slide 22 text

Kaggle Data Scientist Survey Data 2018

Slide 23

Slide 23 text

Summary View

Slide 24

Slide 24 text

Bar Chart, sorted by number of rows (people)

Slide 25

Slide 25 text

Bar Chart, sorted by X-Axis Value Names.

Slide 26

Slide 26 text

But, it would be better to see them being sorted like this…

Slide 27

Slide 27 text

But, it would be better to see them being sorted like this…

Slide 28

Slide 28 text

Factor!

Slide 29

Slide 29 text

forcats

Slide 30

Slide 30 text

forcats: Tools for Working with Categorical Variables (Factors) https://cran.r-project.org/web/packages/forcats/index.html

Slide 31

Slide 31 text

forcats home page https://forcats.tidyverse.org

Slide 32

Slide 32 text

forcats functions • fct_relevel • fct_inorder • fct_infreq • fct_reorder • fct_rev (reverse) • fct_lump • and others…

Slide 33

Slide 33 text

Setting the levels manually

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

fct_relevel(`Online learning platforms and MOOCs`, "Much worse", "Slightly worse", "Neither better nor worse", "Slightly better", "Much better") List up all the values that you want to set the levels explicitly for.

Slide 37

Slide 37 text

Category Much Worse Slightly Worse Neither better or worse Slightly Better Much Better No Opinions Category Level Much Worse 1 Slightly Worse 2 Neither better or worse 3 Slightly Better 4 Much Better 5 No Opinions

Slide 38

Slide 38 text

Category Much Worse Slightly Worse Neither better or worse Slightly Better Much Better No Opinions Category Level Much Worse 1 Slightly Worse 2 Neither better or worse 3 Slightly Better 4 Much Better 5 No Opinions 6 The ones you didn’t set will be added after in an alphabetical order.

Slide 39

Slide 39 text

Character Factor

Slide 40

Slide 40 text

Without setting the sort. X-Axis values are shown according to the Factor.

Slide 41

Slide 41 text

Setting the Levels As Is

Slide 42

Slide 42 text

US Presidents & Years

Slide 43

Slide 43 text

Bar Chart, sorted by Number of Rows (Years)

Slide 44

Slide 44 text

Bar Chart, sorted by Names

Slide 45

Slide 45 text

But, we want to show the Presidents sorted by the year they served at.

Slide 46

Slide 46 text

Luckily, we have YEAR column, we can sort the data by YEAR.

Slide 47

Slide 47 text

Sort Data by YEAR with Arrange

Slide 48

Slide 48 text

Select ‘As Is’ option.

Slide 49

Slide 49 text

Summary view respects the levels for Factor columns.

Slide 50

Slide 50 text

Chart respects the levels for Factor columns when sorting options is default.

Slide 51

Slide 51 text

The Presidents are now sorted by the years they served.

Slide 52

Slide 52 text

Bonus

Slide 53

Slide 53 text

We have a Historical US Beer Tax Rate data.

Slide 54

Slide 54 text

ttbbeer: US Beer Statistics from TTB https://cran.r-project.org/web/packages/ttbbeer/index.html

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Use Line Chart to visualize the tax rate changes.

Slide 57

Slide 57 text

Let’s color the line by the Presidents to find out who raised the beer tax! ;)

Slide 58

Slide 58 text

Join with the Presidents data frame!

Slide 59

Slide 59 text

Assign the President column to Color.

Slide 60

Slide 60 text

President names are sorted by the year they served.

Slide 61

Slide 61 text

What???

Slide 62

Slide 62 text

Setting Levels Based on Frequency

Slide 63

Slide 63 text

We have a fictional Order data.

Slide 64

Slide 64 text

Bar chart, Countries are sorted by Average Marketing Cost

Slide 65

Slide 65 text

Can we reorder by the number of orders?

Slide 66

Slide 66 text

Select ‘By Frequency’.

Slide 67

Slide 67 text

fct_infreq

Slide 68

Slide 68 text

Chart respects the levels of Factor columns.

Slide 69

Slide 69 text

Notice that we are not sorting inside the chart. The countries are sorted according to the Factor order.

Slide 70

Slide 70 text

Switch Y-Axis column to Average Marketing.

Slide 71

Slide 71 text

Countries are still sorted by the frequency (number of orders).

Slide 72

Slide 72 text

Countries with less order have higher Marketing Cost

Slide 73

Slide 73 text

What if we want to set the level based on Sales Amount, NOT based on Frequency (Number of Orders)?

Slide 74

Slide 74 text

Setting Levels Based on Another Column

Slide 75

Slide 75 text

Select ‘By Another Column’.

Slide 76

Slide 76 text

fct_reorder

Slide 77

Slide 77 text

If you don’t set any summarizing function, ‘Mean with Ascending Order’ will be used by default.

Slide 78

Slide 78 text

Countries are sorted by ‘Mean with Ascending Order.’

Slide 79

Slide 79 text

Set ‘sum’ to calculate the total Sales Amount, and set ‘Descending Order’ option.

Slide 80

Slide 80 text

Now it’s ordered by Sum of Sales.

Slide 81

Slide 81 text

if your Sales happens to have NA…

Slide 82

Slide 82 text

Window Calculation Difference from First

Slide 83

Slide 83 text

We have Employee data.

Slide 84

Slide 84 text

Showing Average Monthly Income by Job Role.

Slide 85

Slide 85 text

Compare against Healthcare Representative.

Slide 86

Slide 86 text

Use Window Calculation with % of Difference from First.

Slide 87

Slide 87 text

% of Difference from First.

Slide 88

Slide 88 text

But, can we compare against Research Director?

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

Category Research Director Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Category Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director

Slide 91

Slide 91 text

Category Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director Category Level Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director 1 Character Factor Set Research Director as the 1st Level.

Slide 92

Slide 92 text

Category Level Research Director 1 Healthcare Rep 2 Human Resources 3 Laboratory Technician 4 Manager 5 Manufacturing Director 6 Category Level Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director 1 The rest of the values will be assigned the levels based on the alphabetical sorting order.

Slide 93

Slide 93 text

Category Level Research Director 1 Healthcare Rep 2 Human Resources 3 Laboratory Technician 4 Manager 5 Manufacturing Director 6 Category Level Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director 1 The rest of the values will be assigned the levels based on the alphabetical sorting order.

Slide 94

Slide 94 text

Category Level Research Director 1 Healthcare Rep 2 Human Resources 3 Laboratory Technician 4 Manager 5 Manufacturing Director 6 Base Level

Slide 95

Slide 95 text

Select ‘Manually’.

Slide 96

Slide 96 text

fct_relevel

Slide 97

Slide 97 text

‘% of Difference’ are calculated against Research Director.

Slide 98

Slide 98 text

Binning

Slide 99

Slide 99 text

Numeric Ordinal

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

With the newer version, assigning Numerical columns to X-Axis will automatically categorize the values (binning).

Slide 103

Slide 103 text

You can setup the ‘binning’ from the property.

Slide 104

Slide 104 text

Statistical Learning Model

Slide 105

Slide 105 text

Research Director’s Monthly Income is about $4,000 higher compared to …

Slide 106

Slide 106 text

What is the base level of Job Role variable?

Slide 107

Slide 107 text

I’m not familiar with Sales Executive, I want to compare all the Job Roles based on Research Director.

Slide 108

Slide 108 text

Set ‘Research Director’ as the 1st level with fct_relevel.

Slide 109

Slide 109 text

Research Director is set to the base level.

Slide 110

Slide 110 text

By running the model again, you will see Research Director will be set as the base level.

Slide 111

Slide 111 text

All the coefficients are now interpreted by comparing to Research Director.

Slide 112

Slide 112 text

Q & A

Slide 113

Slide 113 text

EXPLORATORY