Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Introduction to Factor for Handling Ordered Categorical Data

Exploratory: Introduction to Factor for Handling Ordered Categorical Data

Factor is one of the data types in R and it is designed to address typical challenges with categorical data. With Factor data type, we can set the order for the categorical values and manipulate the order based on your needs with a series of convenient functions.

In this seminar, Kan will introduce Factor data type and show how to manage the order in Exploratory.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

August 30, 2019
Tweet

Transcript

  1. 1 Exploratory Seminar Factor: Categorical with Order

  2. EXPLORATORY

  3. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  4. Mission Make Data Science Available for Everyone

  5. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  6. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  7. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  8. 8 Exploratory Seminar Factor: Categorical with Order

  9. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning)
  10. Factor • For Ordinal Data (Categorical Data with Order) Columns

    • Set Levels Explicitly • Manipulate Levels • Many Statistical Models rely on ‘Base Level’ of Factor
  11. Data Type

  12. 12 Data Type in General Data Type in R /

    Exploratory Numerical numeric, Integer Categorical character Ordinal factor Logical logical Date, Time Date, POSIXct
  13. Numerical 0 10 20 30 40 50 11 22 45

  14. Categorical California Texas New York Florida Oregon • No continuous

    relationship • Limited Set of Values • Ordinal relationship is NOT necessary
  15. Ordinal Really Bad Bad Neutral Good Really Good It looks

    like Categories…
  16. Ordinal Really Bad Bad Neutral Good Really Good 1 2

    3 4 5 But, there is an inherent ordinal relationship.
  17. Category Really Bad Bad Neutral Good Really Good Character Category

    Level Really Bad 1 Bad 2 Neutral 3 Good 4 Really Good 5 Factor vs.
  18. Really Bad Bad Neutral Good Really Good

  19. Category Level Really Bad 1 Bad 2 Neutral 3 Good

    4 Really Good 5 Base Level
  20. When do we need it? • Visualization • Window Calculation

    with Chart • Binning • Statistical Model with Categorical Predictors
  21. Examples

  22. Kaggle Data Scientist Survey Data 2018

  23. Summary View

  24. Bar Chart, sorted by number of rows (people)

  25. Bar Chart, sorted by X-Axis Value Names.

  26. But, it would be better to see them being sorted

    like this…
  27. But, it would be better to see them being sorted

    like this…
  28. Factor!

  29. forcats

  30. forcats: Tools for Working with Categorical Variables (Factors) https://cran.r-project.org/web/packages/forcats/index.html

  31. forcats home page https://forcats.tidyverse.org

  32. forcats functions • fct_relevel • fct_inorder • fct_infreq • fct_reorder

    • fct_rev (reverse) • fct_lump • and others…
  33. Setting the levels manually

  34. None
  35. None
  36. fct_relevel(`Online learning platforms and MOOCs`, "Much worse", "Slightly worse", "Neither

    better nor worse", "Slightly better", "Much better") List up all the values that you want to set the levels explicitly for.
  37. Category Much Worse Slightly Worse Neither better or worse Slightly

    Better Much Better No Opinions Category Level Much Worse 1 Slightly Worse 2 Neither better or worse 3 Slightly Better 4 Much Better 5 No Opinions
  38. Category Much Worse Slightly Worse Neither better or worse Slightly

    Better Much Better No Opinions Category Level Much Worse 1 Slightly Worse 2 Neither better or worse 3 Slightly Better 4 Much Better 5 No Opinions 6 The ones you didn’t set will be added after in an alphabetical order.
  39. Character Factor

  40. Without setting the sort. X-Axis values are shown according to

    the Factor.
  41. Setting the Levels As Is

  42. US Presidents & Years

  43. Bar Chart, sorted by Number of Rows (Years)

  44. Bar Chart, sorted by Names

  45. But, we want to show the Presidents sorted by the

    year they served at.
  46. Luckily, we have YEAR column, we can sort the data

    by YEAR.
  47. Sort Data by YEAR with Arrange

  48. Select ‘As Is’ option.

  49. Summary view respects the levels for Factor columns.

  50. Chart respects the levels for Factor columns when sorting options

    is default.
  51. The Presidents are now sorted by the years they served.

  52. Bonus

  53. We have a Historical US Beer Tax Rate data.

  54. ttbbeer: US Beer Statistics from TTB https://cran.r-project.org/web/packages/ttbbeer/index.html

  55. None
  56. Use Line Chart to visualize the tax rate changes.

  57. Let’s color the line by the Presidents to find out

    who raised the beer tax! ;)
  58. Join with the Presidents data frame!

  59. Assign the President column to Color.

  60. President names are sorted by the year they served.

  61. What???

  62. Setting Levels Based on Frequency

  63. We have a fictional Order data.

  64. Bar chart, Countries are sorted by Average Marketing Cost

  65. Can we reorder by the number of orders?

  66. Select ‘By Frequency’.

  67. fct_infreq

  68. Chart respects the levels of Factor columns.

  69. Notice that we are not sorting inside the chart. The

    countries are sorted according to the Factor order.
  70. Switch Y-Axis column to Average Marketing.

  71. Countries are still sorted by the frequency (number of orders).

  72. Countries with less order have higher Marketing Cost

  73. What if we want to set the level based on

    Sales Amount, NOT based on Frequency (Number of Orders)?
  74. Setting Levels Based on Another Column

  75. Select ‘By Another Column’.

  76. fct_reorder

  77. If you don’t set any summarizing function, ‘Mean with Ascending

    Order’ will be used by default.
  78. Countries are sorted by ‘Mean with Ascending Order.’

  79. Set ‘sum’ to calculate the total Sales Amount, and set

    ‘Descending Order’ option.
  80. Now it’s ordered by Sum of Sales.

  81. if your Sales happens to have NA…

  82. Window Calculation Difference from First

  83. We have Employee data.

  84. Showing Average Monthly Income by Job Role.

  85. Compare against Healthcare Representative.

  86. Use Window Calculation with % of Difference from First.

  87. % of Difference from First.

  88. But, can we compare against Research Director?

  89. None
  90. Category Research Director Healthcare Rep Human Resources Laboratory Technician Manager

    Manufacturing Director Category Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director
  91. Category Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director

    Research Director Category Level Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director 1 Character Factor Set Research Director as the 1st Level.
  92. Category Level Research Director 1 Healthcare Rep 2 Human Resources

    3 Laboratory Technician 4 Manager 5 Manufacturing Director 6 Category Level Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director 1 The rest of the values will be assigned the levels based on the alphabetical sorting order.
  93. Category Level Research Director 1 Healthcare Rep 2 Human Resources

    3 Laboratory Technician 4 Manager 5 Manufacturing Director 6 Category Level Healthcare Rep Human Resources Laboratory Technician Manager Manufacturing Director Research Director 1 The rest of the values will be assigned the levels based on the alphabetical sorting order.
  94. Category Level Research Director 1 Healthcare Rep 2 Human Resources

    3 Laboratory Technician 4 Manager 5 Manufacturing Director 6 Base Level
  95. Select ‘Manually’.

  96. fct_relevel

  97. ‘% of Difference’ are calculated against Research Director.

  98. Binning

  99. Numeric Ordinal

  100. None
  101. None
  102. With the newer version, assigning Numerical columns to X-Axis will

    automatically categorize the values (binning).
  103. You can setup the ‘binning’ from the property.

  104. Statistical Learning Model

  105. Research Director’s Monthly Income is about $4,000 higher compared to

  106. What is the base level of Job Role variable?

  107. I’m not familiar with Sales Executive, I want to compare

    all the Job Roles based on Research Director.
  108. Set ‘Research Director’ as the 1st level with fct_relevel.

  109. Research Director is set to the base level.

  110. By running the model again, you will see Research Director

    will be set as the base level.
  111. All the coefficients are now interpreted by comparing to Research

    Director.
  112. Q & A

  113. EXPLORATORY