Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to choose the right charts for Exploratory Data Analysis

19fc8f6113c5c3d86e6176362ff29479?s=47 Kan Nishida
PRO
September 04, 2019

How to choose the right charts for Exploratory Data Analysis

This is to show you how to choose the right charts for Exploratory Data Analysis and how to use charts like Histogram, Density Plot, Boxplot, Scatter, Stack Bar charts in Exploratory.  

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

September 04, 2019
Tweet

Transcript

  1. How to Choose the Right Charts for EDA

  2. How to Choose the Right Charts for Exploratory Data Analysis

  3. EXPLORATORY

  4. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  5. Mission Make Data Science Available for Everyone

  6. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  7. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  8. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  9. How to Choose the Right Charts for Exploratory Data Analysis

  10. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning)
  11. EDA (Exploratory Data Analysis) 11

  12. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 12 EDA (Exploratory Data Analysis)
  13. 13 Questions Answers Data Hypotheses EDA (Exploratory Data Analysis)

  14. Why Data Analysis? 14

  15. 15 Goal Grow Business

  16. 16 Goal Grow Business Increase Number of Customers Problem

  17. 17 Quantify Increase Customers Number of Customers Problem Goal Grow

    Business
  18. 18 Predict how many customers we will have. Prediction e.g.

    Customers will be 1000 by end of this year.
  19. 19 e.g. We want to grow Customers to 1000. What

    can we do to make that happen? Control
  20. 20 Prediction Control Hypothesis Correlation / Influence Causal Relationship

  21. 21 Hypothesis If the weather will be warm, we will

    have more customers. If we offer 10% discount, we would have more customers. Prediction Control
  22. 22 Test Hypotheses by using Data • Hypothesis Test •

    A/B Test Hypotheses Data Test
  23. 23 Hypotheses Data Test Confirmatory Analysis

  24. 24 Prediction Control Hypothesis Correlation / Influence Causal Relationship How

    can we build Hypothesis?
  25. 25 Build hypotheses by exploring data. EDA Hypotheses Data Test

    Data Exploratory Data Analysis
  26. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 26 EDA (Exploratory Data Analysis)
  27. Employee Data

  28. Employee Data

  29. Goal • Want to explain how the salary is decided.

    • Want to predict the salary based on the attributes. • What to control to increase salary, if possible!
  30. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Prediction and Control. 30 EDA (Exploratory Data Analysis)
  31. 31 Far better an approximate answer to the right question,

    which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey
  32. 32 • How the variation in variables? • How are

    the variables associated (or correlated) to one another? Two Principle Questions for EDA
  33. “Since the aim of exploratory data analysis is to learn

    what seems to be, it should be no surprise that pictures play a vital role in doing it well. There is nothing better for making you think of questions you had forgotten to ask (even mentally),” John Tukey
  34. Visualizing Variance and Correlation

  35. Visualizing Variance

  36. $6,503 Variance Average $15,000 $1,000

  37. $6,503 Average

  38. Questions for Variance • What are the typical values? •

    Are there any outliers compared to the general trend in the variance? • How the data is distributed? • Are there any patterns you can spot in the variance? 38
  39. Summary View

  40. 40

  41. 41 Histogram Density Plot Bar Chart Charts to visualize the

    variance.
  42. 42 How to pick which charts to use?

  43. 43 Depends on Data Type

  44. Continuous Data vs. Categorical Data 44

  45. Continuous Data vs. Categorical Data 45

  46. • Numerical - Numeric, Integer, Double • Date/Time - Date,

    POSIXct 46
  47. Numerical 0 10 20 30 40 50 11 22 45

    Continuous and Ordinal relationship among values.
  48. 48 Histogram Density Plot Bar Chart Visualize the Variance of

    Numerical Variables
  49. Histogram 49

  50. 50 It splits numerical values into a set of ‘bins’

    with equal range it shows the size (or number of rows) for each ‘bin’.
  51. Visualizing Variance with Histogram 51 1. Visualize a variance of

    Monthly Income 2. Find if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles.
  52. 52 1. Visualize a variance of Monthly Income 2. Find

    if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles. Visualize the Variance with Histogram
  53. 53

  54. 54 Increase Number of Bars

  55. 55 Increase to 100 Bars.

  56. 56 There seems to be a few different groups.

  57. Visualize the Variance with Density Plot 57 1. Visualize a

    variance of Monthly Income 2. Find if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles.
  58. 58 There is no clear difference between Female and Male.

  59. Visualize the Variance with Density Plot 59 1. Visualize a

    variance of Monthly Income 2. Find if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles.
  60. 60 Manager’s Monthly Income range seems to be higher while

    Sales Rep & Research Scientist are lower.
  61. Density Plot 61

  62. 62 • Draws a smooth curve to visualize the distribution

    of data. • The height shows an estimated data density of any given point.
  63. Visualizing Variance with Density Plot 63 1. Visualize a variance

    of Monthly Income 2. Find if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles.
  64. Visualizing Variance with Density Plot 64 1. Visualize a variance

    of Monthly Income 2. Find if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles.
  65. 65

  66. Visualizing Variance with Density Plot 66 1. Visualize a variance

    of Monthly Income 2. Find if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles.
  67. 67 There is no clear difference between Female and Male.

  68. Visualizing Variance with Density Plot 68 1. Visualize a variance

    of Monthly Income 2. Find if there is a difference in the Monthly Income variance between Male and Female. 3. Find if there is a difference in the Monthly Income variance among Job Roles.
  69. 69 It’s easier to see the differences among Job Roles

    compared to Histogram.
  70. Density Plot Histogram Same data variance is visualized in different

    ways.
  71. Continuous Data vs. Categorical Data 71

  72. Categorical California Texas New York Florida Oregon • No continuous

    relationship • Limited Set of Values • Ordinal relationship is NOT necessary
  73. 73 Histogram Density Plot Bar Chart

  74. Visualize the Variance with Bar Chart 74 1. Visualize the

    variation of Job Role 2. Find if there is a difference in the variations in Job Role between Male and Female. 3. Find if there is a difference in the variations in Job Role between Attrition Status.
  75. Visualize the Variance with Bar Chart 75 1. Visualize the

    variation of Job Role 2. Find if there is a difference in the variations in Job Role between Male and Female. 3. Find if there is a difference in the variations in Job Role between Attrition Status.
  76. 76

  77. 77

  78. Visualize the Variance with Bar Chart 78 1. Visualize the

    variation of Job Role 2. Find if there is a difference in the variations in Job Role between Male and Female. 3. Find if there is a difference in the variations in Job Role between Attrition Status.
  79. None
  80. Visualize the Variance with Bar Chart 80 1. Visualize the

    variation of Job Role 2. Find if there is a difference in the variations in Job Role between Male and Female. 3. Find if there is a difference in the variations in Job Role between Attrition Status.
  81. None
  82. None
  83. Visualizing Association and Correlation 83

  84. 84 A relationship where changes in one variable happen together

    with changes in another variable with a certain rule. Association and Correlation
  85. 85 Association Correlation Any type of relationship between two variables.

    A certain type of (usually linear) association between two variables
  86. 86 US UK Japan 5000 2500 Monthly Income variances are

    different among countries. Country Monthly Income 0 Association
  87. 87 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  88. Strong Negative Correlation No Correlation Strong Positive Correlation 0 1

    -1 -0.5 0.5 Correlation
  89. 89 Why Association & Correlation are Important?

  90. 90 Variance Average (Mean) $20,000 $1,000

  91. 91 Variance $20,000 $1,000 Monthly Income

  92. 92 How much the income would be in this company?

    $20,000 $1,000 Monthly Income Variance
  93. 93 Uncertainty $20,000 $1,000 Monthly Income How much the income

    would be in this company? Variance
  94. 94 0 30 20 If we can find a correlation

    between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income
  95. 95 0 30 20 10 $20,000 $1,000 Working Years If

    Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income
  96. 96 5000 0 30 20 Working Years Correlation Variance 100

    $20,000 $1,000 $15,000 Correlation reduces Uncertainty caused by Variance. Monthly Income
  97. 97 US UK Japan Association Variance Reduce Uncertainty 5000 100

    Monthly Income
  98. If we can find strong correlations, it makes it easier

    to explain how Monthly Income changes and to predict what Monthly Income will be.
  99. Correlation is not equal to Causation. Causation is a special

    type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome.
  100. Visualizing Association and Correlation 100

  101. 101 Scatter Plot Boxplot Violin Plot Heatmap Density Plot Stack

    Bar
  102. 102 How to choose the right charts?

  103. 103 Depends on the combination of data types.

  104. 104 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  105. 105 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  106. 106 Scatter Plot Boxplot Violin Plot Heatmap Density Plot Stack

    Bar
  107. 107 Histogram with Color

  108. 108 Job Role (Category) vs. Monthly Income (Numeric)

  109. 109 Density Plot with Color

  110. 110 Job Role (Category) vs. Monthly Income (Numeric)

  111. 111 Boxplot

  112. Boxplot • Displays the distribution of numerical values by Category

    • Y Axis represents range of values, X Axis represents each Category
  113. None
  114. Separate into 4 groups (quartile) so that each group has

    equal size.
  115. 3Q (3rd Quartile / 75 Percentile) 2Q (2nd Quartile /

    50 Percentile / Median) 1Q (1st Quartile / 25 Percentile)
  116. None
  117. 3Q Median 1Q

  118. 3Q Median 1Q Max Min

  119. Visualize the relationship between Categorical and Numerical 119 1. Visualize

    the relationship between Monthly Salary and Job Role. 2. Visualize the relationship between Monthly Salary and Gender.
  120. Visualize the relationship between Categorical and Numerical 120 1. Visualize

    the relationship between Monthly Salary and Job Role. 2. Visualize the relationship between Monthly Salary and Gender.
  121. 121 Job Role (Category) vs. Monthly Income (Numeric)

  122. 122 1. Visualize the relationship between Monthly Salary and Job

    Role. 2. Visualize the relationship between Monthly Salary and Gender. Visualize the relationship between Categorical and Numerical
  123. 123 Gender (Category) vs. Monthly Income (Numeric)

  124. Gender (Category) vs. Monthly Income (Numeric) by Job Role (Category)

  125. OR…

  126. Job Role (Category) vs. Monthly Income (Numeric) by Gender (Category)

  127. 127 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  128. 128 Scatter Plot Boxplot Violin Plot Heatmap Density Plot Stack

    Bar
  129. 129 Scatter Plot

  130. 130 Numeric Numeric Each data point (row) is positioned at

    an intersection of two numeric variables.
  131. Strong Negative Correlation No Correlation Strong Positive Correlation 0 1

    -1 -0.5 0.5 Correlation
  132. Visualize the relationship between Numerical and Numerical 132 1. Visualize

    the relationship between Monthly Salary and Age. 2. Visualize the relationship between Monthly Salary and Total Working Years. 3. Find if the correlations are different among Job Roles.
  133. Visualize the relationship between Numerical and Numerical 133 1. Visualize

    the relationship between Monthly Salary and Age. 2. Visualize the relationship between Monthly Salary and Total Working Years. 3. Find if the correlations are different among Job Roles.
  134. 134 Age (Numeric) vs. Monthly Income (Numeric)

  135. 135 Show a Trend Line.

  136. 136

  137. Check the strength of correlation.

  138. Visualize the relationship between Numerical and Numerical 138 1. Visualize

    the relationship between Monthly Salary and Age. 2. Visualize the relationship between Monthly Salary and Total Working Years. 3. Find if the correlations are different among Job Roles.
  139. 139

  140. 140 Check the strength of correlation.

  141. Visualize the relationship between Numerical and Numerical 141 1. Visualize

    the relationship between Monthly Salary and Age. 2. Visualize the relationship between Monthly Salary and Total Working Years. 3. Find if the correlations are different among Job Roles.
  142. 142 The strength of correlation varies among Job Roles.

  143. 143 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  144. 144 Scatter Plot Boxplot Violin Plot Heatmap Density Plot Stack

    Bar
  145. 145 Category vs. Category Calculating the size (number of rows)

    for each pair and/or calculate the ratio against the total size.
  146. 146 1. Visualize the relationship between Job Role and Education.

    2. Visualize the relationship between Job Role and Attrition. 3. Visualize the relationship between Job Role and Monthly Salary. 4. Visualize the relationship between Monthly Salary and Total Working Years. Visualize the relationship between Categorical and Categorical
  147. 147 1. Visualize the relationship between Job Role and Education.

    2. Visualize the relationship between Job Role and Attrition. 3. Visualize the relationship between Job Role and Monthly Salary. 4. Visualize the relationship between Monthly Salary and Total Working Years.
  148. Education (Category) vs. Job Role (Category)

  149. 149 1. Visualize the relationship between Job Role and Education.

    2. Visualize the relationship between Job Role and Attrition. 3. Visualize the relationship between Job Role and Monthly Salary. 4. Visualize the relationship between Monthly Salary and Total Working Years.
  150. Visualize the relationship between Job Role and Attrition. Job Role

    (Categorical) vs. Attrition (Logical)
  151. • Logical is a special case of Categorical. • It

    can have only two unique values. • They are TRUE or FALSE.
  152. Categorical California Texas New York Florida Oregon No continuous relationship

    No ordinal relationship is necessary
  153. TRUE FALSE Is she Japanese? Either TRUE or FALSE Logical

  154. Job Role (Category) vs. Attrition (Logical)

  155. 155 1. Visualize the relationship between Job Role and Education.

    2. Visualize the relationship between Job Role and Attrition. 3. Visualize the relationship between Job Role and Monthly Salary. 4. Visualize the relationship between Monthly Salary and Total Working Years.
  156. Visualize the relationship between Job Role and Monthly Income. Job

    Role (Categorical) vs. Monthly Income (Numerical)
  157. Visualize the relationship between Job Role and Monthly Income. Job

    Role (Categorical) vs. Monthly Income (Numerical) Monthly Income (Categorical) Binning
  158. Monthly Income

  159. After Binning…

  160. Monthly Income (Category) vs. Job Role (Category)

  161. 161 1. Visualize the relationship between Job Role and Education.

    2. Visualize the relationship between Job Role and Attrition. 3. Visualize the relationship between Job Role and Monthly Salary. 4. Visualize the relationship between Monthly Salary and Total Working Years.
  162. Monthly Income (Category) vs. Total Working Years (Category)

  163. Monthly Income (Category) vs. Job Role (Category)

  164. Q & A

  165. EXPLORATORY