Introduction to Exploratory v6.0

Introduction to Exploratory v6.0

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida

June 10, 2020
Tweet

Transcript

  1. EXPLORATORY Seminar #28 Exploratory v6.0

  2. Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  3. 3 Data Science is not just for Engineers and Statisticians.

    Exploratory makes it possible for Everyone to do Data Science. The Third Wave
  4. 4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  5. 5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  6. EXPLORATORY Seminar #28 Exploratory v6.0

  7. EXPLORATORY v6.0 4th Year Anniversary

  8. EXPLORATORY v1.0 2016-05-06

  9. None
  10. None
  11. None
  12. EXPLORATORY v2.0 2016-09-01

  13. None
  14. EXPLORATORY v3.0 2017-01-24

  15. None
  16. None
  17. EXPLORATORY v4.0 2017-08-08

  18. None
  19. None
  20. EXPLORATORY v5.0 2018-10-31

  21. None
  22. EXPLORATORY v6.0 2020-05-28

  23. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 23 EDA (Exploratory Data Analysis)
  24. 24

  25. 25 Far better an approximate answer to the right question,

    which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey
  26. 26 • How the variation in variables? • How are

    the variables associated or correlated to one another? Two Principle Questions for EDA
  27. 27 Variation Average (Mean) $20,000 $1,000

  28. 28 A relationship where changes in one variable happen together

    with changes in another variable with a certain rule. Association and Correlation
  29. 29 Association Correlation Any type of relationship between two variables.

    A certain type of (usually linear) association between two continuous variables
  30. 30 US UK Japan 5000 2500 Variations of Monthly Income

    are different among countries. Country Monthly Income 0 Association
  31. 31 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  32. Summary View

  33. How the variation of each variables?

  34. v1.0

  35. v6.0

  36. None
  37. 37 • How the variation in variables? • How are

    the variables associated or correlated to one another? Two Principle Questions for EDA
  38. Highlight Correlate

  39. Highlight

  40. Highlight

  41. Highlight - Ratio

  42. None
  43. Correlate

  44. Correlate - Numerical

  45. Correlate - Logical

  46. None
  47. It will create an Error Bar chart automatically.

  48. None
  49. It will create a Chi-Square Test automatically.

  50. None
  51. It will create a Logistic Regression model automatically.

  52. Data Wrangling

  53. • Table View: Filter • Filter: Outlier • Window Calculation

    for Date / POSIXct Columns
  54. Table View Filter

  55. Table View Filter

  56. Table View Filter: Disable

  57. Outliers

  58. None
  59. None
  60. Chart Level Filter: Disable

  61. Highlight - Outliers

  62. Table Filter - Outliers

  63. Window Calculation - Date / POSIXct Data Type Columns

  64. Chart

  65. • Map: Standard • Map: More Granular Level Zoom In/Out

    • Limit Values for Color / Repeat By • Multiple Reference Lines
  66. Map - Standard

  67. You can switch between Circle and Area types.

  68. 68 Various types of names can be used for the

    column assignment. For example, the country can be Name (US, United States, etc.), ISO2, ISO3 codes.
  69. Limit Values on Color Sometimes, there are too many values

    in Color
  70. You can use ‘Limit’ to show only the Top N

    countries or the countries that match with a given condition.
  71. It shows only the top 20 countries based on the

    ‘Death’ on the last day.
  72. Multiple Reference Lines

  73. Analytics

  74. • Model Interpretability • Survival Analysis: Random Survival Forest •

    Survival Analysis: Cox Regression: Survival Curve, Prediction • ROC Curve
  75. Model Interpretability

  76. 76 • How the variation in variables? • How are

    the variables associated or correlated to one another? Two Principle Questions for EDA
  77. Job Age Good Looking Nationality School Politician 60s FALSE Japanese

    Law School Actor 40s TRUE American Actor’s School Actor 50s TRUE American Actor’s School Politician 40s TRUE Canadian Law School Politician 50s TRUE American Law School Actor 50s TRUE American Actor’s School Algorithm Model Build Prediction Model
  78. Job Age Good Looking Nationality School Politician 60s FALSE Japanese

    Law School Actor 40s TRUE American Actor’s School Actor 50s TRUE American Actor’s School Politician 40s TRUE Canadian Law School Politician 50s TRUE American Law School Actor 50s TRUE American Actor’s School Algorithm Model Job Age Good Looking Nationality School ? 60s FALSE Japanese Law School ? 40s TRUE American Actor’s School ? 50s TRUE American Actor’s School Job Age Good Looking Nationality School Politician 60s FALSE Japanese Law School Actor 40s TRUE American Actor’s School Actor 50s TRUE American Actor’s School Predict
  79. The algorithm has learned the patterns to predict. Algorithm Model

    Job Age Good Looking Nationality School Politician 60s FALSE Japanese Law School Actor 40s TRUE American Actor’s School Actor 50s TRUE American Actor’s School Politician 40s TRUE Canadian Law School Politician 50s TRUE American Law School Actor 50s TRUE American Actor’s School
  80. Prediction models recognize relationships among the variables and predict based

    on the relationship.
  81. Interpreting models is to understand the relationship in data.

  82. The difference among the prediction model algorithms (Statistical Learning, Machine

    Learning) is about what relationships they can recognize. What kinds of relationships have the algorithms found?
  83. Can we have a common framework to understand the relationships

    effectively?
  84. A common framework for understanding what the models found Analytics

    Grammer
  85. • Variable Importance • Prediction by Variable • Summary -

    Model Quality • Predicted Data Analytics Grammer
  86. Variable Importance Which variables are more correlated or associated with

    a target variable?
  87. Variable Importance It uses the ‘Permutation Importance’ method, which scores

    the importance for each variable by evaluating how the prediction quality degrades without the variable.
  88. Who is more important?

  89. How the performance degrades when John Lennon is not here?

  90. How about without Ringo Star?

  91. How about without George?

  92. How about without Paul?

  93. Variable Importance

  94. Linear Regression

  95. GLM - Poisson Regression

  96. Logistic Regression

  97. Decision Tree

  98. Random Forest

  99. Cox Regression

  100. Random Survival Forest

  101. Prediction by Variables How the target variable changes when each

    predictor variable changes?
  102. Linear Regression

  103. GLM - Poisson Regression

  104. Logistic Regression

  105. Decision Tree

  106. Random Forest

  107. Cox Regression

  108. Random Survival Forest

  109. Survival Prediction Model

  110. Cox Regression

  111. Random Survival Forest

  112. 112 Calculated by Kaplan-Meier Predicted by Cox Regression Model Cox

    Regression is not good at capturing the relationship that changes overtime due to its constraint that the hazard ratio is constant.
  113. 113 Random Survival Forest can capture the relationship that changes

    overtime because it is a machine learning algorithm with less constrains. Calculated by Kaplan-Meier Predicted by Random Survival Forest Model
  114. 114 Cox Regression Random Survival Forest The survival curves with

    different trends.
  115. Data Access

  116. • Weather Data • Schedule for Extension Data

  117. Weather Data

  118. Select Country Network and Station.

  119. None
  120. Schedule for Extension Data

  121. Many more Enhancements & Bug Fixes! https://exploratory.io/release-notes Release Note

  122. https://exploratory.io/download Exploratory Desktop - Personal / Business / Community https://exploratory.io/download-public

    Exploratory Public Download
  123. Next Seminar

  124. • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty - 6/17 (Wed) • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop
  125. Q & A

  126. Information Email kan@exploratory.io Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

  127. EXPLORATORY 127