Slide 1

Slide 1 text

EXPLORATORY Online Seminar #47 Survey Data Analysis Part 1 PCA, Clustering, & NPS

Slide 2

Slide 2 text

Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 3

Slide 3 text

3 Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 4

Slide 4 text

4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Science Workflow

Slide 5

Slide 5 text

5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) ExploratoryɹModern & Simple UI

Slide 6

Slide 6 text

EXPLORATORY Online Seminar #47 Survey Data Analysis Part 1 PCA, Clustering, & NPS

Slide 7

Slide 7 text

7 1. Correlation 2. PCA (Principal Component Analysis) 3. Clustering 4. NPS Survey Data Analysis Part 1

Slide 8

Slide 8 text

8 1. Correlation 2. PCA (Principal Component Analysis) 3. Clustering 4. NPS Survey Data Analysis Part 1

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

12

Slide 13

Slide 13 text

13 • Get as many responses as you can • Get high quality answers When you do survey you want to …

Slide 14

Slide 14 text

14 The more questions there are, the less completions there are.

Slide 15

Slide 15 text

15 The more questions there are, the less quality the answers become.

Slide 16

Slide 16 text

16 We can’t remove these questions because some people want them to be asked. How about using Amazon Gift card? We already have too many questions, it will take more than 20 minutes to answer them all.

Slide 17

Slide 17 text

17 I know, but this is a great opportunity to know them better. I don’t want to miss anything potentially important. We should keep them minimal so that they can be all answered under 5 minutes.

Slide 18

Slide 18 text

18 We want to have our questions answered with high quality from as many customers as possible. How can we ask fewer questions without losing important information?

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Name Passionate about my work Consider my work is important Support company’s mission John 5 5 2 Nancy 5 5 3 Yoko 5 4 3 Mike 4 5 5 Stephany 4 3 4 Mary 3 3 2 Ken 3 2 1 Sunil 2 2 4 Tom 2 1 3 Brenda 1 1 3 20 If the two questions have very similar answers, then you can guess how a given person would answer to one of the questions if you know how he/she answer to the other question.

Slide 21

Slide 21 text

21 Passionate about work My work is important Correlation 5 4 3 2 1 1 2 3 4 5

Slide 22

Slide 22 text

Strong Negative Correlation No Correlation Strong Positive Correlation 0 1 -1 -0.5 0.5 Correlation

Slide 23

Slide 23 text

23 The Correlation Coefficient is 0.84, which indicates a highly positive correlation between the two questions.

Slide 24

Slide 24 text

Name Passionate about my work Consider my work is important Support company’s mission John 5 5 2 Nancy 5 5 3 Yoko 5 4 3 Mike 4 5 5 Stephany 4 3 4 Mary 3 3 2 Ken 3 2 1 Sunil 2 2 4 Tom 2 1 3 Brenda 1 1 3 24 If the two questions have very similar answers, then you can guess how a given person would answer to one of the questions if you know how he/she answer to the other question.

Slide 25

Slide 25 text

On the other hand…

Slide 26

Slide 26 text

Name Passionate about my work Consider my work is important Support company’s mission John 5 5 2 Nancy 5 5 3 Yoko 5 4 3 Mike 4 5 5 Stephany 4 3 4 Mary 3 3 2 Ken 3 2 1 Sunil 2 2 4 Tom 2 1 3 Brenda 1 1 3 26 Some questions are very different in terms of how they are answered.

Slide 27

Slide 27 text

27 The correlation coefficient is 0.019, which means there is almost no correlation between the two questions.

Slide 28

Slide 28 text

Name Passionate about my work Consider my work is important Support company’s mission John 5 5 2 Nancy 5 5 3 Yoko 5 4 3 Mike 4 5 5 Stephany 4 3 4 Mary 3 3 2 Ken 3 2 1 Sunil 2 2 4 Tom 2 1 3 Brenda 1 1 3 28 This means that removing one of the questions will lose a significant part of information about the employees.

Slide 29

Slide 29 text

29 We have more questions, and we can investigate every single combination. But…

Slide 30

Slide 30 text

You can use ‘Correlation’ under Analytics view to investigate the correlation between any given combinations of the variables.

Slide 31

Slide 31 text

Select the variables (questions) and run it.

Slide 32

Slide 32 text

32 You can see which pairs of the questions are correlated the highest among all.

Slide 33

Slide 33 text

33 1. Correlation 2. PCA (Principal Component Analysis) 3. Clustering 4. NPS Survey Data Analysis Part 1

Slide 34

Slide 34 text

34 ‘Correlation’ helps you understand how strong (or weak) the relationship between two variables. Using the Correlation Coefficient values you can compare which combinations are more correlated than the others. However, it doesn’t give you an overall picture of how all the questions are related to.

Slide 35

Slide 35 text

Generates a new set of artificial dimensions (components) that are created in a way that they are not correlated to one another and that can carry as much information of the original data as possible with fewer dimensions. It is one of the ‘Dimensionality Reduction’ techniques. PCA (Principal Component Analysis)

Slide 36

Slide 36 text

PCA • Find the directions (Components) in data that has high variance. • Find a few components with high variance that can explain the most variance of data. (Principal Components)

Slide 37

Slide 37 text

How PCA finds the new dimensions? 1. Finds a center point of the whole data presented in the multi-dimensional space. 2. Finds the direction that has the highest variance. (The 1st Component) 3. Finds the direction that is orthogonal to the 1st component and has the highest variance. (The 2nd Component) 4. Finds the direction that is orthogonal to the 1st and the 2nd components and has the highest variance. (The 3rd component) 5. Repeat till the last Nth component. 1 2 3 4

Slide 38

Slide 38 text

38 PCA helps you understand which questions are similar to one another and how similar they are. And also, how different they are. You can grasp the overall relationship among all the questions. This helps you to understand which questions can be removed or should be kept. PCA for Survey Data Analysis

Slide 39

Slide 39 text

Let’s Try!

Slide 40

Slide 40 text

40 Sample Data Employee Satisfaction Survey

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Each row represents each employee. Each column represents each question. The cell is each survey answer (scaled 1 - 5). 42

Slide 43

Slide 43 text

43 Select ‘Principal Component Analysis (PCA)’.

Slide 44

Slide 44 text

44 Click on the Variable Columns button to select the variables.

Slide 45

Slide 45 text

45 Select all the questions (variables) and run it.

Slide 46

Slide 46 text

46 You will see a chart called ‘Biplot’, which tries to present you all the variables in a 2- dimensional space and places all the rows (employees) as dots in related to the variables.

Slide 47

Slide 47 text

47 The variables that going into the similar direction are considered highly and positively correlated.

Slide 48

Slide 48 text

48 For example, both of these two questions are asking about similar thing.

Slide 49

Slide 49 text

49 You can see these are highly correlated when you visualize them with Scatter chart.

Slide 50

Slide 50 text

50 Both of these questions are about ‘amount of work’ and similar.

Slide 51

Slide 51 text

51 You can see these are highly correlated when you visualize them with Scatter chart.

Slide 52

Slide 52 text

52 These two questions are diverged from each other with almost 90 degree. This means they are independent from each other in the context of all the variables.

Slide 53

Slide 53 text

53 You can see these are not correlated at all when you visualize them with Scatter chart.

Slide 54

Slide 54 text

54 These are the questions we can consider removing because removing them won’t lose out much information.

Slide 55

Slide 55 text

55 With Scatter chart, you can visualize the relationship between a given pair of questions intuitively. With Correlation under Analytics, you can investigate the strength of the correlation for every single combination of all the questions. With PCA under Analytics, you can visualize the relationship among all the questions and see which questions are similar or different in an overall view.

Slide 56

Slide 56 text

56 With these tools, you can investigate what are the minimal set of questions without losing much information. By reducing the number of questions, you will have more people complete your survey questions with high quality, which will help you understand your customers better.

Slide 57

Slide 57 text

57 1. Correlation 2. PCA (Principal Component Analysis) 3. Clustering 4. NPS Survey Data Analysis Part 1

Slide 58

Slide 58 text

58 Some people answer the questions the same way, but some don’t. Can we segment them based on how they answer the questions so that we can approach them differently in more optimized ways?

Slide 59

Slide 59 text

59 Let’s say we ask what is important about their work.

Slide 60

Slide 60 text

Name Relationship is important for my work Salary is important for my work John 5 2 Nancy 5 1 Yoko 5 2 Mike 4 2 Stephany 4 1 Mary 4 1 Ken 1 4 Sunil 2 5 Tom 2 5 Brenda 1 5 For some people Relationship is more important, but for others Salary is more important.

Slide 61

Slide 61 text

Name Relationship is important for my work Salary is important for my work John 5 2 Nancy 5 1 Yoko 5 2 Mike 4 2 Stephany 4 1 Mary 4 1 Ken 1 4 Sunil 2 5 Tom 2 5 Brenda 1 5 Relationship is more important Salary is more important We can segment them into 2 groups.

Slide 62

Slide 62 text

We have more questions! Can we segment them based on how they answered all these questions automatically?

Slide 63

Slide 63 text

Clustering

Slide 64

Slide 64 text

• Detect the inherent structures in the data • Categorize the data into groups of maximum commonality Clustering

Slide 65

Slide 65 text

Let's do it! 65

Slide 66

Slide 66 text

66 Sample Data Employee Satisfaction Survey

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Each row represents each employee with his/her survey answers. 68

Slide 69

Slide 69 text

69 Select ‘K-Means Clustering’ under the Analytics view.

Slide 70

Slide 70 text

70 Select all the numerical variables (questions) and run it.

Slide 71

Slide 71 text

71 Once you run it, you will see the similar Biplot chart we saw with PCA.

Slide 72

Slide 72 text

72 People in the Cluster 1 are located at the opposite side of the satisfactory questions, which means that they score low on these questions. They are not happy!

Slide 73

Slide 73 text

73 On the other hand, people in the Cluster 2 scored high on the satisfactory questions. We can consider this group as a ‘happy’ group.

Slide 74

Slide 74 text

74 The people in the Cluster 3 score high on the company mission and the work amount related questions.

Slide 75

Slide 75 text

75 Boxplot tab shows you the distribution of the scores on each question in each cluster. The Y-Axis values are the scores in the standardized scale.

Slide 76

Slide 76 text

76 Cluster 1’s satisfaction levels are low on all measures. This is the ‘un-happy’ group.

Slide 77

Slide 77 text

77 People in the Cluster 2’s satisfaction levels are high overall, though their support on the mission is relatively low.

Slide 78

Slide 78 text

78 People in the Cluster 3 score high on the mission, the salary, and the amount of work related questions.

Slide 79

Slide 79 text

79 We can use the Label Column to see how that is related to each cluster.

Slide 80

Slide 80 text

80 By assigning the Age column to the Label, you can see the age bucket for each employee that is shown as a dot.

Slide 81

Slide 81 text

81 Under the Stack Bar tab, we can see the ratio of each age bucket in each cluster.

Slide 82

Slide 82 text

82 For example, the cluster 2 is the ‘happy’ group and we can see that it consists of mainly 40 something employees.

Slide 83

Slide 83 text

83 On the other hand, the cluster 3 is the group who support the company mission the most and it consists of mainly 20s and 30s employees.

Slide 84

Slide 84 text

84 With Clustering under Analytics, you can segment the respondents (customers, employees, etc.) of the survey questions into a few groups and understand the characteristics of each group. This type of insight will help you strategize how you can approach or communicate to your customers (or employees) in more optimized ways.

Slide 85

Slide 85 text

85 1. Correlation 2. PCA (Principal Component Analysis) 3. Clustering 4. NPS Survey Data Analysis Part 1

Slide 86

Slide 86 text

86 Often, we do surveys because we want to understand if / how customers are satisfied with our product or service in order to improve our product or service.

Slide 87

Slide 87 text

A typical question about the customer satisfaction would be… 87

Slide 88

Slide 88 text

88 The problem with this question is that it is obscure and it tends to make many people end up scoring too high (or too low) without considering it too much.

Slide 89

Slide 89 text

89 This is where NPS comes in rescue. NPS is a measure of how much value the customers find in your product or service.

Slide 90

Slide 90 text

90 NPS asks a question to see how likely they want to recommend your product or service to other people.

Slide 91

Slide 91 text

Because the question is more specific people don’t blindly score high unless they can see they would really do it. 91

Slide 92

Slide 92 text

92 According to Airbnb, 4% of the customers who scored 10 have referred other customers within a year while 0% of customers who scored between 0 and 6 didn’t referred at all.

Slide 93

Slide 93 text

93 Now, we got 100 people answered the NPS, how should we calculate the overall NPS? Not average.

Slide 94

Slide 94 text

We’ll group the scores into 3 buckets. 94 1 2 3 4 5 6 0 7 8 9 10

Slide 95

Slide 95 text

First, the people who scored less than 6 are called ‘Detractors’. 95 1 2 3 4 5 6 0 7 8 9 10 Detractors

Slide 96

Slide 96 text

Second, the ones who score 7 or 8 are called Passive. They are not disappointed but also don’t think your product is superb. 96 1 2 3 4 5 6 0 7 8 9 10 Passive

Slide 97

Slide 97 text

Last, people who scored 9 or 10 are called Promotor. These are the people who are really satisfied and therefore will tell their friends good things about your product. 97 1 2 3 4 5 6 0 7 8 9 10 Promotor

Slide 98

Slide 98 text

98 1 2 3 4 5 6 0 7 8 9 10 % of Promotors NPS − = % of Detractors Promotor Detractors

Slide 99

Slide 99 text

99 Let’s say we’ve got 10 people answered the NPS like the below. 0 1 2 3 4 5 6 7 8 9 10

Slide 100

Slide 100 text

100 We can segment them into the three groups. 0 1 2 3 4 5 6 7 8 9 10 Detractors Promotor

Slide 101

Slide 101 text

Detractors 30%ʢ3/10ʣ We calculate the % of Promotors and the % of Detractors. 101 0 1 2 3 4 5 6 7 8 9 10 Promotor 40%ʢ4/10ʣ

Slide 102

Slide 102 text

We can subtract the % of detractors from the % of promotors. 102 0 1 2 3 4 5 6 7 8 9 10 10ʢNPSʣ = − % of Promotors 40% % of Detractors 30% Detractors 30%ʢ3/10ʣ Promotor 40%ʢ4/10ʣ

Slide 103

Slide 103 text

103 In general, if your NPS is greater than 50 you are considered ‘Excellent’. If it is greater than 70 you are considered ‘World Class!’

Slide 104

Slide 104 text

104 62 68 72 96 74 77

Slide 105

Slide 105 text

When the NPS is around 50, it tends to have many Passives. 105

Slide 106

Slide 106 text

When the NPS goes beyond 70, a significant portion of people are scoring 9 or 10 and not many detractors. 106

Slide 107

Slide 107 text

107 Here is a distribution of NPS scores for Airbnb. NPS = 74

Slide 108

Slide 108 text

Let's do it! 108

Slide 109

Slide 109 text

109 Each row represents each customer’s answer.

Slide 110

Slide 110 text

110 ʁ We don’t have a column to indicate whether a given customers is Promoter, Passive, or Detractor so we need to create one.

Slide 111

Slide 111 text

111

Slide 112

Slide 112 text

112

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

No content

Slide 115

Slide 115 text

No content

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

No content

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

No content

Slide 122

Slide 122 text

No content

Slide 123

Slide 123 text

No content

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

ܭࢉͨ͠NPSͷσʔλ͔ΒɺμογϡϘʔυΛ࡞੒͢Δ͜ͱͰɺ࠷৽ ͷύϑΥʔϚϯε΍࣌ܥྻͷτϨϯυΛཧղ͢Δ͜ͱ͕Ͱ͖Δɻ 125

Slide 126

Slide 126 text

That’s it for today!

Slide 127

Slide 127 text

Next Seminar

Slide 128

Slide 128 text

EXPLORATORY Online Seminar #48 6/16/2021 (Wed) 11AM PT Exploratory v6.6

Slide 129

Slide 129 text

No content

Slide 130

Slide 130 text

Information Email kan@exploratory.io Website https://exploratory.io Twitter @ExploratoryData Seminar https://exploratory.io/online-seminar

Slide 131

Slide 131 text

Q & A 131

Slide 132

Slide 132 text

EXPLORATORY 132