Slide 1

Slide 1 text

Exploratory Seminar Distance & MDS

Slide 2

Slide 2 text

EXPLORATORY

Slide 3

Slide 3 text

Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 4

Slide 4 text

Mission Make Data Science Available for Everyone

Slide 5

Slide 5 text

Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 6

Slide 6 text

First Wave Second Wave Third Wave Proprietary Open Source UI & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Slide 7

Slide 7 text

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Exploratory Data Analysis

Slide 8

Slide 8 text

Exploratory Seminar Distance & MDS

Slide 9

Slide 9 text

Distance

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Distance Types • Euclidean Distance • Manhattan Distance • Binary Distance

Slide 12

Slide 12 text

Distance Types • Euclidean Distance • Manhattan Distance • Binary Distance

Slide 13

Slide 13 text

Two voting examples: • Jerusalem: General Assembly Resolution ES-10/L.22 - Criticizing US policy on Jerusalem (2017) • Ukraine: General Assembly Resolution 68/262 - Territorial Integrity of Ukraine (2014) The United Nations Resolutions

Slide 14

Slide 14 text

Jerusalem Ukraine US No Yes Russia Yes No Canada Abstain Yes How Countries Voted for UN Resolutions

Slide 15

Slide 15 text

• Yes -> 1 • No -> -1 • Absence / Abstain -> 0 Give numeric weights to the voting result.

Slide 16

Slide 16 text

Jerusalem Ukraine US -1 1 Russia 1 -1 Canada 0 1 How Countries Voted for UN Resolutions

Slide 17

Slide 17 text

Jerusalem Ukraine US Russia ( -1, 1 ) ( 1, -1 ) 1 (Yes) -1 (No) 0 -1 (No) 1 (Yes) Visualize Voting Results with 2 Dimensions Canada

Slide 18

Slide 18 text

Jerusalem Ukraine US Russia ( -1, 1 ) ( 1, -1 ) 2.828 1 (Yes) -1 (No) 0 -1 (No) 1 (Yes) Euclidean Distance between US and Russia Canada Direct distance

Slide 19

Slide 19 text

Jerusalem Ukraine US Russia ( -1, 1 ) ( 1, -1 ) 1 1 (Yes) 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) Euclidean Distance between US and Canada

Slide 20

Slide 20 text

Jerusalem Ukraine US Russia ( -1, 1 ) ( 1, -1 ) 1 1 (Yes) 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) Euclidean Distances 2.828

Slide 21

Slide 21 text

Distance Types • Euclidean Distance • Manhattan Distance • Binary Distance

Slide 22

Slide 22 text

Manhattan Distance

Slide 23

Slide 23 text

Jerusalem Ukraine US Russia ( -1, 1 ) ( 1, -1 ) 1 (Yes) -1 (No) 0 -1 (No) 4 1 (Yes) Distance is measured in a grid way.

Slide 24

Slide 24 text

Jerusalem Ukraine US Russia ( -1, 1 ) ( 1, -1 ) 1 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) 1 (Yes) Between US and Canada

Slide 25

Slide 25 text

Euclidean Jerusalem US Russia ( -1, 1 ) ( 1, -1 ) Jerusalem US Russia ( -1, 1 ) ( 1, -1 ) ( 0 , 1 ) ( 0 , 1 ) 4 2.828 1 1 Manhattan 1 -1 -1 0 1 -1 -1 0 Ukraine Ukraine Canada 1 1 Canada

Slide 26

Slide 26 text

Euclidean Manhattan 2.828 4 1 Manhattan Distance emphasizes the difference more. US Canada US US US Canada Russia Russia 1

Slide 27

Slide 27 text

Distance Types • Euclidean Distance • Manhattan Distance • Binary Distance

Slide 28

Slide 28 text

Binary Distance = 1 - Jaccard Index

Slide 29

Slide 29 text

When you want to calculate the distance based on Logical values (TRUE/FALSE - 1/0), instead of Numerical values. RC_ID US Japan Canada Russia 1 1 1 1 0 2 1 0 1 0 3 0 1 1 1 4 0 0 0 0

Slide 30

Slide 30 text

1 RC_ID US Japan 1 1 1 2 1 0 3 0 1 4 0 0 … US vs. Japan Both US and Japan are 1 Only US is 1 Only Japan is 1 5 2 6 4 3 7 Both US and Japan are 0

Slide 31

Slide 31 text

1 RC_ID US Canada 1 1 1 2 1 1 3 1 1 4 1 0 … US vs. Canada Both US and Canada are 1 5 2 6 4 3 7 Only US is 1 Only Canada is 1 Very Close

Slide 32

Slide 32 text

1 RC_ID US Russia 1 1 0 2 1 0 3 0 1 4 1 0 … US vs. Russia 5 2 6 4 3 7 Only US is 1 Only Russia is 1 Both US and Russia are 1 Very Far

Slide 33

Slide 33 text

US vs. Japan US vs. Canada US vs. Russia Very Close Very Far Not too close, Not too far

Slide 34

Slide 34 text

Binary Distance Between 0 and 1 Bigger numbers imply being close. Between 0 and 1 Bigger numbers imply being far. Jaccard Index = Binary Distance = 1 - Jaccard Index Opposite Direction

Slide 35

Slide 35 text

Let’s try!

Slide 36

Slide 36 text

2016 California Election Data

Slide 37

Slide 37 text

Data • 17 ballot measures. • 59 California counties. • Yes Ratio indicates how much of the voters voted ‘Yes’

Slide 38

Slide 38 text

Which California counties are similar based on California Election data? Question

Slide 39

Slide 39 text

• Analytics Type: Similarity by Categories • Category: COUNTY_NAME • Measured By: BALLOT_MEASURE_TITLE • Measure: yes_ratio” Analytics View

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

But… is this the best way to understand the similarity among the counties?

Slide 42

Slide 42 text

Visualizing Distance with MDS (Multi-Dimensional Scaling)

Slide 43

Slide 43 text

Distance Between Cities Let’s think about the distance between 3 cities. San Francisco Los Angeles New York San Francisco 0 mile Los Angeles 380 mile 0 mile New York 2,900 mile 2,700 mile 0 mile

Slide 44

Slide 44 text

SF NY 2,900 mile SF LA 380 mile 2,700 mile NY LA We can look at the distance between two cities one by one.

Slide 45

Slide 45 text

NY SF LA Or, look at them together on 2 Dimensional Scale.

Slide 46

Slide 46 text

NY SF LA 2 Dimensional Scale makes it more intuitive to understand the similarity.

Slide 47

Slide 47 text

We can transform 1 dimensional scale to 2 dimensional scale with MDS (Multi Dimensional Scaling) algorithm.

Slide 48

Slide 48 text

Similarity Map shows the result of MDS on Scatter chart.

Slide 49

Slide 49 text

Also, it uses K-means clustering algorithm to group the data into a number of clusters based on the distance information.

Slide 50

Slide 50 text

You can uncheck “Show on Plot” in the chart property to hide the country names from the plot area.

Slide 51

Slide 51 text

In Analytics property, you can customize the settings. You can change the distance methods and set the number of clusters for K-means clustering.

Slide 52

Slide 52 text

Example: Number of Clusters: 5

Slide 53

Slide 53 text

Appendix

Slide 54

Slide 54 text

Long Data / Wide Data

Slide 55

Slide 55 text

Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57 0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data

Slide 56

Slide 56 text

Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57 0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 If there are more counties, the data becomes wider.

Slide 57

Slide 57 text

Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data

Slide 58

Slide 58 text

Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 If there are more counties, the data becomes longer.

Slide 59

Slide 59 text

Gather / Un-Pivot Wide Data to Long Data

Slide 60

Slide 60 text

Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57 0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data Gather, Un-Pivot, Tidy

Slide 61

Slide 61 text

Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57 0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 gather(County, YesRatio, Sacramento:Napa)

Slide 62

Slide 62 text

Spread / Pivot Long Data to Wide Data

Slide 63

Slide 63 text

Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57 0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data Spread, Pivot, Un-Tidy

Slide 64

Slide 64 text

Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57 0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 spread(County, YesRatio)

Slide 65

Slide 65 text

Q & A

Slide 66

Slide 66 text

Contact Email [email protected] Home Page https://exploratory.io Twitter @KanAugust Online Seminar https://exploratory.io/online-seminar