Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: An Introduction to Distance / MDS algorithms

Exploratory: An Introduction to Distance / MDS algorithms

A family of distance algorithms help you understand the similarity (or difference) among your subjects such as customers, countries, products, etc.

By using another algorithm called Multi-Dimensional Scaling (MDS), you can visualize such relationship in a much more intuitive way.

Kan introduces Distance algorithms and how to use them in Exploratory.

Kan Nishida

July 17, 2019
Tweet

More Decks by Kan Nishida

Other Decks in Science

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  4. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  5. Two voting examples: • Jerusalem: General Assembly Resolution ES-10/L.22 -

    Criticizing US policy on Jerusalem (2017) • Ukraine: General Assembly Resolution 68/262 - Territorial Integrity of Ukraine (2014) The United Nations Resolutions
  6. Jerusalem Ukraine US No Yes Russia Yes No Canada Abstain

    Yes How Countries Voted for UN Resolutions
  7. • Yes -> 1 • No -> -1 • Absence

    / Abstain -> 0 Give numeric weights to the voting result.
  8. Jerusalem Ukraine US -1 1 Russia 1 -1 Canada 0

    1 How Countries Voted for UN Resolutions
  9. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 (Yes) -1 (No) 0 -1 (No) 1 (Yes) Visualize Voting Results with 2 Dimensions Canada
  10. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 2.828 1 (Yes) -1 (No) 0 -1 (No) 1 (Yes) Euclidean Distance between US and Russia Canada Direct distance
  11. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 1 (Yes) 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) Euclidean Distance between US and Canada
  12. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 1 (Yes) 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) Euclidean Distances 2.828
  13. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 (Yes) -1 (No) 0 -1 (No) 4 1 (Yes) Distance is measured in a grid way.
  14. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) 1 (Yes) Between US and Canada
  15. Euclidean Jerusalem US Russia ( -1, 1 ) ( 1,

    -1 ) Jerusalem US Russia ( -1, 1 ) ( 1, -1 ) ( 0 , 1 ) ( 0 , 1 ) 4 2.828 1 1 Manhattan 1 -1 -1 0 1 -1 -1 0 Ukraine Ukraine Canada 1 1 Canada
  16. Euclidean Manhattan 2.828 4 1 Manhattan Distance emphasizes the difference

    more. US Canada US US US Canada Russia Russia 1
  17. When you want to calculate the distance based on Logical

    values (TRUE/FALSE - 1/0), instead of Numerical values. RC_ID US Japan Canada Russia 1 1 1 1 0 2 1 0 1 0 3 0 1 1 1 4 0 0 0 0
  18. 1 RC_ID US Japan 1 1 1 2 1 0

    3 0 1 4 0 0 … US vs. Japan Both US and Japan are 1 Only US is 1 Only Japan is 1 5 2 6 4 3 7 Both US and Japan are 0
  19. 1 RC_ID US Canada 1 1 1 2 1 1

    3 1 1 4 1 0 … US vs. Canada Both US and Canada are 1 5 2 6 4 3 7 Only US is 1 Only Canada is 1 Very Close
  20. 1 RC_ID US Russia 1 1 0 2 1 0

    3 0 1 4 1 0 … US vs. Russia 5 2 6 4 3 7 Only US is 1 Only Russia is 1 Both US and Russia are 1 Very Far
  21. US vs. Japan US vs. Canada US vs. Russia Very

    Close Very Far Not too close, Not too far
  22. Binary Distance Between 0 and 1 Bigger numbers imply being

    close. Between 0 and 1 Bigger numbers imply being far. Jaccard Index = Binary Distance = 1 - Jaccard Index Opposite Direction
  23. Data • 17 ballot measures. • 59 California counties. •

    Yes Ratio indicates how much of the voters voted ‘Yes’
  24. • Analytics Type: Similarity by Categories • Category: COUNTY_NAME •

    Measured By: BALLOT_MEASURE_TITLE • Measure: yes_ratio” Analytics View
  25. Distance Between Cities Let’s think about the distance between 3

    cities. San Francisco Los Angeles New York San Francisco 0 mile Los Angeles 380 mile 0 mile New York 2,900 mile 2,700 mile 0 mile
  26. SF NY 2,900 mile SF LA 380 mile 2,700 mile

    NY LA We can look at the distance between two cities one by one.
  27. We can transform 1 dimensional scale to 2 dimensional scale

    with MDS (Multi Dimensional Scaling) algorithm.
  28. Also, it uses K-means clustering algorithm to group the data

    into a number of clusters based on the distance information.
  29. You can uncheck “Show on Plot” in the chart property

    to hide the country names from the plot area.
  30. In Analytics property, you can customize the settings. You can

    change the distance methods and set the number of clusters for K-means clustering.
  31. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data
  32. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 If there are more counties, the data becomes wider.
  33. Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San

    Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data
  34. Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San

    Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 If there are more counties, the data becomes longer.
  35. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data Gather, Un-Pivot, Tidy
  36. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 gather(County, YesRatio, Sacramento:Napa)
  37. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data Spread, Pivot, Un-Tidy
  38. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 spread(County, YesRatio)