Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: An Introduction to Distance / MDS algorithms

Exploratory: An Introduction to Distance / MDS algorithms

A family of distance algorithms help you understand the similarity (or difference) among your subjects such as customers, countries, products, etc.

By using another algorithm called Multi-Dimensional Scaling (MDS), you can visualize such relationship in a much more intuitive way.

Kan introduces Distance algorithms and how to use them in Exploratory.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

July 17, 2019
Tweet

Transcript

  1. Exploratory Seminar Distance & MDS

  2. EXPLORATORY

  3. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  4. Mission Make Data Science Available for Everyone

  5. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  6. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  7. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  8. Exploratory Seminar Distance & MDS

  9. Distance

  10. None
  11. Distance Types • Euclidean Distance • Manhattan Distance • Binary

    Distance
  12. Distance Types • Euclidean Distance • Manhattan Distance • Binary

    Distance
  13. Two voting examples: • Jerusalem: General Assembly Resolution ES-10/L.22 -

    Criticizing US policy on Jerusalem (2017) • Ukraine: General Assembly Resolution 68/262 - Territorial Integrity of Ukraine (2014) The United Nations Resolutions
  14. Jerusalem Ukraine US No Yes Russia Yes No Canada Abstain

    Yes How Countries Voted for UN Resolutions
  15. • Yes -> 1 • No -> -1 • Absence

    / Abstain -> 0 Give numeric weights to the voting result.
  16. Jerusalem Ukraine US -1 1 Russia 1 -1 Canada 0

    1 How Countries Voted for UN Resolutions
  17. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 (Yes) -1 (No) 0 -1 (No) 1 (Yes) Visualize Voting Results with 2 Dimensions Canada
  18. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 2.828 1 (Yes) -1 (No) 0 -1 (No) 1 (Yes) Euclidean Distance between US and Russia Canada Direct distance
  19. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 1 (Yes) 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) Euclidean Distance between US and Canada
  20. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 1 (Yes) 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) Euclidean Distances 2.828
  21. Distance Types • Euclidean Distance • Manhattan Distance • Binary

    Distance
  22. Manhattan Distance

  23. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 (Yes) -1 (No) 0 -1 (No) 4 1 (Yes) Distance is measured in a grid way.
  24. Jerusalem Ukraine US Russia ( -1, 1 ) ( 1,

    -1 ) 1 1 (Yes) -1 (No) 0 -1 (No) Canada ( 0 , 1 ) 1 (Yes) Between US and Canada
  25. Euclidean Jerusalem US Russia ( -1, 1 ) ( 1,

    -1 ) Jerusalem US Russia ( -1, 1 ) ( 1, -1 ) ( 0 , 1 ) ( 0 , 1 ) 4 2.828 1 1 Manhattan 1 -1 -1 0 1 -1 -1 0 Ukraine Ukraine Canada 1 1 Canada
  26. Euclidean Manhattan 2.828 4 1 Manhattan Distance emphasizes the difference

    more. US Canada US US US Canada Russia Russia 1
  27. Distance Types • Euclidean Distance • Manhattan Distance • Binary

    Distance
  28. Binary Distance = 1 - Jaccard Index

  29. When you want to calculate the distance based on Logical

    values (TRUE/FALSE - 1/0), instead of Numerical values. RC_ID US Japan Canada Russia 1 1 1 1 0 2 1 0 1 0 3 0 1 1 1 4 0 0 0 0
  30. 1 RC_ID US Japan 1 1 1 2 1 0

    3 0 1 4 0 0 … US vs. Japan Both US and Japan are 1 Only US is 1 Only Japan is 1 5 2 6 4 3 7 Both US and Japan are 0
  31. 1 RC_ID US Canada 1 1 1 2 1 1

    3 1 1 4 1 0 … US vs. Canada Both US and Canada are 1 5 2 6 4 3 7 Only US is 1 Only Canada is 1 Very Close
  32. 1 RC_ID US Russia 1 1 0 2 1 0

    3 0 1 4 1 0 … US vs. Russia 5 2 6 4 3 7 Only US is 1 Only Russia is 1 Both US and Russia are 1 Very Far
  33. US vs. Japan US vs. Canada US vs. Russia Very

    Close Very Far Not too close, Not too far
  34. Binary Distance Between 0 and 1 Bigger numbers imply being

    close. Between 0 and 1 Bigger numbers imply being far. Jaccard Index = Binary Distance = 1 - Jaccard Index Opposite Direction
  35. Let’s try!

  36. 2016 California Election Data

  37. Data • 17 ballot measures. • 59 California counties. •

    Yes Ratio indicates how much of the voters voted ‘Yes’
  38. Which California counties are similar based on California Election data?

    Question
  39. • Analytics Type: Similarity by Categories • Category: COUNTY_NAME •

    Measured By: BALLOT_MEASURE_TITLE • Measure: yes_ratio” Analytics View
  40. None
  41. But… is this the best way to understand the similarity

    among the counties?
  42. Visualizing Distance with MDS (Multi-Dimensional Scaling)

  43. Distance Between Cities Let’s think about the distance between 3

    cities. San Francisco Los Angeles New York San Francisco 0 mile Los Angeles 380 mile 0 mile New York 2,900 mile 2,700 mile 0 mile
  44. SF NY 2,900 mile SF LA 380 mile 2,700 mile

    NY LA We can look at the distance between two cities one by one.
  45. NY SF LA Or, look at them together on 2

    Dimensional Scale.
  46. NY SF LA 2 Dimensional Scale makes it more intuitive

    to understand the similarity.
  47. We can transform 1 dimensional scale to 2 dimensional scale

    with MDS (Multi Dimensional Scaling) algorithm.
  48. Similarity Map shows the result of MDS on Scatter chart.

  49. Also, it uses K-means clustering algorithm to group the data

    into a number of clusters based on the distance information.
  50. You can uncheck “Show on Plot” in the chart property

    to hide the country names from the plot area.
  51. In Analytics property, you can customize the settings. You can

    change the distance methods and set the number of clusters for K-means clustering.
  52. Example: Number of Clusters: 5

  53. Appendix

  54. Long Data / Wide Data

  55. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data
  56. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 If there are more counties, the data becomes wider.
  57. Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San

    Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data
  58. Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San

    Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 If there are more counties, the data becomes longer.
  59. Gather / Un-Pivot Wide Data to Long Data

  60. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data Gather, Un-Pivot, Tidy
  61. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 gather(County, YesRatio, Sacramento:Napa)
  62. Spread / Pivot Long Data to Wide Data

  63. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Wide Data Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 Long Data Spread, Pivot, Un-Tidy
  64. Measure Sacramento San Francisco Los Angeles Napa Firearms Sales 0.57

    0.85 0.72 0.62 Cigarette Tax 0.51 0.81 0.68 0.65 Measure County YesRatio Firearms Sales Sacramento 0.57 Firearms Sales San Francisco 0.85 Firearms Sales Los Angeles 0.72 Firearms Sales Napa 0.62 Cigarette Tax Sacramento 0.51 Cigarette Tax San Francisco 0.81 Cigarette Tax Los Angeles 0.68 Cigarette Tax Napa 0.65 spread(County, YesRatio)
  65. Q & A

  66. Contact Email kan@exploratory.io Home Page https://exploratory.io Twitter @KanAugust Online Seminar

    https://exploratory.io/online-seminar