Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High Dimensional Data Visualization

High Dimensional Data Visualization

The data gathered in various scientific domains and industrial applications is steadily growing in size. What today seems to be large-scale may become small-scale in five to ten years. The size of the data increases in both, data set size and the number of measured or simulated variables of a single datum. However, the increase in the size of the data leads to an increased complexity when dealing with such data sets. On the one hand the large-scale data needs to be processed within reasonable amounts of time while on the other hand, the perception of the human being analyzing data can only deal with a certain complexity limited by perception. For the latter, it is inevitable to reduce the complexity of the data. A commonly chosen method in applications such as compression, classification or visualization is to reduce the number of dimensions of the data. Dimension reduction techniques aim to compute a data set with fewer dimensions based on the original data, that still represents patterns and characteristics of the original data.

Fabian Keller

July 16, 2015
Tweet

More Decks by Fabian Keller

Other Decks in Research

Transcript

  1. High Dimensional Data Visualization Presented by Fabian Keller Seminar: Large

    Scale Visualization Advisor: Steffen Koch University of Stuttgart, Summer Term 2015
  2. Agenda • Introduction • Dimension Reduction Techniques PCA / LLE

    / ISOMAP / t-SNE • Visualization Techniques Scatterplots / Parallel Coordinate Plots / Glyphs • Conclusion 16.07.2015 Fabian Keller 4
  3. Goal Of dimensionality reduction • High Dimensional Data (>>1000 dimensions)

    • Reduce Dimensions (for Clustering / Learning / …) • Extract Meaning • Visualize and Interact 16.07.2015 Fabian Keller 5 [c.f. Card et al 1999; dos Santos and Brodlie 2004]
  4. Intrinsic Dimensionality How many dimensions can we reduce? 2D 

    1D 3D  1D 16.07.2015 Fabian Keller 6  Intrinsic Dimensionality: 1
  5. Agenda • Introduction • Dimension Reduction Techniques PCA / LLE

    / ISOMAP / t-SNE • Visualization Techniques Scatterplots / Parallel Coordinate Plots / Glyphs • Conclusion 16.07.2015 Fabian Keller 7
  6. Dimension Reduction What techniques are there? DR Techniques Linear Principal

    Component Analysis Non-Linear Local Local Linear Embedding Global ISOMAP t-SNE 16.07.2015 Fabian Keller 8
  7. Principal Component Analysis (PCA) Eigen-* 16.07.2015 Fabian Keller 9 •

    Linear, Global • Find “Principal Components” • Minimize Reconstruction Error [isomorphismes, 2014]
  8. Local-Linear Embedding (LLE) Assumes the data is locally linear •

    Non-Linear, Local • Select neighbors and approximate linearly • Map to lower dimension 16.07.2015 Fabian Keller 11 [Roweis, 2000]
  9. ISOMAP Isometric feature mapping 16.07.2015 Fabian Keller 12 • Non-linear,

    Global • K-Nearest Neighbors • Construct neighborhood graph • Compute shortest paths [Balasubramanian, 2002]
  10. t-SNE Stochastic Neighbor Embedding • Non-linear, Global • Uses Gaussian

    similarities • Preserves the similarities in lower dimensions 16.07.2015 Fabian Keller 13
  11. Agenda • Introduction • Dimension Reduction Techniques PCA / LLE

    / ISOMAP / t-SNE • Visualization Techniques Scatterplots / Parallel Coordinate Plots / Glyphs • Conclusion 16.07.2015 Fabian Keller 14
  12. 2D Scatter Plots Commonly used • Easy Perception • (No)

    Interaction • Limited to two dimensions • Colors?! 16.07.2015 Fabian Keller 15
  13. 2D Scatter Plot Matrices Show relationships with scatter plots •

    Slow perception • May have interaction • Does not scale well 16.07.2015 Fabian Keller 16
  14. 2D Scatter Plot Matrices Let an algorithm choose the plots

    16.07.2015 Fabian Keller 17 [Zheng, 2014]
  15. 3D Scatter Plots Interactive • Only one additional dimension •

    Expensive interaction, useless without! • Limited benefit compared to 2D scatter plots 16.07.2015 Fabian Keller 18 [Sedlmair, 2013]
  16. Parallel Coordinate Plot Display >2 dimensions 16.07.2015 Fabian Keller 19

    Interaction Examples: https://syntagmatic.github.io/parallel-coordinates/ • Noisy • Slow perception • Meaning of x-axis?! [Harvard Business Manager, 2015-07]
  17. Glyphs Encode important information • Memorable semantics • Small •

    Details through interaction • Overwhelming? 16.07.2015 Fabian Keller 20 [Fuchs, 2013]
  18. Conclusion High Dimensional Data Visualization • Lots of DR /

    visualization techniques • Even more combinations • Application needs to be tailored to needs 16.07.2015 Fabian Keller 23 “A problem well put is half-solved” – John Dewey
  19. Literature • Sedlmair, Michael; Munzner, Tamara; Tory, Melanie (2013): Empirical

    guidance on scatterplot and dimension reduction technique choices. • Zheng, Yunzhu; Suematsu, Haruka; Itoh, Takayuki; Fujimaki, Ryohei; Morinaga, Satoshi; Kawahara, Yoshinobu (2014): Scatterplot layout for high-dimensional data visualization. • Card, S. K., Mackinlay, J. D., and Shneiderman, B., editors. Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann, San Francisco. 1999. • Fuchs, Johannes, et al. "Evaluation of alternative glyph designs for time series data in a small multiple setting." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2013. • Christopher Kintzel, Johannes Fuchs, and Florian Mansmann. 2011. Monitoring large IP spaces with ClockView. • Fuchs, Johaness et al. “Leaf Glyph Visualizing Multi-Dimensional Data with Environmental Cues“. 2014. • Balasubramanian, Mukund, and Eric L. Schwartz. "The isomap algorithm and topological stability." Science 295.5552 (2002): 7-7. • Roweis, Sam T.; Saul, Lawrence K. (2000): Nonlinear dimensionality reduction by locally linear embedding. • dos Santos, S. and Brodlie, K. Gaining understanding of multivariate and multidimensional data through visualization. Computers & Graphics, 28(3):311–325. 2004. • Harvard Business Manager, 2015-07: Andere Länder, anderer Stil http://www.harvardbusinessmanager.de/heft/d-135395625.html • isomorphismes (2014). pca - making sense of principal component analysis, eigenvectors & eigenvalues - cross validated. http://stats.stackexchange.com/a/82427/80011 16.07.2015 Fabian Keller 25
  20. Example Applications • Biological / Medical (genes, fMRI) • Finance

    (time series) • Geological (climate, spatial, temporal) • Big Data Analysis (Netflix Movie Rating Data) 16.07.2015 Fabian Keller 26
  21. Other DR techniques Matlab toolbox for dimensionality reduction 16.07.2015 Fabian

    Keller 27 • Principal Component Analysis (PCA) • Probabilistic PCA • Factor Analysis (FA) • Classical multidimensional scaling (MDS) • Sammon mapping • Linear Discriminant Analysis (LDA) • Isomap • Landmark Isomap • Local Linear Embedding (LLE) • Laplacian Eigenmaps • Hessian LLE • Local Tangent Space Alignment (LTSA) • Conformal Eigenmaps (extension of LLE) • Maximum Variance Unfolding (extension of LLE) • Landmark MVU (LandmarkMVU) • Fast Maximum Variance Unfolding (FastMVU) • Kernel PCA • Generalized Discriminant Analysis (GDA) • Diffusion maps • Neighborhood Preserving Embedding (NPE) • Locality Preserving Projection (LPP) • Linear Local Tangent Space Alignment (LLTSA) • Stochastic Proximity Embedding (SPE) • Deep autoencoders (using denoising autoencoder pretraining) • Local Linear Coordination (LLC) • Manifold charting • Coordinated Factor Analysis (CFA) • Gaussian Process Latent Variable Model (GPLVM) • Stochastic Neighbor Embedding (SNE) • Symmetric SNE • t-Distributed Stochastic Neighbor Embedding (t-SNE) • Neighborhood Components Analysis (NCA) • Maximally Collapsing Metric Learning (MCML) • Large-Margin Nearest Neighbor (LMNN) See: http://lvdmaaten.github.io/drtoolbox/