clustree: a package for producing clustering trees using ggraph

clustree: a package for producing clustering trees using ggraph

Clustering analysis is commonly used in many fields to group together similar samples. Many clustering algorithms exist, but all of them require some sort of user input to set parameters that affect the number of clusters produced. Deciding on the correct number of clusters for a given dataset is a difficult problem that can be tackled by looking at the relationships between samples at different resolutions. Here I will present clustree, an R package for producing clustering tree visualisations. These visualisations combine information from multiple clusterings with different resolutions, showing where new clusters come from and how samples change clusters as the number of clusters increases. Summarised information describing the samples in each cluster can be overlaid on the tree to give additional insight. I will also describe my experience developing clustree, particularly how I have made use of the ggraph package. The clustree package is available at https://github.com/lazappi/clustree and a preprint describing clustering trees can be read at https://www.biorxiv.org/content/early/2018/03/02/274035.

This talk was presented at userR! 2018 in Brisbane.

9d81fd2d95185ac557a4a6a1e2139657?s=128

Luke Zappia

July 12, 2018
Tweet

Transcript

  1. clustree producing clustering trees using ggraph Luke Zappia @_lazappi_ lazappi.github.io/clustree

  2. My data OpenStax College, CC BY 3.0 via Wikimedia Commons

    Single-cell RNA-sequencing Gene activity in thousands of cells ~20000 features (genes) ~8000 samples (cells) Look for different cell types
  3. How many clusters?

  4. None
  5. Sample K1 K2 K3 0 A A A 1 A

    B C 2 A A A 3 A A B 4 A B A 5 A A B 6 A B C 7 A A A 8 A A B 9 A B C
  6. None
  7. None
  8. How do we do that in R?

  9. Clusters + transitions ID Resolution Cluster Size 1A 1 A

    10 2A 2 A 6 2B 2 B 4 3A 3 A 4 3B 3 B 3 3C 3 C 3 From To Number 1A 2A 6 1A 2B 4 2A 3A 3 2A 3B 3 2B 3A 1 2B 3C 3
  10. ggplot?

  11. None
  12. Building a graph igraph::from_data_frame(edges, vertices = nodes) tidygraph::tbl_graph(edges, nodes) graph

    %>% activate(nodes) %>% filter(...) %>% mutate(...) %>% activate(edges) %>% filter(...) %>% mutate(...)
  13. ggraph ggraph(graph, layout = “tree”) + geom_edge_link(...) + geom_node_point(...) +

    scale_size(...) + scale_edge_colour(...)
  14. clustree(clusterings, ...)

  15. What does it look like?

  16. Simulations - 1 group

  17. Simulations - 4 groups

  18. The Iris dataset Tiia Monto CC BY-SA 4.0, via Wikimedia

    Commons C T Johansson CC BY 3.0, via Wikimedia Commons Jefficus, via Wikimedia Commons Iris setosa Iris versicolor Iris virginica
  19. Iris dataset k-means clustering k = 1,...,5

  20. None
  21. Petal length

  22. Organoid data

  23. NPHS1

  24. None
  25. t-SNE 2 t-SNE 1 t-SNE 1 t-SNE 2

  26. Acknowledgements Everyone that makes tools and data available MCRI Bioinformatics

    Belinda Phipson MCRI KDDR Alex Combes @_lazappi_ lazappi.github.io/clustree install.packages(“clustree”) Paper doi.org/10.1101/274035 Slides tinyurl.com/clustree-useR2018 Supervisors Alicia Oshlack Melissa Little