Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a clustering tree

9d81fd2d95185ac557a4a6a1e2139657?s=47 Luke Zappia
August 18, 2017

Building a clustering tree

A common task in many fields is to group objects or samples based on their characteristics such that objects in one group are more similar to each other than to objects in other groups. This unsupervised process is known as clustering. Many clustering algorithms exist but all of them require some sort of user input to decide how many groups to produce. For example k-means clustering requires a value for k, the number of clusters, and graph-based clustering methods can include a resolution parameter. How do we know what the correct number of clusters is and what are the relationships between clusters at different resolutions? I will show how to construct a clustering tree to help answer these questions. While I will use a single-cell RNA sequencing dataset as an example this approach could be used for any discipline or clustering method. A clustering tree combines information from multiple clusterings with different resolutions and shows the relationships between them. This visualisation can be used to see which clusters are distinct, where new clusters come from and how many samples change clusters as the number of clusters increases. Building a clustering tree provides an alternative way of looking at clusterings, giving extra insight into the clustering process and helping to determine the correct number of clusters to use in further analysis.

9d81fd2d95185ac557a4a6a1e2139657?s=128

Luke Zappia

August 18, 2017
Tweet

Transcript

  1. Building a clustering tree Luke Zappia @_lazappi_

  2. My data OpenStax College, CC BY 3.0 via Wikimedia Commons

    Single-cell RNA-sequencing Gene activity in thousands of cells ~20000 rows x ~7000 columns Look for different cell types
  3. How many clusters?

  4. None
  5. None
  6. Can we build a tree of clusters?

  7. Nodes Resolution (k) Cluster Size

  8. Edges Cluster from (lower resolution) Cluster to (higher resolution)

  9. Edges Cluster from (lower resolution) Cluster to (higher resolution) Number

    Proportion
  10. Proportions size = 100 60 40 k = 1 k

    = 2 n = 60 n = 40
  11. Proportions 100 60 40 k = 1 k = 2

    p from = n / size low n = 60 n = 40 p to = n / size high
  12. Proportions 100 60 40 k = 1 k = 2

    p from = 0.6 n = 60 n = 40 p to = 1.0 p from = 0.4 p to = 1.0
  13. Proportions 100 60 40 k = 1 k = 2

    p from = 0.67 60 40 p to = 1.0 40 30 40 20 30 10 30 k = 3 p from = 0.33 p to = 0.67 p from = 0.25 p to = 0.33 p from = 0.75 p to = 1.0
  14. Algorithm For each resolution r = 1, ..., R -

    1 For each unique cluster C low in r For each unique cluster C high in r + 1 Count C low , C high pairs Res1 Res2 Res3 S1 1 1 1 S2 2 1 2 S3 1 2 2 S4 1 3 3 S5 2 1 4
  15. Algorithm For each resolution r = 1, ..., R -

    1 Res1, Res2 For each unique cluster C low in r For each unique cluster C high in r + 1 Count C low , C high pairs Res1 Res2 Res3 S1 1 1 1 S2 2 1 2 S3 1 2 2 S4 1 3 3 S5 2 1 4
  16. Algorithm For each resolution r = 1, ..., R -

    1 For each unique cluster C low in r 1, 2 For each unique cluster C high in r + 1 Count C low , C high pairs Res1 Res2 Res3 S1 1 1 1 S2 2 1 2 S3 1 2 2 S4 1 3 3 S5 2 1 4
  17. Algorithm For each resolution r = 1, ..., R -

    1 For each unique cluster C low in r For each unique cluster C high in r + 1 1, 2, 3 Count C low , C high pairs Res1 Res2 Res3 S1 1 1 1 S2 2 1 2 S3 1 2 2 S4 1 3 3 S5 2 1 4
  18. Algorithm For each resolution r = 1, ..., R -

    1 For each unique cluster C low in r For each unique cluster C high in r + 1 Count C low , C high pairs 1 Res1 Res2 Res3 S1 1 1 1 S2 2 1 2 S3 1 2 2 S4 1 3 3 S5 2 1 4
  19. Edge table Res1 Res2 Res3 S1 1 1 1 S2

    2 1 2 S3 1 2 2 S4 1 3 3 S5 2 1 4 ResFrom ClusterFrom ResTo ClusterTo Number Res1 1 Res2 1 1 Res1 1 Res2 2 1 Res1 1 Res2 3 1 Res1 2 Res2 1 2 Res1 2 Res2 2 0 Res1 2 Res2 3 0
  20. Tree Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7

    0.8 0.9 1.0
  21. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0
  22. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0
  23. None
  24. None
  25. None
  26. None
  27. None
  28. Summary Choosing the number of clusters is hard but important

    A clustering tree can help by showing: - Relationships between clusters - Which clusters are distinct - Where samples are changing
  29. Acknowledgements Everyone that makes tools and data available Supervisors Alicia

    Oshlack Melissa Little MCRI Bioinformatics Belinda Phipson MCRI KDDR Alex Combes @_lazappi_ oshlacklab.com R Tutorial lazappi.id.au/building-a-clustering-tree/ Slides speakerdeck.com/lazappi/building-a-clustering-tree