Upgrade to Pro — share decks privately, control downloads, hide ads and more …

clustree: a package for producing clustering trees using ggraph

clustree: a package for producing clustering trees using ggraph

Clustering analysis is commonly used in many fields to group together similar samples. Many clustering algorithms exist, but all of them require some sort of user input to set parameters that affect the number of clusters produced. Deciding on the correct number of clusters for a given dataset is a difficult problem that can be tackled by looking at the relationships between samples at different resolutions. Here I will present clustree, an R package for producing clustering tree visualisations. These visualisations combine information from multiple clusterings with different resolutions, showing where new clusters come from and how samples change clusters as the number of clusters increases. Summarised information describing the samples in each cluster can be overlaid on the tree to give additional insight. I will also describe my experience developing clustree, particularly how I have made use of the ggraph package. The clustree package is available at https://github.com/lazappi/clustree and a preprint describing clustering trees can be read at https://www.biorxiv.org/content/early/2018/03/02/274035.

This talk was presented at userR! 2018 in Brisbane.

Luke Zappia

July 12, 2018
Tweet

More Decks by Luke Zappia

Other Decks in Programming

Transcript

  1. clustree
    producing clustering trees using ggraph
    Luke Zappia
    @_lazappi_
    lazappi.github.io/clustree

    View Slide

  2. My data
    OpenStax College, CC BY 3.0 via Wikimedia Commons
    Single-cell RNA-sequencing
    Gene activity in thousands of cells
    ~20000 features (genes)
    ~8000 samples (cells)
    Look for different cell types

    View Slide

  3. How many
    clusters?

    View Slide

  4. View Slide

  5. Sample K1 K2 K3
    0 A A A
    1 A B C
    2 A A A
    3 A A B
    4 A B A
    5 A A B
    6 A B C
    7 A A A
    8 A A B
    9 A B C

    View Slide

  6. View Slide

  7. View Slide

  8. How do we
    do that in R?

    View Slide

  9. Clusters + transitions
    ID Resolution Cluster Size
    1A 1 A 10
    2A 2 A 6
    2B 2 B 4
    3A 3 A 4
    3B 3 B 3
    3C 3 C 3
    From To Number
    1A 2A 6
    1A 2B 4
    2A 3A 3
    2A 3B 3
    2B 3A 1
    2B 3C 3

    View Slide

  10. ggplot?

    View Slide

  11. View Slide

  12. Building a graph
    igraph::from_data_frame(edges, vertices = nodes)
    tidygraph::tbl_graph(edges, nodes)
    graph %>%
    activate(nodes) %>%
    filter(...) %>%
    mutate(...) %>%
    activate(edges) %>%
    filter(...) %>%
    mutate(...)

    View Slide

  13. ggraph
    ggraph(graph, layout = “tree”) +
    geom_edge_link(...) +
    geom_node_point(...) +
    scale_size(...) +
    scale_edge_colour(...)

    View Slide

  14. clustree(clusterings, ...)

    View Slide

  15. What does it
    look like?

    View Slide

  16. Simulations - 1 group

    View Slide

  17. Simulations - 4 groups

    View Slide

  18. The Iris dataset
    Tiia Monto CC BY-SA 4.0,
    via Wikimedia Commons
    C T Johansson CC BY 3.0,
    via Wikimedia Commons
    Jefficus,
    via Wikimedia Commons
    Iris setosa
    Iris versicolor
    Iris virginica

    View Slide

  19. Iris dataset
    k-means clustering
    k = 1,...,5

    View Slide

  20. View Slide

  21. Petal length

    View Slide

  22. Organoid
    data

    View Slide

  23. NPHS1

    View Slide

  24. View Slide

  25. t-SNE 2
    t-SNE 1
    t-SNE 1
    t-SNE 2

    View Slide

  26. Acknowledgements
    Everyone that makes tools and data available
    MCRI Bioinformatics
    Belinda Phipson
    MCRI KDDR
    Alex Combes
    @_lazappi_
    lazappi.github.io/clustree
    install.packages(“clustree”)
    Paper
    doi.org/10.1101/274035
    Slides
    tinyurl.com/clustree-useR2018
    Supervisors
    Alicia Oshlack
    Melissa Little

    View Slide