Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lecture 20 ChIP-seq

Avatar for shaunmahony shaunmahony
March 28, 2022
81

Lecture 20 ChIP-seq

BMMB 554 Lecture 20

Avatar for shaunmahony

shaunmahony

March 28, 2022
Tweet

Transcript

  1. How do transcription factors recognize their binding sites? C T

    A A T T A C T A A T T A T T A A G T A C T A A T G G T T A A T T G T T A A G G A C T A A T G A G T A A T T G Binding sites Sequence that best matches motif. 206,801 exact matches in mouse genome. 2,807 of these are bound by Isl1 in motor neurons. In general (vertebrate genomes): • Millions of motif instances. • Tens of thousands of binding sites. Isl1
  2. Most TF binding motif instances are unbound • Essentially all

    TF binding motif occurrences will have no function in a given cell type*. • How can we focus on motif instances that are more likely to be bound & functional? • Conservation • Cis-regulatory modules (i.e., clusters of sites) • Measuring TF binding (ChIP-seq) • Accessibility (DNaseI hypersensitivity) • Chromatin marks (H3K4me1, H3K27ac) * Wasserman & Sandelin, Nature Reviews Genetics (2004) Cell type dependent: we would need experimental data
  3. Limitations of phylogenetic footprinting • Only ~60% of TF binding

    sites are directly conserved between human & mouse. • Perhaps even fewer TF binding sites are functionally conserved. • Phylogenetic shadowing approaches allows conservation analysis of binding sites across more closely related species.
  4. Cis-regulatory modules • Idea: motif instances should be clustered at

    real enhancers. • Different combinations of TFs may have different regulatory effects. Regulatory synergy
  5. Cis-regulatory modules • Multiple TF binding sites will form a

    stronger binding region. • Motif instances for cooperating TFs should be located together.
  6. How do transcription factors recognize their regulatory targets in a

    given cell type? CHROMATIN STATE DEPENDENT ACCESSIBILITY TF XYZ COOPERATIVE INTERACTIONS TF XYZ PIONEER BINDING TF BEFORE AFTER BEFORE AFTER BEFORE AFTER NO DEPENDENCE ON PRIOR STATE…
  7. Chromatin structure may determine TF binding locations • Regulatory sites

    are typically located in regions of accessible chromatin. • Enhancers and promoters have characteristic histone modifications. • H3K4me1 at enhancers • H3K4me3 at promoters • TFs may interact differently with methylated DNA. • Higher-order genome topology may also play a role in TF binding. Rosa & Shaw, Biology (2013)
  8. Chromatin immunoprecipitation (ChIP) 1. Crosslink: Use formaldehyde or UV light

    to covalently crosslink proteins to DNA (i.e. stick everything together) 2. Lyse: Use a lysis buffer to break down cell walls and release chromatin. 3. Shear: Use sonication or nuclease digestion to break the crosslinked chromatin into small fragments.
  9. Chromatin immunoprecipitation (ChIP) 4. Immunoprecipitate: Use antibodies attached to beads

    to select for DNA fragments that have a protein of interest attached. Separate out beads using magnet for magnetic beads or centrifugation for agarose beads. Y Y Y Y Y magnetic bead with attached antibodies Y
  10. Chromatin immunoprecipitation (ChIP) 5. Reverse crosslinks Incubate at 70°C to

    reverse cross-linking (i.e., unstick DNA from protein), and separate DNA.
  11. Chromatin immunoprecipitation (ChIP) 6. Library preparation: Select DNA fragments that

    are ~200bp in length. Amplify the fragments using PCR. Size select PCR amplify
  12. ChIP-seq • Directly sequence ChIPed DNA fragments using a high-throughput

    sequencer. • Sequence one or both fragment ends? • à FASTQ file • Map the sequenced reads back to the genome. • Bowtie / BWA • à BAM file
  13. Data smoothing helps ChIP-seq signals to stand out + -

    500bp Foxa2 Liver ChIP-seq Aldh1a1 ~35Kbp upstream
  14. Data smoothing helps ChIP-seq signals to stand out + -

    500bp Foxa2 Liver ChIP-seq extend in 3’ direction Aldh1a1 ~35Kbp upstream
  15. Data smoothing helps ChIP-seq signals to stand out + -

    500bp Foxa2 Liver ChIP-seq extend in 3’ direction remove strandedness Aldh1a1 ~35Kbp upstream
  16. Making genome browser tracks • Why view data in a

    browser? • Visually assess quality of ChIP-enrichment • Confirm previously known binding targets • Explore the data! • Genome browsers • UCSC genome browser (tracks from Galaxy) [http://genome.ucsc.edu] • IGV [https://www.broadinstitute.org/igv]
  17. ChIP-seq ChIP-seq signal can be punctate or broadly distributed (or

    both) Park, Nature Reviews Genetics (2009)
  18. Identifying ChIP-enriched regions • Scan the genome using a sliding

    window, looking for regions that have more ChIP-seq reads than we would have expected. • Expected by chance? • Expected according to a control experiment? • What would be a good control experiment? • Sequence input material? • Pseudo IP experiment? • Reads from control experiments are not evenly distributed. • More likely to occur in regions of open chromatin • Possibly more likely to occur near highly expressed genes
  19. Identifying ChIP-enriched regions • MACS: • Artificially extend reads to

    expected fragment length, and generate coverage map along genome. • Assume background reads are Poisson distributed. • Mean of the Poisson is locally variable… estimate from control experiment in 5Kbp or 10Kbp around examined location. • For a given location, do we see more reads than we would have expected from the Poisson (p<10-5) MACS: Zhang, et al. Genome Biology, 2008
  20. Identifying ChIP-enriched regions PeakSeq: Rozowsky, et al., Nature Biotech, 2009

    • Appropriate normalization between signal & control experiments is key • Approaches: • Total tag normalization • Regression on binned paired read counts (ignoring peaks)
  21. Identifying ChIP-enriched regions • PeakSeq: • Step 1: Identify candidate

    enriched locations • Examine only the signal experiment. • Determine an enrichment threshold by simulating the same number of reads and setting a false discovery rate. • Step 2: • Find a control vs. signal scaling ratio by performing regression using non-candidate regions. • Use Binomial test of population proportion differences to assess differences between signal read counts and corresponding scaled control read counts. • Benjamini-Hochberg correction for multiple hypothesis testing. PeakSeq: Rozowsky, et al., Nature Biotech, 2009
  22. How many locations does a protein bind? • Statistical answers:

    • p-value thresholds (significance of enrichment) • Fold-enrichment cutoffs • Greater sequencing depth typically yields higher numbers of peaks. • Statistical significance vs. biological significance?
  23. Identifying precise TF binding locations • ChIP-seq reads are distributed

    bimodally around binding sites. • Regions of ChIP-enrichment are known as “peaks”. Valouev, et al. Nature Methods (2008)
  24. Computational challenge: resolve the structure of TF binding events from

    ChIP-seq data How many binding events are here? How close to the actual bound bases are event predictions? + -
  25. Signal to noise scores • FRiP: fraction of reads in

    peaks • Correlates with number of peaks • “Successful” experiments have FRiP >1% • FRiP is sensitive to peak-calling parameters, etc. Landt, et al., Genome Res, 2012
  26. Signal to noise scores • Cross-correlation analysis • Pearson linear

    correlation between Watson & Crick strand read densities after shifting Watson reads by k bases. • Takes advantage of fact that reads are bimodally distributed around true binding sites. • Independent of peak-calling. Landt, et al., Genome Res, 2012
  27. Signal to noise scores • SES: Signal extraction scaling •

    Rank order statistic • Rank genomic bins by read count • Plot against cumulative % of reads • Look for inflection point in the graph • CHANCE: https://github.com/songlab/chance Diaz, et al., Genome Biol, 2012
  28. Signal to noise scores NCIS: Liang & Keles (2012) NCIS:

    • Rank bins by total tags (signal+control) • Estimate ratio using summed tag counts in n lowest ranked bins. • Find point at which estimated ratio begins to increase (using at least 75% of bins)
  29. Reproducibility scores • Correlation between biological replicates • Compute correlation

    between binned read counts across entire genome, or read counts within peak regions. • IDR: Irreproducible Discovery Rate • Assumes ranks of genuine binding site peaks should be consistent across replicates, and ranks of false positive peaks should not. • Uses model to cluster reproducible peaks from irreproducible peaks. • IDR score = probability peak comes from irreproducible group.
  30. Summary • We cannot (yet) predict where a given transcription

    factor will bind in a given cell type. • Motif scanning yields too many potential sites. • ChIP-seq or other experimental approaches are required. • Analysis of ChIP-seq and other protein-DNA binding assay data is not fully standardized • Choice of peak-finding methodology & interpretation of QC metrics is dataset-dependent. • Many tools available for ChIP-seq analysis, only some of which are accessible beyond the command-line. • Few principled performance comparisons have been performed.
  31. Further reading • Park PJ “ChIP-seq: advantages and disadvantages of

    a maturing technology”, Nature Reviews Genetics (2009) 10(10):669-680 • Landt S, et al. “ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia”, Genome Research (2012) 22:1813 – 1831 • Carroll TS, et al. “Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data”, Frontiers in Genetics (2014) 5:75 • Mahony S & Pugh BF “Protein-DNA binding in high resolution”, Critical Reviews in Biochemistry and Molecular Biology (2015) 4:269-283