Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beating the bugs: Simulating drug resistance in viral and bacterial DNA using Python and AWS by Imogen Wright

Pycon ZA
October 06, 2016

Beating the bugs: Simulating drug resistance in viral and bacterial DNA using Python and AWS by Imogen Wright

As a species, we're engaged in a crucial evolutionary struggle, and we're losing: pathogens are evolving resistance to drugs faster than we can make new ones. To slow down the clock and beat the bugs, we need to make sure that resistant pathogens don't get a chance to replicate unchecked in their human hosts. This means doing drug resistance tests to ensure that we only give patients drugs that their infections will respond to.

At Hyrax Biosciences, our software developers build web services that use machine learning to analyse DNA for drug resistance. While building Exatype, our drug resistance testing platform, we ran into a classic problem: how do we build a validation test with a verified result, when we're already the most sensitive tool on the market? To solve this problem, we turned to simulation. This talk is about a multithreaded python tool (Biopython, multiprocessing, numpy, pysam) that simulates every stage of the evolution of HIV drug resistance and the DNA sequencing process. Each run returns 300 unique, procedurally-generated HIV samples from "patients" with different histories and drug resistance profiles. We host the tool on AWS and integrate with Slack to run validations and report results.

The code to simulate your own HIV dataset is available publicly at https://github.com/hyraxbio/simulated-data. Other pathogens coming soon.

Pycon ZA

October 06, 2016
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. © Hyrax Biosciences | hyraxbio.com Simulating drug resistance in HIV

    DNA using Python Imogen Wright, Hyrax Biosciences
  2. © Hyrax Biosciences | hyraxbio.com Outline How do you manipulate

    DNA with Python? What is HIV drug resistance? Why do we want to simulate it on a computer? How do you simulate it in Python?
  3. © Hyrax Biosciences | hyraxbio.com Hopes for this talk Learn

    a bit about Biopython Learn a bit about bioinformatics
  4. © Hyrax Biosciences | hyraxbio.com Hopes for this talk Learn

    a bit about Biopython Learn a bit about bioinformatics Develop a burning desire to simulate your own HIV DNA? (unlikely)
  5. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAGGGAAATTTTGGCCTCCTAAAACAATTT... Each “letter” of

    the code is a nucleotide, one of the four molecular building blocks of DNA.
  6. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAGGGAAATTTTGGCCTCCTAAAACAATTT... Each “letter” of

    the code is a nucleotide, one of the four molecular building blocks of DNA. A bonds to T, and C to G, in a double helix. Complement strand Template strand
  7. © Hyrax Biosciences | hyraxbio.com from Bio.Seq import Seq s

    = Seq("TTAAAGGGAAAAACTACCTCCAAAGAGAGAGA”) Each “letter” of the code is a nucleotide, one of the four molecular building blocks of DNA. A bonds to T, and C to G, in a double helix. Template strand Complement strand
  8. © Hyrax Biosciences | hyraxbio.com from Bio.Seq import Seq s

    = Seq("TTAAAGGGAAAAACTACCTCCAAAGAGAGAGA") s.complement() Each “letter” of the code is a nucleotide, one of the four molecular building blocks of DNA. A bonds to T, and C to G, in a double helix. Template strand Complement strand
  9. © Hyrax Biosciences | hyraxbio.com from Bio.Seq import Seq s

    = Seq("TTAAAGGGAAAAACTACCTCCAAAGAGAGAGA") s.complement() s Seq('AATTTCCCTTTTTGATGGAGGTTTCTCTCTCT', Alphabet()) Each “letter” of the code is a nucleotide, one of the four molecular building blocks of DNA. A bonds to T, and C to G, in a double helix. Template strand Complement strand
  10. © Hyrax Biosciences | hyraxbio.com from Bio.Seq import Seq s

    = Seq("TTAAAGGGAAAAACTACCTCCAAAGAGAGAGA") s.complement() s Seq('AATTTCCCTTTTTGATGGAGGTTTCTCTCTCT', Alphabet()) Each “letter” of the code is a nucleotide, one of the four molecular building blocks of DNA. A bonds to T, and C to G, in a double helix. Template strand Complement strand
  11. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAGGGAAATTTTGGCCTTCTTACAAAAA... Amino acid chains

    are called proteins., which fold into a F F R E D L A F L Q K ...
  12. © Hyrax Biosciences | hyraxbio.com pytheric, adj. (programming) Unlike a

    pyrrhic victory, a pytheric victory negates any sense of achievement because doing things in Python is too easy. fold into a m = Seq("CCTTATACACATGAACGGATCTGT", generic_dna) m.translate() Seq('PYTHERIC', ExtendedIUPACProtein())
  13. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAGGGAAATTTTGGCCTTCTTACAAAAA... Proteins fold into

    a functional shape, the building blocks of all life on Earth. F F R E D L A F L Q K ...
  14. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAGGGAAATTTTGGCCTTCTTACAAAAA... Proteins fold into

    a functional shape, the building blocks of all life on Earth. F F R E D L A F L Q K ... If you like the look of this, Bio.PDB will let you investigate it
  15. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAAGGAAATTTTGGCCTTCTTACAAAAA... >Resistant1 TTTTTTAAGGAAATTTTGGCCATCTTACAAAAA... Evolution

    results in random changes in the genetic code - mutations, which change the shape of the protein.
  16. © Hyrax Biosciences | hyraxbio.com Actually, this method is a

    little slow and (sometimes) inaccurate. alignment = pairwise2.align.globalds( reference, query, blosum62, -10, -0.5 )
  17. © Hyrax Biosciences | hyraxbio.com Now we have the building

    blocks of life worked out, let’s apply them to an interesting problem.
  18. © Hyrax Biosciences | hyraxbio.com HIV drug resistance A recent

    South African study showed that 1 in 2 HIV+ patients on ARVs had high viral loads. Only a drug resistance test can tell non-adherence from drug resistance. Lippman et al. J Acquir Immune Defic. Syn. 2016
  19. © Hyrax Biosciences | hyraxbio.com DNA sequencing for drug resistance

    tests DNA sequencing detects drug resistance. However, it’s currently costly and doesn’t scale. take blood extract and sequence HIV DNA analyse DNA for drug-resistance mutations
  20. © Hyrax Biosciences | hyraxbio.com We want to detect drug

    resistance if even 1% of the HIV sequences in a sample were drug resistant.
  21. © Hyrax Biosciences | hyraxbio.com We want to detect drug

    resistance if even 1% of the HIV sequences in a sample were drug resistant. we call this the prevalence of resistance
  22. © Hyrax Biosciences | hyraxbio.com We want to detect drug

    resistance if even 1% of the HIV sequences in a sample were drug resistant. No other genetic test is currently this accurate.
  23. © Hyrax Biosciences | hyraxbio.com How do you test that

    a system is working properly, when there’s nothing to compare the results to?
  24. © Hyrax Biosciences | hyraxbio.com How do you test that

    a system is working properly, when there’s nothing to compare the results to? Simulate!
  25. © Hyrax Biosciences | hyraxbio.com How do you test that

    a system is working properly, when there’s nothing to compare the results to? Simulate...carefully... “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy...” – Hamlet Act 1, Scene 5
  26. © Hyrax Biosciences | hyraxbio.com B C HIV evolves very

    quickly; the dominant HIV subtype differs between regions. ≠
  27. © Hyrax Biosciences | hyraxbio.com HIV exists in the body

    as a population of distinct viruses, only some of which may contain resistant mutations.
  28. © Hyrax Biosciences | hyraxbio.com Put DNA into a sequencing

    machine, and it’ll give you millions of fragments of error-prone DNA sequence “reads”. We simulate the puzzle pieces – Exatype puts the puzzle together.
  29. © Hyrax Biosciences | hyraxbio.com Simulate HIV evolution and diversity

    Simulate DNA sequencing Simulate drug resistance at different prevalences in a sequenced sample
  30. © Hyrax Biosciences | hyraxbio.com Simulate HIV evolution and diversity

    Simulate DNA sequencing Simulate drug resistance at different prevalences in a sequenced sample EvolveAGene ART
  31. © Hyrax Biosciences | hyraxbio.com HIV drug resistance testing Simulate

    HIV evolution and diversity Simulate DNA sequencing Simulate drug resistance at different prevalences in a sequenced sample EvolveAGene ART Python read selection tool
  32. © Hyrax Biosciences | hyraxbio.com HIV drug resistance testing Simulate

    HIV evolution and diversity Simulate DNA sequencing Simulate drug resistance at different prevalences in a sequenced sample EvolveAGene ART Python read selection tool
  33. © Hyrax Biosciences | hyraxbio.com to model this, select some

    interesting resistant HIV strains from across the world
  34. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAGGGAAATTTTGGCCTCCTAAAACAATTTAACACGGGG >Resistant2 TTTTATAGGGAAAATTTGGCCTTCTAAAACAATTTAACACGGGG >Resistant3

    TTTTATAGGCAAAATTTGGCCTTCTAAAACAATTTAACACGGGG >Resistant4 TTTTTTAGTGAAATTTTGGCCTTCTAAAACAATTTAACACGGGG to model this, select some interesting resistant HIV strains from across the world This is FASTA format you can download your own strains from hiv.lanl.gov
  35. © Hyrax Biosciences | hyraxbio.com Use EvolveAGene to evolve the

    resistant and susceptible sequences Resistant Susceptible
  36. © Hyrax Biosciences | hyraxbio.com Use EvolveAGene to evolve the

    resistant and susceptible sequences Resistant Susceptible
  37. © Hyrax Biosciences | hyraxbio.com Resistant Susceptible Keep all sequences,

    except ones where the drug resistance profile has deviated.
  38. © Hyrax Biosciences | hyraxbio.com ...and, Python 2.7 multiprocessing child

    processes don’t log errors to the screen (this is a known issue)...
  39. © Hyrax Biosciences | hyraxbio.com class LoggingPool(Pool): def apply_async(self, func,

    args=(), kwds = {}, callback = None): return Pool.apply_async( self, LogExceptions(func), args, kwds, callback) ...as a complication, Python 2.7 multiprocessing child processes don’t log errors to the screen (this is a known issue)... so we derive the Pool class and wrap it in a logger
  40. © Hyrax Biosciences | hyraxbio.com class LoggingPool(Pool): def apply_async(self, func,

    args=(), kwds = {}, callback = None): return Pool.apply_async( self, LogExceptions(func), args, kwds, callback) ...as a complication, Python 2.7 multiprocessing child processes don’t log errors to the screen (this is a known issue)... so we derive the Pool class and wrap it in a logger
  41. © Hyrax Biosciences | hyraxbio.com class LoggingPool(Pool): def apply_async(self, func,

    args=(), kwds = {}, callback = None): return Pool.apply_async( self, LogExceptions(func), args, kwds, callback) ...as a complication, Python 2.7 multiprocessing child processes don’t log errors to the screen (this is a known issue)... so we derive the Pool class and wrap it in a logger
  42. © Hyrax Biosciences | hyraxbio.com class LoggingPool(Pool): def apply_async(self, func,

    args=(), kwds = {}, callback = None): return Pool.apply_async( self, LogExceptions(func), args, kwds, callback) ...as a complication, Python 2.7 multiprocessing child processes don’t log errors to the screen (this is a known issue)... so we derive the Pool class and wrap it in a logger
  43. © Hyrax Biosciences | hyraxbio.com class LoggingPool(Pool): def apply_async(self, func,

    args=(), kwds = {}, callback = None): return Pool.apply_async( self, LogExceptions(func), args, kwds, callback) ...as a complication, Python 2.7 multiprocessing child processes don’t log errors to the screen (this is a known issue)... so we derive the Pool class and wrap it in a logger
  44. © Hyrax Biosciences | hyraxbio.com class LoggingPool(Pool): def apply_async(self, func,

    args=(), kwds = {}, callback = None): return Pool.apply_async( self, LogExceptions(func), args, kwds, callback) ...as a complication, Python 2.7 multiprocessing child processes don’t log errors to the screen (this is a known issue)... so we derive the Pool class and wrap it in a logger* * Huge thanks to Rupert Nash of StackOverflow for this solution
  45. © Hyrax Biosciences | hyraxbio.com Resistant Susceptible Each read is

    a short chunk of the original sequence (200-450 nucleotides long), with a random amount of junk in it.
  46. © Hyrax Biosciences | hyraxbio.com Resistant Susceptible We developed a

    Python wrapper for ART to generate DNA sequence reads.
  47. © Hyrax Biosciences | hyraxbio.com Resistant Susceptible Now, combine resistant

    and susceptible reads at different percentages. and some of these we need some of these
  48. © Hyrax Biosciences | hyraxbio.com We tried randomly selecting susceptible

    and resistant reads in the correct proportion. This couldn’t fulfill our 1% prevalence requirement.
  49. © Hyrax Biosciences | hyraxbio.com 20% Resistant Susceptible HIV gene

    Choose prevalence at which to simulate resistance 0 3000
  50. © Hyrax Biosciences | hyraxbio.com 20% Resistant Susceptible DRM Add

    the correct ratio of resistant to susceptible reads at each drug resistance mutation (DRM) position. (use pysam to parse ART output and locate reads)
  51. © Hyrax Biosciences | hyraxbio.com What happens if nearby drug

    resistance mutations block each other from being fulfilled concurrently?
  52. © Hyrax Biosciences | hyraxbio.com What happens if nearby drug

    resistance mutations block each other from being fulfilled concurrently?
  53. © Hyrax Biosciences | hyraxbio.com drm_dict = {d:0 for d

    in drms_to_cover} simulate_selection(drm_dict)
  54. © Hyrax Biosciences | hyraxbio.com drm_dict = {d:0 for d

    in drms_to_cover} simulate_selection(drm_dict) ig = initial_generator(read_file, drm_dict)
  55. © Hyrax Biosciences | hyraxbio.com drm_dict = {d:0 for d

    in drms_to_cover} simulate_selection(drm_dict) ig = initial_generator(read_file, drm_dict) if any(low_coverage(d) for d in drm_dict): b = blockers(read_file, drm_dict)
  56. © Hyrax Biosciences | hyraxbio.com drm_dict = {d:0 for d

    in drms_to_cover} simulate_selection(drm_dict) ig = initial_generator(read_file, drm_dict) if any(low_coverage(d) for d in drm_dict): b = blockers(read_file, drm_dict) dg = delete_generator(ig, drm_dict, b)
  57. © Hyrax Biosciences | hyraxbio.com drm_dict = {d:0 for d

    in drms_to_cover} simulate_selection(drm_dict) ig = initial_generator(read_file, drm_dict) if any(low_coverage(d) for d in drm_dict): b = blockers(read_file, drm_dict) dg = delete_generator(ig, drm_dict, b) ag = add_generator(drm_dict, b) og = itertools.chain(dg, ag)
  58. © Hyrax Biosciences | hyraxbio.com 4 Coverage needed for ratio:

    ig = initial_generator(read_file, drm_dict)
  59. © Hyrax Biosciences | hyraxbio.com 1 1 4 Coverage needed

    for ratio: ig = initial_generator(read_file, drm_dict)
  60. © Hyrax Biosciences | hyraxbio.com 2 2 4 Coverage needed

    for ratio: ig = initial_generator(read_file, drm_dict)
  61. © Hyrax Biosciences | hyraxbio.com 3 2 4 Coverage needed

    for ratio: ig = initial_generator(read_file, drm_dict)
  62. © Hyrax Biosciences | hyraxbio.com 4 3 4 Coverage needed

    for ratio: ig = initial_generator(read_file, drm_dict)
  63. © Hyrax Biosciences | hyraxbio.com 4 3 4 Coverage needed

    for ratio: dg = delete_generator(ig, drm_dict, b)
  64. © Hyrax Biosciences | hyraxbio.com 4 3 4 Coverage needed

    for ratio: dg = delete_generator(ig, drm_dict, b)
  65. © Hyrax Biosciences | hyraxbio.com 4 4 4 Coverage needed

    for ratio: ag = add_generator(drm_dict, b)
  66. © Hyrax Biosciences | hyraxbio.com >Resistant1 TTTTTTAGGGAAATTTTGGCCTCC... >Resistant2 TTTTATAGGGAAAATTTGGCCTTC... >Resistant3

    TTTTATAGGCAAAATTTGGCCTTC... >Resistant4 TTTTTTAGTGAAATTTTGGCCTTC... >Resistant5 TTTTTTAGGGAAATTTTGGCCTCC... >Resistant6 TTTTATAGGGAAAATTTGGCCTTC... >Resistant7 TTTTATAGGCAAAATTTGGCCTTC... >Resistant8 TTTTTTAGTGAAATTTTGGCCTTC... >Resistant9 TTTTATAGGCAAAATTTGGCCTTC... >Resistant10 TTTTTTAGTGAAATTTTGGCCTTC... 1% 5% 2% 10% 15% 20% 30% 50% 100%
  67. © Hyrax Biosciences | hyraxbio.com You can produce your own

    samples with any • resistance profile • sequencing platform ­ 454, Ion Torrent, Illumina (PacBio soon!) • drug resistance mutation prevalence gi t cl one gi t @ gi t hub. com : hyraxbi o/ si m ul at ed- dat a 5’ Pol 3’