Slide 1

Slide 1 text

Matthew McKay ECE Department Hong Kong University of Science and Technology Centrale-Supelec February 3, 2015 Signal Processing meets Immunology: Towards a Hepatitis C Vaccine via High- Dimensional Correlation Estimation

Slide 2

Slide 2 text

Other Team Members 2 I-Ming Hsing Professor, CBME Head and Professor, BME Raymond H. Y. Louie Visiting Assistant Professor ECE Ahmed Abdul Quadeer PhD student, ECE Arup K. Chakraborty Robert T. Haslam Professor of Chemical Engineering, Professor of Chemistry, Physics, and Biological Engineering Karthik Shekhar Post-doc, Broad Institute

Slide 3

Slide 3 text

Outline 3  Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

Slide 4

Slide 4 text

Virus 4  Invading microbial organism that replicates inside the living cells  Cause infectious diseases like  Human Immunodeficiency Virus (HIV) that leads to AIDS  Hepatitis (Hepatitis A,B,C virus)  Influenza (H1N1, H3N2, H7N9)

Slide 5

Slide 5 text

Hepatitis C virus (HCV) 5  HCV causes an infectious disease that affects mainly the liver  More than 170 million people affected globally  Treatment available  Pegylated interferon and ribavirin  Expensive  Prolonged  Extensive side-effects  Frequently fails  No vaccine available! Vexing problem: Virus’s extreme mutability

Slide 6

Slide 6 text

Virus consists of proteins 6 HCV Viral Genome

Slide 7

Slide 7 text

Proteins consist of sequence of amino acids 7 No. Amino Acid Letter 1 Alanine A 2 Arginine R 3 Asparagine N 4 Aspartic acid D 5 Cysteine C 6 Glutamic acid E 7 Glutamine Q 8 Glycine G 9 Histidine H 10 Isoleucine I 11 Leucine L 12 Lysine K 13 Methionine M 14 Phenylalanine F 15 Proline P 16 Serine S 17 Threonine T 18 Tryptophan W 19 Tyrosine Y 20 Valine V

Slide 8

Slide 8 text

 Same function but different effectiveness Protein properties 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Protein 1 V Y A T T S A S A G L R Q K K V A S K T K R S K G L R R K K 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Protein 1 V Y A T T S A S A G L R Q K K 1 2 3 4 5 6 7 8 Protein 2 M Q S A A K L R Different proteins have different amino acid sequence and length The same protein has similar length and amino acid sequence

Slide 9

Slide 9 text

Multiple sequence alignment (MSA) 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K … Peptide All observed viral sequences are considered fit

Slide 10

Slide 10 text

Peptide - MHC 10 Host cell Antibodies Virus T cell TCR Pathogen specific adaptive immune system BCR Infected cell B cell

Slide 11

Slide 11 text

Peptide - MHC T cell TCR 11  Memory of past infections  Basis for vaccination  Goal: Find specific peptides that kill large number of infected cells Host cell Antibodies Virus B cell BCR CTL Pathogen specific adaptive immune system Infected cell B cell

Slide 12

Slide 12 text

Peptide - MHC 12 T cell TCR Infected cell T cell T cell Epitope with no mutation Epitope with one mutation Cannot recognize Recognition and Activation Single mutation in peptide can abrogate T cell recognition

Slide 13

Slide 13 text

Outline 13  Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

Slide 14

Slide 14 text

Vaccine Design Challenges 14  1. Which type of immune response should the vaccine induce?  2. Which proteins to target?  3. Which peptides of the protein to target?

Slide 15

Slide 15 text

1. B cell or T cell vaccine? 15  B cells (antibodies) based vaccine that targets the external proteins?  T cell based vaccine that targets the internal proteins?  Experimental and clinical studies reveal that HCV controllers use broadly directed T cell response to clear the virus T cell based immune response is important in case of HCV

Slide 16

Slide 16 text

2. Which proteins to target? 16  Why NS3?  Immune system of HCV Controllers target peptides of NS3  Comparatively large number of sequences Helicase/ Protease Function Membrane Binding Function Polymerase Function

Slide 17

Slide 17 text

3. Which peptides of the protein to target? 17  Major challenge  Difficult to address experimentally Use of statistical and computational methods to help finding a solution based on the large amount of sequence data available now

Slide 18

Slide 18 text

Human Genome Project  Modern advances in bio-technology are revolutionizing the field of biomedical research  Landmark: Human Genome Project  Time Period: 1990  2003  Cost: 3 BILLION US DOLLARS  Advancement in Genomics paved the way for advanced study in the field of medicine to develop treatment of cancer and other diseases 18

Slide 19

Slide 19 text

19  Increase in data

Slide 20

Slide 20 text

20 and many more.. (e.g. UniProt, ProDm, VectorBase….) Lots of databases! Explosive growth in submissions! Open databases

Slide 21

Slide 21 text

21 Large number of sequences for many infectious diseases!

Slide 22

Slide 22 text

3. Which peptides of the protein to target? 22  Large number of sequences (observations) (2800+ in NS3)  Large number of amino acids in the protein (variables) (631 in NS3) Most difficult challenge to be addressed using high- dimensional correlation matrix estimation

Slide 23

Slide 23 text

 No mutation at all  100% conserved  Conventional approach: Design a vaccine which can elicit a T cell response to target highly conserved peptides  Basis of a recently proposed HCV vaccine IC-41 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K … Consensus Sequence V Y A T T S R S A G L R Q K K … A TOY EXAMPLE: Conventional vaccine design strategy Problem: High mutability of virus may result in escape mutations T cell T cell

Slide 24

Slide 24 text

Proposed vaccine design approach 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K … Consensus Sequence V Y A T T S R S A G L R Q K K … Positively correlated pairs of locations  Beneficial mutations

Slide 25

Slide 25 text

Proposed vaccine design approach 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K … Consensus Sequence V Y A T T S R S A G L R Q K K … Positively correlated pairs of locations  Beneficial mutations Negatively correlated pairs of locations  Harmful mutations Target the negatively correlated pairs of locations along with the 100% conserved ones and avoid the positively correlated pairs of locations

Slide 26

Slide 26 text

Outline 26  Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

Slide 27

Slide 27 text

Technical problem … 27  Large number of sequences (observations) (2800+ in NS3)  Large number of amino acids in the protein (variables) (631 in NS3) Challenge: Accurate high dimensional correlation estimation

Slide 28

Slide 28 text

Correlation matrix estimation  Examples  Portfolio management and risk assessment  Array processing  Designing wireless communication receivers  Number of observations ≈ number of variables  The sample correlation is known to have poor performance [Johnstone, 2001]

Slide 29

Slide 29 text

Basis - RMT application in finance 29  Random Matrix Theory (RMT) for noise-cleaning in finance  RMT also instrumental in modern communication system design such as WiFi and cellular phones  HIV work by Arup Chakraborty (MIT) [PNAS, 2011]  Finding HIV sectors (groups of amino acids)  Designing vaccine to attack such sectors  Vaccine trials in progress Bouchaud Stanley Arup K. Chakraborty

Slide 30

Slide 30 text

In the news… 30

Slide 31

Slide 31 text

Method 31  Advantages:  The results can potentially yield significant improvements over IC-41  Such vaccine strategies can be explored with computational methods Obtain the Multiple Sequence Alignment (MSA) Construct the sample correlation matrix from MSA Clean the correlation matrix using RMT Design immunogen targeting the highly conserved and negatively correlated pairs of sites

Slide 32

Slide 32 text

Sample correlation matrix 32

Slide 33

Slide 33 text

Cleaned correlation matrix Statistical Noise Phylogenetic Noise

Slide 34

Slide 34 text

Alternate covariance matrix estimation methods 34  Regularized (shrinkage) methods [Ledoit et. al., 2004, Ledoit et. al., 2012]  Sparse covariance matrix estimation [Bickel et. al., 2008, Cai et. al., 2012]  Sparse PCA [Johnstone et. al., 2009, Paul et. al. 2012, Ma 2013, Vu 2013, Liu et. al. 2014]  Robust estimation [Maronna 1976,, Couillet et. al. 2013, Zheng et. al. 2014]

Slide 35

Slide 35 text

Outline 35  Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

Slide 36

Slide 36 text

Important factors in the proposed vaccine design 36 1. Metric L - calculated based on correlations 2. Population coverage MHC Peptide Host Cell T cell

Slide 37

Slide 37 text

1. Metric L - calculated based on correlations 37  PCP = Percentage of 100% conserved pairs  PNCP = Percentage of negatively correlated pairs  PPCP = Percentage of positively correlated pairs  PUCP = Percentage of uncorrelated pairs Vaccine Design Objective: Maximize L = PCP + PNCP – PPCP – PUCP Peptide 1 Peptide 1 with single mutation Peptide 2 Peptide 2 with single mutation

Slide 38

Slide 38 text

38 Cell MHC Molecules  Different people have different types of MHC molecules  Different MHC molecules may present different peptides  Thus different people may present different peptides Person 1 Person 2 Person 3 Person 4 Person 5 Difference in MHC molecules leads to presentation of different peptides across populations 2. Population Coverage

Slide 39

Slide 39 text

39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 … V Y A T T S A S A G L R Q K K R E D K M V L K F G S … Person 1 V Y A T T S A S A G L R Q K K R E D K M V L K F G S … Person 2 V Y A T T S A S A G L R Q K K R E D K M V L K F G S … Person 3  Challenge: Designing a vaccine that covers a large proportion of population  Information required:  Detailed statistics of distribution of MHCs in a given population  Data of NS3 peptides presented by particular MHCs (IEDB database) V Y A T T S A S A G L R Q K K R E D K M V L K F G S … Person 4 2. Population Coverage

Slide 40

Slide 40 text

Statistics of haplotypes in US Caucasian population [Maiers et. al. 2007] 40

Slide 41

Slide 41 text

Proposed T cell vaccine design 41  A list of 32 peptides recognized by T cells in individuals in a large proportion of the US Caucasian population was compiled  We consider a 5-peptides based vaccine design for this population as an example APITAYAQQTRGLLGCIITSLTGRDKNQVEGEVQIVSTAAQTFLATCINGVCWTVYHGAGTRTIASPKGPVIQMYTNVDQDLV GWPAPQGARSLTPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPAGHAVGIFRAAVCTRGVAKAV DFIPVENLETTMRSPVFTDNSSPPAVPQSFQVAHLHAPTGSGKSTKVPAAYAAQGYKVLVLNPSVAATLGFGAYMSKAHGI DPNIRTGVRTITTGSPITYSTYGKFLADGGCSGGAYDIIICDECHSTDATSILGIGTVLDQAETAGARLVVLATATPPGSVTVPHP NIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMT GFTGDFDSVIDCNTCVTQTVDFSLDPTFTIETTTLPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPSGMFDSSVLCECYDAGCA WYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENLPYLVAYQATVCARAQAPPPSW DQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMSADLEVVT

Slide 42

Slide 42 text

42  Obtain 10 combinations with maximum L (effectiveness of combination to kill viruses)  Order them with respect to Dcov (double coverage) Combination Peptide 1 Peptide 2 Peptide 3 Peptide 4 Peptide 5 L Dcov 1 1251-1259 1292-1300 1436-1444 1585-1594 1585-1595 63.58 0.50 2 1123-1131 1169-1177 1251-1259 1292-1300 1436-1444 61.62 0.44 3 1123-1131 1175-1183 1251-1259 1292-1300 1436-1444 65.45 0.37 4 1123-1131 1175-1183 1251-1259 1359-1367 1436-1444 61.62 0.37 5 1169-1177 1175-1183 1251-1259 1292-1300 1436-1444 64.46 0.34 6 1123-1131 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.30 7 1251-1259 1292-1300 1436-1444 1540-1550 1541-1550 61.31 0.18 8 1169-1177 1251-1259 1292-1300 1359-1367 1436-1444 61.62 0.14 9 1175-1183 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.07 10 1123-1131 1175-1183 1251-1259 1292-1300 1359-1367 61.62 0.07 Proposed T cell vaccine design

Slide 43

Slide 43 text

Analysis of NS3 peptides of IC41 43  Plus point  No positively correlated pairs of sites!  Rank in 2-peptides based vaccine design  71 /496 0 0,02 0,04 0,06 0,08 0,1 0,12 0,14 1 IC41 2 3 4 5 Combination of 2 NS3 peptides Double Coverage 92 93 94 95 96 97 98 99 100 1 IC41 2 3 4 5 Combination of 2 NS3 peptides Mean conservation across all genotypes 67.03 38.34 75.44 72.55 80.39 86.93 L-score

Slide 44

Slide 44 text

Validation 44  Experiments  Existing clinical and experimental data  Cannot directly validate proposed peptides  Validation Strategy: 1. Identify group/sector of potentially vulnerable sites (negatively correlated) that are collectively coupled 2. Validate this sector by comparing with structural and clinical data 3. Check if our vaccine targets the sites in this sector

Slide 45

Slide 45 text

1. Identify sectors of potentially vulnerable sites 45  Use clustering algorithm based on eigenvectors of Ccleaned  Finance  Economic sectors

Slide 46

Slide 46 text

46 0,8 0,9 1 1 2 3 Mean conservation 0 10 20 30 1 2 3 %Positive correlations 0 2 4 6 8 10 12 1 2 3 Sector %Negative correlations 0 2 4 6 8 1 2 3 Sector Neg/pos correlations 3-D Scatter plot of eigenvectors Sector 1 consists of the most immunologically vulnerable sites Three sectors of co-evolving sites in NS3

Slide 47

Slide 47 text

2. Structural significance of sector 1 47 Sector1 sites are dominant in the critical interface of the NS3 crystal structure (p-value < 0.01) Red – Sector 1 sites

Slide 48

Slide 48 text

2. Significance of sector 1 based on previously published experimental and clinical results 48 >30% Majority of peptides targeted by “HCV Controllers” consist of predominantly sector 1 sites (p-value < 0.05). 0 10 20 30 40 50 60 70 80 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 Allele- independent epitopes Allele-restricted epitopes % Sector 1 sites

Slide 49

Slide 49 text

3. Sector 1 sites in proposed vaccine design 49 Combination Peptide1 Peptide2 Peptide3 Peptide4 Peptide5 L Dcov 1 1251-1259 1292-1300 1436-1444 1585-1594 1585-1595 63.58 0.50 2 1123-1131 1169-1177 1251-1259 1292-1300 1436-1444 61.62 0.44 3 1123-1131 1175-1183 1251-1259 1292-1300 1436-1444 65.45 0.37 4 1123-1131 1175-1183 1251-1259 1359-1367 1436-1444 61.62 0.37 5 1169-1177 1175-1183 1251-1259 1292-1300 1436-1444 64.46 0.34 6 1123-1131 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.30 7 1251-1259 1292-1300 1436-1444 1540-1550 1541-1550 61.31 0.18 8 1169-1177 1251-1259 1292-1300 1359-1367 1436-1444 61.62 0.14 9 1175-1183 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.07 10 1123-1131 1175-1183 1251-1259 1292-1300 1359-1367 61.62 0.07 A large proportion (~60%) of sites in the proposed vaccine design belong to sector 1 (p-value < 0.01)

Slide 50

Slide 50 text

Conclusions 50  Majority of the sites present in the proposed design belong to sector 1 that appears to be significant from experimental and clinical data available in literature  Numerical validation of currently proposed vaccine design, IC-41  Proposal of new vaccine design strategies which can:  Potentially improve upon IC-41 by inducing an immune response against more vulnerable parts of the HCV genome  Cover a large portion of the population (currently, for US)  Similar analysis for NS4B and NS5B proteins also reveals potential sites for vaccine design Next step: Experimental trials!

Slide 51

Slide 51 text

Conclusions 51  There is much similarity between high-dimensional statistical problems in immunology and those in signal processing  Many methods common in SP find direct application (though, currently not well explored):  Maximum entropy modeling  Sampling methods (e.g., MCMC)  Sparsity  Subspace estimation  Robust estimation  Machine learning  …

Slide 52

Slide 52 text

Related Publications 52  A. A. Quadeer, R. H. Y. Louie, K. Shekhar, A. K. Chakraborty, I. Hsing, and M. R. McKay, “Discovering statistical vulnerabilities in highly mutable viruses: a random matrix approach,” in Proc. of the IEEE Workshop on Statistical Signal Processing (SSP), Gold Coast,Australia, July 2014.  A. A. Quadeer, R. H. Y. Louie, K. Shekhar, A. K. Chakraborty, I. Hsing, and M. R. McKay, “Statistical linkage of substitutions in patient-derived sequences of genotype 1a hepatitis C virus non-structural protein 3 exposes targets for immunogen design,” Journal ofVirology, 88 (13), pp. 7628-7644, July 2014.

Slide 53

Slide 53 text

Join us in Brisbane 19 – 24 April 2015 www.icassp2015.org