BIM_2010_20-Project Presentation

Validation of Time Series Technique for Prediction of Conformational States
of Amino Acids Dr. Sangeeta Sawant , Bioinformatics Centre, UoP, Pune (Guide) Dr. Mohan Kale, Dept. of Statistics, UoP, Pune (co-guide)

Concepts Used Ramachandran Plot Time series AR,ARMA,ARIMA models AIC criteria
Euclidean distance Potential values for AA residues Feynman Problem Solving Algorithm

Ramachandran Plot

Time Series a sequence of data points or set of
observations, measured typically at successive time instants spaced at uniform time intervals. Patterns, variations forecasting

Autoregressive (AR) models Autoregressive-moving average (ARMA) Autoregressive integrated moving average
(ARIMA) models - depend linearly on previous data points Time Series Models (probability model)

Materials & Methods R R-Studio, Tinn-R bio3d,itsmr,forecast,tseries,timsac,wordcloud ITSM_2000- Standalone R
Nabble BioStars stats.stackexchange

Methods A) Calculation of Potential values for AA residues B)Forecasting
of AA states C) Clustering

Calculation of Potential values for AA residues Dataset-I Assignment of
Conformational state 1, 2, or 3 - to regions I, II, or III of the Rama. Plot, to each amino-acid residue (Phi_psi values) Phi-Psi values –torsion.pdb() of “bio3d” & verified via PDBGoodies (IISC, Bangalore) & Protein Angle Descriptor utility (IIT, Delhi ) Chain breaks, only CA atoms Expt. method-X-ray, R-factor: - 0-0.25 (for best resolved structures) 3829 proteins selected from PDB (Protein Data Bank) –PDBSelect dataset list(25 % seq. similarity)

Figure No- 2 Ramachandran plot showing three conformational regions I
,II and III I- closely/tightly packed conformations, Phi-140 to 0,Psi -100 to 0 II-extended conformations, Phi -180 to 0, Psi 80 to 180 III- all remaining confirmations ᵠ ᶲ

Frequencies of single residues in three states calculated & normalized
using (Kolaskar, A.S. & Sawant, S.V. -1996 ) ∑ ∑ ik ik k ik n n N ni = P Nik –no. of times the AA of type (i) occurs in state k=1-3; N -total no. of residues Pik -potential values of AA of type (i) in state k Potential values in pdf

Potential values

Time Series

ACF Plot

ACF –Stat Vs. Non-stationary Non-stationary Stationary

Time Series Stationary Non- stationary Stationary ACF plot

Stationary TS

TS model building….. AR (p) ARMA(p,q) ARIMA (p,q)

Best model Selection AR (p) ARMA (p, q) ARIMA (p,
q) AIC

Forecasting of AA states for best models

Forecasting of AA states for best models…. e.g. for AR(1)
process, X t = φ X (t-1) + Z (t) , t=0,± 1,…. Where {Z t}~ WN (0, s2) & | φ | <1 1st observed potential for AA with index given as data points & t respectively, prediction starts from 2nd position up to last index using forecast() “itsmr”

Similarly for ARMA (1,1) /ARIMA (1,1) X t = φ
X (t-1) + Z (t) + θ Z (t-1) , θ + φ Forecasting Quality by coefficient of determination (R2) using formula ∑ ∑ − − − 2 2 2 1 ) Y (Y ) F (Y = R i i i Yi =True value /Observed value Fi = Forecasted/predicted value

Clustering Dataset-II SCOP Domain specific PDB-style files(ATOM & HETATM records
) downloaded from ASTRAL Compendium for Sequence and Structure Analysis - release 1.75 (June 2009) Scan for chain breaks & presence of CA atoms only, breaked files kept aside

Length of AA residues(100-110) e.g. 10gsa1_a_133_pot.txt File

Potential values (Time series),each domain divided into stationary (506) &
non-stationary process (1692) Non-stationary data kept aside for further transformations AR,ARMA & ARIMA models Best model (minimum AIC criteria) Best-AR(22),ARMA(484),ARIMA(No model) AR(p), ARMA(p,q) -distance matrix (Euclidean distance ) Dendrogram-Neighbour-joing ( Phylip packages)

Dendrogram_TS –AR models(22)

Dendrogram_TS –ARMA models(484) • Phylowidget link

Results & Discussion For each AA of all the proteins,
3D- Cartesian co-ordinates were transformed into 2D info. i.e. conformational states of AA and potential values were computed and used to build time-distance (index of AA) dependent statistical model as time series for forecasting purposes.

AR values Autoregressive order (p)  1-18 range Short &
long range dependence  variations in protein structural arrangements Variations proves  biodiversity exhibits through structural components Less order –more variations & more order– less variations

All α (a)-12 All β (b)-5 α/ β (c)-9 α
+ β (d)-13 Small proteins (g)-1 Coiled-coil (h)-3 Designed proteins (k)-1 Max Min Max Min Max Min Max Min Max Min AA seq (%) 26.82 2.41 16.30 8.88 27.77 1.47 28.57 7.04 19.51 22.5 5.88 29.03 States (%) 55.68 21.77 51.11 44.76 54.76 30.64 51.70 19.04 48.78 26 15 26.88 Table No. II – Forecasting results for AR models (44) out of best 90 models (Note- for 46 models, class information not found in SCOP database) All values are in % accuracy Conformational states accuracy > AA residues accuracy due to low resolution of potential values(forecasted values)

All α (a)-123 All β (b)-146 α/ β (c)-120 α
+ β (d)-127 Multi domains proteins (e)-13 Membrane & cell surface (f)-3 Small proteins(g)- 17 Max Min Max Min Max Min Max Min Max Min Max Min Max Min AA seq (%) 32.55 2.63 32.81 3.96 43.47 5 37.96 2.70 24.39 6.034 12.65 7.01 30.64 6.60 States (%) 65.77 8.06 65.01 17.94 62.89 8.97 68.15 11.11 50 17.80 34.33 11.42 64.51 14.28 Table No. III– Forecasting results for ARMA models (557) out of best 1239 models (Note- for 682 models, class information not found in SCOP database) —All values are in % accuracy Due to non-representative dataset & inadequate info. about class, can’t say that for any particular class i) pred. accuracy ↑ or ↓ & ii) follows mostly ARMA process

Discussion TS graphs opens new door in scientific visualization of
proteins (no 3D str. info) i.e. specific AA can be visualized on line plot with its value proportional to frequency to occur into allowed regions of Ramachandran plot. Potential value for each AA adds new feature of selection in machine learning techniques. Order of AR model tells how current value linearly related to past p value Intra-dependency of AA shown using models of TS e.g. AR(4),ARMA(1,3)

Found new way of looking at protein structure prediction. Application
of TS technique for predicting conformational states based on the conformational state potentials instead of secondary str. has been attempted. Accuracy of prediction of conformational states for AA, using time series is higher than that for prediction of AA residues. To increase accuracy for prediction, multivariate time series concept may be useful instead of uni-variate time series Intra-fluctuations inside proteins, due to AA arrangement can be traced out by stationary & non-stationary groups CONCLUSIONS

AR and MA order of TS models -as point of
genetic information (distances) to predict evolutionary relationship between different proteins. TS concept can be used to predict conformational states of missing residues in PDB data files Hierarchical clustering/classification of TS of proteins -birth to new concept of time dependent clustering (pseudo-clustering) & pseudo-phylogeny. Development of synthetic proteins to combat seasonal diseases & to tackle chemical warfare attacks. TS fluctuations for specific class of proteins can be used as “Pattern” for data analysis and pattern-dependent classification of proteins FUTURE WORK

References Blundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-
based prediction of protein structures and the design of novel molecules. Nature. 1987 Mar 26-Apr 1;326(6111):347-52. Review Kolaskar, A.S., Sawant, S.V. (1996). Prediction of conformational states of amino acids using a Ramachandran plot. Int.J.Peptide Protein Res.110-116 Alessandro G.,Romualdo B.,(2000). Nonlinear Methods in the Analysis of Protein Sequences:A Case Study in Rubredoxins. Biophysical Journal.136-148

Questions

Thank You !

BIM_2010_20-Project Presentation

BIM_2010_20-Project Presentation

More Decks by Sagar Nikam

Other Decks in Research

Featured

Transcript