The following information has been received by the server: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ________________________________________________________________________________ reference predict_h2098 (Fri May 24 11:41:41 MDT 1996) from pazos@gredos.cnb.uam.es password(###) resp MAIL orig HTML prediction of: -secondary structure-solvent accessibility- return no aligment # msf format Please check that the conversion from MSF to HSSP format of the align- ment looks reasonable. The alignment that has been used as input to the network is: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ________________________________________________________________________________ --- ------------------------------------------------------------ --- MAXHOM multiple sequence alignment --- ------------------------------------------------------------ --- --- MAXHOM ALIGNMENT HEADER: ABBREVIATIONS FOR SUMMARY --- ID : identifier of aligned (homologous) protein --- STRID : PDB identifier (only for known structures) --- PIDE : percentage of pairwise sequence identity --- WSIM : percentage of weighted similarity --- LALI : number of residues aligned --- NGAP : number of insertions and deletions (indels) --- LGAP : number of residues in all indels --- LSEQ2 : length of aligned sequence --- ACCNUM : SwissProt accession number --- NAME : one-line description of aligned protein --- --- MAXHOM ALIGNMENT HEADER: SUMMARY ID STRID IDE WSIM LALI NGAP LGAP LEN2 ACCNUM NAME S38883 82 0 84 0 0 84 PNP_HAEIN 67 0 83 0 0 83 RS1H_BACSU 57 0 78 1 1 77 BS29668_2 52 0 83 0 0 83 CEEEED8_6 49 0 77 1 1 76 YABR_BACSU 48 0 84 0 0 84 HS055891_1 46 0 83 0 0 83 RS1_RHIME 43 0 83 0 0 83 SA19858_1 43 0 81 0 0 81 HSHRH1_1 43 0 80 1 1 79 RS1_MYCLE 43 0 74 0 0 74 1222505 1222 42 0 78 0 0 78 RR1_SPIOL 42 0 78 1 1 77 NMCPSPS_1 41 0 78 0 0 78 YHGF_ECOLI 40 0 84 0 0 84 PR22_YEAST 41 0 76 0 0 76 RS1_SYNP6 38 0 79 1 1 78 RS1_PROSP 38 0 83 1 1 82 1221346 1221 35 0 83 1 1 82 HSKIAA25_1 35 0 74 0 0 74 RPOE_SULAC 34 0 72 1 1 71 SC9959_11 31 0 83 0 0 83 RS1_CHLTR 32 0 81 1 5 76 --- --- MAXHOM ALIGNMENT: IN MSF FORMAT MSF of: /home/phd/server/work/predict_h2098_1615.hssp from: 1 to: 84 /home/phd/server/work/predict_h2098_1615.ret_msf MSF: 84 Type: P 24-May-96 11:41:4 Check: 6000 .. Name: PNS1 Len: 84 Check: 4930 Weight: 1.00 Name: S38883 Len: 84 Check: 4045 Weight: 1.00 Name: PNP_HAEIN Len: 84 Check: 4384 Weight: 1.00 Name: RS1H_BACSU Len: 84 Check: 9612 Weight: 1.00 Name: BS29668_2 Len: 84 Check: 2115 Weight: 1.00 Name: CEEEED8_6 Len: 84 Check: 2346 Weight: 1.00 Name: YABR_BACSU Len: 84 Check: 3816 Weight: 1.00 Name: HS055891_1 Len: 84 Check: 4280 Weight: 1.00 Name: RS1_RHIME Len: 84 Check: 5063 Weight: 1.00 Name: SA19858_1 Len: 84 Check: 5029 Weight: 1.00 Name: HSHRH1_1 Len: 84 Check: 3390 Weight: 1.00 Name: RS1_MYCLE Len: 84 Check: 6902 Weight: 1.00 Name: 1222505 Len: 84 Check: 1566 Weight: 1.00 Name: RR1_SPIOL Len: 84 Check: 373 Weight: 1.00 Name: NMCPSPS_1 Len: 84 Check: 152 Weight: 1.00 Name: YHGF_ECOLI Len: 84 Check: 4396 Weight: 1.00 Name: PR22_YEAST Len: 84 Check: 2065 Weight: 1.00 Name: RS1_SYNP6 Len: 84 Check: 9505 Weight: 1.00 Name: RS1_PROSP Len: 84 Check: 2941 Weight: 1.00 Name: 1221346 Len: 84 Check: 3567 Weight: 1.00 Name: HSKIAA25_1 Len: 84 Check: 8406 Weight: 1.00 Name: RPOE_SULAC Len: 84 Check: 9904 Weight: 1.00 Name: SC9959_11 Len: 84 Check: 1446 Weight: 1.00 Name: RS1_CHLTR Len: 84 Check: 5767 Weight: 1.00 // 1 50 PNS1 AEIEVGRVYT GKVTRIVDFG AFVAIGGGKE GLVHISQIAD KRVEKVTDYL S38883 AEIEVGRIYA GKVTRIVDFG AFVAIGGGKE GLVHISQIAD KRVEKVADYL PNP_HAEIN AEVEAGVIYK GKVTRLADFG AFVAIVGNKE GLVHISQIAE ERVEKVSDYL RS1H_BACSUQSLEVGSVLD GKVQRLTDFG AFVDIGG.ID GLVHISQLSH SHVEKPSDVV BS29668_2 .EVEVGQLYL GKVKRIEKFG AFVEIFSGKD GLVHISELAL ERVGKVEDVV CEEEED8_6 ...EIGKIYD GRVNSIQSFG AFITLEGFQE GLVHISQIRN ERVQTVADVL YABR_BACSUMSIEVGSKLQ GKITGITNFG AFVELPGGST GLVHISEVAD NYVKDINDHL HS055891_1DQIAAGSVLE GTVKRVKDFG AFVEILPGIE GLVHVSQISN KRIENPSEVL RS1_RHIME AKYPVGKKIS GTVTNITDYG AFVELEPGIE GLIHISEMST KKNVHPGKIL SA19858_1 ...EVGERIL GSVVKTTTFG AFVSLLPGKD GLLHISQIRK KRVENVEDVL HSHRH1_1 EEPTIGDIYN GKVTSIMQFG CFVQLEGLRE GLVHISELRR ERVANVADVV RS1_MYCLE .THAIGQIVP GKVTKLVPFG AFVRVEEGIE GLVHISELAE RHVEVPDQVV 1222505 TDLKSGMILE GTVTNVTNFG AFVDIGVHQD GLVHISSLSD KFVEDPHQVV RR1_SPIOL AQLGIGSVVT GTVQSLKPYG AFIDIGG.IN GLLHVSQISH DRVSDIATVL NMCPSPS_1 SDLQVGMILE GVVSNVANFG AFVDIGVHQD GLVHISALSN KFVQDPREVV YHGF_ECOLINDLQPGMILE GAVTNVTNFG AFVDIGVHQD GLVHISSLSN KFVEDPHTVV PR22_YEAST....LHKVYE GKVRNITTFG CFVQIFGTRD GLVHISEMSD QRTLDPHDVV RS1_SYNP6 NRLEVGEVVV GAVRGIKPYG AFIDIGG.VS GLLHISEISH DHIETPHSVF RS1_PROSP ENLQEGMEVK GIVKNLTDYG AFVDLGG.VD GLLHITDMAW KRVKHPSEIV 1221346 ENLVEGSEVK GVVKNLTEYG AFVDLGG.VD GLLHITDMAW KRVKHPSEIV HSKIAA25_1SEIHPGMLLI GFVKSIKDYG VFIQFPSGLS GLAPKAIMSD KFVTSTSDHF RPOE_SULAC....IHEVIE GEVSQVDNYG VYVNMGP.VD GLVHISQITD DNLEKSKKSI SC9959_11 SDIKAGDVFE GTIKSVTDFG VFVKLDNTVT GLAHITEIAD KKPEDLSALF RS1_CHLTR SEVQPGAILK GTVVDISKDF VVVDVGLKSE GVIPMSEFID S.....SEGL 51 84 PNS1 QMGQEVPVKV LEVDRQGRIR LSIKEATEQS QPAA S38883 QVGQETSVKV LEIDRQGRVR LSIKEATAGT AVEE PNP_HAEIN QVGQEVNVKV VEIDRQGRIR LTMKDLAPKQ ETE. RS1H_BACSUEEGQEVKVKV LSVDRDERIS LSIKDTLP.. .... BS29668_2 KIGDEILVKV TEIDKQGRVN LSRKAVLREE KEKE CEEEED8_6 KRGENVKVKV NKIEN.GKIS LSMKEVDQNS .... YABR_BACSUKVGDQVEVKV INVEKDGKIG LSIKKAKDRP QARP HS055891_1KSGDKVQVKV LDIKPEERIS LSMKALEEKP ERE. RS1_RHIME STSQEVDVVV LEVDPTRRIS LGLKQTLENP WQA. SA19858_1 GVGQKVQVEI AEIDSRGKLS LIPVIEGEEA ASDE HSHRH1_1 SKGQRVKVKV LSFTG.TKTS LSMKDVDQET .... RS1_MYCLE AVGDDAMVKV IDIDLERRIS LSLKA..... .... 1222505 KTGNIVKVKV LEVDVRKRIA LTMRLDES.. .... RR1_SPIOL QPGDTLKVMI LSHDREGRVS LSTKKLEP.. .... NMCPSPS_1 KAGDVVKVKV LEVDARKRIA LTMRLDDE.. .... YHGF_ECOLIKAGDIVKVKV LEVDLQKRIA LTMRLDEQPG ETNA PR22_YEASTRQGQHIFVEV IKIQNNGKIS LSMKNIDQHS .... RS1_SYNP6 NVNDEVKVMI IDLDAEGRIS LSTKQLEPE. .... RS1_PROSP NVGDEITVKV LKFDRETRVS LGLKQLGEDP WVA. 1221346 NVGDEVTVKV LKFDKDTRVS LGLKQLGQDP WAA. HSKIAA25_1VEGQTVAAKV TNVDEEQRML LSLR...... .... RPOE_SULACTKGDRVRAMI ISSGRLPRIA LTMKQP.... .... SC9959_11 GVGDRVKAIV LKTNPEKQIS LSLKASHFSK EAE. RS1_CHLTR SVGAEVEVYL DQEDEEGKVV LSREKATRQR Q... ________________________________________________________________________________ PredictProtein@EMBL-Heidelberg.DE PHD: Profile fed neural network systems from HeiDelberg ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Prediction of secondary structure by PHDsec Prediction of solvent accessibility by PHDacc ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Author: Burkhard Rost EMBL, Heidelberg, FRG Meyerhofstrasse 1, 69 117 Heidelberg Internet: Predict-Help@EMBL-Heidelberg.DE All rights reserved. Please quote ~~~~~~~~~~~ The PredictProtein mail server is described in: B Rost: PHD: predicting one-dimensional protein structure by pro- file based neural networks. Meth. in Enzym., 1996, 266, 525-539. Additionally to be quoted for publications of PHDsec output: B Rost & C Sander: Prediction of protein structure at better than 70% accuracy. J. Mol. Biol., 1993, 232, 584-599. The latest improvement steps (up to 72%) are explained in: B Rost & C Sander: Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 1994, 19, 55-72. Additionally to be quoted for publications of PHDacc output: B Rost & C Sander: Conservation and prediction of solvent accessi- bility in protein families. Proteins, 1994, 20, 216-226. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Prediction of secondary structure by PHDsec ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ About the input to the network ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The prediction is performed by a system of neural networks. The input is a multiple sequence alignment. It is taken from an HSSP file (produced by the program MaxHom: Sander, Chris & Schneider, Reinhard: Database of Homology-Derived Structures and the Structural Meaning of Sequence Alignment. Proteins, 1991, 9, 56-68. For optimal results the alignment should contain sequences with varying degrees of sequence similarity relative to the input protein. The following is an ideal situation: +-----------------+----------------------+ | sequence: | sequence identity | +-----------------+----------------------+ | target sequence | 100 % | | aligned seq. 1 | 90 % | | aligned seq. 2 | 80 % | | ... | ... | | aligned seq. 7 | 30 % | +-----------------+----------------------+ Estimated Accuracy of Prediction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A careful cross validation test on some 250 protein chains (in total about 55,000 residues) with less than 25% pairwise sequence identity gave the following results: ++================++-----------------------------------------+ || Qtotal = 72.1% || ("overall three state accuracy") | ++================++-----------------------------------------+ +----------------------------+-----------------------------+ | Qhelix (% of observed)=70% | Qhelix (% of predicted)=77% | | Qstrand(% of observed)=62% | Qstrand(% of predicted)=64% | | Qloop (% of observed)=79% | Qloop (% of predicted)=72% | +----------------------------+-----------------------------+ .......................................................................... These percentages are defined by: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | number of correctly predicted residues |Qtotal = --------------------------------------- (*100) | number of all residues | | no of res correctly predicted to be in helix |Qhelix (% of obs) = -------------------------------------------- (*100) | no of all res observed to be in helix | | | no of res correctly predicted to be in helix |Qhelix (% of pred)= -------------------------------------------- (*100) | no of all residues predicted to be in helix .......................................................................... Averaging over single chains ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The most reasonable way to compute the overall accuracies is the above quoted percentage of correctly predicted residues. However, since the user is mainly interested in the expected performance of the prediction for a particular protein, the mean value when averaging over protein chains might be of help as well. Computing first the three state accuracy for each protein chain, and then averaging over 250 chains yields the following average: +-------------------------------====--+ | Qtotal/averaged over chains = 72.2% | +-------------------------------====--+ | standard deviation = 9.3% | +-------------------------------------+ .......................................................................... Further measures of performance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Matthews correlation coefficient: +---------------------------------------------+ | Chelix = 0.63, Cstrand = 0.53, Cloop = 0.52 | +---------------------------------------------+ .......................................................................... Average length of predicted secondary structure segments: . +------------+----------+ . | predicted | observed | +-----------+------------+----------+ | Lhelix = | 10.3 | 9.3 | | Lstrand = | 5.0 | 5.3 | | Lloop = | 7.2 | 5.9 | +-----------+------------+----------+ .......................................................................... The accuracy matrix in detail: +---------------------------------------+ | number of residues with H, E, L | +---------+------+------+------+--------+ | |net H |net E |net L |sum obs | +---------+------+------+------+--------+ | obs H |12447 | 1255 | 3990 | 17692 | | obs E | 949 | 7493 | 3750 | 12192 | | obs L | 2604 | 2875 |19962 | 25441 | +---------+------+------+------+--------+ | sum Net |16000 |11623 |27702 | 55325 | +---------+------+------+------+--------+ Note: This table is to be read in the following manner: 12447 of all residues predicted to be in helix, were observed to be in helix, 949 however belong to observed strands, 2604 to observed loop regions. The term "observed" refers to the DSSP assignment of secondary structure calculated from 3D coordinates of experimentally determined structures (Dictionary of Secondary Structure of Proteins: Kabsch & Sander (1983) Biopolymers, 22, 2577-2637). Position-specific reliability index ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The network predicts the three secondary structure types using real numbers from the output units. The prediction is assigned by choosing the maximal unit ("winner takes all"). However, the real numbers contain additional information. E.g. the difference between the maximal and the second largest output unit can be used to derive a "reliability index". This index is given for each residue along with the prediction. The index is scaled to have values between 0 (lowest reliability), and 9 (highest). The accuracies (Qtot) to be expected for residues with values above a particular value of the index are given below as well as the fraction of such residues (%res).: +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | index| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | | %res |100.0| 99.2| 90.4| 80.9| 71.6| 62.5| 52.8| 42.3| 29.8| 14.1| +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | | | | | | | | | Qtot | 72.1| 72.3| 74.8| 77.7| 80.3| 82.9| 85.7| 88.5| 91.1| 94.2| | | | | | | | | | | | | +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | H%obs| 70.4| 70.6| 73.7| 77.1| 80.1| 83.1| 86.0| 89.3| 92.5| 96.4| | E%obs| 61.5| 61.7| 63.7| 66.6| 69.1| 71.7| 74.6| 77.0| 77.8| 68.1| | | | | | | | | | | | | | H%prd| 77.8| 78.0| 80.0| 82.6| 84.7| 86.9| 89.2| 91.3| 93.1| 95.4| | E%prd| 64.5| 64.7| 67.8| 71.0| 74.2| 77.6| 81.4| 85.1| 89.8| 93.5| +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ The above table gives the cumulative results, e.g. 62.5% of all residues have a reliability of at least 5. The overall three-state accuracy for this subset of almost two thirds of all residues is 82.9%. For this subset, e.g., 83.1% of the observed helices are correctly predicted, and 86.9% of all residues predicted to be in helix are correct. .......................................................................... The following table gives the non-cumulative quantities, i.e. the values per reliability index range. These numbers answer the question: how reliable is the prediction for all residues labeled with the particular index i. +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | index| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | | %res | 8.8| 9.5| 9.3| 9.1| 9.7| 10.5| 12.5| 15.7| 14.1| +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | | | | | | | | Qtot | 46.6| 50.6| 57.7| 62.6| 67.9| 74.2| 82.2| 88.3| 94.2| | | | | | | | | | | | +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | H%obs| 36.8| 42.3| 49.5| 55.2| 61.7| 69.9| 78.8| 87.4| 96.4| | E%obs| 44.7| 44.5| 52.1| 55.4| 60.9| 68.0| 75.9| 81.0| 68.1| | | | | | | | | | | | | H%prd| 49.9| 52.5| 60.3| 64.2| 69.2| 77.5| 85.4| 89.9| 95.4| | E%prd| 41.7| 47.1| 53.6| 57.0| 64.0| 71.6| 78.8| 88.8| 93.5| +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ For example, for residues with Relindex = 5 64% of all predicted betha- strand residues are correctly identified. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Prediction of solvent accessibility by PHDacc ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Definition of accessibility ~~~~~~~~~~~~~~~~~~~~~~~~~~ For training the residue solvent accessibility the DSSP (Dictionary of Secondary Structure of Proteins; Kabsch & Sander (1983) Biopolymers, 22, 2577-2637) values of accessible surface area have been used. The prediction provides values for the relative solvent accessibility. The normalisation is the following: | ACCESSIBILITY (from DSSP in Angstrom) |RELATIVE_ACCESSIBILITY = ------------------------------------- * 100 | MAXIMAL_ACC (amino acid type i) where MAXIMAL_ACC (i) is the maximal accessibility of amino acid type i. The maximal values are: +----+----+----+----+----+----+----+----+----+----+----+----+ | A | B | C | D | E | F | G | H | I | K | L | M | | 106| 160| 135| 163| 194| 197| 84| 184| 169| 205| 164| 188| +----+----+----+----+----+----+----+----+----+----+----+----+ | N | P | Q | R | S | T | V | W | X | Y | Z | | 157| 136| 198| 248| 130| 142| 142| 227| 180| 222| 196| +----+----+----+----+----+----+----+----+----+----+----+ Notation: one letter code for amino acid, B stands for D or N; Z stands for E or Q; and X stands for undetermined. The relative solvent accessibility can be used to estimate the number of water molecules (W) in contact with the residue: W = ACCESSIBILITY /10 The prediction is given in 10 states for relative accessibility, with RELATIVE_ACCESSIBILITY = (PREDICTED_ACC * PREDICTED_ACC) where PREDICTED_ACC = 0 - 9. Estimated Accuracy of Prediction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A careful cross validation test on some 238 protein chains (in total about 62,000 residues) with less than 25% pairwise sequence identity gave the following results: Correlation ........... The correlation between observed and predicted solvent accessibility is: ----------- corr = 0.53 ----------- This value ought to be compared to the worst and best case prediction scenario: random prediction (corr = 0.0) and homology modelling (corr = 0.66). (Note: homology modelling yields a relative accurate prediction in 3D if, and only if, a significantly identical sequence has a known 3D structure.) 3-state accuracy ................ Often the relative accessibility is projected onto, e.g., 3 states: b = buried (here defined as < 9% relative accessibility), i = intermediate ( 9% <= rel. acc. < 36% ), e = exposed ( rel. acc. >= 36% ). A projection onto 3 states or 2 states (buried/exposed) enables the compilation of a 3- and 2-state prediction accuracy. PHD reaches an overall 3-state accuracy of: Q3 = 57.5% (compared to 35% for random prediction and 70% for homology modelling). In detail: +-----------------------------------+-------------------------+ | Qburied (% of observed)=77% | Qb (% of predicted)=60% | | Qintermediate (% of observed)= 9% | Qi (% of predicted)=44% | | Qexposed (% of observed)=78% | Qe (% of predicted)=56% | +-----------------------------------+-------------------------+ 10-state accuracy ................. The network predicts relative solvent accessibility in 10 states, with state i (i = 0-9) corresponding to a relative solvent accessibility of i*i %. The 10-state accuracy of the network is: Q10 = 24.5% .......................................................................... These percentages are defined by: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | number of correctly predicted residues |Q3 = --------------------------------------- (*100) | number of all residues | | no of res. correctly predicted to be buried |Qburied (% of obs) = ------------------------------------------- (*100) | no of all res. observed to be buried | | | no of res. correctly predicted to be buried |Qburied (% of pred)= ------------------------------------------- (*100) | no of all residues predicted to be buried .......................................................................... Averaging over single chains ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The most reasonable way to compute the overall accuracies is the above quoted percentage of correctly predicted residues. However, since the user is mainly interested in the expected performance of the prediction for a particular protein, the mean value when averaging over protein chains might be of help as well. Computing first the correlation between observed and predicted accessibility for each protein chan, and then averaging over all 238 chains yields the following average: +-------------------------------====--+ | corr/averaged over chains = 0.53 | +-------------------------------====--+ | standard deviation = 0.11 | +-------------------------------------+ .......................................................................... Further details of performance accuracy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The accuracy matrix in detail: .............................. -------+----------------------------------------------------+----------- \ PHD | 0 1 2 3 4 5 6 7 8 9 | SUM %obs -------+----------------------------------------------------+----------- OBS 0 | 8611 140 8 44 82 169 772 334 27 0 | 10187 16.6 OBS 1 | 4367 164 0 50 106 231 738 346 44 3 | 6049 9.8 OBS 2 | 3194 168 1 68 125 303 951 513 42 7 | 5372 8.7 OBS 3 | 2760 159 8 80 136 327 1246 746 58 19 | 5539 9.0 OBS 4 | 2312 144 2 72 166 396 1615 1245 124 19 | 6095 9.9 OBS 5 | 1873 96 3 84 138 425 1979 1834 187 27 | 6646 10.8 OBS 6 | 1387 67 1 60 80 278 2237 2627 231 51 | 7019 11.4 OBS 7 | 1082 35 0 32 56 225 1871 3107 302 60 | 6770 11.0 OBS 8 | 660 25 0 27 43 136 1206 2374 325 87 | 4883 7.9 OBS 9 | 325 20 2 27 29 74 648 1159 366 214 | 2864 4.7 -------+----------------------------------------------------+----------- SUM |26571 1018 25 544 961 2564 13263 14285 1706 487 | %pred | 43.3 1.7 0.0 0.9 1.6 4.2 21.6 23.3 2.8 0.8 | -------+----------------------------------------------------+----------- Note: This table is to be read in the following manner: 8611 of all residues predicted to be in exposed by 0%, were observed with 0% relative accessibility. However, 325 of all residues predicted to have 0% are observed as completely exposed (obs = 9 -> rel. acc. >= 81%). The term "observed" refers to the DSSP compilation of area of solvent accessibility calculated from 3D coordinates of experimentally determined structures (Diction- ary of Secondary Structure of Proteins: Kabsch & Sander (1983) Biopolymers, 22, 2577-2637). Accuracy for each amino acid: ............................. +---+------------------------------+-----+-------+------+ |AA | Q3 b%o b%p i%o i%p e%o e%p | Q10 | corr | N | +---+------------------------------+-----+-------+------+ | A | 59.0 87 60 2 38 66 57 | 31 | 0.530 | 5054 | | C | 62.0 91 67 5 39 25 21 | 34 | 0.244 | 893 | | D | 56.5 21 45 6 49 94 57 | 20 | 0.321 | 3536 | | E | 60.8 9 40 3 41 98 61 | 21 | 0.347 | 3743 | | F | 63.3 94 67 9 46 29 37 | 27 | 0.366 | 2436 | | G | 52.1 75 51 1 31 67 53 | 22 | 0.405 | 4787 | | H | 50.9 63 53 23 45 71 50 | 18 | 0.442 | 1366 | | I | 64.9 95 68 6 41 30 38 | 34 | 0.360 | 3437 | | K | 66.6 2 11 2 37 98 67 | 23 | 0.267 | 3652 | | L | 61.6 93 65 8 44 31 40 | 31 | 0.368 | 5016 | | M | 60.1 92 64 5 39 45 44 | 29 | 0.452 | 1371 | | N | 55.5 45 45 8 38 87 59 | 17 | 0.410 | 2923 | | P | 53.0 48 48 9 39 83 56 | 18 | 0.364 | 2920 | | Q | 54.3 27 44 7 44 92 56 | 20 | 0.344 | 2225 | | R | 49.9 15 47 36 47 76 51 | 18 | 0.372 | 2765 | | S | 55.6 69 53 3 51 81 56 | 22 | 0.464 | 3981 | | T | 51.8 61 51 8 38 78 53 | 21 | 0.432 | 3740 | | V | 61.1 93 65 5 40 39 42 | 34 | 0.418 | 4156 | | W | 56.2 85 62 20 49 29 27 | 21 | 0.318 | 891 | | Y | 49.7 73 52 33 49 36 38 | 19 | 0.359 | 2301 | +---+------------------------------+-----+-------+------+ Abbreviations: AA: amino acid in one-letter code b%o, i%o, e%o: = Qburied, Qintermediate, Qexposed (% of observed), i.e. percentage of correct prediction in each state, see above b%p, i%p, e%p: = Qburied, Qintermediate, Qexposed (% of predicted), i.e. probability of correct prediction in each state, see above b%o: = Qburied (% of observed), see above Q10: percentage of correctly predicted residues in each of the 10 states of predicted relative accessibility. corr: correlation between predicted and observed rel. acc. N: number of residues in data set Accuracy for different secondary structure: ........................................... +--------+------------------------------+----+-------+-------+ | type | Q3 b%o b%p i%o i%p e%o e%p |Q10 | corr | N | +--------+------------------------------+----+-------+-------+ | helix | 59.5 79 64 8 44 80 56 | 27 | 0.574 | 20100 | | strand | 61.3 84 73 9 46 69 37 | 35 | 0.524 | 13356 | | loop | 54.4 64 43 11 44 78 61 | 18 | 0.442 | 27968 | +--------+------------------------------+----+-------+-------+ Abbreviations as before. Position-specific reliability index ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The network predicts the 10 states for relative accessibility using real numbers from the output units. The prediction is assigned by choosing the maximal unit ("winner takes all"). However, the real numbers contain additional information. E.g. the difference between the maximal and the second largest output unit (with the constraint that the second largest output is compiled among all units at least 2 positions off the maximal unit) can be used to derive a "reliability index". This index is given for each residue along with the prediction. The index is scaled to have values between 0 (lowest reliability), and 9 (highest). The accuracies (Q3, corr, asf.) to be expected for residues with values above a particular value of the index are given below as well as the fraction of such residues (%res).: +---+------------------------------+----+-------+-------+ |RI | Q3 b%o b%p i%o i%p e%o e%p |Q10 | corr | %res | +---+------------------------------+----+-------+-------+ | 0 | 57.5 77 60 9 44 78 56 | 24 | 0.535 | 100.0 | | 1 | 59.1 76 63 9 45 82 57 | 25 | 0.560 | 91.2 | | 2 | 61.7 79 66 4 47 87 58 | 27 | 0.594 | 77.1 | | 3 | 66.6 87 70 1 51 89 63 | 30 | 0.650 | 57.1 | | 4 | 70.0 89 72 0 83 91 67 | 32 | 0.686 | 45.8 | | 5 | 72.9 92 75 0 0 93 70 | 34 | 0.722 | 35.6 | | 6 | 76.3 95 77 0 0 93 75 | 36 | 0.769 | 24.7 | | 7 | 79.0 97 79 0 0 93 78 | 39 | 0.803 | 16.0 | | 8 | 80.9 98 80 0 0 91 81 | 43 | 0.824 | 9.6 | | 9 | 81.2 99 80 0 0 88 83 | 45 | 0.828 | 5.9 | +---+------------------------------+----+-------+-------+ Abbreviations as before. The above table gives the cumulative results, e.g. 45.8% of all residues have a reliability of at least 4. The correlation for this most reliably predicted half of the residues is 0.686, i.e. a value comparable to what could be expected if homology modelling were possible. For this subset of 45.8% of all residues, 89% of the buried residues are correctly predicted, and 72% of all residues predicted to be buried are correct. .......................................................................... The following table gives the non-cumulative quantities, i.e. the values per reliability index range. These numbers answer the question: how reliable is the prediction for all residues labeled with the particular index i. +---+------------------------------+----+-------+-------+ |RI | Q3 b%o b%p i%o i%p e%o e%p |Q10 | corr | %res | +---+------------------------------+----+-------+-------+ | 0 | 40.9 79 40 16 41 21 40 | 14 | 0.175 | 8.8 | | 1 | 45.4 61 46 28 44 48 44 | 17 | 0.278 | 14.1 | | 2 | 47.4 53 52 10 46 80 44 | 19 | 0.343 | 19.9 | | 3 | 52.9 75 59 4 50 77 47 | 23 | 0.439 | 11.4 | | 4 | 60.0 81 63 0 83 84 56 | 25 | 0.547 | 10.1 | | 5 | 65.2 82 70 0 0 93 62 | 28 | 0.607 | 10.9 | | 6 | 71.3 90 72 0 0 94 70 | 31 | 0.692 | 8.8 | | 7 | 76.0 94 76 0 0 95 75 | 34 | 0.762 | 6.3 | | 8 | 80.5 97 81 0 0 94 79 | 39 | 0.808 | 3.8 | | 9 | 81.2 99 80 0 0 88 83 | 45 | 0.828 | 5.9 | +---+------------------------------+----+-------+-------+ For example, for residues with RI = 4 83% of all predicted intermediate residues are correctly predicted as such. The resulting network (PHD) prediction is: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ________________________________________________________________________________ PredictProtein@EMBL-Heidelberg.DE PHD: Profile fed neural network systems from HeiDelberg ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Prediction of: - secondary structure, by PHDsec - solvent accessibility, by PHDacc - and helical transmembrane regions, by PHDhtm Author: Burkhard Rost EMBL, Heidelberg, FRG Meyerhofstrasse 1, 69 117 Heidelberg Internet: Predict-Help@EMBL-Heidelberg.DE All rights reserved. The network systems are described in: PHDsec: B Rost & C Sander, JMB, 1993, 232, 584-599. B Rost & C Sander, Proteins, 1994, 19, 55-72. PHDacc: B Rost & C Sander, Proteins, 1994, 20, 216-226. PHDhtm: B Rost et al., Prot. Science, 4, 521-533. Some statistics ~~~~~~~~~~~~~~ Percentage of amino acids: +--------------+--------+--------+--------+--------+--------+ | AA: | V | G | E | I | A | | % of AA: | 14.3 | 10.7 | 9.5 | 8.3 | 8.3 | +--------------+--------+--------+--------+--------+--------+ | AA: | R | Q | K | T | L | | % of AA: | 7.1 | 7.1 | 7.1 | 4.8 | 4.8 | +--------------+--------+--------+--------+--------+--------+ | AA: | D | S | Y | P | F | | % of AA: | 4.8 | 3.6 | 2.4 | 2.4 | 2.4 | +--------------+--------+--------+--------+--------+--------+ | AA: | M | H | | % of AA: | 1.2 | 1.2 | +--------------+--------+--------+ Percentage of secondary structure predicted: +--------------+--------+--------+--------+ | SecStr: | H | E | L | | % Predicted: | 0.0 | 59.5 | 40.5 | +--------------+--------+--------+--------+ According to the following classes: all-alpha: %H>45 and %E< 5; all-beta : %H<5 and %E>45 alpha-beta : %H>30 and %E>20; mixed: rest, this means that the predicted class is: all-beta PHD output for your protein ~~~~~~~~~~~~~~~~~~~~~~~~~~ Fri May 24 11:44:47 1996 Jury on: 10 different architectures (version 5.94_317 ). Note: differently trained architectures, i.e., different versions can result in different predictions. About the protein ~~~~~~~~~~~~~~~~ HEADER COMPND SOURCE AUTHOR SEQLENGTH 84 NCHAIN 1 chain(s) in PNS1 data set NALIGN 23 (=number of aligned sequences in HSSP file) Abbreviations: PHDsec ~~~~~~~~~~~~~~~~~~~~ sequence: AA : amino acid sequence secondary structure: HEL: H=helix, E=extended (sheet), blank=other (loop) PHD: Profile network prediction HeiDelberg Rel: Reliability index of prediction (0-9) detail: prH: 'probability' for assigning helix prE: 'probability' for assigning strand prL: 'probability' for assigning loop note: the 'probabilites' are scaled to the interval 0-9, e.g., prH=5 means, that the first output node is 0.5-0.6 subset: SUB: a subset of the prediction, for all residues with an expected average accuracy > 82% (tables in header) note: for this subset the following symbols are used: L: is loop (for which above " " is used) ".": means that no prediction is made for this residue, as the reliability is: Rel < 5 Abbreviations: PHDacc ~~~~~~~~~~~~~~~~~~~~ solvent accessibility: 3st: relative solvent accessibility (acc) in 3 states: b = 0-9%, i = 9-36%, e = 36-100%. PHD: Profile network prediction HeiDelberg Rel: Reliability index of prediction (0-9) P_3: predicted relative accessibility in 3 states note: for convenience a blank is used intermediate (i). 10st:relative accessibility in 10 states: = n corresponds to a relative acc. of n*n % subset: SUB: a subset of the prediction, for all residues with an expected average correlation > 0.69 (tables in header) note: for this subset the following symbols are used: "I": is intermediate (for which above " " is used) ".": means that no prediction is made for this residue, as the reliability is: Rel < 4 protein: PNS1 length 84 ....,....1....,....2....,....3....,....4....,....5....,....6 AA |AEIEVGRVYTGKVTRIVDFGAFVAIGGGKEGLVHISQIADKRVEKVTDYLQMGQEVPVKV| PHD sec | EEEEEEEEEEEEEEE EEEEEEE EEEEEEEEE EEE EEEEEE| Rel sec |963754676667686531317999714640589998522344346320143384599999| detail: prH sec |000000000000000000000000000000000000001222101233311000000000| prE sec |026776787777787654357998843134689998653211321112455312789999| prL sec |973122212221111235541000146764200000235565566553222586200000| subset: SUB sec |LL.EE.EEEEEEEEEE....EEEEE..L..EEEEEEE.......L.......L.EEEEEE| ACCESSIBILITY 3st: P_3 acc |eebebbbbbebebebbbebbbbbbbbebbebbbbbbebbeeebeebeebbebeeebebeb| 10st: PHD acc |980700000706060006000000007017000000600776077077007077706060| Rel acc |841442013411420601226670503005454284132451243224443023452727| subset: SUB acc |ee.eb....e..b..b....bbb.b....ebbb.bb...ee..e...ebb....eb.b.b| ....,....7....,....8....,....9....,....10...,....11...,....12 AA |LEVDRQGRIRLSIKEATEQSQPAA| PHD sec |EEE EEEEEEE | Rel sec |873477269998633578757889| detail: prH sec |000000000000000000111000| prE sec |876211478998753211110000| prL sec |013688510001135688767889| subset: SUB sec |EE..LL.EEEEEE..LLLLLLLLL| ACCESSIBILITY 3st: P_3 acc |bebeeeeebbbbbeebeeeeeeee| 10st: PHD acc |060677660000076077777799| Rel acc |412135024042351145444349| subset: SUB acc |b....e..b.b..e..eeeee.ee| ________________________________________________________________________________ ----------------------------------------------------------------------------- --- PredictProtein: NEWS from March, 1996 --- --- --- --- PredictProtein is available interactively via WWW: --- --- http://www.embl-heidelberg.de/predictprotein/predictprotein.html --- --- --- --- The error rate in falsely predicting transmembrane helices for --- --- globular proteins has been reduced by a new program to below 2%. --- --- --- --- The following option is now available upon request: --- --- --- --- 1. "predict htm topology" --- --- Usage: add the words "predict htm topology" in any line before --- --- the one beginning with a hash (#), i.e. the line with --- --- the sequence name. --- --- Result: a refined prediction of transmembrane helices and top- --- --- ology (PHDtopology) is returned. --- --- --- -----------------------------------------------------------------------------