CONTACTS PREDICTION EVALUATION
FOR CAFASP-3
Evaluation Criteria
The distance between two residues will
be calculated as the distance in A between their Cb carbons (Ca for Gly).
For a given set of predicted contacts, five main parameters will be calculated.
- Accuracy (Acc ).
The relation between the number of true predicted contacts and the total number
of predicted contacts.
Acc= nt / n
Where nt is the number of true predicted contacts and n
is the total number of predicted contacts.
- Improvement
over random (Imp). The relation between the accuracy and the accuracy
of random (predicting all the pairs in the protein as contacting).
Imp= Acc / (C/N)
Where N is the total number of residue pairs in the protein excluding
the ones close in the sequence (see below) and C is the observed
number of contacts within N.
- Distance distribution
of the predicted contacts, Xd. The weighted harmonic average difference
between the predicted contacts distance distribution and the all-pairs distance
distribution.
Xd= SUM {i=1,15}((Pip-Pia) / (di * 15))
Where the sum runs for all the distance bins. There are 15 distance bins
covering the range from 0 to 60 A. di is the distance representing
each bin, its upper limit (normalised to 60). Pip is the percentage
of predicted pairs whose distance is included in the i bin. Pia
is the same for all the
pairs. Defined in that way, Xd>0 indicates the positive cases where the
population of predicted contacts distances is shifted to lower distances
(see J. Mol. Biol. (1997), 271:511-523).
For the calculation of the three parameters, both, the predicted pairs
of residues and all the pairs in the protein are split in three sets according
to the separation of the two residues of the pair in the linear sequence
of the protein, the number of residues between them: seqsep>=6, seqsep>=12
and seqsep>=24. Acc, Imp and Xd are evaluated for these three sets.
-
Delta evaluation. Percentage of correctly predicted contacts within |
d | residues, measured along the sequence, of the experimental
contact (see Ortiz et al., 1999, Proteins 3:177-185). This will be done for
| d |=0, 1 , 2 , 3 , 4 and 5. It means that a
predicted contact between two residues i and j is considered correct if there
is any real contact between any residue in the range [i-
d ,i+d ] and any residue in the range [j-
d, j+d]. A contact evaluation with
d =0 is equivalent to a standard contact evaluation. Accuracy and
improvement over random will be evaluated here as well.
Predictors can submit a number of residue pairs as the ones predicted
to be in contact or can send all the pairs in the protein with an associated
score for each pair (see file format). In the first case, the coverage
of the prediction is also calculated as the relation between the number
of predicted pairs and the total number of possible pairs. In the second
case, the list will be sorted by the score and evaluations will be made taken
different numbers of top pairs as function of the protein length: the first
2L, L, L/2, L/5 and L/10 pairs will be taken (L: length of the protein). All
those calculations are performed for the three subset of pairs explained above
(seqsep>=6, 12 and 24).Targets for Contact Prediction
will be split in different sets according to their sequence length.
The fundamental parameter for the evaluation will be Xd at high sequence
separation (seqsep>=24).
File format submissions
There is just a file format, that must be a PDB-like. The
AUTHOR field must contain the name of the predictor. An
AUTHID field must contain its registration code. In a
TARGID field there must be the target ID and the
PREDN field must contain the prediction number of that autbor fort
that target respectively.
Each predictor
can send more than one prediction for a given target using different files
with different PREDN numbers. If
two files contain the same values for
AUTHID,
TARGID and
PREDN, the second file will overwrite the first one.
In the SEQRES field
of the PDB there must be the sequence in a 'one letter' code.
After this, there are two possibilities:
-
A field labeled as
CONTC must contain the
contacts prediction as follows: the sequence numbers of the two residues
(numbers as in CASP template PDB file; the one with the lower sequence
number in first place), their respective aminoacid type, and optionally,
an score proportional to the strength/confidence of that prediction. An example
of this line in C/Perl format:
So the required fields in the file are:
AUTHOR, AUTHID
, TARGID,
PREDN, SEQRES
,
CONTC and/or ATOM
(if
CONTC
and ATOM are submitted
only the CONTC field will
be taken into account).
Example: