HS3D

Database Description

In the last years many computational tools for gene identification and characterization[1,2,3,4,5,6,7,8 and many others], mostly based on machine learning approaches, have been used. In the machine learning approach, a learning algorithm receives a set of training examples, each labelled as belonging to a particular class. The algorithmís goal is to produce a classification rule for correctly assigning new examples to these classes. The success of these methods depends largely on the quality of the data sets that are used as the training set[9]. Furthermore a common data set is necessary when the prediction accuracy of different programs needs to be comparatively assessed[10,11]. The Irvine Primate Splice Junctions Dataset (UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html) is a standard ìde factoî in the machine learning community [12,13,14,15 and many others], but it is now very out of date and do not include sufficient material for the most learning algorithm needs. A recent and EST confirmed data set[16] has the same limitation in the data extend. More recently Burset et al.[17] developed an extensive data base, but the data do not include false splice sites (negative examples), and, specifically, proximal false splice sites. The latter data form a well known critical point of classification systems[11]. We developed a new database (HS3D - Homo Sapiens Splice Site Dataset) of Homo Sapiens Exon, Intron and Splice regions. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization. From the complete GenBank Primate Sequences Rel.123 (8436 entries), 697 entries of Human Nuclear DNA including a Gene with Complete CDS and with more than one exon have been selected according to assessed selection criteria[18] (file genbank_filtered.inf). 4450 exons and 3752 introns have been extracted from these entries (files exons.seq and introns.seq). Several statistics for such exons and introns (overall nucleotides, average GC content, number of exons/introns including not AGCT bases, number of exons/introns in which the annotated end is not found, exon/intron minimum length, exon/intron maximum length, exon/intron average length, exon/intron length standard deviation, number of introns in which the sequence does not start with GT, number of introns in which the sequence does not end with AG) are reported (files exons.stat and introns.stat). Then 3762 + 3762 donor and acceptor sites have been extracted as windows of 140 nucleotides around each splice site. After discarding sequences not including canonical GTñAG junctions (176 +191), including insufficient data (not enough material for a 140 nucleotide window) (590+547), and including not AGCT bases (30+32), there are 2955+2992 windows (files GT_true.seq and AG_true.seq). Information and several statistics about the splice sites extraction are reported (files GT_true.inf, AG_true.inf, GT_true.stat, and AG_true.stat). Finally, there are 287,296+348,370 windows of false splice sites, selected by searching canonical GTñAG pairs in not splicing positions. The false sites in a range+/- 60 from a true splice site are marked as proximal (files GT_false.seq, and AG_false.seq) (Related information: GT_false.inf, and AG_false.inf). HS3D is available at the Web server of the University of Sannio http://www.sci.unisannio.it/docenti/rampone/

REFERENCES

S. Brunak, J. Engelbrecht, and S. Knudsen (1991) Prediction of the human mRNA donor and acceptor sites from the DNA Sequence, J.Mol.Biol., 220, 49-65.
V.V. Solovyev, A.A.Salamov, and C.B. Lawrence (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Research, 22, 5156-5163.
J. Henderson, S. Salzberg, and K.H. Fasman (1997) Finding Genes in DNA with a Hidden Markov Model. J. Comput. Biol. 4(2) 127-41
N. Friedman, D. Geiger, and M. Goldszmidt (1997) Bayesian network classifiers. Machine Learning, 29, 131-163.
M.Q. Zhang (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis, Proc. Natl. Acad. Sci. USA, 94, 565-568.
A. Krogh (1998) An Introduction to Hidden Markov Models for Biological Sequences. In Computational methods in Molecular Biology, S.L.Salzberg, D.B.Searls, and S.Kasif ed.s, Elsevier, 45-63.
S. Rampone (1998) Recognition of Splice-Junctions on DNA Sequences by BRAIN learning algorithm. Bioinformatics, 14, 676-684.
D. Cai, A. Delcher, B. Kao, and S. Kasif (2000) Modelling splice sites with Bayes Networks. Bioinformatics, 16, 152-158.
C.M. Bishop (1995) Neural Networks for Pattern Recognition, Oxford University Press.
M. Burset, and R. Guigo (1996). Evaluation of gene structure prediction programs. Genomics, 34, 353-367.
T.A. Thanaraj (2000) Positional Characterisation of False Positives from Computational Prediction of Human Splice Sites. Nucleic Acids Research, 28, 744-754.
M.O. Noordewier, G.G. Towell and J.W. Shavlik, (1991) Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences. Advances in Neural Information Processing Systems, volume 3, Morgan Kaufmann.
G.G. Towell, J.W. Shavlik, and M.W. Craven (1991) Constructive Induction in Knowledge-Based Neural Networks. In Proceedings of the Eighth International Machine Learning Workshop, Morgan Kaufmann.
G.G. Towell (1991) Symbolic Knowledge and Neural Networks: Insertion, Refinement, and Extraction. PhD Thesis, University of Wisconsin - Madison.
G.G. Towell, and J.W. Shavlik (1992) Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules. In Advances in Neural Information Processing Systems, volume 4, Morgan Kaufmann.
T.A. Thanaraj (1999) A Clean data set of EST-confirmed Splice Sites from Homo Sapiens and Standards for Clean-up Procedures. Nucleic Acids Research, 27, 2627-2637.
M. Burset, I.A. Seledtsov, and V.V. Solovyev (2001) SpliceDB: database of canonical and non-canonical mammalian splice sites, Nucleic Acids Research, 29, 255-259.
T.A. Thanaraj (1999) Standards to Create Clean Data Sets for Gene Prediction. Bioinformer, Fall `99, http://bioinformer.ebi.ac.uk/newsletter/archives/5/gene_prediction.html.

HS3D

Database Description

Acknowledgements

REFERENCES