Facoltý di Scienze MM.FF.NN. and INFM Universitý del Sannio Via Port'Arsa 11 I-82100 Benevento ITALY
Contact rampone@unisannio.it
In the last years many computational tools for gene identification and characterization[1,2,3,4,5,6,7,8 and many others], mostly based on machine learning approaches, have been used. In the machine learning approach, a learning algorithm receives a set of training examples, each labelled as belonging to a particular class. The algorithmís goal is to produce a classification rule for correctly assigning new examples to these classes. The success of these methods depends largely on the quality of the data sets that are used as the training set[9]. Furthermore a common data set is necessary when the prediction accuracy of different programs needs to be comparatively assessed[10,11]. The Irvine Primate Splice Junctions Dataset (UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html) is a standard ìde factoî in the machine learning community [12,13,14,15 and many others], but it is now very out of date and do not include sufficient material for the most learning algorithm needs. A recent and EST confirmed data set[16] has the same limitation in the data extend. More recently Burset et al.[17] developed an extensive data base, but the data do not include false splice sites (negative examples), and, specifically, proximal false splice sites. The latter data form a well known critical point of classification systems[11]. We developed a new database (HS3D - Homo Sapiens Splice Site Dataset) of Homo Sapiens Exon, Intron and Splice regions. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization. From the complete GenBank Primate Sequences Rel.123 (8436 entries), 697 entries of Human Nuclear DNA including a Gene with Complete CDS and with more than one exon have been selected according to assessed selection criteria[18] (file genbank_filtered.inf). 4450 exons and 3752 introns have been extracted from these entries (files exons.seq and introns.seq). Several statistics for such exons and introns (overall nucleotides, average GC content, number of exons/introns including not AGCT bases, number of exons/introns in which the annotated end is not found, exon/intron minimum length, exon/intron maximum length, exon/intron average length, exon/intron length standard deviation, number of introns in which the sequence does not start with GT, number of introns in which the sequence does not end with AG) are reported (files exons.stat and introns.stat). Then 3762 + 3762 donor and acceptor sites have been extracted as windows of 140 nucleotides around each splice site. After discarding sequences not including canonical GTñAG junctions (176 +191), including insufficient data (not enough material for a 140 nucleotide window) (590+547), and including not AGCT bases (30+32), there are 2955+2992 windows (files GT_true.seq and AG_true.seq). Information and several statistics about the splice sites extraction are reported (files GT_true.inf, AG_true.inf, GT_true.stat, and AG_true.stat). Finally, there are 287,296+348,370 windows of false splice sites, selected by searching canonical GTñAG pairs in not splicing positions. The false sites in a range+/- 60 from a true splice site are marked as proximal (files GT_false.seq, and AG_false.seq) (Related information: GT_false.inf, and AG_false.inf). HS3D is available at the Web server of the University of Sannio http://www.sci.unisannio.it/docenti/rampone/
The authors wish to thank Massimo Mastroianni, Webmaster of the Facoltý di Scienze, Universitý del Sannio, for his patience.
Category Gene Identification and Structure