GeneFIND Family Identification System Introduction

GeneFIND Family Identification System
Introduction

GeneFIND (Gene Family Identification Network Design) is an integrated database search system that combines several search/alignment tools and ProClass database to provide rapid and accurate gene family classification with enriched family information. The objectives are to improve speed and sensitivity, differentiate global and motif similarities, and provide collective information in an integrated platform that alleviates human annotation effort. It was used to identify several thousands of new ProSite members, which have been incorporated into out ProClass_Motif sub-database [Wu & Shivakumar, 1998], and flagged as PCT or PCN entries.

GeneFIND uses a multi-level filter system (Figure 1), with MOTIFIND and BLAST as the first-level filters to quickly eliminate query sequences that have very low probabilities of being a family member. MOTIFIND (Motif Identification Neural Design) (Figure 2) [Wu et al., 1996] combines global and motif sequence features to improve sensitivity, and runs about 10 times faster than BLAST in direct comparisons. Individual three-layered feed-forward backpropagation neural networks are used for each protein family. The current implementation has 1302 MOTIFIND networks, trained for all families with defined ProSite patterns. (i.e., PCFA families, Table).

For full-scale classification, each query sequence is searched through all neural networks to identify PCFA members. The sequence is also searched against ProClass using BLAST (version 2.0) for matching to PCFB families and PCFC unclassified entries. Potential member is further aligned to known family members using the Smith-Waterman SSEARCH program (version 3.0). Motif pattern matching is performed for probable ProSite members using our signature pattern search program to determine whether it has "T" (true positive) or "N" (false negative) patterns. The motif regions are further modeled with the HMMER program (Version 1.8) against sequence motifs of the respective family. Multiple motif alignment is then performed using ClustalW (Version 1.74).

This on-line server [Wu et al., 1998] allows direct family identification of query sequences, and returns search results as HTML documents. As illustrated in the example report, overall global and motif probability scores are displayed for likely families, together with BLAST scores, links to corresponding ProClass records, SSEARCH full-length pair-wise alignments and ClustalW multiple motif alignments. The global score is computed from the BLAST E-value, the SSEARCH score and the percentage of sequence identity at overlapped length ratio. The motif score is computed from the ratio of mismatched amino acids to ProSite patterns, and a HMM motif match score. Since the system is family-based and displays database search output with only best-matched sequences of the families, it greatly reduces the noise level and the redundancy seen in the standard entry-based database search results. Family information from ProClass database, with hypertext links to all other major family and structural databases, can further assist the discovery of new family relationship.

Figure 1. The GeneFIND Family Identification System.

Figure 2. The MOTIFIND Neural Network Design.

Go to GeneFIND Family Identification System Home
Suggestions by e-mail to Cathy Wu

Last Update March 1, 1999