Nuclc. Acids. Res. OUP
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH ARTICLES TABLE OF CONTENTS
Compilation Paper
Categories List
Alphabetical List
Search Summary Papers

VIDA

http://www.biochem.ucl.ac.uk/bsm/virus_database/VIDA.html

Albà, M.M.1, Holzerlandt, R.1, Myers, R.E.1, Shepherd, A.J.2, Lee, D.2, Martin, N.3, Orengo, C.A.2, Kellam, P.1

1Wohl Virion Centre, Department of Immunology and Molecular Pathology, Windeyer Institute of Medical Sciences, University College London
2Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London
3Department of Computer Science, Birkbeck College, University of London

Contact   p.kellam@ucl.ac.uk


Database Description

Analysis of viral proteins in the absence of many of today's bioinformatics approaches has resulted in the level of organization and annotation of viral genomic sequences being inferior compared to other organism centered genome databases. Existing virus sequence databases are mainly focused of the visualization and interpretation of complete virus genomes (1-2) or the detailed study of particular viral proteins (3). We have therefore developed a virus database, VIDA, that organizes open reading frames (ORFs) from partial and complete genomic sequences using a consistent methodology (4). In VIDA, the proteins are grouped into homologous protein families (HPFs) on the basis of regions or domains of sequence conservation, thus allowing the identification of functionally important regions and cross-comparison of different virus genomes (5). Relevant information on the proteins is parsed from GenBank, regions of sequence conservation among different proteins are identitified with XDOM (6) and construction of HPFs integrating GenBank annotations is performed with PSCbuilder (4). VIDA is currently a flat file database. Each HPF is a different entry and is presented to the user as a HTML document. The HPFs can be searched in different ways: using a pre-defined list of virus name, functional description or class, GenBank protein identification number, keywords and, more recently, by sequence similarity to a submited sequence. The latter option uses position specific scoring matrices derived from the conserved regions in the HPFs and the IMPALA program (7). Conserved sequence regions can be retrieved from VIDA as sequence alignments. A controlled taxonomical and functional classification is used for all proteins and protein families. When available, viral proteins and HPFs are mapped to structural protein families and corresponding SwissProt and Trembl entries. All protein sequences from the HPF can be retrieved in Fasta format individually or as a single file. Sequences with 100% sequence redundancy to other proteins in the HPF, if existent, can be visualised in a separate table. VIDA 2.0 includes all sequences from GenBank for the Herpesviridae, Coronaviridae, Arteriviridae, Papillomaviridae and Poxviridae. VIDA now contains 12,400 protein sequences, 163 complete viral genomes and 1930 homologous protein families (HPFs). Among the HPFs there are 909 for the Poxviridae, 884 for the Herpesviridae, 80 for the Papillomaviridae, 43 for the Coronaviridae and 14 for the Arteriviridae. About 92% of the viral proteins (11349) are part of multi-protein HPFs and the remaining 1051 are singleton entries.

Recent Developments

In addition to the Herpesviridae, Coronavirida and Arteriviridae, VIDA now includes two more virus families, the Poxviridae and the Papillomaviridae. The number of proteins in VIDA 2.0 has approximately doubled in respect to VIDA 1.0, as has the number of homologous protein families (HPFs). Another novelty is that the HPFs can now be searched for sequence similarity to a user defined sequence. This is achieved by position specific scoring matrices that correspond to the conserved regions and uses the IMPALA program (7). Sequence similarity searches provide a useful method not only to find information about a given protein but also to annotate new viral proteins from virus genome projects. We have also improved the functional annotation of the HPFs by including a representative gene name in the functional description. VIDA is constructed with non-redundant sequences but we have added references to all the 100% redundant protein sequences as a separate table in each HPF. This information may be of interest as the 'redundant' proteins may be from a different virus strains.

Acknowledgements

We are thankful for useful suggestions from Prof. Geoffrey Smith and Dr. Caroline Gubser on the Poxviridae and from Richard Jenner on human herpesvirus 8. We acknowledge the Medical Research Council for grants to R.E.M., D.L., N.M., C.A.O. and P.K. and the Biotechnology and Biological Sciences Research Council for grants to M.M.A., R.H. and A.J.S.

REFERENCES

  1. Tamames, J. and Tramontano, A. (2000) DANTE: A workbench for sequence analysis. Trends Biochem. Sci., 25, 402-403.
  2. Hiscock, D. and Upton, C. (2000) Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes. Bioinformatics, 16, 484-485.
  3. Shafer, R.W., Jung, D.R., Betts, B.J., Xi, Y. and Gonzales, M.J. (2001) Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res., 29, 11-16.
  4. Albà, M.M., Lee, D., Pearl, F.M.G., Shepherd, A.J., Martin, N., Orengo, C.A. and Kellam, P. (2001) VIDA: a virus database system for the organization of animal virus genome open reading frames. Nucleic Acids Res., 29, 133-136.
  5. Albà, M.M., Das, R., Orengo, C., Kellam, P. (2001) Genome wide function conservation and phylogeny in the Herpesviridae. Genome Res., 11, 43-54.
  6. Gouzy, J., Eugene, P., Greene, E.A., Khan, D. and Corpet, F. (1997) XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences. Comput. Appl. Biosci., 13, 601-608.
  7. Schaffer, A. A., Wolf, Y. I., Ponting, C. P., Koonin, E. V., Aravind, L. and Altschul, S. F. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15, 1000-11.

Category   Protein Databases

Go to the abstract in the NAR 2001 Database Issue.

 

Compilation Paper
Categories List
Alphabetical List
Search Summary Papers