Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
Contact mriley@mbl.edu
GenProtEC is a database of the chromosomally encoded genes and proteins of Escherichia coli K-12. The database contains information on 4401 genes encoding 4285 proteins and 116 RNAs. The proteins are presented as modular units where a module is defined as a domain that has at least 100 amino acid residues, carries a biological function, and has an independent evolutionary history. Protein modules were identified by Darwin analysis of E. coli protein sequences (1) using an alignment of at least 100 amino acids and a PAM (accepted point mutation) score of less than 200 as minimum requirements. Most of the modules are represented as individual proteins but some are part of multi-modular (compound) proteins, which are a result of gene fusion events during evolution (2). There is currently 287 such compound genes containing 2, 3 or 4 modules. The present number of modules is 4616. GeneProtEC (http://genprotec.mbl.edu) can be searched by bnumber (Blattner number), gene or protein name, gene product type, Enzyme Commission (EC) number, function description or physiological role. For any sequence-related pair, the position and length of the alignment is given as well as the percent of the protein aligned, the percent identical amino acids, and the PAM score. Most of the information in the database is available as downloadable text files. GenProtEC provides molecular function assignments for 3531 (80%) of the E. coli genes including experimentally characterized functions (1924 genes), phenotype associated functions (96 genes), phage associated functions (312 genes) and putative functions (1199 genes). The putative function assignments of E. coli K-12 were recently reviewed and updated (3). Literature references to the characterized, phenotype and some putative functions are provided. GenProEC also contains MultiFun, a recently developed system for classification of cellular functions in E. coli (4). MultiFun, is based on the previous classification system developed by Monica Riley (5) and has incorporated the transport classification system of Milton Saier (6). MultiFun contains 10 major categories; Metabolism, Information Transfer, Regulation, Transport, Cell Processes, Cell Structure, Location , Extrachromosomal Origin, DNA Site and Cryptic Genes, which are further subdivided in a hierarchical system. Cellular functions have been assigned to 66% of the E. coli gene products. Because a gene product may play many roles in the cell, multiple cell function assignments are allowed per gene product where appropriate. The average number of cellular roles assigned per gene product is currently 2-3. A correlation table between MultiFun categories and Gene Ontology (GO) Consortium categories (7) is also available. GenProtEC ?s protein modules have been grouped into sequence similar or paralogous groups where each member must recognize at least one partner of the group and none outside of the group using the requirements listed above. The current groups were generated in collaboration with Bernard Labedan (2). Over half of the protein modules have at least one E. coli partner and the sequence related groups ranges in size from two to 96. Most members of each group are related by function as well as by sequence.
Incorporation of MultiFun, a classification schema for cellular function. Reannotation of putative function assignments.
Category Protein Databases