Blocks

Database Description

Blocks are ungapped multiple alignments corresponding to the most conserved regions of proteins. The Blocks Database (1) consists of blocks constructed from documented families of related proteins by the automated PROTOMAT system (2). It currently contains 11,853 blocks representing 2,608 protein families documented in InterPro (3) and Prints (4). A blocks multiple alignment consists of ungapped conserved regions separated by unaligned regions of variable size. The PROTOMAT system applies a robust motif-finder (5) to a set of related protein sequences. Resulting candidate motifs are assembled into a best set along the lengths of the sequences, and additional sequences may be added if they are known to be related and can be aligned with all of the resulting blocks for a family. In order to reduce the redundancy and size of the Blocks Database, PROTOMAT is applied to families of sequences documented in InterPro in a heirarchical manner by perceived quality of documentation and lack of family inter-relationships. Since the Prints Database format is consistent with that of Blocks, Prints blocks are added directly without running PROTOMAT. The LAMA algorithm (6) is used to compare all blocks added to the Blocks Database with each other to further reduce redundant entries. The Blocks Database is used to annotate proteins of unknown function. Protein or DNA sequence can be compared with the Blocks Database using the BLIMPS searching tool (7), or the IMPALA and RPS-BLAST tools from NCBI (8, 9), all of which provide statistics to evaluate hits. When a search of the Blocks Database hits a protein family, the user is linked to the InterPro documentation page. The Blocks WWW Server also provides several tools that enhance the information found there. Visual displays of the blocks for a family are provided by sequence logos (10), by maps of the blocks along the sequences, and by highlighting the blocks on known structures of the sequences in them. A phylogenetic tree is made from the blocks representing each family and can be explored with the ProWeb tree viewer. Reverse searches of the block alignments against sequence databases are facilitated by links to BLAST (9, 11), MAST (12) and LAMA (6) searching pages. Links are provided to CYRCA sets of consistently aligned blocks (13), which are usually found to be protein regions of similar structure and function that appear in different contexts. CYRCA sets are found by comparing pairs of blocks using the LAMA algorithm. Blocks are linked to the CODEHOP PCR primer design tool (14), which uses the multiple alignment to design hybrid consensus-degenerate primers. Finally, blocks are linked to SIFT (15), a program which predicts the effect of amino acid substitutions from multiple alignment information. These blocks tools are also available when users make blocks from their own sequences, or excise blocks from their own multiple alignments, using the "Block Maker" and "Multiple Alignment Processor" features.

Recent Developments

InterPro and Prints are now the only sources of protein families for the Blocks Database, which was updated from InterPro 3.1 and Prints 31.0 in August, 2001. Phylogenetic trees made from the conserved regions of protein sequences for a family can now be explored with the ProWeb tree viewer. It allows interactive zooming and the making of new blocks from sequences associated with sub-trees. This may be desirable when the original blocks family contains distinct sub-families. Blocks are now linked to CYRCA sets of consistently aligned blocks. CYRCA sets are formed by aligning pairs of blocks using the LAMA algorithm. Alignable blocks are usually found to be protein regions of similar structure and function that appear in different contexts. CYRCA has a tool to superimpose the structures of sequences belonging to these sets. Blocks can now be analyzed by SIFT, a tool that uses multiple alignments to predict whether an amino acid substitution will affect protein function. SIFT takes advantage of the the fact that substitutions at conserved positions, such as those in blocks, are more likely to affect function.

Acknowledgements

This work is supported by grants from the NIH (GM29009) and the DOE (DE-FG03-97ER62382).

REFERENCES

Henikoff, J.G., Greene, E.A., Pietrokovski, S. and Henikoff, S. (2000) Nucleic Acids Res., 28, 228-230.
Henikoff S. and Henikoff J.G. (1994) Nucleic Acids Res., 19, 97-107.
Apweiler, R., Attwood, T.K., Bairoch, A, et al (2000), Bioinformatics, 16, 1145-1150.
Attwood, T.K., Croning, M.D., Flower, D.R. et al (2000), Nucleic Acids Res., 28, 225-227.
Smith, H.O., Annau, T.M. and Chandrasegaran, S. (1990) Proc. Natl. Acad. Sci. USA, 87, 826-830.
Pietrokovski, S., (1996) Nucleic Acids Res. 24, 3836-3845.
Henikoff, J.G. and Henikoff, S. (1996) Computer Applications in the Biological Sciences, 12, 135-143.
Schaffer, A.A., Wolf, Y.I., Ponting, C.P., Koonin, E.V., Aravind, L. and Altschul, S.F. (1999?) Bioinformatics.
Altschul, S.F., Madden, T.L., Schaffer, A.A., et al (1997), Nucleic Acids Res., 25, 3389-3402.
Schneider, T.D. and Stephens, R.M. (1990), Nucleic Acids Res., 18, 6097-6100.
Henikoff, S. and Henikoff, J.G. (1997) Protein Science, 6, 698-705.
Bailey, T.L. and Gribskov, M. (1998) Bioinformatics, 14, 48-54.
Kunin, V., Chan, B., Sitbon, E., Lithwick, G. and Pietrokovski, S. (2001) J. Mol. Biol., 307, 939-949.
Rose, T.M., Schultz, E.R., Henikoff, J.G., Pietrokovski, S., McCallum, C.M. and Henikoff, S. (1998) Nucleic Acids Res., 26, 1628-1635.
Ng, P.C. and Henikoff, S. (2001) Genome Res., 5, 863-874.