Protein Function Prediction
 Methods for Functional Assignment

              


Pairwise and Multiple Sequence Alignments and Similarity Searches


Introduction
Alignments are fundamental to most bioinformatics methods. Good alignments can be used to infer the common ancestry of a number of sequences, to make predictions about function and structure and to identify conserved regions that may be of structural or functional importance.

In this part of the practical we will encounter some basic alignment techniques using server-based sequence database search methods and multiple sequence alignment tools.



Tools and Databases

References




Patterns and Profiles, Protein Motifs and Domains


Introduction

 
In general protein sequences can be grouped in clusters. It is possible to identify and cluster groups of sequences that are maximally similar between them, and minimally similar to other clusters. These clusters of evolutionary-related protein sequences are called families.

The sequence information from alignments of these clusters or families can be combined in order to search for function or for more distantly-related sequences (remote homologues). Evolutionary information from aligned homologous sequences combined in this way is known as a profile.

Another general property of sequences is that sequence similarity may be restricted to short stretches of sequence called domains and motifs.

The definition of these conserved sub-sequences depends on their size and function. Domains are stretches of sequence that appear as structural modules, often within many proteins. Motifs are short conserved sub-sequences that often correspond to active or functional sites. Motifs can be used to help predict function and in the identification of remote homologues.

In the second part of the practical we will go through several examples to illustrate the concepts of  patterns, profiles, motifs, domains and families. Each of the exercises will be centered in the analysis of a specific sequence. The examples include results from database searches or from the application of a range of tools.



Tools and Databases

  • InterPro - an integrated database of protein families, domains, motifs and functional sites.
  • Blocks - multiply aligned ungapped segments for the most highly conserved regions of proteins.
  • Motif - a server that scans databases to find motifs or patterns and that can generate sequence profiles.
  • Pfam - multiple sequence alignments and HMMs of protein domains and families.
  • PRINTS - database of groups of conserved motifs, or protein fingerprints.
  • ProDom - protein domain families automatically generated from SWISS-PROT and TrEMBL.
  • PROSITE - database of protein families and domains defined by functional sites, patterns and profiles.
  • SMART - Simple Modular Architecture Research Tool for the identification of domains.



Non-sequence Based Function Prediction


Introduction

The traditional function prediction methods are rapidly being complemented by servers that make predictions that are often not based on just simple sequence similarity. These servers often use combination of text mining, genomic context, and machine learning techniques to make thier predictions. Some even make use of 3D structural information when it is known.

In the third part of the practical we will take a closer look at some of these methods and compare their efficacy with the more traditional methods.



Tools and Databases

  • COGS database - clusters of sequences determined by comparing sequences from whole genomes.
  • MGDB, Microbial genome database
  • KEGG.  Kyoto encyclopedia of Genes and Genomes.
  • String - String looks for conservation of gene order, common phylogenetic patterns and gene fusions.
  • iHOP - an information system that accesses the more than ten million abstracts in PubMed.
  • GoA - the Gene Ontology Annotation database.
  • IntAct - IntAct provides a system and analysis tools for protein interaction data.
  • Bind - The Biomolecular Interaction Network Database
  • ProtFun - queries other prediction servers to obtain the cellular role, enzyme class, etc
  • ProFunc - uses both sequence- and structure-based methods to deduce function.
  • TRIPPI - predicts residues involved in interactions between proteins.



References



 

Michael Tress
Protein Design Group