Protein Function Prediction

Protein Function Prediction
Methods for Functional Assignment

Pairwise and Multiple Sequence Alignments and Similarity Searches

Introduction

Alignments are fundamental to most bioinformatics methods. Good alignments can be used to infer the common ancestry of a number of sequences, to make predictions about function and structure and to identify conserved regions that may be of structural or functional importance.
In this part of the practical we will encounter some basic alignment techniques using server-based sequence database search methods and multiple sequence alignment tools.

Tools and Databases

BLAST Servers

EMBL WU-BLAST
NCBI BLAST

Servers for Aligning Two Sequences

EBI Pairwise Alignment - with Smith and Waterman, and Needleman and Wunsch Alignment Tools
NCBI bl2seq
Lipman and Pearson's Align program

Multiple Alignment Servers

ClustalW server

T-Coffee server

Sequence Databases

Uniprot - the most comprehensive database of catalogued protein sequences.
Swiss-Prot - a curated database of highly annotated protein sequences.
PIR Database - another curated sequence database.
TrEMBL - a computer-annotated database of sequences translated from the EMBL nucleotide database.

Multiple Sequence Alignment Viewer

BoxShade

References

Patterns and Profiles, Protein Motifs and Domains

Introduction

In general protein sequences can be grouped in clusters. It is possible to identify and cluster groups of sequences that are maximally similar between them, and minimally similar to other clusters. These clusters of evolutionary-related protein sequences are called families.
The sequence information from alignments of these clusters or families can be combined in order to search for function or for more distantly-related sequences (remote homologues). Evolutionary information from aligned homologous sequences combined in this way is known as a profile.
Another general property of sequences is that sequence similarity may be restricted to short stretches of sequence called domains and motifs.
The definition of these conserved sub-sequences depends on their size and function. Domains are stretches of sequence that appear as structural modules, often within many proteins. Motifs are short conserved sub-sequences that often correspond to active or functional sites. Motifs can be used to help predict function and in the identification of remote homologues.
In the second part of the practical we will go through several examples to illustrate the concepts of patterns, profiles, motifs, domains and families. Each of the exercises will be centered in the analysis of a specific sequence. The examples include results from database searches or from the application of a range of tools.

Tools and Databases

InterPro - an integrated database of protein families, domains, motifs and functional sites.
Blocks - multiply aligned ungapped segments for the most highly conserved regions of proteins.
Motif - a server that scans databases to find motifs or patterns and that can generate sequence profiles.
Pfam - multiple sequence alignments and HMMs of protein domains and families.
PRINTS - database of groups of conserved motifs, or protein fingerprints.
ProDom - protein domain families automatically generated from SWISS-PROT and TrEMBL.
PROSITE - database of protein families and domains defined by functional sites, patterns and profiles.
SMART - Simple Modular Architecture Research Tool for the identification of domains.

Non-sequence Based Function Prediction

Introduction

The traditional function prediction methods are rapidly being complemented by servers that make predictions that are often not based on just simple sequence similarity. These servers often use combination of text mining, genomic context, and machine learning techniques to make thier predictions. Some even make use of 3D structural information when it is known.

In the third part of the practical we will take a closer look at some of these methods and compare their efficacy with the more traditional methods.

Tools and Databases

COGS database - clusters of sequences determined by comparing sequences from whole genomes.

MGDB, Microbial genome database

KEGG. Kyoto encyclopedia of Genes and Genomes.

String - String looks for conservation of gene order, common phylogenetic patterns and gene fusions.

iHOP - an information system that accesses the more than ten million abstracts in PubMed.

GoA - the Gene Ontology Annotation database.

IntAct - IntAct provides a system and analysis tools for protein interaction data.

Bind - The Biomolecular Interaction Network Database

ProtFun - queries other prediction servers to obtain the cellular role, enzyme class, etc

ProFunc - uses both sequence- and structure-based methods to deduce function.

TRIPPI - predicts residues involved in interactions between proteins.

References

Prediction of Protein Function Tutorial at the EBI

Identifying Domains and Motifs

Domains, motifs, and clusters in the protein universe - a paper by Liu and Rost

Michael Tress
Protein Design Group