Practical lesson 1

PATTERNS AND PROFILES.
PROTEIN MOTIFS, DOMAINS AND FAMILIES

By F. Abascal, PDG, CNB, CSIC.

In this practical lesson we will go through several examples to illustrate the concepts of "PATTERNS", "PROFILES", "MOTIFS", "DOMAINS" and "FAMILIES". Each of the exercises will be centered in the analysis of a specific sequence, as it is described later.

The examples include results from database searches or from the application of several tools.

HOWEVER, the idea is that you should repeat by yourself the various analyses that are mentioned.

A selection of links to Databases and Tools
Exercise 1: PATTERNS

swiss::RPE_YEAST, from Saccharomyces cerevisiae, annotated as: "Ribulose-phosphate 3-epimerase".

Exercise 2: PROFILES

swiss::YD33_MYCTU, from Mycobacterium tuberculosis, annotated as "Hypothetical protein Rv1333".

Exercise 3: FAMILIES

Protein from the gene gcsf, from Bos taurus, annotated as "Granulocyte colony-stimulating factor precursor (G-CSF)".

Exercise 4: DOMAINS

swiss::ICE9_HUMAN, from Homo sapiens; precursor of caspase-9.

Link to another list of Databases and Tools centered on Protein Patterns and Profiles.
Link to tutorial on Patterns and Profiles (in spanish)
Link to tutorial on Families (in spanish)

A selection of links to Databases and Tools:

General Tools:

BLAST: EMBL, EBI, NCBI.

ClustalW: EBI, ch.EMBNET.org, crick.genes.nig.ac.jp, NPS@, GenomeNet.

Databases: SwissProt, EMBL; SRS-EBI, SRS-EMBL.

Pattern Tools and Databases:

PROSITE: Database of protein motifs expressed as patterns or profiles

http://us.expasy.org/prosite/

ScanProsite (several mirrors): Scans a sequence against PROSITE or a pattern against SWISS-PROT and TrEMBL

http://www.expasy.org/cgi-bin/scanprosite

http://us.expasy.org/tools/scanprosite/

http://kr.expasy.org/cgi-bin/scanprosite

http://tw.expasy.org/cgi-bin/scanprosite

http://ca.expasy.org/cgi-bin/scanprosite

http://cn.expasy.org/cgi-bin/scanprosite

PRATT: Generation of patterns (regular expressions) from a group of unaligned sequences.

http://www.ebi.ac.uk/pratt/

Profile Tools and Databases:

NCBI PSI-BLAST: Automatic generation of profiles in iterative searches.

BLAST page at NCBI

ProfileScan: Scans a sequence to find matches to protein patterns and profiles in PROSITE and Pfam.

http://hits.isb-sib.ch/cgi-bin/PFSCAN?

MOTIF: Scans sequences to find motifs; databases to find sequences that match profiles or patterns; and generates profiles from sequences provided by the user.

http://motif.genome.ad.jp/

InterPro: Database of protein families defined by presence of common motifs and domains, defined in several databases such as Pfam, SMART, Prosite, and other.

http://www.ebi.ac.uk/InterProScan

Pfam: database of protein HMM profiles that define domain families.

Sanger Institute (UK)

St. Louis (USA)

Karolinska Institutet (Sweden)

Institut National de la Recherche Agronomique (France)

Bioaccelerator: allows the generation of profiles and the performance of several kinds of searches. .

http://eta.embl-heidelberg.de:8000/

Meme: identifies conserved motifs in groups of sequences, and generates profiles that can be used to search related sequences.

http://meme.sdsc.edu/meme/website/intro.html

1. PATTERNS

RPE_YEAST, from Saccharomyces cerevisiae, annotated as: "Ribulose-phosphate 3-epimerase".

This case will be used to illustrate how to generate a pattern that can describe a protein family.

Let's assume that we are interested in characterizing the following protein sequence.

>my_protein
MVKPIIAPSI LASDFANLGC ECHKVINAGA DWLHIDVMDG HFVPNITLGQ PIVTSLRRSV
PRPGDASNTE KKPTAFFDCH MMVENPEKWV DDFAKCGADQ FTFHYEATQD PLHLVKLIKS
KGIKAACAIK PGTSVDVLFE LAPHLDMALV MTVEPGFGGQ KFMEDMMPKV ETLRAKFPHL
NIQVDGGLGK ETIPKAAKAG ANVIVAGTSV FTAADPHDVI SFMKEEVSKE LRSRDLLD

First, you will perform a BLAST search to identify related proteins. The link to NCBI BLAST is in the table on top. Paste the protein sequence and perform a search against SwissProt (pre-computed results, obtained with NCBI BLAST, are here). The results obtained with BLAST indicate that there is a considerable number of proteins in SwissProt, which are similar in sequence, and which have been annotated as Ribulose-phosphate 3-epimerase. They can be considered a family.

Then retrieve the sequences of all proteins identified in the BLAST search, or choose those that show similarity to RPE_YEAST, with an associated e-value below some threshold (for example, 1e-6).

Then, align them with ClustalW. Part of the resulting multiple sequence alignment, viewed with Belvu, should look as appears below. If you are not using Belvu, you still should be able to recognize some conserved regions.

Light blue columns indicate positions with a high percentage of identity. Dark blue columns indicate positions with high percentage of similarity (amino acids with similar properties). Gray columns indicate positions with a lower degree of similarity.

The blocks of positions (2 to 11) and (25 to 53) are clearly more conserved than the rest of aligned segments, and therefore, correspond to PROTEIN MOTIFS that could be characteristic of the RPE protein family. As you know, protein Motifs can be efficiently represented as simple patterns or regular expressions, if they correspond to sequences that are not too long. Therefore, we will now produce regular expressions that represent some of these Motifs, and we will try to find other sequences that also contain them.

You have now TWO options: you can generate MANUALLY your own pattern, or use some AUTOMATIC algorithm to do it.

First, we will try MANUALLY:

Choose some of the conserved segments and deduce patterns or regular expressions that could represent them. To do it, use the one-letter-code to refer to amino acids, x to denote any amino acid, square brackets to specify several optional amino acids, and numbers in regular brackets to specify the number of repeated characters.

Then try to find other sequences that contain the same Motifs, using ScanProsite or any of the servers that allow performing database searches with user provided patterns (check the table at the top, or this other to find servers that you could use)

You can compare your results with those obtained with the following patterns:

Example 1: H-x-D-[IVM]-[ME]-D-x(1,2)-[FY]
Search results with ScanProsite, against SwissProt.

Example 2: [FVLYI]-H-x-D-[IVM]-[ME]-D-x(1,2)-[FY]-x(2)-N
Search results with ScanProsite, against SwissProt.

Questions:

What type of proteins have been obtained using patterns and ScanProsite?
Which of them could be considered homologous?
How do you interpret the differences in the set of proteins identified with the patterns of each example?
How do you identify more proteins, by performing pattern searches in databases or by using BLAST (the results of a BLAST search are here (with BLAST we identify mainly "Ribulose-phosphate 3-epimerases", "D-allulose-6-phosphate 3-epimerases")?

Now, we will try to AUTOMATICALLY generate the pattern, using Pratt

We will use the same segment of the alignment that we chose before to define patterns manually.

To trim the alignment to keep only a given segment, you definitively need a MSA viewer and editor, such as Belvu, Pfaat or JalView.

If you can not succeed in obtaining a portion of the alignment, continue with the sequences in the file: prueba.faa

The server Pratt: http://www.ebi.ac.uk/pratt/, uses sets of unaligned sequences as input. You just have to copy the sequences in the file, and paste them in the Pratt input box.

The results should look like those reproduced below:

Best Patterns before refinement:

fitness hits (seqs) Pattern

1: 16.1802 16( 16) H-x-D-x(2)-D-x(4,5)-N

2: 12.0102 16( 16) D-x(2)-D-x(4,5)-N

3: 7.8401 17( 16) V-x(2,3)-T

4: 7.8401 19( 16) D-x(4,5)-N

5: 7.8401 33( 16) L-x(2,3)-D

6: 7.8401 23( 16) S-x(0,1)-L

7: 7.8401 21( 16) S-x(1,2)-L

8: 7.3401 25( 16) D-x(1,3)-L

9: 7.3401 28( 16) D-x(2,4)-H

10: 7.3401 18( 16) M-x(3,5)-V

11: 7.3401 23( 16) L-x(1,3)-F

Best Patterns (after refinement phase):
                             fitness     hits(seqs)    Pattern
                    A   1:   23.2043     16( 16)   H-x-D-[IMV]-x-D-x(4,5)-N-x-[ASTV]-[FILMV]
                    B   2:   19.0342     16( 16)   D-[IMV]-x-D-x(4,5)-N-x-[ASTV]-[FILMV]
                    C   3:   16.7715     18( 16)   D-x(2,4)-H-x-[DTV]-x(3)-[AGNSTV]-x(2)-[GILPV]-x(2)-[IL]
                    D   4:   12.1179     18( 16)   M-x(3,5)-V-x-[ENR]-x-[ASTV]
                    E   5:   11.7492     16( 16)   D-x(4,5)-N-x-[AGSTV]-[FILMV]
                    F   6:    7.8401     17( 16)   V-x(2,3)-T
                    G   7:    7.8401     33( 16)   L-x(2,3)-D
                    H   8:    7.8401     23( 16)   S-x(0,1)-L
                    I   9:    7.8401     21( 16)   S-x(1,2)-L
                    J 10:    7.3401     25( 16)   D-x(1,3)-L
                    K 11:    7.3401     23( 16)   L-x(1,3)-F

You could now use some of the patterns generated by Pratt, to perform searches with ScanProsite in order to test whether they recover more or less protein sequences that seem to belong to the same functional group.

Now, we will try to identify Prosite patterns in your target sequence.

So far, you have been able to identify identify protein Motifs that are characteristic of a group of aligned sequences, and to identify new sequences in SwissProt that also contain the Motif. According to their annotations, most of those sequences seem have a similar function, and, therefore, you could consider that you have discovered a Motif that is characteristic of the RPE family of proteins. That Motif is represented with the patterns that you have either manually constructed or automatically generated with Pratt.

It seems interesting, therefore, to check whether a similar pattern (or Motif) has been deposited in Prosite and, if that is the case, whether the pattern identifies a family of proteins that includes those that we have identified.

You can follow several approaches:

Use ScanProsite to scan the amino acid sequence of RPE_YEAST against the database of patterns and profiles of Prosite.
Search Prosite with the expression "Ribulose-phosphate 3-epimerase".
Find the RPE_YEAST entry in SwissProt (accessing directly to SwissProt or with SRS), and find out whether there are cross references to the Prosite database.

You should be able to find that the Ribulose-phosphate 3-epimerase family has been documented in Prosite, and has accession number PDOC00833. Also, Prosite contains information about patterns that are characteristic of this family.

Questions:

How many patterns have been identified as characteristic of this family?
What are their accession numbers?
Are the patterns very different to those that you have generated?
What can you say about the specificity of the various patterns?
How many proteins have been identified in Prosite as members of the family?

Following with the same idea, we will try now with Pfam

We want to know whether there is some profile (HMM profile) in Pfam, which identifies the RPE family.

As before, you can:

Use the RPE_YEAST sequence to perform a Protein Search against Pfam.
Find the RPE_YEAST entry in SwissProt (accessing directly to SwissProt or with SRS), and find out whether there are cross references to the Pfam database.

You should be able to find that THERE IS a Pfam profile that describes the family ( Ribul_P_3_epim.). As expected, the Pfam profile describes a protein segment that is longer than that described by Prosite patterns. It is considered a conserved Domain, more than a conserved Motif.

Questions:

How long is the Pfam Domain that is characteristic of the RPE family?
How many proteins have been identified that contain the conserved domain?
Do you believe that, in the case of RPE, it is useful to define a pattern to describe the family?
Does the pattern help in identifying remote homologs of the family?

(You can check again, the results of searching in SwissProt with the pattern, or with BLAST)

Does the pattern provide information about functional features of the members of the family?

2. PROFILES

YD33_MYCTU, from Mycobacterium tuberculosis, "Hypothetical protein Rv1333".

This exercise will show that profile-based methods are more sensitive than similarity searches like those conducted with BLAST.

Let's assume that we are trying to predict the function of YD33_MYCTU.

A. First get the sequence with SwissProt accession number YD33_MYCTU.

alternative 1: Open SRS at http://srs.ebi.ac.uk/; select the SwissProt database in the Library Page and search for "YD33_MYCTU". Choose "FastaSeqs" view and click on the button "view".

alternative 2 (faster): Open SwissProt at http://us.expasy.org/sprot/ and "quick search" for "YD33_MYCTU". At the end of the page click on "Q10644 in FASTA format".

You should obtain the sequence

>sw|Q10644|YD33_MYCTU Hypothetical protein Rv1333.
MNSITDVGGIRVGHYQRLDPDASLGAGWACGVTVVLPPPGTVGAVDCRGGAPGTRETDLL
DPANSVRFVDALLLAGGSAYGLAAADGVMRWLEEHRRGVAMDSGVVPIVPGAVIFDLPVG
GWNCRPTADFGYSACAAAGVDVAVGTVGVGVGARAGALKGGVGTASATLQSGVTVGVLAV
VNAAGNVVDPATGLPWMADLVGEFALRAPPAEQIAALAQLSSPLGAFNTPFNTTIGVIAC
DAALSPAACRRIAIAAHDGLARTIRPAHTPLDGDTVFALATGAVAVPPEAGVPAALSPET
QLVTAVGAAAADCLARAVLAGVLNAQPVAGIPTYRDMFPGAFGS

B. Now search for homologous sequences using BLAST:

BLAST servers: EMBL, NCBI, EBI.

For EMBL WU-BLAST the initial parameters could be:

program=blastp
database=nrdb95
filter=SEG
descriptions=250
alignments=100
Click on "Submit Query"

Pre-computed results:

Questions

Which proteins are identified in the BLAST searches?
Do the annotations provide any information?
What does the "X" mean in the alignments?
Check whether different results are obtained using SEG as filter or not.

C. Since there is no obvious way to infer the function of the protein, we can go on to use PROFILES:

There are many possible strategies, but the most obvious one is to recover the protein sequences identified as similar with BLAST and generate a MSA with ClustalW. Then we would generate a Profile or an HMM profile, and we would use the profile to search again in protein sequence databases.

The easiest approach is to use PSI-BLAST, because the same algorithm performs automatically an initial BLAST search, retrieves similar sequences, generates an alignment, constructs a profile, and performs a new search, this time using the profile. The newly identified sequences are retrieved and aligned, and the profile is updated with the new information. The process is iterated as many times as the user decides, or until no additional new sequences that match the profile are identified in the database.

Using PSI-BLAST at the NCBI:

BLAST is at http://www.ncbi.nlm.nih.gov/BLAST/. Choose "Protein BLAST / PSI- and PHI-BLAST".
Paste the sequence in "Search".
Descriptions=500
Alignments=100
Format for PSI-BLAST -> with inclusion threshold = 1e-05 (=0.00001)
Hit "BLAST!"

After a while click on "format", to have displayed the results of the first round search.
Choose the proteins you want to use to construct the the profile and select "Run PSI-BLAST iteration 2".

Here you have the pre-computed results of theFIRST ROUND, of the SECOND ROUND and the THIRD ROUND.

Questions:

Have we detected more possible homologues with PSI-BLAST than with BLAST?
Are the annotations clearer?
Can we make any hypothesis about the function?

D. Searching against the HMM profile libraries of Pfam

Now we will try to identify whether there is an HMM profile for this family of proteins in Pfam.

Using Pfam

Go to Pfam: http://www.sanger.ac.uk/Pfam/.
Paste the sequence.
E-value=10
Hit "Search Pfam".

The pre-computed results of the search in Pfam are here.

Questions:

How many domains does this protein have?

Where are they?

What are their functions? (CLick on each of them to access their individual entries)

What can we say about the function of the protein?

How many proteins do have the domain peptidase_S58?

Which other domains appear associated to the domain peptidase_S58? (Go to the Domain Organization box and click on "View Graphic")

3. FAMILIES

Protein coded by the gene gcsf of Bos taurus (Granulocyte colony-stimulating factor precursor)

In this exercise we will illustrate the importance of defining protein families in sequence analyses.

A. First, get the sequence:

Go to: http://srs.ebi.ac.uk/.
Choose one of the following databases to look for it: SwissProt, SpTrEMBL or TrEMBL (updates).
Click on "Query forms => Standard".
Search for: "GeneName=gcsf", "Organism name=Bos taurus". Click on "submit query".
And now the same as before: "FastaSeqs" and "View"

>sw|P35833|CSF3_BOVIN Granulocyte colony-stimulating factor precursor (G-CSF).
MKLMVLQLLLWHSALWTVHEATPLGPARSLPQSFLLKCLEQVRKIQADGAELQERLCAAH
KLCHPEELMLLRHSLGIPQAPLSSCSSQSLQLTSCLNQLHGGLFLYQGLLQALAGISPEL
APTLDTLQLDVTDFATNIWLQMEDLGAAPAVQPTQGAMPTFTSAFQRRAGGVLVASQLHR
FLELAYRGLRYLAEP

B. PSI-BLAST search.

Proceed as in the previous exercise.
Here you have, the pre-computed results, for FIVE ROUNDS.
Try to understand what are the differences between the different rounds

Questions:

What has PSI-BLAST shown us?
In the second round this appears: "Q90YI0 (Q90YI0) Interleukin-6 precursor". It has an identity of 20% with respect to the "query", but a significant e-value. Why?
What are the consequences of this last sequence being included in the profile.
Why in the last round do the proteins annotated as "interleukin..." have better e-values than the proteins annotated as "granulocyte..."?
All the proteins that have been identified have the same evolutive origen, but, do they have the same function?
Can you identify subfamilies?
How do you believe that the existence of sub-families affects function prediction by the identification of homologous proteins?

C. Pfam search.

Connect to Pfam.
Repeat the steps indicated in Exercise number 2.
Pre-computed results are here

Questions:

Read the Pfam documentation for the protein domain. Which subfamilies are grouped here?
There is a remote relationship (e-value de 0.39 and 5.6) with the families IL-11 and IL-12. Could it be that the family IL6 documented in Pfam has a common ancestor with IL-11 and IL-12?
To address the previous question it would be necessary to realize a multiple alignment to see if the two families have similarities, or compare their structures.

D. Search in InterPro.

Open a connection to InterProScan at: http://www.ebi.ac.uk/InterProScan/.
Paste the amino acid sequence of gscf.
Enter your e-mail address (any expression wit the character @ is OK)
And "Submit job".

Pre-computed results can be found here.

Questions:

With which InterPro entries are there similarities?
Is there a hierarchy?
Is InterPro of any use in determining to which subfamily does gcsf belong to?

4. DOMAINS

ICE9_HUMAN, from Homo Sapiens; precursor of caspase-9.

With this example we will illustrate the importance of considering the multi-domain organization of many proteins.

A. First, you should get the sequence from SwissProt using the identifier ICE9_HUMAN:

>ICE9_HUMAN
MDEADRRLLR RCRLRLVEEL QVDQLWDALL SRELFRPHMI EDIQRAGSGS RRDQARQLII
DLETRGSQAL PLFISCLEDT GQDMLASFLR TNRQAAKLSK PTLENLTPVV LRPEIRKPEV
LRPETPRPVD IGSGGFGDVG ALESLRGNAD LAYILSMEPC GHCLIINNVN FCRESGLRTR
TGSNIDCEKL RRRFSSLHFM VEVKGDLTAK KMVLALLELA QQDHGALDCC VVVILSHGCQ
ASHLQFPGAV YGTDGCPVSV EKIVNIFNGT SCPSLGGKPK LFFIQACGGE QKDHGFEVAS
TSPEDESPGS NPEPDATPFQ EGLRTFDQLD AISSLPTPSD IFVSYSTFPG FVSWRDPKSG
SWYVETLDDI FEQWAHSEDL QSLLLRVANA VSVKGIYKQM PGCFNFLRKK LFFKTS

B. Then we will look for related sequences with EMBL-BLAST:

Pre-computed results are here.

Study the results to to check what kind of proteins have been found.
Take a look at the graphic representation of the sequence matches.

Questions:

What can you conclude from the graphical representation of the alignments.
Could it be related with the presence of different domains?

C. Now we will make a search in Pfam

The pre-computed result is here.

Questions:

How many domains does ICE9_HUMAN have?
What is the function of the various domains? (Click on each of them to access their individual entries).
Is there any reason why these domains should appear together?
For each of the domains find out with which other domains are they associated (Go to the Domain Organization box and click on "View Graphic").
If you check the SwissProt entry for ICE9_HUMAN you will learn that this protein interacts with the protein apaf-1. If you retrieve the SwissProt entry for APAF-1, by searching SRS with "description=apaf-1" + "organism name=Homo sapiens", you will find out that the SwissProt entry for APAF-1 is APAF_HUMAN. From the SwissProt entry you can follow the link to the Pfam entry for APAF_HUMAN. Having done all this, can you conclude how do caspase-9 and apaf-1 interact?

Think a little bit:

Protein comparisons indicate that protein domains often appear combined with different partners (shuffled) .
What are the implications of domain shuffling on protein function prediction.
What are the implications of domain shuffling at the moment of constructing multiple sequence alignments.

D. Construction of a profile from a MSA, and identification of similar proteins by profile search.

To continue with the analyses of (part of) ICE9_HUMAN, we will use Bioaccelerator. which is a server that allows the generation of profiles from MSA, and to perform profile searches in databases. This is equivalent to what we could do with PSI-BLAST, but it is a better method in the sense that the user has complete control about the alignment that is used to construct the profile.

Since, from the previous exercise, it is obvious that to study multi-domain proteins it may be necessary to isolate the domains, we will concentrate only on the CARD domain.

Procedure:

1.- We will obtain a MSA of the CARD domain, for example from the Pfam CARD entry (locate the "Alignment" box, select FASTA format and click on "Get alignment").

The alignment, visualized with Belvu, looks like this:

2.- Now, we will use the FASTA formatted alignment to construct a profile or PSSM (position specific scoring matrix), with the program ProfileWeight.

Once the PSSM matrix (profile) has been generated, we will save it in a file.

A piece of information that can be obtained from the PSMM file is the weight that has been defined for each sequence. More divergent sequences have larger weights.

Sequence Weights:
   1 CED4_CAEEL/3-90      100
   2 RIK2_HUMAN/436-524    94
   3 CRAD_HUMAN/2-89       92
   4 ICE2_HUMAN/16-104     54
   5 ICE2_CHICK/8-96       62
   6 ICE9_HUMAN/2-92       83
   7 CED3_CAEVU/3-91       56
   8 CED3_CAEEL/3-91       58
   9 Q66677/22-110         89
10 APAF_HUMAN/2-90       96
11 ICEB_MOUSE/2-94       79
12 ICE5_HUMAN/44-132     69
13 ICED_BOVIN/2-91       62
14 ICE4_HUMAN/2-91       65
15 BIR2_MOUSE/437-525    58
...etc.

3.- Now we can use this profile to search a database of sequences. Click on the link: ProfileSearch".

Click on "Upload file" to load the PSSM profile.

Then, set up the gap opening and gap extension penalties to a 11 y 1, respectively.

4.- Finally, launch the search.

The pre-computed results can be found here.

Questions:

Why more divergent sequences have larger weights in the profile generated with ProfileWeight?

Compare the results of the profile search with the BLAST search with the one of the sequences in the CARD MSA, ICE2_HUMAN:

>ICE2_HUMAN/16-104

HPHHQETLKKNRVVLAKQLLLSELLEHLLEKDIITLEMRELIQAKVGSFSQNVELLN

LLPKRGPQAFDAFCEALRETKQGHLEDMLLTT

Which method is capable of detecting more homologues?

February 2004

Manuel J. Gómez
Grupo de Diseño de Proteínas
Centro Nacional de Biotecnología, CSIC

Practical lesson 1

PATTERNS AND PROFILES. PROTEIN MOTIFS, DOMAINS AND FAMILIES

PATTERNS AND PROFILES.
PROTEIN MOTIFS, DOMAINS AND FAMILIES