Sequence alignments. Sequence search. Pratice.

Sequence Analysis
Sequence alignment. Similarity searches.
Multiple alignments. Patterns and profiles.

Practice

Juan Carlos Sanchez

Links
1. Two sequence alignment
2. Homologues search in a database. BLAST.
3. Search of homologues of known 3D structure.
4. Identification of coding regions using BLAST.

5. Manual alignment using Jalview.
6. Multiple sequence alignment using BLAST results.
7. Get evolution information(phylogeny) using a multiple sequence alignment
8. Remote homologues search using PSI-BLAST.
9. Remote homologues search using PSI-BLAST II.

Links:

Databases: Swiss-Prot, EMBL; Uniprot.
Format conversion of sequences and alignments: ReadSeq.
Two sequence alignment: EBI (global o local).
Homologues search with BLAST: EMBL, EBI, NCBI. [FTP]
Búsqueda de homólogos con PSI-BLAST: NCBI
Multiple sequence alignment:

ClustalW: EMBNET, EBI, GenomeNet.

[FTP]

Muscle.

Multiple sequence alignment editors:

Domains and protein families databases:

Jalview use tutorial

Most part of the exercises will be done using yeast "Ribulose-phosphate 3-epimerase"S. cerevisiae, which has the following uniprot code RPE_YEAST. In Uniprot will find some bibliographic references about to this protein and also some extra information that relates it to the pentose phosphate pathway and that this protein has a protection role against oxidative stress.

1. Two sequence alignment

Align this two sequenes (RPE_YEAST y RPE_MYCPN) using the EMBL-EBI server. RPE_MYCPN is the "Ribulose-phosphate 3-epimerase" from the bacteria Mycoplasma pneumoniae.

>RPE_YEAST
MVKPIIAPSI LASDFANLGC ECHKVINAGA DWLHIDVMDG HFVPNITLGQ PIVTSLRRSV PRPGDASNTE KKPTAFFDCH MMVENPEKWV DDFAKCGADQ FTFHYEATQD PLHLVKLIKS KGIKAACAIK PGTSVDVLFE LAPHLDMALV MTVEPGFGGQ KFMEDMMPKV ETLRAKFPHL NIQVDGGLGK ETIPKAAKAG ANVIVAGTSV FTAADPHDVI SFMKEEVSKE LRSRDLLD

>RPE_MYCPN MLNLVVNREI AFSLLPLLHQ FDRKLLEQFF ADGLRLIHYD VMDHFVDNTV FQGEHLDELQ QIGFQVNVHL MVQALEQILP VYLHHQAVKR ISFHVEPFDI PTIKHFIAQI KQAGKQVGLA FKFTTPLVNY ERLVQQLDFV TLMSVPPGKG GQAFNSAVFN NLKQAHKYHC SIEIDGGIKL DNIHQIQDDV NFIVMGSGFI KLERWQRQQL LKTNQ

a) Try doing a global alignment (opción "needle") and one local alignment (opción "water").
Question: Wich differences can be observed? Do you think this two sequences are related?

b) Try to obtain the alignment using different subsitution matrices and several penalties to start and extend the gaps. For instance, try BLOSUM62 and BLOSUM40.
Question: ¿Which differences do yo observe?

c) Question: How could we be more sure of which one is the best alignment? How could it be obtained?

2. Homologues search in a database. BLAST.

Do a BLAST search using RPE_YEAST. The goal is to find possible homologues in other species.

Use BLAST at EMBL or the one atEBI or NCBI

If we use BLAST-EMBL:

database=Swiss-Prot (nrdb95 more complete).
filter=none
descriptions=250
alignments=250

If we check "Get selected sequences" will obtain the corresponding sequences (by default, the ones that have good p-value are already check).

Try using BLAST at NCBI (EMBL is actually WU-BLAST, little different version from BLAST) and compare the results.

Resultados BLAST-EMBL. Resultados BLAST-NCBI.

Questions: About the two sequences that we previously aligned (RPE_YEAST and RPE_MYCPN), which is their e-value? is it meaningful?

3. Search of homologues of known 3D structure.

Now will try to find out if there is a known 3D structure for RPE_YEAST or some homologue protein.

The easiest way to find this out is to use (BLAST) against a database of known structures (PDB).
BLAST search: we can use any BLAST server, use the RPE_YEAST sequence and select the PDB database. This could be one result here.

Question: Is the RPE_YEAST structure known? Of some homologue?

Question: To which organism belongs the closest "pdb" to RPE_YEAST? do you think that with information from the alignment,the RPE_YEAST structure could be predicted?

Question: How could be possible to combine the information from the multiple alignment with structural information?

4. Identification of coding regions using BLAST.

A different use for BLAST.

>human
AGCTTTCTTCTTTTCCCTGTTGCTCAAATAAATAGTGTTCTTTGCTCAAA
CCCCCTTTCCCTCCTCCTTCTGCAATCTCAGCGCCTAGCGAAATCTGTTT
TCTTCATTGTAACCTCAGCTTCACCGCAATTAATTTTTTTTCCCTCTGGT
CACAAGATAATTCCTGACGCCAGTGAGTCTGGAGGTCAGACGAACAGCAA
ATTGGGGAACAAGGCGGCACTAATTCCTTACAAGTTCCTTGAAAAATCTT
TCGCTTAAAAAAAACGGGGGGTGGGGGGAGCTTCTTTGCTGTTCAGGGAT
TTATGCCTCGCGGAGCTGTGGCTCGAACCAGTGTTGGCTAAGGCGGACTG
GCAGGGGCAGGGAAGCTCAAAGATCTGGGGTGCTGCCAGGAAAAAGCAAA
TTCTGGAAGTTAATGGTTTTGAGTGATTTTTAAATCCTTGCTGGCGGAGA
GGCCCGCCTCTCCCCGGTATCAGCGCTTCCTCATTCTTTGAATCCGCGGC
TCCGCGGTCTTCGGCGTCAGACCAGCCGGAGGAAGCCTGTTTGCAATTTA
AGCGGGCTGTGAACGCCCAGGGCCGGCGGGGGCAGGGCCGAGGCGGGCCA
TTTTGAATAAAGAGGCGTGCCTTCCAGGCAGGCTCTATAAGTGACCGCCG
CGGCGAGCGTGCGCGCGTTGCAGGTCACTGTAGCGGACTTCTTTTGGTTT
TCTTTCTCTTTGGGGCACCTCTGGACTCACTCCCCAGCATGAAGGCGCTG
AGCCCGGTGCGCGGCTGCTACGAGGCGGTGTGCTGCCTGTCGGAACGCAG
TCTGGCCATCGCCCGGGGCCGAGGGAAGGGCCCGGCAGCTGAGGAGCCGC
TGAGCTTGCTGGACGACATGAACCACTGCTACTCCCGCCTGCGGGAACTG
GTACCCGGAGTCCCGAGAGGCACTCAGCTTAGCCAGGTGGAAATCCTACA
GCGCGTCATCGACTACATTCTCGACCTGCAGGTAGTCCTGGCCGAGCCAG
CCCCTGGACCCCCTGATGGCCCCCACCTTCCCATCCAGGTAAGCCTCGAA
GTCGGGACAGGGCTGAACACCCAGGCAAGGATGCTGCGGGACCCTCGGAG
CTCCCGATTGCCTCGCGTAACTCTTCCCTCTTTTCCTCTAATCAGACAGC
CGAGCTCGCTCCGGAACTTGTCATCTCCAACGACAAAAGGAGCTTTTGCC
ACTGACTCGGCCGTGTCCTGACACCTCCAGGTGAGTATCTCCTCTCTTGG
AGAGGGAGGTTTAAACGGCAAGTCCTGGAGTTGGCAGACGTTTTGAAAAA
TTGCCACTCACTCGGTTTAGGGAAACTGAGGCCAGAGAGGGACAAGTGAC
TTGCCCATGGTTGCATCAAATGAATGGCAGAGTCAGTTTCCATGTGATGT
GCATTTAAGCCTTAATGCGCCTGGCCCTGCCTCCGCAGTGGCCGAGGTCT
GGCAAGTAGACATGGTCCGACTAAATACAAGTCTTTCTGTTCCATGTTGT
ATAGGAGCTGTCTTCGGCAGCCCCCTCCCAGCTAGTGTCAATTCCAAGTA
GGAGGGGTAGCGCAACGTCCGCCTGTGGTCTTTGGCGCCAACTGGGTGGG
GGCAGCGTGGGGGGCGGAGTTATCAGGCTGGAGGTACAGACCAAGTTTCC
TCCCTGGCGCCGGCCAGTCTGCGGACGGCCCCCGCCTCGGCACGCTCGGC
GGAAACTGACTGCTCCTTGGTCTTCTTTCCTCCCCCGCCCAGAACGCAGG
TGCTGGCGCCCGTTCTGCCTGGGACCCCGGGAACCTCTCCTGCCGGAAGC
CGGACGGCAGGGATGGGCCCCAACTTCGCCCTGCCCACTTGACTTCACCA
AATCCCTTCCTGGAGACTAAACCTGGTGCTCAGGAGCGAAGGACTGTGAA
CTTGTGGCCTGAAGAGCCAGAGCTAGCTCTGGCCACCAGCTGGGCGACGT
CACCCTGCTCCCACCCCACCCCCAAGTTCTAAGGTCTTTTCAGAGCGTGG
AGGTGTGGAAGGAGTGGCTGCTCTCCAAACTATGCCAAGGCGGCGGCAGA
GCTGGTCTTCTGGTCTCCTTGGAGAAAGGTTCTGTTGCCCTGATTTATGA
ACTCTATAATAGAGTATATAGGTTTTGTACCTTTTTTACAGGAAGGTGAC
TTTCTGTAACAATGCGATGTATATTAAACTTTTTATAAAAGTTAACATTT
TGCATAATAAACGATTTTTAAACACTTGTGTATATGATGACACCCGTCTC
CATTAAGTACTAATGATGCTTTCTCGCACATGGCCGAATTTTGGGAGCTT
TGGGAAAGTGAACTTGCTTATTCTACGAGAGGGAAATGAAAAACTGCCTG
GTTGAGAGGGGATGGGGTGGAGAGAGAAGGGTTCATGATGGGAGTCTCAT
GTCCATTGAGGGATGGGTGCAGAGAAAAGTTCTGGCTCTGCCTCATTATT
TCAGAGATGAAACCAGAGACTGGTGCAAGCT

Will try to find any evidence for transcription of any fragment(possible gene). For this will use BLAST against a ESTs (expressed sequence tags)database.

    Go now to the BLAST page and check blastn"

    * Choose now the EST database, and if we are using human sequence, EST human

    * The result will be something like this.

Question: Do you know what a EST is?, Could this fragment contain a gene?, if the answer is yes What could be the gene structure like?

Question: If no EST databases where available and you had a DNA sequence, in which way think could use BLAST to determine if the fragment contains a gene? Hint: blastx.

5. "Manual" Pairwise Sequence Alignments using JalView
- All Files Containing only 2 Sequences We will demonstrate aligning these two tublin sequences to each other using JalView. The demonstration will involve

Loading the sequences into JalView using both

File->Input Alignment->from File
File->Input Alignment->from Textbox

Changing the residue colouring scheme to:

Percentage Identity
Clustalx

Inserting gaps into a sequence using the mouse
Saving your work

6. Multiple sequence alignment using BLAST results.

Now will make a multiple alignment with the sequences that we have previously found using BLAST. Will use ClustalW and Muscle.
It could be done using a web server or download it and install it locally.

Clustalw y Muscle

Clustalw web server: EBI, EMBNET, For Muscle: EBI
To visualize the alignment click in the jalview button "Jalview".

Practice with Jalview:

order the sequences following this criteria (menu; "Calculate > Sort"),
elimina secuencias redundantes ("Edit > Remove redundancy"),
obtain a neighbor joining tree("Calculate > Calculate Tree"),
order the sequences by tree order ("Calculate > sort"),
delete sequences (click one or more sequences and click "Edit > Delete"),
etc.

Question: What are the differences, if any, between Clustalw and muscle alignments? What do you think is the reason?

7. Get evolution information(phylogeny) using a multiple sequence alignment.

Order sequences by the Neighbor Joining tree ("Calculate > Sort > By tree order") and color by %identity ("Color").

Question: Any particularly divergent sequence? If this exist, which function does it have assigned in uniprot?

Question: Can observe any region more conserved than other? Why?

Question: Where could be found the active center? Which other information could help to find it?

8. Remote homologues search using PSI-BLAST.

Starting with RPE_YEAST sequence, will try to find remote homolgues (%20-30 identity), this is, those that BLAST is not able to find.

PSI-BLAST server. (go to "protein blast", and then check the option "PSI-BLAST")
Will use only SwissProt database(this way the searches will be faster).

Do several iterations with PSI-BLAST and analize which proteins are found in each round. The results can also be found here: first iteration, second (multiple alignment ), third, fourth.

Question: When doing the second iteration what happens to the e-values of the proteins found in the first iteration?

Question: Check the multiple alignment of the homologues found (close and remote) in the second iteration. Taking a look to the multiple alignment obtained in the previous exercise, which differences can you observe? Is it more or less conserved? Including remote homologues give any extra information? What could be the trouble when including remote homologues?

Question: After the second iteration, are there any new sequences? Do their e-values have any statistically relevance?

Question: Check the graph where shows what part of the protein is aligned. The new homologues found in the second iteration, to which region do the align? Propose a reason for what it can be seen

Question: After check the results for the fourth iterations and which proteis appear in each one, what can you tell for this proteins?

9. Remote homologues search using PSI-BLAST II.

Go to the BLAST web-site at NCBI. Select the appropriate Blast method. Paste in the query sequence Query. Set the database to pdb, and press Blast.

Question 1: How many significant hits does BLAST find (E-value < 0.005)?
Now go back to the BLAST web-site. Paste in the query sequence Query. Set the database to nr, select PSI-BLAST (Position-Specific Iterated BLAST), change the Algorithm parameters Max target sequences to 5000 and press Blast.

Question 2: How many significant hits does BLAST find (E-value < 0.005)?

Question 3: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical matches)?

Question 4: Do you find any PDB hits among the significant hits (look for the colored S to the right of the E-value))?
Now run a second BLAST iteration. Press Go at Run PSI-Blast iteration 2.

Question 5: How many significant hits does BLAST find (E-value < 0.005)? Just give a rough estimate.

Question 6: How large a fraction of the query sequence do the significant hits match (do not include the first hit since this is identical to the query)?

Question 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!

Question 8: Do you find any PDB hits among the significant hits (look for the red colored S to the right of the E-value)?
If you did not find a PDB hit among the significant hits, run a third Blast iteration

Question 9: What is the PDB identifier (a 4 letter code followed by a single letter chain name) for the best PDB hit?

Question 10: What is the sequence similarity (Identity) between the query and this PDB hit (click on the alignment score (Max score column - link to get to the actual alignment of the query sequence to the PDB hit)? If the alignment is not shown, go to Formatting options the very top of the page, set Alignments to 5000 and press Reformat.

Question 11: What is the function of this protein?