Sequence Analysis
Sequence alignment.
Similarity searches.
Multiple alignments. Patterns and profiles.
Practice
Most part of the exercises will be done using yeast "Ribulose-phosphate 3-epimerase"S. cerevisiae, which has the following uniprot code RPE_YEAST. In Uniprot will find some bibliographic references about to this protein and also some extra information that relates it to the pentose phosphate pathway and that this protein has a protection role against oxidative stress.
Align this two sequenes (RPE_YEAST y RPE_MYCPN) using the EMBL-EBI server. RPE_MYCPN is the "Ribulose-phosphate 3-epimerase" from the bacteria Mycoplasma pneumoniae.
>RPE_YEAST >RPE_MYCPN
MVKPIIAPSI LASDFANLGC ECHKVINAGA
DWLHIDVMDG
HFVPNITLGQ PIVTSLRRSV
PRPGDASNTE KKPTAFFDCH MMVENPEKWV
DDFAKCGADQ
FTFHYEATQD PLHLVKLIKS
KGIKAACAIK PGTSVDVLFE LAPHLDMALV
MTVEPGFGGQ
KFMEDMMPKV ETLRAKFPHL
NIQVDGGLGK ETIPKAAKAG ANVIVAGTSV
FTAADPHDVI
SFMKEEVSKE LRSRDLLD
MLNLVVNREI AFSLLPLLHQ FDRKLLEQFF
ADGLRLIHYD
VMDHFVDNTV FQGEHLDELQ
QIGFQVNVHL MVQALEQILP VYLHHQAVKR
ISFHVEPFDI
PTIKHFIAQI KQAGKQVGLA
FKFTTPLVNY ERLVQQLDFV TLMSVPPGKG
GQAFNSAVFN
NLKQAHKYHC SIEIDGGIKL
DNIHQIQDDV NFIVMGSGFI KLERWQRQQL LKTNQ
a) Try
doing a global alignment
(opción "needle") and one
local alignment
(opción "water").
Question: Wich differences can be observed? Do you think this two sequences are related?
b) Try to obtain the alignment using different subsitution matrices
and several penalties to start and extend the
gaps.
For instance, try BLOSUM62 and BLOSUM40.
Question: ¿Which differences do yo observe?
c) Question: How could we be more sure of which one is the best alignment? How could it be obtained?
2. Homologues search in a database. BLAST.
Do a BLAST search using RPE_YEAST.
The goal is to find possible homologues in other species.
Use BLAST at EMBL or the one atEBI or NCBI
If we use BLAST-EMBL:
database=Swiss-Prot (nrdb95 more complete).
filter=none
descriptions=250
alignments=250
If we check "Get selected sequences" will obtain the corresponding sequences (by default, the ones that have good p-value are already check).
Try using BLAST at NCBI (EMBL is actually WU-BLAST, little different version from BLAST) and compare the results.
Resultados BLAST-EMBL. Resultados BLAST-NCBI.
Questions: About the two sequences that we previously aligned (RPE_YEAST and RPE_MYCPN), which is their e-value? is it meaningful?
3.
Search of homologues of known 3D structure.
Now will try to find out if there is a known
3D structure for
RPE_YEAST or some homologue protein.
The easiest way to find this out is to use (BLAST) against a database of known structures (PDB).
BLAST search: we can use any BLAST server, use the RPE_YEAST sequence and select the PDB database. This could be one result
here.
Question:
Is the RPE_YEAST structure known? Of some homologue?
Question:
To which organism belongs the closest "pdb" to RPE_YEAST? do you think that with information from the alignment,the RPE_YEAST structure could be predicted?
Question:
How could be possible to combine the information from the multiple alignment with structural information?
4.
Identification of coding regions using BLAST.
A different use for BLAST.
>human
AGCTTTCTTCTTTTCCCTGTTGCTCAAATAAATAGTGTTCTTTGCTCAAA
CCCCCTTTCCCTCCTCCTTCTGCAATCTCAGCGCCTAGCGAAATCTGTTT
TCTTCATTGTAACCTCAGCTTCACCGCAATTAATTTTTTTTCCCTCTGGT
CACAAGATAATTCCTGACGCCAGTGAGTCTGGAGGTCAGACGAACAGCAA
ATTGGGGAACAAGGCGGCACTAATTCCTTACAAGTTCCTTGAAAAATCTT
TCGCTTAAAAAAAACGGGGGGTGGGGGGAGCTTCTTTGCTGTTCAGGGAT
TTATGCCTCGCGGAGCTGTGGCTCGAACCAGTGTTGGCTAAGGCGGACTG
GCAGGGGCAGGGAAGCTCAAAGATCTGGGGTGCTGCCAGGAAAAAGCAAA
TTCTGGAAGTTAATGGTTTTGAGTGATTTTTAAATCCTTGCTGGCGGAGA
GGCCCGCCTCTCCCCGGTATCAGCGCTTCCTCATTCTTTGAATCCGCGGC
TCCGCGGTCTTCGGCGTCAGACCAGCCGGAGGAAGCCTGTTTGCAATTTA
AGCGGGCTGTGAACGCCCAGGGCCGGCGGGGGCAGGGCCGAGGCGGGCCA
TTTTGAATAAAGAGGCGTGCCTTCCAGGCAGGCTCTATAAGTGACCGCCG
CGGCGAGCGTGCGCGCGTTGCAGGTCACTGTAGCGGACTTCTTTTGGTTT
TCTTTCTCTTTGGGGCACCTCTGGACTCACTCCCCAGCATGAAGGCGCTG
AGCCCGGTGCGCGGCTGCTACGAGGCGGTGTGCTGCCTGTCGGAACGCAG
TCTGGCCATCGCCCGGGGCCGAGGGAAGGGCCCGGCAGCTGAGGAGCCGC
TGAGCTTGCTGGACGACATGAACCACTGCTACTCCCGCCTGCGGGAACTG
GTACCCGGAGTCCCGAGAGGCACTCAGCTTAGCCAGGTGGAAATCCTACA
GCGCGTCATCGACTACATTCTCGACCTGCAGGTAGTCCTGGCCGAGCCAG
CCCCTGGACCCCCTGATGGCCCCCACCTTCCCATCCAGGTAAGCCTCGAA
GTCGGGACAGGGCTGAACACCCAGGCAAGGATGCTGCGGGACCCTCGGAG
CTCCCGATTGCCTCGCGTAACTCTTCCCTCTTTTCCTCTAATCAGACAGC
CGAGCTCGCTCCGGAACTTGTCATCTCCAACGACAAAAGGAGCTTTTGCC
ACTGACTCGGCCGTGTCCTGACACCTCCAGGTGAGTATCTCCTCTCTTGG
AGAGGGAGGTTTAAACGGCAAGTCCTGGAGTTGGCAGACGTTTTGAAAAA
TTGCCACTCACTCGGTTTAGGGAAACTGAGGCCAGAGAGGGACAAGTGAC
TTGCCCATGGTTGCATCAAATGAATGGCAGAGTCAGTTTCCATGTGATGT
GCATTTAAGCCTTAATGCGCCTGGCCCTGCCTCCGCAGTGGCCGAGGTCT
GGCAAGTAGACATGGTCCGACTAAATACAAGTCTTTCTGTTCCATGTTGT
ATAGGAGCTGTCTTCGGCAGCCCCCTCCCAGCTAGTGTCAATTCCAAGTA
GGAGGGGTAGCGCAACGTCCGCCTGTGGTCTTTGGCGCCAACTGGGTGGG
GGCAGCGTGGGGGGCGGAGTTATCAGGCTGGAGGTACAGACCAAGTTTCC
TCCCTGGCGCCGGCCAGTCTGCGGACGGCCCCCGCCTCGGCACGCTCGGC
GGAAACTGACTGCTCCTTGGTCTTCTTTCCTCCCCCGCCCAGAACGCAGG
TGCTGGCGCCCGTTCTGCCTGGGACCCCGGGAACCTCTCCTGCCGGAAGC
CGGACGGCAGGGATGGGCCCCAACTTCGCCCTGCCCACTTGACTTCACCA
AATCCCTTCCTGGAGACTAAACCTGGTGCTCAGGAGCGAAGGACTGTGAA
CTTGTGGCCTGAAGAGCCAGAGCTAGCTCTGGCCACCAGCTGGGCGACGT
CACCCTGCTCCCACCCCACCCCCAAGTTCTAAGGTCTTTTCAGAGCGTGG
AGGTGTGGAAGGAGTGGCTGCTCTCCAAACTATGCCAAGGCGGCGGCAGA
GCTGGTCTTCTGGTCTCCTTGGAGAAAGGTTCTGTTGCCCTGATTTATGA
ACTCTATAATAGAGTATATAGGTTTTGTACCTTTTTTACAGGAAGGTGAC
TTTCTGTAACAATGCGATGTATATTAAACTTTTTATAAAAGTTAACATTT
TGCATAATAAACGATTTTTAAACACTTGTGTATATGATGACACCCGTCTC
CATTAAGTACTAATGATGCTTTCTCGCACATGGCCGAATTTTGGGAGCTT
TGGGAAAGTGAACTTGCTTATTCTACGAGAGGGAAATGAAAAACTGCCTG
GTTGAGAGGGGATGGGGTGGAGAGAGAAGGGTTCATGATGGGAGTCTCAT
GTCCATTGAGGGATGGGTGCAGAGAAAAGTTCTGGCTCTGCCTCATTATT
TCAGAGATGAAACCAGAGACTGGTGCAAGCT
Will try to find any evidence for transcription of any fragment(possible gene). For this will use BLAST against a ESTs (expressed sequence tags)database.
Go now to the
BLAST page
and check blastn"
* Choose now the EST database, and if we are using human sequence, EST human
* The result will be something like this.
Question: Do you know what a EST is?, Could this fragment contain a gene?, if the answer is yes What could be the gene structure like?
Question: If no EST databases where available and you had a DNA sequence, in which way think could use BLAST to determine if the fragment contains a gene? Hint: blastx.
5.
"Manual" Pairwise Sequence Alignments using JalView
- All Files Containing only 2 Sequences
We will demonstrate aligning these two tublin
sequences to each other using
JalView. The demonstration will involve
6. Multiple sequence alignment using BLAST results.
Now will make a multiple alignment with the sequences that we have previously found using BLAST. Will use ClustalW
and Muscle.
It could be done using a web server or download it and install it locally.
Clustalw y Muscle
Clustalw web server: EBI, EMBNET, For Muscle: EBI
To visualize the alignment click in the jalview button "Jalview".
Practice with Jalview:
order the sequences following this criteria (menu; "Calculate > Sort"),
elimina secuencias redundantes ("Edit > Remove redundancy"),
obtain a neighbor joining tree("Calculate > Calculate Tree"),
order the sequences by tree order ("Calculate > sort"),
delete sequences (click one or more sequences and click "Edit > Delete"),
etc.Question: What are the differences, if any, between Clustalw and muscle alignments? What do you think is the reason?
7.
Get evolution information(phylogeny) using a multiple sequence alignment.
Order sequences by the Neighbor Joining tree ("Calculate > Sort > By
tree order") and color by %identity ("Color").
Question:
Any particularly divergent sequence? If this exist, which function does it have assigned in uniprot?
Question:
Can observe any region more conserved than other?
Why?
Question:
Where could be found the active center?
Which other information could help to find it?
8.
Remote homologues search using PSI-BLAST.
Starting with RPE_YEAST sequence, will try to find remote homolgues (%20-30 identity), this is, those that BLAST is not able to find.
PSI-BLAST server. (go to "protein blast", and then check the option
"PSI-BLAST")
Will use only SwissProt database(this way the searches will be faster).
Do several iterations with PSI-BLAST and analize which proteins are found in each round. The results can also be found here: first iteration, second (multiple alignment
), third,
fourth.
Question:
When doing the second iteration what happens to the e-values of the proteins found in the first iteration?
Question:
Check the multiple alignment
of the homologues found (close and remote) in the second iteration.
Taking a look to the multiple alignment obtained in the previous exercise, which differences can you observe?
Is it more or less conserved? Including remote homologues give any extra information? What could be the trouble when including remote homologues?
Question:
After the second iteration, are there any new sequences?
Do their e-values have any statistically relevance?
Question:
Check the graph where shows what part of the protein is aligned. The new homologues found in the second iteration, to which region do the align?
Propose a reason for what it can be seen
Question:
After check the results for the fourth iterations and which proteis appear in each one, what can you tell for this proteins?
9.
Remote homologues search using PSI-BLAST II.
Go to the BLAST web-site at NCBI. Select the appropriate Blast method. Paste in the query sequence Query. Set the database to pdb, and press Blast.
Question 1: How many significant hits does BLAST find (E-value < 0.005)?
Now go back to the BLAST web-site. Paste in the query sequence Query. Set the database to nr, select PSI-BLAST (Position-Specific Iterated BLAST), change the Algorithm parameters Max target sequences to 5000 and press Blast.
Question 2: How many significant hits does BLAST find (E-value < 0.005)?
Question 3: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical matches)?
Question 4: Do you find any PDB hits among the significant hits (look for the colored S to the right of the E-value))?
Now run a second BLAST iteration. Press Go at Run PSI-Blast iteration 2.
Question 5: How many significant hits does BLAST find (E-value < 0.005)? Just give a rough estimate.
Question 6: How large a fraction of the query sequence do the significant hits match (do not include the first hit since this is identical to the query)?
Question 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
Question 8: Do you find any PDB hits among the significant hits (look for the red colored S to the right of the E-value)?
If you did not find a PDB hit among the significant hits, run a third Blast iteration
Question 9: What is the PDB identifier (a 4 letter code followed by a single letter chain name) for the best PDB hit?
Question 10: What is the sequence similarity (Identity) between the query and this PDB hit (click on the alignment score (Max score column - link to get to the actual alignment of the query sequence to the PDB hit)? If the alignment is not shown, go to Formatting options the very top of the page, set Alignments to 5000 and press Reformat.
Question 11: What is the function of this protein?