Practical lesson 1
Accessing GenBank and EMBL databases by FTP and the WWW.
By Manuel J. Gómez, PDG, CNB, CSIC.
You will get the genomic sequence of Mycoplasma pneumoniae, from the GenBank
and EMBL nucleotide databases, in FASTA, GenBank and EMBL formats, as well
as a file with the sequences of all proteins codified in the Mycoplasma
pneumoniae genome. You will access the databases both using FTP and through
a WWW server.
For comparison, you will also retrieve another version of the Mycoplasma
pneumoniae genome that has been curated by NCBI staff, and that, therefore,
is not part of GenBank. In doing so, you will visualize examples of different
data formats and entry identifiers, and you will surf a little bit through
the NCBI WWW server.
A. Accessing the NCBI server by FTP
By command line.
-
Open a console and run the following commands:
-
ftp
-
open ftp.ncbi.nih.gov
-
A login, use "anonymous"
-
As password, use your e-mail address, or any string containing the character
"@".
-
You will be at the main directory of the NCBI's FTP server.
-
Continue with the following commands:
-
ls (to get a list of the directory's content)
-
cd genbank
-
ls (to get another list; take a look at the contents: they include all
GenBank sequences, grouped by DIVISIONS; each division is stored as several
files).
-
cd genomes (to access a section of GenBank that contains only genomic sequences).
-
ls
-
cd Bacteria
-
ls
-
cd Mycoplasma_pneumoniae
-
ls
-
You should see several files, whose names include the GenBank accession
number for the Mycoplasma pneumoniae genome, U00089.
-
get U00089.gbk (to download the complete database entry, in GenBank format).
-
get U00089.fna (to download the complete nucleotide sequence of the genome,
in FASTA format).
-
get U00089.ptt (to download a table with the coordinates of all the predicted
ORFs in the genome, the Protein Translation Table).
-
get U00089.faa (to download a file with the amino acid sequence deduced
for all ORFs, in FASTA format).
-
bye (to close the connection)
-
The files that you have downloaded are probably in your home directory
(or in C:\, if you are using Windows), and they should be text files.
-
Open the files to check the content of the files, and how the different
formats look like.
-
An interesting exercise, ideally suited to practice a little with Perl,
could be as follows: write a script that uses the Protein Translation Table
and the file that contains the complete nucleotide sequence in FASTA format,
to generate a file that contains the sequence of all intergenic regions
in the genome.
A variation to compare GenBank versus NCBI's curated databases
-
From the main directory of the NCBI's FTP server:
-
ls (to get the list of contents)
-
cd genomes (now, you are NOT in GenBank anymore, but in the GENOMES database
of NCBI).
-
ls
-
cd Bacteria
-
ls
-
get README (to download a file with information about the directory)
-
cd Mycoplasma_pneumoniae
-
ls
-
You should see several files, whose names include the accession number
for the Mycoplasma pneumoniae genome in the genomes database, NC_000912,
which is different to the one in GenBank. While the information in GenBank
is that submitted by the original authors, the information in the Genomes
database has been curated or included by the NCBI staff.
-
get NC_000912.gbk (to download the complete database entry, in GenBank
format).
-
get NC_000912.fna (to download the complete nucleotide sequence of the
genome, in FASTA format).
-
get NC_000912.ptt (to download a table with the coordinates of all the
predicted ORFs in the genome, the Protein Translation Table).
-
get NC_000912.Glimmer2.ptt (to download a different table with coordinates).
-
get NC_000912.faa (to download a file with the amino acid sequence deduced
for all ORFs, in FASTA format).
-
In the README file you will find the explanation of why there are more
than one Protein Translation Table. The new protein translation tables
have been generated by the NCBI, using several gene prediction programs,
to add value to the database.
-
A question that can be discussed is how to decide which of the Protein
Translation Tables should be considered as more reliable.
Using a WWW Browser.
-
Use Netscape or a similar browser to make a connection to the following
address: ftp://ftp.ncbi.nih.gov/
-
You will have a view of the main directory of the NCBI's FTP server.
-
The browser allows you to move within the hierarchy of directories, but
now you are using a graphical interface. You could try to repeat some of
the instructions suggested in the previous exercise.
B. Accessing the NCBI server through the WWW.
-
Open Netscape or a similar browser and make a connection to the NCBI's
WWW server, at the address:
http://www.ncbi.nlm.nih.gov/
-
Open Entrez, the general interface to NCBI databases, one of which
is GenBank.
-
Select the Protein database.
-
Use the Mycoplasma pneumoniae genome accession number, U00089, and click
on GO.
-
You will see an error message, of course, because the accession number
points to a nucleotide sequence.
-
Now select the Nucleotide database. This database includes GenBank
records, but also nucleotide sequences from other databases, such as RefSeq
and PDB.
-
Enter Mycoplasma pneumoniae and click on GO.
-
You will find that there are more than 500 entries in which the expression
"Mycoplasma pneumoniae" is contained. One of them is the complete genome.
Instead of trying to find it, will will try again.
-
Enter the accession number of the Mycoplasma pneumoniae genome (U00089)
and click on GO.
-
You will get information about the the Mycoplasma pneumoniae entry in Summary
format, this is, just the accession number and a brief description. You
can choose now among different display options.
-
If you select GenBank from the Display menu, or click on the accession
number, you will see a document with a HEADER section that includes information
about the entry, for example, locus, and accession numbers, bibliographic
references and submission information. This is a special type of GenBank
document because, instead of the SEQUENCE section, it contains a CONTIG
section that includes a list of contigs (referred to by their accession
numbers). These are the segments in which the genome sequence had been
divided at the moment of submission.
-
If you select FASTA from the Display menu, you will get the complete sequence
of the genome, reconstructed by joining the contigs that were listed in
the GenBank document.
-
Now display again the U00089 entry in GenBank format, and click on any
of the contigs accession numbers.
-
You will get a document with a standard GenBank format (HEADER, FEATURES
AND SEQUENCE). In the HEADER, you should see a line like the following:
COMMENT On Nov.
27, 2000 this sequence version replaced gi:xxxxxxx.
-
This comment indicates that there are other versions for this entry. Each
version has the same accession numbers (U00089 for the complete genome,
and another one for the current contig), but different gi numbers.
-
If you click on the gi number of the obsolete entry, you will visualize
its GenBank record, annotated with the comment:
This record was replaced or removed
and you will have access to the Revision history for that entry,
by clicking on the appropriated link.
-
Now, retrieve entry NC_000912, the NCBI curated version of the Mycoplasma
pneumoniae genome, from the nucleotide database.
-
You will get a Summary view of the entry.
-
As before, it is possible to visualize the entry in several formats, by
choosing among the options of the Display menu. For example:
-
Taxonomic information about Mycoplasma pneumoniae, by choosing Taxonomy
Links.
-
Bibliographic references, by choosing PMC Links or PubMed Links.
-
FASTA formatted nucleotide sequence, by choosing FASTA.
-
Complete entry in GenBank format, by choosing GenBank.
-
You have probably noticed that once you have chosen a specific format,
the display options change. If you have displayed a GenBank formatted document,
they now should include the option Graphics, that produces a lineal
map of the genome, together with a segment of its sequence. You can walk
over the genomic sequence and zoom the map.
-
If you click on the link Protein coding genes, located above the
map in the Graphics display, you will get the Mycoplasma pneumoniae Protein
Translation Table.
-
List of protein sequence entries associated to NC_000912, with ProteinLinks.
-
When you choose Protein Links, you move automatically from the Nucleotide
database to the Protein database.
-
You will have now on screen a Summary view of the 689 entries that correspond
to all the proteins predicted to be coded in NC_000912.
-
If you select FASTA, among the display options, you will retrieve the amino
acid sequences of all those predicted proteins, in FASTA format.
-
At any moment, you can save your results as a file in your local computer,
by several methods:
-
You can use the menu commands File-Save as-Text o File-Save as-Source,
of the web browser.
-
You can use the option Send to - File, of the Entrez interface.
-
Or you can use the option Send to - Text, of the Entrez interface, followed
by the menu commands File-Save as-Text o File-Save as-Source, of the web
browser.
-
Now, select the Genome database.
-
Enter the terms Mycoplasma pneumoniae and click on GO.
-
You will get a list of 14 entries that contain the expression Mycoplasma
pneumoniae. All of them have accession numbers that start with "NC", what
means that they are part of the RefSeq project.
-
Click on the entry with accession number NC_000912.
-
You will have access to all the information available at the NCBI databases
about the genome of Mycoplasma pneumoniae (some of which you have seen
already), for example:
-
Link to the FTP site to download sequence files.
-
Links to the Protein Translation Table and the Structural RNA Coding Table.
-
Link to a database of orthologous proteins (COGs).
-
Link to a list of Mycoplasma pneumoniae proteins that are similar to other
proteins with known structure.
-
Link to the Microbial Genomes page of the Genomes database. Please,
follow that link.
-
You will see a summary of all microbial genomes in the database. For each
of them there are links to:
-
Entries in GenBank format.
-
Nucleotide sequence in FASTA format.
-
Protein translation tables
-
Families of orthologous proteins.
C. Accessing the EMBL nucleotide database through
the WWW.
-
Open Netscape or a similar browser and make a connection to the EMBL databases
server, which is hosted by the European Bioinformatics Institute, at the
address:
http://www.ebi.ac.uk/Databases/index.html
-
Select the link Nucleotide Databases (on the left side of the page);
then, the link to the Genomes server (in the table).
-
Select the link to Bacteria (on the left side).
-
You will see a table with a list of the bacterial genomes available from
this database.
-
Find Mycoplasma pneumoniae.
-
Clicking on the name Mycoplasma pneumoniae you will get taxonomic
information from Newt, the taxonomy database of the EBI.
-
Clicking on the accession number U00089 you will retrieve the complete
entry, in EMBL format.
-
Clicking on Proteome (last column), you will get a lot of information
about the Mycoplasma pneumoniae proteome: protein size distribution, amino
acid composition statistics, protein functional annotation distribution
(according to the Gene Ontology), comparisons with other proteomes, etc.
D. Accessing the EMBL nucleotide database by FTP (using
a WWW browser).
Open Netscape or a similar browser and make a connection to the EMBL databases
server, at the address:
http://www.ebi.ac.uk/Databases/index.html
Click on the tab labeled as Downloads (on the upper part); then
on the link to Databases (in the table).
You will be in the main directory of the EBI databases FTP site. You can
use the browser to travel through the directories. Take some time to have
a look and check what you can find.
Follow the path to: genomes/bacteria/mpneumoniae/U00089.embl
February 2004