Practical lesson 1

Accessing GenBank and EMBL databases by FTP and the WWW.

By Manuel J. Gómez, PDG, CNB, CSIC.

You will get the genomic sequence of Mycoplasma pneumoniae, from the GenBank and EMBL nucleotide databases, in FASTA, GenBank and EMBL formats, as well as a file with the sequences of all proteins codified in the Mycoplasma pneumoniae genome. You will access the databases both using FTP and through a WWW server.

For comparison, you will also retrieve another version of the Mycoplasma pneumoniae genome that has been curated by NCBI staff, and that, therefore, is not part of GenBank. In doing so, you will visualize examples of different data formats and entry identifiers, and you will surf a little bit through the NCBI WWW server.

A. Accessing the NCBI server by FTP

By command line.

Open a console and run the following commands:

ftp
open ftp.ncbi.nih.gov

A login, use "anonymous"
As password, use your e-mail address, or any string containing the character "@".
You will be at the main directory of the NCBI's FTP server.
Continue with the following commands:

ls (to get a list of the directory's content)
cd genbank
ls (to get another list; take a look at the contents: they include all GenBank sequences, grouped by DIVISIONS; each division is stored as several files).
cd genomes (to access a section of GenBank that contains only genomic sequences).
ls
cd Bacteria
ls
cd Mycoplasma_pneumoniae
ls

You should see several files, whose names include the GenBank accession number for the Mycoplasma pneumoniae genome, U00089.

get U00089.gbk (to download the complete database entry, in GenBank format).
get U00089.fna (to download the complete nucleotide sequence of the genome, in FASTA format).
get U00089.ptt (to download a table with the coordinates of all the predicted ORFs in the genome, the Protein Translation Table).
get U00089.faa (to download a file with the amino acid sequence deduced for all ORFs, in FASTA format).

bye (to close the connection)
The files that you have downloaded are probably in your home directory (or in C:\, if you are using Windows), and they should be text files.
Open the files to check the content of the files, and how the different formats look like.
An interesting exercise, ideally suited to practice a little with Perl, could be as follows: write a script that uses the Protein Translation Table and the file that contains the complete nucleotide sequence in FASTA format, to generate a file that contains the sequence of all intergenic regions in the genome.

A variation to compare GenBank versus NCBI's curated databases

From the main directory of the NCBI's FTP server:

ls (to get the list of contents)
cd genomes (now, you are NOT in GenBank anymore, but in the GENOMES database of NCBI).
ls
cd Bacteria
ls
get README (to download a file with information about the directory)
cd Mycoplasma_pneumoniae
ls

You should see several files, whose names include the accession number for the Mycoplasma pneumoniae genome in the genomes database, NC_000912, which is different to the one in GenBank. While the information in GenBank is that submitted by the original authors, the information in the Genomes database has been curated or included by the NCBI staff.

get NC_000912.gbk (to download the complete database entry, in GenBank format).
get NC_000912.fna (to download the complete nucleotide sequence of the genome, in FASTA format).
get NC_000912.ptt (to download a table with the coordinates of all the predicted ORFs in the genome, the Protein Translation Table).
get NC_000912.Glimmer2.ptt (to download a different table with coordinates).
get NC_000912.faa (to download a file with the amino acid sequence deduced for all ORFs, in FASTA format).

In the README file you will find the explanation of why there are more than one Protein Translation Table. The new protein translation tables have been generated by the NCBI, using several gene prediction programs, to add value to the database.
A question that can be discussed is how to decide which of the Protein Translation Tables should be considered as more reliable.

Using a WWW Browser.

Use Netscape or a similar browser to make a connection to the following address: ftp://ftp.ncbi.nih.gov/
You will have a view of the main directory of the NCBI's FTP server.
The browser allows you to move within the hierarchy of directories, but now you are using a graphical interface. You could try to repeat some of the instructions suggested in the previous exercise.

B. Accessing the NCBI server through the WWW.

Open Netscape or a similar browser and make a connection to the NCBI's WWW server, at the address: http://www.ncbi.nlm.nih.gov/
Open Entrez, the general interface to NCBI databases, one of which is GenBank.
Select the Protein database.

Use the Mycoplasma pneumoniae genome accession number, U00089, and click on GO.
You will see an error message, of course, because the accession number points to a nucleotide sequence.

Now select the Nucleotide database. This database includes GenBank records, but also nucleotide sequences from other databases, such as RefSeq and PDB.

Enter Mycoplasma pneumoniae and click on GO.
You will find that there are more than 500 entries in which the expression "Mycoplasma pneumoniae" is contained. One of them is the complete genome. Instead of trying to find it, will will try again.
Enter the accession number of the Mycoplasma pneumoniae genome (U00089) and click on GO.
You will get information about the the Mycoplasma pneumoniae entry in Summary format, this is, just the accession number and a brief description. You can choose now among different display options.

If you select GenBank from the Display menu, or click on the accession number, you will see a document with a HEADER section that includes information about the entry, for example, locus, and accession numbers, bibliographic references and submission information. This is a special type of GenBank document because, instead of the SEQUENCE section, it contains a CONTIG section that includes a list of contigs (referred to by their accession numbers). These are the segments in which the genome sequence had been divided at the moment of submission.
If you select FASTA from the Display menu, you will get the complete sequence of the genome, reconstructed by joining the contigs that were listed in the GenBank document.

Now display again the U00089 entry in GenBank format, and click on any of the contigs accession numbers.

You will get a document with a standard GenBank format (HEADER, FEATURES AND SEQUENCE). In the HEADER, you should see a line like the following:

COMMENT On Nov. 27, 2000 this sequence version replaced gi:xxxxxxx.

This comment indicates that there are other versions for this entry. Each version has the same accession numbers (U00089 for the complete genome, and another one for the current contig), but different gi numbers.
If you click on the gi number of the obsolete entry, you will visualize its GenBank record, annotated with the comment:

This record was replaced or removed

Revision history

Now, retrieve entry NC_000912, the NCBI curated version of the Mycoplasma pneumoniae genome, from the nucleotide database.

You will get a Summary view of the entry.
As before, it is possible to visualize the entry in several formats, by choosing among the options of the Display menu. For example:

Taxonomic information about Mycoplasma pneumoniae, by choosing Taxonomy Links.
Bibliographic references, by choosing PMC Links or PubMed Links.
FASTA formatted nucleotide sequence, by choosing FASTA.
Complete entry in GenBank format, by choosing GenBank.

You have probably noticed that once you have chosen a specific format, the display options change. If you have displayed a GenBank formatted document, they now should include the option Graphics, that produces a lineal map of the genome, together with a segment of its sequence. You can walk over the genomic sequence and zoom the map.
If you click on the link Protein coding genes, located above the map in the Graphics display, you will get the Mycoplasma pneumoniae Protein Translation Table.

List of protein sequence entries associated to NC_000912, with ProteinLinks.

When you choose Protein Links, you move automatically from the Nucleotide database to the Protein database.
You will have now on screen a Summary view of the 689 entries that correspond to all the proteins predicted to be coded in NC_000912.
If you select FASTA, among the display options, you will retrieve the amino acid sequences of all those predicted proteins, in FASTA format.

At any moment, you can save your results as a file in your local computer, by several methods:

You can use the menu commands File-Save as-Text o File-Save as-Source, of the web browser.
You can use the option Send to - File, of the Entrez interface.
Or you can use the option Send to - Text, of the Entrez interface, followed by the menu commands File-Save as-Text o File-Save as-Source, of the web browser.

Now, select the Genome database.

Enter the terms Mycoplasma pneumoniae and click on GO.

You will get a list of 14 entries that contain the expression Mycoplasma pneumoniae. All of them have accession numbers that start with "NC", what means that they are part of the RefSeq project.
Click on the entry with accession number NC_000912.
You will have access to all the information available at the NCBI databases about the genome of Mycoplasma pneumoniae (some of which you have seen already), for example:

Link to the FTP site to download sequence files.
Links to the Protein Translation Table and the Structural RNA Coding Table.
Link to a database of orthologous proteins (COGs).
Link to a list of Mycoplasma pneumoniae proteins that are similar to other proteins with known structure.
Link to the Microbial Genomes page of the Genomes database. Please, follow that link.

You will see a summary of all microbial genomes in the database. For each of them there are links to:

Entries in GenBank format.
Nucleotide sequence in FASTA format.
Protein translation tables
Families of orthologous proteins.

C. Accessing the EMBL nucleotide database through the WWW.

Open Netscape or a similar browser and make a connection to the EMBL databases server, which is hosted by the European Bioinformatics Institute, at the address: http://www.ebi.ac.uk/Databases/index.html
Select the link Nucleotide Databases (on the left side of the page); then, the link to the Genomes server (in the table).
Select the link to Bacteria (on the left side).

You will see a table with a list of the bacterial genomes available from this database.
Find Mycoplasma pneumoniae.

Clicking on the name Mycoplasma pneumoniae you will get taxonomic information from Newt, the taxonomy database of the EBI.
Clicking on the accession number U00089 you will retrieve the complete entry, in EMBL format.
Clicking on Proteome (last column), you will get a lot of information about the Mycoplasma pneumoniae proteome: protein size distribution, amino acid composition statistics, protein functional annotation distribution (according to the Gene Ontology), comparisons with other proteomes, etc.

D. Accessing the EMBL nucleotide database by FTP (using a WWW browser).

Open Netscape or a similar browser and make a connection to the EMBL databases server, at the address: http://www.ebi.ac.uk/Databases/index.html

Click on the tab labeled as Downloads (on the upper part); then on the link to Databases (in the table).

You will be in the main directory of the EBI databases FTP site. You can use the browser to travel through the directories. Take some time to have a look and check what you can find.

Follow the path to: genomes/bacteria/mpneumoniae/U00089.embl

February 2004

Manuel J. Gómez
Grupo de Diseño de Proteínas
Centro Nacional de Biotecnología, CSIC