Help for the ExPASy BLAST Interface

Query sequence

Enter a query protein sequence in raw format or a SWISS-PROT, TrEMBL or TrEMBL-new accession number.

Output format

HTML - BLAST native output format with hyperlinks and some formatting.
NiceBlast - View with full descriptions and organism sources.
Plain Text - Text format with no links.

BLAST program and databases

Programs available on ExPASy
blastp	compares a protein query sequence against a protein sequence database.
tblastn	compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
Programs available elsewhere
blastn	compares a nucleotide query sequence against a nucleotide sequence database. Available at EMBnet Switzerland
blastx	compares a nucleotide query sequence translated in all reading frames against a protein sequence database. Available at EMBnet Switzerland
tblastx	compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Available at EMBnet Switzerland
PSI-BLAST	Position Specific Iterative BLAST detects weak homologs by building a profile from a multiple alignment of the highest scoring hits in an initial BLAST search. Available at NCBI
PHI-BLAST	Pattern-Hit Initiated BLAST combines matching of regular expressions with local alignments surrounding the match. Available at NCBI
xblast	a totally unrelated game. Available at xblast center

Databases

Protein Databases

SWISS-PROT	Manually annotated protein sequence database (over 100000 entries). Includes weekly updates and splice variants.
SWISS-PROT, TrEMBL and TrEMBL-new	TrEMBL is an computer-annotated supplement to SWISS-PROT with some redundancy (over 600000 entries). TrEMBL-new contains the translations of the newest submissions to the EMBL database. Contains all consolidated proteins and ORFs, with weekly updates and annotated splice variants.
complete microbial proteomes	Non-redundant sets of all the proteins from complete genome sequencing projects, compiled from SWISS-PROT and TrEMBL.
Translated EST	Protein sequences derived from EST sequencing data (human, mouse, rat, zebrafish, drosophila, bovine, arabidopsis). This database contains many potential errors because of the low quality of the data.

DNA Databases (for tblastn)

All databases are subdivided into taxonomic sections, selectable from the Taxonomic groups drop-down list.

All EMBL + GSS	All entries from the EMBL database (equivalent to GenBank and DDBJ).
HTG	Unverified data from high-throughput genomic sequencing. Usually in the form of cosmids.
dbEST	Expressed sequence tag database from the NCBI.
EST contigs	Database of contigs based on EST clusters from Unigene (human, mouse, rat, bovine, zebrafish) and SwissClusters (Drosophila melanogaster, Arabidopsis thaliana).
Unigene EST	Database of EST clusters (list of ESTs known to match the same cDNA) from the NCBI (updated occasionally). This database contains also useful information like STS matches, tissue distribution, or transcript map.
Complete genomes	Genomes released in the form of a complete, assembled sequence.

Taxonomic groups

A taxonomic subselection can be made through a free-text input field for blastp (excepted on translated EST) and as a drop-down list with database subsections for tblastn.

For blastp, you may enter either a numeric NCBI TaxID (e.g. 10090), or a taxon (e.g. Bacteria), or a species name either in Latin or in English. For the list of known species names and synonyms, see SWISS-PROT species list. As the hits will be filtered in a post-processing stage, this may result in a significant delay.

A display of the BLAST hits as a taxonomic tree is also available from the result page, by clicking on the "Taxonomic view of BLAST hits" button.

E-mail address

Enter your e-mail address to receive the results by e-mail. Otherwise, they will arrive interactively in your browser. The e-mail option is recommended for tblastn searches on big databases such as EMBL. If your interactive search is too long, you will receive an error message requiring you to resubmit via e-mail.

Options

Comparison matrix

The matrix assigns a probability score for each position in an alignment. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins. BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. The PAM set of matrices is also available. If the "Auto-select" option is selected (default), the matrix will be selected depending on the query sequence length, based on the following (empirically constructed) table:

Query length	Substitution matrix
<35	PAM-30
35-50	PAM-70
50-85	BLOSUM-80
>85	BLOSUM-62

Setting the E threshold

The expectation value (E) threshold is a statistical measure of the number of expected matches in a random database. The lower the e-value, the more likely the match is to be significant. E-values between 0.1 and 10 are generally dubious, and over 10 are unlikely to have biological significance. In all cases, those matches need to be verified manually. You may need to increase the E threshold in the following cases :

if you have a very short query sequence
to detect very weak similarities, or similarities in a short region
if your sequence has a low complexity region and you use the masking option

Filter the sequence for low-complexity regions

Low-complexity regions (e.g. stretches of cysteine in CSP_DROME (Q03751), hydrophobic regions in membrane proteins) tend to produce spurious, insignificant matches with sequences in the database which have the same kind of low-complexity regions, but are unrelated biologically. If this option is checked, the query sequence will be run through the program SEG, and all amino acids in low-complexity regions will be replaced by X's which will appear in the alignment. The masked regions will also be visible as slashed regions in the PaintBlast image.

Gapped alignment

This will allow gaps to be introduced in the sequences when the comparison is done, and is usually left checked.

Output page

The output page is divided into three sections. The first is a summary of the hits, including the score and e-value of the best HSP for each hit. The second part is a PaintBlast image summarizing the matching portions for each hit. The third part contains the alignments between the query and the hits. From the summary of the hits, several operations may be performed on selected sequences. This is only available for blastp against the protein databases :

ClustalW is a multiple sequence alignment program,
T-COFFEE is a new alignment program that often gives better results than ClustalW, especially when dealing with divergent sequences and long insertions,
By clicking on "Retrieve selected sequences", several sequences may be retrieved at a time from the database. Individual entries are always available by clicking on the accession numbers.