Gene Expression Pattern Preprocessing
 
--  Manual  --

Javier Herrero
Bioinformatics Unit. CNIO

3rd October 2002

Manual

Gene expression patterns

Gene expression data are ratios of expression levels of two different samples. Usually, one sample corresponds to the experimental condition and the other one is a reference sample (Figure 1).

Figure 1: Experimental design
Image extract_RNA.gif

mRNA is extracted from two different samples and labeled with a fluorescent probe. Both samples are mixed before hybridization.

A gene expression pattern is a suite of expression data for a gene along a suite of different experimental conditions (Figure 2).

Figure 2: Different representations of two gene expression patterns
A





Image pattern1.gif
B
Image pattern2.gif

(A) Gene expression patterns in a common plot. (B) Gene expression patterns in a color scale. In this case, higher values appear in green and lower ones in red.

Step 1: Scale Transformation

Since we are looking at expression ratios, we will get the patterns in an asymmetrical scale: over-expressions will have values between 1 and infinite while over-repressions will be between 1 and 0. In order to give the same weight to both over-expressions and over-repressions we need to transform the scale (Figure 3). The most common function used for this task is the log-transformation.

Figure 3: Log-Transformation
A



Image log_transform1.gif
B



Image log_transform2.gif

(A) Gene expression patterns before log-transformation: gene B seems to be nearly flat (B) Gene expression patterns after log-transformation: it is now possible to see that gene A and gene B are symmetrical.

The Expression Data Preprocessor has a function for log-transforming the patterns (Figure 4). You can choose among 2, e and 10 for the base of log-transformation.

Figure 4: Interface to log-transformation function
Image snap_log_transform.png

Step 2: Replicate Handling

In order to check the quality of the final result it is usual to spot several times the same cDNA clone or different ones representing the same gene on the cDNA array. We will get several values of expression for those genes but we would like to have only one to continue the analysis of the results.

The solution is quite simple: we have to get the average or the median value of all the replicates, but it is desirable to check first on inconsistencies among replicates. The system includes a function for removing inconsistent replicates based on maximum distance to median. For each replicated gene, it calculates the median and remove all the values that are beyond the threshold from the median (Table 1).


Table 1: Examples of inconsistent replicate removing
  Replicated values Median Threshold Accepted range Resulting values
1 2.6; 3.8 3.2 1.0 2.2 -- 4.2 2.6; 3.8
2 1.3; 1.6; 3.8 1.6 1.5 0.1 -- 3.1 1.3; 1.6
3 -2.2; -1.8; -1.6; 0.3 -1.7 1.0 -2.7 -- -0.7 -2.2; -1.8; -1.6
4 0.8; 4.2 2.5 1.0 1.5 -- 3.5 --
5 -0.3; 1.1; 1.3 1.1 0.5 0.6 -- 1.6 1.1; 1.3

Once inconsistent replicates have been removed, it is possible to merge them together. Two options are available at this stage: getting the average or the median of the replicates (Table 2).


Table 2: Examples of replicate merging
  Resulting values Average Median
1 2.6; 3.8 3.2 3.2
2 1.3; 1.6 1.45 1.45
3 -2.2; -1.8; -1.6 -1.87 -1.8
5 1.1; 1.3; 1.8 1.4 1.3

Figure 5: Interface to replicate handling functions
Image snap_replicates.png

Step 3: Missing Value Handling

One of the characteristic of the gene expression patterns is the existence of missing values in the data set. The excess of missing values can be a problem for standard hierarchical clustering methods where all the calculations are based on a distance matrix. If two patterns have not enough points in common, the distance between them will be undefined and the clustering will fail. Moreover, other methods like principal component analysis cannot deal with missing values. Using imputation for these values can bypass these problems but it could be still necessary to remove beforehand some patterns if there are not enough data (see [Troyanskaya et al., 2001] for further reading).

The function for removing patterns with too many missing values is based on the percentage of existing values, i.e., on the percentage of non-missing values (Table 3)


Table 3: Examples of removing patterns with excess of missing values
 
Length of patterns
(number of conditions)
Number of
missing values
Number of
existing values
Persentage of
existing values
Result
1 10 3 7 70 % Accept
2 20 7 13 70 % Remove
3 5 3 2 50 % Remove
4 12 2 10 75 % Accept

Actually, the system has 4 different options for imputing missing values: (1) filling blanks with zeros, (2) with row (pattern) average, (3) with row median or (4) using KNN-imputation method. KNN-imputation method is a method which uses the K nearest neighbors (where K is a user-defined parameter and euclidean distance is used for determining the nearest neighbors) to impute the missing values (Table 4).


Table 4: Example of missing values imputation for different imputation methods
Method x$_{1}$ x$_{2}$ x$_{3}$ x$_{4}$ x$_{5}$ x$_{6}$ x$_{7}$ x$_{8}$ x$_{9}$ x$_{10}$ x$_{11}$
Original pattern 1.2 1.5 -- 2.1 3.1 -- 2.2 0.8 -- -0.3 0.5
Fill with 0's 1.2 1.5 0 2.1 3.1 0 2.2 0.8 0 -0.3 0.5
Fill with row average 1.2 1.5 1.39 2.1 3.1 1.39 2.2 0.8 1.39 -0.3 0.5
Fill with row median 1.2 1.5 1.35 2.1 3.1 1.35 2.2 0.8 1.35 -0.3 0.5
KNNimpute 1.2 1.5 1.8 (*) 2.1 3.1 3.2 (*) 2.2 0.8 0.5 (*) -0.3 0.5

(*) Actual results depend on the value of K and on the other patterns

In principle KNN-imputation works much better than the other methods but it requires to have enough complete patterns (patterns with no missing values) in the data set to be confident of finding real neighbors of the patterns with missing values. It also requires to have enough existing values in patterns with missing values in order to be able to determine their neighbors. Usually one expect to have at least 1000 complete patterns in the data set and at the very minimum 5 existing values for using this method.

Figure 6: Interface to missing value handling functions
Image snap_missing_values.png

Step 4: Flat Pattern Filtering

It is desirable to remove flat patterns because we cannot distinguish signal form noise in such patterns. Moreover, we will get strange results while clustering if we use a correlation metrics since flat patterns can correlate with nearly anything. There are different functions for flat patterns filtering depending on the actual definition of ``flat'' pattern. The simplest one is based on number of peaks in the pattern. The other two functions are based on the variability of the values in the pattern: the root mean square (RMS; cf. (1)) and the standard deviation (stddev; cf. (2)).


$\displaystyle RMS$ $\textstyle =$ $\displaystyle \sqrt{\frac{\sum _{i}x_{i}^{2}}{n}}$ (1)
$\displaystyle stddev$ $\textstyle =$ $\displaystyle \sqrt{\frac{\sum _{i}(x_{i}-\bar{x})^{2}}{n}}$ (2)

Figure 7: Example of patterns for different flat pattern definitions
A





Image flat_patterns_example.png

B 
  x$_{1}$ x$_{2}$ x$_{3}$ x$_{4}$ x$_{5}$ x$_{6}$ x$_{7}$ x$_{8}$
A 1.2 1.5 1.8 2.1 3.1 3.5 2.2 0.8
B -0.3 -0.9 0.9 1.8 2.2 1.8 -0.7 -1.1
C 3.1 3.9 2.7 4.1 1.8 2.3 3.5 3.9
D -2.1 -1.3 -0.7 0.2 1.3 0.6 0.2 -0.1

C 
 
Peaks
(Thr. $\pm 1)$
Peaks
(Thr. $\pm 2)$
RMS stddev
A 7 peaks 4 peaks 2.2 0.92
B 3 peaks 1 peak 1.36 1.36
C 8 peaks 7 peaks 3.26 0.84
D 3 peaks 1 peak 1.04 1.09

(A) Representation of expression patterns of genes A, B, C and D. (B) Numerical values of their expression patterns. (C) Number of peaks, RMS and stddev of their expression patterns.

In those examples (Figure 7) we see the difference between RMS and stddev: genes A and C have an RMS much higher than their stddev because the average value of their patterns is far from zero, whereas genes B and D show nearly the same RMS as stddev because their average values are approximately zero.

Different authors used different methods. For example, [Iyer et al., 1999] used a combination of peak and stddev for filtering flat patterns: they only took into account patterns (in $log_{2}$ scale) with at least two peaks greater than 2.2 (or lower than -2.2) and patterns with a stddev greater than 0.7. [Chu et al., 1998] used a threshold of 1.13 of RMS (in $log_{2}$ scale) for filtering flat patterns.

Figure 8: Interface to flat pattern filtering functions
Image snap_flat_patterns.png

Step 5: Unknown Gene Removing

All previous filters were based on the gene expression patterns, whereas this one is based on the names of the genes. The Expression Data Preprocessor reads a list of known genes from an extra file and removes all the patterns corresponding to genes that are not present in this list. This option can be used if you want to concentrate in a particular family of genes, or apply the filtering steps from a different dataset. For example, it is possible to apply several filtering criteria to a subset of the experimental conditions and use the result to filter the original dataset.

Figure 9: Interface to unknown gene removing function
Image snap_unknown_genes.png

Step 6: Pattern Standardization

Finally, there is an option for standardizing the patterns. Pattern standardization is nothing but subtracting the average value of the pattern to each value and dividing the result by the stddev (cf. ([*])):


$\displaystyle x_{i}^{standardized}$ $\textstyle =$ $\displaystyle \frac{x_{i}-\bar{x}}{stddev}$ (3)

Figure 11: Examples of gene expression patterns after standardization
A





Image before_standardization_example.png

B 
  x$_{1}$ x$_{2}$ x$_{3}$ x$_{4}$ x$_{5}$ x$_{6}$ x$_{7}$ x$_{8}$ Average stddev
A 1.20 2.60 3.20 2.00 3.10 4.30 3.70 1.40 2.69 1.10
B -0.50 0.20 0.40 -0.50 0.40 0.90 0.60 -0.90 0.08 0.63
C -2.50 -3.20 -3.10 -2.20 -0.80 -1.30 -0.50 -1.70 -1.91 1.01
D 1.50 0.80 0.70 1.50 3.10 2.50 3.40 2.70 2.03 1.04

(A) Representation of expression patterns of genes A, B, C and D before pattern standardization. (B) Numerical values of their expression patterns, their average value and their stddev.


Figure 10: Examples of gene expression patterns before standardization
A




Image after_standardization_example.png

  x$_{1}$ x$_{2}$ x$_{3}$ x$_{4}$ x$_{5}$ x$_{6}$ x$_{7}$ x$_{8}$ Average stddev
A 1.36 -0.08 0.47 -0.63 0.38 1.47 0.92 -1.17 0.0 1.0
B -0.91 0.20 0.51 -0.91 0.51 1.31 0.83 -1.54 0.0 1.0
C -0.58 -1.27 -1.18 -0.28 1.10 0.61 1.40 0.21 0.0 1.0
D -0.51 -1.18 -1.28 -0.51 1.04 0.46 1.33 0.65 0.0 1.0

(A) Representation of expression patterns of genes A, B, C and D after pattern standardization. (B) Numerical values of their expression patterns, their average value and their stddev.

This function is used to bring all the patterns of the data set into the same range for an easier comparison. In particular, this transformation is applied when one wants to calculate correlations among gene expression patterns but the method he wants to use only permits to calculate euclidean distances: an euclidean metrics to standardized patterns is comparable to a correlation metrics on original patterns.

Nevertheless, pattern standardization present a serious drawback when dealing with flat patterns: the method can produce a big noise if the stddev of the pattern is close to zero, moreover, the result is undefined if the stddev is exactly zero. Thus it is desirable to remove flat patterns beforehand.

Figure 12: Interface to pattern standardization function
Image snap_standardization.png

Pre-Analyse Module

The most interesting feature of this interface is its Pre-Analysis module. It performs several checks and plots different histograms to help and to guide the user through the options. It automatically selects or unselects different options and gives useful information to the user.

Here is the list of topics covered by the pre-analyse module:

It is important to emphasize that the Pre-Analyse module does not change the original file: the user can ignore the recommendations simply by unselecting the corresponding options.

Figure 13: Result of Pre-Analyse on diauxic shift data set [DeRisi et al., 1997]
Image snap_preanalyse.png

Figure 14: Histogram of values in the original data set (data from [DeRisi et al., 1997])
Image orig_distrib.png

Figure 15: Histogram of values in the data set after log2-transformation (data from [DeRisi et al., 1997])
Image log2_distrib.png

Figure 16: Histogram of distances from replicates to their median (data from [DeRisi et al., 1997])
Image median_distrib.png

Figure 17: Histogram of missing values by pattern (data from [DeRisi et al., 1997])
Image miss_distrib.png

Figure 18: Histogram of remaining patterns after peak filtering (data from [DeRisi et al., 1997])
Image peaks_distrib.png

Figure 19: Histogram of RMS (data from [DeRisi et al., 1997])
Image rms_distrib.png

Figure 20: Histogram of stddev (data from [DeRisi et al., 1997])
Image stddev_distrib.png

Interface

Help

All functions have a small square with an ``i'' inside: this is a link to the help section related to that function (see Figures 4, 5, 6, 8, 9 and 12).

Setting and Unsetting Options

If your browser understands and uses JavaScript, you can click on the names of the functions for setting and unsetting them (see Figures 4, 5, 6, 8, 9 and 12) and on the names of their options (see Figures 5 and 6) for selecting them.

Messages

This server returns four kinds of messages:

Input File Format

Figure 21: Example of input file format (data from [Chu et al., 1998])
Image file_format.png

Output File Formats

Preprocessed data will appear in the same format as the input file format. The file contains a comment line for each function used. This is the file format understood by the hierarchical clustering and the SOTA [Herrero et al., 2001] servers. Additionnaly, there is an extra file for SOM [Kohonen, 1997] and SomTree [Herrero & Dopazo, 2002], two files for sending the preprocessed data set to the EPClust [Brazma & Vilo, 2000] and a file containing the list of removed genes. The last one is used in the FatiGO (Al-Shahrour et al., in preparation) analysis for comparing removed against remaining genes.

In SOM file format the gene names appear at the end of the patterns and the first line must contain the length of patterns. EPClust is able to use two different file: one for gene expression patterns and the other one for gene descriptions.

Privacy and Security

Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.

In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.

Exercises

All the data set needed for these exercises are available here

Exercise 1

  1. Go to the ``Diauxic Shift'' [DeRisi et al., 1997] data set web page [Click here]
  2. Send the data set to the Preprocessor by clicking on ``preprocess''
  3. Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

  1. How many gene expression patterns are in the data set?
  2. How many of them are replicated?
  3. Are there a lot of missing values?
  4. Was the data set already in a symmetrical scale?
  5. How many gene expression patterns have 3 missing values?
  6. How many genes will remain if we want to keep only genes with at least 3 peaks higher than 1 or lower than -1 (in a $log_{2}$ scale)
  7. Which functions are activated by default?

Exercise 2

  1. Go to the ``Yeast cell cycle'' [Spellman et al., 1998] data set web page [Click here]
  2. Send the data set to the Preprocessor by clicking on ``preprocess''
  3. Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

  1. How many gene expression patterns are in the data set?
  2. Was the data set already in a symmetrical scale?
  3. How many of them are replicated?
  4. Are there a lot of missing values?
  5. How many gene expression patterns have no missing values?
  6. Which method will you use if you want to impute missing values?
  7. How many genes will be removed if we use the ``Filter flat patterns by RMS'' option with a threshold of 2 (in a $log_{2}$ scale)?
  8. And if we use a threshold of 0.2 (in a $log_{2}$ scale)?
  9. Which functions are activated by default?

Exercise 3

First Part:

  1. Go to the ``Response of Human Fibroblasts to Serum'' [Iyer et al., 1999] data set web page [Click here]
  2. Send the data set to the Preprocessor by clicking on ``preprocess''
  3. Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

  1. How many gene expression patterns are in the data set?
  2. Was the data set already in a symmetrical scale?
  3. How many missing values are in the data set?
  4. Propose a good threshold of maximum distance to median of replicates for removing inconsistent replicates.
  5. Are the patterns already standardized?
  6. Which functions are activated by default?

Second Part:

  1. Activate the ``Remove Inconsistent Replicates'' function and put the threshold you proposed before.
  2. Activate and set up any combination of flat pattern filtering function.
  3. Click on the ``Run'' button.

Questions:

  1. How many gene expression patterns are now in the data set?
  2. Is the data set in a symmetrical scale?
  3. How many files do appear?
  4. What are the differences between <fibroblasts_ori.txt> and <fibroblasts_ori.dat>?

Exercise 4

First Part:

  1. Go to the ``Lymphoma/leukemia Molecular Profiling Project'' [Alizadeh et al., 2000] data set web page [Click here]
  2. Send the data set to the Preprocessor by clicking on ``preprocess''
  3. Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

  1. Is the data set already in a symmetrical scale?
  2. How many missing values are in the data set?
  3. How many complete gene expression patterns are available for KNN imputation?

Second Part:

  1. Activate the ``Log-transform'' function and choose your preferred base for logarithm.
  2. Activate the ``Impute missing values'' function and choose the ``use KNNimpute'' option.
  3. Activate the ``Filter flat patterns by standard deviation'' function and set up the threshold to 3.
  4. Click on the ``Run'' button.

Questions:

  1. What happen when the server tried to log-transform the data set?
  2. How many complete gene expression patterns are available for KNN imputation now? Why?
  3. How many gene expression patterns are now in the data set?

Bibliography

Alizadeh et al., 2000
Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Jr, J. H., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. & Staudt, L. M. (2000).
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Nature, 403:503-511.

Brazma & Vilo, 2000
Brazma, A. & Vilo, J. (2000).
Gene expression data analysis.
FEBS Letters, 480(1):17-24.

Chu et al., 1998
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O. & Herskowitz, I. (1998).
The Transcriptional Program of Sporulation in Budding Yeast.
Science, 282(5389):699-705.

DeRisi et al., 1997
DeRisi, J. L., Iyer, V. R. & Brown, P. O. (1997).
Exploring the metabolic and genomic control of gene expression on a genomic scale.
Science, 278:680-686.

Herrero & Dopazo, 2002
Herrero, J. & Dopazo, J. (2002).
Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns.
Journal of Proteome Research, 1(5):467-470.

Herrero et al., 2001
Herrero, J., Valencia, A. & Dopazo, J. (2001).
A hierarchical unsupervised growing neural network for clustering gene expression patterns.
Bioinformatics, 17(2):126-138.

Iyer et al., 1999
Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Jr., J. H., Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D. & Brown, P. O. (1999).
The transcriptional program in the response of human fibroblasts to serum.
Science, 283:83-87.

Kohonen, 1997
Kohonen, T. (1997).
Self-organazing Maps.
Springer-Verlag, Berlin.

Spellman et al., 1998
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998).
Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization.
Mol. Biol. Cell, 9(12):3273-3297.

Troyanskaya et al., 2001
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. & Altman, R. B. (2001).
Missing value estimation methods for DNA microarrays.
Bioinformatics, 17(6):520-525.