Gene Expression Pattern Preprocessing

-- Manual --

Javier Herrero
Bioinformatics Unit. CNIO

3rd October 2002

Manual

Gene expression patterns

Gene expression data are ratios of expression levels of two different samples. Usually, one sample corresponds to the experimental condition and the other one is a reference sample (Figure 1).

Figure 1: Experimental design

mRNA is extracted from two different samples and labeled with a fluorescent probe. Both samples are mixed before hybridization.

A gene expression pattern is a suite of expression data for a gene along a suite of different experimental conditions (Figure 2).

Figure 2: Different representations of two gene expression patterns

(A) Gene expression patterns in a common plot. (B) Gene expression patterns in a color scale. In this case, higher values appear in green and lower ones in red.

Step 1: Scale Transformation

Since we are looking at expression ratios, we will get the patterns in an asymmetrical scale: over-expressions will have values between 1 and infinite while over-repressions will be between 1 and 0. In order to give the same weight to both over-expressions and over-repressions we need to transform the scale (Figure 3). The most common function used for this task is the log-transformation.

Figure 3: Log-Transformation

(A) Gene expression patterns before log-transformation: gene B seems to be nearly flat (B) Gene expression patterns after log-transformation: it is now possible to see that gene A and gene B are symmetrical.

The Expression Data Preprocessor has a function for log-transforming the patterns (Figure 4). You can choose among 2, e and 10 for the base of log-transformation.

**Figure 4:** **Interface to log-transformation function**

Step 2: Replicate Handling

In order to check the quality of the final result it is usual to spot several times the same cDNA clone or different ones representing the same gene on the cDNA array. We will get several values of expression for those genes but we would like to have only one to continue the analysis of the results.

The solution is quite simple: we have to get the average or the median value of all the replicates, but it is desirable to check first on inconsistencies among replicates. The system includes a function for removing inconsistent replicates based on maximum distance to median. For each replicated gene, it calculates the median and remove all the values that are beyond the threshold from the median (Table 1).

**Table 1:** **Examples of inconsistent replicate removing**
	Replicated values	Median	Threshold	Accepted range	Resulting values
1	2.6; 3.8	3.2	1.0	2.2 -- 4.2	2.6; 3.8
2	1.3; 1.6; 3.8	1.6	1.5	0.1 -- 3.1	1.3; 1.6
3	-2.2; -1.8; -1.6; 0.3	-1.7	1.0	-2.7 -- -0.7	-2.2; -1.8; -1.6
4	0.8; 4.2	2.5	1.0	1.5 -- 3.5	--
5	-0.3; 1.1; 1.3	1.1	0.5	0.6 -- 1.6	1.1; 1.3

Once inconsistent replicates have been removed, it is possible to merge them together. Two options are available at this stage: getting the average or the median of the replicates (Table 2).

**Table 2:** **Examples of replicate merging**
	Resulting values	Average	Median
1	2.6; 3.8	3.2	3.2
2	1.3; 1.6	1.45	1.45
3	-2.2; -1.8; -1.6	-1.87	-1.8
5	1.1; 1.3; 1.8	1.4	1.3

**Figure 5:** **Interface to replicate handling functions**

Step 3: Missing Value Handling

One of the characteristic of the gene expression patterns is the existence of missing values in the data set. The excess of missing values can be a problem for standard hierarchical clustering methods where all the calculations are based on a distance matrix. If two patterns have not enough points in common, the distance between them will be undefined and the clustering will fail. Moreover, other methods like principal component analysis cannot deal with missing values. Using imputation for these values can bypass these problems but it could be still necessary to remove beforehand some patterns if there are not enough data (see [Troyanskaya et al., 2001] for further reading).

The function for removing patterns with too many missing values is based on the percentage of existing values, i.e., on the percentage of non-missing values (Table 3)

Table 3: Examples of removing patterns with excess of missing values

Length of patterns
(number of conditions)

Number of
missing values

Number of
existing values

Persentage of
existing values

Result

70 %

Remove

50 %

Remove

75 %

Actually, the system has 4 different options for imputing missing values: (1) filling blanks with zeros, (2) with row (pattern) average, (3) with row median or (4) using KNN-imputation method. KNN-imputation method is a method which uses the K nearest neighbors (where K is a user-defined parameter and euclidean distance is used for determining the nearest neighbors) to impute the missing values (Table 4).

**Table 4:** **Example of missing values imputation for different imputation methods**
Method	x $_{1}$	x $_{2}$	x $_{3}$	x $_{4}$	x $_{5}$	x $_{6}$	x $_{7}$	x $_{8}$	x $_{9}$	x $_{10}$	x $_{11}$
Original pattern	1.2	1.5	--	2.1	3.1	--	2.2	0.8	--	-0.3	0.5
Fill with 0's	1.2	1.5	0	2.1	3.1	0	2.2	0.8	0	-0.3	0.5
Fill with row average	1.2	1.5	1.39	2.1	3.1	1.39	2.2	0.8	1.39	-0.3	0.5
Fill with row median	1.2	1.5	1.35	2.1	3.1	1.35	2.2	0.8	1.35	-0.3	0.5
KNNimpute	1.2	1.5	*1.8 ()**	2.1	3.1	*3.2 ()**	2.2	0.8	*0.5 ()**	-0.3	0.5

(*) Actual results depend on the value of K and on the other patterns

In principle KNN-imputation works much better than the other methods but it requires to have enough complete patterns (patterns with no missing values) in the data set to be confident of finding real neighbors of the patterns with missing values. It also requires to have enough existing values in patterns with missing values in order to be able to determine their neighbors. Usually one expect to have at least 1000 complete patterns in the data set and at the very minimum 5 existing values for using this method.

**Figure 6:** **Interface to missing value handling functions**

Step 4: Flat Pattern Filtering

It is desirable to remove flat patterns because we cannot distinguish signal form noise in such patterns. Moreover, we will get strange results while clustering if we use a correlation metrics since flat patterns can correlate with nearly anything. There are different functions for flat patterns filtering depending on the actual definition of ``flat'' pattern. The simplest one is based on number of peaks in the pattern. The other two functions are based on the variability of the values in the pattern: the root mean square (RMS; cf. (1)) and the standard deviation (stddev; cf. (2)).

$\displaystyle RMS$	$\textstyle =$	$\displaystyle \sqrt{\frac{\sum _{i}x_{i}^{2}}{n}}$	(1)
$\displaystyle stddev$	$\textstyle =$	$\displaystyle \sqrt{\frac{\sum _{i}(x_{i}-\bar{x})^{2}}{n}}$	(2)

Figure 7: Example of patterns for different flat pattern definitions

A

	x $_{1}$	x $_{2}$	x $_{3}$	x $_{4}$	x $_{5}$	x $_{6}$	x $_{7}$	x $_{8}$
A	1.2	1.5	1.8	2.1	3.1	3.5	2.2	0.8
B	-0.3	-0.9	0.9	1.8	2.2	1.8	-0.7	-1.1
C	3.1	3.9	2.7	4.1	1.8	2.3	3.5	3.9
D	-2.1	-1.3	-0.7	0.2	1.3	0.6	0.2	-0.1

Peaks
(Thr. $\pm 1)$

Peaks
(Thr. $\pm 2)$

RMS

stddev

7 peaks

4 peaks

2.2

0.92

3 peaks

1 peak

1.36

8 peaks

7 peaks

3.26

0.84

3 peaks

1 peak

1.04

1.09

(A) Representation of expression patterns of genes A, B, C and D. (B) Numerical values of their expression patterns. (C) Number of peaks, RMS and stddev of their expression patterns.

In those examples (Figure 7) we see the difference between RMS and stddev: genes A and C have an RMS much higher than their stddev because the average value of their patterns is far from zero, whereas genes B and D show nearly the same RMS as stddev because their average values are approximately zero.

Different authors used different methods. For example, [Iyer et al., 1999] used a combination of peak and stddev for filtering flat patterns: they only took into account patterns (in $log_{2}$ scale) with at least two peaks greater than 2.2 (or lower than -2.2) and patterns with a stddev greater than 0.7. [Chu et al., 1998] used a threshold of 1.13 of RMS (in $log_{2}$ scale) for filtering flat patterns.

**Figure 8:** **Interface to flat pattern filtering functions**

Step 5: Unknown Gene Removing

All previous filters were based on the gene expression patterns, whereas this one is based on the names of the genes. The Expression Data Preprocessor reads a list of known genes from an extra file and removes all the patterns corresponding to genes that are not present in this list. This option can be used if you want to concentrate in a particular family of genes, or apply the filtering steps from a different dataset. For example, it is possible to apply several filtering criteria to a subset of the experimental conditions and use the result to filter the original dataset.

**Figure 9:** **Interface to unknown gene removing function**

Step 6: Pattern Standardization

Finally, there is an option for standardizing the patterns. Pattern standardization is nothing but subtracting the average value of the pattern to each value and dividing the result by the stddev (cf. ()):

$\displaystyle x_{i}^{standardized}$

$\textstyle =$

$\displaystyle \frac{x_{i}-\bar{x}}{stddev}$

(3)

Figure 11: Examples of gene expression patterns after standardization

Image before_standardization_example.png

	x $_{1}$	x $_{2}$	x $_{3}$	x $_{4}$	x $_{5}$	x $_{6}$	x $_{7}$	x $_{8}$	Average	stddev
A	1.20	2.60	3.20	2.00	3.10	4.30	3.70	1.40	2.69	1.10
B	-0.50	0.20	0.40	-0.50	0.40	0.90	0.60	-0.90	0.08	0.63
C	-2.50	-3.20	-3.10	-2.20	-0.80	-1.30	-0.50	-1.70	-1.91	1.01
D	1.50	0.80	0.70	1.50	3.10	2.50	3.40	2.70	2.03	1.04

(A) Representation of expression patterns of genes A, B, C and D before pattern standardization. (B) Numerical values of their expression patterns, their average value and their stddev.

Figure 10: Examples of gene expression patterns before standardization

	x $_{1}$	x $_{2}$	x $_{3}$	x $_{4}$	x $_{5}$	x $_{6}$	x $_{7}$	x $_{8}$	Average	stddev
A	1.36	-0.08	0.47	-0.63	0.38	1.47	0.92	-1.17	0.0	1.0
B	-0.91	0.20	0.51	-0.91	0.51	1.31	0.83	-1.54	0.0	1.0
C	-0.58	-1.27	-1.18	-0.28	1.10	0.61	1.40	0.21	0.0	1.0
D	-0.51	-1.18	-1.28	-0.51	1.04	0.46	1.33	0.65	0.0	1.0

(A) Representation of expression patterns of genes A, B, C and D after pattern standardization. (B) Numerical values of their expression patterns, their average value and their stddev.

This function is used to bring all the patterns of the data set into the same range for an easier comparison. In particular, this transformation is applied when one wants to calculate correlations among gene expression patterns but the method he wants to use only permits to calculate euclidean distances: an euclidean metrics to standardized patterns is comparable to a correlation metrics on original patterns.

Nevertheless, pattern standardization present a serious drawback when dealing with flat patterns: the method can produce a big noise if the stddev of the pattern is close to zero, moreover, the result is undefined if the stddev is exactly zero. Thus it is desirable to remove flat patterns beforehand.

**Figure 12:** **Interface to pattern standardization function**

Pre-Analyse Module

The most interesting feature of this interface is its Pre-Analysis module. It performs several checks and plots different histograms to help and to guide the user through the options. It automatically selects or unselects different options and gives useful information to the user.

Here is the list of topics covered by the pre-analyse module:

File format: it checks that all the patterns have the same number of conditions. It adds extra missing values or removes extra ones if needed. It also checks that all the patterns have a valid identifier. It adds a unique one when necessary. It also keeps a list of symbols interpreted as missing values and displays this information. A warning message is displayed if there are too many of them, as could happen if the file uploaded is in a wrong format.
Dimensions of the data set: the server expects a dataset with more rows (genes) than columns (experimental conditions). It displays a warning message if it founds more conditions than patterns, as could happen if the data matrix is transposed.
Scale of expression patterns: the server plots the histogram of values found in the data set (Figure 14) and looks for negative values. It assumes the user did not log-transform the dataset if none is found. It displays a warning message, log-transforms the data internally to continue with the analysis, auto-selects the option for log-transforming the dataset and plots the new histogram of log-transformed values (Figure 15).
Replicated genes: the server looks for replicated genes and merge them into their average pattern. The list of replicated genes as well as the number of them is displayed. The server also plots an additional histogram of distances to median of replicates (Figure 16) to guide user in selecting a good threshold for removing inconsistent replicates.
Missing values: the server obtains the number of missing values for this dataset and suggests the best imputation method among the available ones depending on the amount of missing values, the amount of complete patterns and the number of conditions. It also plots an histogram of the number of missing values by pattern (Figure 17).
Histogram of peaks: this histogram informs the user about the number of remaining profiles after using the ``Filter flat patterns by number of peaks'' option versus the threshold used (see Figure 18).
Histogram of RMS: this is the histogram of RMS in the data set (see Figure 19).
Histogram of stddev: this is the histogram of stddev in the data set (see Figure 20).

It is important to emphasize that the Pre-Analyse module does not change the original file: the user can ignore the recommendations simply by unselecting the corresponding options.

**Figure 13:** **Result of Pre-Analyse on diauxic shift data set** [DeRisi *et al.*, 1997]

**Figure 14:** **Histogram of values in the original data set** (data from [DeRisi *et al.*, 1997])

**Figure 15:** **Histogram of values in the data set after log2-transformation** (data from [DeRisi *et al.*, 1997])

**Figure 16:** **Histogram of distances from replicates to their median** (data from [DeRisi *et al.*, 1997])

**Figure 17:** **Histogram of missing values by pattern** (data from [DeRisi *et al.*, 1997])

**Figure 18:** **Histogram of remaining patterns after peak filtering** (data from [DeRisi *et al.*, 1997])

**Figure 19:** **Histogram of RMS** (data from [DeRisi *et al.*, 1997])

**Figure 20:** **Histogram of stddev** (data from [DeRisi *et al.*, 1997])

Interface

Help

All functions have a small square with an ``i'' inside: this is a link to the help section related to that function (see Figures 4, 5, 6, 8, 9 and 12).

Setting and Unsetting Options

If your browser understands and uses JavaScript, you can click on the names of the functions for setting and unsetting them (see Figures 4, 5, 6, 8, 9 and 12) and on the names of their options (see Figures 5 and 6) for selecting them.

Messages

This server returns four kinds of messages:

Useful Information is displayed in black.
Output from some functions will appear in scrollable text boxes.
Harmless warning messages are written in blue.
More serious warning messages appear in red.

Input File Format

Figure 21: Example of input file format (data from [Chu et al., 1998])

This server uses plain text file as input. The gene expression patterns must appear in a table separated by tabulators where each line corresponds to a gene and each column to an experimental condition. The first column must contain the gene names.
There is no special character for missing values, simply leave these places empty.
All lines beginning with a ``#'' are ignored.
This server does not need to know the name of the conditions, but you can specify them by adding a row to the table where the first element must be ``#NAMES''.
You can add class labels to the genes by appending an ``@'' and a number to the name of the gene. You can also set up the names of those class labels by adding a line starting by ``#LABELS'' (see Figure 21 for an example).

Output File Formats

Preprocessed data will appear in the same format as the input file format. The file contains a comment line for each function used. This is the file format understood by the hierarchical clustering and the SOTA [Herrero et al., 2001] servers. Additionnaly, there is an extra file for SOM [Kohonen, 1997] and SomTree [Herrero & Dopazo, 2002], two files for sending the preprocessed data set to the EPClust [Brazma & Vilo, 2000] and a file containing the list of removed genes. The last one is used in the FatiGO (Al-Shahrour et al., in preparation) analysis for comparing removed against remaining genes.

In SOM file format the gene names appear at the end of the patterns and the first line must contain the length of patterns. EPClust is able to use two different file: one for gene expression patterns and the other one for gene descriptions.

Privacy and Security

Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.

In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.

Exercises

All the data set needed for these exercises are available here

Exercise 1

Go to the ``Diauxic Shift'' [DeRisi et al., 1997] data set web page [Click here]
Send the data set to the Preprocessor by clicking on ``preprocess''
Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

How many gene expression patterns are in the data set?
How many of them are replicated?
Are there a lot of missing values?
Was the data set already in a symmetrical scale?
How many gene expression patterns have 3 missing values?
How many genes will remain if we want to keep only genes with at least 3 peaks higher than 1 or lower than -1 (in a $log_{2}$ scale)
Which functions are activated by default?

Exercise 2

Go to the ``Yeast cell cycle'' [Spellman et al., 1998] data set web page [Click here]
Send the data set to the Preprocessor by clicking on ``preprocess''
Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

How many gene expression patterns are in the data set?
Was the data set already in a symmetrical scale?
How many of them are replicated?
Are there a lot of missing values?
How many gene expression patterns have no missing values?
Which method will you use if you want to impute missing values?
How many genes will be removed if we use the ``Filter flat patterns by RMS'' option with a threshold of 2 (in a $log_{2}$ scale)?
And if we use a threshold of 0.2 (in a $log_{2}$ scale)?
Which functions are activated by default?

Exercise 3

First Part:

Go to the ``Response of Human Fibroblasts to Serum'' [Iyer et al., 1999] data set web page [Click here]
Send the data set to the Preprocessor by clicking on ``preprocess''
Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

How many gene expression patterns are in the data set?
Was the data set already in a symmetrical scale?
How many missing values are in the data set?
Propose a good threshold of maximum distance to median of replicates for removing inconsistent replicates.
Are the patterns already standardized?
Which functions are activated by default?

Second Part:

Activate the ``Remove Inconsistent Replicates'' function and put the threshold you proposed before.
Activate and set up any combination of flat pattern filtering function.
Click on the ``Run'' button.

Questions:

How many gene expression patterns are now in the data set?
Is the data set in a symmetrical scale?
How many files do appear?
What are the differences between <fibroblasts_ori.txt> and <fibroblasts_ori.dat>?

Exercise 4

First Part:

Go to the ``Lymphoma/leukemia Molecular Profiling Project'' [Alizadeh et al., 2000] data set web page [Click here]
Send the data set to the Preprocessor by clicking on ``preprocess''
Pre-Analyse the data set by clicking on the ``Pre-Analyse'' button.

Questions:

Is the data set already in a symmetrical scale?
How many missing values are in the data set?
How many complete gene expression patterns are available for KNN imputation?

Second Part:

Activate the ``Log-transform'' function and choose your preferred base for logarithm.
Activate the ``Impute missing values'' function and choose the ``use KNNimpute'' option.
Activate the ``Filter flat patterns by standard deviation'' function and set up the threshold to 3.
Click on the ``Run'' button.

Questions:

What happen when the server tried to log-transform the data set?
How many complete gene expression patterns are available for KNN imputation now? Why?
How many gene expression patterns are now in the data set?

Bibliography

Alizadeh et al., 2000: Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Jr, J. H., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. & Staudt, L. M. (2000).
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Nature, 403:503-511.
Brazma & Vilo, 2000: Brazma, A. & Vilo, J. (2000).
Gene expression data analysis.
FEBS Letters, 480(1):17-24.
Chu et al., 1998: Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O. & Herskowitz, I. (1998).
The Transcriptional Program of Sporulation in Budding Yeast.
Science, 282(5389):699-705.
DeRisi et al., 1997: DeRisi, J. L., Iyer, V. R. & Brown, P. O. (1997).
Exploring the metabolic and genomic control of gene expression on a genomic scale.
Science, 278:680-686.
Herrero & Dopazo, 2002: Herrero, J. & Dopazo, J. (2002).
Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns.
Journal of Proteome Research, 1(5):467-470.
Herrero et al., 2001: Herrero, J., Valencia, A. & Dopazo, J. (2001).
A hierarchical unsupervised growing neural network for clustering gene expression patterns.
Bioinformatics, 17(2):126-138.
Iyer et al., 1999: Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Jr., J. H., Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D. & Brown, P. O. (1999).
The transcriptional program in the response of human fibroblasts to serum.
Science, 283:83-87.
Kohonen, 1997: Kohonen, T. (1997).
Self-organazing Maps.
Springer-Verlag, Berlin.
Spellman et al., 1998: Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998).
Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization.
Mol. Biol. Cell, 9(12):3273-3297.
Troyanskaya et al., 2001: Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. & Altman, R. B. (2001).
Missing value estimation methods for DNA microarrays.
Bioinformatics, 17(6):520-525.

Gene Expression Pattern Preprocessing -- Manual --

Gene Expression Pattern Preprocessing

-- Manual --