About BDPServer

- Summary.
- Representation of chemical compounds.
- Analysis of the biodegradation network and classification of the compounds: BDPServer DB.
- Production of classifiers and evaluation of the predictive performance of the system.
- Implementation of the prediction web server.
- Predictions: examples of application.
- Acknowledgements.


SUMMARY

top

BDPServer is a server that can be used to predict whether a chemical compound can be biodegraded or not.

The system currently uses information from the November 17th 2003 update of the University of Minnesota Biocatalysis / Biodegradation Database (UMBBD), which includes information about 850 compounds and 903 reactions.

Each compound has been represented as a collection of 152 attributes, that represent atomic composition and connectivity, molecular weight, solubility.

The biodegradation network, defined upon the complete set of reactions in UMBBD, has been used to classify the chemical compounds into three overlapping groups, depending on whether they are connected to non biodegradable compounds, central metabolism compounds or carbon dioxide. The three groups are:

  • NB: non biodegradable path compounds (353 compounds).

  • CM: central metabolism path compounds (533 compounds).

  • CD: carbon dioxide path compounds (329 compounds).

A fourth group has been defined, by merging the groups CM and CD. This group (CMCD) contains the compounds that are connected to central metabolism or to carbon dioxide, and corresponds to what can be considered more intuitively as the set of biodegradable compounds.

For each of the groups, a binary classification scheme has been defined and, therefore, each compound can be independently classified as:

  • NB or No NB.

  • CM or No CM.

  • CD or No CD.

  • CMCD or No CMCD.

For each of the four classification schemes, a machine learning algorithm has been used to generate a classifier that associates the attributes of chemical compounds to a given class.

BDPServer is the web interface to the four classifiers.


REPRESENTATION OF CHEMICAL COMPOUNDS

top

The November 17th 2003 update of UMBBD includes information about 850 compounds that participate in chemical reactions related to biodegradation. We have represented each compound as a collection of 152 attributes:

  • One of them is molecular weight.

  • Two other are the solubility value, expressed in g/100 ml, and the solubility class, expressed with the terms:

Solubility class

Solubility value

Soluble (Miscible)

> 1 g/100 ml

Slightly

0.05 - 1 g/100 ml

Sparingly

0.005 - 0.05 g/100 ml

Insoluble

< 0.005 g/100 ml

  • The remaining 149 attributes represent the structure of the compound, in terms of the absolute frequency of atomic triads. An atomic triad is any group of three consecutive, connected atoms, that can be defined on the structure of the compound. The type of chemical bonds connecting the three atoms is also considered. For example:

    • C-C-H is different to C=C-H

    • C=C-H is equal to H-C=C


ANALYSIS OF THE BIODEGRADATION NETWORK AND CLASSIFICATION OF THE COMPOUNDS: BDPServer DB

top

The November 17th 2003 update of UMBBD includes information about 903 reactions in which 850 compounds participate as either substrates or as products; some of them are annotated as belonging to central metabolism.

From the set of 903 reactions, a directed graph, in which nodes correspond to compounds, was constructed. For example, from the reaction:

A -> B + C

the following pairs of connected nodes are generated:

A -> B
A -> C

From the analysis of the network, it was concluded that:

  • 112 compounds are not biodegradable because they do not participate as substrates in any reaction.

  • 353 compounds are connected, or in the path, to non-biodegradable compounds or are non- biodegradable compounds, and have been defined as non-biodegradable path compounds (NB).

  • 38 compounds are annotated as belonging to central metabolism.

  • 533 compounds are connected, or in the path, to central metabolism compounds or are central metabolism compounds, and have been defined as central metabolism path compounds (CM).

  • The overlap between the NB and CM sets is of 112 compounds.

  • The non redundant set that contains CM and NB includes 774 compounds

  • The 76 remaining compounds are part of several paths (forming also some cyclic paths) that end in carbon dioxide. Therefore, they are biodegradable even if they are not connected to common metabolism compounds. A graph, representing the paths connected exclusively to carbon dioxide (UMBBD id c0131) is provided here.

  • 329 compounds (out of the whole set of compounds) are connected, or in the path, to carbon dioxide. These have been defined as carbon dioxide path compounds (CD).

Therefore three groups of compounds have been defined according to their possible fate:

  • NB: non-biodegradable path compounds (353 compounds).

  • CM: central metabolism path compounds (533 compounds).

  • CD: carbon dioxide path compounds (329 compounds).

A fourth group has been defined, by merging the groups CM and CD. This group (CMCD) contains the compounds that are connected to central metabolism or to carbon dioxide, and corresponds to what can be considered more intuitively as the set of biodegradable compounds.

Four independent binary classification schemes have been defined based on the groups previously considered:

  • NB vs No NB.

  • CM vs No CM.

  • CD vs No CD.

  • CMCD vs No CMCD.

It has to be stressed that these classification schemes are relative to the information in the UMBBD database.

The classification of the 850 chemical compounds, according the the four classification schemes described above can be consulted in the BDPServer DB section.


PRODUCTION OF CLASSIFIERS AND EVALUATION OF THE PREDICTIVE PERFORMANCE OF THE SYSTEM

top

For each of the binary classification schemes mentioned above, an algorithm that constructs decision trees (c4.5) was used to generate classifiers that associate the properties of chemical compounds to one of the two predictable classes.

To evaluate the predictive performance of the system when unseen cases are presented, cross-validation experiments were conducted in which 80% of the cases were used as training sets, and 20% as test sets.

Results are as follows:

Scheme

CM vs.
No CM

NB vs.
No NB

CD vs.
No CD

CMCD vs.
No CMCD

AGA(%)

83.40

76.60

69.60

82.30

Default class

CM

No NB

No CD

CMCD

Majority class

CM

No NB

No CD

CMCD

Number of cases

533

496

513

634

Sensitivity (%)

92.00

85.80

74.60

92.40

Specificity (%)

83.40

77.20

75.40

85.20

Minority class

No CM

NB

CD

No CMCD

Number of cases

308

345

328

207

Sensitivity (%)

68.40

63.00

61.40

51.00

Specificity (%)

83.60

76.60

62.40

68.60

AGA (Average Global Accuracy) is the percentage of compounds correctly classified as belonging to any of the classes, or their negation, in the five iterations of cross-validation experiments.

The results indicate that, in general, the best predictive performance of the system is achieved for the classification scheme CM vs. No CM.


IMPLEMENTATION OF THE PREDICTION WEB SERVER

top

BDPserver is a prediction server implemented on top of classifiers generated by c4.5, using training sets that included all compounds.

The input for the predictor consists chemical compound descriptions in SMILES format, which can be typed directly in a text box or drawn with the JME applet, provided by Peter Ertl (Novartis). Solubility information can also be submitted, but it is optional.

The server uses OpenBabel to translate SMILES strings to other formats, from which the connectivity between atoms is extracted, and to add hydrogen atoms, which are not explicitly represented in SMILES strings. From the connectivity data, which includes the type of chemical bonds between atom pairs, the composition of each compound in terms of atomic triads, is calculated. OpenBabel is also used to calculate the molecular weight of compounds.

For each compound, four independent predictions are provided, together with a confidence factor (CF) that is calculated by c4.5. If a user-provided SMILES string matches another string in the database, the server returns the actual classification of the compound and predictions are not shown, unless the user chooses to force the prediction.


PREDICTIONS: EXAMPLES OF APPLICATION

top

BDPServer has been applied to predicting the environmental fate of several sets of compounds that are relevant for European regulations concerning chemicals.

The sets of compounds so far analyzed, are:

  • Annex I-S.

  • HPVC-S.

  • LPVC-S.

Annex I is a list of dangerous substances published as part of the directive 67/548 of the European Commission which regulates the classification, packaging and labelling of dangerous substances, updated on April 2004 (29th ATP).

HPVC and LPVC are two lists of substances implemented in application of directive 793/93 and refer to substances that are produced in quantities exceeding 1,000 tonnes per year, and between 10 and 1,000 tonnes per year, respectively.

The three lists were kindly provided by Rémi Allanou, of the European Chemicals Bureau.

The lists include many substances of poorly defined composition, which can not be analyzed by BDPserver. By cross - checking the lists with the SMILECAS database, they were filtered to obtain new lists of substances that consist in pure compounds that have an associated SMILES descriptor. The filtered lists are referred to as Annex I-S, HPVC-S and LPVC-S.

The number of compounds in each of these lists is presented in the next table.

Compound set

Number of substances

Annex I

3,365

Annex I-S

1,766

HPVC

2,747

HPVC-S

1,653

LPVC

7,829

LPVC-S

5,645

The percentages of compound types in Annex I-S, HPVC-S and LPVC-S, according to their predicted environmental fate, are summarized in the next table.

Environmental fate

Annex I-S

HPVC-S

LPVC-S

CM / No CM

57.81 / 36.86

61.16 / 32.3

62.23 / 33.43

NB / No NB

51.42 / 43.26

42.77 / 50.7

49.92 / 45.75

CD / No CD

24.35 / 70.33

19.6 / 73.87

21.72 / 73.94

CMCD / No CMCD

68.35 / 26.33

73.62 / 19.84

71.24 / 24.42

The predictions for each compound can be consulted at the Predictions section.


Bioinformatics Lab, CAB (INTA-CSIC).
PDG, CNB (CSIC).
top