About BDPServer
- Summary.
BDPServer is a server that can be used to predict whether a chemical compound can be biodegraded or not. The system currently uses information from the November 17th 2003 update of the University of Minnesota Biocatalysis / Biodegradation Database (UMBBD), which includes information about 850 compounds and 903 reactions. Each compound has been represented as a collection of 152 attributes, that represent atomic composition and connectivity, molecular weight, solubility. The biodegradation network, defined upon the complete set of reactions in UMBBD, has been used to classify the chemical compounds into three overlapping groups, depending on whether they are connected to non biodegradable compounds, central metabolism compounds or carbon dioxide. The three groups are:
A fourth group has been defined, by merging the groups CM and CD. This group (CMCD) contains the compounds that are connected to central metabolism or to carbon dioxide, and corresponds to what can be considered more intuitively as the set of biodegradable compounds. For each of the groups, a binary classification scheme has been defined and, therefore, each compound can be independently classified as:
For each of the four classification schemes, a machine learning algorithm has been used to generate a classifier that associates the attributes of chemical compounds to a given class. BDPServer is the web interface to the four classifiers.
The November 17th 2003 update of UMBBD includes information about 850 compounds that participate in chemical reactions related to biodegradation. We have represented each compound as a collection of 152 attributes:
The November 17th 2003 update of UMBBD includes information about 903 reactions in which 850 compounds participate as either substrates or as products; some of them are annotated as belonging to central metabolism. From the set of 903 reactions, a directed graph, in which nodes correspond to compounds, was constructed. For example, from the reaction: A -> B + C the following pairs of connected nodes are generated: A
-> B From the analysis of the network, it was concluded that:
Therefore three groups of compounds have been defined according to their possible fate:
A fourth group has been defined, by merging the groups CM and CD. This group (CMCD) contains the compounds that are connected to central metabolism or to carbon dioxide, and corresponds to what can be considered more intuitively as the set of biodegradable compounds. Four independent binary classification schemes have been defined based on the groups previously considered:
It has to be stressed that these classification schemes are relative to the information in the UMBBD database. The classification of the 850 chemical compounds, according the the four classification schemes described above can be consulted in the BDPServer DB section.
For each of the binary classification schemes mentioned above, an algorithm that constructs decision trees (c4.5) was used to generate classifiers that associate the properties of chemical compounds to one of the two predictable classes. To evaluate the predictive performance of the system when unseen cases are presented, cross-validation experiments were conducted in which 80% of the cases were used as training sets, and 20% as test sets. Results are as follows:
AGA (Average Global Accuracy) is the percentage of compounds correctly classified as belonging to any of the classes, or their negation, in the five iterations of cross-validation experiments. The results indicate that, in general, the best predictive performance of the system is achieved for the classification scheme CM vs. No CM.
BDPserver is a prediction server implemented on top of classifiers generated by c4.5, using training sets that included all compounds. The input for the predictor consists chemical compound descriptions in SMILES format, which can be typed directly in a text box or drawn with the JME applet, provided by Peter Ertl (Novartis). Solubility information can also be submitted, but it is optional. The server uses OpenBabel to translate SMILES strings to other formats, from which the connectivity between atoms is extracted, and to add hydrogen atoms, which are not explicitly represented in SMILES strings. From the connectivity data, which includes the type of chemical bonds between atom pairs, the composition of each compound in terms of atomic triads, is calculated. OpenBabel is also used to calculate the molecular weight of compounds. For each compound, four independent predictions are provided, together with a confidence factor (CF) that is calculated by c4.5. If a user-provided SMILES string matches another string in the database, the server returns the actual classification of the compound and predictions are not shown, unless the user chooses to force the prediction.
BDPServer has been applied to predicting the environmental fate of several sets of compounds that are relevant for European regulations concerning chemicals. The sets of compounds so far analyzed, are:
Annex I is a list of dangerous substances published as part of the directive 67/548 of the European Commission which regulates the classification, packaging and labelling of dangerous substances, updated on April 2004 (29th ATP). HPVC and LPVC are two lists of substances implemented in application of directive 793/93 and refer to substances that are produced in quantities exceeding 1,000 tonnes per year, and between 10 and 1,000 tonnes per year, respectively. The three lists were kindly provided by Rémi Allanou, of the European Chemicals Bureau. The lists include many substances of poorly defined composition, which can not be analyzed by BDPserver. By cross - checking the lists with the SMILECAS database, they were filtered to obtain new lists of substances that consist in pure compounds that have an associated SMILES descriptor. The filtered lists are referred to as Annex I-S, HPVC-S and LPVC-S. The number of compounds in each of these lists is presented in the next table.
The percentages of compound types in Annex I-S, HPVC-S and LPVC-S, according to their predicted environmental fate, are summarized in the next table.
The predictions for each compound can be consulted at the Predictions section.
|