Aljoša Smajić, Thomas Steger-Hartmann, Gerhard. F. Ecker, Anke Hackl: Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets Chem. Res. Toxicol. April 20, 2025
DOI
https://doi.org/10.1021/acs.chemrestox.4c00347
Abstract
When applying machine learning (ML) approaches for the prediction of bioactivity, it is common to collect data from different assays or sources and combine them into single data sets. However, depending on the data domains and sources from which these data are retrieved, bioactivity data for the same macromolecular target may show a high variance of values (looking at a single compound) and cover very different parts of the chemical space as well as the bioactivity range (looking at the whole data set). The effectiveness and applicability domain of the resulting prediction models may be strongly influenced by the sources from which their training data were retrieved. Therefore, we investigated the chemical space and active/inactive distribution of proprietary pharmaceutical data from Bayer AG and the publicly available ChEMBL database, and their impact when applied as training data for classification models. For this end, we applied two different sets of descriptors in combination with different ML algorithms. The results show substantial differences in chemical space between the two different data sources, leading to suboptimal prediction performance when models are applied to domains other than their training data. MCC values between −0.34 and 0.37 among all targets were retrieved, indicating suboptimal model performance when models trained on Bayer AG data were tested on ChEMBL data and vice versa. The mean Tanimoto similarity of the nearest neighbors between these two data sources indicated similarities for 31 targets equal to or less than 0.3. Interestingly, all applied methods to assess overlap of chemical space of the two data sources to predict the applicability of models beyond their training data sets did not correlate with observed performances. Finally, we applied different strategies for creating mixed training data sets based on both public and proprietary sources, using assay format (cell-based and cell-free) information and Tanimoto similarities.
Funding
The Pharmacoinformatics Research Group (Ecker lab) acknowledges funding provided by the Austrian Science Fund FWF W1232 MolTag. Open Access is funded by the Austrian Science Fund (FWF). Furthermore, the authors would like to acknowledge the resources provided by Bayer Pharma AG for this project.
Rights & permissions
This publication is licensed under CC-BY 4.0
© 2025 The Authors. Published by American Chemical Society
Keywords
algorithms, assays, bioactivity, biological databases, chemical calculations