Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets

24.04.2025

Combining bioactivity data from different sources for ML predictions can lead to high variance in values and differences in chemical space. This study analyzed proprietary Bayer AG data and public ChEMBL data, revealing significant chemical space differences that impacted model performance. Models trained on one source performed poorly on the other, with MCC values ranging from −0.34 to 0.37. Overlap assessments, such as Tanimoto similarity, showed limited correlation with performance. Mixed training strategies using assay format and similarity were also explored.

Aljoša Smajić, Thomas Steger-Hartmann, Gerhard. F. Ecker, Anke Hackl: Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets Chem. Res. Toxicol. April 20, 2025

DOI

https://doi.org/10.1021/acs.chemrestox.4c00347

Abstract

When applying machine learning (ML) approaches for the prediction of bioactivity, it is common to collect data from different assays or sources and combine them into single data sets. However, depending on the data domains and sources from which these data are retrieved, bioactivity data for the same macromolecular target may show a high variance of values (looking at a single compound) and cover very different parts of the chemical space as well as the bioactivity range (looking at the whole data set). The effectiveness and applicability domain of the resulting prediction models may be strongly influenced by the sources from which their training data were retrieved. Therefore, we investigated the chemical space and active/inactive distribution of proprietary pharmaceutical data from Bayer AG and the publicly available ChEMBL database, and their impact when applied as training data for classification models. For this end, we applied two different sets of descriptors in combination with different ML algorithms. The results show substantial differences in chemical space between the two different data sources, leading to suboptimal prediction performance when models are applied to domains other than their training data. MCC values between −0.34 and 0.37 among all targets were retrieved, indicating suboptimal model performance when models trained on Bayer AG data were tested on ChEMBL data and vice versa. The mean Tanimoto similarity of the nearest neighbors between these two data sources indicated similarities for 31 targets equal to or less than 0.3. Interestingly, all applied methods to assess overlap of chemical space of the two data sources to predict the applicability of models beyond their training data sets did not correlate with observed performances. Finally, we applied different strategies for creating mixed training data sets based on both public and proprietary sources, using assay format (cell-based and cell-free) information and Tanimoto similarities.

Funding

The Pharmacoinformatics Research Group (Ecker lab) acknowledges funding provided by the Austrian Science Fund FWF W1232 MolTag. Open Access is funded by the Austrian Science Fund (FWF). Furthermore, the authors would like to acknowledge the resources provided by Bayer Pharma AG for this project.

Rights & permissions

This publication is licensed under CC-BY 4.0

© 2025 The Authors. Published by American Chemical Society

Keywords

algorithms, assays, bioactivity, biological databases, chemical calculations

Graphical Abstract

 More News

Open Access
 

Combining bioactivity data from different sources for ML predictions can lead to high variance in values and differences in chemical space. This study...

Open Access
 

Given the complexity of the T-cell response, we explored different approaches to enhance the model’s performance and generalizability. This involved...

Open Access
 

Are you curious what is known about SLCs? And how they are related? We are enthusiastic about our manuscript presenting the data- and...