Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets

01.08.2023

In silico toxicology has gained significant importance over recent years as the development of computational technology rapidly increases. The very recently signed FDA Modernization Act 2.0, which comprises a major shift away from animal use in drug development will further boost new approach methods in safety assessment.

Smajić A, Rami I, Sosnin S, Ecker GF. Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets. Chem Res Toxicol. 2023 

DOI

https://doi.org/10.1021/acs.chemrestox.3c00042

Abstract

Each year, publicly available databases are updated with new compounds from different research institutions. Positive experimental outcomes are more likely to be reported; therefore, they account for a considerable fraction of these entries. Established publicly available databases such as ChEMBL allow researchers to use information without constrictions and create predictive tools for a broad spectrum of applications in the field of toxicology. Therefore, we investigated the distribution of positive and nonpositive entries within ChEMBL for a set of off-targets and its impact on the performance of classification models when applied to pharmaceutical industry data sets. Results indicate that models trained on publicly available data tend to overpredict positives, and models based on industry data sets predict negatives more often than those built using publicly available data sets. This is strengthened even further by the visualization of the prediction space for a set of 10,000 compounds, which makes it possible to identify regions in the chemical space where predictions converge. Finally, we highlight the utilization of these models for consensus modeling for potential adverse events prediction.

Funding

This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement no. 777365 eTRANSAFE. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA. The Pharmacoinformatics Research Group (Ecker lab) acknowledges funding provided by the Austrian Science Fund FWF AW012321 MolTag. Open Access is funded by the Austrian Science Fund (FWF).

Rights & permissions

This is an open access article distributed under the terms of the Creative Commons CC-BY license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2023 The Authors. Published by American Chemical Society

© 2023 The Authors. Published by American Chemical Society

 More News

Open Access
 

Are you curious what is known about SLCs? And how they are related? We are enthusiastic about our manuscript presenting the data- and...

News
 

Tarik Ćerimagić successfully defended his master thesis: "A Multi-Task Deep Neural Network Approach for Data Imputation of SLC Transporter...

News
 

On July 10th, 2024 our colleague Aljoša successfully defended his PhD thesis: "Machine Learning Approaches for Off-Target and Bioactivity...