To Örebro University

oru.seÖrebro University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité Universitätsmedizin Berlin, Berlin, Germany.
BASF SE, Ludwigshafen, Germany; Division of Pharmaceutical Chemistry, Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria.
Örebro University, School of Science and Technology. Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden; Dept Computer and Systems Sciences, Stockholm University, Kista, Sweden. (MTM Research Centre)ORCID iD: 0000-0003-3107-331X
Alzheimer's Research UK UCL Drug Discovery Institute, London, UK.
Show others and affiliations
2022 (English)In: Scientific Reports, E-ISSN 2045-2322, Vol. 12, no 1, article id 7244Article in journal (Refereed) Published
Abstract [en]

Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.

Place, publisher, year, edition, pages
Nature Publishing Group, 2022. Vol. 12, no 1, article id 7244
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:oru:diva-98864DOI: 10.1038/s41598-022-09309-3ISI: 000790941900035PubMedID: 35508546Scopus ID: 2-s2.0-85129425620OAI: oai:DiVA.org:oru-98864DiVA, id: diva2:1656441
Available from: 2022-05-06 Created: 2022-05-06 Last updated: 2024-01-16Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Norinder, Ulf

Search in DiVA

By author/editor
Norinder, Ulf
By organisation
School of Science and Technology
In the same journal
Scientific Reports
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 34 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf