Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/108340
Title: | Few-shot learning for low data drug discovery |
Authors: | Vella, Daniel (2022) |
Keywords: | Drug development -- Computer simulation Drugs--Structure-activity relationships -- Computer simulation Ligand binding (Biochemistry) -- Computer simulation Machine learning |
Issue Date: | 2022 |
Citation: | Vella, D. (2022). Few-shot learning for low data drug discovery (Master's dissertation). |
Abstract: | Humans exhibit a remarkable ability to learn quickly from just few examples. A child seeing a cat for the first time can effectively identify the animal as a cat upon future encounters. This learning ability is in stark contrast with conventional machine learning (ML) techniques which are data hungry. This data requirement poses a challenge for the application of ML for virtual screening (VS) in drug discovery. The main goal in ligand-based VS (LBVS) is to identify active molecules that exhibit desired therapeutic activity against a biological target, based on information on known ligands against these targets. Data acquisition on a compound’s activity against a biological target is resource-intensive and difficult to obtain. Hence, the aim of this study is to quantify whether few-shot ML can be effectively used for low-data drug discovery. Meta-learning techniques aim to achieve the learning how to learn ability observed in humans. Therefore, we explore few-shot ML for this problem domain using the Tox21, MUV and the GPCR subset of the DUD-E datasets. In few-shot ML, we train a model using data from a number of experimental assays, and then use this model to generalise for new experimental assays using only a small (1-10 per class) support set. We build on the state of the art work, which is based on Matching Networks, by Altae-Tran et al. (2017), and use their work as a foundation to introduce two new architectures, the Prototypical Networks and Relation Networks, to this domain. Additionally, we also evaluate results using PR-AUC, in addition to ROC-AUC as in the original work, providing better interpretability of the performance of the proposed models on highly imbalanced data. Our results are consistent with those of the state of the art on Tox21 and MUV datasets. To the best of our knowledge, the DUD-E dataset has not been previously explored for the few-shot learning domain. Our application of the Prototypical Networks, improved with the iterative-refinement LSTM, achieves overall better performance than the state of art on Tox21 data. On MUV data, the baseline models outperform the few-shot learning models. On the GPCR subset of DUD-E, our results are not conclusive as on one target the models obtained outstanding performance and inferior performance on the other. We also experiment with different embeddings on the Tox21 data and find that learned graph embeddings consistently perform better than extended-connectivity fingerprints, a popular LBVS approach. Based on our findings, we can conclude that the effectiveness of few-shot learning is highly dependent on the nature of the data available. The few-shot learning models struggle to perform consistently on MUV and DUD-E data, in which the active compounds are structurally distinct. However, on Tox21 data, which is typically used for lead optimisation, the few-shot ML models perform well and our contribution of the Prototypical Networks even outperforms the state of the art. Additionally, training these networks is much faster (up to 190% faster) and for comparable, or better results, take a fraction of the time to train. |
Description: | M.Sc.(Melit.) |
URI: | https://www.um.edu.mt/library/oar/handle/123456789/108340 |
Appears in Collections: | Dissertations - FacICT - 2022 Dissertations - FacICTAI - 2022 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
2219ICTICS520000003154_1.PDF | 12.16 MB | Adobe PDF | View/Open |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.