Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/53023
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.date.accessioned | 2020-03-25T08:26:44Z | - |
dc.date.available | 2020-03-25T08:26:44Z | - |
dc.date.issued | 2019 | - |
dc.identifier.citation | Penza, K. (2019). GO term predictionsin CATH : a machine learning approach (Master's dissertation). | en_GB |
dc.identifier.uri | https://www.um.edu.mt/library/oar/handle/123456789/53023 | - |
dc.description | M.SC.ARTIFICIAL INTELLIGENCE | en_GB |
dc.description.abstract | Proteins perform different tasks with in an organism such as regulation and signalling. Protein function is characterised through laboratory experiments or predicted using computational methods. Protein function is described using GeneOntology (GO) terms. Protein sequencing is the process of determining the amino acid sequence that makes up the protein. Technological improvements in sequencing technology is making the process more accessible, leading to a never-increasing growth rate of protein databases. The low throughput of laboratory experiments and increasing rate of proteins deposited in protein databases has made protein function prediction (PFP) a central problem in computational biology. Domains are independent structural units that have their own structural and function. Structural protein databases categorise protein using structural properties. CATH is a structural database of protein domain using four levels of hierarchy. This research applied machine learning (ML) techniques to improve PFP. The protein function aspect investigated was molecular function. This research uses a labelled ML data set consisting for GO terms, features extracted from protein sequence and proportions computed from protein databases such as CATH and PFAM. The problem was tackled by defining five experiments that were executed on Homo sapiens and E. coli datasets. The model performance was measured using Fmax computed as per Critical Assessment of Functional Annotation (CAFA) shared task methodology. The first experiment applied automatic feature selection using four different fitness methods based on Random Forest and Support Vector Machine. The second experiment applied different neural network architectures to the datasets. The third experiment applied cross validation to the automatic feature selection process to assess dataset sensitivity in the feature selection process. The fourth experiment investigated the amount of training data required by best performing ML model for each species identified in the first experiment. The fifth experiment investigated the application of the best performing ML model for each species identified in the first experiment to other species. The methods selected in the first and second experiment were evaluated on the CAFA3 targets. The RF with Gini node splitting criterion outperforms the best CAFA2 methods by an Fmax of 0.01 for Homo sapiens and an Fmax of 0.16 for E. coli. The cross validation of the automatic feature selection shows that E. coli models were more sensitive to changes in the dataset with respect to Homo sapiens models. The smaller E. coli dataset explains the sensitivity observed. The training dataset size experiment shows that the models have similar performance levels with the sameamountoftrainingdata. The experiment that applied species-specific models to different species confirms the intuition that models perform well on species of the same domain, and that performance decreases as evolutionary distance increases. The results show that features based on protein structure and proportions from structural protein databases permit reliable PFP. | en_GB |
dc.language.iso | en | en_GB |
dc.rights | info:eu-repo/semantics/openAccess | en_GB |
dc.subject | Machine learning | en_GB |
dc.subject | Protein-protein interactions | en_GB |
dc.subject | Computational biology | en_GB |
dc.subject | Bioinformatics | en_GB |
dc.title | GO term predictionsin CATH : a machine learning approach | en_GB |
dc.type | masterThesis | en_GB |
dc.rights.holder | The copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder. | en_GB |
dc.publisher.institution | University of Malta | en_GB |
dc.publisher.department | Faculty of Information and Communication Technology. Department of Artificial Intelligence | en_GB |
dc.description.reviewed | N/A | en_GB |
dc.contributor.creator | Penza, Kenneth | - |
Appears in Collections: | Dissertations - FacICT - 2019 Dissertations - FacICTAI - 2019 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
19MAIPT011.pdf | 3.84 MB | Adobe PDF | View/Open |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.