GO term predictionsin CATH : a machine learning approach

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/53023

Full metadata record

DC Field	Value	Language
dc.date.accessioned	2020-03-25T08:26:44Z	-
dc.date.available	2020-03-25T08:26:44Z	-
dc.date.issued	2019	-
dc.identifier.citation	Penza, K. (2019). GO term predictionsin CATH : a machine learning approach (Master's dissertation).	en_GB
dc.identifier.uri	https://www.um.edu.mt/library/oar/handle/123456789/53023	-
dc.description	M.SC.ARTIFICIAL INTELLIGENCE	en_GB
dc.description.abstract	Proteins perform different tasks with in an organism such as regulation and signalling. Protein function is characterised through laboratory experiments or predicted using computational methods. Protein function is described using GeneOntology (GO) terms. Protein sequencing is the process of determining the amino acid sequence that makes up the protein. Technological improvements in sequencing technology is making the process more accessible, leading to a never-increasing growth rate of protein databases. The low throughput of laboratory experiments and increasing rate of proteins deposited in protein databases has made protein function prediction (PFP) a central problem in computational biology. Domains are independent structural units that have their own structural and function. Structural protein databases categorise protein using structural properties. CATH is a structural database of protein domain using four levels of hierarchy. This research applied machine learning (ML) techniques to improve PFP. The protein function aspect investigated was molecular function. This research uses a labelled ML data set consisting for GO terms, features extracted from protein sequence and proportions computed from protein databases such as CATH and PFAM. The problem was tackled by deﬁning ﬁve experiments that were executed on Homo sapiens and E. coli datasets. The model performance was measured using Fmax computed as per Critical Assessment of Functional Annotation (CAFA) shared task methodology. The ﬁrst experiment applied automatic feature selection using four different ﬁtness methods based on Random Forest and Support Vector Machine. The second experiment applied different neural network architectures to the datasets. The third experiment applied cross validation to the automatic feature selection process to assess dataset sensitivity in the feature selection process. The fourth experiment investigated the amount of training data required by best performing ML model for each species identiﬁed in the ﬁrst experiment. The ﬁfth experiment investigated the application of the best performing ML model for each species identiﬁed in the ﬁrst experiment to other species. The methods selected in the ﬁrst and second experiment were evaluated on the CAFA3 targets. The RF with Gini node splitting criterion outperforms the best CAFA2 methods by an Fmax of 0.01 for Homo sapiens and an Fmax of 0.16 for E. coli. The cross validation of the automatic feature selection shows that E. coli models were more sensitive to changes in the dataset with respect to Homo sapiens models. The smaller E. coli dataset explains the sensitivity observed. The training dataset size experiment shows that the models have similar performance levels with the sameamountoftrainingdata. The experiment that applied species-speciﬁc models to different species conﬁrms the intuition that models perform well on species of the same domain, and that performance decreases as evolutionary distance increases. The results show that features based on protein structure and proportions from structural protein databases permit reliable PFP.	en_GB
dc.language.iso	en	en_GB
dc.rights	info:eu-repo/semantics/openAccess	en_GB
dc.subject	Machine learning	en_GB
dc.subject	Protein-protein interactions	en_GB
dc.subject	Computational biology	en_GB
dc.subject	Bioinformatics	en_GB
dc.title	GO term predictionsin CATH : a machine learning approach	en_GB
dc.type	masterThesis	en_GB
dc.rights.holder	The copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder.	en_GB
dc.publisher.institution	University of Malta	en_GB
dc.publisher.department	Faculty of Information and Communication Technology. Department of Artificial Intelligence	en_GB
dc.description.reviewed	N/A	en_GB
dc.contributor.creator	Penza, Kenneth	-
Appears in Collections:	Dissertations - FacICT - 2019 Dissertations - FacICTAI - 2019

Files in This Item:

File	Description	Size	Format
19MAIPT011.pdf		3.84 MB	Adobe PDF	View/Open

Show simple item record Statistics