Scaling protein motif discovery using tries in ‘Apache Spark’

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/92112

Full metadata record

DC Field	Value	Language
dc.date.accessioned	2022-03-24T08:03:43Z	-
dc.date.available	2022-03-24T08:03:43Z	-
dc.date.issued	2021	-
dc.identifier.citation	Briffa, E.J. (2021). Scaling protein motif discovery using tries in ‘Apache Spark’ (Bachelor's dissertation).	en_GB
dc.identifier.uri	https://www.um.edu.mt/library/oar/handle/123456789/92112	-
dc.description	B.Sc. IT (Hons)(Melit.)	en_GB
dc.description.abstract	The field of BioInformatics applies computational techniques to Biology. To improve the understanding of proteins, which are large molecules that have specific functions in organisms, requires discovering fixed patterns called motifs inside protein sequences which are indicative of a protein’s structure and function. This research attempts to improve the speed of finding motifs by comparing unknown protein sequences to known protein domains as classified in the CATH hierarchy. The approach adopted in this study uses the Multiple Sequence Alignments (MSA) from proteins found in CATH Functional Families. Each MSA contains motifs which have sequence regions that have been preserved through evolution, known as conserved regions. The representative sequences for the Functional Families are stored as a Suffix Trie that is then used to find potential structures. To improve the efficiency of the search, the suffix trie is implemented using the Spark framework that is used to process large amounts of data efficiently. The Spark architecture offers processing scalability by distributing the process over a number of nodes thereby speeding up the search. The method then determines the best match through a scoring algorithm that ranks the output based on the closest match to a known structural motif. A substitution matrix is also used to consider all possible variations of the conserved regions. This system is compared against a library of Hidden Markov models. The results produced by our system are very comparable to the benchmark system and show that our system has a great potential.	en_GB
dc.language.iso	en	en_GB
dc.rights	info:eu-repo/semantics/restrictedAccess	en_GB
dc.subject	Spark (Electronic resource : Apache Software Foundation)	en_GB
dc.subject	Data structures (Computer science)	en_GB
dc.subject	Information retrieval	en_GB
dc.subject	Sequence alignment (Bioinformatics)	en_GB
dc.title	Scaling protein motif discovery using tries in ‘Apache Spark’	en_GB
dc.type	bachelorThesis	en_GB
dc.rights.holder	The copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder.	en_GB
dc.publisher.institution	University of Malta	en_GB
dc.publisher.department	Faculty of Information and Communication Technology. Department of Computer Information Systems	en_GB
dc.description.reviewed	N/A	en_GB
dc.contributor.creator	Briffa, Ethan Joseph (2021)	-
Appears in Collections:	Dissertations - FacICT - 2021 Dissertations - FacICTCIS - 2021

Files in This Item:

File	Description	Size	Format
21BITSD008.pdf Restricted Access		2.28 MB	Adobe PDF	View/Open Request a copy

Show simple item record Statistics