Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/92112
Full metadata record
DC FieldValueLanguage
dc.date.accessioned2022-03-24T08:03:43Z-
dc.date.available2022-03-24T08:03:43Z-
dc.date.issued2021-
dc.identifier.citationBriffa, E.J. (2021). Scaling protein motif discovery using tries in ‘Apache Spark’ (Bachelor's dissertation).en_GB
dc.identifier.urihttps://www.um.edu.mt/library/oar/handle/123456789/92112-
dc.descriptionB.Sc. IT (Hons)(Melit.)en_GB
dc.description.abstractThe field of BioInformatics applies computational techniques to Biology. To improve the understanding of proteins, which are large molecules that have specific functions in organisms, requires discovering fixed patterns called motifs inside protein sequences which are indicative of a protein’s structure and function. This research attempts to improve the speed of finding motifs by comparing unknown protein sequences to known protein domains as classified in the CATH hierarchy. The approach adopted in this study uses the Multiple Sequence Alignments (MSA) from proteins found in CATH Functional Families. Each MSA contains motifs which have sequence regions that have been preserved through evolution, known as conserved regions. The representative sequences for the Functional Families are stored as a Suffix Trie that is then used to find potential structures. To improve the efficiency of the search, the suffix trie is implemented using the Spark framework that is used to process large amounts of data efficiently. The Spark architecture offers processing scalability by distributing the process over a number of nodes thereby speeding up the search. The method then determines the best match through a scoring algorithm that ranks the output based on the closest match to a known structural motif. A substitution matrix is also used to consider all possible variations of the conserved regions. This system is compared against a library of Hidden Markov models. The results produced by our system are very comparable to the benchmark system and show that our system has a great potential.en_GB
dc.language.isoenen_GB
dc.rightsinfo:eu-repo/semantics/restrictedAccessen_GB
dc.subjectSpark (Electronic resource : Apache Software Foundation)en_GB
dc.subjectData structures (Computer science)en_GB
dc.subjectInformation retrievalen_GB
dc.subjectSequence alignment (Bioinformatics)en_GB
dc.titleScaling protein motif discovery using tries in ‘Apache Spark’en_GB
dc.typebachelorThesisen_GB
dc.rights.holderThe copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder.en_GB
dc.publisher.institutionUniversity of Maltaen_GB
dc.publisher.departmentFaculty of Information and Communication Technology. Department of Computer Information Systemsen_GB
dc.description.reviewedN/Aen_GB
dc.contributor.creatorBriffa, Ethan Joseph (2021)-
Appears in Collections:Dissertations - FacICT - 2021
Dissertations - FacICTCIS - 2021

Files in This Item:
File Description SizeFormat 
21BITSD008.pdf
  Restricted Access
2.28 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.