Investigating the use of the map-reduce paradigm for genome alignment

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/24751

Title:	Investigating the use of the map-reduce paradigm for genome alignment
Authors:	Abdilla, Sara Ann
Keywords:	Genomics Python (Computer program language) Malta Human Genome Project
Issue Date:	2017
Abstract:	Genome alignment has many crucial applications in multiple areas, from medical diagnosis to forensic investigation. Genomes, particularly human ones, consist of a large amount of data - approximately 3 GB (or 3 billion base-pairs) - so they take up a lot of space. Due to limits in DNA sequencing technologies, the whole genome cannot currently be read in one go. These technologies instead have to output millions of short DNA reads (fragments of the genome) to deal with the data. All of these reads are then aligned to the reference genome. This alignment is not a perfect exact match and may contain differences in the form of substitutions, insertions and deletions due to sequencing errors, the natural variation between species and the di erences between DNA sequences of individuals. The motivation behind this project is the fact that the University of Malta is developing a national Maltese Human Reference Genome, for the Malta Human Genome Project (MHGP), which requires analysis tools for its alignment technology. Our aim is to investigate the best strategy for this distributed alignment using the Map-Reduce paradigm and to run it on the cloud in order to attempt to achieve speed up over current methods. This project was designed around the investigation and evaluation of two genome compressors - a 32-bit and a 64-bit encoder - and seven Map-Reduce aligners - na ve matching using the Hamming distance metric, na ve matching using the Edit distance metric, Boyer-Moore, k-mer index, Smith-Waterman, FM index and Burrows-Wheeler. The implementation was developed using the Python programming language. Also, the Apache Hadoop Map-Reduce framework was used for local and cloud executions. The results and evaluation showed that the 32-bit encoder performed similarly to the Gzip compression tool. The latter is however slightly faster, so it was selected to be used by the Map-Reduce aligners. The results and evaluation of these aligners showed that most produced accurate alignments and were executed in a reasonable amount of time - the k-mer index aligner clearly executing the fastest. Evaluating these against the existing, popular alignment tool BWA showed that, in terms of accuracy, all the aligners return similar results; our aligners providing a more readable and concise output. In terms of run-time, BWA outperforms most of the Map-Reduce aligners with the exception of the k-mer index aligner which showed similar timings. However, BWA is limited to execute on a multi-threaded single node while our approach is more scalable as it allows execution on multiple nodes. It was concluded that as the 32-bit encoder and the k-mer index aligner showed similar results as those of existing tools, our implementations could outperform these known tools with some optimizations involving algorithm updates and improved system capabilities. The aligner system could be improved, for instance, by increasing the cluster size to outperform BWA. Future work on this area may involve some optimisations for the 32-bit encoder and for the different Map-Reduce aligners (in particular, the k-mer index aligner), the use of quality scores to measure the sequencing accuracy of the resultant alignments and the construction of more analysis tools for the MHGP such as genome visualisation tools and genome browsers.
Description:	B.SC.(HONS)COMP.SCI.
URI:	https://www.um.edu.mt/library/oar//handle/123456789/24751
Appears in Collections:	Dissertations - FacICT - 2017 Dissertations - FacICTCS - 2017

Files in This Item:

File	Description	Size	Format
17BCS001.pdf Restricted Access		962.91 kB	Adobe PDF	View/Open Request a copy

Show full item record Statistics