Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/73653

Title:	Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware
Authors:	Meli, Clyde (2017)
Keywords:	Spam (Electronic mail) Mathematical notation Genetic algorithms Genetic programming (Computer science) Spam filtering (Electronic mail)
Issue Date:	2017
Citation:	Meli, C. (2017). Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware (Doctoral dissertation).
Abstract:	While spam is increasingly acknowledged as a very expensive problem on the Internet, spam filters attempt to detect spam from emails with the highest possible accuracy. Reverse Polish Notation (RPN) expressions are proposed as a means to combine a range of evaluated features from emails for the detection of spam. Theoretical arguments for the use of RPN expressions applied to spam detection are proposed, together with a new RPN Block Representation. Theoretical comparisons of RPN expressions with Naïve Bayes and Support Vector Machine are also given. It is shown that such RPN expressions are more expressive. A proof that email spam detection is NP-complete is given by mapping groups of email spams onto malware virus families. Seventy-two features, ranging from Subject-line, Header-based, Message Body-based, URL-based and stylistic, have been used to evolve RPN expressions using Linear Genetic Programming. New features and specifically the application of a group of URL features to spam detection are proposed (since many spams contain links to domains which are at times even malicious). These new features are shown to be useful for spam detection theoretically and in practice. Linear Genetic Programming is a subset of Genetic Programming where chromosomes are computer programs represented using imperative computer language instructions or machine code instead of trees made up of symbolic expressions. Such machine code can encode RPN expressions. The Linear Genetic Programming system is used to “learn” an RPN expression consisting of a combination of features which can be used to detect spam. A number of feature selection algorithms are used to identify which subsets of features are most relevant to classification. The feature selection techniques Minimum Redundancy Maximum Relevance method (this filter technique finds features which are mutually far from each other while still having high “correlation” to classification), the conventional Maximum Relevance method, Principle Component Analysis and using the entire feature set are investigated using the Linear Genetic Programming system applied to the SpamAssassin Spam Corpus, which is a standard ham and spam archive. Theoretical and practical comparisons with literature results and industry open source applications are given for the proposed system. A new Block Crossover operator which helps preserve and keep intact RPN Blocks as well as a new RPN First Subexpression algorithm are also proposed. This algorithm helps to maintain a double check for spam and ham and results in improvement of detection accuracy and other metrics. The system was very demanding computationally, taking a long time to run on a supercomputer. The proposed system achieved 99.17% accuracy and an F1-score of 0.9868, which compare very well with results given in the literature as well as with performance of industrial Bogofilter and SpamAssassin Spam Filters.
Description:	PH.D.
URI:	https://www.um.edu.mt/library/oar/handle/123456789/73653
Appears in Collections:	Dissertations - FacICT - 2017 Dissertations - FacICTCIS - 2017

Files in This Item:

File	Description	Size	Format
Clyde Meli - Dissertation.pdf Restricted Access		3.46 MB	Adobe PDF	View/Open Request a copy

Show full item record Statistics