Generating datasets through data source analysis using ADaGe

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/26429

Full metadata record

DC Field	Value	Language
dc.date.accessioned	2018-02-06T13:34:40Z	-
dc.date.available	2018-02-06T13:34:40Z	-
dc.date.issued	2017	-
dc.identifier.uri	https://www.um.edu.mt/library/oar//handle/123456789/26429	-
dc.description	B.SC.IT(HONS)	en_GB
dc.description.abstract	The generation of a synthetic dataset is the simulation of an existing dataset which is able to retain or modify the proportions and characteristics of its real-world counterpart. Using synthetic datasets as opposed to real-world data has various advantages. Synthetic datasets can safeguard sensitive information by replacing it with fictitious subject matter (Wu et al., 2016). It can also generate synthetic conditions which cannot be found in real-world datasets by modifying its field properties. Changing the distributions of values within a dataset can be useful to perform extensive testing on new algorithms for the identification of robustness issues and any errors within (Ayala-Rivera, McDonagh, Cerqueus, & Murphy, 2013). This has the potential to reveal more software bugs within a product which is essential during testing. This study aims to extend the synthetic data generation tool, ADaGe (Camilleri & Bonello, 2016), to semi-automatically infer information from an existing dataset with the aim of replicating its characteristics and properties onto a synthetic dataset. The main objective of this research is to automate the process of extracting data properties from the source database and infer not only the structure of the data, but also the patterns that can be used to generate semantically similar data. By applying EDA techniques, relevant information on the relationship between different fields can be acquired. In contrast to existing data generation tools which generally generate data on a column-by-column basis, this will enable us to replicate the relationships between the various attributes of a table. Results show that by identifying the field types in a table and gathering the relevant statistics, we can acquire the information needed to replicate the relationships between different attributes. Percentage frequency distribution statistics retained relationships between categorical fields, while binning proved effective in preserving the distribution in quantitative fields. Regular expressions were used to define values for text fields which are known to follow a pre-specified pattern.	en_GB
dc.language.iso	en	en_GB
dc.rights	info:eu-repo/semantics/restrictedAccess	en_GB
dc.subject	Computer algorithms	en_GB
dc.subject	Quantitative research	en_GB
dc.title	Generating datasets through data source analysis using ADaGe	en_GB
dc.type	bachelorThesis	en_GB
dc.rights.holder	The copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder.	en_GB
dc.publisher.institution	University of Malta	en_GB
dc.publisher.department	Faculty of Information and Communication Technology	en_GB
dc.description.reviewed	N/A	en_GB
dc.contributor.creator	Xuereb, Matthew	-
Appears in Collections:	Dissertations - FacICT - 2017

Files in This Item:

File	Description	Size	Format
17BITSD031.pdf Restricted Access		2.5 MB	Adobe PDF	View/Open Request a copy

Show simple item record Statistics