Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/26429
Full metadata record
DC FieldValueLanguage
dc.date.accessioned2018-02-06T13:34:40Z-
dc.date.available2018-02-06T13:34:40Z-
dc.date.issued2017-
dc.identifier.urihttps://www.um.edu.mt/library/oar//handle/123456789/26429-
dc.descriptionB.SC.IT(HONS)en_GB
dc.description.abstractThe generation of a synthetic dataset is the simulation of an existing dataset which is able to retain or modify the proportions and characteristics of its real-world counterpart. Using synthetic datasets as opposed to real-world data has various advantages. Synthetic datasets can safeguard sensitive information by replacing it with fictitious subject matter (Wu et al., 2016). It can also generate synthetic conditions which cannot be found in real-world datasets by modifying its field properties. Changing the distributions of values within a dataset can be useful to perform extensive testing on new algorithms for the identification of robustness issues and any errors within (Ayala-Rivera, McDonagh, Cerqueus, & Murphy, 2013). This has the potential to reveal more software bugs within a product which is essential during testing. This study aims to extend the synthetic data generation tool, ADaGe (Camilleri & Bonello, 2016), to semi-automatically infer information from an existing dataset with the aim of replicating its characteristics and properties onto a synthetic dataset. The main objective of this research is to automate the process of extracting data properties from the source database and infer not only the structure of the data, but also the patterns that can be used to generate semantically similar data. By applying EDA techniques, relevant information on the relationship between different fields can be acquired. In contrast to existing data generation tools which generally generate data on a column-by-column basis, this will enable us to replicate the relationships between the various attributes of a table. Results show that by identifying the field types in a table and gathering the relevant statistics, we can acquire the information needed to replicate the relationships between different attributes. Percentage frequency distribution statistics retained relationships between categorical fields, while binning proved effective in preserving the distribution in quantitative fields. Regular expressions were used to define values for text fields which are known to follow a pre-specified pattern.en_GB
dc.language.isoenen_GB
dc.rightsinfo:eu-repo/semantics/restrictedAccessen_GB
dc.subjectComputer algorithmsen_GB
dc.subjectQuantitative researchen_GB
dc.titleGenerating datasets through data source analysis using ADaGeen_GB
dc.typebachelorThesisen_GB
dc.rights.holderThe copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder.en_GB
dc.publisher.institutionUniversity of Maltaen_GB
dc.publisher.departmentFaculty of Information and Communication Technologyen_GB
dc.description.reviewedN/Aen_GB
dc.contributor.creatorXuereb, Matthew-
Appears in Collections:Dissertations - FacICT - 2017

Files in This Item:
File Description SizeFormat 
17BITSD031.pdf
  Restricted Access
2.5 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.