Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/26429
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.date.accessioned | 2018-02-06T13:34:40Z | - |
dc.date.available | 2018-02-06T13:34:40Z | - |
dc.date.issued | 2017 | - |
dc.identifier.uri | https://www.um.edu.mt/library/oar//handle/123456789/26429 | - |
dc.description | B.SC.IT(HONS) | en_GB |
dc.description.abstract | The generation of a synthetic dataset is the simulation of an existing dataset which is able to retain or modify the proportions and characteristics of its real-world counterpart. Using synthetic datasets as opposed to real-world data has various advantages. Synthetic datasets can safeguard sensitive information by replacing it with fictitious subject matter (Wu et al., 2016). It can also generate synthetic conditions which cannot be found in real-world datasets by modifying its field properties. Changing the distributions of values within a dataset can be useful to perform extensive testing on new algorithms for the identification of robustness issues and any errors within (Ayala-Rivera, McDonagh, Cerqueus, & Murphy, 2013). This has the potential to reveal more software bugs within a product which is essential during testing. This study aims to extend the synthetic data generation tool, ADaGe (Camilleri & Bonello, 2016), to semi-automatically infer information from an existing dataset with the aim of replicating its characteristics and properties onto a synthetic dataset. The main objective of this research is to automate the process of extracting data properties from the source database and infer not only the structure of the data, but also the patterns that can be used to generate semantically similar data. By applying EDA techniques, relevant information on the relationship between different fields can be acquired. In contrast to existing data generation tools which generally generate data on a column-by-column basis, this will enable us to replicate the relationships between the various attributes of a table. Results show that by identifying the field types in a table and gathering the relevant statistics, we can acquire the information needed to replicate the relationships between different attributes. Percentage frequency distribution statistics retained relationships between categorical fields, while binning proved effective in preserving the distribution in quantitative fields. Regular expressions were used to define values for text fields which are known to follow a pre-specified pattern. | en_GB |
dc.language.iso | en | en_GB |
dc.rights | info:eu-repo/semantics/restrictedAccess | en_GB |
dc.subject | Computer algorithms | en_GB |
dc.subject | Quantitative research | en_GB |
dc.title | Generating datasets through data source analysis using ADaGe | en_GB |
dc.type | bachelorThesis | en_GB |
dc.rights.holder | The copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder. | en_GB |
dc.publisher.institution | University of Malta | en_GB |
dc.publisher.department | Faculty of Information and Communication Technology | en_GB |
dc.description.reviewed | N/A | en_GB |
dc.contributor.creator | Xuereb, Matthew | - |
Appears in Collections: | Dissertations - FacICT - 2017 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
17BITSD031.pdf Restricted Access | 2.5 MB | Adobe PDF | View/Open Request a copy |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.