Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/108383
Title: | Automatic structure analysis on the Maltese Government Gazette |
Authors: | Rizzo, Christopher (2022) |
Keywords: | Malta. Government Gazette Optical character recognition Machine learning Neural networks (Computer science) |
Issue Date: | 2022 |
Citation: | Rizzo, C. (2022). Automatic structure analysis on the Maltese Government Gazette (Master's dissertation). |
Abstract: | The Government’s department of information digital documents contain a treasure trove of information. Government Gazettes centralise the broadcasting of all the legal notices from the government. The Maltese government gazette portal exposes these documents but has its limitations when browsing through this information. In this dissertation we present a solution that involves a number of steps defined withing a pipeline. Initially we collected the government gazette PDF documents from the government portal. For each document, the pages were converted to images and then subsequently forwarded to the next step in order to extract the logical structure. Document Layout Analysis was applied on the document images in order to identify and classify regions of interests. Mask R-CNN is a Convolution Neural Network (CNN) and a state-of-the-art model for image segmentation, usually applied to images for object detection, but we specifically applied it to the document images. In order to obtain a degree of result, the model had to be trained. Mask R-CNN readily available models in the field mostly cover object detection such as persons, cars, furniture, etc. Our challenge was on a completely different spectrum, in which it entailed in identifying document segments such as titles, figures, lists, text paragraphs and tables. This also transpired another challenge, with the absence of readily available government gazette annotated datasets. Having said that, the next best option was to find a similar dataset to initially train the machine learning model. One such dataset is Publaynet Zhong et al. (2019), which contains a large number of medical document images. It was selected because it had been determined, that there is a degree of similarity in terms of document layout between the Pubmed documents and the Government Gazette layout. An implementation of such prediction model was made available by using Detectron, a solution from Facebook which we have initially trained on Publaynet. Subsequently a smaller annotated dataset was created from ground up, split between training and evaluation, and was based on the actual Maltese Government Gazette document pages. The next step was to determine the impact of transfer learning that such dataset can have on the results produced by Publaynet dataset. Furthermore subsequent steps, Optical Character Recognition (OCR) was applied over the classified regions of interest. Reading order and geometric exclusion of text had allowed us to maintain the context for each extracted topic, that would be needed in the subsequent information extraction step. Then we performed Named Entity Recognition (NER) to identify salient entities which have been used to perform an initial iteration producing a basic Knowledge Graph (KG), that allowed the end user to perform searches related to dates and streets within the Police Notices’ section. Finally, the Knowledge Graph was evaluated using methods available in this field. The main research field of this dissertation is Document Layout Analysis (DLA) but then on a smaller scale a review and evaluation of creating a Knowledge Graph (KG) was performed. |
Description: | M.Sc.(Melit.) |
URI: | https://www.um.edu.mt/library/oar/handle/123456789/108383 |
Appears in Collections: | Dissertations - FacICT - 2022 Dissertations - FacICTAI - 2022 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
2219ICTICS520005057139_1.PDF Restricted Access | 9.09 MB | Adobe PDF | View/Open Request a copy |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.