Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/106666
Title: | A study on the use of keywords in a graph-based image caption generation system |
Authors: | Birmingham, Brandon (2022) |
Keywords: | Natural language generation (Computer science) Computer vision |
Issue Date: | 2022 |
Citation: | Birmingham, B. (2022). A study on the use of keywords in a graph-based image caption generation system (Doctoral dissertation). |
Abstract: | A long-standing goal of Artificial Intelligence is to have agents capable of understanding and interpreting the visual world using natural language. Research at the intersection of Computer Vision and Natural Language Processing is currently booming and the automatic generation of image captions has gained significant popularity. Several ideas and architectures have been proposed to machine generate human-like sentences that describe images, but all fall short of reaching human-level quality. In general, the task of image caption generation involves the selection of salient objects, attributes and relations depicted in the image which are then combined into a natural language sentence. While the state-of-the-art architectures attempt to do this task in one step, this PhD studies the generation of sentences out of a set of discrete keywords. The attentional behaviour of the human brain while processing the visual world inspired this PhD study and has led to the hypothesis that captions can be generated through a relevant set of keywords which are then connected through a path traversal in a knowledge graph derived from a language dataset. This novel combination acts as a gangboard between the vision and language modalities, where keywords are represented as graph nodes, while the sequence between keywords is reflected by directed edges. As opposed to the current popular end-to-end learning approach, the proposed model reduces the dependency of large scale paired image-caption datasets which are very laborious and expensive to collect. To test this hypothesis, this study develops and studies KENGIC, a Keyword driven and Ngram Graph-based Image Captioning framework which exploits n-gram sequences as found in a given text corpus to construct sub-knowledge graphs for query images. By having a set of predicted image keywords considered as nodes, the proposed system is designed to probabilistically connect these nodes to form a directed graph through overlapping n-grams. The system infers the most likely captions by maximising the most probable n-gram sequences constructed from the predicted keywords. The study, investigates the generation of image captions under different configuration setups based on (a) keywords extracted from gold standard captions and (b) from automatically detected keywords. Both quantitative and qualitative analyses demonstrated the effectiveness of KENGIC. As spatial relations (SRs) are inherently more difficult to be predicted from whole images due to their highly polysemous, locative, explicit, and ambiguous nature, this research also contributes to the problem of SR prediction by investigating the problem from a multi-label perspective. However, the explicit use of SRs was not found to improve the quality of the generated captions as evaluated on automatic metrics. The performance achieved by KENGIC is very close to that of current state-of-the-art image caption generators that are trained in the unpaired setting. The analysis of this approach could also shed light on the generation process behind current top performing caption generators trained in the paired setting and in addition, provide insights on the limitations of the current most widely used evaluation metrics in automatic image captioning. |
Description: | Ph.D.(Melit.) |
URI: | https://www.um.edu.mt/library/oar/handle/123456789/106666 |
Appears in Collections: | Dissertations - FacICT - 2022 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Brandon Birmingham PH.D..pdf | 10.71 MB | Adobe PDF | View/Open |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.