Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/127773
Title: Visually grounded language generation : data, models and explanations beyond descriptive captions
Authors: Cafagna, Michele (2024)
Keywords: Artificial intelligence
Description logics
Linguistics -- Malta
Issue Date: 2024
Citation: Cafagna, M. (2024). Visually grounded language generation: data, models and explanations beyond descriptive captions (Doctoral dissertation).
Abstract: Vision and Language are two essential capabilities by which we can talk about what we see and communicate it to others, ultimately allowing us to perform tasks, and understand the world. Modeling such interaction is critical to creating agents able to understand, at least to some extent, the world we perceive. This challenge is generally known as multimodal grounding and corresponds to the capability of a model to create meaningful connections between different modalities to solve a task. Ungrounded models do not properly interleave the two modalities yet they can perform well on downstream tasks, leading to misleading and potentially harmful behaviors. Among other fields, Explainable Artificial Intelligence research has moved forward in recent years, proposing methods able to help scrutinize the inner workings of these models and therefore, also assess their grounding capabilities. However, these methods have some relevant limitations, especially on generative models and they are still unpopular in Vision and Language research. Vision and Language research has mostly focused on performing and evaluating tasks involving the identification and recognition of objects and entities, as they represent the most basic meaningful information represented in a visual scene that can be used as a building block to compose complex multimodal relations, especially on the visual modality. However, in the textual modality, objects represent only a limited amount of linguistic information as language is enriched by words and expressions that do not always correspond to concrete physical objects. Some linguistic expressions can represent complex contexts and situational knowledge that goes beyond the objects visible in the images. For example, describing a picture as a “picnic” (high-level) triggers a whole set of expectations about the scene, making the mention of the objects and entities, totally redundant and uninformative e.g. “people eating food on the grass” (low-level). The latter description is object-centric and it is most likely generated by an automatic captioning system, whereas the former is more human-like and naturally used by humans. The general lack of interest in this relevant aspect by the research community created a potential gap in the overall assessment of the capability of the large-scale models to fully understand the “language”, in the “vision and language”, preventing a potential gain in terms of overall output quality for multimodal models in generative settings. In this thesis, we dive into this direction with the aim to discover whether large pre-trained Vision and Language models can handle high-level linguistic descriptions and to what extent they are able to effectively ground them into the visual modality; implications for both language understanding and generation are of interest in this work. Moving away from object-centric descriptions we potentially change the paradigm used to assess multimodal grounding. We analyze potential changes in terms of tasks and evaluation methods introducing an explainability framework designed to complement the currently available tools to assess models’ multimodal grounding capabilities in generative settings.
Description: Ph.D.(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/127773
Appears in Collections:Dissertations - InsLin - 2024

Files in This Item:
File Description SizeFormat 
2401LLTLIN600005073698_1.PDF19.48 MBAdobe PDFView/Open


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.