COMET for low-resource machine translation evaluation : a case study of English-Maltese and Spanish-Basque

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/120525

Title:	COMET for low-resource machine translation evaluation : a case study of English-Maltese and Spanish-Basque
Authors:	Falcão, Júlia (2023)
Keywords:	English language -- Translating into Maltese Spanish language -- Translating into Basque Machine translating -- Evaluation Neural networks (Computer science)
Issue Date:	2023
Citation:	Falcão, J. (2023). COMET for low-resource machine translation evaluation: a case study of English-Maltese and Spanish-Basque (Master's dissertation).
Abstract:	Translation quality is a largely subjective concept, but in Machine Translation, it needs to be measurable. Human judgements are regarded as the gold standard of evaluation methods, but they are expensive and time-consuming to obtain, so the field has turned to automatic metrics, such as BLEU, which measures the lexical overlap between the translation candidate and one or more reference translations. However, lexical overlap is not all there is to a good translation, and BLEU has repeatedly been shown to correlate poorly with human judgements of quality. A new paradigm has emerged in recent years: trainable metrics, based on neural networks that directly predict quality scores of human judgements, have been topping the ranks in the latest meta-evaluation studies. However, as they need to be trained on annotated parallel data, these metrics have limited language support, and so under-resourced languages are mostly left out. In this work, we look at the most prominent trainable evaluation system proposed for MT so far, the Comet framework, and take English–Maltese and Spanish–Basque as a case study to investigate the extent of Comet’s language support restrictions: how well can it evaluate languages outside of its training data, and languages not supported by its underlying encoder, as is the case of Maltese and Basque? We run a crowd-based evaluation campaign to collect human judgements, and then use this data to analyze the performance of Comet out of the box. We also explore potential avenues of improvement: by fine-tuning existing models, or training new models from scratch. Our results, based on correlations between human evaluations and metric outputs, attest to the potential of fine-tuning to improve existing models, but also indicate that Comet is highly susceptible to the distribution of scores in its training data, which is especially concerning in low-resource scenarios. This dissertation is a step towards the inclusion of under-resourced languages in the development of better metrics for MT evaluation, and we also release our anonymized campaign results to public for future works.
Description:	M.Sc. (HLST)(Melit.)
URI:	https://www.um.edu.mt/library/oar/handle/123456789/120525
Appears in Collections:	Dissertations - FacICT - 2023 Dissertations - FacICTAI - 2023

Files in This Item:

File	Description	Size	Format
2318ICTCSA531005079271_1.PDF Restricted Access		2.18 MB	Adobe PDF	View/Open Request a copy

Show full item record Statistics