Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/120557
Title: Lessons learned from the evaluation of Portuguese language models
Authors: Chaves Rodrigues, Ruan (2023)
Keywords: Portuguese language -- Data processing
Natural language processing (Computer science)
Issue Date: 2023
Citation: Chaves Rodrigues, R. (2023). Lessons learned from the evaluation of Portuguese language models (Master's dissertation).
Abstract: With the rising prominence of the Portuguese language in Natural Language Processing (NLP), a clear divide is observed between major corporations and smaller academic entities in model training. This raises an important question: can the efforts of smaller entities compete with major corporations in Portuguese natural language tasks? And which aspects should they prioritize to enhance their advantage? In our pursuit to answer this, we provide a historical overview of advancements in Portuguese NLP, from early word embeddings to the rise of Large Language Models (LLMs). We then discuss the linguistic challenges of benchmark construction and set out to perform a comprehensive evaluation of modern language models using a carefully designed benchmark. Using detailed evaluation methods and rigorous statistical analysis, our findings show no significant performance differences between models trained solely on Portuguese datasets and those trained on multilingual data. Our study challenges the perceived benefits of current Portuguese language models and highlights the need for deeper linguistic research and evaluation in Portuguese NLP. Our main contribution, the Natural Portuguese Language Benchmark (Napolab), is available at https://github.com/ruanchaves/napolab.
Description: M.Sc. (HLST)(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/120557
Appears in Collections:Dissertations - FacICT - 2023
Dissertations - FacICTAI - 2023

Files in This Item:
File Description SizeFormat 
2418ICTCSA531005079272_1.PDF4.55 MBAdobe PDFView/Open


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.