Pre-training data quality and quantity for a low-resource language : new Corpus and BERT models for Maltese

Micallef, Kurt; Gatt, Albert; Tanti, Marc; van der Plas, Lonneke; Borg, Claudia

Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/104597

Full metadata record

DC Field	Value	Language
dc.contributor.author	Micallef, Kurt	-
dc.contributor.author	Gatt, Albert	-
dc.contributor.author	Tanti, Marc	-
dc.contributor.author	van der Plas, Lonneke	-
dc.contributor.author	Borg, Claudia	-
dc.date.accessioned	2022-12-21T11:29:01Z	-
dc.date.available	2022-12-21T11:29:01Z	-
dc.date.issued	2022	-
dc.identifier.citation	Micallef, K., Gatt, A., Tanti, M., van der Plas, L., & Borg, C. (2022). Pre-training data quality and quantity for a low-resource language : new Corpus and BERT models for Maltese. Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, Virtual conference.	en_GB
dc.identifier.uri	https://www.um.edu.mt/library/oar/handle/123456789/104597	-
dc.description.abstract	Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT – Maltese – with a range of pre-training set ups. We conduct evaluations with the newly pretrained models on three morphosyntactic tasks – dependency parsing, part-of-speech tagging, and named-entity recognition – and one semantic classification task – sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pretrained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks	en_GB
dc.language.iso	en	en_GB
dc.publisher	Association for Computational Linguistics	en_GB
dc.rights	info:eu-repo/semantics/restrictedAccess	en_GB
dc.subject	Artificial intelligence	en_GB
dc.subject	Natural language processing (Computer science)	en_GB
dc.subject	Semantics	en_GB
dc.title	Pre-training data quality and quantity for a low-resource language : new Corpus and BERT models for Maltese	en_GB
dc.type	conferenceObject	en_GB
dc.rights.holder	The copyright of this work belongs to the author(s)/publisher. The rights of this work are as defined by the appropriate Copyright Legislation or as modified by any successive legislation. Users may access this work and can make use of the information contained in accordance with the Copyright Legislation provided that the author must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the prior permission of the copyright holder.	en_GB
dc.bibliographicCitation.conferencename	Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing	en_GB
dc.bibliographicCitation.conferenceplace	Virtual conference. July 2022.	en_GB
dc.description.reviewed	peer-reviewed	en_GB
Appears in Collections:	Scholarly Works - InsLin

Files in This Item:

File	Description	Size	Format
Pre_training_data_quality_and_quantity_for_a_low_resource_language_new_Corpus_and_BERT_models_for_Maltese_2022.pdf Restricted Access		245.44 kB	Adobe PDF	View/Open Request a copy

Show simple item record Statistics