MASRI - Maltese Automatic Speech Recognition

MASRI is developing the first Automatic Speech Recognition technologies for Maltese.

Il-proġett MASRI qed jiżviluppa t-teknoloġiji għat-traskrizzjoni awtomatika bil-Malti.

MASRI

We are interested in a broad variety of problems in Speech Recognition for under-resourced languages, especially Maltese.

Aħna interessati f'diversi sfidi konnessi mat-traskrizzjoni awtomatika b'lingwi bi ftit riżorsi, speċjalment il-Malti.

Support | Fondi

MASRI is supported by a University of Malta Research Fund Excellence grant.
MASRI għandu l-fondi mill-Fond tar-Riċerka tal-Università ta' Malta.

Data

We are creating speech corpora and investigating data augmentation techniques.
Qed noħolqu korpora tat-taħdit u ninvestigaw mezzi biex nawmentaw id-data.

Technologies | Technoloġiji

We investigate neural techniques for speech-to-text, forced alignment etc.
Qed ninvestigaw teknoloġija bbażata fuq xbiek newrali għat-traskrizzjoni awtomatika, forced alignment, eċċ.

Neħtieġu l-għajnuna tiegħek

Biex nibnu sistemi li jużaw il-vuċi aħjar neħtieġu għadd kbir ta' recordings ta' taħdit bil-Malti. Għaldaqstant, qed nużaw il-Common Voice, proġett tal-Mozilla mfassal sabiex jgħin fil-ħolqien ta' sistemi tat-taħdit għal-lingwi kollha mitkellma madwar id-dinja. Permezz ta' dan, qed niġbru kampjuni ta' taħdit minn kelliema madwar Malta u Għawdex.

  • Agħfas il-buttuna biex iżżur il-paġna ta' Common Voice.
  • Qis li tagħżel il-Malti bħala l-lingwa li trid tuża.
  • Jekk trid, oħloq kont. Dan iħallik tippersonalizza l-esperjenza u tneħħi kwalunkwe data li tkun tajtna.
  • Ibda rrekordja l-vuċi tiegħek billi taqra ftit sentenzi kuljum.
  • Isma' r-rekordings ta' ħaddieħor u għidilna humiex tajbin jew le.

Ara vidjow dwar kif taħdem is-sistema.

We need your help

In order to build better voice-activated systems, we need huge amounts of spoken data in Maltese. We are therefore using Common Voice, a project by Mozilla designed to help in the construction of voice-activated systems for all the languages of the world. We are using this in order to collect samples of speech from speakers all over the Maltese islands.

  • Click the button to visit the Common Voice page.
  • Make sure you choose Maltese (Malti) as your language.
  • If you want, you can create an account. This allows you to personalise the experience, but also to delete your data should you wish to.
  • Start recording your voice by reading a few sentences a day.
  • Listen to recordings made by other people and tell us if they're good or not.

Watch this video to see how it works (in Maltese).

Data

Data and resources created in the MASRI project.

Dati

Data u riżorsi maħluqa fil-proġett MASRI.

MASRI Data Repo

MASRI has released several corpora of paired text and speech. For full details of the datasets, check our Github repo.
Currently, this data is only being released for non-commercial purposes.

Il-proġett MASRI nieda diversi datasets ta' taħdit bit-traskrizzjonijiet. Għal aktar informazzjoni ara d-dettalji fuq ir-repożitorju tal-Github tal-proġett.
Bħalissa, id-data hija disponibbli biss għal skopijiet mhux kummerċjali.

Download | Niżżel

To download the data, please read our terms and conditions.
Biex tniżżel id-data, aqra t-termini u l-kundizzjonijiet.

Ready to download? Click the button below.
Lest biex tniżżel? Agħfas il-buttuna hawn taħt.

Loading
You should receive a download link. | Għandek tirċievi ħolqa biex tniżżel id-data.

G2P Tool for Maltese
Sistema G2P għall-Malti

Python 3 grapheme-to-phoneme tool (G2P). Code is available on Github. A demo and web service interface is hosted on the Maltese Language Resource Server.
Implimentazzjoni bil-Python 3 ta' programm għat-traskrizzjoni fonetika. Kodiċi disponibbli fuq Github. . Tista' tara' demo u tikseb aċċess permezz ta' web service mis-Server għar-Riżorsi Lingwistiċi Maltin.

MASRI-HEADSET

Corpus of 8 hours of paired Maltese speech and text.
To obtain the dataset, kindly contact us.
Note that this dataset is superseded by our full data release.

Korpus ta' 8 sigħat ta' taħdit bil-Malti, flimkien mat-traskrizzjonijiet.
Dan id-dataset huwa sorpassat, mindu nidejna d-data kompluta.

MASRI-HEADSET splits
Taqsimiet tal-MASRI-HEADSET

These files reproduce the train/test experiments on MASRI-HEADSET, as described in this paper.
Dawn il-fajls jirriproduċu t-taqsimiet fil-MASRI-HEADSET użati fl-esperimenti rrappurtati f'dan l-artiklu.

Publications

Project research papers

Pubblikazzjonijiet

Artikli dwar ir-riċerka tal-proġett.

Mena, C; Gatt, A; DeMarco, A; Borg, C; van der Plas, L; Muscat, A and Padovani, I. (2020). MASRI-HEADSET: A Maltese corpus for speech recognition. Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20). Marseille, France: ELRA.

People

Project members and associated research assistants.

Membri

Membri tal-proġett u assistenti tar-riċerka.

Albert Gatt

Andrea DeMarco

Claudia Borg

Lonneke van der Plas

Carlos Mena

Alexandra Vella

Amanda Muscat

Ian Padovani

Kirsty Azzopardi

Ayrton Didier Brincat

Contact

Ikkuntattjana

Institute of Linguistics and Language Technology
University of Malta
Msida MSD2080

masri {at} um.edu.mt

Loading
Message sent. Thank you! | Il-messaġġ intbagħat. Grazz!