MASRI

We are interested in a broad variety of problems in Speech Recognition for under-resourced languages, especially Maltese.

Aħna interessati f'diversi sfidi konnessi mat-traskrizzjoni awtomatika b'lingwi bi ftit riżorsi, speċjalment il-Malti.

Support | Fondi

MASRI is supported by a University of Malta Research Fund Excellence grant.
MASRI għandu l-fondi mill-Fond tar-Riċerka tal-Università ta' Malta.

Data

We are creating speech corpora and investigating data augmentation techniques.
Qed noħolqu korpora tat-taħdit u ninvestigaw mezzi biex nawmentaw id-data.

Technologies | Technoloġiji

We investigate neural techniques for speech-to-text, forced alignment etc.
Qed ninvestigaw teknoloġija bbażata fuq xbiek newrali għat-traskrizzjoni awtomatika, forced alignment, eċċ.

Neħtieġu l-għajnuna tiegħek

Biex nibnu sistemi li jużaw il-vuċi aħjar neħtieġu għadd kbir ta' recordings ta' taħdit bil-Malti. Għaldaqstant, qed nużaw il-Common Voice, proġett tal-Mozilla mfassal sabiex jgħin fil-ħolqien ta' sistemi tat-taħdit għal-lingwi kollha mitkellma madwar id-dinja. Permezz ta' dan, qed niġbru kampjuni ta' taħdit minn kelliema madwar Malta u Għawdex.

Agħfas il-buttuna biex iżżur il-paġna ta' Common Voice.
Qis li tagħżel il-Malti bħala l-lingwa li trid tuża.
Jekk trid, oħloq kont. Dan iħallik tippersonalizza l-esperjenza u tneħħi kwalunkwe data li tkun tajtna.
Ibda rrekordja l-vuċi tiegħek billi taqra ftit sentenzi kuljum.
Isma' r-rekordings ta' ħaddieħor u għidilna humiex tajbin jew le.

Ara vidjow dwar kif taħdem is-sistema.

Ibda għin

We need your help

In order to build better voice-activated systems, we need huge amounts of spoken data in Maltese. We are therefore using Common Voice, a project by Mozilla designed to help in the construction of voice-activated systems for all the languages of the world. We are using this in order to collect samples of speech from speakers all over the Maltese islands.

Click the button to visit the Common Voice page.
Make sure you choose Maltese (Malti) as your language.
If you want, you can create an account. This allows you to personalise the experience, but also to delete your data should you wish to.
Start recording your voice by reading a few sentences a day.
Listen to recordings made by other people and tell us if they're good or not.

Watch this video to see how it works (in Maltese).

Start helping

Data

Data and resources created in the MASRI project.

Dati

Data u riżorsi maħluqa fil-proġett MASRI.

MASRI Data Repo

MASRI has released several corpora of paired text and speech. For full details of the datasets, check our Github repo.
Currently, this data is only being released for non-commercial purposes.
Il-proġett MASRI nieda diversi datasets ta' taħdit bit-traskrizzjonijiet. Għal aktar informazzjoni ara d-dettalji fuq ir-repożitorju tal-Github tal-proġett.
Bħalissa, id-data hija disponibbli biss għal skopijiet mhux kummerċjali.

General data characteristics

Audio files are distributed in a 16khz@16bit mono format.
Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx.
Transcriptions are lowercase. Unless otherwise stated below, no punctuation marks are permitted except dashes (-) and apostrophes (') because these belong to the Maltese orthography.

Dataset	Size	Type
MASRI-Headset v2	6h39m	Read speech
MASRI-Farfield	9h37m	Read speech
MASRI-Booths	2h27m	Read speech
MASRI-MEP	1h17m	Spontaneous speech
MASRI-COMVO	7h29m	Read speech
MASRI-TUBE	13h17m	Spontaneous speech
MASRI-Synthetic	99h18m	Synthesized speech
MASRI-Dev	1h	Spontaneous speech
MASRI-Test	1h	Spontaneous speech

Download | Niżżel

To download the data, please read our terms and conditions.
Biex tniżżel id-data, aqra t-termini u l-kundizzjonijiet.

Ready to download? Click the button below.
Lest biex tniżżel? Agħfas il-buttuna hawn taħt.

Role (choose the best that applies) | Rwol (agħżel l-aħjar kategorija)

Briefly describe why you wish to use this data. | Għidilna fil-qosor għalfejn tixtieq tuża d-data.

I have read and understood the terms and conditions and the standard terms of use.
Nikkonferma li qrajt u fhimt it-termini u l-kundizzjonijiet għall-użu tad-data..

You should receive a download link. | Għandek tirċievi ħolqa biex tniżżel id-data.

G2P Tool for Maltese
Sistema G2P għall-Malti

Python 3 grapheme-to-phoneme tool (G2P). Code is available on Github. A demo and web service interface is hosted on the Maltese Language Resource Server.
Implimentazzjoni bil-Python 3 ta' programm għat-traskrizzjoni fonetika. Kodiċi disponibbli fuq Github. . Tista' tara' demo u tikseb aċċess permezz ta' web service mis-Server għar-Riżorsi Lingwistiċi Maltin.

MASRI-HEADSET

Corpus of 8 hours of paired Maltese speech and text.
To obtain the dataset, kindly contact us.
Note that this dataset is superseded by our full data release.
Korpus ta' 8 sigħat ta' taħdit bil-Malti, flimkien mat-traskrizzjonijiet.
Dan id-dataset huwa sorpassat, mindu nidejna d-data kompluta.

MASRI-HEADSET splits
Taqsimiet tal-MASRI-HEADSET

These files reproduce the train/test experiments on MASRI-HEADSET, as described in this paper.
Dawn il-fajls jirriproduċu t-taqsimiet fil-MASRI-HEADSET użati fl-esperimenti rrappurtati f'dan l-artiklu.

Publications

Project research papers

Pubblikazzjonijiet

Artikli dwar ir-riċerka tal-proġett.

Mena, C; Gatt, A; DeMarco, A; Borg, C; van der Plas, L; Muscat, A and Padovani, I. (2020). MASRI-HEADSET: A Maltese corpus for speech recognition. Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20). Marseille, France: ELRA.

People

Project members and associated research assistants.

Membri

Membri tal-proġett u assistenti tar-riċerka.

Albert Gatt

Andrea DeMarco

Claudia Borg

Lonneke van der Plas

Carlos Mena

Alexandra Vella

Amanda Muscat

Ian Padovani

Kirsty Azzopardi

Ayrton Didier Brincat

Contact

Ikkuntattjana

Institute of Linguistics and Language Technology
University of Malta
Msida MSD2080

masri {at} um.edu.mt

MASRI - Maltese Automatic Speech Recognition

MASRI is developing the first Automatic Speech Recognition technologies for Maltese.

Il-proġett MASRI qed jiżviluppa t-teknoloġiji għat-traskrizzjoni awtomatika bil-Malti.

MASRI

Support | Fondi

Data

Technologies | Technoloġiji

Neħtieġu l-għajnuna tiegħek

We need your help

Data

Dati

MASRI Data Repo

Download | Niżżel

G2P Tool for Maltese
Sistema G2P għall-Malti

MASRI-HEADSET

MASRI-HEADSET splits
Taqsimiet tal-MASRI-HEADSET

Publications

Pubblikazzjonijiet

People

Membri

Albert Gatt

Andrea DeMarco

Claudia Borg

Lonneke van der Plas

Carlos Mena

Alexandra Vella

Amanda Muscat

Ian Padovani

Kirsty Azzopardi

Ayrton Didier Brincat

Contact

Ikkuntattjana

MASRI - Maltese Automatic Speech Recognition

MASRI is developing the first Automatic Speech Recognition technologies for Maltese.

Il-proġett MASRI qed jiżviluppa t-teknoloġiji għat-traskrizzjoni awtomatika bil-Malti.

MASRI

Support | Fondi

Data

Technologies | Technoloġiji

Neħtieġu l-għajnuna tiegħek

We need your help

Data

Dati

MASRI Data Repo

Download | Niżżel

G2P Tool for Maltese Sistema G2P għall-Malti

MASRI-HEADSET

MASRI-HEADSET splits Taqsimiet tal-MASRI-HEADSET

Publications

Pubblikazzjonijiet

People

Membri

Albert Gatt

Andrea DeMarco

Claudia Borg

Lonneke van der Plas

Carlos Mena

Alexandra Vella

Amanda Muscat

Ian Padovani

Kirsty Azzopardi

Ayrton Didier Brincat

Contact

Ikkuntattjana

G2P Tool for Maltese
Sistema G2P għall-Malti

MASRI-HEADSET splits
Taqsimiet tal-MASRI-HEADSET