Please use this identifier to cite or link to this item:
https://www.um.edu.mt/library/oar/handle/123456789/63172
Title: | Topic modelling of newspaper comments using embedding vectors and clustering |
Authors: | Zammit, Samuel |
Keywords: | Press -- Malta Natural language processing (Computer science) Vector analysis |
Issue Date: | 2020 |
Citation: | Zammit, S. (2020). Topic modelling of newspaper comments using embedding vectors and clustering (Bachelor's dissertation). |
Abstract: | As the use of the Internet and online social media increases, text data is becoming an ever more important source of data. To this end, a vast number of techniques have been developed in the field of Natural Language Processing. These include n-grams, skip-grams, the Bag-of-Words model, Term Frequency-Inverse Document Frequency, stemming, lemmatisation, embedding vectors, and clustering techniques. This dissertation investigates the theoretical foundations of these techniques, and they are then applied to a dataset of online newspaper comments written between 2008 and 2017 obtained from the Times of Malta website. In particular, the FastText algorithm (Bojanowski et al., 2017) is used to transform each unique word in the dataset to a vector representation known as a word embedding by means of an underlying neural network framework. The word embeddings are then used to obtain clusters by means of the k-means clustering algorithm. Vector representations are also obtained for each online newspaper comment, where again similar comments are assigned similar representations. The obtained representations, which are known as document embeddings, are then also clustered using k-means clustering. The results obtained from the in-depth analysis of the data show that the vast majority of comments are political in nature, with comments related to sports, arts and culture being less frequent than possibly expected. In addition, a number of topics were identified as being more prevalent during some time periods than during others. These include divorce in 2011, as well as Maltese citizenship in 2013 and Russia’s annexation of Crimea in 2014. Furthermore, the morning-after pill and corruption were two topics that were highly discussed in 2016. |
Description: | B.SC.(HONS)STATS.&OP.RESEARCH |
URI: | https://www.um.edu.mt/library/oar/handle/123456789/63172 |
Appears in Collections: | Dissertations - FacSci - 2020 Dissertations - FacSciSOR - 2020 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
20BSCMSOR009.pdf Restricted Access | 3.05 MB | Adobe PDF | View/Open Request a copy |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.