Master thesis :  Generating Topic Models from Corpora Across Languages

Master thesis : Generating Topic Models from Corpora Across Languages

Thielen, Benoit

Date of defense : 28-Jan-2022 • Permalink : `http://hdl.handle.net/2268.2/13874`

Details

Title :	Master thesis : Generating Topic Models from Corpora Across Languages
Author :	Thielen, Benoit
Date of defense :	28-Jan-2022
Advisor(s) :	Ittoo, Ashwin
Committee's member(s) :	Geurts, Pierre Louppe, Gilles
Language :	English
Keywords :	[en] LDA [en] HDP [en] nHDP [en] Poincaré [en] embeddings [en] topic [en] modeling [en] Dirichlet
Discipline(s) :	Engineering, computing & technology > Computer science
Target public :	Researchers
Institution(s) :	Université de Liège, Liège, Belgique
Degree:	Master en sciences informatiques, à finalité spécialisée en "intelligent systems"
Faculty:	Master thesis of the Faculté des Sciences appliquées

Abstract

[en] Topic modeling is a learning process aiming to analyze texts to discover their topic composition by associating groups of correlated words. Historically, topic modeling has used unsupervised learning techniques. Bayesian generative models, such as Latent Dirichlet Allocation (LDA), have quickly proven their performance for representing with probabilities the distributions of words across topics and of topics across documents. Recently, new topic models based on LDA have emerged, like the Hierarchical Dirichlet Process (HDP) which self-determines the number of topics in the text and the nested Hierarchical Dirichlet Process (nHDP) which enables a hierarchical representation of the topics.

The performances in topic identification and hierarchical modeling of HDP and nHDP were evaluated in this work, on English and French corpora built from Wikipedia articles. A large number of very coherent and interesting topics were detected in both languages, despite the presence of some less coherent ones. Correlations have been highlighted between the statistics of the corpus and evaluation metrics such as coherence and model perplexity.

Additionally, a more recent approach of learning word embeddings in hyperbolic space, specifically in the Poincaré ball space, has been studied to determine if it could constitute a promising approach to hierarchical topic modeling. Poincaré embeddings of 10 dimensions were trained on hypernymy relations of our English corpus. Our analysis revealed clusters of words which can be linked to topics, unfortunately the 2D representation method we applied did not allow to show hierarchical relations between those clusters.

In conclusion, both HDP and nHDP models have shown good and similar learning performances when trained on French and English corpora, nHDP being also efficient in providing hierarchical representation of the topics. The Poincaré embeddings were successful in learning and representing the hypernymy relations in the Poincaré ball, however suffered from the constraints imposed by the data acquisition methods and required filtering processes.

File(s)

Document(s)

Master_thesis_final.pdf
Description:
Size: 5.23 MB
Format: Adobe PDF

summary_thesis.pdf
Description:
Size: 244.36 kB
Format: Adobe PDF

Cite this master thesis

All documents available on MatheO are protected by copyright and subject to the usual rules for fair use.
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.

Nom	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	Cookie de session de plate-forme à usage général, utilisé par les sites écrits en JSP. Habituellement utilisé pour maintenir une session utilisateur anonyme par le serveur.
CookieScriptConsent	CookieScript .uliege.be	1 an	Ce cookie est utilisé par le service Cookie-Script.com pour mémoriser les préférences de consentement des visiteurs en matière de cookies. Il est nécessaire pour que la bannière de cookies Cookie-Script.com fonctionne correctement.

Nom	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 an	Ce nom de cookie est associé à la plateforme d'analyse Web open source Matomo. Il est utilisé pour aider les propriétaires de sites Web à suivre le comportement des visiteurs et à mesurer les performances du site. Il s'agit d'un cookie de type modèle, où le préfixe _pk_id est suivi d'une courte série de chiffres et de lettres, qui est censé être un code de référence pour le domaine définissant le cookie.
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Ce nom de cookie est associé à la plateforme d'analyse Web open source Matomo. Il est utilisé pour aider les propriétaires de sites Web à suivre le comportement des visiteurs et à mesurer les performances du site. Il s'agit d'un cookie de type modèle, où le préfixe _pk_ses est suivi d'une courte série de chiffres et de lettres, ce qui est considéré comme un code de référence pour le domaine définissant le cookie.
_pk_ref	InnoCraft Ltd .uliege.be	6 mois	Ce nom de cookie est associé à la plateforme d'analyse Web open source Matomo. Il est utilisé pour aider les propriétaires de sites Web à suivre le comportement des visiteurs et à mesurer les performances du site. Il s'agit d'un cookie de type modèle, où le préfixe _pk_ref est suivi d'une courte série de chiffres et de lettres, ce qui est considéré comme un code de référence pour le domaine définissant le cookie.

MASTER THESIS

Master thesis : Generating Topic Models from Corpora Across Languages

Thielen, Benoit

Promotor(s) : Ittoo, Ashwin

Date of defense : 28-Jan-2022 • Permalink : http://hdl.handle.net/2268.2/13874

Details

Abstract

File(s)

Document(s)

Author

Promotor(s)

Committee's member(s)

Cite this master thesis

APA

Chicago

Date of defense : 28-Jan-2022 • Permalink : `http://hdl.handle.net/2268.2/13874`