Predicting review helpfulness : the case of class imbalance
Delic, Leïla
Promotor(s) :
Ittoo, Ashwin
Date of defense : 21-Jun-2019/25-Jun-2019 • Permalink : http://hdl.handle.net/2268.2/6399
Details
Title : | Predicting review helpfulness : the case of class imbalance |
Author : | Delic, Leïla ![]() |
Date of defense : | 21-Jun-2019/25-Jun-2019 |
Advisor(s) : | Ittoo, Ashwin ![]() |
Committee's member(s) : | Heuchenne, Cédric ![]() Hoffait, Anne-Sophie ![]() |
Language : | English |
Number of pages : | 67 |
Keywords : | [en] Machine learning [en] Review helpfulness [en] Text classification [en] Class imbalance [en] Prediction [en] Online customer review |
Discipline(s) : | Business & economic sciences > Management information systems |
Institution(s) : | Université de Liège, Liège, Belgique |
Degree: | Master en ingénieur de gestion, à finalité spécialisée en Supply Chain Management and Business Analytics |
Faculty: | Master thesis of the HEC-Ecole de gestion de l'Université de Liège |
Abstract
[en] Online reviews are becoming increasingly abundant, which makes them sometimes overwhelming for the users. To mitigate the problem of information overload, online retailers often proceed to display them according to their helpfulness to other users. In recent years, research has been aimed at finding efficient ways to automatically predict review helpfulness. This paper offers insight on both the most appropriate algorithm for the task of predicting review helpfulness in the specific context of class imbalance and high overlap of class features, and on the pre-processing techniques which can improve classifier performance in that context. To do so, it considers three classification algorithms: random forest, multinomial naive Bayes and linear support vector machine that uses stochastic gradient descent for learning.
It shows that : (1) none of the considered algorithm exhibit satisfying performance when facing imbalanced datasets and similar class features; (2) the use of linguistic pre-processing techniques results in marginal or no improvement; (3) the use of frequency-based pre- processing yields moderate improvement; (4) re-sampling techniques are highly efficient, especially Synthetic Minority Over-sampling TEchnique (SMOTE); (5) Overall, random forest combined with SMOTE shows the best performance in terms of precision, recall and F1-score.
Cite this master thesis
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.