HEC-Ecole de gestion de l'Université de Liège
HEC-Ecole de gestion de l'Université de Liège
VIEW 335 | DOWNLOAD 6976

Predicting ratings of Amazon reviews - Techniques for imbalanced datasets

Martin, Marie ULiège
Promotor(s) : Ittoo, Ashwin ULiège
Date of defense : 22-Jun-2017/27-Jun-2017 • Permalink :
Title : Predicting ratings of Amazon reviews - Techniques for imbalanced datasets
Author : Martin, Marie ULiège
Date of defense  : 22-Jun-2017/27-Jun-2017
Advisor(s) : Ittoo, Ashwin ULiège
Committee's member(s) : Schyns, Michael ULiège
Beretta, Alessandro ULiège
Language : English
Number of pages : 79
Keywords : [en] product review
[en] rating
[en] supervised machine learning
[en] prediction
[en] text classification
[en] imbalanced datasets
Discipline(s) : Business & economic sciences > Management information systems
Institution(s) : Université de Liège, Liège, Belgique
Degree: Master en ingénieur de gestion, à finalité spécialisée en Supply Chain Management and Business Analytics
Faculty: Master thesis of the HEC-Ecole de gestion de l'Université de Liège


[en] The goal of this dissertation is to successfully predict a user’s numerical rating from its review text content. To do so, supervised machine learning techniques and more specifically text classification are used.
Three distinct approaches are presented, namely binary classification, aiming at predicting the rating of a review as low or high, as well as multi-class classification and logistic regression whose aim is to predict the exact value of the rating for each review. Moreover, three different classifiers (Naïve Bayes, Support Vector Machine and Random Forest) are trained and tested on two different datasets from Amazon. These datasets are divided into two major categories: experience and search products and are characterized by an imbalanced distribution. We overcome this issue by applying sampling techniques to even out the class distributions. Eventually, the performance of those classifiers is tested and assessed thanks to accuracy metrics, including precision, recall and f1-score.
Our results show that the two most successful classifiers are Naïve Bayes and SVM, with a slight advantage for the latter one for both datasets. Binary classification shows quite good results while making more precise predictions (i.e. scale from 1 to 5) is significantly a harder task. Nevertheless, these results are still acceptable.
More practically, our approach enables users’ feedbacks to be automatically expressed on a numerical scale and therefore to ease the consumer decision process prior to making a purchase. This can in turn be extended to various other situations where no numerical rating system is available, for instance comments on YouTube or Twitter.



Access Memoire_MarieMartin_s112740.pdf
Size: 1.56 MB
Format: Adobe PDF


  • Martin, Marie ULiège Université de Liège > Master ingé. gest., à fin.


Committee's member(s)

  • Total number of views 335
  • Total number of downloads 6976

All documents available on MatheO are protected by copyright and subject to the usual rules for fair use.
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.