Predicting ratings of Amazon reviews - Techniques for imbalanced datasets
Martin, Marie
Promotor(s) : Ittoo, Ashwin
Date of defense : 22-Jun-2017/27-Jun-2017 • Permalink : http://hdl.handle.net/2268.2/2707
Details
Title : | Predicting ratings of Amazon reviews - Techniques for imbalanced datasets |
Author : | Martin, Marie |
Date of defense : | 22-Jun-2017/27-Jun-2017 |
Advisor(s) : | Ittoo, Ashwin |
Committee's member(s) : | Schyns, Michael
Beretta, Alessandro |
Language : | English |
Number of pages : | 79 |
Keywords : | [en] product review [en] rating [en] supervised machine learning [en] prediction [en] text classification [en] imbalanced datasets |
Discipline(s) : | Business & economic sciences > Management information systems |
Institution(s) : | Université de Liège, Liège, Belgique |
Degree: | Master en ingénieur de gestion, à finalité spécialisée en Supply Chain Management and Business Analytics |
Faculty: | Master thesis of the HEC-Ecole de gestion de l'Université de Liège |
Abstract
[en] The goal of this dissertation is to successfully predict a user’s numerical rating from its review text content. To do so, supervised machine learning techniques and more specifically text classification are used.
Three distinct approaches are presented, namely binary classification, aiming at predicting the rating of a review as low or high, as well as multi-class classification and logistic regression whose aim is to predict the exact value of the rating for each review. Moreover, three different classifiers (Naïve Bayes, Support Vector Machine and Random Forest) are trained and tested on two different datasets from Amazon. These datasets are divided into two major categories: experience and search products and are characterized by an imbalanced distribution. We overcome this issue by applying sampling techniques to even out the class distributions. Eventually, the performance of those classifiers is tested and assessed thanks to accuracy metrics, including precision, recall and f1-score.
Our results show that the two most successful classifiers are Naïve Bayes and SVM, with a slight advantage for the latter one for both datasets. Binary classification shows quite good results while making more precise predictions (i.e. scale from 1 to 5) is significantly a harder task. Nevertheless, these results are still acceptable.
More practically, our approach enables users’ feedbacks to be automatically expressed on a numerical scale and therefore to ease the consumer decision process prior to making a purchase. This can in turn be extended to various other situations where no numerical rating system is available, for instance comments on YouTube or Twitter.
Cite this master thesis
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.