Feedback

Faculté des Sciences
Faculté des Sciences
Mémoire
VIEW 112 | DOWNLOAD 302

Toward automated classification of bacterial metabarcoding samples by machine learning

Télécharger
Misztak, Agnieszka ULiège
Promoteur(s) : Baurain, Denis ULiège
Date de soutenance : 3-sep-2021 • URL permanente : http://hdl.handle.net/2268.2/12550
Détails
Titre : Toward automated classification of bacterial metabarcoding samples by machine learning
Auteur : Misztak, Agnieszka ULiège
Date de soutenance  : 3-sep-2021
Promoteur(s) : Baurain, Denis ULiège
Membre(s) du jury : Hanikenne, Marc ULiège
Meyer, Patrick ULiège
Taminiau, Bernard ULiège
Langue : Anglais
Nombre de pages : 48
Discipline(s) : Sciences du vivant > Microbiologie
Institution(s) : Université de Liège, Liège, Belgique
Diplôme : Master en bioinformatique et modélisation, à finalité approfondie
Faculté : Mémoires de la Faculté des Sciences

Résumé

[en] The studies of the bacterial communities are increasingly popular. Thanks to the continuous decrease in price of NGS services, curiosity is the limit. It is reflected in the diversity of the metabarcoding data available. Recently a collaborative Earth Microbiome Project had begun a creation of Earth’s multiscale microbial diversity catalogue unifying the effort of almost 100 independent studies for standardization of the protocol for bacterial communities analyses. However, in the public databases there is a substantial amount of the metabarcoding data that were generated throughout the years with the use of different sequencing primers targeting different hypervariable regions.
The information about bacterial communities compositions accumulated in those metabarcoding samples could serve e.g. for identification of the origin of the sample. This work aims at establishing a base process for combining the analysis of the metabarcoding data obtained using various protocols. In the process of selection, out of over a million sequencing runs, 1567 individually processed paired-end reads samples were merged into 45 fine-scaled categories falling into four general datasets: animal-, animal-gut-, environment-, and plant-related. Next, they were processed using popular QIIME2 software without OTU clustering. Three general databases containing 16S rRNA taxonomic information, and their efficacy at five taxonomic ranks, have been tested in order to optimize the taxonomic identification of amplicon sequence variants. The above-mentioned datasets were tested for classification accuracy using two different dimensionality reduction techniques, Principal Component Analysis and Linear Discriminant Analysis applied on the similarity/dissimilarity matrices obtained separetly from an abundance and presence/absence matrices. The aptitude of machine learning in establishing the taxonomic-based classification of the sample sources has been tested with four different algorithms, radial SVM, Naive Bayes, Random Forest and k-Nearest Neighbours. The LDA transformed similarity matrix created at Order rank provided the best and most confident classification with corrected accuracy of 97.6%. Additionally, to examine whether there exist taxonomic relationships among the microorganisms detected in the aforementioned studies, the association rule learning algorithms ‘Apriori’ has been utilized. Number of co-occurrences of microorganisms on different taxonomic ranks was detected and several different taxa forming highly connected nodes were observed. Those taxa can be regarded as putative keystone taxa and considered for further investigation in different niches.


Fichier(s)

Document(s)

File
Access BIM_thesis_AMisztak.pdf
Description:
Taille: 15.39 MB
Format: Adobe PDF
File
Access Erratum_BIM_thesis_AMisztak.pdf
Description: -
Taille: 1.87 MB
Format: Adobe PDF

Annexe(s)

File
Access SFig1.jpeg
Description:
Taille: 180.16 kB
Format: JPEG
File
Access SFig2.jpeg
Description:
Taille: 210.6 kB
Format: JPEG
File
Access SFig3.jpeg
Description:
Taille: 180.1 kB
Format: JPEG
File
Access SFig4.jpeg
Description:
Taille: 271.23 kB
Format: JPEG
File
Access SFig5.jpeg
Description:
Taille: 203.66 kB
Format: JPEG
File
Access SFig6.png
Description:
Taille: 2.1 MB
Format: image/png
File
Access SFig7.jpeg
Description:
Taille: 184.97 kB
Format: JPEG
File
Access SFig8.jpeg
Description:
Taille: 191.74 kB
Format: JPEG
File
Access SFig9.jpeg
Description:
Taille: 207.98 kB
Format: JPEG
File
Access STable1.csv
Description:
Taille: 711.88 kB
Format: Unknown
File
Access STable2.txt
Description:
Taille: 63.12 kB
Format: Text
File
Access STable3.csv
Description:
Taille: 976.89 kB
Format: Unknown
File
Access STable4.csv
Description:
Taille: 680 B
Format: Unknown
File
Access STable5-8.xls
Description:
Taille: 66 kB
Format: Microsoft Excel
File
Access STable9-32.xls
Description:
Taille: 319 kB
Format: Microsoft Excel

Auteur

  • Misztak, Agnieszka ULiège Université de Liège > Master bioinf. & mod., à fin.

Promoteur(s)

Membre(s) du jury

  • Hanikenne, Marc ULiège Université de Liège - ULiège > Département des sciences de la vie > Génomique fonctionnelle et imagerie moléculaire végétale
    ORBi Voir ses publications sur ORBi
  • Meyer, Patrick ULiège Université de Liège - ULiège > Département des sciences de la vie > Biologie des systèmes et bioinformatique
    ORBi Voir ses publications sur ORBi
  • Taminiau, Bernard ULiège Université de Liège - ULiège > Département de sciences des denrées alimentaires (DDA) > Microbiologie des denrées alimentaires
    ORBi Voir ses publications sur ORBi
  • Nombre total de vues 112
  • Nombre total de téléchargements 302










Tous les documents disponibles sur MatheO sont protégés par le droit d'auteur et soumis aux règles habituelles de bon usage.
L'Université de Liège ne garantit pas la qualité scientifique de ces travaux d'étudiants ni l'exactitude de l'ensemble des informations qu'ils contiennent.