Feedback

Faculté des Sciences appliquées
Faculté des Sciences appliquées
Mémoire

Master thesis : Invoice Entity Recognition

Télécharger
Hubar, Julien ULiège
Promoteur(s) : Louppe, Gilles ULiège
Date de soutenance : 27-jui-2022/28-jui-2022 • URL permanente : http://hdl.handle.net/2268.2/14584
Détails
Titre : Master thesis : Invoice Entity Recognition
Titre traduit : [fr] Reconnaissance d'entités de facture
Auteur : Hubar, Julien ULiège
Date de soutenance  : 27-jui-2022/28-jui-2022
Promoteur(s) : Louppe, Gilles ULiège
Membre(s) du jury : Van Droogenbroeck, Marc ULiège
Fontaine, Pascal ULiège
Rosu, Lionel 
Langue : Anglais
Nombre de pages : 74
Mots-clés : [en] Machine learning, Deep learning
Discipline(s) : Ingénierie, informatique & technologie > Sciences informatiques
Institution(s) : Université de Liège, Liège, Belgique
Diplôme : Master : ingénieur civil en science des données, à finalité spécialisée
Faculté : Mémoires de la Faculté des Sciences appliquées

Résumé

[en] Today, for Billy and many accounting fiduciaries, invoice information is usually encoded manually by the accountant or by a low-performance software, so a lot of time is wasted on encoding and not on advice. Indeed, During the year 2021, Billy's accountants spent 37% of their time on encoding. Consequently, the recognition of fields in a semi-structured document of variable layout (i.e. invoice) is a growing need for accountants, and especially for Billy, in which the number of new customers increasing every month. Nonetheless, the text and image pre-training strategies of the Transformer architecture model have proven to be efficient in the field of document understanding. Thus, several OCR tools were tested, and the Azure OCR tool, which gave the higher performances, was selected to extract text from image invoice of Billy's customers in order to create datasets. Indeed, this allows the elaboration of four datasets partially annotated, named BTT, BTT Star, BTT QV, and BTT QV Date, which were created from scanned purchase, sales documents, and their accounting encoding in the accounting Horus software. Then, the fine-tuning of the pre-trained multi-modal models LayoutLMv2_BASE, LayoutLMv2_LARGE, and LayoutXLM_BASE has been done. In contrast to the previous architectural models of the LayoutLM family, these models include, in addition to the text and layout information, information that can be provided by the document image. Thanks to spatial-aware self-attention mechanisms integrated in the Transformer architecture model, it is able to interpret relations through different bounding boxes.
According to Billy's accountants, invoice information is recognized by Horus in 70\% of the cases. During the experiments conducted in this Master thesis, it was shown that on token classification tasks, higher results were obtained for the the different datasets in terms of F1-score: BTT (0.9420), BTT Star (0.9553), BTT QV ( 0.9413) and BTT QV Date (0.9472). In addition, similar state-of-the-art results were obtained using the open source CORD dataset which gives an F1-score of 0.9354. Moreover, the impact of a pre-trained model on a dataset composed only of English documents LayoutXLM_BASE) was studied in comparison with a pre-trained model on a multi-lingual dataset (LayoutLMv2_BASE) to classify tokens on documents mostly in French. The results show that the pre-trained model does not have a great impact on the final result for this type of task: (BTT (0.9323 -> 0.9338), BTT Star (0.9354 -> 0.9468), BTT QV (0.9229 <- 0.8955), and BTT QV Date (0.9411 <- 0.9328). To conclude, since the results of the four datasets were close to each other, the dataset BTT Star produced the best results. This dataset has the largest number of labels and is the most widely distributed over the documents, leading to the hypothesis that a more widely distributed set of labels provides better results.
Finally, to concretize this work, a web application was developed in parallel in order to use this tool in everyday life for both the accountants and the customers.


Fichier(s)

Document(s)

File
Access Master-Thesis-HubarJulien.pdf
Description:
Taille: 18.16 MB
Format: Adobe PDF
File
Access Abstract_Julien_Hubar.pdf
Description:
Taille: 214.68 kB
Format: Adobe PDF

Annexe(s)

File
Access code Hubar Julien.zip
Description:
Taille: 34.61 MB
Format: Unknown

Auteur

  • Hubar, Julien ULiège Université de Liège > Master ingé. civ. sc. don. à . fin.

Promoteur(s)

Membre(s) du jury

  • Van Droogenbroeck, Marc ULiège Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Télécommunications
    ORBi Voir ses publications sur ORBi
  • Fontaine, Pascal ULiège Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes informatiques distribués
    ORBi Voir ses publications sur ORBi
  • Rosu, Lionel








Tous les documents disponibles sur MatheO sont protégés par le droit d'auteur et soumis aux règles habituelles de bon usage.
L'Université de Liège ne garantit pas la qualité scientifique de ces travaux d'étudiants ni l'exactitude de l'ensemble des informations qu'ils contiennent.