Faculté des Sciences appliquées
Faculté des Sciences appliquées

Master thesis : Invoice Entity Recognition

Hubar, Julien ULiège
Promotor(s) : Louppe, Gilles ULiège
Date of defense : 27-Jun-2022/28-Jun-2022 • Permalink :
Title : Master thesis : Invoice Entity Recognition
Translated title : [fr] Reconnaissance d'entités de facture
Author : Hubar, Julien ULiège
Date of defense  : 27-Jun-2022/28-Jun-2022
Advisor(s) : Louppe, Gilles ULiège
Committee's member(s) : Van Droogenbroeck, Marc ULiège
Fontaine, Pascal ULiège
Rosu, Lionel 
Language : English
Number of pages : 74
Keywords : [en] Machine learning, Deep learning
Discipline(s) : Engineering, computing & technology > Computer science
Institution(s) : Université de Liège, Liège, Belgique
Degree: Master : ingénieur civil en science des données, à finalité spécialisée
Faculty: Master thesis of the Faculté des Sciences appliquées


[en] Today, for Billy and many accounting fiduciaries, invoice information is usually encoded manually by the accountant or by a low-performance software, so a lot of time is wasted on encoding and not on advice. Indeed, During the year 2021, Billy's accountants spent 37% of their time on encoding. Consequently, the recognition of fields in a semi-structured document of variable layout (i.e. invoice) is a growing need for accountants, and especially for Billy, in which the number of new customers increasing every month. Nonetheless, the text and image pre-training strategies of the Transformer architecture model have proven to be efficient in the field of document understanding. Thus, several OCR tools were tested, and the Azure OCR tool, which gave the higher performances, was selected to extract text from image invoice of Billy's customers in order to create datasets. Indeed, this allows the elaboration of four datasets partially annotated, named BTT, BTT Star, BTT QV, and BTT QV Date, which were created from scanned purchase, sales documents, and their accounting encoding in the accounting Horus software. Then, the fine-tuning of the pre-trained multi-modal models LayoutLMv2_BASE, LayoutLMv2_LARGE, and LayoutXLM_BASE has been done. In contrast to the previous architectural models of the LayoutLM family, these models include, in addition to the text and layout information, information that can be provided by the document image. Thanks to spatial-aware self-attention mechanisms integrated in the Transformer architecture model, it is able to interpret relations through different bounding boxes.
According to Billy's accountants, invoice information is recognized by Horus in 70\% of the cases. During the experiments conducted in this Master thesis, it was shown that on token classification tasks, higher results were obtained for the the different datasets in terms of F1-score: BTT (0.9420), BTT Star (0.9553), BTT QV ( 0.9413) and BTT QV Date (0.9472). In addition, similar state-of-the-art results were obtained using the open source CORD dataset which gives an F1-score of 0.9354. Moreover, the impact of a pre-trained model on a dataset composed only of English documents LayoutXLM_BASE) was studied in comparison with a pre-trained model on a multi-lingual dataset (LayoutLMv2_BASE) to classify tokens on documents mostly in French. The results show that the pre-trained model does not have a great impact on the final result for this type of task: (BTT (0.9323 -> 0.9338), BTT Star (0.9354 -> 0.9468), BTT QV (0.9229 <- 0.8955), and BTT QV Date (0.9411 <- 0.9328). To conclude, since the results of the four datasets were close to each other, the dataset BTT Star produced the best results. This dataset has the largest number of labels and is the most widely distributed over the documents, leading to the hypothesis that a more widely distributed set of labels provides better results.
Finally, to concretize this work, a web application was developed in parallel in order to use this tool in everyday life for both the accountants and the customers.



Access Master-Thesis-HubarJulien.pdf
Size: 18.16 MB
Format: Adobe PDF
Access Abstract_Julien_Hubar.pdf
Size: 214.68 kB
Format: Adobe PDF


Access code Hubar
Size: 34.61 MB
Format: Unknown


  • Hubar, Julien ULiège Université de Liège > Master ingé. civ. sc. don. à . fin.


Committee's member(s)

  • Van Droogenbroeck, Marc ULiège Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Télécommunications
    ORBi View his publications on ORBi
  • Fontaine, Pascal ULiège Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes informatiques distribués
    ORBi View his publications on ORBi
  • Rosu, Lionel
  • Total number of views 33
  • Total number of downloads 13

All documents available on MatheO are protected by copyright and subject to the usual rules for fair use.
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.