Master thesis : Invoice Entity Recognition

Master thesis : Invoice Entity Recognition

Hubar, Julien

Date of defense : 27-Jun-2022/28-Jun-2022 • Permalink : `http://hdl.handle.net/2268.2/14584`

Details

Title :	Master thesis : Invoice Entity Recognition
Translated title :	[fr] Reconnaissance d'entités de facture
Author :	Hubar, Julien
Date of defense :	27-Jun-2022/28-Jun-2022
Advisor(s) :	Louppe, Gilles
Committee's member(s) :	Van Droogenbroeck, Marc Fontaine, Pascal Rosu, Lionel
Language :	English
Number of pages :	74
Keywords :	[en] Machine learning, Deep learning
Discipline(s) :	Engineering, computing & technology > Computer science
Institution(s) :	Université de Liège, Liège, Belgique
Degree:	Master : ingénieur civil en science des données, à finalité spécialisée
Faculty:	Master thesis of the Faculté des Sciences appliquées

Abstract

[en] Today, for Billy and many accounting fiduciaries, invoice information is usually encoded manually by the accountant or by a low-performance software, so a lot of time is wasted on encoding and not on advice. Indeed, During the year 2021, Billy's accountants spent 37% of their time on encoding. Consequently, the recognition of fields in a semi-structured document of variable layout (i.e. invoice) is a growing need for accountants, and especially for Billy, in which the number of new customers increasing every month. Nonetheless, the text and image pre-training strategies of the Transformer architecture model have proven to be efficient in the field of document understanding. Thus, several OCR tools were tested, and the Azure OCR tool, which gave the higher performances, was selected to extract text from image invoice of Billy's customers in order to create datasets. Indeed, this allows the elaboration of four datasets partially annotated, named BTT, BTT Star, BTT QV, and BTT QV Date, which were created from scanned purchase, sales documents, and their accounting encoding in the accounting Horus software. Then, the fine-tuning of the pre-trained multi-modal models LayoutLMv2_BASE, LayoutLMv2_LARGE, and LayoutXLM_BASE has been done. In contrast to the previous architectural models of the LayoutLM family, these models include, in addition to the text and layout information, information that can be provided by the document image. Thanks to spatial-aware self-attention mechanisms integrated in the Transformer architecture model, it is able to interpret relations through different bounding boxes.
According to Billy's accountants, invoice information is recognized by Horus in 70\% of the cases. During the experiments conducted in this Master thesis, it was shown that on token classification tasks, higher results were obtained for the the different datasets in terms of F1-score: BTT (0.9420), BTT Star (0.9553), BTT QV ( 0.9413) and BTT QV Date (0.9472). In addition, similar state-of-the-art results were obtained using the open source CORD dataset which gives an F1-score of 0.9354. Moreover, the impact of a pre-trained model on a dataset composed only of English documents LayoutXLM_BASE) was studied in comparison with a pre-trained model on a multi-lingual dataset (LayoutLMv2_BASE) to classify tokens on documents mostly in French. The results show that the pre-trained model does not have a great impact on the final result for this type of task: (BTT (0.9323 -> 0.9338), BTT Star (0.9354 -> 0.9468), BTT QV (0.9229 <- 0.8955), and BTT QV Date (0.9411 <- 0.9328). To conclude, since the results of the four datasets were close to each other, the dataset BTT Star produced the best results. This dataset has the largest number of labels and is the most widely distributed over the documents, leading to the hypothesis that a more widely distributed set of labels provides better results.
Finally, to concretize this work, a web application was developed in parallel in order to use this tool in everyday life for both the accountants and the customers.

File(s)

Document(s)

Master-Thesis-HubarJulien.pdf
Description:
Size: 18.16 MB
Format: Adobe PDF

Ask a request copy

Abstract_Julien_Hubar.pdf
Description:
Size: 214.68 kB
Format: Adobe PDF

Annexe(s)

code Hubar Julien.zip
Description:
Size: 34.61 MB
Format: Unknown

Ask a request copy

Cite this master thesis

All documents available on MatheO are protected by copyright and subject to the usual rules for fair use.
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.

MASTER THESIS

Master thesis : Invoice Entity Recognition

Hubar, Julien

Promotor(s) : Louppe, Gilles

Date of defense : 27-Jun-2022/28-Jun-2022 • Permalink : http://hdl.handle.net/2268.2/14584

Details

Abstract

File(s)

Document(s)

Annexe(s)

Author

Promotor(s)

Committee's member(s)

Cite this master thesis

Date of defense : 27-Jun-2022/28-Jun-2022 • Permalink : `http://hdl.handle.net/2268.2/14584`