Representing Jupyter Notebooks with Knowledge Graphs to Address Data Lineage Problems

Representing Jupyter Notebooks with Knowledge Graphs to Address Data Lineage Problems

Birtles, Alixia

Date of defense : 24-Jun-2024/25-Jun-2024 • Permalink : `http://hdl.handle.net/2268.2/20479`

Details

Title :	Representing Jupyter Notebooks with Knowledge Graphs to Address Data Lineage Problems
Translated title :	[fr] Représentation de Notebooks Jupyter à l'aide de graphes de connaissances pour résoudre des problèmes de traçabilité de données
Author :	Birtles, Alixia
Date of defense :	24-Jun-2024/25-Jun-2024
Advisor(s) :	Debruyne, Christophe
Committee's member(s) :	Geurts, Pierre Ittoo, Ashwin
Language :	English
Number of pages :	80
Keywords :	[en] Data Lineage [en] Jupyter Notebook [en] Knowledge Graph [en] PROV-O Ontology [en] RML
Discipline(s) :	Engineering, computing & technology > Computer science
Target public :	Researchers Professionals of domain Student
Institution(s) :	Université de Liège, Liège, Belgique
Degree:	Master : ingénieur civil en science des données, à finalité spécialisée
Faculty:	Master thesis of the Faculté des Sciences appliquées

Abstract

[en] In data science, data lineage is a crucial aspect that is often insufficiently considered. To
address challenges related to data lineage, the approach presented in this thesis leverages
knowledge graphs and data provenance.
The PROV-O ontology and the FOAF vocabulary are harnessed to design a structure, along
with defined terms. This ontology aims to represent the information extracted from Jupyter
notebooks, tools often used in data science. Additionally, public APIs are leveraged to enrich
the graph.
Initially, the RML language was used to map the data, but it was too limiting and led to
the consideration of the RDFLib library in Python. RMLMapper and Morph-KGC have been
considered, but the former does not have the required extension to access the desired data
in the source code, while the latter has iterator challenges and does not support theta-joins.
The correctness of the approach was validated with visualization in GraphDb and SPARQL
queries. A complex query related to the extraction of licenses demonstrated the feasibility of
the approach and the ability to answer questions about data lineage. Moreover, experimentation
with queries on a real-world dataset, the KGTorrent dataset, showed the effectiveness of
the approach. Performance measurements on the construction of the graph and on SPARQL
queries in real-world conditions led to promising results.

File(s)

Document(s)

Resume_AlixiaBirtles.pdf
Description:
Size: 61.8 kB
Format: Adobe PDF

Thesis_AlixiaBirtles.pdf
Description:
Size: 1.19 MB
Format: Adobe PDF

Cite this master thesis

All documents available on MatheO are protected by copyright and subject to the usual rules for fair use.
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.

MASTER THESIS

Representing Jupyter Notebooks with Knowledge Graphs to Address Data Lineage Problems

Birtles, Alixia

Promotor(s) : Debruyne, Christophe

Date of defense : 24-Jun-2024/25-Jun-2024 • Permalink : http://hdl.handle.net/2268.2/20479

Details

Abstract

File(s)

Document(s)

Author

Promotor(s)

Committee's member(s)

Cite this master thesis

Date of defense : 24-Jun-2024/25-Jun-2024 • Permalink : `http://hdl.handle.net/2268.2/20479`