Can Large Language Models accelerate the correction of student code ?
Coco, Andreas
Promotor(s) :
Geurts, Pierre
Date of defense : 8-Sep-2025/9-Sep-2025 • Permalink : http://hdl.handle.net/2268.2/24781
Details
| Title : | Can Large Language Models accelerate the correction of student code ? |
| Author : | Coco, Andreas
|
| Date of defense : | 8-Sep-2025/9-Sep-2025 |
| Advisor(s) : | Geurts, Pierre
|
| Committee's member(s) : | Louppe, Gilles
Donnet, Benoît
Debruyne, Christophe
|
| Language : | English |
| Discipline(s) : | Engineering, computing & technology > Computer science |
| Institution(s) : | Université de Liège, Liège, Belgique |
| Degree: | Master en science des données, à finalité spécialisée |
| Faculty: | Master thesis of the Faculté des Sciences appliquées |
Abstract
[en] This thesis assesses to what extent large language models (LLMs) can accelerate the correction of student code in an introductory C programming course. The motivation is practical. Autograders are helpful grading tools but they miss many dimensions of code quality like clarity, efficiency and style. Hence, human review remains heavy and slow. LLMs, which can read code in context and provide natural language feedback, may fill part of this gap. The principal objective is to determine how they can be leveraged to accelerate code correction.
We conduct three sets of experiments on real coursework from the ``Additional Information Theory'' course at the University of Liège. First, we run preliminary code-generation tests to determine whether state-of-the-art LLMs can solve the course tasks. Second, we evaluate automated grading with Qwen2.5-Coder-7B on two datasets. These sets respectively consist of student submissions for a homework assignment and a project. We compare model-predicted grades and feedback to human grades. Third, we study error detection and code correction on the same homework by fine-tuning Qwen2.5-Coder-7B with LoRA using prompt-response pairs.
With respect to grading, the model's numeric predictions are not reliable. On both tasks, the mean errors often match or exceed those obtained by a constant baseline. However, when the task is reframed as a simpler classification problem where we ask the LLM whether each submission is fully correct, Qwen's performance is above chance. The best setting is the one in which we use a criteria-based prompt in French. This method consistently outperforms the baseline. Nevertheless, it remains insufficient for autonomous grading.
In error detection and correction, our initial fine-tuning with Qwen-generated data slightly improved correction rates. However, it often produced full code rewrites rather than genuine code corrections. A second fine-tuning attempt used more diverse, high-quality training data generated by OpenAI models, which encouraged targeted edits. However, this reduced correction performance on student submissions. These results indicate that improving a model's error detection and repair abilities is difficult with such limited datasets.
Overall, we find that LLMs are not powerful enough yet to replace human graders for either grading or error detection and correction. Their most promising use today is as a support tool alongside autograders and human review. Still, our findings are bounded by scope as we only used tasks in C from a specific course and minimal prompting. We recommend exploring more powerful models and considering fine-tuning on Python tasks with a larger, more comprehensive training set.
Cite this master thesis
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.

Master Thesis Online


s2302246Coco2025.pdf