Can Large Language Models accelerate the correction of student code ?
Coco, Andreas
Promoteur(s) :
Geurts, Pierre
Date de soutenance : 8-sep-2025/9-sep-2025 • URL permanente : http://hdl.handle.net/2268.2/24781
Détails
| Titre : | Can Large Language Models accelerate the correction of student code ? |
| Auteur : | Coco, Andreas
|
| Date de soutenance : | 8-sep-2025/9-sep-2025 |
| Promoteur(s) : | Geurts, Pierre
|
| Membre(s) du jury : | Louppe, Gilles
Donnet, Benoît
Debruyne, Christophe
|
| Langue : | Anglais |
| Discipline(s) : | Ingénierie, informatique & technologie > Sciences informatiques |
| Institution(s) : | Université de Liège, Liège, Belgique |
| Diplôme : | Master en science des données, à finalité spécialisée |
| Faculté : | Mémoires de la Faculté des Sciences appliquées |
Résumé
[en] This thesis assesses to what extent large language models (LLMs) can accelerate the correction of student code in an introductory C programming course. The motivation is practical. Autograders are helpful grading tools but they miss many dimensions of code quality like clarity, efficiency and style. Hence, human review remains heavy and slow. LLMs, which can read code in context and provide natural language feedback, may fill part of this gap. The principal objective is to determine how they can be leveraged to accelerate code correction.
We conduct three sets of experiments on real coursework from the ``Additional Information Theory'' course at the University of Liège. First, we run preliminary code-generation tests to determine whether state-of-the-art LLMs can solve the course tasks. Second, we evaluate automated grading with Qwen2.5-Coder-7B on two datasets. These sets respectively consist of student submissions for a homework assignment and a project. We compare model-predicted grades and feedback to human grades. Third, we study error detection and code correction on the same homework by fine-tuning Qwen2.5-Coder-7B with LoRA using prompt-response pairs.
With respect to grading, the model's numeric predictions are not reliable. On both tasks, the mean errors often match or exceed those obtained by a constant baseline. However, when the task is reframed as a simpler classification problem where we ask the LLM whether each submission is fully correct, Qwen's performance is above chance. The best setting is the one in which we use a criteria-based prompt in French. This method consistently outperforms the baseline. Nevertheless, it remains insufficient for autonomous grading.
In error detection and correction, our initial fine-tuning with Qwen-generated data slightly improved correction rates. However, it often produced full code rewrites rather than genuine code corrections. A second fine-tuning attempt used more diverse, high-quality training data generated by OpenAI models, which encouraged targeted edits. However, this reduced correction performance on student submissions. These results indicate that improving a model's error detection and repair abilities is difficult with such limited datasets.
Overall, we find that LLMs are not powerful enough yet to replace human graders for either grading or error detection and correction. Their most promising use today is as a support tool alongside autograders and human review. Still, our findings are bounded by scope as we only used tasks in C from a specific course and minimal prompting. We recommend exploring more powerful models and considering fine-tuning on Python tasks with a larger, more comprehensive training set.
Citer ce mémoire
L'Université de Liège ne garantit pas la qualité scientifique de ces travaux d'étudiants ni l'exactitude de l'ensemble des informations qu'ils contiennent.

Master Thesis Online


s2302246Coco2025.pdf