Master thesis : Performance evaluation and optimization of a GPU-enabled Discontinuous Galerkin code
D'Antonio, Marco
Promotor(s) :
Geuzaine, Christophe
Date of defense : 5-Sep-2022/6-Sep-2022 • Permalink : http://hdl.handle.net/2268.2/15924
Details
Title : | Master thesis : Performance evaluation and optimization of a GPU-enabled Discontinuous Galerkin code |
Author : | D'Antonio, Marco ![]() |
Date of defense : | 5-Sep-2022/6-Sep-2022 |
Advisor(s) : | Geuzaine, Christophe ![]() |
Committee's member(s) : | Cicuttin, Matteo ![]() Hillewaert, Koen ![]() Arnst, Maarten ![]() |
Language : | English |
Discipline(s) : | Engineering, computing & technology > Computer science |
Target public : | Researchers Professionals of domain Student |
Institution(s) : | Université de Liège, Liège, Belgique Università degli Studi di Salerno, Fisciano, Italia |
Degree: | Cours supplémentaires destinés aux étudiants d'échange (Erasmus, ...) |
Faculty: | Master thesis of the Faculté des Sciences appliquées |
Abstract
[en] Modern supercomputers adopt the use of GPUs to enable better performance on many problems, but developing parallel applications that run at high performance requires a thorough understanding of the hardware and software platforms.
Numerical electromagnetics for example, is one of the fields that benefit from modern HPC machines, with various numerical methods that showed improved performance after implementation on GPU.
In particular, Discontinuous Galerkin Time Domain methods are usually implemented on GPUs for their scalability.
Gmsh DG, developed at the Applied and Computational Electromagnetics research group of the University of Liège, is a solver for Maxwell's equations using the Discontinuous Galerkin method, targeting high-performance parallel systems.
This thesis aimed to implement performance optimizations, a thorough performance analysis and support for multiple GPUs systems.
During the work two optimization were implemented, allowing to improve the overall application performance by reducing memory traffic, increasing locality and enabling the use of compiler optimizations.
The performance of the application were evaluated on real-world problems, performing scaling analysis on a multiprocessor system, showing a perfect scaling up to the bandwidth saturation of the NUMA domains of the AMD processors used for testing.
Furthermore, the results show that in order to outperform single GPU execution, about 64 dedicated cores are required.
The evaluation was also carried out for the single computational kernels, highlighting how all of them, both on CPU and GPU, exploit to the maximum the bandwidth available and especially for high orders of approximation some kernels show performance very close to the maximum peak achievable by the hardware.
Finally, the work focused on implementing multi-GPU support for the application and testing its performance on the available platform, our measurement show that the solver can achieve good performance that become optimal as the problem size increases.
File(s)
Document(s)
Cite this master thesis
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.