Using Deep Learning for Malware generation to bypass Malware detection mechanisms
Schleich, Corinne
Promotor(s) : Louppe, Gilles ; Donnet, Benoît
Date of defense : 6-Sep-2021/7-Sep-2021 • Permalink : http://hdl.handle.net/2268.2/13112
Details
Title : | Using Deep Learning for Malware generation to bypass Malware detection mechanisms |
Author : | Schleich, Corinne |
Date of defense : | 6-Sep-2021/7-Sep-2021 |
Advisor(s) : | Louppe, Gilles
Donnet, Benoît |
Committee's member(s) : | Mathy, Laurent
Boigelot, Bernard |
Language : | English |
Discipline(s) : | Engineering, computing & technology > Computer science |
Commentary : | Codes can be found at : https://gitlab.com/CorinneSch/using-deep-learning-for-malware-generation-to-bypass-malware-detection-mechanisms |
Institution(s) : | Université de Liège, Liège, Belgique |
Degree: | Master en ingénieur civil en informatique, à finalité spécialisée en "intelligent systems" |
Faculty: | Master thesis of the Faculté des Sciences appliquées |
Abstract
[en] Machine learning (ML) has become increasingly accessible in recent years. Documentation about ML techniques and their implementation is widespread over the internet. As a result, they are increasingly used for malicious intent. The creation of malware variants that will not be detected by security measures is fastidious and difficult. The idea of this thesis is therefore to use deep learning techniques in order to automatically generate malware variants that can bypass classical detection techniques such as signature checking. The deep learning models used are encoder-decoder models that proved themselves as state-of-the art models for the task of text generation. Two main types of models have been tested in the scope of this thesis: variational auto-encoders (VAEs) and sequence-to-sequence models with attention. For each, different architectures have been implemented and tested. Convolutional and Recurrent Neural Network (RNN) VAEs are tested with different parameters and on different datasets. Different sequence-to-sequence architectures with a Bahdanau attention mechanism are also tested. The models are trained using a benign dataset since the goal is to add modifications to an initial code, it appears at this stage inappropriate to apply this specifically on malware files. This has the advantage to construct the dataset more easily since finding benign C files is a simpler task than finding enough malware files. A training dataset is also created, it is made up of variants of C code lines. Using modified lines instead of fully modified codes has the advantage that the produced database is composed of smaller sequences and should for this reason ease the learning of the deep learning models. To this end, an initial file is chosen and parsed into lines of code. A genetic algorithm is then used on each line to add random modifications to the initial line in order to create variants of this line. Those variants are then collected to form the dataset. Genetic algorithms (GAs) have already been used in other works and have shown good results in generating code variants. Hence, GAs were chosen to generate the variant datasets. Tests have been performed with the outputs of the GA. Files are reconstructed using the modified lines and tested in order to verify that the modifications added to the initial lines are effective also at the binary level of the file. To this end, the reconstructed files are compiled and the output of the compiler is used to measure the similarity between the initial code and the reconstructed code. The goal is to have a low similarity rate so that the different codes are as different as possible from their initial code. This thesis is a first step in the area of malware variant code generation using deep learning methods. The tested deep learning models seem to have a great difficulty in learning to generate such variant sequences and therefore do not lead to revolutionary predictions. However, the results allow to draw important conclusions to direct future research in the field.
File(s)
Document(s)
Cite this master thesis
The University of Liège does not guarantee the scientific quality of these students' works or the accuracy of all the information they contain.