yun gao | The University of Tokushima (original) (raw)

Uploads

Papers by yun gao

Research paper thumbnail of Malware Detection by Control-Flow Graph Level Representation Learning With Graph Isomorphism Network

IEEE Access

With society's increasing reliance on computer systems and network technology, the threat of mali... more With society's increasing reliance on computer systems and network technology, the threat of malicious software grows more and more serious. In the field of information security, malware detection has been a key problem that academia and industry are committed to solving. Machine learning is an effective method for processing large-scale data, such as the Gradient Boosting Decision Tree (GBDT) and deep neural network technology. Although these types of detection methods can deal with cyber threats, most feature extraction methods are based on the statistical information features of portable executable (PE) files and thus lack the decompiled code and execution flow structure of the PE samples. Therefore, we propose a Control-Flow Graph (CFG)-and Graph Isomorphism Network (GIN)-based malware classification system. The feature vectors of CFG basic blocks are generated using the large-scale pre-trained language model MiniLM, which is beneficial for the GIN to further learn and compress the CFG-based representation, and classified with multi-layer perceptron. In addition, we evaluated the effectiveness of the representation under different dimensions and classifiers. To evaluate our method, we set up a CFG-based malware detection graph dataset from a PE file of the Blue Hexagon Open Dataset for Malware Analysis (BODMAS), which we call the Malware Geometric Binary Dataset (MGD-BINARY) and collected the experimental results of CFG representation in different dimensions and classifier settings. The evaluation results show that our proposal has proved an Accuracy metric of 0.99160 and achieved 0.99148 Area Under the Curve (AUC) results.

Research paper thumbnail of Malware Detection Using LightGBM With a Custom Logistic Loss Function

IEEE Access

The increased spread of malicious software (malware) through the internet remains a serious threa... more The increased spread of malicious software (malware) through the internet remains a serious threat. Malware authors use obfuscation and deformation techniques to generate new types than can evade traditional detection methods. Hence, it is widely expected that machine learning methods can classify malware and cleanware based on the characteristics of malware samples. This paper investigates malware classification accuracy using static methods for malware detection based on LightGBM by a custom log loss function, which controls learning by installing coefficient α to a loss function of the false-negative side and coefficient β to a loss function of the false-positive side. By installing coefficients, we can create a lopsided classifier. We used two malware datasets, non-public and public, to construct a malware baseline model to verify the effectiveness of the proposed method. We extracted the dataset features from PE-file surface analysis and PE-header dumps and customized a binary log loss function to improve all the classification evaluation metrics to a certain extent. We obtained a better result (AUC = 0.979) at α = 430 and β = 339 than the normal log loss function (AUC = 0.978) on the EMBER dataset. In addition, to maintain malware detection coverage and quick countermeasures to true positive results, we propose a hybrid usage of different custom models to prioritize positive results.

Research paper thumbnail of Malware Detection by Control-Flow Graph Level Representation Learning With Graph Isomorphism Network

IEEE Access

With society's increasing reliance on computer systems and network technology, the threat of mali... more With society's increasing reliance on computer systems and network technology, the threat of malicious software grows more and more serious. In the field of information security, malware detection has been a key problem that academia and industry are committed to solving. Machine learning is an effective method for processing large-scale data, such as the Gradient Boosting Decision Tree (GBDT) and deep neural network technology. Although these types of detection methods can deal with cyber threats, most feature extraction methods are based on the statistical information features of portable executable (PE) files and thus lack the decompiled code and execution flow structure of the PE samples. Therefore, we propose a Control-Flow Graph (CFG)-and Graph Isomorphism Network (GIN)-based malware classification system. The feature vectors of CFG basic blocks are generated using the large-scale pre-trained language model MiniLM, which is beneficial for the GIN to further learn and compress the CFG-based representation, and classified with multi-layer perceptron. In addition, we evaluated the effectiveness of the representation under different dimensions and classifiers. To evaluate our method, we set up a CFG-based malware detection graph dataset from a PE file of the Blue Hexagon Open Dataset for Malware Analysis (BODMAS), which we call the Malware Geometric Binary Dataset (MGD-BINARY) and collected the experimental results of CFG representation in different dimensions and classifier settings. The evaluation results show that our proposal has proved an Accuracy metric of 0.99160 and achieved 0.99148 Area Under the Curve (AUC) results.

Research paper thumbnail of Malware Detection Using LightGBM With a Custom Logistic Loss Function

IEEE Access

The increased spread of malicious software (malware) through the internet remains a serious threa... more The increased spread of malicious software (malware) through the internet remains a serious threat. Malware authors use obfuscation and deformation techniques to generate new types than can evade traditional detection methods. Hence, it is widely expected that machine learning methods can classify malware and cleanware based on the characteristics of malware samples. This paper investigates malware classification accuracy using static methods for malware detection based on LightGBM by a custom log loss function, which controls learning by installing coefficient α to a loss function of the false-negative side and coefficient β to a loss function of the false-positive side. By installing coefficients, we can create a lopsided classifier. We used two malware datasets, non-public and public, to construct a malware baseline model to verify the effectiveness of the proposed method. We extracted the dataset features from PE-file surface analysis and PE-header dumps and customized a binary log loss function to improve all the classification evaluation metrics to a certain extent. We obtained a better result (AUC = 0.979) at α = 430 and β = 339 than the normal log loss function (AUC = 0.978) on the EMBER dataset. In addition, to maintain malware detection coverage and quick countermeasures to true positive results, we propose a hybrid usage of different custom models to prioritize positive results.

Log In