Towards an Automated Pipeline for Detecting and Classifying Malware through Machine Learning (original) (raw)
Related papers
A New Classification Based Model for Malicious PE Files Detection
International Journal of Computer Network and Information Security, 2019
Malware presents a major threat to the security of computer systems, smart devices, and applications. It can also endanger sensitive data by modifying or destroying them. Thus, electronic exchanges through different communicating entities can be compromised. However, currently used signature-based methods cannot provide accurate detection of zero-day attacks, polymorphic and metamorphic programs which have the ability to change their code during propagation. In order to solve this issue, static and dynamic malware analysis is being used along with machine learning algorithms for malware detection and classification. Machine learning methods play an important role in automated malware detection. Several approaches have been applied to classify and to detect malware. The most challenging task is selecting a relevant set of features from a large dataset so that the classification model can be built in less time with higher accuracy. The purpose of this work is firstly to make a general review on the existing classification and detection methods, and secondly to develop an automated system to detect malicious Portable Executable files based on their headers with low performance and more efficiency. Experimental results will be presented for the best classifier selected in this study, namely Random Forest; accuracy and time performance will be discussed.
Machine Learning Techniques to Detect Maliciousness of Portable Executable Files
2019 International Conference on Promising Electronic Technologies (ICPET), 2019
In the past few years, malware has become one of the most significant threats to computer security. Malware or malicious is software that attackers use or program to interrupt the operations of a computer, to collect secret or private information, or to access computer systems without being authorized to do. In this paper, we presented a machine learning based approach to classifying a portable executable (PE) file as benign or malware with high accuracy. The proposed approach used the static analysis technique to extract the integrated feature set, which was created by combining a few raw features selected from the three main headers of PE files and a set of derived features. Seven supervised learning algorithms are used in the classification of malware. We compared the performance of each classifier in terms of accuracy, precision, and F -measure. The experimental results indicate that the integrated feature set performs better than the raw feature set on all metrics. Integrated dataset accuracy values are between 91% and 99%, against the raw dataset values which are between 71% and 97% using (70/30) split method. Random Forest has outperformed all classifiers on both datasets (with accuracy of 99.23%).
Complex & Intelligent Systems
Enterprises are striving to remain protected against malware-based cyber-attacks on their infrastructure, facilities, networks and systems. Static analysis is an effective approach to detect the malware, i.e., malicious Portable Executable (PE). It performs an in-depth analysis of PE files without executing, which is highly useful to minimize the risk of malicious PE contaminating the system. Yet, instant detection using static analysis has become very difficult due to the exponential rise in volume and variety of malware. The compelling need of early stage detection of malware-based attacks significantly motivates research inclination towards automated malware detection. The recent machine learning aided malware detection approaches using static analysis are mostly supervised. Supervised malware detection using static analysis requires manual labelling and human feedback; therefore, it is less effective in rapidly evolutionary and dynamic threat space. To this end, we propose a pro...
Malware Analysis with Machine Learning: Classifying Malware based on PE Header
International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2022
If we study the adventures of malware, it has been since the evolution of the computer itself. Since then, they have been a fascinating tool for hackers to exploit computers. Due to the advancement of technology, the malware also has been becoming more complex and hard to detect which in turn creates a cat and mouse chase between security researchers and hackers trying to outsmart each other. While traditional anti-virus softwares are failing to detect these malware variants, in consequence, new technologies have to be adapted to detect them. This paper presents an approach to detect malware using machine learning techniques. This paper explores different malware detection techniques and how ML can be integrated to make it more reliable. ML models are trained with PE features of malware and benign executables. It is observed that the model predicts with 99.4% accuracy. This approach would help researchers to understand and develop more sophisticated antiviruses to detect even complex malware.
2020
There is a lack of scientific testing of commercially available malware detectors, especially those that boast accurate classification of never-before-seen (zero-day) files using machine learning (ML). The result is that the efficacy and trade-offs among the different available approaches are opaque. In this paper, we address this gap in the scientific literature with an evaluation of commercially available malware detection tools. We tested each tool against 3,536 total files (2,554 72% malicious, 982 28% benign) including over 400 zero-day malware, and tested with a variety of file types and protocols for delivery. Specifically, we investigate three questions: Do ML-based malware detectors provide better detection than signature-based detectors? Is it worth purchasing a network-level malware detector to complement host-based detection? What is the trade-off in detection time and detection accuracy among commercially available tools using static and dynamic analysis? We present sta...
A Multi-Feature Dataset for Windows Pe Malware Classification
Social Science Research Network, 2023
This paper describes a multi-feature dataset for training machine learning classifiers for detecting malicious Windows Portable Executable (PE) files. The dataset includes four feature sets from 18,551 binary samples belonging to five malware families including Spyware, Ransomware, Downloader, Backdoor and Generic Malware. The feature sets include the list of DLLs and their functions, values of different fields of PE Header and Sections. First, we explain the data collection and creation phase and then we explain how did we label the samples in it using VirusTotal's services. Finally, we explore the dataset to describe how this dataset can benefit the researchers for static malware analysis. The dataset is made public in the hope that it will help inspire machine learning research for malware detection.
IJERT-Integrating Machine Learning in Malware Detection
International Journal of Engineering Research and Technology (IJERT), 2021
https://www.ijert.org/integrating-machine-learning-in-malware-detection https://www.ijert.org/research/integrating-machine-learning-in-malware-detection-IJERTV10IS080016.pdf Malware has become one of the biggest cyber threats today with the rapid growth of the Internet. Malware can be referred to as any programme that performs malicious acts, including data theft, espionage, etc. In a world of growing technology, protection should also increase at the same time. Machine learning has played a significant role in operating systems over the years. Cybersecurity is capable of using machine learning to boost organisations' detection of malware, triage, breach recognition and security alert. Machine learning will significantly change the cyber security climate. New techniques such as machine learning must be used to solve the rising malware problem. This paper aims to research how cybersecurity can be used for machine learning and how it can be used to detect malware. We will look at the PE (portable executable) headers of samples of malware and non-malware samples and create a classifier for malware that can detect whether or not malware is present.
Multi-feature Dataset for Windows PE Malware Classification
Cornell University - arXiv, 2022
This paper describes a multi-feature dataset for training machine learning classifiers for detecting malicious Windows Portable Executable (PE) files. The dataset includes four feature sets from 18,551 binary samples belonging to five malware families including Spyware, Ransomware, Downloader, Backdoor and Generic Malware. The feature sets include the list of DLLs and their functions, values of different fields of PE Header and Sections. First, we explain the data collection and creation phase and then we explain how did we label the samples in it using VirusTotal's services. Finally, we explore the dataset to describe how this dataset can benefit the researchers for static malware analysis. The dataset is made public in the hope that it will help inspire machine learning research for malware detection.
Detecting Malware in Portable Executable Files using Machine Learning Approach
International Journal of Network Security & Its Applications
There have been many solutions proposed to increase the ability to detection of malware in executable files in general and in Portable Executable files in particular. In this paper, we rely on the PE header structure of Portable Executablefiles to propose another approach in using Machine learning to classify these files, as malware files or benign files. Experimental results show that the proposed approach still uses the Random Forest algorithm for the classification problem but the accuracy and execution time are improved compared to some recent publications (accuracy reaches 99.71%).
Enhancing cyber security by predicting malwares using supervised machine learning models
International Journal of Computing and Artificial Intelligence, 2021
Malware poses a severe threat to computer systems and networks. Quick and accurate detection of malware is crucial to mitigating its detrimental impacts. This study aimed to develop a machine learning model to accurately classify whether a Portable Executable (P.E.) file is malware or benign. Supervised classification algorithms like Random Forest, K-Nearest Neighbors (KNN), Support Vector Classifier (SVC), Decision Tree, Multinomial Naïve Bayes, and Logistic Regression were trained on a dataset of 10,868 PE files. Each file had extracted static features like file headers, entropy, string literals, metadata, etc. The algorithms were evaluated using accuracy, precision, recall, and F1 scores. Random Forest performed the best with 99% accuracy, 0.99 precision, 1.00 recall, and a 0.99 F1 score. The features were ranked by importance, with the top ones providing the most discriminatory power. The finalized Random Forest model was saved for operationalization to classify unknown P.E. files automatically. In conclusion, machine learning, especially ensemble tree-based methods, proves highly efficacious for malware detection with the proper feature engineering of file content and characteristics. The model has promising capabilities as an anti-malware system to identify and nullify malware attacks proactively. Further research can focus on generalizability testing across different file types and integration with antivirus solutions.