A Novel Feature Representation for Malware Classification (original) (raw)

Deep Learning at the Shallow End: Malware Classification for Non-Domain Experts

Digital Investigation, 2018

Current malware detection and classification approaches generally rely on time consuming and knowledge intensive processes to extract patterns (signatures) and behaviors from malware, which are then used for identification. Moreover, these signatures are often limited to local, contiguous sequences within the data whilst ignoring their context in relation to each other and throughout the malware file as a whole. We present a Deep Learning based malware classification approach that requires no expert domain knowledge and is based on a purely data driven approach for complex pattern and feature identification.

Convolutional neural networks for malware classification

2016

According to AV vendors malicious software has been growing exponentially last years. One of the main reasons for these high volumes is that in order to evade detection, malware authors started using polymorphic and metamorphic techniques. As a result, traditional signature-based approaches to detect malware are being insufficient against new malware and the categorization of malware samples had become essential to know the basis of the behavior of malware and to fight back cybercriminals. During the last decade, solutions that fight against malicious software had begun using machine learning approaches. Unfortunately, there are few opensource datasets available for the academic community. One of the biggest datasets available was released last year in a competition hosted on Kaggle with data provided by Microsoft for the Big Data Innovators Gathering (BIG 2015). This thesis presents two novel and scalable approaches using Convolutional Neural Networks (CNNs) to assign malware to it...

Activation Analysis of a Byte-Based Deep Neural Network for Malware Classification

Feature engineering is one of the most costly aspects of developing effective machine learning models, and that cost is even greater in specialized problem domains, like malware classification, where expert skills are necessary to identify useful features. Recent work, however, has shown that deep learning models can be used to automatically learn feature representations directly from the raw, unstructured bytes of the binaries themselves. In this paper, we explore what these models are learning about malware. To do so, we examine the learned features at multiple levels of resolution, from individual byte embeddings to end-to-end analysis of the model. At each step, we connect these byte-oriented activations to their original semantics through parsing and disassembly of the binary to arrive at humanunderstandable features. Through our results, we identify several interesting features learned by the model and their connection to manually-derived features typically used by traditional machine learning models. Additionally, we explore the impact of training data volume and regularization on the quality of the learned features and the efficacy of the classifiers, revealing the somewhat paradoxical insight that better generalization does not necessarily result in better performance for byte-based malware classifiers.

Malware Classification with Improved Convolutional Neural Network Model

International Journal of Computer Network and Information Security

Malware is a threat to people in the cyber world. It steals personal information and harms computer systems. Various developers and information security specialists around the globe continuously work on strategies for detecting malware. From the last few years, machine learning has been investigated by many researchers for malware classification. The existing solutions require more computing resources and are not efficient for datasets with large numbers of samples. Using existing feature extractors for extracting features of images consumes more resources. This paper presents a Convolutional Neural Network model with pre-processing and augmentation techniques for the classification of malware gray-scale images. An investigation is conducted on the Malimg dataset, which contains 9339 gray-scale images. The dataset created from binaries of malware belongs to 25 different families. To create a precise approach and considering the success of deep learning techniques for the classificat...

Semantic malware classification using convolutional neural networks

This paper addresses malware classification into families using static analysis and a convolutional neural network through raw bytes. Previous research indicates that machine learning is an interesting approach to malware classification. The neural network used was based on the proposed Malconv, a convolutional neural network used for malware classification by training the network with the whole binary. Minor modifications were made to get better results and apply them to a multi-classification problem. Four models were trained with data extracted from Portable Executable malware samples labeled into nine families.These data were extracted in two ways: according to the semantic variation of bytes and using the entire file. The trained models were used for testing to check generality. The results from these four proposed models were compared and analyzed against models trained according to similar research. We concluded that the header is the most important part of a PE for malware i...

Recent Innovations and Comparison of Deep Learning Techniques in Malware Classification : A Review

2021

The internet made an individuals life very easy and more productive, but there are some associated threats linked to the internet and devices. Malware is considered the most severe threat for decades to the digital world and malware variants identification and classification is the most vital and critical research problem. It is an invasive malicious code that accesses devices, information, and services without the permission, knowledge of the user. Researchers, analysts and antivirus companies are incessantly inventing and implementing new strategies to fight back malware and its variants. In the last decade, one of the strategies is extensively used in the field of malware detection and classification is the deep learning methods using malware visualization. Results revealed that using visualization; malware can be identified, classified more promptly, efficiently, and accurately. Deep learning algorithms vary according to applications, architecture, and uses, so it is required to...

Using convolutional neural networks for classification of malware represented as images

Journal of Computer Virology and Hacking Techniques, 2018

Malware authors introduced obfuscation techniques to existing malware in order to evade detection and hide its purposes. As a result, the number of malicious programs has grown in both volume and sophistication. Thus, effective categorization of malware based on its characteristics and behavior is required. In this paper, malicious software is visualized as gray scale images since its ability to capture minor changes while retaining the global structure helps to detect variations. Motivated by the visual similarity between malware samples of the same family, we propose a file agnostic deep learning approach for malware categorization to efficiently group malicious software into families based on a set of discriminant patterns extracted from their visualization as images. The suitability of our approach is evaluated against two benchmarks: the MalImg dataset and the BigData Innovators Gathering. Experimental comparison demonstrates its superior performance with respect to state-of-the-art techniques.

Malware Detection using Deep Learning

International Journal of Modern Agriculture, 2021

Malicious software or malware continues to pose a major security concern in this digital age as computer users, corporations, and governments witness an exponential growth in malware attacks. Current malware detection solutions adopt Static and Dynamic analysis of malware signatures and behaviour patterns that are time consuming and ineffective in identifying unknown malwares. Recent malwares use polymorphic, metamorphic and other evasive techniques to change the malware behaviour's quickly and to generate large number of malwares. Since new malwares are predominantly variants of existing malwares, machine learning algorithms are being employed recently to conduct an effective malware analysis. This requires extensive feature engineering, feature learning and feature representation. By using the advanced MLAs such as deep learning, the feature engineering phase can be completely avoided. Though some recent research studies exist in this direction, the performance of the algorithms is biased with the training data. There is a need to mitigate bias and evaluate these methods independently in order to arrive at new enhanced methods for effective zero-day malware detection. To fill the gap in literature, this work evaluates classical MLAs and deep learning architectures for malware detection, classification and categorization with both public and private datasets. The train and test splits of public and private datasets used in the experimental analysis are disjoint to each other's and collected in different timescales. In addition, we propose a novel image processing technique with optimal parameters for MLAs and deep learning architectures. A comprehensive experimental evaluation of these methods indicate that deep learning architectures outperform classical MLAs. Overall, this work proposes an effective visual detection of malware using a scalable and hybrid deep learning framework for real-time deployments. The visualization and deep learning architectures for static, dynamic and image processing-based hybrid approach in a big data environment is a new enhanced method for effective zero-day malware detection.

Classification of Malicious Code Variants using Deep Learning

International Journal For Research In Applied Science & Engineering Technology, 2020

Malware attacks are increasing exponentially with usage of the internet. The first step towards safeguarding from malware attacks is by distinguishing malware files from benign ones and classifying them to known classes. Classification of malware is helpful for the analyst as it helps them to get a better insight into the functioning of the malware. This paper proposes a classification system for the malware variants into there families using Convolutional Neural Network along with Spatial Pyramid Pooling layer. This system involves the visualization of malware binary file and uses the texture based similarity in the images of same families for classification. Convolutional Neural Network along with Spatial Pyramid Pooling layer allows to use multi scale images which improved the classification accuracy