Discriminant malware distance learning on structuralinformation for automated malware classification (original) (raw)

Malware classification based on call graph clustering

Journal in Computer Virology, 2011

Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

ArXiv, 2021

Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3× larger than any prior expert-labeled corpus and 36× larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that ...

MalClassifier: Malware family classification using network flow sequence behaviour

2018 APWG Symposium on Electronic Crime Research (eCrime), 2018

Anti-malware vendors receive daily thousands of potentially malicious binaries to analyse and categorise before deploying the appropriate defence measure. Considering the limitations of existing malware analysis and classification methods, we present MalClassifier, a novel privacy-preserving system for the automatic analysis and classification of malware using network flow sequence mining. MalClassifier allows identifying the malware family behind detected malicious network activity without requiring access to the infected host or malicious executable reducing overall response time. MalClassifier abstracts the malware families' network flow sequence order and semantics behaviour as an n-flow. By mining and extracting the distinctive n-flows for each malware family, it automatically generates network flow sequence behaviour profiles. These profiles are used as features to build supervised machine learning classifiers (K-Nearest Neighbour and Random Forest) for malware family classification. We compute the degree of similarity between a flow sequence and the extracted profiles using a novel fuzzy similarity measure that computes the similarity between flows attributes and the similarity between the order of the flow sequences. For classifier performance evaluation, we use network traffic datasets of ransomware and botnets obtaining 96% F-measure for family classification. MalClassifier is resilient to malware evasion through flow sequence manipulation, maintaining the classifier's high accuracy. Our results demonstrate that this type of network flow-level sequence analysis is highly effective in malware family classification, providing insights on reoccurring malware network flow patterns.

Selecting Prominent API Calls and Labeling Malicious Samples for Effective Malware Family Classification

IJCSIS Vol 17 No 5 May Issue, 2019

Today's threats have become very complex and serious in their packing and encryption techniques. Every day new malware variants are becoming increasingly in quantity together with quality by using packing and encrypting techniques. The challenges in this research field are the traditional malware detection systems sometimes might fail to detect new malware variants and produces false alarms. Malicious software in the form of virus, worm, trojan, ransom, and spy harms our computer systems, network environment, and organizations in various ways. Therefore, malware analysis for detection and family classification plays a significant role in Cyber Crime Incident Handling Systems. This system contributes malware family classification with 10 prominent features by conduction feature selection process. The process of labeling the malicious samples using Regular Expressions has been contributed in this approach. The proposed malware classification system provides 7 different families including malware and benign using machine learning classifiers. The finding from our experiment proves that the selected 10 API features provide the best evaluation metrics in terms of accuracy, precision-recall, and ROC scores.

Mal-Netminer: Malware Classification Approach Based on Social Network Analysis of System Call Graph

Mathematical Problems in Engineering, 2015

As the security landscape evolves over time, where thousands of species of malicious codes are seen every day, antivirus vendors strive to detect and classify malware families for efficient and effective responses against malware campaigns. To enrich this effort and by capitalizing on ideas from the social network analysis domain, we build a tool that can help classify malware families using features driven from the graph structure of their system calls. To achieve that, we first construct a system call graph that consists of system calls found in the execution of the individual malware families. To explore distinguishing features of various malware species, we study social network properties as applied to the call graph, including the degree distribution, degree centrality, average distance, clustering coefficient, network density, and component ratio. We utilize features driven from those properties to build a classifier for malware families. Our experimental results show that “in...

A Comparative Study of Malware Family Classification

Lecture Notes in Computer Science, 2012

In this paper, we present a comparative study of conventional malware family classification techniques and identifiy their limitations. In our study, we investigate three different feature set, function length frequency and printable string information as static features and Application Programming Interface (API) calls and API parameters as dynamic features. In our classification process, we used some of well-known machine-learning algorithms by invoking WEKA libraries. We made a comparative analysis and conclude that the independent features are not good enough to defence against current as well as future malware.

Classification of Malware Based on String and Function Feature Selection

2010

Anti-malware software producers are continually challenged to identify and counter new malware as it is released into the wild. A dramatic increase in malware production in recent years has rendered the conventional method of manually determining a signature for each new malware sample untenable. This paper presents a scalable, automated approach for detecting and classifying malware by using pattern recognition algorithms and statistical methods at various stages of the malware analysis life cycle. Our framework combines the static features of function length and printable string information extracted from malware samples into a single test which gives classification results better than those achieved by using either feature individually. In our testing we input feature information from close to 1400 unpacked malware samples to a number of different classification algorithms. Using k-fold cross validation on the malware, which includes Trojans and viruses, along with 151 clean files, we achieve an overall classification accuracy of over 98%.

Structural classification and similarity measurement of malware

IEEJ Transactions on Electrical and Electronic Engineering, 2014

This paper proposes a new lightweight method that utilizes the growing hierarchical self-organizing map (GHSOM) for malware detection and structural classification. It also shows a new method for measuring the structural similarity between classes. A dynamic link library (DLL) file is an executable file used in the Windows operating system that allows applications to share codes and other resources to perform particular tasks. In this paper, we classify different malware by the data mining of the DLL files used by the malware. Since the malware families are evolving quickly, they present many new problems, such as how to link them to other existing malware families. The experiment shows that our GHSOM-based structural classification can solve these issues and generate a malware classification tree according to the similarity of malware families.

Robust Malware Family Classification Using Effective Features and Classifiers

Applied Sciences

Malware development has significantly increased recently, posing a serious security risk to both consumers and businesses. Malware developers continually find new ways to circumvent security research’s ongoing efforts to guard against malware attacks. Malware Classification (MC) entails labeling a class of malware to a specific sample, while malware detection merely entails finding malware without identifying which kind of malware it is. There are two main reasons why the most popular MC techniques have a low classification rate. First, Finding and developing accurate features requires highly specialized domain expertise. Second, a data imbalance that makes it challenging to classify and correctly identify malware. Furthermore, the proposed malware classification (MC) method consists of the following five steps: (i) Dataset preparation: 2D malware images are created from the malware binary files; (ii) Visualized Malware Pre-processing: the visual malware images need to be scaled to ...

Malware classification using static analysis based features

2017 IEEE Symposium Series on Computational Intelligence (SSCI), 2017

Anti-virus vendors receive hundreds of thousands of malware to be analysed each day. Some are new malware while others are variations or evolutions of existing malware. Because analyzing each malware sample by hand is impossible, automated techniques to analyse and categorize incoming samples are needed. In this work, we explore various machine learning features extracted from malware samples through static analysis for classification of malware binaries into already known malware families. We present a new feature based on control statement shingling that has a comparable accuracy to ordinary opcode n-gram based features while requiring smaller dimensions. This, in turn, results in a shorter training time.