Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Information Security (original) (raw)

The Curious Case of Machine Learning in Malware Detection

Proceedings of the 5th International Conference on Information Systems Security and Privacy, 2019

In this paper, we argue that detecting malware attacks in the wild is a unique challenge for machine learning techniques. Given the current trend in malware development and the increase of unconventional malware attacks, we expect that dynamic malware analysis is the future for antimalware detection and prevention systems. A comprehensive review of machine learning for malware detection is presented. Then, we discuss how malware detection in the wild present unique challenges for the current state-of-the-art machine learning techniques. We defined three critical problems that limit the success of malware detectors powered by machine learning in the wild. Next, we discuss possible solutions to these challenges and present the requirements of next-generation malware detection. Finally, we outline potential research directions in machine learning for malware detection.

IJERT-Integrating Machine Learning in Malware Detection

International Journal of Engineering Research and Technology (IJERT), 2021

https://www.ijert.org/integrating-machine-learning-in-malware-detection https://www.ijert.org/research/integrating-machine-learning-in-malware-detection-IJERTV10IS080016.pdf Malware has become one of the biggest cyber threats today with the rapid growth of the Internet. Malware can be referred to as any programme that performs malicious acts, including data theft, espionage, etc. In a world of growing technology, protection should also increase at the same time. Machine learning has played a significant role in operating systems over the years. Cybersecurity is capable of using machine learning to boost organisations' detection of malware, triage, breach recognition and security alert. Machine learning will significantly change the cyber security climate. New techniques such as machine learning must be used to solve the rising malware problem. This paper aims to research how cybersecurity can be used for machine learning and how it can be used to detect malware. We will look at the PE (portable executable) headers of samples of malware and non-malware samples and create a classifier for malware that can detect whether or not malware is present.

A Survey of Machine Learning Methods and Challenges for Windows Malware Classification

ArXiv, 2020

Malware classification is a difficult problem, to which machine learning methods have been applied for decades. Yet progress has often been slow, in part due to a number of unique difficulties with the task that occur through all stages of the developing a machine learning system: data collection, labeling, feature creation and selection, model selection, and evaluation. In this survey we will review a number of the current methods and challenges related to malware classification, including data collection, feature extraction, and model construction, and evaluation. Our discussion will include thoughts on the constraints that must be considered for machine learning based solutions in this domain, and yet to be tackled problems for which machine learning could also provide a solution. This survey aims to be useful both to cybersecurity practitioners who wish to learn more about how machine learning can be applied to the malware problem, and to give data scientists the necessary backg...

Towards an Automated Pipeline for Detecting and Classifying Malware through Machine Learning

arXiv (Cornell University), 2021

The constant growth in the number of malware-software or code fragment potentially harmful for computers and information networks-and the use of sophisticated evasion and obfuscation techniques have seriously hindered classic signature-based approaches. On the other hand, malware detection systems based on machine learning techniques started offering a promising alternative to standard approaches, drastically reducing analysis time and turning out to be more robust against evasion and obfuscation techniques. In this paper, we propose a malware taxonomic classification pipeline able to classify Windows Portable Executable files (PEs). Given an input PE sample, it is first classified as either malicious or benign. If malicious, the pipeline further analyzes it in order to establish its threat type, family, and behavior(s). We tested the proposed pipeline on the open source dataset EMBER, containing approximately 1 million PE samples, analyzed through static analysis. Obtained malware detection results are comparable to other academic works in the current state of art and, in addition, we provide an in-depth classification of malicious samples. Models used in the pipeline provides interpretable results which can help security analysts in better understanding decisions taken by the automated pipeline.

Machine learning in computer security

The past few years have witnessed a rise in the use of AI and Machine Learning techniques to a variety of application areas, such as image understanding and autonomous vehicle driving. Wireless and cloud technologies have also made it possible for millions of people to access and use services available via the internet. During the same period, the world has also witnessed a rise in cyber-crime, with criminals continually expanding their methods of attack. Weapons like ransomware, botnets, and attack vectors became popular forms of malware attacks. This paper examines the state-of-the-art in computer security and the use of machine learning techniques therein. True, machine learning did make an impact on some narrow application areas such as spam filtering and fraud detection. However – in spite of extensive academic research – it did not seem to make a visible impact on the problem of intrusion detection in real operational settings. A possible reason for this apparent failure is that computer security is inherently a difficult problem. Difficult because it is not just one problem; it is a group of problems characterized by a diversity of operational settings and a multitude of attack scenarios. This is one reason why machine learning has not yet found its niche in the cyber warfare armory. This paper first summarizes the state-of-the-art in computer security and then examines the process of applying machine learning to solve a sample problem.

Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection

2020

There is a lack of scientific testing of commercially available malware detectors, especially those that boast accurate classification of never-before-seen (zero-day) files using machine learning (ML). The result is that the efficacy and trade-offs among the different available approaches are opaque. In this paper, we address this gap in the scientific literature with an evaluation of commercially available malware detection tools. We tested each tool against 3,536 total files (2,554 72% malicious, 982 28% benign) including over 400 zero-day malware, and tested with a variety of file types and protocols for delivery. Specifically, we investigate three questions: Do ML-based malware detectors provide better detection than signature-based detectors? Is it worth purchasing a network-level malware detector to complement host-based detection? What is the trade-off in detection time and detection accuracy among commercially available tools using static and dynamic analysis? We present sta...

Microsoft Malware Classification Challenge

2018

The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.

Towards an Understanding of the Misclassification Rates of Machine Learning-based Malware Detection Systems

Proceedings of the 3rd International Conference on Information Systems Security and Privacy, 2017

A number of machine learning based malware detection systems have been suggested to replace signature based detection methods. These systems have shown that they can provide a high detection rate when recognising non-previously seen malware samples. However, in systems based on behavioural features, some new malware can go undetected as a result of changes in behaviour compared to the training data. In this paper we analysed misclassified malware instances and investigated whether there were recognisable patterns across these misclassifications. Several questions needed to be understood: Can we claim that malware changes over time directly affect the detection rate? Do changes that affect classification occur in malware at the level of families, where all instances that belong to certain families are hard to detect? Alternatively, can such changes be traced back to certain malware variants instead of families? Our experiments showed that these changes are mostly due to behavioural changes at the level of variants across malware families where variants did not behave as expected. This can be due to the adoption of anti-virtualisation techniques, the fact that these variants were looking for a specific argument to be activated or it can be due to the fact that these variants were actually corrupted.

IRJET- Using Static and Dynamic Malware Features to Perform Malware Ascription

IRJET, 2021

Machine learning is amongst the most celebrated research avenues today and is growing as the harbinger of advancements in every field. It is receiving growing attention in the area of privacy and security for building robust systems. Malware ascription is a relatively unexplored area, and it is rather difficult to attribute malware and detect authorship. Our work focuses on leveraging machine learning models for malware detection by determining the relation between the training dataset and the output achieved. To this end, we develop three different datasets that include pure malware data, non-malware data, and obscure malware data. We present three different scenarios to train the model and test its effectiveness in a more simulated scenario to a more realistic one. In our model, we apply temporal-based methodologies to train and validate the classifier. Further, we study how much we can reduce the training dataset without compromising the optimal results. Upon applying a multi-layer approach, we improved our base model by 20%. Our reports are extremely useful in malware ascription.

MACHINE LEARNING FOR CYBERSECURITY: IMPLEMENTATION OF MALWARE DETECTION USING P.E FILE, N-GRAMS AND DEEP LEARNING ON EXECUTABLES

Cyber threats and attacks are increasingly pervasive, constantly and rapidly evolving. The battle between pro-malware and anti-malware developers seems to be nowhere close to the end. Also, the Covid-19 pandemic increased the need for the internet, computers and electronic systems for remote and virtual communication which has exponentially influence and boosted the demand curve upwards, this created an opportunity for cybercriminals leading to a significant increase in malware attacks. Modern day malware takes various sophisticated and unsuspecting novel means to gain access and attack computer systems without detection. Therefore, the classical ways of detecting malware such as; signature-based, AV scanning and other ways of preventing malware attacks has proven to be obsolete and no longer efficient. Due to this challenge, dealing with malware is one of the major challenges to the world of computers science, engineering and technology development. Stakeholders such as law enforcement agents, business entities, and general users have suffered several severe losses. The significance of data science in cybersecurity can never be overrated when it comes to predicting of possible future threats for the mitigation and prevention of malware attacks. Several application of data sciences, machine learning and Artificial Intelligence in the prevention and control of malware has been proposed in the academia world and implemented in the real world. Our work in this paper is focus on literature survey and an exploratory implementation of Machine Learning for malware detection leveraging the application of Data Science Cybersecurity. Public data sets have been used for experiments, overview and recommended applications situation of ML techniques and algorithms have been discussed. The short fall in the usage of data science in cybersecurity has also been explored.