ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection (original) (raw)

Code Aggregate Graph: Effective Representation for Graph Neural Networks to Detect Vulnerable Code

IEEE Access

Deep learning, especially graph neural networks (GNNs), provides efficient, fast, and automated methods to detect vulnerable code. However, the accuracy could be improved as previous studies were limited by existing code representations. Additionally, the diversity of embedding techniques and GNN models can make selecting the appropriate method challenging. Herein we propose Code Aggregate Graph (CAG) to improve vulnerability detection efficiency. CAG combines the principles of different code analyses such as abstract syntax tree, control flow graph, and program dependence graph with dominator and postdominator trees. This extensive representation empowers deep graph networks for enhanced classification. We also implement different data encoding methods and neural networks to provide a multidimensional view of the system performance. Specifically, three word embedding approaches and three deep GNNs are utilized to build classifiers. Then CAG is evaluated using two datasets: a real-world open-source dataset and the software assurance reference dataset. CAG is also compared with seven state-of-the-art methods and six classic representations. CAG shows the best performance. Compared to previous studies, CAG has an increased accuracy (5.4%) and F1-score (5.1%). Additionally, experiments confirm that encoding has a positive impact on accuracy (4-6%) but the network type does not. The study should contribute to a meaningful benchmark for future research on code representations, data encoding, and GNNs.

Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation

2021

Identifying vulnerable code is a precautionary measure to counter software security breaches. Tedious expert effort has been spent to build static analyzers, yet insecure patterns are barely fully enumerated. This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program, in order to improve prediction performance. Compared with a generic GNN, our enhancements include a synthesis of multiple representations learned from the several parsed graphs of a program, and a new training loss metric that leverages the fine granularity of labeling. Our model outperforms multiple text, image and graph-based approaches, across two real-world datasets.

LineVD: Statement-level Vulnerability Detection using Graph Neural Networks

arXiv (Cornell University), 2022

Current machine-learning based software vulnerability detection methods are primarily conducted at the function-level. However, a key limitation of these methods is that they do not indicate the specific lines of code contributing to vulnerabilities. This limits the ability of developers to efficiently inspect and interpret the predictions from a learnt model, which is crucial for integrating machine-learning based tools into the software development workflow. Graph-based models have shown promising performance in function-level vulnerability detection, but their capability for statement-level vulnerability detection has not been extensively explored. While interpreting function-level predictions through explainable AI is one promising direction, we herein consider the statement-level software vulnerability detection task from a fully supervised learning perspective. We propose a novel deep learning framework, LineVD, which formulates statement-level vulnerability detection as a node classification task. LineVD leverages control and data dependencies between statements using graph neural networks, and a transformer-based model to encode the raw source code tokens. In particular, by addressing the conflicting outputs between function-level and statement-level information, LineVD significantly improve the prediction performance without vulnerability status for function code. We have conducted extensive experiments against a large-scale collection of real-world C/C++ vulnerabilities obtained from multiple real-world projects, and demonstrate an increase of 105% in F1-score over the current state-of-the-art.

Learning to map source code to software vulnerability using code-as-a-graph

ArXiv, 2020

We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. Specifically, whether signatures of vulnerabilities in source code can be learned from its graph representation, in terms of relationships between nodes and edges. We create a pipeline we call AI4VA, which first encodes a sample source code into a Code Property Graph. The extracted graph is then vectorized in a manner which preserves its semantic information. A Gated Graph Neural Network is then trained using several such graphs to automatically extract templates differentiating the graph of a vulnerable sample from a healthy one. Our model outperforms static analyzers, classic machine learning, as well as CNN and RNN-based deep learning models on two of the three datasets we experiment with. We thus show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches. (Submitted Oct...

Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection

IEEE Transactions on Information Forensics and Security

This paper presents FUNDED 1 , a novel learning framework for building vulnerability detection models. FUNDED leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the program's control, data, and call dependencies. Unlike prior work that treats the program as a sequential sequence or an untyped graph, FUNDED learns and operates on a graph representation of the program source code, in which individual statements are connected to other statements through relational edges. By capturing the program syntax, semantics and flows, FUNDED finds better code representation for the downstream software vulnerability detection task. To provide sufficient training data to build an effective deep learning model, we combine probabilistic learning and statistical assessments to automatically gather high-quality training samples from opensource projects. This provides many real-life vulnerable code training samples to complement the limited vulnerable code samples available in standard vulnerability databases. We apply FUNDED to identify software vulnerabilities at the function level from program source code. We evaluate FUNDED on large real-world datasets with programs written in C, Java, Swift and Php, and compare it against six state-of-the-art code vulnerability detection models. Experimental results show that FUNDED significantly outperforms alternative approaches across evaluation settings.

Automated Vulnerability Detection in Source Code Using Deep Representation Learning

2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a largescale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. The labeled dataset is available at: https://osf.io/d45bw/. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection.

VulANalyzeR: Explainable Binary Vulnerability Detection with Multi-Task Learning and Attentional Graph Convolution

ACM Transactions on Privacy and Security

Software vulnerabilities have been posing tremendous reliability threats to the general public as well as critical infrastructures, and there have been many studies aiming to detect and mitigate software defects at the binary level. Most of the standard practices leverage both static and dynamic analysis, which have several drawbacks like heavy manual workload and high complexity. Existing deep learning-based solutions not only suffer to capture the complex relationships among different variables from raw binary code, but also lack the explainability required for humans to verify, evaluate, and patch the detected bugs. We propose VulANalyzeR, a deep learning-based model, for automated binary vulnerability detection, CWE type classification, and root cause analysis to enhance safety and security. VulANalyzeR features sequential and topological learning through recurrent units and graph convolution to simulate how a program is executed. The attention mechanism is integrated throughout...

On using distributed representations of source code for the detection of C security vulnerabilities

ArXiv, 2021

This paper presents an evaluation of the code representation model Code2vec when trained on the task of detecting security vulnerabilities in C source code. We leverage the open-source library astminer to extract path-contexts from the abstract syntax trees of a corpus of labeled C functions. Code2vec is trained on the resulting path-contexts with the task of classifying a function as vulnerable or non-vulnerable. Using the CodeXGLUE benchmark, we show that the accuracy of Code2vec for this task is comparable to simple transformer-based methods such as pretrained RoBERTa, and outperforms more naive NLP-based methods. We achieved an accuracy of 61.43% while maintaining low computational requirements relative to larger models.

Detecting Software Vulnerabilities Using Neural Networks

ICMLC, 2021

As software vulnerabilities remain prevalent, automatically detecting software vulnerabilities is crucial for software security. Recently neural networks have been shown to be a promising tool in detecting software vulnerabilities. In this paper, we use neural networks trained with program slices, which extract the syntax and semantic characteristics of the source code of programs, to detect software vulnerabilities in C/C++ programs. To achieve a strong prediction model, we combine different types of program slices and optimize different types of neural networks. Our result shows that combining different types of characteristics of source code and using a balanced ratio of vulnerable program slices and non-vulnerable program slices a balanced accuracy in predicting both vulnerable code and non-vulnerable code. Among different neural networks, BGRU performs the best in detecting software vulnerabilities with an accuracy of 94.89%. CCS CONCEPTS • Security and privacy → Software and application security; • Computing methodologies → Neural networks.

Learning-based Vulnerability Detection in Binary Code

ICMLC, 2022

Cyberattacks typically exploit software vulnerabilities to compromise computers and smart devices. To address vulnerabilities, many approaches have been developed to detect vulnerabilities using deep learning. However, most learning-based approaches detect vulnerabilities in source code instead of binary code. In this paper, we present our approach on detecting vulnerabilities in binary code. Our approach uses binary code compiled from the SARD dataset to build deep learning models to detect vulnerabilities. It extracts features on the syntax information of the assembly instructions in binary code, and trains two deep learning models on the features for vulnerability detection. From our evaluation, we find that the BLSTM model has the best performance, which achieves an accuracy rate of 81% in detecting vulnerabilities. Particularly the F1-score, recall, and specificity of the BLSTM model are 75%, 95% and 75% respectively. This indicates that the model is balanced in detecting both vulnerable code and non-vulnerable code. CCS CONCEPTS • Security and privacy → Software and application security; • Computing methodologies → Neural networks; Machine learning.