Jason Upchurch - Academia.edu (original) (raw)
Papers by Jason Upchurch
In this paper, we describe the use of Bloom filters as a sliding window hash storage mechanism fo... more In this paper, we describe the use of Bloom filters as a sliding window hash storage mechanism for similarity comparisons. The focus of the paper is overcoming accuracy issues in current file similarity methods.
2016 11th International Conference on Malicious and Unwanted Software (MALWARE), 2016
Detecting code reuse in software has applications in malicious code analysis and in malware code ... more Detecting code reuse in software has applications in malicious code analysis and in malware code search and retrieval, but is complicated by the lack of available source code. In this paper, we examine the methods for detecting similarity using the First Byte instruction block normalization approach proposed previously, but examine the performance and characteristics of the proposed Locality Sensitive Hashing (LSH) scheme for search and retrieval. We demonstrate that our approach allows for the construction of new super signatures without the availability of the original malware input and that signatures from constituent malware blocks can be used to construct signatures of malware variants. We compare our approach with other projects that propose a similar method and show the effectiveness of our approach with regards to a known malware dataset. Experimental results show that our approach is advantageous in detection accuracy and comparison time.
2015 10th International Conference on Malicious and Unwanted Software (MALWARE), 2015
This paper describes Variant, a testing framework for projects attempting to locate variants of m... more This paper describes Variant, a testing framework for projects attempting to locate variants of malware families through similarity testing. The framework is a series of tests and data standards to evaluate recall and precision in tools that attempt to statically measure similarity in implementation of compiled software, specifically in determining code reuse in compiled software to identify malware variants. The paper offers a malware test dataset that has been manually analyzed to provide a gold standard dataset to be used in current and future malware variant detection works. This set provides a much needed resource in standardizing results across numerous works that have, so far, been tested against datasets that are either not reproducible, algorithmically derived, or both. The framework and dataset provided in this paper are used to test several malware detection approaches published in academic works or used in industry. Finally, the paper brings alignment of testing and reporting methods used in malware variant detection to those used in other statical testing methods used in industry and academia.
2013 8th International Conference on Malicious and Unwanted Software: "The Americas" (MALWARE), 2013
Detecting code reuse in malicious software is complicated by the lack of source code. The same ci... more Detecting code reuse in malicious software is complicated by the lack of source code. The same circumstance that makes code reuse detection in malicious software desirable, that is, the limited availability of original source code, also contributes to the difficulty of detecting code reuse. In this paper, we propose a method for detecting code reuse in software, specifically malicious software, that moves beyond the limitations of targeting variant detection (categorization of families). This method expands n-gram analysis to target basic blocks extracted from compiled code vice entire text sections. It also targets individual relationships between basic blocks found in localized code reuse, while preserving the ability to detect variants and families of variants found with generalized code reuse. We demonstrate the limitations of similarity calculated without first disassembling the instructions and show that our First Byte normalization gives dramatic improvements in detection of code reuse. To visualize results, our method proposes force-based clustering as a solution to rapidly detect relationships between compiled binaries and detect relationships without complex analysis. Our methods retain the previously demonstrated ability of n-gram analysis to detect variants, while adding the ability to detect code reuse in non-variant malware. We show that our proposed filtering method reduces the number of similarity calculations and highlights only meaningful relationships in our malware set.
2013 8th International Conference on Malicious and Unwanted Software: "The Americas" (MALWARE), 2013
Each day, malware analysts are tasked with more samples than they have the ability to analyze by ... more Each day, malware analysts are tasked with more samples than they have the ability to analyze by hand. To produce this trend, malware authors often reuse a significant portion of their code. In this paper, we introduce a technique to statically decompose malicious software to identify shared code. This technique variably applies a sliding-window methodology to either full files or individual basic blocks to produce representative similarity ratios either between two binaries or between two functionalities within binaries, respectively. This grants the ability to apply heuristic detection via threshold similarity matching as well as full-inclusivity matching for malicious functionality. Additionally, we apply generalization techniques to minimize local assembly variants while still maintaining consistent structural matching. We also identify improvements that this technique provides over previous technologies and demonstrate its success in practical sample detection. Finally, we suggest further applications of this technique and highlight possible contributions to modern malware detection.
In this paper, we describe the use of Bloom filters as a sliding window hash storage mechanism fo... more In this paper, we describe the use of Bloom filters as a sliding window hash storage mechanism for similarity comparisons. The focus of the paper is overcoming accuracy issues in current file similarity methods.
2016 11th International Conference on Malicious and Unwanted Software (MALWARE), 2016
Detecting code reuse in software has applications in malicious code analysis and in malware code ... more Detecting code reuse in software has applications in malicious code analysis and in malware code search and retrieval, but is complicated by the lack of available source code. In this paper, we examine the methods for detecting similarity using the First Byte instruction block normalization approach proposed previously, but examine the performance and characteristics of the proposed Locality Sensitive Hashing (LSH) scheme for search and retrieval. We demonstrate that our approach allows for the construction of new super signatures without the availability of the original malware input and that signatures from constituent malware blocks can be used to construct signatures of malware variants. We compare our approach with other projects that propose a similar method and show the effectiveness of our approach with regards to a known malware dataset. Experimental results show that our approach is advantageous in detection accuracy and comparison time.
2015 10th International Conference on Malicious and Unwanted Software (MALWARE), 2015
This paper describes Variant, a testing framework for projects attempting to locate variants of m... more This paper describes Variant, a testing framework for projects attempting to locate variants of malware families through similarity testing. The framework is a series of tests and data standards to evaluate recall and precision in tools that attempt to statically measure similarity in implementation of compiled software, specifically in determining code reuse in compiled software to identify malware variants. The paper offers a malware test dataset that has been manually analyzed to provide a gold standard dataset to be used in current and future malware variant detection works. This set provides a much needed resource in standardizing results across numerous works that have, so far, been tested against datasets that are either not reproducible, algorithmically derived, or both. The framework and dataset provided in this paper are used to test several malware detection approaches published in academic works or used in industry. Finally, the paper brings alignment of testing and reporting methods used in malware variant detection to those used in other statical testing methods used in industry and academia.
2013 8th International Conference on Malicious and Unwanted Software: "The Americas" (MALWARE), 2013
Detecting code reuse in malicious software is complicated by the lack of source code. The same ci... more Detecting code reuse in malicious software is complicated by the lack of source code. The same circumstance that makes code reuse detection in malicious software desirable, that is, the limited availability of original source code, also contributes to the difficulty of detecting code reuse. In this paper, we propose a method for detecting code reuse in software, specifically malicious software, that moves beyond the limitations of targeting variant detection (categorization of families). This method expands n-gram analysis to target basic blocks extracted from compiled code vice entire text sections. It also targets individual relationships between basic blocks found in localized code reuse, while preserving the ability to detect variants and families of variants found with generalized code reuse. We demonstrate the limitations of similarity calculated without first disassembling the instructions and show that our First Byte normalization gives dramatic improvements in detection of code reuse. To visualize results, our method proposes force-based clustering as a solution to rapidly detect relationships between compiled binaries and detect relationships without complex analysis. Our methods retain the previously demonstrated ability of n-gram analysis to detect variants, while adding the ability to detect code reuse in non-variant malware. We show that our proposed filtering method reduces the number of similarity calculations and highlights only meaningful relationships in our malware set.
2013 8th International Conference on Malicious and Unwanted Software: "The Americas" (MALWARE), 2013
Each day, malware analysts are tasked with more samples than they have the ability to analyze by ... more Each day, malware analysts are tasked with more samples than they have the ability to analyze by hand. To produce this trend, malware authors often reuse a significant portion of their code. In this paper, we introduce a technique to statically decompose malicious software to identify shared code. This technique variably applies a sliding-window methodology to either full files or individual basic blocks to produce representative similarity ratios either between two binaries or between two functionalities within binaries, respectively. This grants the ability to apply heuristic detection via threshold similarity matching as well as full-inclusivity matching for malicious functionality. Additionally, we apply generalization techniques to minimize local assembly variants while still maintaining consistent structural matching. We also identify improvements that this technique provides over previous technologies and demonstrate its success in practical sample detection. Finally, we suggest further applications of this technique and highlight possible contributions to modern malware detection.