Automatic extraction of function bodies from software binaries (original) (raw)
Related papers
Estimation of similarity between functions extracted from x86 executable files
Serbian Journal of Electrical Engineering, 2015
Comparison of functions is required in various domains of software engineering. In most domains, comparison is done using source code, but in some domains, such as license violation or malware analysis, only binary code is available. The goal of this paper is to evaluate whether the existing solution meant for ARM architecture can be applied to x86 architecture. The existing solution encompasses multiple approaches, but for the purpose of this paper three representative approaches are implemented; two are based on machine learning, and the third does not require previous knowledge. Results show that the best recalls obtained for the first ten positions on both architectures are comparable and do not differ significantly. The results confirm that adaptation of all approaches of the existing solution is not only possible but also promising and represent adequate basis for future research.
A lightweight framework for function name reassignment based on large-scale stripped binaries
Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis
Software in the wild is usually released as stripped binaries that contain no debug information (e.g., function names). This paper studies the issue of reassigning descriptive names for functions to help facilitate reverse engineering. Since the essence of this issue is a data-driven prediction task, persuasive research should be based on sufficiently large-scale and diverse data. However, prior studies can only be based on small-scale datasets because their techniques suffer from heavyweight binary analysis, making them powerless in the face of big-size and large-scale binaries. This paper presents the Neural Function Rename Engine (NFRE), a lightweight framework for function name reassignment that utilizes both sequential and structural information of assembly code. NFRE uses fine-grained and easily acquired features to model assembly code, making it more effective and efficient than existing techniques. In addition, we construct a large-scale dataset and present two data-preprocessing approaches to help improve its usability. Benefiting from the lightweight design, NFRE can be efficiently trained on the large-scale dataset, thereby having better generalization capability for unknown functions. The comparative experiments show that NFRE outperforms two existing techniques by a relative improvement of 32% and 16%, respectively, while the time cost for binary analysis is much less.
Reuse-oriented reverse engineering of functional components from x86 binaries
Proceedings of the 36th International Conference on Software Engineering - ICSE 2014, 2014
Locating, extracting, and reusing the implementation of a feature within an existing binary program is challenging. This paper proposes a novel algorithm to identify modular functions corresponding to such features and to provide usable interfaces for the extracted functions. We provide a way to represent a desired feature with two executions that both execute the feature but with different inputs. Instead of reverse engineering the interface of a function, we wrap the existing interface and provide a simpler and more intuitive interface for the function through concretization and redirection. Experiments show that our technique can be applied to extract varied features from several real world applications including a malicious application.
Automatic Retargeting of Binary Utilities for Embedded Code Generation
IEEE Computer Society Annual Symposium on VLSI (ISVLSI '07), 2007
Contemporary SoC design involves the proper selection of cores from a reference platform. Such selection implies the design exploration of alternative CPUs, which requires the generation of binary code for each possible target. However, the embedded computing market shows a broad spectrum of instruction-set architectures, ranging from micro-controllers to RISCs and ASIPs. As a consequence, binary utilities cannot always rely on pre-existent tools within standard packages. Besides, the task of manually retargeting every binary utility is not acceptable under time-to-market pressure. This paper describes a technique for the automatic generation of binary utilities from an abstract model of the target CPU, which can be synthesized from an arbitrary ADL. The technique is based upon two key mechanisms: model provision for tool generation (at the front-end) and automatic library modification (at the backend). To illustrate the technique's automation effectiveness, we describe the generation of assemblers, linkers and disassemblers. We have successfully compared the files produced by the generated tools to those produced by conventional tools. Moreover, to give proper evidence of retargetability, we present results for MIPS, SPARC, PowerPC and i8051.
Finding Inlined Functions in Optimized Binaries
arXiv (Cornell University), 2021
Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization.
Learning to Find Usages of Library Functions in Optimized Binaries
IEEE Transactions on Software Engineering, 2022
Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization.
Towards automatic program partitioning
Proceedings of the 6th ACM conference on Computing frontiers - CF '09, 2009
There is a trend towards using accelerators to increase performance and energy efficiency of general-purpose processors. Adoption of accelerators, however, depends on the availability of tools to facilitate programming these devices.
Component Identification Through Program Slicing
Electronic Notes in Theoretical Computer Science, 2006
This paper reports on the development of specific slicing techniques for functional programs and their use for the identification of possible coherent components from monolithic code. An associated tool is also introduced. This piece of research is part of a broader project on program understanding and re-engineering of legacy code supported by formal methods.
Extracting Classes from Routine Calls in Legacy Software
Extracting object-oriented design from procedural code is an important issue in software maintenance. Existing research in this direction puts a heavy burden on the experts of the system being studied. To try to automate the process, we propose a new method to cluster together routines that are semantically related. The method is based on routine-call analysis. Some experiments on a subset of the system we are studying (23 KLOC) are discussed. They give very promising results. 1 Introduction While maintaining legacy applications, a large portion of the software engineers' effort is spent in trying to understand the program and data [11]. To help the software engineers in this task, we have built a tool to let them easily browse through the code and find what they are looking for. A component of this browsing tool is an "object-oriented browser": a browser which will present the (procedural) code as a hierarchy of classes. A number of researchers have tried to migrate p...