[GSoC 2025] Input-Gen: A Scalable Framework for Stateful Input Generation (original) (raw)
I am posting here the project which I discussed in my GSoC proposal to see what other discussion I can have about any future steps.
Description: This project aims to enhance the Input-Gen tool, a scalable framework for stateful input generation, to extend coverage and compatibility with LLVM-supported languages (C/C++, Rust, Julia, and Swift). Introductory Discourse post here. Input-Gen generates inputs for arbitrary program fragments by instrumenting LLVM Intermediate Representation (IR) code, and afterwards capturing and replaying program states. The tool operates through a multi-stage process involving module preparation, LLVM IR instrumentation, runtime execution, and input storage. By utilizing the LLVM ComPile dataset, a large amount of inputs can be generated and evaluated for determining the accuracy of the Input-Gen. The goal is to improve the tool’s accuracy when executing with arbitrary IR files, enabling its adoption for practical purposes defined by LLVM developers, such as comprehensive testing, performance tuning, and ML training.
Expected Results: Enhanced accuracy of the Input-Gen tool, increased coverage percentage, and successful instrumentation and execution of generated inputs from IR bitcode files or modules. By the end of the GSoC timeline, Input-Gen is expected to achieve a larger number of successfully instrumented and executed functions, as well as a higher number of basic blocks executed for each IR file on average. This is relative to previous results discussed in the Input-Gen paper. This will be accomplished by directly editing input-gen.cpp
and its associated files, found here.
Project Size: Medium
Requirement: Basic C & C++ skills, familiarity with LLVM IR features
Confirmed Mentors: Aiden Grossman, Ivan Ivanov, Johannes Doerfert
andrewka April 10, 2025, 2:16pm 2
Current Individual Progress: The Input-Gen tool has been run on two x86 architecture systems and has shown promising results. Initial testing with the ComPile dataset has demonstrated successful instrumentation and execution of generated inputs. The mass input generation was run using the run_local_mass_input_gen.sh
script with a configuration that specified the dataset location as ~/.cache/huggingface/datasets/llvm-ml___com_pile
(which was supplied to HuggingFace Datasets.load_dataset()
), LLVM installation directory as /path/to/llvm-input-gen-install
, jugfile data location as $SCRIPT_DIR/jugfile.jugdata
, and output directory as /path/to/compile-input-gen-out
.
The shell arguments were:
VERBOSE=1 ADDITIONAL_FLAGS="--verbose -g" JUG=run START=0 END=99 LANGUAGE=c \
./scripts/run_local_mass_input_gen.sh
Verbosity was added to visually confirm the tool was executing correctly.
The key statistics from the run are:
- 685 functions
- 665 inputs generated (all)
- 618 inputs generated for functions with normal exit paths
- 657 inputs ran (all)
- 645 inputs ran for functions with normal exit paths
- 7246 basic blocks from all IR files
- 3577 basic blocks executed
Visual comparison with the results previously obtained using Input-Gen is still too early, as there were issues that need to be addressed, and not enough files used by the tool to make a comparison. These results currently serve as an indication that the tool was performing expectedly.
One of the pressing issues with executing Input-Gen was an error with branch-hints.ll
in the lit-test suite, which needs to be addressed. Additionally, a modification was made to llvm/tools/input-gen/input-gen.cpp
as shown in the following diff:
@@ -354,7 +354,7 @@ public:
std::string RuntimeName) {
if (ClCompileInputGenExecutables) {
LLVM_DEBUG(dbgs() << "Compiling" << ExecutableName << "\n";
- SmallVector<StringRef, 10> Args = {Clang, "-ldl", "-rdynamic",
+ SmallVector<StringRef, 10> Args = {Clang, "-ldl", "-rdynamic", "--gcc-toolchain=/packages/gcc/13.2.0",
RuntimeName, ModuleName, "-o",
ExecutableName};
This was done because I was unable to successfully build LLVM using LLVM_ENABLE_LIBCXX
, which would provide clang++
the standard C++ library (libc++) that Input-Gen needs. So, I instead relied on the GNU C++ library (libstd++), but this is not a reliable solution.
Hello
It is a bit strange to see this here now:
- It is usually mentor who submits the project proposal. The project submission guidelines were posted at LLVM+GSoC 2025: call for mentors and projects!
- Proposal submission deadline already passed
@akorobeynikov, From what Andrew told me, he thought he had to make a discourse post after he submitted his project proposal. Maybe there was just some confusion going around.
~ J
Not sure where he got this (see the link above with the instructions). The project was not listed in Open Projects
, I was not aware of its presence until recently as well as mentors of it.
We are having lots of irrelevant / spam / LLM-generated proposals this year.
andrewka April 10, 2025, 6:35pm 6
@akorobeynikov from my understanding the link above is relevant to those who are mentors for a GSoC project. I submitted my proposal as a contributor, and I created this post to allow for any open discussion of this proposed project. I submitted the proposal recently, so I would understand why there was not awareness of this particular one.