Introducing input-gen: Automatically generate runnable inputs for your IR (original) (raw)

Introduction

We have built a tool that can be used to generate valid inputs for arbitrary LLVM IR functions. This encompasses the memory state prior to the function execution and the arguments to the function.

For example, suppose you have some C code:

typedef struct LinkedList {
  int Payload;
  struct LinkedList *Next;
} LinkedList;

void sum(LinkedList *LL) {
  int S = 0, L = 0;
  while (LL != 0) {
    S += LL->Payload;
    L += 1;
    LL = LL->Next;
  }
  printf("Length: %i, sum %i\n", L, S);
}

We can use our input-gen tool to generate inputs for it:

$ input_gen_module.py --function sum --outdir ./outdir --input-module linked_list.bc --input-gen-num 5 --verbose
Length: 5, sum 121
Length: 24, sum 457
Length: 1, sum 6
Length: 2, sum 29
Length: 9, sum 164

This is achieved by a sanitizer-inspired flow involving an LLVM pass which instruments memory effects of the code in conjunction with runtimes for generating inputs and replaying them.

High-Level Approach

To illustrate, suppose we want to generate inputs for this function:

define foo(ptr %p) {
  ...
  %a = load i64 %p
  %b = ...
  store f32 %b, %p2
  ...

First, we will provide an entry point for the input generation and instrument all sources of side-effects, such as loads, stores, and input arguments:

define __inputgen_entry() {
  %p = __ig_arg_ptr()
  call @foo(%p)
}
 
define foo(ptr %p) {
  ...
  __ig_read_i64(%p)
  %a = load i64 %p
  %b = ...
  __ig_store_f32(%p)
  store f32 %b, %p2
  ...

Here, __ig_arg_ptr, __ig_read_i64, and __ig_store_f32 are calls into the input generation runtime.

Then, the runtime will call the entry function when the program is started, and act upon the various events that happen through the execution.

In the above example, it will first get asked to generate a pointer for the argument by the call to __ig_arg_ptr. It will allocate an object backed by some memory and return a pointer to that. When the pointer will be read from (__ig_read_i64), that means the program in question expects an i64 value to already exist there. The runtime will check whether a value already existed in that location in memory, and will either do nothing if that was the case, or pull an integer value out of a random distribution and store it at that location to be read from by the program in the next instruction. __ig_store_* calls in turn update the accessed object bounds (this will later be used in the replay stage to allocate enough memory for each object to execute the input).

The resulting input state and arguments are then dumped to a binary file.

An alternative instrumented module is generated which contains an entry point for replaying the dumped inputs and it looks roughly like this:

define __inputrun_entry() {
  %p = __ir_arg_ptr()
  call @foo(%p)
}
 
define foo(ptr %p) {
  ...
  %a = load i64 %p
  %b = ...
  store f32 %b, %p2
  ...

where the original function is kept unchanged.

This is just the high-level approach and there are a lot of interesting details with regards to how to handle objects, indirect calls, aggregates, external functions, globals, and how to explore paths.

Use Cases

We believe the ability to automatically generate a large amount of runnable inputs can be very valuable.

A couple of examples are ML model training, large scale evaluation of compiler optimizations and tuning parameters thereof, fuzzing, and testing.

More Details

For more details, refer to our paper and our implementation in this branch.

Future Work

We are currently working on code clean up / rewrite and concretizing the tool interface and features we would like to have upstream. We plan on submitting an RFC and pull requests, as usual, so be on the lookout for that!

We would love to hear your thoughts and any exciting use cases you see for this.

@ivanradanov @jdoerfert @wsmoses @fodinabor @boomanaiden154-1