Initial Migration of Fuzz Tests & Integration Scripts From the OSS-Fuzz Project Repo by DaveLak · Pull Request #1901 · gitpython-developers/GitPython (original) (raw)

Ok, sorry for the delay. Below are my thoughts regarding fuzz_tree.py, but also, a little extra context on the OSS-Fuzz execution environment because it may be helpful for thinking about #1901 (comment).

About the OSS-Fuzz Environment

You both touched on it, but for the sake of shared understanding, the scripts and tests look like they do in part because of the OSS-Fuzz container environment implementation details.

The key points to understand are:

There are two stages of an OSS-Fuzz run that influence why the shell scripts and fuzz tests look like they do: a build stage and a fuzzer execution stage.
Each stage has its own distinct container environment.
Each container environment's filesystem is mostly read-only.
The containers are Linux based (Ubuntu 20.04 right now I believe.)

Build Stage Execution Environment

As per this page of the docs:

build.sh script environment
When your build.sh script is executed, the following locations are available within the image:

Location Env Variable Description

/out/ $OUT Directory to store build artifacts (fuzz targets, dictionaries, options files, seed corpus archives).

/src/ $SRC Directory to checkout source files.

/work/ $WORK Directory to store intermediate files.

Location	Env Variable	Description
/out/	$OUT	Directory to store build artifacts (fuzz targets, dictionaries, options files, seed corpus archives).
/src/	$SRC	Directory to checkout source files.
/work/	$WORK	Directory to store intermediate files.

Everything else is read-only.

Fuzzer Execution Stage Environment

As per this page of the docs:

File system
Everything except /tmp is read-only, including the directory that your fuzz target executable lives in.
/dev is also unavailable.

Regarding the Fuzz Tests

When I initially started on this, my primary objective was really only about fixing the build. I did try to clean them up a bit in the process and apply some best practices, but those changes were mostly incidental and I intentionally avoided any significant refactors.

After we decided to bring them here, my focus shifted to lowering the barrier to entry for contributors because of the nature of the project. In particular, I tried to distill the useful information from various place that I haven't seen documented in one place yet. One such tip is the direct execution section, but in hindsight I overlooked the environment assumptions in fuzz_tree.py.

On `fuzz_tree.py` Specifically

This test is less than ideal for several reasons.

Writing to the /tmp dir

In addition to what has been mentioned, modifying global state and I/O operations (e.g., writing to the temp dir) are problematic for fuzzing in general.

Further, @EliahKagan's remark in #1901 (comment) about "implicating other things in /tmp" made me realize this may already be causing issues inside the OSS-Fuzz execution environment. A non-obvious implementation detail specific to Python projects in OSS-Fuzz that the compile_python_fuzzer function in the build.sh script uses Pyinstaller to bundle the fuzz harness and all related dependencies into a single file archive that gets transferred to the fuzzer execution stage, where it is unpacked into a subdirectory of /tmp for execution. Which, I believe means, the test execution is actually using the test executable as a fixture 🥴

I have yet to confirm this, but I wouldn't be surprised if that is the cause of the issue I described as a possible " circular dependency " in google/oss-fuzz#11763

It's very slow

With an average execs per second of ~4 after the seed corpus loads on a short local run, it is far, far, below what is hoped for. In contrast, I'm seeing about 1432 exec/s on fuzz_config.py in a similar local run. fuzz_config.py creates a config fake in-memory but never writes it to disk.

I haven't dug in yet, but I think this is partly related to point 1 above and partly related to some of the implementation details of the legacy code in GitPython.

This is less of a concern/priority than the /tmp dir issue though because even if it's slow, it's still at least testing the API.

Thoughts on Next Steps

I'm fully in agreement that the /tmp issue and documentation related to it must be fixed before other updates are considered. I think it can be broken down into smaller problems in the following prioritized order:

Address adverse potential that the local execution documentation exposes unaware contributors to.

Possible quick fix options:

Remove that section all together for now (probably quickest and safest?)
Add a Warning or Caution callout to that section about fuzz_tree.py (this feels not as great IMHO)

Address the /tmp writing implementation issues in fuzz_tree.py

TemporaryDirectory as suggested by @EliahKagan looks like a good solution but I would need to test it in OSS-Fuzz (which I can certainly do regardless of the downstream PR's status.)
I've also considered proposing we add a Dockerfile to the fuzzing/ dir that would provide a "batteries included" way for folks to use the direct execution method quickly and easily inside a container environment prepared specifically to support it. Even before this discussion, I thought that would be the ideal way to document it because users wouldn't need to worry about any of the particulars beyond having Docker installed, which they'd already need for the alternative anyway.

I think the most robust solution would be some combination of both points above, but I don't know enough to comment on the potential issues that we may face goinf the TemporaryDirectory route inside OSS-Fuzz. The Dockerfile solution on the other hand is something I've already been toying with locally.

I also looked at options like MemoryFS from https://github.com/PyFilesystem/pyfilesystem2 because an in-memory solution would be ideal, but I think the git executable invocation may cause issues with that approach.